linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] fs/dax: Fix FS DAX page reference counts
@ 2024-06-27  0:54 Alistair Popple
  2024-06-27  0:54 ` [PATCH 01/13] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
                   ` (14 more replies)
  0 siblings, 15 replies; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

FS DAX pages have always maintained their own page reference counts
without following the normal rules for page reference counting. In
particular pages are considered free when the refcount hits one rather
than zero and refcounts are not added when mapping the page.

Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.

By treating the refcounts on these pages the same way as normal pages
we can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change
here. It also frees up a valuable SW define PTE bit on architectures
that have devmap PTE bits defined.

It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment.

This is an update to the original RFC rebased onto v6.10-rc5. Unlike
the original RFC it passes the same number of ndctl test suite
(https://github.com/pmem/ndctl) tests as my current development
environment does without these patches.

I am not intimately familiar with the FS DAX code so would appreciate
some careful review there. In particular I have not given any thought
at all to CONFIG_FS_DAX_LIMITED.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

Alistair Popple (13):
  mm/gup.c: Remove redundant check for PCI P2PDMA page
  pci/p2pdma: Don't initialise page refcount to one
  fs/dax: Refactor wait for dax idle page
  fs/dax: Add dax_page_free callback
  mm: Allow compound zone device pages
  mm/memory: Add dax_insert_pfn
  huge_memory: Allow mappings of PUD sized pages
  huge_memory: Allow mappings of PMD sized pages
  gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  fs/dax: Properly refcount fs dax pages
  huge_memory: Remove dead vmf_insert_pXd code
  mm: Remove pXX_devmap callers
  mm: Remove devmap related functions and page table bits

 Documentation/mm/arch_pgtable_helpers.rst     |   6 +-
 arch/arm64/Kconfig                            |   1 +-
 arch/arm64/include/asm/pgtable-prot.h         |   1 +-
 arch/arm64/include/asm/pgtable.h              |  24 +--
 arch/powerpc/Kconfig                          |   1 +-
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   6 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   7 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  52 +----
 arch/powerpc/include/asm/book3s/64/radix.h    |  14 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c       |   3 +-
 arch/powerpc/mm/book3s64/pgtable.c            |   8 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c      |   5 +-
 arch/powerpc/mm/pgtable.c                     |   2 +-
 arch/x86/Kconfig                              |   1 +-
 arch/x86/include/asm/pgtable.h                |  50 +----
 arch/x86/include/asm/pgtable_types.h          |   5 +-
 drivers/dax/device.c                          |  12 +-
 drivers/dax/super.c                           |   2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c        |   2 +-
 drivers/nvdimm/pmem.c                         |   9 +-
 drivers/pci/p2pdma.c                          |   4 +-
 fs/dax.c                                      | 204 +++++++---------
 fs/ext4/inode.c                               |   5 +-
 fs/fuse/dax.c                                 |   4 +-
 fs/fuse/virtio_fs.c                           |   8 +-
 fs/userfaultfd.c                              |   2 +-
 fs/xfs/xfs_inode.c                            |   4 +-
 include/linux/dax.h                           |  11 +-
 include/linux/huge_mm.h                       |  17 +-
 include/linux/memremap.h                      |  23 +-
 include/linux/migrate.h                       |   2 +-
 include/linux/mm.h                            |  40 +---
 include/linux/page-flags.h                    |   6 +-
 include/linux/pfn_t.h                         |  20 +--
 include/linux/pgtable.h                       |  21 +--
 include/linux/rmap.h                          |  14 +-
 lib/test_hmm.c                                |   2 +-
 mm/Kconfig                                    |   4 +-
 mm/debug_vm_pgtable.c                         |  59 +-----
 mm/gup.c                                      | 178 +--------------
 mm/hmm.c                                      |  12 +-
 mm/huge_memory.c                              | 248 +++++++------------
 mm/internal.h                                 |   2 +-
 mm/khugepaged.c                               |   2 +-
 mm/mapping_dirty_helpers.c                    |   4 +-
 mm/memory-failure.c                           |   6 +-
 mm/memory.c                                   | 114 ++++++---
 mm/memremap.c                                 |  38 +---
 mm/migrate_device.c                           |   6 +-
 mm/mlock.c                                    |   2 +-
 mm/mm_init.c                                  |   5 +-
 mm/mprotect.c                                 |   2 +-
 mm/mremap.c                                   |   5 +-
 mm/page_vma_mapped.c                          |   5 +-
 mm/pgtable-generic.c                          |   7 +-
 mm/rmap.c                                     |  48 ++++-
 mm/swap.c                                     |   2 +-
 mm/userfaultfd.c                              |   2 +-
 mm/vmscan.c                                   |   5 +-
 59 files changed, 485 insertions(+), 869 deletions(-)

base-commit: f2661062f16b2de5d7b6a5c42a9a5c96326b8454
-- 
git-series 0.9.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 01/13] mm/gup.c: Remove redundant check for PCI P2PDMA page
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27  6:36   ` Dan Williams
  2024-06-27  0:54 ` [PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple, Jason Gunthorpe

PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
check in __gup_device_huge() is redundant. Remove it

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ca0f5ce..669583e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3044,11 +3044,6 @@ static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr,
 			break;
 		}
 
-		if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
-			gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-			break;
-		}
-
 		folio = try_grab_folio(page, 1, flags);
 		if (!folio) {
 			gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
  2024-06-27  0:54 ` [PATCH 01/13] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27  5:30   ` Christoph Hellwig
  2024-06-29 21:28   ` Bjorn Helgaas
  2024-06-27  0:54 ` [PATCH 03/13] fs/dax: Refactor wait for dax idle page Alistair Popple
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

The reference counts for ZONE_DEVICE private pages should be
initialised by the driver when the page is actually allocated by the
driver allocator, not when they are first created. This is currently
the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 drivers/pci/p2pdma.c | 2 ++
 mm/memremap.c        | 8 ++++----
 mm/mm_init.c         | 4 +++-
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4f47a13..1e9ea32 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -128,6 +128,8 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
 		goto out;
 	}
 
+	set_page_count(virt_to_page(kaddr), 1);
+
 	/*
 	 * vm_insert_page() can sleep, so a reference is taken to mapping
 	 * such that rcu_read_unlock() can be done before inserting the
diff --git a/mm/memremap.c b/mm/memremap.c
index 40d4547..caccbd8 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -488,15 +488,15 @@ void free_zone_device_folio(struct folio *folio)
 	folio->mapping = NULL;
 	folio->page.pgmap->ops->page_free(folio_page(folio, 0));
 
-	if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
-	    folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
+	if (folio->page.pgmap->type == MEMORY_DEVICE_PRIVATE ||
+	    folio->page.pgmap->type == MEMORY_DEVICE_COHERENT)
+		put_dev_pagemap(folio->page.pgmap);
+	else if (folio->page.pgmap->type != MEMORY_DEVICE_PCI_P2PDMA)
 		/*
 		 * Reset the refcount to 1 to prepare for handing out the page
 		 * again.
 		 */
 		folio_set_count(folio, 1);
-	else
-		put_dev_pagemap(folio->page.pgmap);
 }
 
 void zone_device_page_init(struct page *page)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 3ec0493..b7e1599 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -6,6 +6,7 @@
  * Author Mel Gorman <mel@csn.ul.ie>
  *
  */
+#include "linux/memremap.h"
 #include <linux/kernel.h>
 #include <linux/init.h>
 #include <linux/kobject.h>
@@ -1014,7 +1015,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	 * which will set the page count to 1 when allocating the page.
 	 */
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-	    pgmap->type == MEMORY_DEVICE_COHERENT)
+	    pgmap->type == MEMORY_DEVICE_COHERENT ||
+	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
 		set_page_count(page, 0);
 }
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 03/13] fs/dax: Refactor wait for dax idle page
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
  2024-06-27  0:54 ` [PATCH 01/13] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
  2024-06-27  0:54 ` [PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27  5:31   ` Christoph Hellwig
  2024-06-27  0:54 ` [PATCH 04/13] fs/dax: Add dax_page_free callback Alistair Popple
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

A FS DAX page is considered idle when its refcount drops to one. This
is currently open-coded in all file systems supporting FS DAX. Move
the idle detection to a common function to make future changes easier.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c     | 5 +----
 fs/fuse/dax.c       | 4 +---
 fs/xfs/xfs_inode.c  | 4 +---
 include/linux/dax.h | 8 ++++++++
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4bae9cc..4737450 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3844,10 +3844,7 @@ int ext4_break_layouts(struct inode *inode)
 		if (!page)
 			return 0;
 
-		error = ___wait_var_event(&page->_refcount,
-				atomic_read(&page->_refcount) == 1,
-				TASK_INTERRUPTIBLE, 0, 0,
-				ext4_wait_dax_page(inode));
+		error = dax_wait_page_idle(page, ext4_wait_dax_page, inode);
 	} while (error == 0);
 
 	return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 12ef91d..da50595 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -676,9 +676,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(&page->_refcount,
-			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-			0, 0, fuse_wait_dax_page(inode));
+	return dax_wait_page_idle(page, fuse_wait_dax_page, inode);
 }
 
 /* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f36091e..b5742aa 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -4243,9 +4243,7 @@ xfs_break_dax_layouts(
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(&page->_refcount,
-			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-			0, 0, xfs_wait_dax_page(inode));
+	return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9d3e332..773dfc4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -213,6 +213,14 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		const struct iomap_ops *ops);
 
+static inline int dax_wait_page_idle(struct page *page,
+				void (cb)(struct inode *),
+				struct inode *inode)
+{
+	return ___wait_var_event(page, page_ref_count(page) == 1,
+				TASK_INTERRUPTIBLE, 0, 0, cb(inode));
+}
+
 #if IS_ENABLED(CONFIG_DAX)
 int dax_read_lock(void);
 void dax_read_unlock(int id);
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 04/13] fs/dax: Add dax_page_free callback
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (2 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 03/13] fs/dax: Refactor wait for dax idle page Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27  5:33   ` Christoph Hellwig
  2024-06-27  0:54 ` [PATCH 05/13] mm: Allow compound zone device pages Alistair Popple
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

When a fs dax page is freed it has to notify filesystems that the page
has been unpinned/unmapped and is free. Currently this involves
special code in the page free paths to detect a transition of refcount
from 2 to 1 and to call some fs dax specific code.

A future change will require this to happen when the page refcount
drops to zero. In this case we can use the existing
pgmap->ops->page_free() callback so wire that up for all devices that
support FS DAX (nvdimm and virtio).

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 drivers/nvdimm/pmem.c | 1 +
 fs/dax.c              | 6 ++++++
 fs/fuse/virtio_fs.c   | 5 +++++
 include/linux/dax.h   | 1 +
 4 files changed, 13 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 598fe2e..cafadd0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -444,6 +444,7 @@ static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
 
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
 	.memory_failure		= pmem_pagemap_memory_failure,
+	.page_free	        = dax_page_free,
 };
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/fs/dax.c b/fs/dax.c
index becb4a6..f93afd7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -2065,3 +2065,9 @@ int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					       pos_out, len, remap_flags, ops);
 }
 EXPORT_SYMBOL_GPL(dax_remap_file_range_prep);
+
+void dax_page_free(struct page *page)
+{
+	wake_up_var(page);
+}
+EXPORT_SYMBOL_GPL(dax_page_free);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 1a52a51..6e90a4b 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -909,6 +909,10 @@ static void virtio_fs_cleanup_dax(void *data)
 
 DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T))
 
+static const struct dev_pagemap_ops fsdax_pagemap_ops = {
+	.page_free = dax_page_free,
+};
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
 {
 	struct dax_device *dax_dev __free(cleanup_dax) = NULL;
@@ -948,6 +952,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
 		return -ENOMEM;
 
 	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->ops = &fsdax_pagemap_ops;
 
 	/* Ideally we would directly use the PCI BAR resource but
 	 * devm_memremap_pages() wants its own copy in pgmap.  So
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 773dfc4..adbafc8 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -213,6 +213,7 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		const struct iomap_ops *ops);
 
+void dax_page_free(struct page *page);
 static inline int dax_wait_page_idle(struct page *page,
 				void (cb)(struct inode *),
 				struct inode *inode)
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 05/13] mm: Allow compound zone device pages
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (3 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 04/13] fs/dax: Add dax_page_free callback Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27  5:35   ` Christoph Hellwig
  2024-06-27  0:54 ` [PATCH 06/13] mm/memory: Add dax_insert_pfn Alistair Popple
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple, Jason Gunthorpe

Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are
not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
user of higher order zone device pages and have their own page
reference counting.

A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference
counting. Supporting that requires compound zone device pages.

Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.

A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head
page. For zone device pages page->compound_head is shared with
page->pgmap.

The page->pgmap field is common to all pages within a memory section.
Therefore pgmap is the same for both head and tail pages and we can
use the same scheme to distinguish tail pages. To obtain the pgmap for
a tail page a new accessor is introduced to fetch it from
compound_head.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

---

In response to the RFC Matthew Wilcox pointed out that we could move
the pgmap field to the folio. Morally I think that's where pgmap
belongs, so I it's a good idea that I just haven't had a change to
implement yet. I suspect there will be at least a v2 of this series
though so will probably do it then.
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 drivers/pci/p2pdma.c                   |  2 +-
 include/linux/memremap.h               | 12 +++++++++---
 include/linux/migrate.h                |  2 +-
 lib/test_hmm.c                         |  2 +-
 mm/hmm.c                               |  2 +-
 mm/memory.c                            |  2 +-
 mm/memremap.c                          |  8 ++++----
 mm/migrate_device.c                    |  4 ++--
 9 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 6fb65b0..18d74a7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -88,7 +88,7 @@ struct nouveau_dmem {
 
 static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
 {
-	return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
+	return container_of(page_dev_pagemap(page), struct nouveau_dmem_chunk, pagemap);
 }
 
 static struct nouveau_drm *page_to_drm(struct page *page)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 1e9ea32..d9b422a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -195,7 +195,7 @@ static const struct attribute_group p2pmem_group = {
 
 static void p2pdma_page_free(struct page *page)
 {
-	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_dev_pagemap(page));
 	/* safe to dereference while a reference is held to the percpu ref */
 	struct pci_p2pdma *p2pdma =
 		rcu_dereference_protected(pgmap->provider->p2pdma, 1);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3f7143a..6505713 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -140,6 +140,12 @@ struct dev_pagemap {
 	};
 };
 
+static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
+{
+	WARN_ON(!is_zone_device_page(page));
+	return compound_head(page)->pgmap;
+}
+
 static inline bool pgmap_has_memory_failure(struct dev_pagemap *pgmap)
 {
 	return pgmap->ops && pgmap->ops->memory_failure;
@@ -161,7 +167,7 @@ static inline bool is_device_private_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
 		is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+		page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
 }
 
 static inline bool folio_is_device_private(const struct folio *folio)
@@ -173,13 +179,13 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
 		is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+		page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
 
 static inline bool is_device_coherent_page(const struct page *page)
 {
 	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_COHERENT;
+		page_dev_pagemap(page)->type == MEMORY_DEVICE_COHERENT;
 }
 
 static inline bool folio_is_device_coherent(const struct folio *folio)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 2ce13e8..e31acc0 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -200,7 +200,7 @@ struct migrate_vma {
 	unsigned long		end;
 
 	/*
-	 * Set to the owner value also stored in page->pgmap->owner for
+	 * Set to the owner value also stored in page_dev_pagemap(page)->owner for
 	 * migrating out of device private memory. The flags also need to
 	 * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
 	 * The caller should always set this field when using mmu notifier
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index b823ba7..a02d709 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -195,7 +195,7 @@ static int dmirror_fops_release(struct inode *inode, struct file *filp)
 
 static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
 {
-	return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+	return container_of(page_dev_pagemap(page), struct dmirror_chunk, pagemap);
 }
 
 static struct dmirror_device *dmirror_page_to_device(struct page *page)
diff --git a/mm/hmm.c b/mm/hmm.c
index 93aebd9..26e1905 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -248,7 +248,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * just report the PFN.
 		 */
 		if (is_device_private_entry(entry) &&
-		    pfn_swap_entry_to_page(entry)->pgmap->owner ==
+		    page_dev_pagemap(pfn_swap_entry_to_page(entry))->owner ==
 		    range->dev_private_owner) {
 			cpu_flags = HMM_PFN_VALID;
 			if (is_writable_device_private_entry(entry))
diff --git a/mm/memory.c b/mm/memory.c
index 25a77c4..ce48a05 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3994,7 +3994,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			 */
 			get_page(vmf->page);
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
-			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
+			ret = page_dev_pagemap(vmf->page)->ops->migrate_to_ram(vmf);
 			put_page(vmf->page);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
diff --git a/mm/memremap.c b/mm/memremap.c
index caccbd8..13c1d5b 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -458,8 +458,8 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 void free_zone_device_folio(struct folio *folio)
 {
-	if (WARN_ON_ONCE(!folio->page.pgmap->ops ||
-			!folio->page.pgmap->ops->page_free))
+	if (WARN_ON_ONCE(!page_dev_pagemap(&folio->page)->ops ||
+			!page_dev_pagemap(&folio->page)->ops->page_free))
 		return;
 
 	mem_cgroup_uncharge(folio);
@@ -486,7 +486,7 @@ void free_zone_device_folio(struct folio *folio)
 	 * to clear folio->mapping.
 	 */
 	folio->mapping = NULL;
-	folio->page.pgmap->ops->page_free(folio_page(folio, 0));
+	page_dev_pagemap(&folio->page)->ops->page_free(folio_page(folio, 0));
 
 	if (folio->page.pgmap->type == MEMORY_DEVICE_PRIVATE ||
 	    folio->page.pgmap->type == MEMORY_DEVICE_COHERENT)
@@ -505,7 +505,7 @@ void zone_device_page_init(struct page *page)
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
-	WARN_ON_ONCE(!percpu_ref_tryget_live(&page->pgmap->ref));
+	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_dev_pagemap(page)->ref));
 	set_page_count(page, 1);
 	lock_page(page);
 }
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index aecc719..4fdd8fa 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -135,7 +135,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			page = pfn_swap_entry_to_page(entry);
 			if (!(migrate->flags &
 				MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
-			    page->pgmap->owner != migrate->pgmap_owner)
+			    page_dev_pagemap(page)->owner != migrate->pgmap_owner)
 				goto next;
 
 			mpfn = migrate_pfn(page_to_pfn(page)) |
@@ -156,7 +156,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				goto next;
 			else if (page && is_device_coherent_page(page) &&
 			    (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
-			     page->pgmap->owner != migrate->pgmap_owner))
+			     page_dev_pagemap(page)->owner != migrate->pgmap_owner))
 				goto next;
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 06/13] mm/memory: Add dax_insert_pfn
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (4 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 05/13] mm: Allow compound zone device pages Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27  5:22   ` Christoph Hellwig
                     ` (2 more replies)
  2024-06-27  0:54 ` [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages Alistair Popple
                   ` (8 subsequent siblings)
  14 siblings, 3 replies; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
creates a special devmap PTE entry for the pfn but does not take a
reference on the underlying struct page for the mapping. This is
because DAX page refcounts are treated specially, as indicated by the
presence of a devmap entry.

To allow DAX page refcounts to be managed the same as normal page
refcounts introduce dax_insert_pfn. This will take a reference on the
underlying page much the same as vmf_insert_page, except it also
permits upgrading an existing mapping to be writable if
requested/possible.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/mm.h |  4 ++-
 mm/memory.c        | 79 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a5652c..b84368b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1080,6 +1080,8 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
 struct mmu_gather;
 struct inode;
 
+extern void prep_compound_page(struct page *page, unsigned int order);
+
 /*
  * compound_order() can be called without holding a reference, which means
  * that niceties like page_folio() don't work.  These callers should be
@@ -3624,6 +3626,8 @@ int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
 				unsigned long num);
 int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
 				unsigned long num);
+vm_fault_t dax_insert_pfn(struct vm_area_struct *vma,
+		unsigned long addr, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index ce48a05..4f26a1f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1989,14 +1989,42 @@ static int validate_page_before_insert(struct page *page)
 }
 
 static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
-			unsigned long addr, struct page *page, pgprot_t prot)
+			unsigned long addr, struct page *page, pgprot_t prot, bool mkwrite)
 {
 	struct folio *folio = page_folio(page);
+	pte_t entry = ptep_get(pte);
 
-	if (!pte_none(ptep_get(pte)))
+	if (!pte_none(entry)) {
+		if (mkwrite) {
+			/*
+			 * For read faults on private mappings the PFN passed
+			 * in may not match the PFN we have mapped if the
+			 * mapped PFN is a writeable COW page.  In the mkwrite
+			 * case we are creating a writable PTE for a shared
+			 * mapping and we expect the PFNs to match. If they
+			 * don't match, we are likely racing with block
+			 * allocation and mapping invalidation so just skip the
+			 * update.
+			 */
+			if (pte_pfn(entry) != page_to_pfn(page)) {
+				WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
+				return -EFAULT;
+			}
+			entry = maybe_mkwrite(entry, vma);
+			entry = pte_mkyoung(entry);
+			if (ptep_set_access_flags(vma, addr, pte, entry, 1))
+				update_mmu_cache(vma, addr, pte);
+			return 0;
+		}
 		return -EBUSY;
+	}
+
 	/* Ok, finally just insert the thing.. */
 	folio_get(folio);
+	if (mkwrite)
+		entry = maybe_mkwrite(mk_pte(page, prot), vma);
+	else
+		entry = mk_pte(page, prot);
 	inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
 	folio_add_file_rmap_pte(folio, page, vma);
 	set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
@@ -2011,7 +2039,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
  * pages reserved for the old functions anyway.
  */
 static int insert_page(struct vm_area_struct *vma, unsigned long addr,
-			struct page *page, pgprot_t prot)
+			struct page *page, pgprot_t prot, bool mkwrite)
 {
 	int retval;
 	pte_t *pte;
@@ -2024,7 +2052,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 	pte = get_locked_pte(vma->vm_mm, addr, &ptl);
 	if (!pte)
 		goto out;
-	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
+	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, mkwrite);
 	pte_unmap_unlock(pte, ptl);
 out:
 	return retval;
@@ -2040,7 +2068,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
 	err = validate_page_before_insert(page);
 	if (err)
 		return err;
-	return insert_page_into_pte_locked(vma, pte, addr, page, prot);
+	return insert_page_into_pte_locked(vma, pte, addr, page, prot, false);
 }
 
 /* insert_pages() amortizes the cost of spinlock operations
@@ -2177,7 +2205,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
 		BUG_ON(vma->vm_flags & VM_PFNMAP);
 		vm_flags_set(vma, VM_MIXEDMAP);
 	}
-	return insert_page(vma, addr, page, vma->vm_page_prot);
+	return insert_page(vma, addr, page, vma->vm_page_prot, false);
 }
 EXPORT_SYMBOL(vm_insert_page);
 
@@ -2451,7 +2479,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 		 * result in pfn_t_has_page() == false.
 		 */
 		page = pfn_to_page(pfn_t_to_pfn(pfn));
-		err = insert_page(vma, addr, page, pgprot);
+		err = insert_page(vma, addr, page, pgprot, mkwrite);
 	} else {
 		return insert_pfn(vma, addr, pfn, pgprot, mkwrite);
 	}
@@ -2464,6 +2492,43 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 	return VM_FAULT_NOPAGE;
 }
 
+vm_fault_t dax_insert_pfn(struct vm_area_struct *vma,
+		unsigned long addr, pfn_t pfn_t, bool write)
+{
+	pgprot_t pgprot = vma->vm_page_prot;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
+	struct page *page = pfn_to_page(pfn);
+	int err;
+
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
+
+	track_pfn_insert(vma, &pgprot, pfn_t);
+
+	if (!pfn_modify_allowed(pfn, pgprot))
+		return VM_FAULT_SIGBUS;
+
+	/*
+	 * We refcount the page normally so make sure pfn_valid is true.
+	 */
+	if (!pfn_t_valid(pfn_t))
+		return VM_FAULT_SIGBUS;
+
+	WARN_ON_ONCE(pfn_t_devmap(pfn_t));
+
+	if (WARN_ON(is_zero_pfn(pfn) && write))
+		return VM_FAULT_SIGBUS;
+
+	err = insert_page(vma, addr, page, pgprot, write);
+	if (err == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (err < 0 && err != -EBUSY)
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_insert_pfn);
+
 vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 		pfn_t pfn)
 {
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (5 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 06/13] mm/memory: Add dax_insert_pfn Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27 22:26   ` kernel test robot
  2024-07-02  7:16   ` David Hildenbrand
  2024-06-27  0:54 ` [PATCH 08/13] huge_memory: Allow mappings of PMD " Alistair Popple
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
and take references as it would for a normally mapped page.

This is distinct from the current mechanism, vmf_insert_pfn_pud, which
simply inserts a special devmap PUD entry into the page table without
holding a reference to the page for the mapping.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/huge_mm.h |   4 ++-
 include/linux/rmap.h    |  14 +++++-
 mm/huge_memory.c        | 108 ++++++++++++++++++++++++++++++++++++++---
 mm/rmap.c               |  48 ++++++++++++++++++-
 4 files changed, 168 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2aa986a..b98a3cc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_UNSUPPORTED,
@@ -106,6 +107,9 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
 #define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
 
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 extern unsigned long transparent_hugepage_flags;
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7229b9b..c5a0205 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -192,6 +192,7 @@ typedef int __bitwise rmap_t;
 enum rmap_level {
 	RMAP_LEVEL_PTE = 0,
 	RMAP_LEVEL_PMD,
+	RMAP_LEVEL_PUD,
 };
 
 static inline void __folio_rmap_sanity_checks(struct folio *folio,
@@ -225,6 +226,13 @@ static inline void __folio_rmap_sanity_checks(struct folio *folio,
 		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
 		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
 		break;
+	case RMAP_LEVEL_PUD:
+		/*
+		 * Asume that we are creating * a single "entire" mapping of the folio.
+		 */
+		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio);
+		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
+		break;
 	default:
 		VM_WARN_ON_ONCE(true);
 	}
@@ -248,12 +256,16 @@ void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
 	folio_add_file_rmap_ptes(folio, page, 1, vma)
 void folio_add_file_rmap_pmd(struct folio *, struct page *,
 		struct vm_area_struct *);
+void folio_add_file_rmap_pud(struct folio *, struct page *,
+		struct vm_area_struct *);
 void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
 		struct vm_area_struct *);
 #define folio_remove_rmap_pte(folio, page, vma) \
 	folio_remove_rmap_ptes(folio, page, 1, vma)
 void folio_remove_rmap_pmd(struct folio *, struct page *,
 		struct vm_area_struct *);
+void folio_remove_rmap_pud(struct folio *, struct page *,
+		struct vm_area_struct *);
 
 void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
@@ -338,6 +350,7 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
 		atomic_add(orig_nr_pages, &folio->_large_mapcount);
 		break;
 	case RMAP_LEVEL_PMD:
+	case RMAP_LEVEL_PUD:
 		atomic_inc(&folio->_entire_mapcount);
 		atomic_inc(&folio->_large_mapcount);
 		break;
@@ -434,6 +447,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
 		atomic_add(orig_nr_pages, &folio->_large_mapcount);
 		break;
 	case RMAP_LEVEL_PMD:
+	case RMAP_LEVEL_PUD:
 		if (PageAnonExclusive(page)) {
 			if (unlikely(maybe_pinned))
 				return -EBUSY;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index db7946a..e1f053e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1283,6 +1283,70 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
+
+/**
+ * dax_insert_pfn_pud - insert a pud size pfn backed by a normal page
+ * @vmf: Structure describing the fault
+ * @pfn: pfn of the page to insert
+ * @write: whether it's a write fault
+ *
+ * Return: vm_fault_t value.
+ */
+vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long addr = vmf->address & PUD_MASK;
+	pud_t *pud = vmf->pud;
+	pgprot_t prot = vma->vm_page_prot;
+	struct mm_struct *mm = vma->vm_mm;
+	pud_t entry;
+	spinlock_t *ptl;
+	struct folio *folio;
+	struct page *page;
+
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
+
+	track_pfn_insert(vma, &prot, pfn);
+
+	ptl = pud_lock(mm, pud);
+	if (!pud_none(*pud)) {
+		if (write) {
+			if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
+				WARN_ON_ONCE(!is_huge_zero_pud(*pud));
+				goto out_unlock;
+			}
+			entry = pud_mkyoung(*pud);
+			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
+				update_mmu_cache_pud(vma, addr, pud);
+		}
+		goto out_unlock;
+	}
+
+	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
+	if (pfn_t_devmap(pfn))
+		entry = pud_mkdevmap(entry);
+	if (write) {
+		entry = pud_mkyoung(pud_mkdirty(entry));
+		entry = maybe_pud_mkwrite(entry, vma);
+	}
+
+	page = pfn_t_to_page(pfn);
+	folio = page_folio(page);
+	folio_get(folio);
+	folio_add_file_rmap_pud(folio, page, vma);
+	add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
+
+	set_pud_at(mm, addr, pud, entry);
+	update_mmu_cache_pud(vma, addr, pud);
+
+out_unlock:
+	spin_unlock(ptl);
+
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_insert_pfn_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1836,7 +1900,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
-		zap_deposited_table(tlb->mm, pmd);
+		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
+			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else {
 		struct folio *folio = NULL;
@@ -2268,20 +2333,34 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pud_t *pud, unsigned long addr)
 {
+	pud_t orig_pud;
 	spinlock_t *ptl;
 
 	ptl = __pud_trans_huge_lock(pud, vma);
 	if (!ptl)
 		return 0;
 
-	pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
+	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
-	if (vma_is_special_huge(vma)) {
+	if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
 		spin_unlock(ptl);
 		/* No zero page support yet */
 	} else {
-		/* No support for anonymous PUD pages yet */
-		BUG();
+		struct page *page = NULL;
+		struct folio *folio;
+
+		/* No support for anonymous PUD pages or migration yet */
+		BUG_ON(vma_is_anonymous(vma) || !pud_present(orig_pud));
+
+		page = pud_page(orig_pud);
+		folio = page_folio(page);
+		folio_remove_rmap_pud(folio, page, vma);
+		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+		VM_BUG_ON_PAGE(!PageHead(page), page);
+		add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
+
+		spin_unlock(ptl);
+		tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
 	}
 	return 1;
 }
@@ -2289,6 +2368,8 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long haddr)
 {
+	pud_t old_pud;
+
 	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
@@ -2296,7 +2377,22 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 
 	count_vm_event(THP_SPLIT_PUD);
 
-	pudp_huge_clear_flush(vma, haddr, pud);
+	old_pud = pudp_huge_clear_flush(vma, haddr, pud);
+	if (is_huge_zero_pud(old_pud))
+		return;
+
+	if (vma_is_dax(vma)) {
+		struct page *page = pud_page(old_pud);
+		struct folio *folio = page_folio(page);
+
+		if (!folio_test_dirty(folio) && pud_dirty(old_pud))
+			folio_mark_dirty(folio);
+		if (!folio_test_referenced(folio) && pud_young(old_pud))
+			folio_set_referenced(folio);
+		folio_remove_rmap_pud(folio, page, vma);
+		folio_put(folio);
+		add_mm_counter(vma->vm_mm, mm_counter_file(folio), -HPAGE_PUD_NR);
+	}
 }
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ec..e949e4f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1165,6 +1165,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
 		atomic_add(orig_nr_pages, &folio->_large_mapcount);
 		break;
 	case RMAP_LEVEL_PMD:
+	case RMAP_LEVEL_PUD:
 		first = atomic_inc_and_test(&folio->_entire_mapcount);
 		if (first) {
 			nr = atomic_add_return_relaxed(ENTIRELY_MAPPED, mapped);
@@ -1306,6 +1307,12 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
 		case RMAP_LEVEL_PMD:
 			SetPageAnonExclusive(page);
 			break;
+		case RMAP_LEVEL_PUD:
+			/*
+			 * Keep the compiler happy, we don't support anonymous PUD mappings.
+			 */
+			WARN_ON_ONCE(1);
+			break;
 		}
 	}
 	for (i = 0; i < nr_pages; i++) {
@@ -1489,6 +1496,26 @@ void folio_add_file_rmap_pmd(struct folio *folio, struct page *page,
 #endif
 }
 
+/**
+ * folio_add_file_rmap_pud - add a PUD mapping to a page range of a folio
+ * @folio:	The folio to add the mapping to
+ * @page:	The first page to add
+ * @vma:	The vm area in which the mapping is added
+ *
+ * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
+ *
+ * The caller needs to hold the page table lock.
+ */
+void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
+		struct vm_area_struct *vma)
+{
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	__folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
+#else
+	WARN_ON_ONCE(true);
+#endif
+}
+
 static __always_inline void __folio_remove_rmap(struct folio *folio,
 		struct page *page, int nr_pages, struct vm_area_struct *vma,
 		enum rmap_level level)
@@ -1521,6 +1548,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 		partially_mapped = nr && atomic_read(mapped);
 		break;
 	case RMAP_LEVEL_PMD:
+	case RMAP_LEVEL_PUD:
 		atomic_dec(&folio->_large_mapcount);
 		last = atomic_add_negative(-1, &folio->_entire_mapcount);
 		if (last) {
@@ -1615,6 +1643,26 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
 #endif
 }
 
+/**
+ * folio_remove_rmap_pud - remove a PUD mapping from a page range of a folio
+ * @folio:	The folio to remove the mapping from
+ * @page:	The first page to remove
+ * @vma:	The vm area from which the mapping is removed
+ *
+ * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
+ *
+ * The caller needs to hold the page table lock.
+ */
+void folio_remove_rmap_pud(struct folio *folio, struct page *page,
+		struct vm_area_struct *vma)
+{
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	__folio_remove_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
+#else
+	WARN_ON_ONCE(true);
+#endif
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 08/13] huge_memory: Allow mappings of PMD sized pages
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (6 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27  0:54 ` [PATCH 09/13] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce dax_insert_pfn_pmd. This will map the entire PMD-sized folio
and take references as it would for a normally mapped page.

This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
simply inserts a special devmap PMD entry into the page table without
holding a reference to the page for the mapping.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/huge_mm.h |  1 +-
 mm/huge_memory.c        | 70 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 71 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b98a3cc..9207d8e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 
 enum transparent_hugepage_flag {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e1f053e..a9874ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1202,6 +1202,76 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
 
+vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long addr = vmf->address & PMD_MASK;
+	pmd_t *pmd = vmf->pmd;
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t entry;
+	spinlock_t *ptl;
+	pgtable_t pgtable = NULL;
+	struct folio *folio;
+	struct page *page;
+
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
+
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable)
+			return VM_FAULT_OOM;
+	}
+
+	track_pfn_insert(vma, &vma->vm_page_prot, pfn);
+
+	ptl = pmd_lock(mm, pmd);
+	if (!pmd_none(*pmd)) {
+		if (write) {
+			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
+				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
+				goto out_unlock;
+			}
+			entry = pmd_mkyoung(*pmd);
+			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+			if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
+				update_mmu_cache_pmd(vma, addr, pmd);
+		}
+
+		goto out_unlock;
+	}
+
+	entry = pmd_mkhuge(pfn_t_pmd(pfn, vma->vm_page_prot));
+	if (pfn_t_devmap(pfn))
+		entry = pmd_mkdevmap(entry);
+	if (write) {
+		entry = pmd_mkyoung(pmd_mkdirty(entry));
+		entry = maybe_pmd_mkwrite(entry, vma);
+	}
+
+	if (pgtable) {
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		mm_inc_nr_ptes(mm);
+		pgtable = NULL;
+	}
+
+	page = pfn_t_to_page(pfn);
+	folio = page_folio(page);
+	folio_get(folio);
+	folio_add_file_rmap_pmd(folio, page, vma);
+	add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
+	set_pmd_at(mm, addr, pmd, entry);
+	update_mmu_cache_pmd(vma, addr, pmd);
+
+out_unlock:
+	spin_unlock(ptl);
+	if (pgtable)
+		pte_free(mm, pgtable);
+
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_insert_pfn_pmd);
+
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
 {
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 09/13] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (7 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 08/13] huge_memory: Allow mappings of PMD " Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-07-01  8:59   ` David Hildenbrand
  2024-06-27  0:54 ` [PATCH 10/13] fs/dax: Properly refcount fs dax pages Alistair Popple
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

Longterm pinning of FS DAX pages should already be disallowed by
various pXX_devmap checks. However a future change will cause these
checks to be invalid for FS DAX pages so make
folio_is_longterm_pinnable() return false for FS DAX pages.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/memremap.h | 11 +++++++++++
 include/linux/mm.h       |  4 ++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 6505713..19a448e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -193,6 +193,17 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
 	return is_device_coherent_page(&folio->page);
 }
 
+static inline bool is_device_dax_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page_dev_pagemap(page)->type == MEMORY_DEVICE_FS_DAX;
+}
+
+static inline bool folio_is_device_dax(const struct folio *folio)
+{
+	return is_device_dax_page(&folio->page);
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 void zone_device_page_init(struct page *page);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84368b..4d1cdea 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2032,6 +2032,10 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
 	if (folio_is_device_coherent(folio))
 		return false;
 
+	/* DAX must also always allow eviction. */
+	if (folio_is_device_dax(folio))
+		return false;
+
 	/* Otherwise, non-movable zone folios can be pinned. */
 	return !folio_is_zone_movable(folio);
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 10/13] fs/dax: Properly refcount fs dax pages
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (8 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 09/13] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27  5:44   ` Christoph Hellwig
  2024-06-27  0:54 ` [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code Alistair Popple
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

Currently fs dax pages are considered free when the refcount drops to
one and their refcounts are not increased when mapped via PTEs or
decreased when unmapped. This requires special logic in mm paths to
detect that these pages should not be properly refcounted, and to
detect when the refcount drops to one instead of zero.

On the other hand get_user_pages(), etc. will properly refcount fs dax
pages by taking a reference and dropping it when the page is
unpinned.

Tracking this special behaviour requires extra PTE bits
(eg. pte_devmap) and introduces rules that are potentially confusing
and specific to FS DAX pages. To fix this, and to possibly allow
removal of the special PTE bits in future, convert the fs dax page
refcounts to be zero based and instead take a reference on the page
each time it is mapped as is currently the case for normal pages.

This may also allow a future clean-up to remove the pgmap refcounting
that is currently done in mm/gup.c.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 drivers/dax/device.c       |  12 +-
 drivers/dax/super.c        |   2 +-
 drivers/nvdimm/pmem.c      |   8 +--
 fs/dax.c                   | 193 +++++++++++++++++---------------------
 fs/fuse/virtio_fs.c        |   3 +-
 include/linux/dax.h        |   4 +-
 include/linux/mm.h         |  25 +-----
 include/linux/page-flags.h |   6 +-
 mm/gup.c                   |   9 +--
 mm/huge_memory.c           |   6 +-
 mm/internal.h              |   2 +-
 mm/memory-failure.c        |   6 +-
 mm/memremap.c              |  24 +-----
 mm/mlock.c                 |   2 +-
 mm/mm_init.c               |   3 +-
 mm/swap.c                  |   2 +-
 16 files changed, 123 insertions(+), 184 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index eb61598..b7a31ae 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -126,11 +126,11 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+	pfn = phys_to_pfn_t(phys, 0);
 
 	dax_set_mapping(vmf, pfn, fault_size);
 
-	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	return dax_insert_pfn(vmf->vma, vmf->address, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
@@ -169,11 +169,11 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+	pfn = phys_to_pfn_t(phys, 0);
 
 	dax_set_mapping(vmf, pfn, fault_size);
 
-	return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+	return dax_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -214,11 +214,11 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+	pfn = phys_to_pfn_t(phys, 0);
 
 	dax_set_mapping(vmf, pfn, fault_size);
 
-	return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+	return dax_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index aca71d7..d83196e 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -257,7 +257,7 @@ EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-	if (unlikely(!dax_write_cache_enabled(dax_dev)))
+	if (unlikely(dax_dev && !dax_write_cache_enabled(dax_dev)))
 		return;
 
 	arch_wb_cache_pmem(addr, size);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index cafadd0..da13dc1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -510,7 +510,7 @@ static int pmem_attach_disk(struct device *dev,
 
 	pmem->disk = disk;
 	pmem->pgmap.owner = pmem;
-	pmem->pfn_flags = PFN_DEV;
+	pmem->pfn_flags = 0;
 	if (is_nd_pfn(dev)) {
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
@@ -519,7 +519,7 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
 		pmem->pfn_pad = resource_size(res) -
 			range_len(&pmem->pgmap.range);
-		pmem->pfn_flags |= PFN_MAP;
+		blk_queue_flag_set(QUEUE_FLAG_DAX, q);
 		bb_range = pmem->pgmap.range;
 		bb_range.start += pmem->data_offset;
 	} else if (pmem_should_map_pages(dev)) {
@@ -529,7 +529,7 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
-		pmem->pfn_flags |= PFN_MAP;
+		blk_queue_flag_set(QUEUE_FLAG_DAX, q);
 		bb_range = pmem->pgmap.range;
 	} else {
 		addr = devm_memremap(dev, pmem->phys_addr,
@@ -547,8 +547,6 @@ static int pmem_attach_disk(struct device *dev,
 	blk_queue_write_cache(q, true, fua);
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
 	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q);
-	if (pmem->pfn_flags & PFN_MAP)
-		blk_queue_flag_set(QUEUE_FLAG_DAX, q);
 
 	disk->fops		= &pmem_fops;
 	disk->private_data	= pmem;
diff --git a/fs/dax.c b/fs/dax.c
index f93afd7..862af24 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -71,6 +71,11 @@ static unsigned long dax_to_pfn(void *entry)
 	return xa_to_value(entry) >> DAX_SHIFT;
 }
 
+static struct folio *dax_to_folio(void *entry)
+{
+	return page_folio(pfn_to_page(dax_to_pfn(entry)));
+}
+
 static void *dax_make_entry(pfn_t pfn, unsigned long flags)
 {
 	return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT));
@@ -318,85 +323,51 @@ static unsigned long dax_end_pfn(void *entry)
  */
 #define for_each_mapped_pfn(entry, pfn) \
 	for (pfn = dax_to_pfn(entry); \
-			pfn < dax_end_pfn(entry); pfn++)
+		pfn < dax_end_pfn(entry); pfn++)
 
-static inline bool dax_page_is_shared(struct page *page)
+static void dax_device_folio_init(struct folio *folio, int order)
 {
-	return page->mapping == PAGE_MAPPING_DAX_SHARED;
-}
-
-/*
- * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
- * refcount.
- */
-static inline void dax_page_share_get(struct page *page)
-{
-	if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
-		/*
-		 * Reset the index if the page was already mapped
-		 * regularly before.
-		 */
-		if (page->mapping)
-			page->share = 1;
-		page->mapping = PAGE_MAPPING_DAX_SHARED;
-	}
-	page->share++;
-}
-
-static inline unsigned long dax_page_share_put(struct page *page)
-{
-	return --page->share;
-}
+	int orig_order = folio_order(folio);
+	int i;
 
-/*
- * When it is called in dax_insert_entry(), the shared flag will indicate that
- * whether this entry is shared by multiple files.  If so, set the page->mapping
- * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
- */
-static void dax_associate_entry(void *entry, struct address_space *mapping,
-		struct vm_area_struct *vma, unsigned long address, bool shared)
-{
-	unsigned long size = dax_entry_size(entry), pfn, index;
-	int i = 0;
+	if (orig_order != order) {
+		struct dev_pagemap *pgmap = page_dev_pagemap(&folio->page);
 
-	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return;
+		for (i = 0; i < (1UL << orig_order); i++) {
+			ClearPageHead(folio_page(folio, i));
+			clear_compound_head(folio_page(folio, i));
 
-	index = linear_page_index(vma, address & ~(size - 1));
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
+			/* Reset pgmap which was over-written by prep_compound_page() */
+			folio_page(folio, i)->pgmap = pgmap;
+		}
+	}
 
-		if (shared) {
-			dax_page_share_get(page);
-		} else {
-			WARN_ON_ONCE(page->mapping);
-			page->mapping = mapping;
-			page->index = index + i++;
+	if (order > 0) {
+		prep_compound_page(&folio->page, order);
+		if (order > 1) {
+			VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
+			INIT_LIST_HEAD(&folio->_deferred_list);
 		}
 	}
 }
 
-static void dax_disassociate_entry(void *entry, struct address_space *mapping,
-		bool trunc)
+static void dax_associate_new_entry(void *entry, struct address_space *mapping, pgoff_t index)
 {
-	unsigned long pfn;
+	unsigned long order = dax_entry_order(entry);
+	struct folio *folio = dax_to_folio(entry);
 
-	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+	if (!dax_entry_size(entry))
 		return;
 
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
-
-		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-		if (dax_page_is_shared(page)) {
-			/* keep the shared flag if this page is still shared */
-			if (dax_page_share_put(page) > 0)
-				continue;
-		} else
-			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
-		page->mapping = NULL;
-		page->index = 0;
-	}
+	/*
+	 * We don't hold a reference for the DAX pagecache entry for the page. But we
+	 * need to initialise the folio so we can hand it out. Nothing else should have
+	 * a reference either.
+	 */
+	WARN_ON_ONCE(folio_ref_count(folio));
+	dax_device_folio_init(folio, order);
+	folio->mapping = mapping;
+	folio->index = index;
 }
 
 static struct page *dax_busy_page(void *entry)
@@ -406,7 +377,7 @@ static struct page *dax_busy_page(void *entry)
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		if (page_ref_count(page) > 1)
+		if (page_ref_count(page))
 			return page;
 	}
 	return NULL;
@@ -620,7 +591,6 @@ static void *grab_mapping_entry(struct xa_state *xas,
 			xas_lock_irq(xas);
 		}
 
-		dax_disassociate_entry(entry, mapping, false);
 		xas_store(xas, NULL);	/* undo the PMD join */
 		dax_wake_entry(xas, entry, WAKE_ALL);
 		mapping->nrpages -= PG_PMD_NR;
@@ -743,7 +713,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 EXPORT_SYMBOL_GPL(dax_layout_busy_page);
 
 static int __dax_invalidate_entry(struct address_space *mapping,
-					  pgoff_t index, bool trunc)
+				  pgoff_t index, bool trunc)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
 	int ret = 0;
@@ -757,7 +727,6 @@ static int __dax_invalidate_entry(struct address_space *mapping,
 	    (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) ||
 	     xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE)))
 		goto out;
-	dax_disassociate_entry(entry, mapping, trunc);
 	xas_store(&xas, NULL);
 	mapping->nrpages -= 1UL << dax_entry_order(entry);
 	ret = 1;
@@ -894,9 +863,11 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 	if (shared || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		void *old;
 
-		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
-				shared);
+		if (!shared) {
+			dax_associate_new_entry(new_entry, mapping,
+				linear_page_index(vmf->vma, vmf->address));
+		}
+
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -1084,9 +1055,7 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
 		goto out;
 	if (pfn_t_to_pfn(*pfnp) & (PHYS_PFN(size)-1))
 		goto out;
-	/* For larger pages we need devmap */
-	if (length > 1 && !pfn_t_devmap(*pfnp))
-		goto out;
+
 	rc = 0;
 
 out_check_addr:
@@ -1189,11 +1158,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	struct inode *inode = iter->inode;
 	unsigned long vaddr = vmf->address;
 	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
+	struct page *page = pfn_t_to_page(pfn);
 	vm_fault_t ret;
 
 	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
 
-	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
+	page_ref_inc(page);
+	ret = dax_insert_pfn(vmf->vma, vaddr, pfn, false);
+	put_page(page);
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
 }
@@ -1212,8 +1184,13 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	pmd_t pmd_entry;
 	pfn_t pfn;
 
-	zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable)
+			return VM_FAULT_OOM;
+	}
 
+	zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
 	if (unlikely(!zero_folio))
 		goto fallback;
 
@@ -1221,29 +1198,23 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
 				  DAX_PMD | DAX_ZERO_PAGE);
 
-	if (arch_needs_pgtable_deposit()) {
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (!pgtable)
-			return VM_FAULT_OOM;
-	}
-
 	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
-	if (!pmd_none(*(vmf->pmd))) {
-		spin_unlock(ptl);
-		goto fallback;
-	}
+	if (!pmd_none(*vmf->pmd))
+		goto fallback_unlock;
 
-	if (pgtable) {
-		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
-		mm_inc_nr_ptes(vma->vm_mm);
-	}
-	pmd_entry = mk_pmd(&zero_folio->page, vmf->vma->vm_page_prot);
+	pmd_entry = mk_pmd(&zero_folio->page, vma->vm_page_prot);
 	pmd_entry = pmd_mkhuge(pmd_entry);
-	set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
+	if (pgtable)
+		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+	set_pmd_at(vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
 	spin_unlock(ptl);
 	trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
 	return VM_FAULT_NOPAGE;
 
+fallback_unlock:
+	spin_unlock(ptl);
+	mm_put_huge_zero_folio(vma->vm_mm);
+
 fallback:
 	if (pgtable)
 		pte_free(vma->vm_mm, pgtable);
@@ -1649,9 +1620,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
 	bool write = iter->flags & IOMAP_WRITE;
 	unsigned long entry_flags = pmd ? DAX_PMD : 0;
-	int err = 0;
+	int ret, err = 0;
 	pfn_t pfn;
 	void *kaddr;
+	struct page *page;
 
 	if (!pmd && vmf->cow_page)
 		return dax_fault_cow_page(vmf, iter);
@@ -1684,14 +1656,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	if (dax_fault_is_synchronous(iter, vmf->vma))
 		return dax_fault_synchronous_pfnp(pfnp, pfn);
 
-	/* insert PMD pfn */
+	page = pfn_t_to_page(pfn);
+	page_ref_inc(page);
+
 	if (pmd)
-		return vmf_insert_pfn_pmd(vmf, pfn, write);
+		ret = dax_insert_pfn_pmd(vmf, pfn, write);
+	else
+		ret = dax_insert_pfn(vmf->vma, vmf->address, pfn, write);
+
+	/*
+	 * Insert PMD/PTE will have a reference on the page when mapping it so drop
+	 * ours.
+	 */
+	put_page(page);
 
-	/* insert PTE pfn */
-	if (write)
-		return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
-	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	return ret;
 }
 
 static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
@@ -1932,6 +1911,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
 	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order);
 	void *entry;
 	vm_fault_t ret;
+	struct page *page;
 
 	xas_lock_irq(&xas);
 	entry = get_unlocked_entry(&xas, order);
@@ -1947,14 +1927,17 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
 	xas_set_mark(&xas, PAGECACHE_TAG_DIRTY);
 	dax_lock_entry(&xas, entry);
 	xas_unlock_irq(&xas);
+	page = pfn_t_to_page(pfn);
+	page_ref_inc(page);
 	if (order == 0)
-		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
+		ret = dax_insert_pfn(vmf->vma, vmf->address, pfn, true);
 #ifdef CONFIG_FS_DAX_PMD
 	else if (order == PMD_ORDER)
-		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
+		ret = dax_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
 #endif
 	else
 		ret = VM_FAULT_FALLBACK;
+	put_page(page);
 	dax_unlock_entry(&xas, entry);
 	trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret);
 	return ret;
@@ -2068,6 +2051,12 @@ EXPORT_SYMBOL_GPL(dax_remap_file_range_prep);
 
 void dax_page_free(struct page *page)
 {
+	/*
+	 * Make sure we flush any cached data to the page now that it's free.
+	 */
+	if (PageDirty(page))
+		dax_flush(NULL, page_address(page), page_size(page));
+
 	wake_up_var(page);
 }
 EXPORT_SYMBOL_GPL(dax_page_free);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 6e90a4b..4462ff6 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -873,8 +873,7 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	if (kaddr)
 		*kaddr = fs->window_kaddr + offset;
 	if (pfn)
-		*pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
-					PFN_DEV | PFN_MAP);
+		*pfn = phys_to_pfn_t(fs->window_phys_addr + offset, 0);
 	return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index adbafc8..02dc580 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -218,7 +218,9 @@ static inline int dax_wait_page_idle(struct page *page,
 				void (cb)(struct inode *),
 				struct inode *inode)
 {
-	return ___wait_var_event(page, page_ref_count(page) == 1,
+	int i = 0;
+
+	return ___wait_var_event(page, page_ref_count(page) == 1 || i++,
 				TASK_INTERRUPTIBLE, 0, 0, cb(inode));
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4d1cdea..47d8923 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1440,25 +1440,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf);
  *   back into memory.
  */
 
-#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-bool __put_devmap_managed_folio_refs(struct folio *folio, int refs);
-static inline bool put_devmap_managed_folio_refs(struct folio *folio, int refs)
-{
-	if (!static_branch_unlikely(&devmap_managed_key))
-		return false;
-	if (!folio_is_zone_device(folio))
-		return false;
-	return __put_devmap_managed_folio_refs(folio, refs);
-}
-#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-static inline bool put_devmap_managed_folio_refs(struct folio *folio, int refs)
-{
-	return false;
-}
-#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-
 /* 127: arbitrary random number, small enough to assemble well */
 #define folio_ref_zero_or_close_to_overflow(folio) \
 	((unsigned int) folio_ref_count(folio) + 127u <= 127u)
@@ -1573,12 +1554,6 @@ static inline void put_page(struct page *page)
 {
 	struct folio *folio = page_folio(page);
 
-	/*
-	 * For some devmap managed pages we need to catch refcount transition
-	 * from 2 to 1:
-	 */
-	if (put_devmap_managed_folio_refs(folio, 1))
-		return;
 	folio_put(folio);
 }
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 104078a..72c48af 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -682,12 +682,6 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted)
 #define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 #define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 
-/*
- * Different with flags above, this flag is used only for fsdax mode.  It
- * indicates that this page->mapping is now under reflink case.
- */
-#define PAGE_MAPPING_DAX_SHARED	((void *)0x1)
-
 static __always_inline bool folio_mapping_flags(const struct folio *folio)
 {
 	return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) != 0;
diff --git a/mm/gup.c b/mm/gup.c
index 669583e..ce80ff6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -89,8 +89,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
 	 * belongs to this folio.
 	 */
 	if (unlikely(page_folio(page) != folio)) {
-		if (!put_devmap_managed_folio_refs(folio, refs))
-			folio_put_refs(folio, refs);
+		folio_put_refs(folio, refs);
 		goto retry;
 	}
 
@@ -156,8 +155,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 	 */
 	if (unlikely((flags & FOLL_LONGTERM) &&
 		     !folio_is_longterm_pinnable(folio))) {
-		if (!put_devmap_managed_folio_refs(folio, refs))
-			folio_put_refs(folio, refs);
+		folio_put_refs(folio, refs);
 		return NULL;
 	}
 
@@ -198,8 +196,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 			refs *= GUP_PIN_COUNTING_BIAS;
 	}
 
-	if (!put_devmap_managed_folio_refs(folio, refs))
-		folio_put_refs(folio, refs);
+	folio_put_refs(folio, refs);
 }
 
 /**
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a9874ac..5191f91 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1965,7 +1965,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 						tlb->fullmm);
 	arch_check_zapped_pmd(vma, orig_pmd);
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-	if (vma_is_special_huge(vma)) {
+	if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
@@ -2557,13 +2557,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 */
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(mm, pmd);
-		if (vma_is_special_huge(vma))
+		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
 			return;
 		if (unlikely(is_pmd_migration_entry(old_pmd))) {
 			swp_entry_t entry;
 
 			entry = pmd_to_swp_entry(old_pmd);
 			folio = pfn_swap_entry_folio(entry);
+		} else if (is_huge_zero_pmd(old_pmd)) {
+			return;
 		} else {
 			page = pmd_page(old_pmd);
 			folio = page_folio(page);
diff --git a/mm/internal.h b/mm/internal.h
index c72c306..b07e70e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -637,8 +637,6 @@ static inline void prep_compound_tail(struct page *head, int tail_idx)
 	set_page_private(p, 0);
 }
 
-extern void prep_compound_page(struct page *page, unsigned int order);
-
 extern void post_alloc_hook(struct page *page, unsigned int order,
 					gfp_t gfp_flags);
 extern bool free_pages_prepare(struct page *page, unsigned int order);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d3c830e..47491ef 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -411,18 +411,18 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
 	pud = pud_offset(p4d, address);
 	if (!pud_present(*pud))
 		return 0;
-	if (pud_devmap(*pud))
+	if (pud_trans_huge(*pud))
 		return PUD_SHIFT;
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		return 0;
-	if (pmd_devmap(*pmd))
+	if (pmd_trans_huge(*pmd))
 		return PMD_SHIFT;
 	pte = pte_offset_map(pmd, address);
 	if (!pte)
 		return 0;
 	ptent = ptep_get(pte);
-	if (pte_present(ptent) && pte_devmap(ptent))
+	if (pte_present(ptent))
 		ret = PAGE_SHIFT;
 	pte_unmap(pte);
 	return ret;
diff --git a/mm/memremap.c b/mm/memremap.c
index 13c1d5b..2476aad 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -485,18 +485,20 @@ void free_zone_device_folio(struct folio *folio)
 	 * handled differently or not done at all, so there is no need
 	 * to clear folio->mapping.
 	 */
-	folio->mapping = NULL;
 	page_dev_pagemap(&folio->page)->ops->page_free(folio_page(folio, 0));
 
 	if (folio->page.pgmap->type == MEMORY_DEVICE_PRIVATE ||
 	    folio->page.pgmap->type == MEMORY_DEVICE_COHERENT)
 		put_dev_pagemap(folio->page.pgmap);
-	else if (folio->page.pgmap->type != MEMORY_DEVICE_PCI_P2PDMA)
+	else if (folio->page.pgmap->type != MEMORY_DEVICE_PCI_P2PDMA &&
+		 folio->page.pgmap->type != MEMORY_DEVICE_FS_DAX)
 		/*
 		 * Reset the refcount to 1 to prepare for handing out the page
 		 * again.
 		 */
 		folio_set_count(folio, 1);
+
+	folio->mapping = NULL;
 }
 
 void zone_device_page_init(struct page *page)
@@ -510,21 +512,3 @@ void zone_device_page_init(struct page *page)
 	lock_page(page);
 }
 EXPORT_SYMBOL_GPL(zone_device_page_init);
-
-#ifdef CONFIG_FS_DAX
-bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
-{
-	if (folio->page.pgmap->type != MEMORY_DEVICE_FS_DAX)
-		return false;
-
-	/*
-	 * fsdax page refcounts are 1-based, rather than 0-based: if
-	 * refcount is 1, then the page is free and the refcount is
-	 * stable because nobody holds a reference on the page.
-	 */
-	if (folio_ref_sub_return(folio, refs) == 1)
-		wake_up_var(&folio->_refcount);
-	return true;
-}
-EXPORT_SYMBOL(__put_devmap_managed_folio_refs);
-#endif /* CONFIG_FS_DAX */
diff --git a/mm/mlock.c b/mm/mlock.c
index 30b51cd..03fa9e9 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -373,6 +373,8 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (vma_is_dax(vma))
+		ptl = NULL;
 	if (ptl) {
 		if (!pmd_present(*pmd))
 			goto out;
diff --git a/mm/mm_init.c b/mm/mm_init.c
index b7e1599..f11ee0d 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1016,7 +1016,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	 */
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
 	    pgmap->type == MEMORY_DEVICE_COHERENT ||
-	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
+	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA ||
+	    pgmap->type == MEMORY_DEVICE_FS_DAX)
 		set_page_count(page, 0);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 67786cb..041cda6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -983,8 +983,6 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			if (put_devmap_managed_folio_refs(folio, nr_refs))
-				continue;
 			if (folio_ref_sub_and_test(folio, nr_refs))
 				free_zone_device_folio(folio);
 			continue;
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (9 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 10/13] fs/dax: Properly refcount fs dax pages Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-07-05 14:24   ` Peter Xu
  2024-06-27  0:54 ` [PATCH 12/13] mm: Remove pXX_devmap callers Alistair Popple
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

Now that DAX is managing page reference counts the same as normal
pages there are no callers for vmf_insert_pXd functions so remove
them.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/huge_mm.h |   2 +-
 mm/huge_memory.c        | 165 +-----------------------------------------
 2 files changed, 167 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9207d8e..0fb6bff 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -37,8 +37,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
 		    unsigned long cp_flags);
 
-vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
-vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5191f91..de39af4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1111,97 +1111,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
 }
 
-static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
-		pgtable_t pgtable)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pmd_t entry;
-	spinlock_t *ptl;
-
-	ptl = pmd_lock(mm, pmd);
-	if (!pmd_none(*pmd)) {
-		if (write) {
-			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
-				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
-				goto out_unlock;
-			}
-			entry = pmd_mkyoung(*pmd);
-			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-			if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
-				update_mmu_cache_pmd(vma, addr, pmd);
-		}
-
-		goto out_unlock;
-	}
-
-	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
-	if (pfn_t_devmap(pfn))
-		entry = pmd_mkdevmap(entry);
-	if (write) {
-		entry = pmd_mkyoung(pmd_mkdirty(entry));
-		entry = maybe_pmd_mkwrite(entry, vma);
-	}
-
-	if (pgtable) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
-		mm_inc_nr_ptes(mm);
-		pgtable = NULL;
-	}
-
-	set_pmd_at(mm, addr, pmd, entry);
-	update_mmu_cache_pmd(vma, addr, pmd);
-
-out_unlock:
-	spin_unlock(ptl);
-	if (pgtable)
-		pte_free(mm, pgtable);
-}
-
-/**
- * vmf_insert_pfn_pmd - insert a pmd size pfn
- * @vmf: Structure describing the fault
- * @pfn: pfn to insert
- * @write: whether it's a write fault
- *
- * Insert a pmd size pfn. See vmf_insert_pfn() for additional info.
- *
- * Return: vm_fault_t value.
- */
-vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
-{
-	unsigned long addr = vmf->address & PMD_MASK;
-	struct vm_area_struct *vma = vmf->vma;
-	pgprot_t pgprot = vma->vm_page_prot;
-	pgtable_t pgtable = NULL;
-
-	/*
-	 * If we had pmd_special, we could avoid all these restrictions,
-	 * but we need to be consistent with PTEs and architectures that
-	 * can't support a 'special' bit.
-	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
-	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
-						(VM_PFNMAP|VM_MIXEDMAP));
-	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-
-	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return VM_FAULT_SIGBUS;
-
-	if (arch_needs_pgtable_deposit()) {
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (!pgtable)
-			return VM_FAULT_OOM;
-	}
-
-	track_pfn_insert(vma, &pgprot, pfn);
-
-	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable);
-	return VM_FAULT_NOPAGE;
-}
-EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
-
 vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -1280,80 +1189,6 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
 	return pud;
 }
 
-static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, pfn_t pfn, bool write)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pgprot_t prot = vma->vm_page_prot;
-	pud_t entry;
-	spinlock_t *ptl;
-
-	ptl = pud_lock(mm, pud);
-	if (!pud_none(*pud)) {
-		if (write) {
-			if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
-				WARN_ON_ONCE(!is_huge_zero_pud(*pud));
-				goto out_unlock;
-			}
-			entry = pud_mkyoung(*pud);
-			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
-			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
-				update_mmu_cache_pud(vma, addr, pud);
-		}
-		goto out_unlock;
-	}
-
-	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
-	if (pfn_t_devmap(pfn))
-		entry = pud_mkdevmap(entry);
-	if (write) {
-		entry = pud_mkyoung(pud_mkdirty(entry));
-		entry = maybe_pud_mkwrite(entry, vma);
-	}
-	set_pud_at(mm, addr, pud, entry);
-	update_mmu_cache_pud(vma, addr, pud);
-
-out_unlock:
-	spin_unlock(ptl);
-}
-
-/**
- * vmf_insert_pfn_pud - insert a pud size pfn
- * @vmf: Structure describing the fault
- * @pfn: pfn to insert
- * @write: whether it's a write fault
- *
- * Insert a pud size pfn. See vmf_insert_pfn() for additional info.
- *
- * Return: vm_fault_t value.
- */
-vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
-{
-	unsigned long addr = vmf->address & PUD_MASK;
-	struct vm_area_struct *vma = vmf->vma;
-	pgprot_t pgprot = vma->vm_page_prot;
-
-	/*
-	 * If we had pud_special, we could avoid all these restrictions,
-	 * but we need to be consistent with PTEs and architectures that
-	 * can't support a 'special' bit.
-	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
-	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
-						(VM_PFNMAP|VM_MIXEDMAP));
-	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-
-	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return VM_FAULT_SIGBUS;
-
-	track_pfn_insert(vma, &pgprot, pfn);
-
-	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
-	return VM_FAULT_NOPAGE;
-}
-EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
-
 /**
  * dax_insert_pfn_pud - insert a pud size pfn backed by a normal page
  * @vmf: Structure describing the fault
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 12/13] mm: Remove pXX_devmap callers
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (10 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27  0:54 ` [PATCH 13/13] mm: Remove devmap related functions and page table bits Alistair Popple
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

The devmap PTE special bit was used to detect mappings of FS DAX
pages. This tracking was required to ensure the generic mm did not
manipulate the page reference counts as FS DAX implemented it's own
reference counting scheme.

Now that FS DAX pages have their references counted the same way as
normal pages this tracking is no longer needed and can be
removed.

Almost all existing uses of pmd_devmap() are paired with a check of
pmd_trans_huge(). As pmd_trans_huge() now returns true for FS DAX pages
dropping the check in these cases doesn't change anything.

However care needs to be taken because pmd_trans_huge() also checks that
a page is not an FS DAX page. This is dealt with either by checking
!vma_is_dax() or relying on the fact that the page pointer was obtained
from a page list. This is possible because zone device pages cannot
appear in any page list due to sharing page->lru with page->pgmap.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 arch/powerpc/mm/book3s64/hash_pgtable.c  |   3 +-
 arch/powerpc/mm/book3s64/pgtable.c       |   8 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c |   5 +-
 arch/powerpc/mm/pgtable.c                |   2 +-
 fs/dax.c                                 |   5 +-
 fs/userfaultfd.c                         |   2 +-
 include/linux/huge_mm.h                  |  10 +-
 include/linux/pgtable.h                  |   2 +-
 mm/gup.c                                 | 164 +-----------------------
 mm/hmm.c                                 |   7 +-
 mm/huge_memory.c                         |  61 +--------
 mm/khugepaged.c                          |   2 +-
 mm/mapping_dirty_helpers.c               |   4 +-
 mm/memory.c                              |  37 +----
 mm/migrate_device.c                      |   2 +-
 mm/mprotect.c                            |   2 +-
 mm/mremap.c                              |   5 +-
 mm/page_vma_mapped.c                     |   5 +-
 mm/pgtable-generic.c                     |   7 +-
 mm/userfaultfd.c                         |   2 +-
 mm/vmscan.c                              |   5 +-
 21 files changed, 53 insertions(+), 287 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
index 988948d..82d3117 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -195,7 +195,7 @@ unsigned long hash__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr
 	unsigned long old;
 
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(!hash__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	WARN_ON(!hash__pmd_trans_huge(*pmdp));
 	assert_spin_locked(pmd_lockptr(mm, pmdp));
 #endif
 
@@ -227,7 +227,6 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 	VM_BUG_ON(pmd_trans_huge(*pmdp));
-	VM_BUG_ON(pmd_devmap(*pmdp));
 
 	pmd = *pmdp;
 	pmd_clear(pmdp);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 2975ea0..65dd1fe 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -50,7 +50,7 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 {
 	int changed;
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	WARN_ON(!pmd_trans_huge(*pmdp));
 	assert_spin_locked(pmd_lockptr(vma->vm_mm, pmdp));
 #endif
 	changed = !pmd_same(*(pmdp), entry);
@@ -70,7 +70,6 @@ int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 {
 	int changed;
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(!pud_devmap(*pudp));
 	assert_spin_locked(pud_lockptr(vma->vm_mm, pudp));
 #endif
 	changed = !pud_same(*(pudp), entry);
@@ -182,7 +181,7 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
 	pmd_t pmd;
 	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
-		   !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
+		   || !pmd_present(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
 	/*
 	 * if it not a fullmm flush, then we can possibly end up converting
@@ -200,8 +199,7 @@ pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
 	pud_t pud;
 
 	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-	VM_BUG_ON((pud_present(*pudp) && !pud_devmap(*pudp)) ||
-		  !pud_present(*pudp));
+	VM_BUG_ON(!pud_present(*pudp));
 	pud = pudp_huge_get_and_clear(vma->vm_mm, addr, pudp);
 	/*
 	 * if it not a fullmm flush, then we can possibly end up converting
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 15e88f1..1c195bc 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1348,7 +1348,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long add
 	unsigned long old;
 
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(!radix__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	WARN_ON(!radix__pmd_trans_huge(*pmdp));
 	assert_spin_locked(pmd_lockptr(mm, pmdp));
 #endif
 
@@ -1365,7 +1365,7 @@ unsigned long radix__pud_hugepage_update(struct mm_struct *mm, unsigned long add
 	unsigned long old;
 
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(!pud_devmap(*pudp));
+	WARN_ON(!pud_trans_huge(*pudp));
 	assert_spin_locked(pud_lockptr(mm, pudp));
 #endif
 
@@ -1383,7 +1383,6 @@ pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addre
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 	VM_BUG_ON(radix__pmd_trans_huge(*pmdp));
-	VM_BUG_ON(pmd_devmap(*pmdp));
 	/*
 	 * khugepaged calls this for normal pmd
 	 */
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9e7ba9c..11d3b40 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -464,7 +464,7 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 		return NULL;
 #endif
 
-	if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+	if (pmd_trans_huge(pmd)) {
 		if (is_thp)
 			*is_thp = true;
 		ret_pte = (pte_t *)pmdp;
diff --git a/fs/dax.c b/fs/dax.c
index 862af24..7edd18c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1714,7 +1714,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * the PTE we need to set up.  If so just return and the fault will be
 	 * retried.
 	 */
-	if (pmd_trans_huge(*vmf->pmd) || pmd_devmap(*vmf->pmd)) {
+	if (pmd_trans_huge(*vmf->pmd)) {
 		ret = VM_FAULT_NOPAGE;
 		goto unlock_entry;
 	}
@@ -1835,8 +1835,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * the PMD we need to set up.  If so just return and the fault will be
 	 * retried.
 	 */
-	if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd) &&
-			!pmd_devmap(*vmf->pmd)) {
+	if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd)) {
 		ret = 0;
 		goto unlock_entry;
 	}
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index eee7320..094401f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -319,7 +319,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 		goto out;
 
 	ret = false;
-	if (!pmd_present(_pmd) || pmd_devmap(_pmd))
+	if (!pmd_present(_pmd) || vma_is_dax(vmf->vma))
 		goto out;
 
 	if (pmd_trans_huge(_pmd)) {
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0fb6bff..eb3f444 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -322,8 +322,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 #define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
-					|| pmd_devmap(*____pmd))	\
+		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd))	\
 			__split_huge_pmd(__vma, __pmd, __address,	\
 						false, NULL);		\
 	}  while (0)
@@ -338,8 +337,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 #define split_huge_pud(__vma, __pud, __address)				\
 	do {								\
 		pud_t *____pud = (__pud);				\
-		if (pud_trans_huge(*____pud)				\
-					|| pud_devmap(*____pud))	\
+		if (pud_trans_huge(*____pud))				\
 			__split_huge_pud(__vma, __pud, __address);	\
 	}  while (0)
 
@@ -362,7 +360,7 @@ static inline int is_swap_pmd(pmd_t pmd)
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
-	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
@@ -370,7 +368,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 		struct vm_area_struct *vma)
 {
-	if (pud_trans_huge(*pud) || pud_devmap(*pud))
+	if (pud_trans_huge(*pud))
 		return __pud_trans_huge_lock(pud, vma);
 	else
 		return NULL;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 18019f0..91e06bb 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1620,7 +1620,7 @@ static inline int pud_trans_unstable(pud_t *pud)
 	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 	pud_t pudval = READ_ONCE(*pud);
 
-	if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
+	if (pud_none(pudval) || pud_trans_huge(pudval))
 		return 1;
 	if (unlikely(pud_bad(pudval))) {
 		pud_clear_bad(pud);
diff --git a/mm/gup.c b/mm/gup.c
index ce80ff6..88fb92b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -699,31 +699,9 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 		return NULL;
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
-
-	if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) &&
-	    pud_devmap(pud)) {
-		/*
-		 * device mapped pages can only be returned if the caller
-		 * will manage the page reference count.
-		 *
-		 * At least one of FOLL_GET | FOLL_PIN must be set, so
-		 * assert that here:
-		 */
-		if (!(flags & (FOLL_GET | FOLL_PIN)))
-			return ERR_PTR(-EEXIST);
-
-		if (flags & FOLL_TOUCH)
-			touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
-
-		ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
-		if (!ctx->pgmap)
-			return ERR_PTR(-EFAULT);
-	}
-
 	page = pfn_to_page(pfn);
 
-	if (!pud_devmap(pud) && !pud_write(pud) &&
-	    gup_must_unshare(vma, flags, page))
+	if (!pud_write(pud) && gup_must_unshare(vma, flags, page))
 		return ERR_PTR(-EMLINK);
 
 	ret = try_grab_page(page, flags);
@@ -921,8 +899,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	page = vm_normal_page(vma, address, pte);
 
 	/*
-	 * We only care about anon pages in can_follow_write_pte() and don't
-	 * have to worry about pte_devmap() because they are never anon.
+	 * We only care about anon pages in can_follow_write_pte().
 	 */
 	if ((flags & FOLL_WRITE) &&
 	    !can_follow_write_pte(pte, page, vma, flags)) {
@@ -930,18 +907,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
-		/*
-		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
-		 * case since they are only valid while holding the pgmap
-		 * reference.
-		 */
-		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
-		if (*pgmap)
-			page = pte_page(pte);
-		else
-			goto no_page;
-	} else if (unlikely(!page)) {
+	if (unlikely(!page)) {
 		if (flags & FOLL_DUMP) {
 			/* Avoid special (like zero) pages in core dumps */
 			page = ERR_PTR(-EFAULT);
@@ -1025,14 +991,6 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 	if (unlikely(is_hugepd(__hugepd(pmd_val(pmdval)))))
 		return follow_hugepd(vma, __hugepd(pmd_val(pmdval)),
 				     address, PMD_SHIFT, flags, ctx);
-	if (pmd_devmap(pmdval)) {
-		ptl = pmd_lock(mm, pmd);
-		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
-		spin_unlock(ptl);
-		if (page)
-			return page;
-		return no_page_table(vma, flags, address);
-	}
 	if (likely(!pmd_leaf(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
@@ -2920,7 +2878,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 		int *nr)
 {
 	struct dev_pagemap *pgmap = NULL;
-	int nr_start = *nr, ret = 0;
+	int ret = 0;
 	pte_t *ptep, *ptem;
 
 	ptem = ptep = pte_offset_map(&pmd, addr);
@@ -2944,16 +2902,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 		if (!pte_access_permitted(pte, flags & FOLL_WRITE))
 			goto pte_unmap;
 
-		if (pte_devmap(pte)) {
-			if (unlikely(flags & FOLL_LONGTERM))
-				goto pte_unmap;
-
-			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
-			if (unlikely(!pgmap)) {
-				gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-				goto pte_unmap;
-			}
-		} else if (pte_special(pte))
+		if (pte_special(pte))
 			goto pte_unmap;
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -3024,91 +2973,6 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
-#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr,
-	unsigned long end, unsigned int flags, struct page **pages, int *nr)
-{
-	int nr_start = *nr;
-	struct dev_pagemap *pgmap = NULL;
-
-	do {
-		struct folio *folio;
-		struct page *page = pfn_to_page(pfn);
-
-		pgmap = get_dev_pagemap(pfn, pgmap);
-		if (unlikely(!pgmap)) {
-			gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-			break;
-		}
-
-		folio = try_grab_folio(page, 1, flags);
-		if (!folio) {
-			gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-			break;
-		}
-		folio_set_referenced(folio);
-		pages[*nr] = page;
-		(*nr)++;
-		pfn++;
-	} while (addr += PAGE_SIZE, addr != end);
-
-	put_dev_pagemap(pgmap);
-	return addr == end;
-}
-
-static int gup_fast_devmap_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages,
-		int *nr)
-{
-	unsigned long fault_pfn;
-	int nr_start = *nr;
-
-	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!gup_fast_devmap_leaf(fault_pfn, addr, end, flags, pages, nr))
-		return 0;
-
-	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-		return 0;
-	}
-	return 1;
-}
-
-static int gup_fast_devmap_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages,
-		int *nr)
-{
-	unsigned long fault_pfn;
-	int nr_start = *nr;
-
-	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!gup_fast_devmap_leaf(fault_pfn, addr, end, flags, pages, nr))
-		return 0;
-
-	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-		return 0;
-	}
-	return 1;
-}
-#else
-static int gup_fast_devmap_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages,
-		int *nr)
-{
-	BUILD_BUG();
-	return 0;
-}
-
-static int gup_fast_devmap_pud_leaf(pud_t pud, pud_t *pudp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages,
-		int *nr)
-{
-	BUILD_BUG();
-	return 0;
-}
-#endif
-
 static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, unsigned int flags, struct page **pages,
 		int *nr)
@@ -3120,13 +2984,7 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
-	if (pmd_devmap(orig)) {
-		if (unlikely(flags & FOLL_LONGTERM))
-			return 0;
-		return gup_fast_devmap_pmd_leaf(orig, pmdp, addr, end, flags,
-					        pages, nr);
-	}
-
+	// TODO: As a side-effect does this allow long-term pinning of DAX pages?
 	page = pmd_page(orig);
 	refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);
 
@@ -3164,13 +3022,7 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
-	if (pud_devmap(orig)) {
-		if (unlikely(flags & FOLL_LONGTERM))
-			return 0;
-		return gup_fast_devmap_pud_leaf(orig, pudp, addr, end, flags,
-					        pages, nr);
-	}
-
+	// TODO: FOLL_LONGTERM?
 	page = pud_page(orig);
 	refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);
 
@@ -3209,8 +3061,6 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	if (!pgd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
-	BUILD_BUG_ON(pgd_devmap(orig));
-
 	page = pgd_page(orig);
 	refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 26e1905..7f78b0b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -298,7 +298,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	 * fall through and treat it like a normal page.
 	 */
 	if (!vm_normal_page(walk->vma, addr, pte) &&
-	    !pte_devmap(pte) &&
 	    !is_zero_pfn(pte_pfn(pte))) {
 		if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
 			pte_unmap(ptep);
@@ -351,7 +350,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
 	}
 
-	if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) {
+	if (pmd_trans_huge(pmd)) {
 		/*
 		 * No need to take pmd_lock here, even if some other thread
 		 * is splitting the huge pmd we will get that event through
@@ -362,7 +361,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		 * values.
 		 */
 		pmd = pmdp_get_lockless(pmdp);
-		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
+		if (!pmd_trans_huge(pmd))
 			goto again;
 
 		return hmm_vma_handle_pmd(walk, addr, end, hmm_pfns, pmd);
@@ -429,7 +428,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 		return hmm_vma_walk_hole(start, end, -1, walk);
 	}
 
-	if (pud_leaf(pud) && pud_devmap(pud)) {
+	if (pud_leaf(pud) && vma_is_dax(walk->vma)) {
 		unsigned long i, npages, pfn;
 		unsigned int required_fault;
 		unsigned long *hmm_pfns;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de39af4..2e164c3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1151,8 +1151,6 @@ vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 	}
 
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, vma->vm_page_prot));
-	if (pfn_t_devmap(pfn))
-		entry = pmd_mkdevmap(entry);
 	if (write) {
 		entry = pmd_mkyoung(pmd_mkdirty(entry));
 		entry = maybe_pmd_mkwrite(entry, vma);
@@ -1230,8 +1228,6 @@ vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	}
 
 	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
-	if (pfn_t_devmap(pfn))
-		entry = pud_mkdevmap(entry);
 	if (write) {
 		entry = pud_mkyoung(pud_mkdirty(entry));
 		entry = maybe_pud_mkwrite(entry, vma);
@@ -1267,46 +1263,6 @@ void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
 		update_mmu_cache_pmd(vma, addr, pmd);
 }
 
-struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
-{
-	unsigned long pfn = pmd_pfn(*pmd);
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int ret;
-
-	assert_spin_locked(pmd_lockptr(mm, pmd));
-
-	if (flags & FOLL_WRITE && !pmd_write(*pmd))
-		return NULL;
-
-	if (pmd_present(*pmd) && pmd_devmap(*pmd))
-		/* pass */;
-	else
-		return NULL;
-
-	if (flags & FOLL_TOUCH)
-		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
-
-	/*
-	 * device mapped pages can only be returned if the
-	 * caller will manage the page reference count.
-	 */
-	if (!(flags & (FOLL_GET | FOLL_PIN)))
-		return ERR_PTR(-EEXIST);
-
-	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
-		return ERR_PTR(-EFAULT);
-	page = pfn_to_page(pfn);
-	ret = try_grab_page(page, flags);
-	if (ret)
-		page = ERR_PTR(ret);
-
-	return page;
-}
-
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
@@ -1438,7 +1394,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pud = *src_pud;
-	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
+	if (unlikely(!pud_trans_huge(pud)))
 		goto out_unlock;
 
 	/*
@@ -2210,8 +2166,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
-			pmd_devmap(*pmd)))
+	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -2228,7 +2183,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 
 	ptl = pud_lock(vma->vm_mm, pud);
-	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
+	if (likely(pud_trans_huge(*pud)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -2278,7 +2233,7 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
-	VM_BUG_ON(!pud_trans_huge(*pud) && !pud_devmap(*pud));
+	VM_BUG_ON(!pud_trans_huge(*pud));
 
 	count_vm_event(THP_SPLIT_PUD);
 
@@ -2311,7 +2266,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 				(address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 	ptl = pud_lock(vma->vm_mm, pud);
-	if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud)))
+	if (unlikely(!pud_trans_huge(*pud)))
 		goto out;
 	__split_huge_pud_locked(vma, pud, range.start);
 
@@ -2379,8 +2334,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
-				&& !pmd_devmap(*pmd));
+	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -2603,8 +2557,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	VM_BUG_ON(freeze && !folio);
 	VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
 
-	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
-	    is_pmd_migration_entry(*pmd)) {
+	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd)) {
 		/*
 		 * It's safe to call pmd_page when folio is set because it's
 		 * guaranteed that pmd is present.
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 774a97e..d4996ca 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -942,8 +942,6 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 		return SCAN_PMD_NULL;
 	if (pmd_trans_huge(pmde))
 		return SCAN_PMD_MAPPED;
-	if (pmd_devmap(pmde))
-		return SCAN_PMD_NULL;
 	if (pmd_bad(pmde))
 		return SCAN_PMD_NULL;
 	return SCAN_SUCCEED;
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 2f8829b..208b428 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -129,7 +129,7 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pmd_t pmdval = pmdp_get_lockless(pmd);
 
 	/* Do not split a huge pmd, present or migrated */
-	if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
+	if (pmd_trans_huge(pmdval)) {
 		WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
 		walk->action = ACTION_CONTINUE;
 	}
@@ -152,7 +152,7 @@ static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
 	pud_t pudval = READ_ONCE(*pud);
 
 	/* Do not split a huge pud */
-	if (pud_trans_huge(pudval) || pud_devmap(pudval)) {
+	if (pud_trans_huge(pudval)) {
 		WARN_ON(pud_write(pudval) || pud_dirty(pudval));
 		walk->action = ACTION_CONTINUE;
 	}
diff --git a/mm/memory.c b/mm/memory.c
index 4f26a1f..bd80198 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -595,16 +595,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
-		if (pte_devmap(pte))
-		/*
-		 * NOTE: New users of ZONE_DEVICE will not set pte_devmap()
-		 * and will have refcounts incremented on their struct pages
-		 * when they are inserted into PTEs, thus they are safe to
-		 * return here. Legacy ZONE_DEVICE pages that set pte_devmap()
-		 * do not have refcounts. Example of legacy ZONE_DEVICE is
-		 * MEMORY_DEVICE_FS_DAX type in pmem or virtio_fs drivers.
-		 */
-			return NULL;
 
 		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
@@ -680,8 +670,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
-	if (pmd_devmap(pmd))
-		return NULL;
 	if (is_huge_zero_pmd(pmd))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1223,8 +1211,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
-			|| pmd_devmap(*src_pmd)) {
+		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
 			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1260,7 +1247,7 @@ copy_pud_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pud = pud_offset(src_p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
+		if (pud_trans_huge(*src_pud)) {
 			int err;
 
 			VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma);
@@ -1698,7 +1685,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
@@ -1740,7 +1727,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
+		if (pud_trans_huge(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
 				split_huge_pud(vma, pud, addr);
@@ -2326,10 +2313,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	/* Ok, finally just insert the thing.. */
-	if (pfn_t_devmap(pfn))
-		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
-	else
-		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
+	entry = pte_mkspecial(pfn_t_pte(pfn, prot));
 
 	if (mkwrite) {
 		entry = pte_mkyoung(entry);
@@ -2437,8 +2421,6 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
 	/* these checks mirror the abort conditions in vm_normal_page */
 	if (vma->vm_flags & VM_MIXEDMAP)
 		return true;
-	if (pfn_t_devmap(pfn))
-		return true;
 	if (pfn_t_special(pfn))
 		return true;
 	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
@@ -2469,8 +2451,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 	 * than insert_pfn).  If a zero_pfn were inserted into a VM_MIXEDMAP
 	 * without pte special, it would there be refcounted as a normal page.
 	 */
-	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
-	    !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pfn_t_valid(pfn)) {
 		struct page *page;
 
 		/*
@@ -2514,8 +2495,6 @@ vm_fault_t dax_insert_pfn(struct vm_area_struct *vma,
 	if (!pfn_t_valid(pfn_t))
 		return VM_FAULT_SIGBUS;
 
-	WARN_ON_ONCE(pfn_t_devmap(pfn_t));
-
 	if (WARN_ON(is_zero_pfn(pfn) && write))
 		return VM_FAULT_SIGBUS;
 
@@ -5528,7 +5507,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		pud_t orig_pud = *vmf.pud;
 
 		barrier();
-		if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
+		if (pud_trans_huge(orig_pud)) {
 
 			/*
 			 * TODO once we support anonymous PUDs: NUMA case and
@@ -5569,7 +5548,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
 		}
-		if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
+		if (pmd_trans_huge(vmf.orig_pmd)) {
 			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 4fdd8fa..4277516 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -596,7 +596,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pmdp = pmd_alloc(mm, pudp, addr);
 	if (!pmdp)
 		goto abort;
-	if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
+	if (pmd_trans_huge(*pmdp))
 		goto abort;
 	if (pte_alloc(mm, pmdp))
 		goto abort;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8c6cd88..c717c74 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -391,7 +391,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		}
 
 		_pmd = pmdp_get_lockless(pmd);
-		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
+		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
 			    pgtable_split_needed(vma, cp_flags)) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
diff --git a/mm/mremap.c b/mm/mremap.c
index 5f96bc5..57bb0b9 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -587,7 +587,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
 		if (!new_pud)
 			break;
-		if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
+		if (pud_trans_huge(*old_pud)) {
 			if (extent == HPAGE_PUD_SIZE) {
 				move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
 					       old_pud, new_pud, need_rmap_locks);
@@ -609,8 +609,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		if (!new_pmd)
 			break;
 again:
-		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
-		    pmd_devmap(*old_pmd)) {
+		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE &&
 			    move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
 					   old_pmd, new_pmd, need_rmap_locks))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae5cc42..77da636 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -235,8 +235,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		 */
 		pmde = pmdp_get_lockless(pvmw->pmd);
 
-		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
-		    (pmd_present(pmde) && pmd_devmap(pmde))) {
+		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
 			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 			pmde = *pvmw->pmd;
 			if (!pmd_present(pmde)) {
@@ -251,7 +250,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 					return not_found(pvmw);
 				return true;
 			}
-			if (likely(pmd_trans_huge(pmde) || pmd_devmap(pmde))) {
+			if (likely(pmd_trans_huge(pmde))) {
 				if (pvmw->flags & PVMW_MIGRATION)
 					return not_found(pvmw);
 				if (!check_pmd(pmd_pfn(pmde), pvmw))
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4ad..093c435 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -139,8 +139,7 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t pmd;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
-			   !pmd_devmap(*pmdp));
+	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
@@ -153,7 +152,7 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	pud_t pud;
 
 	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
-	VM_BUG_ON(!pud_trans_huge(*pudp) && !pud_devmap(*pudp));
+	VM_BUG_ON(!pud_trans_huge(*pudp));
 	pud = pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
 	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
 	return pud;
@@ -293,7 +292,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
 		goto nomap;
-	if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+	if (unlikely(pmd_trans_huge(pmdval)))
 		goto nomap;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index defa510..dfd95e0 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1685,7 +1685,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 
 		ptl = pmd_trans_huge_lock(src_pmd, src_vma);
 		if (ptl) {
-			if (pmd_devmap(*src_pmd)) {
+			if (vma_is_dax(src_vma)) {
 				spin_unlock(ptl);
 				err = -ENOENT;
 				break;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9..e8badb4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3285,7 +3285,7 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
 	if (!pte_present(pte) || is_zero_pfn(pfn))
 		return -1;
 
-	if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte)))
+	if (WARN_ON_ONCE(pte_special(pte)))
 		return -1;
 
 	if (WARN_ON_ONCE(!pfn_valid(pfn)))
@@ -3303,9 +3303,6 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
 	if (!pmd_present(pmd) || is_huge_zero_pmd(pmd))
 		return -1;
 
-	if (WARN_ON_ONCE(pmd_devmap(pmd)))
-		return -1;
-
 	if (WARN_ON_ONCE(!pfn_valid(pfn)))
 		return -1;
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 13/13] mm: Remove devmap related functions and page table bits
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (11 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 12/13] mm: Remove pXX_devmap callers Alistair Popple
@ 2024-06-27  0:54 ` Alistair Popple
  2024-06-27 23:04   ` kernel test robot
                     ` (2 more replies)
  2024-06-27  6:58 ` [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Dan Williams
  2024-07-01  4:24 ` Dave Chinner
  14 siblings, 3 replies; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  0:54 UTC (permalink / raw)
  To: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

Now that DAX and all other reference counts to ZONE_DEVICE pages are
managed normally there is no need for the special devmap PTE/PMD/PUD
page table bits. So drop all references to these, freeing up a
software defined page table bit on architectures supporting it.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 Documentation/mm/arch_pgtable_helpers.rst     |  6 +--
 arch/arm64/Kconfig                            |  1 +-
 arch/arm64/include/asm/pgtable-prot.h         |  1 +-
 arch/arm64/include/asm/pgtable.h              | 24 +--------
 arch/powerpc/Kconfig                          |  1 +-
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  6 +--
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  7 +--
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 52 +------------------
 arch/powerpc/include/asm/book3s/64/radix.h    | 14 +-----
 arch/x86/Kconfig                              |  1 +-
 arch/x86/include/asm/pgtable.h                | 50 +-----------------
 arch/x86/include/asm/pgtable_types.h          |  5 +--
 include/linux/mm.h                            |  7 +--
 include/linux/pfn_t.h                         | 20 +-------
 include/linux/pgtable.h                       | 19 +------
 mm/Kconfig                                    |  4 +-
 mm/debug_vm_pgtable.c                         | 59 +--------------------
 mm/hmm.c                                      |  3 +-
 18 files changed, 11 insertions(+), 269 deletions(-)

diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
index ad50ca6..9230bc7 100644
--- a/Documentation/mm/arch_pgtable_helpers.rst
+++ b/Documentation/mm/arch_pgtable_helpers.rst
@@ -30,8 +30,6 @@ PTE Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pte_protnone              | Tests a PROT_NONE PTE                            |
 +---------------------------+--------------------------------------------------+
-| pte_devmap                | Tests a ZONE_DEVICE mapped PTE                   |
-+---------------------------+--------------------------------------------------+
 | pte_soft_dirty            | Tests a soft dirty PTE                           |
 +---------------------------+--------------------------------------------------+
 | pte_swp_soft_dirty        | Tests a soft dirty swapped PTE                   |
@@ -106,8 +104,6 @@ PMD Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pmd_protnone              | Tests a PROT_NONE PMD                            |
 +---------------------------+--------------------------------------------------+
-| pmd_devmap                | Tests a ZONE_DEVICE mapped PMD                   |
-+---------------------------+--------------------------------------------------+
 | pmd_soft_dirty            | Tests a soft dirty PMD                           |
 +---------------------------+--------------------------------------------------+
 | pmd_swp_soft_dirty        | Tests a soft dirty swapped PMD                   |
@@ -181,8 +177,6 @@ PUD Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pud_write                 | Tests a writable PUD                             |
 +---------------------------+--------------------------------------------------+
-| pud_devmap                | Tests a ZONE_DEVICE mapped PUD                   |
-+---------------------------+--------------------------------------------------+
 | pud_mkyoung               | Creates a young PUD                              |
 +---------------------------+--------------------------------------------------+
 | pud_mkold                 | Creates an old PUD                               |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5d91259..beb8c3c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -35,7 +35,6 @@ config ARM64
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
-	select ARCH_HAS_PTE_DEVMAP
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_HW_PTE_YOUNG
 	select ARCH_HAS_SETUP_DMA_OPS
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index b11cfb9..043b102 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -17,7 +17,6 @@
 #define PTE_SWP_EXCLUSIVE	(_AT(pteval_t, 1) << 2)	 /* only for swp ptes */
 #define PTE_DIRTY		(_AT(pteval_t, 1) << 55)
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
-#define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
 
 /*
  * PTE_PRESENT_INVALID=1 & PTE_VALID=0 indicates that the pte's fields should be
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f8efbc1..9193537 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -107,7 +107,6 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 #define pte_user(pte)		(!!(pte_val(pte) & PTE_USER))
 #define pte_user_exec(pte)	(!(pte_val(pte) & PTE_UXN))
 #define pte_cont(pte)		(!!(pte_val(pte) & PTE_CONT))
-#define pte_devmap(pte)		(!!(pte_val(pte) & PTE_DEVMAP))
 #define pte_tagged(pte)		((pte_val(pte) & PTE_ATTRINDX_MASK) == \
 				 PTE_ATTRINDX(MT_NORMAL_TAGGED))
 
@@ -269,11 +268,6 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
 	return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
 }
 
-static inline pte_t pte_mkdevmap(pte_t pte)
-{
-	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
-}
-
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pte_uffd_wp(pte_t pte)
 {
@@ -569,14 +563,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
 
 #define pmd_mkhuge(pmd)		(__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define pmd_devmap(pmd)		pte_devmap(pmd_pte(pmd))
-#endif
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
-	return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP)));
-}
-
 #define __pmd_to_phys(pmd)	__pte_to_phys(pmd_pte(pmd))
 #define __phys_to_pmd_val(phys)	__phys_to_pte_val(phys)
 #define pmd_pfn(pmd)		((__pmd_to_phys(pmd) & PMD_MASK) >> PAGE_SHIFT)
@@ -1114,16 +1100,6 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
 	return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
 							pmd_pte(entry), dirty);
 }
-
-static inline int pud_devmap(pud_t pud)
-{
-	return 0;
-}
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
 #endif
 
 #ifdef CONFIG_PAGE_TABLE_CHECK
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c88c6d4..c56a078 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -145,7 +145,6 @@ config PPC
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PMEM_API
-	select ARCH_HAS_PTE_DEVMAP		if PPC_BOOK3S_64
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SCALED_CPUTIME		if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 6472b08..51d868c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -155,12 +155,6 @@ extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
 extern int hash__has_transparent_hugepage(void);
 #endif
 
-static inline pmd_t hash__pmd_mkdevmap(pmd_t pmd)
-{
-	BUG();
-	return pmd;
-}
-
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_BOOK3S_64_HASH_4K_H */
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 0bf6fd0..0fb5b7d 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -259,7 +259,7 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
  */
 static inline int hash__pmd_trans_huge(pmd_t pmd)
 {
-	return !!((pmd_val(pmd) & (_PAGE_PTE | H_PAGE_THP_HUGE | _PAGE_DEVMAP)) ==
+	return !!((pmd_val(pmd) & (_PAGE_PTE | H_PAGE_THP_HUGE)) ==
 		  (_PAGE_PTE | H_PAGE_THP_HUGE));
 }
 
@@ -281,11 +281,6 @@ extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
 extern int hash__has_transparent_hugepage(void);
 #endif /*  CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline pmd_t hash__pmd_mkdevmap(pmd_t pmd)
-{
-	return __pmd(pmd_val(pmd) | (_PAGE_PTE | H_PAGE_THP_HUGE | _PAGE_DEVMAP));
-}
-
 #endif	/* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_BOOK3S_64_HASH_64K_H */
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8f9432e..a48ad02 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -88,7 +88,6 @@
 
 #define _PAGE_SOFT_DIRTY	_RPAGE_SW3 /* software: software dirty tracking */
 #define _PAGE_SPECIAL		_RPAGE_SW2 /* software: special page */
-#define _PAGE_DEVMAP		_RPAGE_SW1 /* software: ZONE_DEVICE page */
 
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
@@ -109,7 +108,7 @@
  */
 #define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \
 			 _PAGE_ACCESSED | H_PAGE_THP_HUGE | _PAGE_PTE | \
-			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
+			 _PAGE_SOFT_DIRTY)
 /*
  * user access blocked by key
  */
@@ -123,7 +122,7 @@
  */
 #define _PAGE_CHG_MASK	(PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \
 			 _PAGE_ACCESSED | _PAGE_SPECIAL | _PAGE_PTE |	\
-			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
+			 _PAGE_SOFT_DIRTY)
 
 /*
  * We define 2 sets of base prot bits, one for basic pages (ie,
@@ -619,24 +618,6 @@ static inline pte_t pte_mkhuge(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkdevmap(pte_t pte)
-{
-	return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SPECIAL | _PAGE_DEVMAP));
-}
-
-/*
- * This is potentially called with a pmd as the argument, in which case it's not
- * safe to check _PAGE_DEVMAP unless we also confirm that _PAGE_PTE is set.
- * That's because the bit we use for _PAGE_DEVMAP is not reserved for software
- * use in page directory entries (ie. non-ptes).
- */
-static inline int pte_devmap(pte_t pte)
-{
-	__be64 mask = cpu_to_be64(_PAGE_DEVMAP | _PAGE_PTE);
-
-	return (pte_raw(pte) & mask) == mask;
-}
-
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	/* FIXME!! check whether this need to be a conditional */
@@ -1387,35 +1368,6 @@ static inline bool arch_needs_pgtable_deposit(void)
 }
 extern void serialize_against_pte_lookup(struct mm_struct *mm);
 
-
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
-	if (radix_enabled())
-		return radix__pmd_mkdevmap(pmd);
-	return hash__pmd_mkdevmap(pmd);
-}
-
-static inline pud_t pud_mkdevmap(pud_t pud)
-{
-	if (radix_enabled())
-		return radix__pud_mkdevmap(pud);
-	BUG();
-	return pud;
-}
-
-static inline int pmd_devmap(pmd_t pmd)
-{
-	return pte_devmap(pmd_pte(pmd));
-}
-
-static inline int pud_devmap(pud_t pud)
-{
-	return pte_devmap(pud_pte(pud));
-}
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 8f55ff7..df23a82 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -264,7 +264,7 @@ static inline int radix__p4d_bad(p4d_t p4d)
 
 static inline int radix__pmd_trans_huge(pmd_t pmd)
 {
-	return (pmd_val(pmd) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
+	return (pmd_val(pmd) & _PAGE_PTE) == _PAGE_PTE;
 }
 
 static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
@@ -274,7 +274,7 @@ static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
 
 static inline int radix__pud_trans_huge(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
+	return (pud_val(pud) & _PAGE_PTE) == _PAGE_PTE;
 }
 
 static inline pud_t radix__pud_mkhuge(pud_t pud)
@@ -315,16 +315,6 @@ static inline int radix__has_transparent_pud_hugepage(void)
 }
 #endif
 
-static inline pmd_t radix__pmd_mkdevmap(pmd_t pmd)
-{
-	return __pmd(pmd_val(pmd) | (_PAGE_PTE | _PAGE_DEVMAP));
-}
-
-static inline pud_t radix__pud_mkdevmap(pud_t pud)
-{
-	return __pud(pud_val(pud) | (_PAGE_PTE | _PAGE_DEVMAP));
-}
-
 struct vmem_altmap;
 struct dev_pagemap;
 extern int __meminit radix__vmemmap_create_mapping(unsigned long start,
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d7122a..8702076 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -91,7 +91,6 @@ config X86
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
-	select ARCH_HAS_PTE_DEVMAP		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_HW_PTE_YOUNG
 	select ARCH_HAS_NONLEAF_PMD_YOUNG	if PGTABLE_LEVELS > 2
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65b8e5b..5220f5a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -268,16 +268,15 @@ static inline bool pmd_leaf(pmd_t pte)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_leaf */
 static inline int pmd_trans_huge(pmd_t pmd)
 {
-	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pmd_val(pmd) & _PAGE_PSE) == _PAGE_PSE;
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static inline int pud_trans_huge(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pud_val(pud) & _PAGE_PSE) == _PAGE_PSE;
 }
 #endif
 
@@ -287,29 +286,6 @@ static inline int has_transparent_hugepage(void)
 	return boot_cpu_has(X86_FEATURE_PSE);
 }
 
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pmd_devmap(pmd_t pmd)
-{
-	return !!(pmd_val(pmd) & _PAGE_DEVMAP);
-}
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static inline int pud_devmap(pud_t pud)
-{
-	return !!(pud_val(pud) & _PAGE_DEVMAP);
-}
-#else
-static inline int pud_devmap(pud_t pud)
-{
-	return 0;
-}
-#endif
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
-#endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
@@ -470,11 +446,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
 	return pte_set_flags(pte, _PAGE_SPECIAL);
 }
 
-static inline pte_t pte_mkdevmap(pte_t pte)
-{
-	return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
-}
-
 static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
 {
 	pmdval_t v = native_pmd_val(pmd);
@@ -560,11 +531,6 @@ static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DIRTY);
 }
 
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_DEVMAP);
-}
-
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
@@ -644,11 +610,6 @@ static inline pud_t pud_mkdirty(pud_t pud)
 	return pud_mksaveddirty(pud);
 }
 
-static inline pud_t pud_mkdevmap(pud_t pud)
-{
-	return pud_set_flags(pud, _PAGE_DEVMAP);
-}
-
 static inline pud_t pud_mkhuge(pud_t pud)
 {
 	return pud_set_flags(pud, _PAGE_PSE);
@@ -953,13 +914,6 @@ static inline int pte_present(pte_t a)
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pte_devmap(pte_t a)
-{
-	return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
-}
-#endif
-
 #define pte_accessible pte_accessible
 static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 {
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b786449..1885ac2 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -33,7 +33,6 @@
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_UFFD_WP	_PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
 #ifdef CONFIG_X86_64
 #define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW5 /* Saved Dirty bit */
@@ -117,11 +116,9 @@
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
-#define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
 #define _PAGE_SOFTW4	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW4)
 #else
 #define _PAGE_NX	(_AT(pteval_t, 0))
-#define _PAGE_DEVMAP	(_AT(pteval_t, 0))
 #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
 #endif
 
@@ -148,7 +145,7 @@
 #define _COMMON_PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |	\
 				 _PAGE_SPECIAL | _PAGE_ACCESSED |	\
 				 _PAGE_DIRTY_BITS | _PAGE_SOFT_DIRTY |	\
-				 _PAGE_DEVMAP | _PAGE_CC | _PAGE_UFFD_WP)
+				 _PAGE_CC | _PAGE_UFFD_WP)
 #define _PAGE_CHG_MASK	(_COMMON_PAGE_CHG_MASK | _PAGE_PAT)
 #define _HPAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_PAT_LARGE)
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47d8923..5e9b754 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2709,13 +2709,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
 }
 #endif
 
-#ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pte_devmap(pte_t pte)
-{
-	return 0;
-}
-#endif
-
 extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
 			       spinlock_t **ptl);
 static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 2d91482..0100ad8 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -97,26 +97,6 @@ static inline pud_t pfn_t_pud(pfn_t pfn, pgprot_t pgprot)
 #endif
 #endif
 
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline bool pfn_t_devmap(pfn_t pfn)
-{
-	const u64 flags = PFN_DEV|PFN_MAP;
-
-	return (pfn.val & flags) == flags;
-}
-#else
-static inline bool pfn_t_devmap(pfn_t pfn)
-{
-	return false;
-}
-pte_t pte_mkdevmap(pte_t pte);
-pmd_t pmd_mkdevmap(pmd_t pmd);
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
-	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
-pud_t pud_mkdevmap(pud_t pud);
-#endif
-#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
-
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static inline bool pfn_t_special(pfn_t pfn)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 91e06bb..2fa40c8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1591,21 +1591,6 @@ static inline int pud_write(pud_t pud)
 }
 #endif /* pud_write */
 
-#if !defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline int pmd_devmap(pmd_t pmd)
-{
-	return 0;
-}
-static inline int pud_devmap(pud_t pud)
-{
-	return 0;
-}
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
-#endif
-
 #if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
 	!defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 static inline int pud_trans_huge(pud_t pud)
@@ -1860,8 +1845,8 @@ typedef unsigned int pgtbl_mod_mask;
  * - It should contain a huge PFN, which points to a huge page larger than
  *   PAGE_SIZE of the platform.  The PFN format isn't important here.
  *
- * - It should cover all kinds of huge mappings (e.g., pXd_trans_huge(),
- *   pXd_devmap(), or hugetlb mappings).
+ * - It should cover all kinds of huge mappings (i.e. pXd_trans_huge()
+ *   or hugetlb mappings).
  */
 #ifndef pgd_leaf
 #define pgd_leaf(x)	false
diff --git a/mm/Kconfig b/mm/Kconfig
index b4cb452..832a320 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -995,9 +995,6 @@ config ARCH_HAS_CURRENT_STACK_POINTER
 	  register alias named "current_stack_pointer", this config can be
 	  selected.
 
-config ARCH_HAS_PTE_DEVMAP
-	bool
-
 config ARCH_HAS_ZONE_DMA_SET
 	bool
 
@@ -1015,7 +1012,6 @@ config ZONE_DEVICE
 	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
 	depends on SPARSEMEM_VMEMMAP
-	depends on ARCH_HAS_PTE_DEVMAP
 	select XARRAY_MULTI
 
 	help
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index e4969fb..1262148 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -348,12 +348,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
 	vaddr &= HPAGE_PUD_MASK;
 
 	pud = pfn_pud(args->pud_pfn, args->page_prot);
-	/*
-	 * Some architectures have debug checks to make sure
-	 * huge pud mapping are only found with devmap entries
-	 * For now test with only devmap entries.
-	 */
-	pud = pud_mkdevmap(pud);
 	set_pud_at(args->mm, vaddr, args->pudp, pud);
 	flush_dcache_page(page);
 	pudp_set_wrprotect(args->mm, vaddr, args->pudp);
@@ -366,7 +360,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
 	WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
 	pud = pfn_pud(args->pud_pfn, args->page_prot);
-	pud = pud_mkdevmap(pud);
 	pud = pud_wrprotect(pud);
 	pud = pud_mkclean(pud);
 	set_pud_at(args->mm, vaddr, args->pudp, pud);
@@ -384,7 +377,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
 #endif /* __PAGETABLE_PMD_FOLDED */
 
 	pud = pfn_pud(args->pud_pfn, args->page_prot);
-	pud = pud_mkdevmap(pud);
 	pud = pud_mkyoung(pud);
 	set_pud_at(args->mm, vaddr, args->pudp, pud);
 	flush_dcache_page(page);
@@ -693,53 +685,6 @@ static void __init pmd_protnone_tests(struct pgtable_debug_args *args)
 static void __init pmd_protnone_tests(struct pgtable_debug_args *args) { }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static void __init pte_devmap_tests(struct pgtable_debug_args *args)
-{
-	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
-
-	pr_debug("Validating PTE devmap\n");
-	WARN_ON(!pte_devmap(pte_mkdevmap(pte)));
-}
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args)
-{
-	pmd_t pmd;
-
-	if (!has_transparent_hugepage())
-		return;
-
-	pr_debug("Validating PMD devmap\n");
-	pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot);
-	WARN_ON(!pmd_devmap(pmd_mkdevmap(pmd)));
-}
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static void __init pud_devmap_tests(struct pgtable_debug_args *args)
-{
-	pud_t pud;
-
-	if (!has_transparent_pud_hugepage())
-		return;
-
-	pr_debug("Validating PUD devmap\n");
-	pud = pfn_pud(args->fixed_pud_pfn, args->page_prot);
-	WARN_ON(!pud_devmap(pud_mkdevmap(pud)));
-}
-#else  /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-#else  /* CONFIG_TRANSPARENT_HUGEPAGE */
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#else
-static void __init pte_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
-
 static void __init pte_soft_dirty_tests(struct pgtable_debug_args *args)
 {
 	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
@@ -1341,10 +1286,6 @@ static int __init debug_vm_pgtable(void)
 	pte_protnone_tests(&args);
 	pmd_protnone_tests(&args);
 
-	pte_devmap_tests(&args);
-	pmd_devmap_tests(&args);
-	pud_devmap_tests(&args);
-
 	pte_soft_dirty_tests(&args);
 	pmd_soft_dirty_tests(&args);
 	pte_swap_soft_dirty_tests(&args);
diff --git a/mm/hmm.c b/mm/hmm.c
index 7f78b0b..fa442b4 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -395,8 +395,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
-#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && \
-    defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#if defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 static inline unsigned long pud_to_hmm_pfn_flags(struct hmm_range *range,
 						 pud_t pud)
 {
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] mm/memory: Add dax_insert_pfn
  2024-06-27  0:54 ` [PATCH 06/13] mm/memory: Add dax_insert_pfn Alistair Popple
@ 2024-06-27  5:22   ` Christoph Hellwig
  2024-06-27 11:33   ` Jan Kara
  2024-07-02  7:18   ` David Hildenbrand
  2 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2024-06-27  5:22 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch, david

On Thu, Jun 27, 2024 at 10:54:21AM +1000, Alistair Popple wrote:
> +extern void prep_compound_page(struct page *page, unsigned int order);

No need for the extern.

>  static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> -			unsigned long addr, struct page *page, pgprot_t prot)
> +			unsigned long addr, struct page *page, pgprot_t prot, bool mkwrite)

Overly long line.

> +	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, mkwrite);

.. same here.

> +vm_fault_t dax_insert_pfn(struct vm_area_struct *vma,
> +		unsigned long addr, pfn_t pfn_t, bool write)

This could probably use a kerneldoc comment.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one
  2024-06-27  0:54 ` [PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
@ 2024-06-27  5:30   ` Christoph Hellwig
  2024-06-29 21:28   ` Bjorn Helgaas
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2024-06-27  5:30 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch, david

On Thu, Jun 27, 2024 at 10:54:17AM +1000, Alistair Popple wrote:
> The reference counts for ZONE_DEVICE private pages should be
> initialised by the driver when the page is actually allocated by the
> driver allocator, not when they are first created. This is currently
> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  drivers/pci/p2pdma.c | 2 ++
>  mm/memremap.c        | 8 ++++----
>  mm/mm_init.c         | 4 +++-
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4f47a13..1e9ea32 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -128,6 +128,8 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  		goto out;
>  	}
>  
> +	set_page_count(virt_to_page(kaddr), 1);

Can we have a comment here?  Without that it feels a bit too much like
black magic when reading the code.

> +	if (folio->page.pgmap->type == MEMORY_DEVICE_PRIVATE ||
> +	    folio->page.pgmap->type == MEMORY_DEVICE_COHERENT)
> +		put_dev_pagemap(folio->page.pgmap);
> +	else if (folio->page.pgmap->type != MEMORY_DEVICE_PCI_P2PDMA)
>  		/*
>  		 * Reset the refcount to 1 to prepare for handing out the page
>  		 * again.
>  		 */
>  		folio_set_count(folio, 1);

Where the else if evaluates to MEMORY_DEVICE_FS_DAX ||
MEMORY_DEVICE_GENERIC.  Maybe make this a switch statement handling
all cases of the enum to make it clear and have the compiler generate
a warning when a new type is added without being handled here?

> @@ -1014,7 +1015,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>  	 * which will set the page count to 1 when allocating the page.
>  	 */
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
> +	    pgmap->type == MEMORY_DEVICE_COHERENT ||
> +	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
>  		set_page_count(page, 0);

Similarly here a switch with explanation of what will be handled and
what not would be nice.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 03/13] fs/dax: Refactor wait for dax idle page
  2024-06-27  0:54 ` [PATCH 03/13] fs/dax: Refactor wait for dax idle page Alistair Popple
@ 2024-06-27  5:31   ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2024-06-27  5:31 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch, david

On Thu, Jun 27, 2024 at 10:54:18AM +1000, Alistair Popple wrote:
> A FS DAX page is considered idle when its refcount drops to one. This
> is currently open-coded in all file systems supporting FS DAX. Move
> the idle detection to a common function to make future changes easier.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

I'm pretty sure I already review this ages ago, but:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 04/13] fs/dax: Add dax_page_free callback
  2024-06-27  0:54 ` [PATCH 04/13] fs/dax: Add dax_page_free callback Alistair Popple
@ 2024-06-27  5:33   ` Christoph Hellwig
  2024-06-27 23:48     ` Alistair Popple
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2024-06-27  5:33 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch, david

On Thu, Jun 27, 2024 at 10:54:19AM +1000, Alistair Popple wrote:
> When a fs dax page is freed it has to notify filesystems that the page
> has been unpinned/unmapped and is free. Currently this involves
> special code in the page free paths to detect a transition of refcount
> from 2 to 1 and to call some fs dax specific code.
> 
> A future change will require this to happen when the page refcount
> drops to zero. In this case we can use the existing
> pgmap->ops->page_free() callback so wire that up for all devices that
> support FS DAX (nvdimm and virtio).

Given that ->page_ffree is only called from free_zone_device_folio
and right next to a switch on the the type, can't we just do the
wake_up_var there without the somewhat confusing indirect call that
just back in common code without any driver logic?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/13] mm: Allow compound zone device pages
  2024-06-27  0:54 ` [PATCH 05/13] mm: Allow compound zone device pages Alistair Popple
@ 2024-06-27  5:35   ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2024-06-27  5:35 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch, david, Jason Gunthorpe

On Thu, Jun 27, 2024 at 10:54:20AM +1000, Alistair Popple wrote:
>  static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
>  {
> -	return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
> +	return container_of(page_dev_pagemap(page), struct nouveau_dmem_chunk, pagemap);

Overly long line hee (and quite a few more).


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 10/13] fs/dax: Properly refcount fs dax pages
  2024-06-27  0:54 ` [PATCH 10/13] fs/dax: Properly refcount fs dax pages Alistair Popple
@ 2024-06-27  5:44   ` Christoph Hellwig
  2024-09-06  6:00     ` Alistair Popple
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2024-06-27  5:44 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch, david

> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index eb61598..b7a31ae 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -126,11 +126,11 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
>  		return VM_FAULT_SIGBUS;
>  	}
>  
> -	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
> +	pfn = phys_to_pfn_t(phys, 0);
>  
>  	dax_set_mapping(vmf, pfn, fault_size);
>  
> -	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
> +	return dax_insert_pfn(vmf->vma, vmf->address, pfn, vmf->flags & FAULT_FLAG_WRITE);

Plenty overly long lines here and later.

Q: hould dax_insert_pfn take a vm_fault structure instead of the vma?
Or are the potential use cases that aren't from the fault path?
similar instead of the bool write passing the fault flags might actually
make things more readable than the bool.

Also at least currently it seems like there are no modular users despite
the export, or am I missing something?

> +		blk_queue_flag_set(QUEUE_FLAG_DAX, q);

Just as a heads up, setting of these flags has changed a lot in
linux-next.

>  {
> +	/*
> +	 * Make sure we flush any cached data to the page now that it's free.
> +	 */
> +	if (PageDirty(page))
> +		dax_flush(NULL, page_address(page), page_size(page));
> +

Adding the magic dax_dev == NULL case to dax_flush and going through it
vs just calling arch_wb_cache_pmem directly here seems odd.

But I also don't quite understand how it is related to the rest
of the patch anyway.

> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -373,6 +373,8 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
>  	unsigned long start = addr;
>  
>  	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (vma_is_dax(vma))
> +		ptl = NULL;
>  	if (ptl) {

This feels sufficiently magic to warrant a comment.

>  		if (!pmd_present(*pmd))
>  			goto out;
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index b7e1599..f11ee0d 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1016,7 +1016,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>  	 */
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
>  	    pgmap->type == MEMORY_DEVICE_COHERENT ||
> -	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
> +	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA ||
> +	    pgmap->type == MEMORY_DEVICE_FS_DAX)
>  		set_page_count(page, 0);
>  }

So we'll skip this for MEMORY_DEVICE_GENERIC only.  Does anyone remember
if that's actively harmful or just not needed?  If the latter it might
be simpler to just set the page count unconditionally here.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 01/13] mm/gup.c: Remove redundant check for PCI P2PDMA page
  2024-06-27  0:54 ` [PATCH 01/13] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
@ 2024-06-27  6:36   ` Dan Williams
  0 siblings, 0 replies; 54+ messages in thread
From: Dan Williams @ 2024-06-27  6:36 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple, Jason Gunthorpe

Alistair Popple wrote:
> PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
> check in __gup_device_huge() is redundant. Remove it
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Acked-by: David Hildenbrand <david@redhat.com>

Acked-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/13] fs/dax: Fix FS DAX page reference counts
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (12 preceding siblings ...)
  2024-06-27  0:54 ` [PATCH 13/13] mm: Remove devmap related functions and page table bits Alistair Popple
@ 2024-06-27  6:58 ` Dan Williams
  2024-06-27  7:15   ` Alistair Popple
  2024-07-01  4:24 ` Dave Chinner
  14 siblings, 1 reply; 54+ messages in thread
From: Dan Williams @ 2024-06-27  6:58 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alistair Popple

Alistair Popple wrote:
> FS DAX pages have always maintained their own page reference counts
> without following the normal rules for page reference counting. In
> particular pages are considered free when the refcount hits one rather
> than zero and refcounts are not added when mapping the page.
> 
> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> mechanism for allowing GUP to hold references on the page (see
> get_dev_pagemap). However there doesn't seem to be any reason why FS
> DAX pages need their own reference counting scheme.
> 
> By treating the refcounts on these pages the same way as normal pages
> we can remove a lot of special checks. In particular pXd_trans_huge()
> becomes the same as pXd_leaf(), although I haven't made that change
> here. It also frees up a valuable SW define PTE bit on architectures
> that have devmap PTE bits defined.
> 
> It also almost certainly allows further clean-up of the devmap managed
> functions, but I have left that as a future improvment.
> 
> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
> the original RFC it passes the same number of ndctl test suite
> (https://github.com/pmem/ndctl) tests as my current development
> environment does without these patches.

Are you seeing the 'mmap.sh' test fail even without these patches?

I see this with the patches, will try without in the morning.

 EXT4-fs (pmem0): unmounting filesystem 26ea1463-343a-464f-9f16-91cb176dbdc7.
 XFS (pmem0): Mounting V5 Filesystem 554953fd-c9f4-460f-bc37-f43979986b68
 XFS (pmem0): Ending clean mount
 Oops: general protection fault, probably for non-canonical address 0xdead000000000518: 00
T SMP PTI
 CPU: 15 PID: 1295 Comm: mmap Tainted: G           OE    N 6.10.0-rc5+ #261
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20240524-3.fc40 05/24/2024
 RIP: 0010:folio_mark_dirty+0x25/0x60
 Code: 90 90 90 90 90 0f 1f 44 00 00 53 48 89 fb e8 22 18 02 00 48 85 c0 74 26 48 89 c7 48
0 02 00 74 05 f0 80 63 02 fd <48> 8b 87 18 01 00 00 48 89 de 5b 48 8b 40 18 e9 77 90 c0 00
 RSP: 0018:ffffb073022f7b08 EFLAGS: 00010246
 RAX: 004ffff800002000 RBX: ffffd0d005000300 RCX: 0400000000000040
 RDX: 0000000000000000 RSI: 00007f4006200000 RDI: dead000000000400
 RBP: 0000000000000000 R08: ffff9a4b04504a30 R09: 000fffffffffffff
 R10: ffffd0d005000300 R11: 0000000000000000 R12: 00007f4006200000
 R13: ffff9a4b7c96c000 R14: ffff9a4b7daba440 R15: ffffb073022f7cb0
 FS:  00007f4046351740(0000) GS:ffff9a4d77780000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f40461ff000 CR3: 000000027aea6000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? __die_body.cold+0x19/0x26
  ? die_addr+0x38/0x60
  ? exc_general_protection+0x143/0x420
  ? asm_exc_general_protection+0x22/0x30
  ? folio_mark_dirty+0x25/0x60
  ? folio_mark_dirty+0xe/0x60
  unmap_page_range+0xea5/0x1550
  unmap_vmas+0xf8/0x1e0
  unmap_region.constprop.0+0xd7/0x150
  ? lock_is_held_type+0xd5/0x130
  do_vmi_align_munmap.isra.0+0x3f4/0x580
  ? mas_walk+0x101/0x1b0
  __vm_munmap+0xa6/0x170
  __x64_sys_munmap+0x17/0x20
  do_syscall_64+0x75/0x190
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

$ faddr2line vmlinux folio_mark_dirty+0x25
folio_mark_dirty+0x25/0x58:
folio_mark_dirty at mm/page-writeback.c:2860

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/13] fs/dax: Fix FS DAX page reference counts
  2024-06-27  6:58 ` [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Dan Williams
@ 2024-06-27  7:15   ` Alistair Popple
  2024-06-27 20:24     ` Dan Williams
  0 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-06-27  7:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> FS DAX pages have always maintained their own page reference counts
>> without following the normal rules for page reference counting. In
>> particular pages are considered free when the refcount hits one rather
>> than zero and refcounts are not added when mapping the page.
>> 
>> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>> mechanism for allowing GUP to hold references on the page (see
>> get_dev_pagemap). However there doesn't seem to be any reason why FS
>> DAX pages need their own reference counting scheme.
>> 
>> By treating the refcounts on these pages the same way as normal pages
>> we can remove a lot of special checks. In particular pXd_trans_huge()
>> becomes the same as pXd_leaf(), although I haven't made that change
>> here. It also frees up a valuable SW define PTE bit on architectures
>> that have devmap PTE bits defined.
>> 
>> It also almost certainly allows further clean-up of the devmap managed
>> functions, but I have left that as a future improvment.
>> 
>> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
>> the original RFC it passes the same number of ndctl test suite
>> (https://github.com/pmem/ndctl) tests as my current development
>> environment does without these patches.
>
> Are you seeing the 'mmap.sh' test fail even without these patches?

No. But I also don't see it failing with these patches :)

For reference this is what I see on my test machine with or without:

[1/70] Generating version.h with a custom command
 1/13 ndctl:dax / daxdev-errors.sh          SKIP             0.06s   exit status 77
 2/13 ndctl:dax / multi-dax.sh              SKIP             0.05s   exit status 77
 3/13 ndctl:dax / sub-section.sh            SKIP             0.14s   exit status 77
 4/13 ndctl:dax / dax-dev                   OK               0.02s
 5/13 ndctl:dax / dax-ext4.sh               OK              12.97s
 6/13 ndctl:dax / dax-xfs.sh                OK              12.44s
 7/13 ndctl:dax / device-dax                OK              13.40s
 8/13 ndctl:dax / revoke-devmem             FAIL             0.31s   (exit status 250 or signal 122 SIGinvalid)
>>> TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl MALLOC_PERTURB_=227 DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/build/test/revoke_devmem

 9/13 ndctl:dax / device-dax-fio.sh         OK              32.43s
10/13 ndctl:dax / daxctl-devices.sh         SKIP             0.07s   exit status 77
11/13 ndctl:dax / daxctl-create.sh          SKIP             0.04s   exit status 77
12/13 ndctl:dax / dm.sh                     FAIL             0.08s   exit status 1
>>> MALLOC_PERTURB_=209 TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/test/dm.sh

13/13 ndctl:dax / mmap.sh                   OK             107.57s

Ok:                 6   
Expected Fail:      0   
Fail:               2   
Unexpected Pass:    0   
Skipped:            5   
Timeout:            0   

I have been using QEMU for my testing. Maybe I missed some condition in
the unmap path though so will take another look.

> I see this with the patches, will try without in the morning.
>
>  EXT4-fs (pmem0): unmounting filesystem 26ea1463-343a-464f-9f16-91cb176dbdc7.
>  XFS (pmem0): Mounting V5 Filesystem 554953fd-c9f4-460f-bc37-f43979986b68
>  XFS (pmem0): Ending clean mount
>  Oops: general protection fault, probably for non-canonical address 0xdead000000000518: 00
> T SMP PTI
>  CPU: 15 PID: 1295 Comm: mmap Tainted: G           OE    N 6.10.0-rc5+ #261
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20240524-3.fc40 05/24/2024
>  RIP: 0010:folio_mark_dirty+0x25/0x60
>  Code: 90 90 90 90 90 0f 1f 44 00 00 53 48 89 fb e8 22 18 02 00 48 85 c0 74 26 48 89 c7 48
> 0 02 00 74 05 f0 80 63 02 fd <48> 8b 87 18 01 00 00 48 89 de 5b 48 8b 40 18 e9 77 90 c0 00
>  RSP: 0018:ffffb073022f7b08 EFLAGS: 00010246
>  RAX: 004ffff800002000 RBX: ffffd0d005000300 RCX: 0400000000000040
>  RDX: 0000000000000000 RSI: 00007f4006200000 RDI: dead000000000400
>  RBP: 0000000000000000 R08: ffff9a4b04504a30 R09: 000fffffffffffff
>  R10: ffffd0d005000300 R11: 0000000000000000 R12: 00007f4006200000
>  R13: ffff9a4b7c96c000 R14: ffff9a4b7daba440 R15: ffffb073022f7cb0
>  FS:  00007f4046351740(0000) GS:ffff9a4d77780000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007f40461ff000 CR3: 000000027aea6000 CR4: 00000000000006f0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Call Trace:
>   <TASK>
>   ? __die_body.cold+0x19/0x26
>   ? die_addr+0x38/0x60
>   ? exc_general_protection+0x143/0x420
>   ? asm_exc_general_protection+0x22/0x30
>   ? folio_mark_dirty+0x25/0x60
>   ? folio_mark_dirty+0xe/0x60
>   unmap_page_range+0xea5/0x1550
>   unmap_vmas+0xf8/0x1e0
>   unmap_region.constprop.0+0xd7/0x150
>   ? lock_is_held_type+0xd5/0x130
>   do_vmi_align_munmap.isra.0+0x3f4/0x580
>   ? mas_walk+0x101/0x1b0
>   __vm_munmap+0xa6/0x170
>   __x64_sys_munmap+0x17/0x20
>   do_syscall_64+0x75/0x190
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> $ faddr2line vmlinux folio_mark_dirty+0x25
> folio_mark_dirty+0x25/0x58:
> folio_mark_dirty at mm/page-writeback.c:2860


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] mm/memory: Add dax_insert_pfn
  2024-06-27  0:54 ` [PATCH 06/13] mm/memory: Add dax_insert_pfn Alistair Popple
  2024-06-27  5:22   ` Christoph Hellwig
@ 2024-06-27 11:33   ` Jan Kara
  2024-09-06  6:21     ` Alistair Popple
  2024-07-02  7:18   ` David Hildenbrand
  2 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2024-06-27 11:33 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch, david

On Thu 27-06-24 10:54:21, Alistair Popple wrote:
> Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
> creates a special devmap PTE entry for the pfn but does not take a
> reference on the underlying struct page for the mapping. This is
> because DAX page refcounts are treated specially, as indicated by the
> presence of a devmap entry.
> 
> To allow DAX page refcounts to be managed the same as normal page
> refcounts introduce dax_insert_pfn. This will take a reference on the
> underlying page much the same as vmf_insert_page, except it also
> permits upgrading an existing mapping to be writable if
> requested/possible.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>

Overall this looks good to me. Some comments below.

> ---
>  include/linux/mm.h |  4 ++-
>  mm/memory.c        | 79 ++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 76 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9a5652c..b84368b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1080,6 +1080,8 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
>  struct mmu_gather;
>  struct inode;
>  
> +extern void prep_compound_page(struct page *page, unsigned int order);
> +

You don't seem to use this function in this patch?

> diff --git a/mm/memory.c b/mm/memory.c
> index ce48a05..4f26a1f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1989,14 +1989,42 @@ static int validate_page_before_insert(struct page *page)
>  }
>  
>  static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> -			unsigned long addr, struct page *page, pgprot_t prot)
> +			unsigned long addr, struct page *page, pgprot_t prot, bool mkwrite)
>  {
>  	struct folio *folio = page_folio(page);
> +	pte_t entry = ptep_get(pte);
>  
> -	if (!pte_none(ptep_get(pte)))
> +	if (!pte_none(entry)) {
> +		if (mkwrite) {
> +			/*
> +			 * For read faults on private mappings the PFN passed
> +			 * in may not match the PFN we have mapped if the
> +			 * mapped PFN is a writeable COW page.  In the mkwrite
> +			 * case we are creating a writable PTE for a shared
> +			 * mapping and we expect the PFNs to match. If they
> +			 * don't match, we are likely racing with block
> +			 * allocation and mapping invalidation so just skip the
> +			 * update.
> +			 */
> +			if (pte_pfn(entry) != page_to_pfn(page)) {
> +				WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
> +				return -EFAULT;
> +			}
> +			entry = maybe_mkwrite(entry, vma);
> +			entry = pte_mkyoung(entry);
> +			if (ptep_set_access_flags(vma, addr, pte, entry, 1))
> +				update_mmu_cache(vma, addr, pte);
> +			return 0;
> +		}
>  		return -EBUSY;

If you do this like:

		if (!mkwrite)
			return -EBUSY;

You can reduce indentation of the big block and also making the flow more
obvious...

> +	}
> +
>  	/* Ok, finally just insert the thing.. */
>  	folio_get(folio);
> +	if (mkwrite)
> +		entry = maybe_mkwrite(mk_pte(page, prot), vma);
> +	else
> +		entry = mk_pte(page, prot);

I'd prefer:

	entry = mk_pte(page, prot);
	if (mkwrite)
		entry = maybe_mkwrite(entry, vma);

but I don't insist. Also insert_pfn() additionally has pte_mkyoung() and
pte_mkdirty(). Why was it left out here?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/13] fs/dax: Fix FS DAX page reference counts
  2024-06-27  7:15   ` Alistair Popple
@ 2024-06-27 20:24     ` Dan Williams
  2024-06-28  0:06       ` Alistair Popple
  0 siblings, 1 reply; 54+ messages in thread
From: Dan Williams @ 2024-06-27 20:24 UTC (permalink / raw)
  To: Alistair Popple, Dan Williams
  Cc: vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david

Alistair Popple wrote:
> 
> Dan Williams <dan.j.williams@intel.com> writes:
> 
> > Alistair Popple wrote:
> >> FS DAX pages have always maintained their own page reference counts
> >> without following the normal rules for page reference counting. In
> >> particular pages are considered free when the refcount hits one rather
> >> than zero and refcounts are not added when mapping the page.
> >> 
> >> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> >> mechanism for allowing GUP to hold references on the page (see
> >> get_dev_pagemap). However there doesn't seem to be any reason why FS
> >> DAX pages need their own reference counting scheme.
> >> 
> >> By treating the refcounts on these pages the same way as normal pages
> >> we can remove a lot of special checks. In particular pXd_trans_huge()
> >> becomes the same as pXd_leaf(), although I haven't made that change
> >> here. It also frees up a valuable SW define PTE bit on architectures
> >> that have devmap PTE bits defined.
> >> 
> >> It also almost certainly allows further clean-up of the devmap managed
> >> functions, but I have left that as a future improvment.
> >> 
> >> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
> >> the original RFC it passes the same number of ndctl test suite
> >> (https://github.com/pmem/ndctl) tests as my current development
> >> environment does without these patches.
> >
> > Are you seeing the 'mmap.sh' test fail even without these patches?
> 
> No. But I also don't see it failing with these patches :)
> 
> For reference this is what I see on my test machine with or without:
> 
> [1/70] Generating version.h with a custom command
>  1/13 ndctl:dax / daxdev-errors.sh          SKIP             0.06s   exit status 77
>  2/13 ndctl:dax / multi-dax.sh              SKIP             0.05s   exit status 77
>  3/13 ndctl:dax / sub-section.sh            SKIP             0.14s   exit status 77

I really need to get this test built as a service as this shows a
pre-req is missing, and it's not quite fair to expect submitters to put
it all together.

>  4/13 ndctl:dax / dax-dev                   OK               0.02s
>  5/13 ndctl:dax / dax-ext4.sh               OK              12.97s
>  6/13 ndctl:dax / dax-xfs.sh                OK              12.44s
>  7/13 ndctl:dax / device-dax                OK              13.40s
>  8/13 ndctl:dax / revoke-devmem             FAIL             0.31s   (exit status 250 or signal 122 SIGinvalid)
> >>> TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl MALLOC_PERTURB_=227 DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/build/test/revoke_devmem
> 
>  9/13 ndctl:dax / device-dax-fio.sh         OK              32.43s
> 10/13 ndctl:dax / daxctl-devices.sh         SKIP             0.07s   exit status 77
> 11/13 ndctl:dax / daxctl-create.sh          SKIP             0.04s   exit status 77
> 12/13 ndctl:dax / dm.sh                     FAIL             0.08s   exit status 1
> >>> MALLOC_PERTURB_=209 TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/test/dm.sh
> 
> 13/13 ndctl:dax / mmap.sh                   OK             107.57s

I need to think through why this one might false succeed, but that can
wait until we get this series reviewed. For now my failure is stable
which allows it to be bisected.

> 
> Ok:                 6   
> Expected Fail:      0   
> Fail:               2   
> Unexpected Pass:    0   
> Skipped:            5   
> Timeout:            0   
> 
> I have been using QEMU for my testing. Maybe I missed some condition in
> the unmap path though so will take another look.

I was able to bisect to:

[PATCH 10/13] fs/dax: Properly refcount fs dax pages

...I will prioritize that one in my review queue.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
  2024-06-27  0:54 ` [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages Alistair Popple
@ 2024-06-27 22:26   ` kernel test robot
  2024-07-02  7:16   ` David Hildenbrand
  1 sibling, 0 replies; 54+ messages in thread
From: kernel test robot @ 2024-06-27 22:26 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg
  Cc: llvm, oe-kbuild-all, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard

Hi Alistair,

kernel test robot noticed the following build errors:

[auto build test ERROR on f2661062f16b2de5d7b6a5c42a9a5c96326b8454]

url:    https://github.com/intel-lab-lkp/linux/commits/Alistair-Popple/mm-gup-c-Remove-redundant-check-for-PCI-P2PDMA-page/20240627-191709
base:   f2661062f16b2de5d7b6a5c42a9a5c96326b8454
patch link:    https://lore.kernel.org/r/bd332b0d3971b03152b3541f97470817c5147b51.1719386613.git-series.apopple%40nvidia.com
patch subject: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20240628/202406280637.147dyRrV-lkp@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240628/202406280637.147dyRrV-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202406280637.147dyRrV-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/rmap.c:1513:37: error: call to '__compiletime_assert_279' declared with 'error' attribute: BUILD_BUG failed
    1513 |         __folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
         |                                            ^
   include/linux/huge_mm.h:111:26: note: expanded from macro 'HPAGE_PUD_NR'
     111 | #define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
         |                          ^
   include/linux/huge_mm.h:110:26: note: expanded from macro 'HPAGE_PUD_ORDER'
     110 | #define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
         |                          ^
   include/linux/huge_mm.h:97:28: note: expanded from macro 'HPAGE_PUD_SHIFT'
      97 | #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
         |                            ^
   note: (skipping 3 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
   include/linux/compiler_types.h:498:2: note: expanded from macro '_compiletime_assert'
     498 |         __compiletime_assert(condition, msg, prefix, suffix)
         |         ^
   include/linux/compiler_types.h:491:4: note: expanded from macro '__compiletime_assert'
     491 |                         prefix ## suffix();                             \
         |                         ^
   <scratch space>:72:1: note: expanded from here
      72 | __compiletime_assert_279
         | ^
   mm/rmap.c:1660:35: error: call to '__compiletime_assert_280' declared with 'error' attribute: BUILD_BUG failed
    1660 |         __folio_remove_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
         |                                          ^
   include/linux/huge_mm.h:111:26: note: expanded from macro 'HPAGE_PUD_NR'
     111 | #define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
         |                          ^
   include/linux/huge_mm.h:110:26: note: expanded from macro 'HPAGE_PUD_ORDER'
     110 | #define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
         |                          ^
   include/linux/huge_mm.h:97:28: note: expanded from macro 'HPAGE_PUD_SHIFT'
      97 | #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
         |                            ^
   note: (skipping 3 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
   include/linux/compiler_types.h:498:2: note: expanded from macro '_compiletime_assert'
     498 |         __compiletime_assert(condition, msg, prefix, suffix)
         |         ^
   include/linux/compiler_types.h:491:4: note: expanded from macro '__compiletime_assert'
     491 |                         prefix ## suffix();                             \
         |                         ^
   <scratch space>:79:1: note: expanded from here
      79 | __compiletime_assert_280
         | ^
   2 errors generated.


vim +1513 mm/rmap.c

  1498	
  1499	/**
  1500	 * folio_add_file_rmap_pud - add a PUD mapping to a page range of a folio
  1501	 * @folio:	The folio to add the mapping to
  1502	 * @page:	The first page to add
  1503	 * @vma:	The vm area in which the mapping is added
  1504	 *
  1505	 * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
  1506	 *
  1507	 * The caller needs to hold the page table lock.
  1508	 */
  1509	void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
  1510			struct vm_area_struct *vma)
  1511	{
  1512	#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> 1513		__folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
  1514	#else
  1515		WARN_ON_ONCE(true);
  1516	#endif
  1517	}
  1518	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 13/13] mm: Remove devmap related functions and page table bits
  2024-06-27  0:54 ` [PATCH 13/13] mm: Remove devmap related functions and page table bits Alistair Popple
@ 2024-06-27 23:04   ` kernel test robot
  2024-06-28  2:12   ` kernel test robot
  2024-07-08 11:35   ` Will Deacon
  2 siblings, 0 replies; 54+ messages in thread
From: kernel test robot @ 2024-06-27 23:04 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg
  Cc: oe-kbuild-all, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard

Hi Alistair,

kernel test robot noticed the following build errors:

[auto build test ERROR on f2661062f16b2de5d7b6a5c42a9a5c96326b8454]

url:    https://github.com/intel-lab-lkp/linux/commits/Alistair-Popple/mm-gup-c-Remove-redundant-check-for-PCI-P2PDMA-page/20240627-191709
base:   f2661062f16b2de5d7b6a5c42a9a5c96326b8454
patch link:    https://lore.kernel.org/r/47c26640cd85f3db2e0a2796047199bb984d1b3f.1719386613.git-series.apopple%40nvidia.com
patch subject: [PATCH 13/13] mm: Remove devmap related functions and page table bits
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20240628/202406280658.1pp5cW2f-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240628/202406280658.1pp5cW2f-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202406280658.1pp5cW2f-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/powerpc/include/asm/book3s/64/mmu-hash.h:20,
                    from arch/powerpc/include/asm/book3s/64/mmu.h:32,
                    from arch/powerpc/include/asm/mmu.h:385,
                    from arch/powerpc/include/asm/paca.h:18,
                    from arch/powerpc/include/asm/current.h:13,
                    from include/linux/thread_info.h:23,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/powerpc/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:79,
                    from include/linux/alloc_tag.h:11,
                    from include/linux/rhashtable-types.h:12,
                    from include/linux/ipc.h:7,
                    from include/uapi/linux/sem.h:5,
                    from include/linux/sem.h:5,
                    from include/linux/compat.h:14,
                    from arch/powerpc/kernel/asm-offsets.c:12:
>> arch/powerpc/include/asm/book3s/64/pgtable.h:1371:1: error: expected identifier or '(' before '}' token
    1371 | }
         | ^
   make[3]: *** [scripts/Makefile.build:117: arch/powerpc/kernel/asm-offsets.s] Error 1
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1208: prepare0] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:240: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:240: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +1371 arch/powerpc/include/asm/book3s/64/pgtable.h

953c66c2b22a30 Aneesh Kumar K.V  2016-12-12  1370  
ebd31197931d75 Oliver O'Halloran 2017-06-28 @1371  }
6a1ea36260f69f Aneesh Kumar K.V  2016-04-29  1372  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
ebd31197931d75 Oliver O'Halloran 2017-06-28  1373  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 04/13] fs/dax: Add dax_page_free callback
  2024-06-27  5:33   ` Christoph Hellwig
@ 2024-06-27 23:48     ` Alistair Popple
  0 siblings, 0 replies; 54+ messages in thread
From: Alistair Popple @ 2024-06-27 23:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, david


Christoph Hellwig <hch@lst.de> writes:

> On Thu, Jun 27, 2024 at 10:54:19AM +1000, Alistair Popple wrote:
>> When a fs dax page is freed it has to notify filesystems that the page
>> has been unpinned/unmapped and is free. Currently this involves
>> special code in the page free paths to detect a transition of refcount
>> from 2 to 1 and to call some fs dax specific code.
>> 
>> A future change will require this to happen when the page refcount
>> drops to zero. In this case we can use the existing
>> pgmap->ops->page_free() callback so wire that up for all devices that
>> support FS DAX (nvdimm and virtio).
>
> Given that ->page_ffree is only called from free_zone_device_folio
> and right next to a switch on the the type, can't we just do the
> wake_up_var there without the somewhat confusing indirect call that
> just back in common code without any driver logic?

Longer term I'm hoping we can get rid of that switch on type entirely as
I don't think the whole get/put_dev_pagemap() thing is very useful. Less
indirection is good though so will move the wake_up_var there.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/13] fs/dax: Fix FS DAX page reference counts
  2024-06-27 20:24     ` Dan Williams
@ 2024-06-28  0:06       ` Alistair Popple
  0 siblings, 0 replies; 54+ messages in thread
From: Alistair Popple @ 2024-06-28  0:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> 
>> Dan Williams <dan.j.williams@intel.com> writes:
>> 
>> > Alistair Popple wrote:
>> >> FS DAX pages have always maintained their own page reference counts
>> >> without following the normal rules for page reference counting. In
>> >> particular pages are considered free when the refcount hits one rather
>> >> than zero and refcounts are not added when mapping the page.
>> >> 
>> >> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>> >> mechanism for allowing GUP to hold references on the page (see
>> >> get_dev_pagemap). However there doesn't seem to be any reason why FS
>> >> DAX pages need their own reference counting scheme.
>> >> 
>> >> By treating the refcounts on these pages the same way as normal pages
>> >> we can remove a lot of special checks. In particular pXd_trans_huge()
>> >> becomes the same as pXd_leaf(), although I haven't made that change
>> >> here. It also frees up a valuable SW define PTE bit on architectures
>> >> that have devmap PTE bits defined.
>> >> 
>> >> It also almost certainly allows further clean-up of the devmap managed
>> >> functions, but I have left that as a future improvment.
>> >> 
>> >> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
>> >> the original RFC it passes the same number of ndctl test suite
>> >> (https://github.com/pmem/ndctl) tests as my current development
>> >> environment does without these patches.
>> >
>> > Are you seeing the 'mmap.sh' test fail even without these patches?
>> 
>> No. But I also don't see it failing with these patches :)
>> 
>> For reference this is what I see on my test machine with or without:
>> 
>> [1/70] Generating version.h with a custom command
>>  1/13 ndctl:dax / daxdev-errors.sh          SKIP             0.06s   exit status 77
>>  2/13 ndctl:dax / multi-dax.sh              SKIP             0.05s   exit status 77
>>  3/13 ndctl:dax / sub-section.sh            SKIP             0.14s   exit status 77
>
> I really need to get this test built as a service as this shows a
> pre-req is missing, and it's not quite fair to expect submitters to put
> it all together.

Ok. I didn't dig into why this was being skipped but I might if I find
some time. The rest of the tests seemed more relevant anyway and turned
up enough bugs with my initial implementation to keep me busy which gave
me some confidence.

If I'm being honest though I found the whole test setup a bit of a
pain. In particular remembering you have to manually (re)build the
special test versions of the modules tripped me up a few times until I
updated my build scripts. But I got there in the end.

>>  4/13 ndctl:dax / dax-dev                   OK               0.02s
>>  5/13 ndctl:dax / dax-ext4.sh               OK              12.97s
>>  6/13 ndctl:dax / dax-xfs.sh                OK              12.44s
>>  7/13 ndctl:dax / device-dax                OK              13.40s
>>  8/13 ndctl:dax / revoke-devmem             FAIL             0.31s   (exit status 250 or signal 122 SIGinvalid)
>> >>> TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl MALLOC_PERTURB_=227 DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/build/test/revoke_devmem
>> 
>>  9/13 ndctl:dax / device-dax-fio.sh         OK              32.43s
>> 10/13 ndctl:dax / daxctl-devices.sh         SKIP             0.07s   exit status 77
>> 11/13 ndctl:dax / daxctl-create.sh          SKIP             0.04s   exit status 77
>> 12/13 ndctl:dax / dm.sh                     FAIL             0.08s   exit status 1
>> >>> MALLOC_PERTURB_=209 TEST_PATH=/home/apopple/ndctl/build/test LD_LIBRARY_PATH=/home/apopple/ndctl/build/cxl/lib:/home/apopple/ndctl/build/daxctl/lib:/home/apopple/ndctl/build/ndctl/lib NDCTL=/home/apopple/ndctl/build/ndctl/ndctl DATA_PATH=/home/apopple/ndctl/test DAXCTL=/home/apopple/ndctl/build/daxctl/daxctl /home/apopple/ndctl/test/dm.sh
>> 
>> 13/13 ndctl:dax / mmap.sh                   OK             107.57s
>
> I need to think through why this one might false succeed, but that can
> wait until we get this series reviewed. For now my failure is stable
> which allows it to be bisected.
>
>> 
>> Ok:                 6   
>> Expected Fail:      0   
>> Fail:               2   
>> Unexpected Pass:    0   
>> Skipped:            5   
>> Timeout:            0   
>> 
>> I have been using QEMU for my testing. Maybe I missed some condition in
>> the unmap path though so will take another look.
>
> I was able to bisect to:

I could have guessed that one, as it's pretty much the crux of this
series given it's the one that switches everything away from
pXX_devmap. That means pXX_leaf/_trans_huge will start returning true
for DAX pages.

Based on your dump I'm guessing I missed some case in the
zap_pXX_range() path. It could be helpful to narrow down which of the
pXX paths is crashing but I will take another look there.

> [PATCH 10/13] fs/dax: Properly refcount fs dax pages
>
> ...I will prioritize that one in my review queue.

Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 13/13] mm: Remove devmap related functions and page table bits
  2024-06-27  0:54 ` [PATCH 13/13] mm: Remove devmap related functions and page table bits Alistair Popple
  2024-06-27 23:04   ` kernel test robot
@ 2024-06-28  2:12   ` kernel test robot
  2024-07-08 11:35   ` Will Deacon
  2 siblings, 0 replies; 54+ messages in thread
From: kernel test robot @ 2024-06-28  2:12 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg
  Cc: llvm, oe-kbuild-all, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard

Hi Alistair,

kernel test robot noticed the following build errors:

[auto build test ERROR on f2661062f16b2de5d7b6a5c42a9a5c96326b8454]

url:    https://github.com/intel-lab-lkp/linux/commits/Alistair-Popple/mm-gup-c-Remove-redundant-check-for-PCI-P2PDMA-page/20240627-191709
base:   f2661062f16b2de5d7b6a5c42a9a5c96326b8454
patch link:    https://lore.kernel.org/r/47c26640cd85f3db2e0a2796047199bb984d1b3f.1719386613.git-series.apopple%40nvidia.com
patch subject: [PATCH 13/13] mm: Remove devmap related functions and page table bits
config: powerpc-allyesconfig (https://download.01.org/0day-ci/archive/20240628/202406280920.VNwSTzZT-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 326ba38a991250a8587a399a260b0f7af2c9166a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240628/202406280920.VNwSTzZT-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202406280920.VNwSTzZT-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:14:
   In file included from include/linux/sem.h:5:
   In file included from include/uapi/linux/sem.h:5:
   In file included from include/linux/ipc.h:7:
   In file included from include/linux/rhashtable-types.h:12:
   In file included from include/linux/alloc_tag.h:11:
   In file included from include/linux/preempt.h:79:
   In file included from ./arch/powerpc/include/generated/asm/preempt.h:1:
   In file included from include/asm-generic/preempt.h:5:
   In file included from include/linux/thread_info.h:23:
   In file included from arch/powerpc/include/asm/current.h:13:
   In file included from arch/powerpc/include/asm/paca.h:18:
   In file included from arch/powerpc/include/asm/mmu.h:385:
   In file included from arch/powerpc/include/asm/book3s/64/mmu.h:32:
   In file included from arch/powerpc/include/asm/book3s/64/mmu-hash.h:20:
>> arch/powerpc/include/asm/book3s/64/pgtable.h:1371:1: error: extraneous closing brace ('}')
    1371 | }
         | ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:11: warning: array index 3 is past the end of the array (that has type 'unsigned long[1]') [-Warray-bounds]
      98 |                 return (set->sig[3] | set->sig[2] |
         |                         ^        ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
      18 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:25: warning: array index 2 is past the end of the array (that has type 'unsigned long[1]') [-Warray-bounds]
      98 |                 return (set->sig[3] | set->sig[2] |
         |                                       ^        ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
      18 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:99:4: warning: array index 1 is past the end of the array (that has type 'unsigned long[1]') [-Warray-bounds]
      99 |                         set->sig[1] | set->sig[0]) == 0;
         |                         ^        ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
      18 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:101:11: warning: array index 1 is past the end of the array (that has type 'unsigned long[1]') [-Warray-bounds]
     101 |                 return (set->sig[1] | set->sig[0]) == 0;
         |                         ^        ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
      18 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:11: warning: array index 3 is past the end of the array (that has type 'const unsigned long[1]') [-Warray-bounds]
     114 |                 return  (set1->sig[3] == set2->sig[3]) &&
         |                          ^         ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
      18 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:27: warning: array index 3 is past the end of the array (that has type 'const unsigned long[1]') [-Warray-bounds]
     114 |                 return  (set1->sig[3] == set2->sig[3]) &&
         |                                          ^         ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
      18 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:5: warning: array index 2 is past the end of the array (that has type 'const unsigned long[1]') [-Warray-bounds]
     115 |                         (set1->sig[2] == set2->sig[2]) &&
         |                          ^         ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
      18 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:21: warning: array index 2 is past the end of the array (that has type 'const unsigned long[1]') [-Warray-bounds]
     115 |                         (set1->sig[2] == set2->sig[2]) &&
         |                                          ^         ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
      18 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:


vim +1371 arch/powerpc/include/asm/book3s/64/pgtable.h

953c66c2b22a30 Aneesh Kumar K.V  2016-12-12  1370  
ebd31197931d75 Oliver O'Halloran 2017-06-28 @1371  }
6a1ea36260f69f Aneesh Kumar K.V  2016-04-29  1372  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
ebd31197931d75 Oliver O'Halloran 2017-06-28  1373  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one
  2024-06-27  0:54 ` [PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
  2024-06-27  5:30   ` Christoph Hellwig
@ 2024-06-29 21:28   ` Bjorn Helgaas
  1 sibling, 0 replies; 54+ messages in thread
From: Bjorn Helgaas @ 2024-06-29 21:28 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch, david

On Thu, Jun 27, 2024 at 10:54:17AM +1000, Alistair Popple wrote:
> The reference counts for ZONE_DEVICE private pages should be
> initialised by the driver when the page is actually allocated by the
> driver allocator, not when they are first created. This is currently
> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.

If you tag the subject line with PCI, please run "git log --oneline
drivers/pci/p2pdma.c" and make yours look like previous ones
("PCI/P2PDMA").

Also recast it to say something semantically useful about what it
*does*, not what it *doesn't* do.  Maybe something about initializing
the refcount where the page is allocated?  Especially since the only
p2pdma.c change here is to "set_page_count(..., 1)", which looks like
exactly the opposite of "don't initialize refcount to one".

> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  drivers/pci/p2pdma.c | 2 ++
>  mm/memremap.c        | 8 ++++----
>  mm/mm_init.c         | 4 +++-
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4f47a13..1e9ea32 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -128,6 +128,8 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  		goto out;
>  	}
>  
> +	set_page_count(virt_to_page(kaddr), 1);
> +
>  	/*
>  	 * vm_insert_page() can sleep, so a reference is taken to mapping
>  	 * such that rcu_read_unlock() can be done before inserting the
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 40d4547..caccbd8 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -488,15 +488,15 @@ void free_zone_device_folio(struct folio *folio)
>  	folio->mapping = NULL;
>  	folio->page.pgmap->ops->page_free(folio_page(folio, 0));
>  
> -	if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
> -	    folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
> +	if (folio->page.pgmap->type == MEMORY_DEVICE_PRIVATE ||
> +	    folio->page.pgmap->type == MEMORY_DEVICE_COHERENT)
> +		put_dev_pagemap(folio->page.pgmap);
> +	else if (folio->page.pgmap->type != MEMORY_DEVICE_PCI_P2PDMA)
>  		/*
>  		 * Reset the refcount to 1 to prepare for handing out the page
>  		 * again.
>  		 */
>  		folio_set_count(folio, 1);
> -	else
> -		put_dev_pagemap(folio->page.pgmap);
>  }
>  
>  void zone_device_page_init(struct page *page)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 3ec0493..b7e1599 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -6,6 +6,7 @@
>   * Author Mel Gorman <mel@csn.ul.ie>
>   *
>   */
> +#include "linux/memremap.h"
>  #include <linux/kernel.h>
>  #include <linux/init.h>
>  #include <linux/kobject.h>
> @@ -1014,7 +1015,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>  	 * which will set the page count to 1 when allocating the page.
>  	 */
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
> -	    pgmap->type == MEMORY_DEVICE_COHERENT)
> +	    pgmap->type == MEMORY_DEVICE_COHERENT ||
> +	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
>  		set_page_count(page, 0);
>  }
>  
> -- 
> git-series 0.9.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/13] fs/dax: Fix FS DAX page reference counts
  2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (13 preceding siblings ...)
  2024-06-27  6:58 ` [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Dan Williams
@ 2024-07-01  4:24 ` Dave Chinner
  2024-07-01  8:33   ` Alistair Popple
  14 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2024-07-01  4:24 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch

On Thu, Jun 27, 2024 at 10:54:15AM +1000, Alistair Popple wrote:
> FS DAX pages have always maintained their own page reference counts
> without following the normal rules for page reference counting. In
> particular pages are considered free when the refcount hits one rather
> than zero and refcounts are not added when mapping the page.
> 
> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> mechanism for allowing GUP to hold references on the page (see
> get_dev_pagemap). However there doesn't seem to be any reason why FS
> DAX pages need their own reference counting scheme.
> 
> By treating the refcounts on these pages the same way as normal pages
> we can remove a lot of special checks. In particular pXd_trans_huge()
> becomes the same as pXd_leaf(), although I haven't made that change
> here. It also frees up a valuable SW define PTE bit on architectures
> that have devmap PTE bits defined.
> 
> It also almost certainly allows further clean-up of the devmap managed
> functions, but I have left that as a future improvment.
> 
> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
> the original RFC it passes the same number of ndctl test suite
> (https://github.com/pmem/ndctl) tests as my current development
> environment does without these patches.

I strongly suggest running fstests on pmem devices with '-o
dax=always' mount options to get much more comprehensive fsdax test
coverage. That exercises a lot of the weird mmap corner cases that
cause problems so it would be good to actually test that nothing new
got broken in FSDAX by this patchset.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/13] fs/dax: Fix FS DAX page reference counts
  2024-07-01  4:24 ` Dave Chinner
@ 2024-07-01  8:33   ` Alistair Popple
  0 siblings, 0 replies; 54+ messages in thread
From: Alistair Popple @ 2024-07-01  8:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch


Dave Chinner <david@fromorbit.com> writes:

> On Thu, Jun 27, 2024 at 10:54:15AM +1000, Alistair Popple wrote:
>> FS DAX pages have always maintained their own page reference counts
>> without following the normal rules for page reference counting. In
>> particular pages are considered free when the refcount hits one rather
>> than zero and refcounts are not added when mapping the page.
>> 
>> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>> mechanism for allowing GUP to hold references on the page (see
>> get_dev_pagemap). However there doesn't seem to be any reason why FS
>> DAX pages need their own reference counting scheme.
>> 
>> By treating the refcounts on these pages the same way as normal pages
>> we can remove a lot of special checks. In particular pXd_trans_huge()
>> becomes the same as pXd_leaf(), although I haven't made that change
>> here. It also frees up a valuable SW define PTE bit on architectures
>> that have devmap PTE bits defined.
>> 
>> It also almost certainly allows further clean-up of the devmap managed
>> functions, but I have left that as a future improvment.
>> 
>> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
>> the original RFC it passes the same number of ndctl test suite
>> (https://github.com/pmem/ndctl) tests as my current development
>> environment does without these patches.
>
> I strongly suggest running fstests on pmem devices with '-o
> dax=always' mount options to get much more comprehensive fsdax test
> coverage. That exercises a lot of the weird mmap corner cases that
> cause problems so it would be good to actually test that nothing new
> got broken in FSDAX by this patchset.

Thanks Dave, I will do that and report back. I suspect it will turn up
something, given Dan was seeing a crash with these patches.

 - Alistair

> -Dave.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 09/13] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  2024-06-27  0:54 ` [PATCH 09/13] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
@ 2024-07-01  8:59   ` David Hildenbrand
  2024-07-01 23:47     ` Alistair Popple
  0 siblings, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2024-07-01  8:59 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
	linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
	linux-mm, linux-ext4, linux-xfs, jhubbard, hch, david

On 27.06.24 02:54, Alistair Popple wrote:
> Longterm pinning of FS DAX pages should already be disallowed by
> various pXX_devmap checks. However a future change will cause these
> checks to be invalid for FS DAX pages so make
> folio_is_longterm_pinnable() return false for FS DAX pages.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>   include/linux/memremap.h | 11 +++++++++++
>   include/linux/mm.h       |  4 ++++
>   2 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 6505713..19a448e 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -193,6 +193,17 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
>   	return is_device_coherent_page(&folio->page);
>   }
>   
> +static inline bool is_device_dax_page(const struct page *page)
> +{
> +	return is_zone_device_page(page) &&
> +		page_dev_pagemap(page)->type == MEMORY_DEVICE_FS_DAX;
> +}
> +
> +static inline bool folio_is_device_dax(const struct folio *folio)
> +{
> +	return is_device_dax_page(&folio->page);
> +}
> +
>   #ifdef CONFIG_ZONE_DEVICE
>   void zone_device_page_init(struct page *page);
>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b84368b..4d1cdea 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2032,6 +2032,10 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
>   	if (folio_is_device_coherent(folio))
>   		return false;
>   
> +	/* DAX must also always allow eviction. */
> +	if (folio_is_device_dax(folio))
> +		return false;
> +
>   	/* Otherwise, non-movable zone folios can be pinned. */
>   	return !folio_is_zone_movable(folio);
>   

Why is the check in check_vma_flags() insufficient? GUP-fast maybe?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 09/13] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  2024-07-01  8:59   ` David Hildenbrand
@ 2024-07-01 23:47     ` Alistair Popple
  2024-07-02 10:48       ` David Hildenbrand
  0 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-07-01 23:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david


David Hildenbrand <david@redhat.com> writes:

> On 27.06.24 02:54, Alistair Popple wrote:
>> Longterm pinning of FS DAX pages should already be disallowed by
>> various pXX_devmap checks. However a future change will cause these
>> checks to be invalid for FS DAX pages so make
>> folio_is_longterm_pinnable() return false for FS DAX pages.
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>   include/linux/memremap.h | 11 +++++++++++
>>   include/linux/mm.h       |  4 ++++
>>   2 files changed, 15 insertions(+)
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 6505713..19a448e 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -193,6 +193,17 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
>>   	return is_device_coherent_page(&folio->page);
>>   }
>>   +static inline bool is_device_dax_page(const struct page *page)
>> +{
>> +	return is_zone_device_page(page) &&
>> +		page_dev_pagemap(page)->type == MEMORY_DEVICE_FS_DAX;
>> +}
>> +
>> +static inline bool folio_is_device_dax(const struct folio *folio)
>> +{
>> +	return is_device_dax_page(&folio->page);
>> +}
>> +
>>   #ifdef CONFIG_ZONE_DEVICE
>>   void zone_device_page_init(struct page *page);
>>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b84368b..4d1cdea 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2032,6 +2032,10 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
>>   	if (folio_is_device_coherent(folio))
>>   		return false;
>>   +	/* DAX must also always allow eviction. */
>> +	if (folio_is_device_dax(folio))
>> +		return false;
>> +
>>   	/* Otherwise, non-movable zone folios can be pinned. */
>>   	return !folio_is_zone_movable(folio);
>>   
>
> Why is the check in check_vma_flags() insufficient? GUP-fast maybe?

Right. This came up when I was changing the code for GUP-fast, but also
they shouldn't be longterm pinnable and adding the case to
folio_is_longterm_pinnable() is an excellent way of documenting that.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
  2024-06-27  0:54 ` [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages Alistair Popple
  2024-06-27 22:26   ` kernel test robot
@ 2024-07-02  7:16   ` David Hildenbrand
  2024-07-02 10:19     ` Alistair Popple
  1 sibling, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2024-07-02  7:16 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
	linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
	linux-mm, linux-ext4, linux-xfs, jhubbard, hch, david

On 27.06.24 02:54, Alistair Popple wrote:
> Currently DAX folio/page reference counts are managed differently to
> normal pages. To allow these to be managed the same as normal pages
> introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
> and take references as it would for a normally mapped page.
> 
> This is distinct from the current mechanism, vmf_insert_pfn_pud, which
> simply inserts a special devmap PUD entry into the page table without
> holding a reference to the page for the mapping.

Do we really have to involve mapcounts/rmap for daxfs pages at this 
point? Or is this only "to make it look more like other pages" ?

I'm asking this because:

(A) We don't support mixing PUD+PMD mappings yet. I have plans to change
     that in the future, but for now you can only map using a single PUD
     or by PTEs. I suspect that's good enoug for now for dax fs?
(B) As long as we have subpage mapcounts, this prevents vmemmap
     optimizations [1]. Is that only used for device-dax for now and are
     there no plans to make use of that for fs-dax?
(C) We managed without so far :)


Having that said, with folio->_large_mapcount things like 
folio_mapcount() are no longer terribly slow once we weould PTE-map a 
PUD-sized folio.

Also, all ZONE_DEVICE pages should currently be marked PG_reserved, 
translating to "don't touch the memmap". I think we might want to tackle 
that first.

[1] https://lwn.net/Articles/860218/
[2] 
https://lkml.kernel.org/r/b0adbb0c-ad59-4bc5-ba0b-0af464b94557@redhat.com

> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>   include/linux/huge_mm.h |   4 ++-
>   include/linux/rmap.h    |  14 +++++-
>   mm/huge_memory.c        | 108 ++++++++++++++++++++++++++++++++++++++---
>   mm/rmap.c               |  48 ++++++++++++++++++-
>   4 files changed, 168 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2aa986a..b98a3cc 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   
>   vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>   vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> +vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>   
>   enum transparent_hugepage_flag {
>   	TRANSPARENT_HUGEPAGE_UNSUPPORTED,
> @@ -106,6 +107,9 @@ extern struct kobj_attribute shmem_enabled_attr;
>   #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
>   #define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
>   
> +#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
> +
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   
>   extern unsigned long transparent_hugepage_flags;
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 7229b9b..c5a0205 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -192,6 +192,7 @@ typedef int __bitwise rmap_t;
>   enum rmap_level {
>   	RMAP_LEVEL_PTE = 0,
>   	RMAP_LEVEL_PMD,
> +	RMAP_LEVEL_PUD,
>   };
>   
>   static inline void __folio_rmap_sanity_checks(struct folio *folio,
> @@ -225,6 +226,13 @@ static inline void __folio_rmap_sanity_checks(struct folio *folio,
>   		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
>   		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
>   		break;
> +	case RMAP_LEVEL_PUD:
> +		/*
> +		 * Asume that we are creating * a single "entire" mapping of the folio.
> +		 */
> +		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio);
> +		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
> +		break;
>   	default:
>   		VM_WARN_ON_ONCE(true);
>   	}
> @@ -248,12 +256,16 @@ void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
>   	folio_add_file_rmap_ptes(folio, page, 1, vma)
>   void folio_add_file_rmap_pmd(struct folio *, struct page *,
>   		struct vm_area_struct *);
> +void folio_add_file_rmap_pud(struct folio *, struct page *,
> +		struct vm_area_struct *);
>   void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
>   		struct vm_area_struct *);
>   #define folio_remove_rmap_pte(folio, page, vma) \
>   	folio_remove_rmap_ptes(folio, page, 1, vma)
>   void folio_remove_rmap_pmd(struct folio *, struct page *,
>   		struct vm_area_struct *);
> +void folio_remove_rmap_pud(struct folio *, struct page *,
> +		struct vm_area_struct *);
>   
>   void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
>   		unsigned long address, rmap_t flags);
> @@ -338,6 +350,7 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
>   		atomic_add(orig_nr_pages, &folio->_large_mapcount);
>   		break;
>   	case RMAP_LEVEL_PMD:
> +	case RMAP_LEVEL_PUD:
>   		atomic_inc(&folio->_entire_mapcount);
>   		atomic_inc(&folio->_large_mapcount);
>   		break;
> @@ -434,6 +447,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
>   		atomic_add(orig_nr_pages, &folio->_large_mapcount);
>   		break;
>   	case RMAP_LEVEL_PMD:
> +	case RMAP_LEVEL_PUD:
>   		if (PageAnonExclusive(page)) {
>   			if (unlikely(maybe_pinned))
>   				return -EBUSY;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index db7946a..e1f053e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1283,6 +1283,70 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>   	return VM_FAULT_NOPAGE;
>   }
>   EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
> +
> +/**
> + * dax_insert_pfn_pud - insert a pud size pfn backed by a normal page
> + * @vmf: Structure describing the fault
> + * @pfn: pfn of the page to insert
> + * @write: whether it's a write fault
> + *
> + * Return: vm_fault_t value.
> + */
> +vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long addr = vmf->address & PUD_MASK;
> +	pud_t *pud = vmf->pud;
> +	pgprot_t prot = vma->vm_page_prot;
> +	struct mm_struct *mm = vma->vm_mm;
> +	pud_t entry;
> +	spinlock_t *ptl;
> +	struct folio *folio;
> +	struct page *page;
> +
> +	if (addr < vma->vm_start || addr >= vma->vm_end)
> +		return VM_FAULT_SIGBUS;
> +
> +	track_pfn_insert(vma, &prot, pfn);
> +
> +	ptl = pud_lock(mm, pud);
> +	if (!pud_none(*pud)) {
> +		if (write) {
> +			if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
> +				WARN_ON_ONCE(!is_huge_zero_pud(*pud));
> +				goto out_unlock;
> +			}
> +			entry = pud_mkyoung(*pud);
> +			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
> +			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
> +				update_mmu_cache_pud(vma, addr, pud);
> +		}
> +		goto out_unlock;
> +	}
> +
> +	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
> +	if (pfn_t_devmap(pfn))
> +		entry = pud_mkdevmap(entry);
> +	if (write) {
> +		entry = pud_mkyoung(pud_mkdirty(entry));
> +		entry = maybe_pud_mkwrite(entry, vma);
> +	}
> +
> +	page = pfn_t_to_page(pfn);
> +	folio = page_folio(page);
> +	folio_get(folio);
> +	folio_add_file_rmap_pud(folio, page, vma);
> +	add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
> +
> +	set_pud_at(mm, addr, pud, entry);
> +	update_mmu_cache_pud(vma, addr, pud);
> +
> +out_unlock:
> +	spin_unlock(ptl);
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +EXPORT_SYMBOL_GPL(dax_insert_pfn_pud);
>   #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>   
>   void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> @@ -1836,7 +1900,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   			zap_deposited_table(tlb->mm, pmd);
>   		spin_unlock(ptl);
>   	} else if (is_huge_zero_pmd(orig_pmd)) {
> -		zap_deposited_table(tlb->mm, pmd);
> +		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
> +			zap_deposited_table(tlb->mm, pmd);
>   		spin_unlock(ptl);
>   	} else {
>   		struct folio *folio = NULL;
> @@ -2268,20 +2333,34 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
>   int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   		 pud_t *pud, unsigned long addr)
>   {
> +	pud_t orig_pud;
>   	spinlock_t *ptl;
>   
>   	ptl = __pud_trans_huge_lock(pud, vma);
>   	if (!ptl)
>   		return 0;
>   
> -	pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
> +	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
>   	tlb_remove_pud_tlb_entry(tlb, pud, addr);
> -	if (vma_is_special_huge(vma)) {
> +	if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
>   		spin_unlock(ptl);
>   		/* No zero page support yet */
>   	} else {
> -		/* No support for anonymous PUD pages yet */
> -		BUG();
> +		struct page *page = NULL;
> +		struct folio *folio;
> +
> +		/* No support for anonymous PUD pages or migration yet */
> +		BUG_ON(vma_is_anonymous(vma) || !pud_present(orig_pud));
> +
> +		page = pud_page(orig_pud);
> +		folio = page_folio(page);
> +		folio_remove_rmap_pud(folio, page, vma);
> +		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> +		VM_BUG_ON_PAGE(!PageHead(page), page);
> +		add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
> +
> +		spin_unlock(ptl);
> +		tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
>   	}
>   	return 1;
>   }
> @@ -2289,6 +2368,8 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>   		unsigned long haddr)
>   {
> +	pud_t old_pud;
> +
>   	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
>   	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>   	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
> @@ -2296,7 +2377,22 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>   
>   	count_vm_event(THP_SPLIT_PUD);
>   
> -	pudp_huge_clear_flush(vma, haddr, pud);
> +	old_pud = pudp_huge_clear_flush(vma, haddr, pud);
> +	if (is_huge_zero_pud(old_pud))
> +		return;
> +
> +	if (vma_is_dax(vma)) {
> +		struct page *page = pud_page(old_pud);
> +		struct folio *folio = page_folio(page);
> +
> +		if (!folio_test_dirty(folio) && pud_dirty(old_pud))
> +			folio_mark_dirty(folio);
> +		if (!folio_test_referenced(folio) && pud_young(old_pud))
> +			folio_set_referenced(folio);
> +		folio_remove_rmap_pud(folio, page, vma);
> +		folio_put(folio);
> +		add_mm_counter(vma->vm_mm, mm_counter_file(folio), -HPAGE_PUD_NR);
> +	}
>   }
>   
>   void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e8fc5ec..e949e4f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1165,6 +1165,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
>   		atomic_add(orig_nr_pages, &folio->_large_mapcount);
>   		break;
>   	case RMAP_LEVEL_PMD:
> +	case RMAP_LEVEL_PUD:
>   		first = atomic_inc_and_test(&folio->_entire_mapcount);
>   		if (first) {
>   			nr = atomic_add_return_relaxed(ENTIRELY_MAPPED, mapped);
> @@ -1306,6 +1307,12 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>   		case RMAP_LEVEL_PMD:
>   			SetPageAnonExclusive(page);
>   			break;
> +		case RMAP_LEVEL_PUD:
> +			/*
> +			 * Keep the compiler happy, we don't support anonymous PUD mappings.
> +			 */
> +			WARN_ON_ONCE(1);
> +			break;
>   		}
>   	}
>   	for (i = 0; i < nr_pages; i++) {
> @@ -1489,6 +1496,26 @@ void folio_add_file_rmap_pmd(struct folio *folio, struct page *page,
>   #endif
>   }
>   
> +/**
> + * folio_add_file_rmap_pud - add a PUD mapping to a page range of a folio
> + * @folio:	The folio to add the mapping to
> + * @page:	The first page to add
> + * @vma:	The vm area in which the mapping is added
> + *
> + * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
> + *
> + * The caller needs to hold the page table lock.
> + */
> +void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
> +		struct vm_area_struct *vma)
> +{
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +	__folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
> +#else
> +	WARN_ON_ONCE(true);
> +#endif
> +}
> +
>   static __always_inline void __folio_remove_rmap(struct folio *folio,
>   		struct page *page, int nr_pages, struct vm_area_struct *vma,
>   		enum rmap_level level)
> @@ -1521,6 +1548,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>   		partially_mapped = nr && atomic_read(mapped);
>   		break;
>   	case RMAP_LEVEL_PMD:
> +	case RMAP_LEVEL_PUD:
>   		atomic_dec(&folio->_large_mapcount);
>   		last = atomic_add_negative(-1, &folio->_entire_mapcount);
>   		if (last) {
> @@ -1615,6 +1643,26 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
>   #endif
>   }
>   
> +/**
> + * folio_remove_rmap_pud - remove a PUD mapping from a page range of a folio
> + * @folio:	The folio to remove the mapping from
> + * @page:	The first page to remove
> + * @vma:	The vm area from which the mapping is removed
> + *
> + * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
> + *
> + * The caller needs to hold the page table lock.
> + */
> +void folio_remove_rmap_pud(struct folio *folio, struct page *page,
> +		struct vm_area_struct *vma)
> +{
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +	__folio_remove_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
> +#else
> +	WARN_ON_ONCE(true);
> +#endif
> +}
> +
>   /*
>    * @arg: enum ttu_flags will be passed to this argument
>    */

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] mm/memory: Add dax_insert_pfn
  2024-06-27  0:54 ` [PATCH 06/13] mm/memory: Add dax_insert_pfn Alistair Popple
  2024-06-27  5:22   ` Christoph Hellwig
  2024-06-27 11:33   ` Jan Kara
@ 2024-07-02  7:18   ` David Hildenbrand
  2024-07-02 10:47     ` Alistair Popple
  2024-07-02 11:46     ` Christoph Hellwig
  2 siblings, 2 replies; 54+ messages in thread
From: David Hildenbrand @ 2024-07-02  7:18 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg
  Cc: catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
	linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
	linux-mm, linux-ext4, linux-xfs, jhubbard, hch, david

On 27.06.24 02:54, Alistair Popple wrote:
> Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
> creates a special devmap PTE entry for the pfn but does not take a
> reference on the underlying struct page for the mapping. This is
> because DAX page refcounts are treated specially, as indicated by the
> presence of a devmap entry.
> 
> To allow DAX page refcounts to be managed the same as normal page
> refcounts introduce dax_insert_pfn. This will take a reference on the
> underlying page much the same as vmf_insert_page, except it also
> permits upgrading an existing mapping to be writable if
> requested/possible.

We have this comparably nasty vmf_insert_mixed() that FS dax abused to 
insert into !VM_MIXED VMAs. Is that abuse now stopping and are there 
maybe ways to get rid of vmf_insert_mixed()?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
  2024-07-02  7:16   ` David Hildenbrand
@ 2024-07-02 10:19     ` Alistair Popple
  2024-07-02 11:02       ` David Hildenbrand
  2024-07-02 11:51       ` Christoph Hellwig
  0 siblings, 2 replies; 54+ messages in thread
From: Alistair Popple @ 2024-07-02 10:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david


David Hildenbrand <david@redhat.com> writes:

> On 27.06.24 02:54, Alistair Popple wrote:
>> Currently DAX folio/page reference counts are managed differently to
>> normal pages. To allow these to be managed the same as normal pages
>> introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
>> and take references as it would for a normally mapped page.
>> This is distinct from the current mechanism, vmf_insert_pfn_pud,
>> which
>> simply inserts a special devmap PUD entry into the page table without
>> holding a reference to the page for the mapping.
>
> Do we really have to involve mapcounts/rmap for daxfs pages at this
> point? Or is this only "to make it look more like other pages" ?

The aim of the series is make FS DAX and other ZONE_DEVICE pages look
like other pages, at least with regards to the way they are refcounted.

At the moment they are not refcounted - instead their refcounts are
basically statically initialised to one and there are all these special
cases and functions requiring magic PTE bits (pXX_devmap) to do the
special DAX reference counting. This then adds some cruft to manage
pgmap references and to catch the 2->1 page refcount transition. All
this just goes away if we manage the page references the same as other
pages (and indeed we already manage DEVICE_PRIVATE and COHERENT pages
the same as normal pages).

So I think to make this work we at least need the mapcounts.

> I'm asking this because:
>
> (A) We don't support mixing PUD+PMD mappings yet. I have plans to change
>     that in the future, but for now you can only map using a single PUD
>     or by PTEs. I suspect that's good enoug for now for dax fs?

Yep, that's all we support.

> (B) As long as we have subpage mapcounts, this prevents vmemmap
>     optimizations [1]. Is that only used for device-dax for now and are
>     there no plans to make use of that for fs-dax?

I don't have any plans to. This is purely focussed on refcounting pages
"like normal" so we can get rid of all the DAX special casing.

> (C) We managed without so far :)

Indeed, although Christoph has asked repeatedly ([1], [2] and likely
others) that this gets fixed and I finally got sick of it coming up
everytime I need to touch something with ZONE_DEVICE pages :)

Also it removes the need for people to understand the special DAX page
recounting scheme and ends up removing a bunch of cruft as a bonus:

 59 files changed, 485 insertions(+), 869 deletions(-)

And that's before I clean up all the pgmap reference handling. It also
removes the pXX_trans_huge and pXX_leaf distinction. So we managed, but
things could be better IMHO.

> Having that said, with folio->_large_mapcount things like
> folio_mapcount() are no longer terribly slow once we weould PTE-map a
> PUD-sized folio.
>
> Also, all ZONE_DEVICE pages should currently be marked PG_reserved,
> translating to "don't touch the memmap". I think we might want to
> tackle that first.

Ok. I'm keen to get this series finished and I don't quite get the
connection here, what needs to change there?

[1] - https://lore.kernel.org/linux-mm/20201106080322.GE31341@lst.de/
[2] - https://lore.kernel.org/linux-mm/20220209135351.GA20631@lst.de/

>
> [1] https://lwn.net/Articles/860218/
> [2]
> https://lkml.kernel.org/r/b0adbb0c-ad59-4bc5-ba0b-0af464b94557@redhat.com
>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>   include/linux/huge_mm.h |   4 ++-
>>   include/linux/rmap.h    |  14 +++++-
>>   mm/huge_memory.c        | 108 ++++++++++++++++++++++++++++++++++++++---
>>   mm/rmap.c               |  48 ++++++++++++++++++-
>>   4 files changed, 168 insertions(+), 6 deletions(-)
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2aa986a..b98a3cc 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>     vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn,
>> bool write);
>>   vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>> +vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>>     enum transparent_hugepage_flag {
>>   	TRANSPARENT_HUGEPAGE_UNSUPPORTED,
>> @@ -106,6 +107,9 @@ extern struct kobj_attribute shmem_enabled_attr;
>>   #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
>>   #define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
>>   +#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
>> +#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
>> +
>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>     extern unsigned long transparent_hugepage_flags;
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 7229b9b..c5a0205 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -192,6 +192,7 @@ typedef int __bitwise rmap_t;
>>   enum rmap_level {
>>   	RMAP_LEVEL_PTE = 0,
>>   	RMAP_LEVEL_PMD,
>> +	RMAP_LEVEL_PUD,
>>   };
>>     static inline void __folio_rmap_sanity_checks(struct folio
>> *folio,
>> @@ -225,6 +226,13 @@ static inline void __folio_rmap_sanity_checks(struct folio *folio,
>>   		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
>>   		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
>>   		break;
>> +	case RMAP_LEVEL_PUD:
>> +		/*
>> +		 * Asume that we are creating * a single "entire" mapping of the folio.
>> +		 */
>> +		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio);
>> +		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
>> +		break;
>>   	default:
>>   		VM_WARN_ON_ONCE(true);
>>   	}
>> @@ -248,12 +256,16 @@ void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>   	folio_add_file_rmap_ptes(folio, page, 1, vma)
>>   void folio_add_file_rmap_pmd(struct folio *, struct page *,
>>   		struct vm_area_struct *);
>> +void folio_add_file_rmap_pud(struct folio *, struct page *,
>> +		struct vm_area_struct *);
>>   void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>   		struct vm_area_struct *);
>>   #define folio_remove_rmap_pte(folio, page, vma) \
>>   	folio_remove_rmap_ptes(folio, page, 1, vma)
>>   void folio_remove_rmap_pmd(struct folio *, struct page *,
>>   		struct vm_area_struct *);
>> +void folio_remove_rmap_pud(struct folio *, struct page *,
>> +		struct vm_area_struct *);
>>     void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct
>> *,
>>   		unsigned long address, rmap_t flags);
>> @@ -338,6 +350,7 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
>>   		atomic_add(orig_nr_pages, &folio->_large_mapcount);
>>   		break;
>>   	case RMAP_LEVEL_PMD:
>> +	case RMAP_LEVEL_PUD:
>>   		atomic_inc(&folio->_entire_mapcount);
>>   		atomic_inc(&folio->_large_mapcount);
>>   		break;
>> @@ -434,6 +447,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
>>   		atomic_add(orig_nr_pages, &folio->_large_mapcount);
>>   		break;
>>   	case RMAP_LEVEL_PMD:
>> +	case RMAP_LEVEL_PUD:
>>   		if (PageAnonExclusive(page)) {
>>   			if (unlikely(maybe_pinned))
>>   				return -EBUSY;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index db7946a..e1f053e 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1283,6 +1283,70 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>>   	return VM_FAULT_NOPAGE;
>>   }
>>   EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
>> +
>> +/**
>> + * dax_insert_pfn_pud - insert a pud size pfn backed by a normal page
>> + * @vmf: Structure describing the fault
>> + * @pfn: pfn of the page to insert
>> + * @write: whether it's a write fault
>> + *
>> + * Return: vm_fault_t value.
>> + */
>> +vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>> +{
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	unsigned long addr = vmf->address & PUD_MASK;
>> +	pud_t *pud = vmf->pud;
>> +	pgprot_t prot = vma->vm_page_prot;
>> +	struct mm_struct *mm = vma->vm_mm;
>> +	pud_t entry;
>> +	spinlock_t *ptl;
>> +	struct folio *folio;
>> +	struct page *page;
>> +
>> +	if (addr < vma->vm_start || addr >= vma->vm_end)
>> +		return VM_FAULT_SIGBUS;
>> +
>> +	track_pfn_insert(vma, &prot, pfn);
>> +
>> +	ptl = pud_lock(mm, pud);
>> +	if (!pud_none(*pud)) {
>> +		if (write) {
>> +			if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
>> +				WARN_ON_ONCE(!is_huge_zero_pud(*pud));
>> +				goto out_unlock;
>> +			}
>> +			entry = pud_mkyoung(*pud);
>> +			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
>> +			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
>> +				update_mmu_cache_pud(vma, addr, pud);
>> +		}
>> +		goto out_unlock;
>> +	}
>> +
>> +	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
>> +	if (pfn_t_devmap(pfn))
>> +		entry = pud_mkdevmap(entry);
>> +	if (write) {
>> +		entry = pud_mkyoung(pud_mkdirty(entry));
>> +		entry = maybe_pud_mkwrite(entry, vma);
>> +	}
>> +
>> +	page = pfn_t_to_page(pfn);
>> +	folio = page_folio(page);
>> +	folio_get(folio);
>> +	folio_add_file_rmap_pud(folio, page, vma);
>> +	add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
>> +
>> +	set_pud_at(mm, addr, pud, entry);
>> +	update_mmu_cache_pud(vma, addr, pud);
>> +
>> +out_unlock:
>> +	spin_unlock(ptl);
>> +
>> +	return VM_FAULT_NOPAGE;
>> +}
>> +EXPORT_SYMBOL_GPL(dax_insert_pfn_pud);
>>   #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>     void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>> @@ -1836,7 +1900,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>   			zap_deposited_table(tlb->mm, pmd);
>>   		spin_unlock(ptl);
>>   	} else if (is_huge_zero_pmd(orig_pmd)) {
>> -		zap_deposited_table(tlb->mm, pmd);
>> +		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
>> +			zap_deposited_table(tlb->mm, pmd);
>>   		spin_unlock(ptl);
>>   	} else {
>>   		struct folio *folio = NULL;
>> @@ -2268,20 +2333,34 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
>>   int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>   		 pud_t *pud, unsigned long addr)
>>   {
>> +	pud_t orig_pud;
>>   	spinlock_t *ptl;
>>     	ptl = __pud_trans_huge_lock(pud, vma);
>>   	if (!ptl)
>>   		return 0;
>>   -	pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
>> +	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
>>   	tlb_remove_pud_tlb_entry(tlb, pud, addr);
>> -	if (vma_is_special_huge(vma)) {
>> +	if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
>>   		spin_unlock(ptl);
>>   		/* No zero page support yet */
>>   	} else {
>> -		/* No support for anonymous PUD pages yet */
>> -		BUG();
>> +		struct page *page = NULL;
>> +		struct folio *folio;
>> +
>> +		/* No support for anonymous PUD pages or migration yet */
>> +		BUG_ON(vma_is_anonymous(vma) || !pud_present(orig_pud));
>> +
>> +		page = pud_page(orig_pud);
>> +		folio = page_folio(page);
>> +		folio_remove_rmap_pud(folio, page, vma);
>> +		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>> +		VM_BUG_ON_PAGE(!PageHead(page), page);
>> +		add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
>> +
>> +		spin_unlock(ptl);
>> +		tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
>>   	}
>>   	return 1;
>>   }
>> @@ -2289,6 +2368,8 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>   static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>   		unsigned long haddr)
>>   {
>> +	pud_t old_pud;
>> +
>>   	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
>>   	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>>   	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
>> @@ -2296,7 +2377,22 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>     	count_vm_event(THP_SPLIT_PUD);
>>   -	pudp_huge_clear_flush(vma, haddr, pud);
>> +	old_pud = pudp_huge_clear_flush(vma, haddr, pud);
>> +	if (is_huge_zero_pud(old_pud))
>> +		return;
>> +
>> +	if (vma_is_dax(vma)) {
>> +		struct page *page = pud_page(old_pud);
>> +		struct folio *folio = page_folio(page);
>> +
>> +		if (!folio_test_dirty(folio) && pud_dirty(old_pud))
>> +			folio_mark_dirty(folio);
>> +		if (!folio_test_referenced(folio) && pud_young(old_pud))
>> +			folio_set_referenced(folio);
>> +		folio_remove_rmap_pud(folio, page, vma);
>> +		folio_put(folio);
>> +		add_mm_counter(vma->vm_mm, mm_counter_file(folio), -HPAGE_PUD_NR);
>> +	}
>>   }
>>     void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index e8fc5ec..e949e4f 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1165,6 +1165,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
>>   		atomic_add(orig_nr_pages, &folio->_large_mapcount);
>>   		break;
>>   	case RMAP_LEVEL_PMD:
>> +	case RMAP_LEVEL_PUD:
>>   		first = atomic_inc_and_test(&folio->_entire_mapcount);
>>   		if (first) {
>>   			nr = atomic_add_return_relaxed(ENTIRELY_MAPPED, mapped);
>> @@ -1306,6 +1307,12 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>>   		case RMAP_LEVEL_PMD:
>>   			SetPageAnonExclusive(page);
>>   			break;
>> +		case RMAP_LEVEL_PUD:
>> +			/*
>> +			 * Keep the compiler happy, we don't support anonymous PUD mappings.
>> +			 */
>> +			WARN_ON_ONCE(1);
>> +			break;
>>   		}
>>   	}
>>   	for (i = 0; i < nr_pages; i++) {
>> @@ -1489,6 +1496,26 @@ void folio_add_file_rmap_pmd(struct folio *folio, struct page *page,
>>   #endif
>>   }
>>   +/**
>> + * folio_add_file_rmap_pud - add a PUD mapping to a page range of a folio
>> + * @folio:	The folio to add the mapping to
>> + * @page:	The first page to add
>> + * @vma:	The vm area in which the mapping is added
>> + *
>> + * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
>> + *
>> + * The caller needs to hold the page table lock.
>> + */
>> +void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
>> +		struct vm_area_struct *vma)
>> +{
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +	__folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
>> +#else
>> +	WARN_ON_ONCE(true);
>> +#endif
>> +}
>> +
>>   static __always_inline void __folio_remove_rmap(struct folio *folio,
>>   		struct page *page, int nr_pages, struct vm_area_struct *vma,
>>   		enum rmap_level level)
>> @@ -1521,6 +1548,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>>   		partially_mapped = nr && atomic_read(mapped);
>>   		break;
>>   	case RMAP_LEVEL_PMD:
>> +	case RMAP_LEVEL_PUD:
>>   		atomic_dec(&folio->_large_mapcount);
>>   		last = atomic_add_negative(-1, &folio->_entire_mapcount);
>>   		if (last) {
>> @@ -1615,6 +1643,26 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
>>   #endif
>>   }
>>   +/**
>> + * folio_remove_rmap_pud - remove a PUD mapping from a page range of a folio
>> + * @folio:	The folio to remove the mapping from
>> + * @page:	The first page to remove
>> + * @vma:	The vm area from which the mapping is removed
>> + *
>> + * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
>> + *
>> + * The caller needs to hold the page table lock.
>> + */
>> +void folio_remove_rmap_pud(struct folio *folio, struct page *page,
>> +		struct vm_area_struct *vma)
>> +{
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +	__folio_remove_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
>> +#else
>> +	WARN_ON_ONCE(true);
>> +#endif
>> +}
>> +
>>   /*
>>    * @arg: enum ttu_flags will be passed to this argument
>>    */


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] mm/memory: Add dax_insert_pfn
  2024-07-02  7:18   ` David Hildenbrand
@ 2024-07-02 10:47     ` Alistair Popple
  2024-07-02 11:46     ` Christoph Hellwig
  1 sibling, 0 replies; 54+ messages in thread
From: Alistair Popple @ 2024-07-02 10:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david


David Hildenbrand <david@redhat.com> writes:

> On 27.06.24 02:54, Alistair Popple wrote:
>> Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
>> creates a special devmap PTE entry for the pfn but does not take a
>> reference on the underlying struct page for the mapping. This is
>> because DAX page refcounts are treated specially, as indicated by the
>> presence of a devmap entry.
>> To allow DAX page refcounts to be managed the same as normal page
>> refcounts introduce dax_insert_pfn. This will take a reference on the
>> underlying page much the same as vmf_insert_page, except it also
>> permits upgrading an existing mapping to be writable if
>> requested/possible.
>
> We have this comparably nasty vmf_insert_mixed() that FS dax abused to
> insert into !VM_MIXED VMAs. Is that abuse now stopping and are there
> maybe ways to get rid of vmf_insert_mixed()?

It's not something I've looked at but quite possibly - there are a
couple of other users of vmf_insert_mixed() that would need to be
removed though. I just added this as dax_insert_pfn() because as an API
it is really specific to the DAX use case. For example a more general
API would likely pass a page/folio.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 09/13] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  2024-07-01 23:47     ` Alistair Popple
@ 2024-07-02 10:48       ` David Hildenbrand
  0 siblings, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2024-07-02 10:48 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david

On 02.07.24 01:47, Alistair Popple wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 27.06.24 02:54, Alistair Popple wrote:
>>> Longterm pinning of FS DAX pages should already be disallowed by
>>> various pXX_devmap checks. However a future change will cause these
>>> checks to be invalid for FS DAX pages so make
>>> folio_is_longterm_pinnable() return false for FS DAX pages.
>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> ---
>>>    include/linux/memremap.h | 11 +++++++++++
>>>    include/linux/mm.h       |  4 ++++
>>>    2 files changed, 15 insertions(+)
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index 6505713..19a448e 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -193,6 +193,17 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
>>>    	return is_device_coherent_page(&folio->page);
>>>    }
>>>    +static inline bool is_device_dax_page(const struct page *page)
>>> +{
>>> +	return is_zone_device_page(page) &&
>>> +		page_dev_pagemap(page)->type == MEMORY_DEVICE_FS_DAX;
>>> +}
>>> +
>>> +static inline bool folio_is_device_dax(const struct folio *folio)
>>> +{
>>> +	return is_device_dax_page(&folio->page);
>>> +}
>>> +
>>>    #ifdef CONFIG_ZONE_DEVICE
>>>    void zone_device_page_init(struct page *page);
>>>    void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index b84368b..4d1cdea 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2032,6 +2032,10 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
>>>    	if (folio_is_device_coherent(folio))
>>>    		return false;
>>>    +	/* DAX must also always allow eviction. */
>>> +	if (folio_is_device_dax(folio))
>>> +		return false;
>>> +
>>>    	/* Otherwise, non-movable zone folios can be pinned. */
>>>    	return !folio_is_zone_movable(folio);
>>>    
>>
>> Why is the check in check_vma_flags() insufficient? GUP-fast maybe?
> 
> Right. This came up when I was changing the code for GUP-fast, but also
> they shouldn't be longterm pinnable and adding the case to
> folio_is_longterm_pinnable() is an excellent way of documenting that.

Makes sense, might be worth adding that (GUP-fast and check_vma_flags() 
covering GUP-slow) to the patch description.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
  2024-07-02 10:19     ` Alistair Popple
@ 2024-07-02 11:02       ` David Hildenbrand
  2024-07-02 11:30         ` Alistair Popple
  2024-07-02 11:51       ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2024-07-02 11:02 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david

On 02.07.24 12:19, Alistair Popple wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 27.06.24 02:54, Alistair Popple wrote:
>>> Currently DAX folio/page reference counts are managed differently to
>>> normal pages. To allow these to be managed the same as normal pages
>>> introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
>>> and take references as it would for a normally mapped page.
>>> This is distinct from the current mechanism, vmf_insert_pfn_pud,
>>> which
>>> simply inserts a special devmap PUD entry into the page table without
>>> holding a reference to the page for the mapping.
>>
>> Do we really have to involve mapcounts/rmap for daxfs pages at this
>> point? Or is this only "to make it look more like other pages" ?
> 
> The aim of the series is make FS DAX and other ZONE_DEVICE pages look
> like other pages, at least with regards to the way they are refcounted.
> 
> At the moment they are not refcounted - instead their refcounts are
> basically statically initialised to one and there are all these special
> cases and functions requiring magic PTE bits (pXX_devmap) to do the
> special DAX reference counting. This then adds some cruft to manage
> pgmap references and to catch the 2->1 page refcount transition. All
> this just goes away if we manage the page references the same as other
> pages (and indeed we already manage DEVICE_PRIVATE and COHERENT pages
> the same as normal pages).
> 
> So I think to make this work we at least need the mapcounts.
> 

We only really need the mapcounts if we intend to do something like 
folio_mapcount() == folio_ref_count() to detect unexpected folio 
references, and if we have to have things like folio_mapped() working. 
For now that was not required, that's why I am asking.

Background also being that in a distant future folios will be decoupled 
more from other compound pages, and only folios (or "struct anon_folio" 
/ "struct file_folio") would even have mapcounts.

For example, most stuff we map (and refcount!) via vm_insert_page() 
really must stop involving mapcounts. These won't be "ordinary" 
mapcount-tracked folios in the future, they are simply some refcounted 
pages some ordinary driver allocated.

For FS-DAX, if we'll be using the same "struct file_folio" approach as 
for ordinary pageache memory, then this is the right thing to do here.


>> I'm asking this because:
>>
>> (A) We don't support mixing PUD+PMD mappings yet. I have plans to change
>>      that in the future, but for now you can only map using a single PUD
>>      or by PTEs. I suspect that's good enoug for now for dax fs?
> 
> Yep, that's all we support.
> 
>> (B) As long as we have subpage mapcounts, this prevents vmemmap
>>      optimizations [1]. Is that only used for device-dax for now and are
>>      there no plans to make use of that for fs-dax?
> 
> I don't have any plans to. This is purely focussed on refcounting pages
> "like normal" so we can get rid of all the DAX special casing.
> 
>> (C) We managed without so far :)
> 
> Indeed, although Christoph has asked repeatedly ([1], [2] and likely
> others) that this gets fixed and I finally got sick of it coming up
> everytime I need to touch something with ZONE_DEVICE pages :)
> 
> Also it removes the need for people to understand the special DAX page
> recounting scheme and ends up removing a bunch of cruft as a bonus:
> 
>   59 files changed, 485 insertions(+), 869 deletions(-)

I'm not challenging the refcounting scheme. I'm purely asking about 
mapcount handling, which is something related but different.

> 
> And that's before I clean up all the pgmap reference handling. It also
> removes the pXX_trans_huge and pXX_leaf distinction. So we managed, but
> things could be better IMHO.
> 

Again, all nice things.

>> Having that said, with folio->_large_mapcount things like
>> folio_mapcount() are no longer terribly slow once we weould PTE-map a
>> PUD-sized folio.
>>
>> Also, all ZONE_DEVICE pages should currently be marked PG_reserved,
>> translating to "don't touch the memmap". I think we might want to
>> tackle that first.

Missed to add a pointer to [2].

> 
> Ok. I'm keen to get this series finished and I don't quite get the
> connection here, what needs to change there?

include/linux/page-flags.h

"PG_reserved is set for special pages. The "struct page" of such a page 
should in general not be touched (e.g. set dirty) except by its owner. 
Pages marked as PG_reserved include:

...

- Device memory (e.g. PMEM, DAX, HMM)
"

I think we already entered that domain with other ZONE_DEVICE pages 
being returned from vm_normal_folio(), unfortunately. But that really 
must be cleaned up for these pages to not look special anymore.

Agreed that it likely is something that is not blocking this series.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
  2024-07-02 11:02       ` David Hildenbrand
@ 2024-07-02 11:30         ` Alistair Popple
  2024-07-02 13:01           ` David Hildenbrand
  0 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-07-02 11:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david


David Hildenbrand <david@redhat.com> writes:

> On 02.07.24 12:19, Alistair Popple wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>>> On 27.06.24 02:54, Alistair Popple wrote:
>>>> Currently DAX folio/page reference counts are managed differently to
>>>> normal pages. To allow these to be managed the same as normal pages
>>>> introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
>>>> and take references as it would for a normally mapped page.
>>>> This is distinct from the current mechanism, vmf_insert_pfn_pud,
>>>> which
>>>> simply inserts a special devmap PUD entry into the page table without
>>>> holding a reference to the page for the mapping.
>>>
>>> Do we really have to involve mapcounts/rmap for daxfs pages at this
>>> point? Or is this only "to make it look more like other pages" ?
>> The aim of the series is make FS DAX and other ZONE_DEVICE pages
>> look
>> like other pages, at least with regards to the way they are refcounted.
>> At the moment they are not refcounted - instead their refcounts are
>> basically statically initialised to one and there are all these special
>> cases and functions requiring magic PTE bits (pXX_devmap) to do the
>> special DAX reference counting. This then adds some cruft to manage
>> pgmap references and to catch the 2->1 page refcount transition. All
>> this just goes away if we manage the page references the same as other
>> pages (and indeed we already manage DEVICE_PRIVATE and COHERENT pages
>> the same as normal pages).
>> So I think to make this work we at least need the mapcounts.
>> 
>
> We only really need the mapcounts if we intend to do something like
> folio_mapcount() == folio_ref_count() to detect unexpected folio
> references, and if we have to have things like folio_mapped()
> working. For now that was not required, that's why I am asking.

Oh I see, thanks for pointing that out. In that case I agree, we don't
need the mapcounts. As you say we don't currently need to detect
unexpect references for FS DAX and this series doesn't seek to introduce
any new behviour/features.

> Background also being that in a distant future folios will be
> decoupled more from other compound pages, and only folios (or "struct
> anon_folio" / "struct file_folio") would even have mapcounts.
>
> For example, most stuff we map (and refcount!) via vm_insert_page()
> really must stop involving mapcounts. These won't be "ordinary"
> mapcount-tracked folios in the future, they are simply some refcounted
> pages some ordinary driver allocated.

Ok, so for FS DAX we should take a reference on the folio for the
mapping but not a mapcount?

> For FS-DAX, if we'll be using the same "struct file_folio" approach as
> for ordinary pageache memory, then this is the right thing to do here.
>
>
>>> I'm asking this because:
>>>
>>> (A) We don't support mixing PUD+PMD mappings yet. I have plans to change
>>>      that in the future, but for now you can only map using a single PUD
>>>      or by PTEs. I suspect that's good enoug for now for dax fs?
>> Yep, that's all we support.
>> 
>>> (B) As long as we have subpage mapcounts, this prevents vmemmap
>>>      optimizations [1]. Is that only used for device-dax for now and are
>>>      there no plans to make use of that for fs-dax?
>> I don't have any plans to. This is purely focussed on refcounting
>> pages
>> "like normal" so we can get rid of all the DAX special casing.
>> 
>>> (C) We managed without so far :)
>> Indeed, although Christoph has asked repeatedly ([1], [2] and likely
>> others) that this gets fixed and I finally got sick of it coming up
>> everytime I need to touch something with ZONE_DEVICE pages :)
>> Also it removes the need for people to understand the special DAX
>> page
>> recounting scheme and ends up removing a bunch of cruft as a bonus:
>>   59 files changed, 485 insertions(+), 869 deletions(-)
>
> I'm not challenging the refcounting scheme. I'm purely asking about
> mapcount handling, which is something related but different.

Got it, thanks. I hadn't quite picked up on the mapcount vs. refcount
distinction you were asking about.

>> And that's before I clean up all the pgmap reference handling. It
>> also
>> removes the pXX_trans_huge and pXX_leaf distinction. So we managed, but
>> things could be better IMHO.
>> 
>
> Again, all nice things.
>
>>> Having that said, with folio->_large_mapcount things like
>>> folio_mapcount() are no longer terribly slow once we weould PTE-map a
>>> PUD-sized folio.
>>>
>>> Also, all ZONE_DEVICE pages should currently be marked PG_reserved,
>>> translating to "don't touch the memmap". I think we might want to
>>> tackle that first.
>
> Missed to add a pointer to [2].
>
>> Ok. I'm keen to get this series finished and I don't quite get the
>> connection here, what needs to change there?
>
> include/linux/page-flags.h
>
> "PG_reserved is set for special pages. The "struct page" of such a
> page should in general not be touched (e.g. set dirty) except by its
> owner. Pages marked as PG_reserved include:
>
> ...
>
> - Device memory (e.g. PMEM, DAX, HMM)
> "
>
> I think we already entered that domain with other ZONE_DEVICE pages
> being returned from vm_normal_folio(), unfortunately. But that really
> must be cleaned up for these pages to not look special anymore.
>
> Agreed that it likely is something that is not blocking this series.

Great. I'd like to see that cleaned up a little too or at least made
more understandable. The first time I looked at this it took a
suprising amount of time to figure out what constituted a "normal"
page so I'd be happy to help. This series does, in a small way, clean
that up by removing the pte_devmap special case.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] mm/memory: Add dax_insert_pfn
  2024-07-02  7:18   ` David Hildenbrand
  2024-07-02 10:47     ` Alistair Popple
@ 2024-07-02 11:46     ` Christoph Hellwig
  2024-07-02 11:53       ` David Hildenbrand
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2024-07-02 11:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch, david

On Tue, Jul 02, 2024 at 09:18:31AM +0200, David Hildenbrand wrote:
> We have this comparably nasty vmf_insert_mixed() that FS dax abused to 
> insert into !VM_MIXED VMAs. Is that abuse now stopping and are there maybe 
> ways to get rid of vmf_insert_mixed()?

Unfortunately it is also used by a few drm drivers and not just DAX.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
  2024-07-02 10:19     ` Alistair Popple
  2024-07-02 11:02       ` David Hildenbrand
@ 2024-07-02 11:51       ` Christoph Hellwig
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2024-07-02 11:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: David Hildenbrand, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, hch, david

On Tue, Jul 02, 2024 at 08:19:01PM +1000, Alistair Popple wrote:
> > (B) As long as we have subpage mapcounts, this prevents vmemmap
> >     optimizations [1]. Is that only used for device-dax for now and are
> >     there no plans to make use of that for fs-dax?
> 
> I don't have any plans to. This is purely focussed on refcounting pages
> "like normal" so we can get rid of all the DAX special casing.
> 
> > (C) We managed without so far :)
> 
> Indeed, although Christoph has asked repeatedly ([1], [2] and likely
> others) that this gets fixed and I finally got sick of it coming up
> everytime I need to touch something with ZONE_DEVICE pages :)
> 
> Also it removes the need for people to understand the special DAX page
> recounting scheme and ends up removing a bunch of cruft as a bonus:
> 
>  59 files changed, 485 insertions(+), 869 deletions(-)
> 
> And that's before I clean up all the pgmap reference handling. It also
> removes the pXX_trans_huge and pXX_leaf distinction. So we managed, but
> things could be better IMHO.

Yes.  I can't wait for this series making the finish line.  There might
be more chance for cleanups and optimizations around ZONE_DEVICE, but
this alone is a huge step forward.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] mm/memory: Add dax_insert_pfn
  2024-07-02 11:46     ` Christoph Hellwig
@ 2024-07-02 11:53       ` David Hildenbrand
  0 siblings, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2024-07-02 11:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alistair Popple, dan.j.williams, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, david

On 02.07.24 13:46, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 09:18:31AM +0200, David Hildenbrand wrote:
>> We have this comparably nasty vmf_insert_mixed() that FS dax abused to
>> insert into !VM_MIXED VMAs. Is that abuse now stopping and are there maybe
>> ways to get rid of vmf_insert_mixed()?
> 
> Unfortunately it is also used by a few drm drivers and not just DAX.

At least they all seem to set VM_MIXED:

* fs/cramfs/inode.c does
* drivers/gpu/drm/gma500/fbdev.c does
* drivers/gpu/drm/omapdrm/omap_gem.c does

Only DAX (including drivers/dax/device.c) doesn't.

VM_MIXEDMAP handling for DAX was changed in

commit e1fb4a0864958fac2fb1b23f9f4562a9f90e3e8f
Author: Dave Jiang <dave.jiang@intel.com>
Date:   Fri Aug 17 15:43:40 2018 -0700

     dax: remove VM_MIXEDMAP for fsdax and device dax

After prepared by

commit 785a3fab4adbf91b2189c928a59ae219c54ba95e
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Mon Oct 23 07:20:00 2017 -0700

     mm, dax: introduce pfn_t_special()

     In support of removing the VM_MIXEDMAP indication from DAX VMAs,
     introduce pfn_t_special() for drivers to indicate that _PAGE_SPECIAL
     should be used for DAX ptes.

I wonder if there are ways forward to either remove vmf_insert_mixed() 
or at least require it (as the name suggests) to have VM_MIXEDMAP set.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages
  2024-07-02 11:30         ` Alistair Popple
@ 2024-07-02 13:01           ` David Hildenbrand
  0 siblings, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2024-07-02 13:01 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david

On 02.07.24 13:30, Alistair Popple wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 02.07.24 12:19, Alistair Popple wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>> On 27.06.24 02:54, Alistair Popple wrote:
>>>>> Currently DAX folio/page reference counts are managed differently to
>>>>> normal pages. To allow these to be managed the same as normal pages
>>>>> introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
>>>>> and take references as it would for a normally mapped page.
>>>>> This is distinct from the current mechanism, vmf_insert_pfn_pud,
>>>>> which
>>>>> simply inserts a special devmap PUD entry into the page table without
>>>>> holding a reference to the page for the mapping.
>>>>
>>>> Do we really have to involve mapcounts/rmap for daxfs pages at this
>>>> point? Or is this only "to make it look more like other pages" ?
>>> The aim of the series is make FS DAX and other ZONE_DEVICE pages
>>> look
>>> like other pages, at least with regards to the way they are refcounted.
>>> At the moment they are not refcounted - instead their refcounts are
>>> basically statically initialised to one and there are all these special
>>> cases and functions requiring magic PTE bits (pXX_devmap) to do the
>>> special DAX reference counting. This then adds some cruft to manage
>>> pgmap references and to catch the 2->1 page refcount transition. All
>>> this just goes away if we manage the page references the same as other
>>> pages (and indeed we already manage DEVICE_PRIVATE and COHERENT pages
>>> the same as normal pages).
>>> So I think to make this work we at least need the mapcounts.
>>>
>>
>> We only really need the mapcounts if we intend to do something like
>> folio_mapcount() == folio_ref_count() to detect unexpected folio
>> references, and if we have to have things like folio_mapped()
>> working. For now that was not required, that's why I am asking.
> 
> Oh I see, thanks for pointing that out. In that case I agree, we don't
> need the mapcounts. As you say we don't currently need to detect
> unexpect references for FS DAX and this series doesn't seek to introduce
> any new behviour/features.
> 
>> Background also being that in a distant future folios will be
>> decoupled more from other compound pages, and only folios (or "struct
>> anon_folio" / "struct file_folio") would even have mapcounts.
>>
>> For example, most stuff we map (and refcount!) via vm_insert_page()
>> really must stop involving mapcounts. These won't be "ordinary"
>> mapcount-tracked folios in the future, they are simply some refcounted
>> pages some ordinary driver allocated.
> 
> Ok, so for FS DAX we should take a reference on the folio for the
> mapping but not a mapcount?

That's what we could do, yes. But if we really just want to track page 
table mappings like we do with other folios (anon/pagecache/shmem), and 
want to keep doing that in the future, then maybe we should just do it 
right away. The rmap code you add will be required in the future either way.

Using that code for device dax would likely be impossible right now due 
to the vmemmap optimizations I guess, if we would end up writing subpage 
mapcounts. But for FSDAX purposes (no vmemmap optimization) it would work.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code
  2024-06-27  0:54 ` [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code Alistair Popple
@ 2024-07-05 14:24   ` Peter Xu
  2024-07-09  4:07     ` Alistair Popple
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2024-07-05 14:24 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alex Williamson

Hi, Alistair,

On Thu, Jun 27, 2024 at 10:54:26AM +1000, Alistair Popple wrote:
> Now that DAX is managing page reference counts the same as normal
> pages there are no callers for vmf_insert_pXd functions so remove
> them.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  include/linux/huge_mm.h |   2 +-
>  mm/huge_memory.c        | 165 +-----------------------------------------
>  2 files changed, 167 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9207d8e..0fb6bff 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -37,8 +37,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
>  		    unsigned long cp_flags);
>  
> -vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
> -vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);

There's a plan to support huge pfnmaps in VFIO, which may still make good
use of these functions.  I think it's fine to remove them but it may mean
we'll need to add them back when supporting pfnmaps with no memmap.

Is it still possible to make the old API generic to both service the new
dax refcount plan, but at the meantime working for pfn injections when
there's no page struct?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 13/13] mm: Remove devmap related functions and page table bits
  2024-06-27  0:54 ` [PATCH 13/13] mm: Remove devmap related functions and page table bits Alistair Popple
  2024-06-27 23:04   ` kernel test robot
  2024-06-28  2:12   ` kernel test robot
@ 2024-07-08 11:35   ` Will Deacon
  2 siblings, 0 replies; 54+ messages in thread
From: Will Deacon @ 2024-07-08 11:35 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david

On Thu, Jun 27, 2024 at 10:54:28AM +1000, Alistair Popple wrote:
> Now that DAX and all other reference counts to ZONE_DEVICE pages are
> managed normally there is no need for the special devmap PTE/PMD/PUD
> page table bits. So drop all references to these, freeing up a
> software defined page table bit on architectures supporting it.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  Documentation/mm/arch_pgtable_helpers.rst     |  6 +--
>  arch/arm64/Kconfig                            |  1 +-
>  arch/arm64/include/asm/pgtable-prot.h         |  1 +-
>  arch/arm64/include/asm/pgtable.h              | 24 +--------

Not only do you exclusively remove code, but you also give us back a
pte bit! What's not to like?

Acked-by: Will Deacon <will@kernel.org> # arm64

Will

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code
  2024-07-05 14:24   ` Peter Xu
@ 2024-07-09  4:07     ` Alistair Popple
  2024-07-09 15:56       ` Peter Xu
  0 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-07-09  4:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alex Williamson


Peter Xu <peterx@redhat.com> writes:

> Hi, Alistair,
>
> On Thu, Jun 27, 2024 at 10:54:26AM +1000, Alistair Popple wrote:
>> Now that DAX is managing page reference counts the same as normal
>> pages there are no callers for vmf_insert_pXd functions so remove
>> them.
>> 
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>  include/linux/huge_mm.h |   2 +-
>>  mm/huge_memory.c        | 165 +-----------------------------------------
>>  2 files changed, 167 deletions(-)
>> 
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 9207d8e..0fb6bff 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -37,8 +37,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
>>  		    unsigned long cp_flags);
>>  
>> -vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>> -vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>>  vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>>  vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>
> There's a plan to support huge pfnmaps in VFIO, which may still make good
> use of these functions.  I think it's fine to remove them but it may mean
> we'll need to add them back when supporting pfnmaps with no memmap.

I'm ok with that. If we need them back in future it shouldn't be too
hard to add them back again. I just couldn't find any callers of them
once DAX stopped using them and the usual policy is to remove unused
functions.

> Is it still possible to make the old API generic to both service the new
> dax refcount plan, but at the meantime working for pfn injections when
> there's no page struct?

I don't think so - this new dax refcount plan relies on having a struct
page to take references on so I don't think it makes much sense to
combine it with something that doesn't have a struct page. It sounds
like the situation is the analogue of vm_insert_page()
vs. vmf_insert_pfn() - it's possible for both to exist but there's not
really anything that can be shared between the two APIs as one has a
page and the other is just a raw PFN.

> Thanks,


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code
  2024-07-09  4:07     ` Alistair Popple
@ 2024-07-09 15:56       ` Peter Xu
  2024-07-12  2:40         ` Alistair Popple
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Xu @ 2024-07-09 15:56 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alex Williamson

On Tue, Jul 09, 2024 at 02:07:31PM +1000, Alistair Popple wrote:
> 
> Peter Xu <peterx@redhat.com> writes:
> 
> > Hi, Alistair,
> >
> > On Thu, Jun 27, 2024 at 10:54:26AM +1000, Alistair Popple wrote:
> >> Now that DAX is managing page reference counts the same as normal
> >> pages there are no callers for vmf_insert_pXd functions so remove
> >> them.
> >> 
> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> ---
> >>  include/linux/huge_mm.h |   2 +-
> >>  mm/huge_memory.c        | 165 +-----------------------------------------
> >>  2 files changed, 167 deletions(-)
> >> 
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 9207d8e..0fb6bff 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -37,8 +37,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> >>  		    unsigned long cp_flags);
> >>  
> >> -vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
> >> -vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> >>  vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
> >>  vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> >
> > There's a plan to support huge pfnmaps in VFIO, which may still make good
> > use of these functions.  I think it's fine to remove them but it may mean
> > we'll need to add them back when supporting pfnmaps with no memmap.
> 
> I'm ok with that. If we need them back in future it shouldn't be too
> hard to add them back again. I just couldn't find any callers of them
> once DAX stopped using them and the usual policy is to remove unused
> functions.

True.  Currently the pmd/pud helpers are only used in dax.

> 
> > Is it still possible to make the old API generic to both service the new
> > dax refcount plan, but at the meantime working for pfn injections when
> > there's no page struct?
> 
> I don't think so - this new dax refcount plan relies on having a struct
> page to take references on so I don't think it makes much sense to
> combine it with something that doesn't have a struct page. It sounds
> like the situation is the analogue of vm_insert_page()
> vs. vmf_insert_pfn() - it's possible for both to exist but there's not
> really anything that can be shared between the two APIs as one has a
> page and the other is just a raw PFN.

I still think most of the codes should be shared on e.g. most of sanity
checks, pgtable injections, pgtable deposits (for pmd) and so on.

To be explicit, I wonder whether something like below diff would be
applicable on top of the patch "huge_memory: Allow mappings of PMD sized
pages" in this series, which introduced dax_insert_pfn_pmd() for dax:

$ diff origin new
1c1
< vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
---
> vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
55,58c55,60
<       folio = page_folio(page);
<       folio_get(folio);
<       folio_add_file_rmap_pmd(folio, page, vma);
<       add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
---
>         if (page) {
>                 folio = page_folio(page);
>                 folio_get(folio);
>                 folio_add_file_rmap_pmd(folio, page, vma);
>                 add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
>         }

As most of the rest look very similar to what pfn injections would need..
and in the PoC of ours we're using vmf_insert_pfn_pmd/pud().

That also reminds me on whether it'll be easier to implement the new dax
support for page struct on top of vmf_insert_pfn_pmd/pud, rather than
removing the 1st then adding the new one.  Maybe it'll reduce code churns,
and would that also make reviews easier?

It's also possible I missed something important so the old function must be
removed.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code
  2024-07-09 15:56       ` Peter Xu
@ 2024-07-12  2:40         ` Alistair Popple
  2024-07-12 15:52           ` Peter Xu
  0 siblings, 1 reply; 54+ messages in thread
From: Alistair Popple @ 2024-07-12  2:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alex Williamson


Peter Xu <peterx@redhat.com> writes:

> On Tue, Jul 09, 2024 at 02:07:31PM +1000, Alistair Popple wrote:
>> 
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > Hi, Alistair,
>> >
>> > On Thu, Jun 27, 2024 at 10:54:26AM +1000, Alistair Popple wrote:
>> >> Now that DAX is managing page reference counts the same as normal
>> >> pages there are no callers for vmf_insert_pXd functions so remove
>> >> them.
>> >> 
>> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> >> ---
>> >>  include/linux/huge_mm.h |   2 +-
>> >>  mm/huge_memory.c        | 165 +-----------------------------------------
>> >>  2 files changed, 167 deletions(-)
>> >> 
>> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> >> index 9207d8e..0fb6bff 100644
>> >> --- a/include/linux/huge_mm.h
>> >> +++ b/include/linux/huge_mm.h
>> >> @@ -37,8 +37,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> >>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
>> >>  		    unsigned long cp_flags);
>> >>  
>> >> -vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>> >> -vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>> >>  vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>> >>  vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>> >
>> > There's a plan to support huge pfnmaps in VFIO, which may still make good
>> > use of these functions.  I think it's fine to remove them but it may mean
>> > we'll need to add them back when supporting pfnmaps with no memmap.
>> 
>> I'm ok with that. If we need them back in future it shouldn't be too
>> hard to add them back again. I just couldn't find any callers of them
>> once DAX stopped using them and the usual policy is to remove unused
>> functions.
>
> True.  Currently the pmd/pud helpers are only used in dax.
>
>> 
>> > Is it still possible to make the old API generic to both service the new
>> > dax refcount plan, but at the meantime working for pfn injections when
>> > there's no page struct?
>> 
>> I don't think so - this new dax refcount plan relies on having a struct
>> page to take references on so I don't think it makes much sense to
>> combine it with something that doesn't have a struct page. It sounds
>> like the situation is the analogue of vm_insert_page()
>> vs. vmf_insert_pfn() - it's possible for both to exist but there's not
>> really anything that can be shared between the two APIs as one has a
>> page and the other is just a raw PFN.
>
> I still think most of the codes should be shared on e.g. most of sanity
> checks, pgtable injections, pgtable deposits (for pmd) and so on.

Yeah, it was mostly the BUG_ON's that weren't applicable once pXd_devmap
went away.

> To be explicit, I wonder whether something like below diff would be
> applicable on top of the patch "huge_memory: Allow mappings of PMD sized
> pages" in this series, which introduced dax_insert_pfn_pmd() for dax:
>
> $ diff origin new
> 1c1
> < vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> ---
>> vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> 55,58c55,60
> <       folio = page_folio(page);
> <       folio_get(folio);
> <       folio_add_file_rmap_pmd(folio, page, vma);
> <       add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> ---
>>         if (page) {
>>                 folio = page_folio(page);
>>                 folio_get(folio);
>>                 folio_add_file_rmap_pmd(folio, page, vma);
>>                 add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
>>         }

We get the page from calling pfn_t_to_page(pfn). This is safe for the
DAX case but is it safe to use a page returned by this more generally?

From an API perspective it would make more sense for the DAX code to
pass the page rather than the pfn. I didn't do that because device DAX
just had the PFN and this was DAX-specific code. But if we want to make
it generic I'd rather have callers pass the page in.

Of course that probably doesn't help you, because then the call would be
vmf_insert_page_pmd() rather than a raw pfn, but as you point out there
might be some common code we could share.

>
> As most of the rest look very similar to what pfn injections would need..
> and in the PoC of ours we're using vmf_insert_pfn_pmd/pud().

Do you have the PoC posted anywhere so I can get an understanding of how
this might be used?

> That also reminds me on whether it'll be easier to implement the new dax
> support for page struct on top of vmf_insert_pfn_pmd/pud, rather than
> removing the 1st then adding the new one.  Maybe it'll reduce code churns,
> and would that also make reviews easier?

Yeah, that's a good observation. I think it was just a quirk of how I
was developing this and also not caring about the PFN case so I'll see
what that looks like.

> It's also possible I missed something important so the old function must be
> removed.
>
> Thanks,


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code
  2024-07-12  2:40         ` Alistair Popple
@ 2024-07-12 15:52           ` Peter Xu
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Xu @ 2024-07-12 15:52 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david, Alex Williamson

On Fri, Jul 12, 2024 at 12:40:39PM +1000, Alistair Popple wrote:
> 
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Jul 09, 2024 at 02:07:31PM +1000, Alistair Popple wrote:
> >> 
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> > Hi, Alistair,
> >> >
> >> > On Thu, Jun 27, 2024 at 10:54:26AM +1000, Alistair Popple wrote:
> >> >> Now that DAX is managing page reference counts the same as normal
> >> >> pages there are no callers for vmf_insert_pXd functions so remove
> >> >> them.
> >> >> 
> >> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> >> ---
> >> >>  include/linux/huge_mm.h |   2 +-
> >> >>  mm/huge_memory.c        | 165 +-----------------------------------------
> >> >>  2 files changed, 167 deletions(-)
> >> >> 
> >> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> >> index 9207d8e..0fb6bff 100644
> >> >> --- a/include/linux/huge_mm.h
> >> >> +++ b/include/linux/huge_mm.h
> >> >> @@ -37,8 +37,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >> >>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> >> >>  		    unsigned long cp_flags);
> >> >>  
> >> >> -vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
> >> >> -vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> >> >>  vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
> >> >>  vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> >> >
> >> > There's a plan to support huge pfnmaps in VFIO, which may still make good
> >> > use of these functions.  I think it's fine to remove them but it may mean
> >> > we'll need to add them back when supporting pfnmaps with no memmap.
> >> 
> >> I'm ok with that. If we need them back in future it shouldn't be too
> >> hard to add them back again. I just couldn't find any callers of them
> >> once DAX stopped using them and the usual policy is to remove unused
> >> functions.
> >
> > True.  Currently the pmd/pud helpers are only used in dax.
> >
> >> 
> >> > Is it still possible to make the old API generic to both service the new
> >> > dax refcount plan, but at the meantime working for pfn injections when
> >> > there's no page struct?
> >> 
> >> I don't think so - this new dax refcount plan relies on having a struct
> >> page to take references on so I don't think it makes much sense to
> >> combine it with something that doesn't have a struct page. It sounds
> >> like the situation is the analogue of vm_insert_page()
> >> vs. vmf_insert_pfn() - it's possible for both to exist but there's not
> >> really anything that can be shared between the two APIs as one has a
> >> page and the other is just a raw PFN.
> >
> > I still think most of the codes should be shared on e.g. most of sanity
> > checks, pgtable injections, pgtable deposits (for pmd) and so on.
> 
> Yeah, it was mostly the BUG_ON's that weren't applicable once pXd_devmap
> went away.
> 
> > To be explicit, I wonder whether something like below diff would be
> > applicable on top of the patch "huge_memory: Allow mappings of PMD sized
> > pages" in this series, which introduced dax_insert_pfn_pmd() for dax:
> >
> > $ diff origin new
> > 1c1
> > < vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> > ---
> >> vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> > 55,58c55,60
> > <       folio = page_folio(page);
> > <       folio_get(folio);
> > <       folio_add_file_rmap_pmd(folio, page, vma);
> > <       add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> > ---
> >>         if (page) {
> >>                 folio = page_folio(page);
> >>                 folio_get(folio);
> >>                 folio_add_file_rmap_pmd(folio, page, vma);
> >>                 add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> >>         }
> 
> We get the page from calling pfn_t_to_page(pfn). This is safe for the
> DAX case but is it safe to use a page returned by this more generally?

Good question.  I thought it should work when the caller doesn't set any
bit in PFN_FLAGS_MASK, but it turns out it's not the case?  As I just
notice:

static inline bool pfn_t_has_page(pfn_t pfn)
{
	return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0;
}

So it looks like "no PFN_FLAGS" case should also fall into this category of
"(pfn.val & PFN_DEV) == 0"..

I'm not sure whether my understanding is correct, though.  Maybe we'd want
to double check with pfn_valid() when it's a generic function.

> 
> From an API perspective it would make more sense for the DAX code to
> pass the page rather than the pfn. I didn't do that because device DAX
> just had the PFN and this was DAX-specific code. But if we want to make
> it generic I'd rather have callers pass the page in.
> 
> Of course that probably doesn't help you, because then the call would be
> vmf_insert_page_pmd() rather than a raw pfn, but as you point out there
> might be some common code we could share.

It'll be fine if it needs page*, then it'll be NULL for VFIO.

So far it looks cleaner if it has the pgtable entry anyway to me, as that
indeed contains the pfn.  But I'd trust you more on what should it look
like, as I didn't read the whole series here.

> 
> >
> > As most of the rest look very similar to what pfn injections would need..
> > and in the PoC of ours we're using vmf_insert_pfn_pmd/pud().
> 
> Do you have the PoC posted anywhere so I can get an understanding of how
> this might be used?

https://github.com/xzpeter/linux/commits/vfio-pfnmap-all/

Specifically Alex's commit here:

https://github.com/xzpeter/linux/commit/afd05f1082bc78738e280f1fc1937da52b2572ed

Just a note that it's still work in progress.  Alex did run it through (not
this tree, but an older one) and it works pretty well so far.

I think it's because so far nothing involves the pfn flags, the only one
has it involved is (taking pmd as example):

insert_pfn_pmd():
	if (!pmd_none(*pmd)) {
		if (write) {
			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
				goto out_unlock;
			}
			entry = pmd_mkyoung(*pmd);
			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
			if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
				update_mmu_cache_pmd(vma, addr, pmd);
		}

		goto out_unlock;
	}

But for VFIO it'll definitely be pmd_none() here, so the whole path ignores
pfn flags so far here, I assume.

> 
> > That also reminds me on whether it'll be easier to implement the new dax
> > support for page struct on top of vmf_insert_pfn_pmd/pud, rather than
> > removing the 1st then adding the new one.  Maybe it'll reduce code churns,
> > and would that also make reviews easier?
> 
> Yeah, that's a good observation. I think it was just a quirk of how I
> was developing this and also not caring about the PFN case so I'll see
> what that looks like.

Great!  I hope it'll reduce the diff for this series too, so it could be a
win-win.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 10/13] fs/dax: Properly refcount fs dax pages
  2024-06-27  5:44   ` Christoph Hellwig
@ 2024-09-06  6:00     ` Alistair Popple
  0 siblings, 0 replies; 54+ messages in thread
From: Alistair Popple @ 2024-09-06  6:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-mm, linux-ext4, linux-xfs,
	jhubbard, david


Christoph Hellwig <hch@lst.de> writes:

>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>> index eb61598..b7a31ae 100644
>> --- a/drivers/dax/device.c
>> +++ b/drivers/dax/device.c
>> @@ -126,11 +126,11 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
>>  		return VM_FAULT_SIGBUS;
>>  	}
>>  
>> -	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
>> +	pfn = phys_to_pfn_t(phys, 0);
>>  
>>  	dax_set_mapping(vmf, pfn, fault_size);
>>  
>> -	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
>> +	return dax_insert_pfn(vmf->vma, vmf->address, pfn, vmf->flags & FAULT_FLAG_WRITE);
>
> Plenty overly long lines here and later.
>
> Q: hould dax_insert_pfn take a vm_fault structure instead of the vma?
> Or are the potential use cases that aren't from the fault path?

Nope, good idea. I will update it to take a vm_fault struct for the next
version.

> similar instead of the bool write passing the fault flags might actually
> make things more readable than the bool.
>
> Also at least currently it seems like there are no modular users despite
> the export, or am I missing something?

It gets used in drivers/dax/device.c which I think is built into
device_dax.ko:

obj-$(CONFIG_DEV_DAX) += device_dax.o

...

device_dax-y := device.o

>>  {
>> +	/*
>> +	 * Make sure we flush any cached data to the page now that it's free.
>> +	 */
>> +	if (PageDirty(page))
>> +		dax_flush(NULL, page_address(page), page_size(page));
>> +
>
> Adding the magic dax_dev == NULL case to dax_flush and going through it
> vs just calling arch_wb_cache_pmem directly here seems odd.
>
> But I also don't quite understand how it is related to the rest
> of the patch anyway.

Yeah, that should be unnecessary as it gets called elsewhere as needed
so will remove it.

>>  		if (!pmd_present(*pmd))
>>  			goto out;
>> diff --git a/mm/mm_init.c b/mm/mm_init.c
>> index b7e1599..f11ee0d 100644
>> --- a/mm/mm_init.c
>> +++ b/mm/mm_init.c
>> @@ -1016,7 +1016,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>>  	 */
>>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
>>  	    pgmap->type == MEMORY_DEVICE_COHERENT ||
>> -	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
>> +	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA ||
>> +	    pgmap->type == MEMORY_DEVICE_FS_DAX)
>>  		set_page_count(page, 0);
>>  }
>
> So we'll skip this for MEMORY_DEVICE_GENERIC only.  Does anyone remember
> if that's actively harmful or just not needed?  If the latter it might
> be simpler to just set the page count unconditionally here.

Yeah I'm not sure but the switch statement you suggested at least makes
this much clearer. Once I get this series finished I can chase down the
MEMORY_DEVICE_GENERIC differences. I suspect we can just do it
unconditionally.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 06/13] mm/memory: Add dax_insert_pfn
  2024-06-27 11:33   ` Jan Kara
@ 2024-09-06  6:21     ` Alistair Popple
  0 siblings, 0 replies; 54+ messages in thread
From: Alistair Popple @ 2024-09-06  6:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: dan.j.williams, vishal.l.verma, dave.jiang, logang, bhelgaas, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-mm, linux-ext4, linux-xfs, jhubbard, hch,
	david


Jan Kara <jack@suse.cz> writes:

> On Thu 27-06-24 10:54:21, Alistair Popple wrote:
>> Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
>> creates a special devmap PTE entry for the pfn but does not take a
>> reference on the underlying struct page for the mapping. This is
>> because DAX page refcounts are treated specially, as indicated by the
>> presence of a devmap entry.
>> 
>> To allow DAX page refcounts to be managed the same as normal page
>> refcounts introduce dax_insert_pfn. This will take a reference on the
>> underlying page much the same as vmf_insert_page, except it also
>> permits upgrading an existing mapping to be writable if
>> requested/possible.
>> 
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> Overall this looks good to me. Some comments below.
>
>> ---
>>  include/linux/mm.h |  4 ++-
>>  mm/memory.c        | 79 ++++++++++++++++++++++++++++++++++++++++++-----
>>  2 files changed, 76 insertions(+), 7 deletions(-)
>> 
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 9a5652c..b84368b 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1080,6 +1080,8 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
>>  struct mmu_gather;
>>  struct inode;
>>  
>> +extern void prep_compound_page(struct page *page, unsigned int order);
>> +
>
> You don't seem to use this function in this patch?

Thanks, bad rebase splitting this up. It belongs later in the series.

>> diff --git a/mm/memory.c b/mm/memory.c
>> index ce48a05..4f26a1f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1989,14 +1989,42 @@ static int validate_page_before_insert(struct page *page)
>>  }
>>  
>>  static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
>> -			unsigned long addr, struct page *page, pgprot_t prot)
>> +			unsigned long addr, struct page *page, pgprot_t prot, bool mkwrite)
>>  {
>>  	struct folio *folio = page_folio(page);
>> +	pte_t entry = ptep_get(pte);
>>  
>> -	if (!pte_none(ptep_get(pte)))
>> +	if (!pte_none(entry)) {
>> +		if (mkwrite) {
>> +			/*
>> +			 * For read faults on private mappings the PFN passed
>> +			 * in may not match the PFN we have mapped if the
>> +			 * mapped PFN is a writeable COW page.  In the mkwrite
>> +			 * case we are creating a writable PTE for a shared
>> +			 * mapping and we expect the PFNs to match. If they
>> +			 * don't match, we are likely racing with block
>> +			 * allocation and mapping invalidation so just skip the
>> +			 * update.
>> +			 */
>> +			if (pte_pfn(entry) != page_to_pfn(page)) {
>> +				WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
>> +				return -EFAULT;
>> +			}
>> +			entry = maybe_mkwrite(entry, vma);
>> +			entry = pte_mkyoung(entry);
>> +			if (ptep_set_access_flags(vma, addr, pte, entry, 1))
>> +				update_mmu_cache(vma, addr, pte);
>> +			return 0;
>> +		}
>>  		return -EBUSY;
>
> If you do this like:
>
> 		if (!mkwrite)
> 			return -EBUSY;
>
> You can reduce indentation of the big block and also making the flow more
> obvious...

Good idea.

>> +	}
>> +
>>  	/* Ok, finally just insert the thing.. */
>>  	folio_get(folio);
>> +	if (mkwrite)
>> +		entry = maybe_mkwrite(mk_pte(page, prot), vma);
>> +	else
>> +		entry = mk_pte(page, prot);
>
> I'd prefer:
>
> 	entry = mk_pte(page, prot);
> 	if (mkwrite)
> 		entry = maybe_mkwrite(entry, vma);
>
> but I don't insist. Also insert_pfn() additionally has pte_mkyoung() and
> pte_mkdirty(). Why was it left out here?

An oversight by me, thanks for pointing it out!

> 								Honza


^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2024-09-06  6:25 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-27  0:54 [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Alistair Popple
2024-06-27  0:54 ` [PATCH 01/13] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
2024-06-27  6:36   ` Dan Williams
2024-06-27  0:54 ` [PATCH 02/13] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
2024-06-27  5:30   ` Christoph Hellwig
2024-06-29 21:28   ` Bjorn Helgaas
2024-06-27  0:54 ` [PATCH 03/13] fs/dax: Refactor wait for dax idle page Alistair Popple
2024-06-27  5:31   ` Christoph Hellwig
2024-06-27  0:54 ` [PATCH 04/13] fs/dax: Add dax_page_free callback Alistair Popple
2024-06-27  5:33   ` Christoph Hellwig
2024-06-27 23:48     ` Alistair Popple
2024-06-27  0:54 ` [PATCH 05/13] mm: Allow compound zone device pages Alistair Popple
2024-06-27  5:35   ` Christoph Hellwig
2024-06-27  0:54 ` [PATCH 06/13] mm/memory: Add dax_insert_pfn Alistair Popple
2024-06-27  5:22   ` Christoph Hellwig
2024-06-27 11:33   ` Jan Kara
2024-09-06  6:21     ` Alistair Popple
2024-07-02  7:18   ` David Hildenbrand
2024-07-02 10:47     ` Alistair Popple
2024-07-02 11:46     ` Christoph Hellwig
2024-07-02 11:53       ` David Hildenbrand
2024-06-27  0:54 ` [PATCH 07/13] huge_memory: Allow mappings of PUD sized pages Alistair Popple
2024-06-27 22:26   ` kernel test robot
2024-07-02  7:16   ` David Hildenbrand
2024-07-02 10:19     ` Alistair Popple
2024-07-02 11:02       ` David Hildenbrand
2024-07-02 11:30         ` Alistair Popple
2024-07-02 13:01           ` David Hildenbrand
2024-07-02 11:51       ` Christoph Hellwig
2024-06-27  0:54 ` [PATCH 08/13] huge_memory: Allow mappings of PMD " Alistair Popple
2024-06-27  0:54 ` [PATCH 09/13] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
2024-07-01  8:59   ` David Hildenbrand
2024-07-01 23:47     ` Alistair Popple
2024-07-02 10:48       ` David Hildenbrand
2024-06-27  0:54 ` [PATCH 10/13] fs/dax: Properly refcount fs dax pages Alistair Popple
2024-06-27  5:44   ` Christoph Hellwig
2024-09-06  6:00     ` Alistair Popple
2024-06-27  0:54 ` [PATCH 11/13] huge_memory: Remove dead vmf_insert_pXd code Alistair Popple
2024-07-05 14:24   ` Peter Xu
2024-07-09  4:07     ` Alistair Popple
2024-07-09 15:56       ` Peter Xu
2024-07-12  2:40         ` Alistair Popple
2024-07-12 15:52           ` Peter Xu
2024-06-27  0:54 ` [PATCH 12/13] mm: Remove pXX_devmap callers Alistair Popple
2024-06-27  0:54 ` [PATCH 13/13] mm: Remove devmap related functions and page table bits Alistair Popple
2024-06-27 23:04   ` kernel test robot
2024-06-28  2:12   ` kernel test robot
2024-07-08 11:35   ` Will Deacon
2024-06-27  6:58 ` [PATCH 00/13] fs/dax: Fix FS DAX page reference counts Dan Williams
2024-06-27  7:15   ` Alistair Popple
2024-06-27 20:24     ` Dan Williams
2024-06-28  0:06       ` Alistair Popple
2024-07-01  4:24 ` Dave Chinner
2024-07-01  8:33   ` Alistair Popple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).