[PATCH 00/12] fs/dax: Fix FS DAX page reference counts

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/12] fs/dax: Fix FS DAX page reference counts
@ 2024-09-10  4:14 Alistair Popple
  2024-09-10  4:14 ` [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
                   ` (11 more replies)
  0 siblings, 12 replies; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Main updates since v1:

 - Now passes the same number of xfs_test with dax=always as without
   this series (some seem to fail on my setup normally). Thanks Dave
   for the suggestion as there were some deadlocks/crashes in v1 due
   to misshandling of write-protect faults and truncation which should
   now be fixed.

 - The pgmap field has been moved to the folio (thanks Matthew for the
   suggestion).

 - No longer remove the vmf_insert_pfn_pXd() functions and instead
   refactor them for use by DAX as Peter Xu suggested they will be
   needed in future and I see there are patches in linux-next that
   call them.

FS DAX pages have always maintained their own page reference counts
without following the normal rules for page reference counting. In
particular pages are considered free when the refcount hits one rather
than zero and refcounts are not added when mapping the page.

Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.

By treating the refcounts on these pages the same way as normal pages
we can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change
here. It also frees up a valuable SW define PTE bit on architectures
that have devmap PTE bits defined.

It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment.

I am not intimately familiar with the FS DAX code so would appreciate
some careful review there. In particular I have not given any thought
at all to CONFIG_FS_DAX_LIMITED.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

---

Cc: dan.j.williams@intel.com
Cc: vishal.l.verma@intel.com
Cc: dave.jiang@intel.com
Cc: logang@deltatee.com
Cc: bhelgaas@google.com
Cc: jack@suse.cz
Cc: jgg@ziepe.ca
Cc: catalin.marinas@arm.com
Cc: will@kernel.org
Cc: mpe@ellerman.id.au
Cc: npiggin@gmail.com
Cc: dave.hansen@linux.intel.com
Cc: ira.weiny@intel.com
Cc: willy@infradead.org
Cc: djwong@kernel.org
Cc: tytso@mit.edu
Cc: linmiaohe@huawei.com
Cc: david@redhat.com
Cc: peterx@redhat.com
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: nvdimm@lists.linux.dev
Cc: linux-cxl@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: jhubbard@nvidia.com
Cc: hch@lst.de
Cc: david@fromorbit.com

Alistair Popple (12):
  mm/gup.c: Remove redundant check for PCI P2PDMA page
  pci/p2pdma: Don't initialise page refcount to one
  fs/dax: Refactor wait for dax idle page
  mm: Allow compound zone device pages
  mm/memory: Add dax_insert_pfn
  huge_memory: Allow mappings of PUD sized pages
  huge_memory: Allow mappings of PMD sized pages
  gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  mm: Update vm_normal_page() callers to accept FS DAX pages
  fs/dax: Properly refcount fs dax pages
  mm: Remove pXX_devmap callers
  mm: Remove devmap related functions and page table bits

 Documentation/mm/arch_pgtable_helpers.rst     |   6 +-
 arch/arm64/Kconfig                            |   1 +-
 arch/arm64/include/asm/pgtable-prot.h         |   1 +-
 arch/arm64/include/asm/pgtable.h              |  24 +--
 arch/powerpc/Kconfig                          |   1 +-
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   6 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   7 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  52 +----
 arch/powerpc/include/asm/book3s/64/radix.h    |  14 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c       |   3 +-
 arch/powerpc/mm/book3s64/pgtable.c            |   8 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c      |   5 +-
 arch/powerpc/mm/pgtable.c                     |   2 +-
 arch/x86/Kconfig                              |   1 +-
 arch/x86/include/asm/pgtable.h                |  50 +----
 arch/x86/include/asm/pgtable_types.h          |   5 +-
 arch/x86/mm/pat/memtype.c                     |   4 +-
 drivers/dax/device.c                          |  12 +-
 drivers/dax/super.c                           |   2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c        |   3 +-
 drivers/nvdimm/pmem.c                         |   4 +-
 drivers/pci/p2pdma.c                          |  12 +-
 fs/dax.c                                      | 197 ++++++++---------
 fs/ext4/inode.c                               |   5 +-
 fs/fuse/dax.c                                 |   4 +-
 fs/fuse/virtio_fs.c                           |   3 +-
 fs/proc/task_mmu.c                            |  16 +-
 fs/userfaultfd.c                              |   2 +-
 fs/xfs/xfs_inode.c                            |   4 +-
 include/linux/dax.h                           |  12 +-
 include/linux/huge_mm.h                       |  15 +-
 include/linux/memremap.h                      |  17 +-
 include/linux/migrate.h                       |   4 +-
 include/linux/mm.h                            |  39 +---
 include/linux/mm_types.h                      |   9 +-
 include/linux/mmzone.h                        |   8 +-
 include/linux/page-flags.h                    |   6 +-
 include/linux/pfn_t.h                         |  20 +--
 include/linux/pgtable.h                       |  21 +--
 include/linux/rmap.h                          |  15 +-
 lib/test_hmm.c                                |   3 +-
 mm/Kconfig                                    |   4 +-
 mm/debug_vm_pgtable.c                         |  59 +-----
 mm/gup.c                                      | 177 +---------------
 mm/hmm.c                                      |  12 +-
 mm/huge_memory.c                              | 221 +++++++++++--------
 mm/internal.h                                 |   2 +-
 mm/khugepaged.c                               |   2 +-
 mm/mapping_dirty_helpers.c                    |   4 +-
 mm/memcontrol-v1.c                            |   2 +-
 mm/memory-failure.c                           |   6 +-
 mm/memory.c                                   | 126 +++++++----
 mm/memremap.c                                 |  53 ++---
 mm/migrate_device.c                           |   9 +-
 mm/mlock.c                                    |   2 +-
 mm/mm_init.c                                  |  23 +-
 mm/mprotect.c                                 |   2 +-
 mm/mremap.c                                   |   5 +-
 mm/page_vma_mapped.c                          |   5 +-
 mm/pagewalk.c                                 |   8 +-
 mm/pgtable-generic.c                          |   7 +-
 mm/rmap.c                                     |  49 ++++-
 mm/swap.c                                     |   2 +-
 mm/userfaultfd.c                              |   2 +-
 mm/vmscan.c                                   |   5 +-
 65 files changed, 591 insertions(+), 819 deletions(-)

base-commit: 6f1833b8208c3b9e59eff10792667b6639365146
-- 
git-series 0.9.1

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-22  1:00   ` Dan Williams
  2024-09-10  4:14 ` [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david, Jason Gunthorpe

PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
check in __gup_device_huge() is redundant. Remove it

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index d19884e..5d2fc9a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2954,11 +2954,6 @@ static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr,
 			break;
 		}
 
-		if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
-			gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-			break;
-		}
-
 		folio = try_grab_folio_fast(page, 1, flags);
 		if (!folio) {
 			gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
  2024-09-10  4:14 ` [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-10 13:47   ` Bjorn Helgaas
                     ` (2 more replies)
  2024-09-10  4:14 ` [PATCH 03/12] fs/dax: Refactor wait for dax idle page Alistair Popple
                   ` (9 subsequent siblings)
  11 siblings, 3 replies; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

The reference counts for ZONE_DEVICE private pages should be
initialised by the driver when the page is actually allocated by the
driver allocator, not when they are first created. This is currently
the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 drivers/pci/p2pdma.c |  6 ++++++
 mm/memremap.c        | 17 +++++++++++++----
 mm/mm_init.c         | 22 ++++++++++++++++++----
 3 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4f47a13..210b9f4 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -129,6 +129,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
 	}
 
 	/*
+	 * Initialise the refcount for the freshly allocated page. As we have
+	 * just allocated the page no one else should be using it.
+	 */
+	set_page_count(virt_to_page(kaddr), 1);
+
+	/*
 	 * vm_insert_page() can sleep, so a reference is taken to mapping
 	 * such that rcu_read_unlock() can be done before inserting the
 	 * pages
diff --git a/mm/memremap.c b/mm/memremap.c
index 40d4547..07bbe0e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -488,15 +488,24 @@ void free_zone_device_folio(struct folio *folio)
 	folio->mapping = NULL;
 	folio->page.pgmap->ops->page_free(folio_page(folio, 0));
 
-	if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
-	    folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
+	switch (folio->page.pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
+		put_dev_pagemap(folio->page.pgmap);
+		break;
+
+	case MEMORY_DEVICE_FS_DAX:
+	case MEMORY_DEVICE_GENERIC:
 		/*
 		 * Reset the refcount to 1 to prepare for handing out the page
 		 * again.
 		 */
 		folio_set_count(folio, 1);
-	else
-		put_dev_pagemap(folio->page.pgmap);
+		break;
+
+	case MEMORY_DEVICE_PCI_P2PDMA:
+		break;
+	}
 }
 
 void zone_device_page_init(struct page *page)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4ba5607..0489820 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1015,12 +1015,26 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 
 	/*
-	 * ZONE_DEVICE pages are released directly to the driver page allocator
-	 * which will set the page count to 1 when allocating the page.
+	 * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and
+	 * MEMORY_TYPE_FS_DAX pages are released directly to the driver page
+	 * allocator which will set the page count to 1 when allocating the
+	 * page.
+	 *
+	 * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
+	 * their refcount reset to one whenever they are freed (ie. after
+	 * their refcount drops to 0).
 	 */
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-	    pgmap->type == MEMORY_DEVICE_COHERENT)
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
+	case MEMORY_DEVICE_PCI_P2PDMA:
 		set_page_count(page, 0);
+		break;
+
+	case MEMORY_DEVICE_FS_DAX:
+	case MEMORY_DEVICE_GENERIC:
+		break;
+	}
 }
 
 /*
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/12] fs/dax: Refactor wait for dax idle page
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
  2024-09-10  4:14 ` [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
  2024-09-10  4:14 ` [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-22  1:01   ` Dan Williams
  2024-09-10  4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

A FS DAX page is considered idle when its refcount drops to one. This
is currently open-coded in all file systems supporting FS DAX. Move
the idle detection to a common function to make future changes easier.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/ext4/inode.c     | 5 +----
 fs/fuse/dax.c       | 4 +---
 fs/xfs/xfs_inode.c  | 4 +---
 include/linux/dax.h | 8 ++++++++
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 941c1c0..367832a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3923,10 +3923,7 @@ int ext4_break_layouts(struct inode *inode)
 		if (!page)
 			return 0;
 
-		error = ___wait_var_event(&page->_refcount,
-				atomic_read(&page->_refcount) == 1,
-				TASK_INTERRUPTIBLE, 0, 0,
-				ext4_wait_dax_page(inode));
+		error = dax_wait_page_idle(page, ext4_wait_dax_page, inode);
 	} while (error == 0);
 
 	return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 12ef91d..da50595 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -676,9 +676,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(&page->_refcount,
-			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-			0, 0, fuse_wait_dax_page(inode));
+	return dax_wait_page_idle(page, fuse_wait_dax_page, inode);
 }
 
 /* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7dc6f32..7e27ba1 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3071,9 +3071,7 @@ xfs_break_dax_layouts(
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(&page->_refcount,
-			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-			0, 0, xfs_wait_dax_page(inode));
+	return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9d3e332..773dfc4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -213,6 +213,14 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		const struct iomap_ops *ops);
 
+static inline int dax_wait_page_idle(struct page *page,
+				void (cb)(struct inode *),
+				struct inode *inode)
+{
+	return ___wait_var_event(page, page_ref_count(page) == 1,
+				TASK_INTERRUPTIBLE, 0, 0, cb(inode));
+}
+
 #if IS_ENABLED(CONFIG_DAX)
 int dax_read_lock(void);
 void dax_read_unlock(int id);
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/12] mm: Allow compound zone device pages
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (2 preceding siblings ...)
  2024-09-10  4:14 ` [PATCH 03/12] fs/dax: Refactor wait for dax idle page Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-10  4:47   ` Matthew Wilcox
                     ` (3 more replies)
  2024-09-10  4:14 ` [PATCH 05/12] mm/memory: Add dax_insert_pfn Alistair Popple
                   ` (7 subsequent siblings)
  11 siblings, 4 replies; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david, Jason Gunthorpe

Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are
not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
user of higher order zone device pages and have their own page
reference counting.

A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference
counting. Supporting that requires compound zone device pages.

Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.

A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head
page. For zone device pages page->compound_head is shared with
page->pgmap.

The page->pgmap field is common to all pages within a memory section.
Therefore pgmap is the same for both head and tail pages and can be
moved into the folio and we can use the standard scheme to find
compound_head from a tail page.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

---

Changes since v1:

 - Move pgmap to the folio as suggested by Matthew Wilcox
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  3 ++-
 drivers/pci/p2pdma.c                   |  6 +++---
 include/linux/memremap.h               |  6 +++---
 include/linux/migrate.h                |  4 ++--
 include/linux/mm_types.h               |  9 +++++++--
 include/linux/mmzone.h                 |  8 +++++++-
 lib/test_hmm.c                         |  3 ++-
 mm/hmm.c                               |  2 +-
 mm/memory.c                            |  4 +++-
 mm/memremap.c                          | 14 +++++++-------
 mm/migrate_device.c                    |  7 +++++--
 mm/mm_init.c                           |  2 +-
 12 files changed, 43 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 6fb65b0..58d308c 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -88,7 +88,8 @@ struct nouveau_dmem {
 
 static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
 {
-	return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
+	return container_of(page_dev_pagemap(page), struct nouveau_dmem_chunk,
+			    pagemap);
 }
 
 static struct nouveau_drm *page_to_drm(struct page *page)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 210b9f4..a58f2c1 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -199,7 +199,7 @@ static const struct attribute_group p2pmem_group = {
 
 static void p2pdma_page_free(struct page *page)
 {
-	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_dev_pagemap(page));
 	/* safe to dereference while a reference is held to the percpu ref */
 	struct pci_p2pdma *p2pdma =
 		rcu_dereference_protected(pgmap->provider->p2pdma, 1);
@@ -1022,8 +1022,8 @@ enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
 		       struct scatterlist *sg)
 {
-	if (state->pgmap != sg_page(sg)->pgmap) {
-		state->pgmap = sg_page(sg)->pgmap;
+	if (state->pgmap != page_dev_pagemap(sg_page(sg))) {
+		state->pgmap = page_dev_pagemap(sg_page(sg));
 		state->map = pci_p2pdma_map_type(state->pgmap, dev);
 		state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
 	}
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3f7143a..14273e6 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -161,7 +161,7 @@ static inline bool is_device_private_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
 		is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+		page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
 }
 
 static inline bool folio_is_device_private(const struct folio *folio)
@@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
 		is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+		page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
 
 static inline bool is_device_coherent_page(const struct page *page)
 {
 	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_COHERENT;
+		page_dev_pagemap(page)->type == MEMORY_DEVICE_COHERENT;
 }
 
 static inline bool folio_is_device_coherent(const struct folio *folio)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 002e49b..9a85a82 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -207,8 +207,8 @@ struct migrate_vma {
 	unsigned long		end;
 
 	/*
-	 * Set to the owner value also stored in page->pgmap->owner for
-	 * migrating out of device private memory. The flags also need to
+	 * Set to the owner value also stored in page_dev_pagemap(page)->owner
+	 * for migrating out of device private memory. The flags also need to
 	 * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
 	 * The caller should always set this field when using mmu notifier
 	 * callbacks to avoid device MMU invalidations for device private
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e3bdf8..c2f1d53 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -129,8 +129,11 @@ struct page {
 			unsigned long compound_head;	/* Bit zero is set */
 		};
 		struct {	/* ZONE_DEVICE pages */
-			/** @pgmap: Points to the hosting device page map. */
-			struct dev_pagemap *pgmap;
+			/*
+			 * The first word is used for compound_head or folio
+			 * pgmap
+			 */
+			void *_unused;
 			void *zone_device_data;
 			/*
 			 * ZONE_DEVICE private pages are counted as being
@@ -299,6 +302,7 @@ typedef struct {
  * @_refcount: Do not access this member directly.  Use folio_ref_count()
  *    to find how many references there are to this folio.
  * @memcg_data: Memory Control Group data.
+ * @pgmap: Metadata for ZONE_DEVICE mappings
  * @virtual: Virtual address in the kernel direct map.
  * @_last_cpupid: IDs of last CPU and last process that accessed the folio.
  * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
@@ -337,6 +341,7 @@ struct folio {
 	/* private: */
 				};
 	/* public: */
+			struct dev_pagemap *pgmap;
 			};
 			struct address_space *mapping;
 			pgoff_t index;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4..e191434 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1134,6 +1134,12 @@ static inline bool is_zone_device_page(const struct page *page)
 	return page_zonenum(page) == ZONE_DEVICE;
 }
 
+static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
+{
+	WARN_ON(!is_zone_device_page(page));
+	return page_folio(page)->pgmap;
+}
+
 /*
  * Consecutive zone device pages should not be merged into the same sgl
  * or bvec segment with other types of pages or if they belong to different
@@ -1149,7 +1155,7 @@ static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
 		return false;
 	if (!is_zone_device_page(a))
 		return true;
-	return a->pgmap == b->pgmap;
+	return page_dev_pagemap(a) == page_dev_pagemap(b);
 }
 
 extern void memmap_init_zone_device(struct zone *, unsigned long,
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 056f2e4..b072ca9 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -195,7 +195,8 @@ static int dmirror_fops_release(struct inode *inode, struct file *filp)
 
 static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
 {
-	return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+	return container_of(page_dev_pagemap(page), struct dmirror_chunk,
+			    pagemap);
 }
 
 static struct dmirror_device *dmirror_page_to_device(struct page *page)
diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229a..a11807c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -248,7 +248,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * just report the PFN.
 		 */
 		if (is_device_private_entry(entry) &&
-		    pfn_swap_entry_to_page(entry)->pgmap->owner ==
+		    page_dev_pagemap(pfn_swap_entry_to_page(entry))->owner ==
 		    range->dev_private_owner) {
 			cpu_flags = HMM_PFN_VALID;
 			if (is_writable_device_private_entry(entry))
diff --git a/mm/memory.c b/mm/memory.c
index c31ea30..d2785fb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4024,6 +4024,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = remove_device_exclusive_entry(vmf);
 		} else if (is_device_private_entry(entry)) {
+			struct dev_pagemap *pgmap;
 			if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
 				/*
 				 * migrate_to_ram is not yet ready to operate
@@ -4048,7 +4049,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			 */
 			get_page(vmf->page);
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
-			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
+			pgmap = page_dev_pagemap(vmf->page);
+			ret = pgmap->ops->migrate_to_ram(vmf);
 			put_page(vmf->page);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
diff --git a/mm/memremap.c b/mm/memremap.c
index 07bbe0e..e885bc9 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -458,8 +458,8 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 void free_zone_device_folio(struct folio *folio)
 {
-	if (WARN_ON_ONCE(!folio->page.pgmap->ops ||
-			!folio->page.pgmap->ops->page_free))
+	if (WARN_ON_ONCE(!folio->pgmap->ops ||
+			!folio->pgmap->ops->page_free))
 		return;
 
 	mem_cgroup_uncharge(folio);
@@ -486,12 +486,12 @@ void free_zone_device_folio(struct folio *folio)
 	 * to clear folio->mapping.
 	 */
 	folio->mapping = NULL;
-	folio->page.pgmap->ops->page_free(folio_page(folio, 0));
+	folio->pgmap->ops->page_free(folio_page(folio, 0));
 
-	switch (folio->page.pgmap->type) {
+	switch (folio->pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_COHERENT:
-		put_dev_pagemap(folio->page.pgmap);
+		put_dev_pagemap(folio->pgmap);
 		break;
 
 	case MEMORY_DEVICE_FS_DAX:
@@ -514,7 +514,7 @@ void zone_device_page_init(struct page *page)
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
-	WARN_ON_ONCE(!percpu_ref_tryget_live(&page->pgmap->ref));
+	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_dev_pagemap(page)->ref));
 	set_page_count(page, 1);
 	lock_page(page);
 }
@@ -523,7 +523,7 @@ EXPORT_SYMBOL_GPL(zone_device_page_init);
 #ifdef CONFIG_FS_DAX
 bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
 {
-	if (folio->page.pgmap->type != MEMORY_DEVICE_FS_DAX)
+	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
 		return false;
 
 	/*
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6d66dc1..9d30107 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -106,6 +106,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	arch_enter_lazy_mmu_mode();
 
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
+		struct dev_pagemap *pgmap;
 		unsigned long mpfn = 0, pfn;
 		struct folio *folio;
 		struct page *page;
@@ -133,9 +134,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				goto next;
 
 			page = pfn_swap_entry_to_page(entry);
+			pgmap = page_dev_pagemap(page);
 			if (!(migrate->flags &
 				MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
-			    page->pgmap->owner != migrate->pgmap_owner)
+			    pgmap->owner != migrate->pgmap_owner)
 				goto next;
 
 			mpfn = migrate_pfn(page_to_pfn(page)) |
@@ -151,12 +153,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				goto next;
 			}
 			page = vm_normal_page(migrate->vma, addr, pte);
+			pgmap = page_dev_pagemap(page);
 			if (page && !is_zone_device_page(page) &&
 			    !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
 				goto next;
 			else if (page && is_device_coherent_page(page) &&
 			    (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
-			     page->pgmap->owner != migrate->pgmap_owner))
+			     pgmap->owner != migrate->pgmap_owner))
 				goto next;
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 0489820..3d0611e 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -996,7 +996,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
 	 * ever freed or placed on a driver-private list.
 	 */
-	page->pgmap = pgmap;
+	page_folio(page)->pgmap = pgmap;
 	page->zone_device_data = NULL;
 
 	/*
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/12] mm/memory: Add dax_insert_pfn
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (3 preceding siblings ...)
  2024-09-10  4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-22  1:41   ` Dan Williams
  2024-09-10  4:14 ` [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages Alistair Popple
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
creates a special devmap PTE entry for the pfn but does not take a
reference on the underlying struct page for the mapping. This is
because DAX page refcounts are treated specially, as indicated by the
presence of a devmap entry.

To allow DAX page refcounts to be managed the same as normal page
refcounts introduce dax_insert_pfn. This will take a reference on the
underlying page much the same as vmf_insert_page, except it also
permits upgrading an existing mapping to be writable if
requested/possible.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

---

Updates from v1:

 - Re-arrange code in insert_page_into_pte_locked() based on comments
   from Jan Kara.

 - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan.
---
 include/linux/mm.h |  1 +-
 mm/memory.c        | 83 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 76 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b0ff06d..ae6d713 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3463,6 +3463,7 @@ int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
 				unsigned long num);
 int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
 				unsigned long num);
+vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write);
 vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index d2785fb..368e15d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2039,19 +2039,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma,
 }
 
 static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
-			unsigned long addr, struct page *page, pgprot_t prot)
+				unsigned long addr, struct page *page,
+				pgprot_t prot, bool mkwrite)
 {
 	struct folio *folio = page_folio(page);
+	pte_t entry = ptep_get(pte);
 	pte_t pteval;
 
-	if (!pte_none(ptep_get(pte)))
-		return -EBUSY;
+	if (!pte_none(entry)) {
+		if (!mkwrite)
+			return -EBUSY;
+
+		/*
+		 * For read faults on private mappings the PFN passed in may not
+		 * match the PFN we have mapped if the mapped PFN is a writeable
+		 * COW page.  In the mkwrite case we are creating a writable PTE
+		 * for a shared mapping and we expect the PFNs to match. If they
+		 * don't match, we are likely racing with block allocation and
+		 * mapping invalidation so just skip the update.
+		 */
+		if (pte_pfn(entry) != page_to_pfn(page)) {
+			WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
+			return -EFAULT;
+		}
+		entry = maybe_mkwrite(entry, vma);
+		entry = pte_mkyoung(entry);
+		if (ptep_set_access_flags(vma, addr, pte, entry, 1))
+			update_mmu_cache(vma, addr, pte);
+		return 0;
+	}
+
 	/* Ok, finally just insert the thing.. */
 	pteval = mk_pte(page, prot);
 	if (unlikely(is_zero_folio(folio))) {
 		pteval = pte_mkspecial(pteval);
 	} else {
 		folio_get(folio);
+		entry = mk_pte(page, prot);
+		if (mkwrite) {
+			entry = pte_mkyoung(entry);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		}
 		inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
 		folio_add_file_rmap_pte(folio, page, vma);
 	}
@@ -2060,7 +2088,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
 }
 
 static int insert_page(struct vm_area_struct *vma, unsigned long addr,
-			struct page *page, pgprot_t prot)
+			struct page *page, pgprot_t prot, bool mkwrite)
 {
 	int retval;
 	pte_t *pte;
@@ -2073,7 +2101,8 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 	pte = get_locked_pte(vma->vm_mm, addr, &ptl);
 	if (!pte)
 		goto out;
-	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
+	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot,
+					mkwrite);
 	pte_unmap_unlock(pte, ptl);
 out:
 	return retval;
@@ -2087,7 +2116,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
 	err = validate_page_before_insert(vma, page);
 	if (err)
 		return err;
-	return insert_page_into_pte_locked(vma, pte, addr, page, prot);
+	return insert_page_into_pte_locked(vma, pte, addr, page, prot, false);
 }
 
 /* insert_pages() amortizes the cost of spinlock operations
@@ -2223,7 +2252,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
 		BUG_ON(vma->vm_flags & VM_PFNMAP);
 		vm_flags_set(vma, VM_MIXEDMAP);
 	}
-	return insert_page(vma, addr, page, vma->vm_page_prot);
+	return insert_page(vma, addr, page, vma->vm_page_prot, false);
 }
 EXPORT_SYMBOL(vm_insert_page);
 
@@ -2503,7 +2532,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 		 * result in pfn_t_has_page() == false.
 		 */
 		page = pfn_to_page(pfn_t_to_pfn(pfn));
-		err = insert_page(vma, addr, page, pgprot);
+		err = insert_page(vma, addr, page, pgprot, mkwrite);
 	} else {
 		return insert_pfn(vma, addr, pfn, pgprot, mkwrite);
 	}
@@ -2516,6 +2545,44 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 	return VM_FAULT_NOPAGE;
 }
 
+vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	pgprot_t pgprot = vma->vm_page_prot;
+	unsigned long pfn = pfn_t_to_pfn(pfn_t);
+	struct page *page = pfn_to_page(pfn);
+	unsigned long addr = vmf->address;
+	int err;
+
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
+
+	track_pfn_insert(vma, &pgprot, pfn_t);
+
+	if (!pfn_modify_allowed(pfn, pgprot))
+		return VM_FAULT_SIGBUS;
+
+	/*
+	 * We refcount the page normally so make sure pfn_valid is true.
+	 */
+	if (!pfn_t_valid(pfn_t))
+		return VM_FAULT_SIGBUS;
+
+	WARN_ON_ONCE(pfn_t_devmap(pfn_t));
+
+	if (WARN_ON(is_zero_pfn(pfn) && write))
+		return VM_FAULT_SIGBUS;
+
+	err = insert_page(vma, addr, page, pgprot, write);
+	if (err == -ENOMEM)
+		return VM_FAULT_OOM;
+	if (err < 0 && err != -EBUSY)
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_insert_pfn);
+
 vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 		pfn_t pfn)
 {
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (4 preceding siblings ...)
  2024-09-10  4:14 ` [PATCH 05/12] mm/memory: Add dax_insert_pfn Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-22  2:07   ` Dan Williams
  2024-09-10  4:14 ` [PATCH 07/12] huge_memory: Allow mappings of PMD " Alistair Popple
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
and take references as it would for a normally mapped page.

This is distinct from the current mechanism, vmf_insert_pfn_pud, which
simply inserts a special devmap PUD entry into the page table without
holding a reference to the page for the mapping.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/huge_mm.h |  4 ++-
 include/linux/rmap.h    | 15 +++++++-
 mm/huge_memory.c        | 93 ++++++++++++++++++++++++++++++++++++------
 mm/rmap.c               | 49 ++++++++++++++++++++++-
 4 files changed, 149 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6370026..d3a1872 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_UNSUPPORTED,
@@ -114,6 +115,9 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
 #define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
 
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 extern unsigned long transparent_hugepage_flags;
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 91b5935..c465694 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -192,6 +192,7 @@ typedef int __bitwise rmap_t;
 enum rmap_level {
 	RMAP_LEVEL_PTE = 0,
 	RMAP_LEVEL_PMD,
+	RMAP_LEVEL_PUD,
 };
 
 static inline void __folio_rmap_sanity_checks(struct folio *folio,
@@ -228,6 +229,14 @@ static inline void __folio_rmap_sanity_checks(struct folio *folio,
 		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
 		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
 		break;
+	case RMAP_LEVEL_PUD:
+		/*
+		 * Asume that we are creating * a single "entire" mapping of the
+		 * folio.
+		 */
+		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio);
+		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
+		break;
 	default:
 		VM_WARN_ON_ONCE(true);
 	}
@@ -251,12 +260,16 @@ void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
 	folio_add_file_rmap_ptes(folio, page, 1, vma)
 void folio_add_file_rmap_pmd(struct folio *, struct page *,
 		struct vm_area_struct *);
+void folio_add_file_rmap_pud(struct folio *, struct page *,
+		struct vm_area_struct *);
 void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
 		struct vm_area_struct *);
 #define folio_remove_rmap_pte(folio, page, vma) \
 	folio_remove_rmap_ptes(folio, page, 1, vma)
 void folio_remove_rmap_pmd(struct folio *, struct page *,
 		struct vm_area_struct *);
+void folio_remove_rmap_pud(struct folio *, struct page *,
+		struct vm_area_struct *);
 
 void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
@@ -341,6 +354,7 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
 		atomic_add(orig_nr_pages, &folio->_large_mapcount);
 		break;
 	case RMAP_LEVEL_PMD:
+	case RMAP_LEVEL_PUD:
 		atomic_inc(&folio->_entire_mapcount);
 		atomic_inc(&folio->_large_mapcount);
 		break;
@@ -437,6 +451,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
 		atomic_add(orig_nr_pages, &folio->_large_mapcount);
 		break;
 	case RMAP_LEVEL_PMD:
+	case RMAP_LEVEL_PUD:
 		if (PageAnonExclusive(page)) {
 			if (unlikely(maybe_pinned))
 				return -EBUSY;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c4b45ad..e8985a4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1336,21 +1336,19 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	struct mm_struct *mm = vma->vm_mm;
 	pgprot_t prot = vma->vm_page_prot;
 	pud_t entry;
-	spinlock_t *ptl;
 
-	ptl = pud_lock(mm, pud);
 	if (!pud_none(*pud)) {
 		if (write) {
 			if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
 				WARN_ON_ONCE(!is_huge_zero_pud(*pud));
-				goto out_unlock;
+				return;
 			}
 			entry = pud_mkyoung(*pud);
 			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
 			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
 				update_mmu_cache_pud(vma, addr, pud);
 		}
-		goto out_unlock;
+		return;
 	}
 
 	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
@@ -1362,9 +1360,6 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	}
 	set_pud_at(mm, addr, pud, entry);
 	update_mmu_cache_pud(vma, addr, pud);
-
-out_unlock:
-	spin_unlock(ptl);
 }
 
 /**
@@ -1382,6 +1377,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	unsigned long addr = vmf->address & PUD_MASK;
 	struct vm_area_struct *vma = vmf->vma;
 	pgprot_t pgprot = vma->vm_page_prot;
+	spinlock_t *ptl;
 
 	/*
 	 * If we had pud_special, we could avoid all these restrictions,
@@ -1399,10 +1395,52 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 
 	track_pfn_insert(vma, &pgprot, pfn);
 
+	ptl = pud_lock(vma->vm_mm, vmf->pud);
 	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
+	spin_unlock(ptl);
+
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
+
+/**
+ * dax_insert_pfn_pud - insert a pud size pfn backed by a normal page
+ * @vmf: Structure describing the fault
+ * @pfn: pfn of the page to insert
+ * @write: whether it's a write fault
+ *
+ * Return: vm_fault_t value.
+ */
+vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long addr = vmf->address & PUD_MASK;
+	pud_t *pud = vmf->pud;
+	pgprot_t prot = vma->vm_page_prot;
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
+	struct folio *folio;
+	struct page *page;
+
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
+
+	track_pfn_insert(vma, &prot, pfn);
+
+	ptl = pud_lock(mm, pud);
+	if (pud_none(*vmf->pud)) {
+		page = pfn_t_to_page(pfn);
+		folio = page_folio(page);
+		folio_get(folio);
+		folio_add_file_rmap_pud(folio, page, vma);
+		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
+	}
+	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
+	spin_unlock(ptl);
+
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_insert_pfn_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1947,7 +1985,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
-		zap_deposited_table(tlb->mm, pmd);
+		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
+			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 	} else {
 		struct folio *folio = NULL;
@@ -2435,12 +2474,24 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
 	arch_check_zapped_pud(vma, orig_pud);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
-	if (vma_is_special_huge(vma)) {
+	if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
 		spin_unlock(ptl);
 		/* No zero page support yet */
 	} else {
-		/* No support for anonymous PUD pages yet */
-		BUG();
+		struct page *page = NULL;
+		struct folio *folio;
+
+		/* No support for anonymous PUD pages or migration yet */
+		BUG_ON(vma_is_anonymous(vma) || !pud_present(orig_pud));
+
+		page = pud_page(orig_pud);
+		folio = page_folio(page);
+		folio_remove_rmap_pud(folio, page, vma);
+		VM_BUG_ON_PAGE(!PageHead(page), page);
+		add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
+
+		spin_unlock(ptl);
+		tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
 	}
 	return 1;
 }
@@ -2448,6 +2499,8 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long haddr)
 {
+	pud_t old_pud;
+
 	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
@@ -2455,7 +2508,23 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 
 	count_vm_event(THP_SPLIT_PUD);
 
-	pudp_huge_clear_flush(vma, haddr, pud);
+	old_pud = pudp_huge_clear_flush(vma, haddr, pud);
+	if (is_huge_zero_pud(old_pud))
+		return;
+
+	if (vma_is_dax(vma)) {
+		struct page *page = pud_page(old_pud);
+		struct folio *folio = page_folio(page);
+
+		if (!folio_test_dirty(folio) && pud_dirty(old_pud))
+			folio_mark_dirty(folio);
+		if (!folio_test_referenced(folio) && pud_young(old_pud))
+			folio_set_referenced(folio);
+		folio_remove_rmap_pud(folio, page, vma);
+		folio_put(folio);
+		add_mm_counter(vma->vm_mm, mm_counter_file(folio),
+			-HPAGE_PUD_NR);
+	}
 }
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
diff --git a/mm/rmap.c b/mm/rmap.c
index 1103a53..274641c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1180,6 +1180,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
 		atomic_add(orig_nr_pages, &folio->_large_mapcount);
 		break;
 	case RMAP_LEVEL_PMD:
+	case RMAP_LEVEL_PUD:
 		first = atomic_inc_and_test(&folio->_entire_mapcount);
 		if (first) {
 			nr = atomic_add_return_relaxed(ENTIRELY_MAPPED, mapped);
@@ -1330,6 +1331,13 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
 		case RMAP_LEVEL_PMD:
 			SetPageAnonExclusive(page);
 			break;
+		case RMAP_LEVEL_PUD:
+			/*
+			 * Keep the compiler happy, we don't support anonymous
+			 * PUD mappings.
+			 */
+			WARN_ON_ONCE(1);
+			break;
 		}
 	}
 	for (i = 0; i < nr_pages; i++) {
@@ -1522,6 +1530,26 @@ void folio_add_file_rmap_pmd(struct folio *folio, struct page *page,
 #endif
 }
 
+/**
+ * folio_add_file_rmap_pud - add a PUD mapping to a page range of a folio
+ * @folio:	The folio to add the mapping to
+ * @page:	The first page to add
+ * @vma:	The vm area in which the mapping is added
+ *
+ * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
+ *
+ * The caller needs to hold the page table lock.
+ */
+void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
+		struct vm_area_struct *vma)
+{
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	__folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
+#else
+	WARN_ON_ONCE(true);
+#endif
+}
+
 static __always_inline void __folio_remove_rmap(struct folio *folio,
 		struct page *page, int nr_pages, struct vm_area_struct *vma,
 		enum rmap_level level)
@@ -1551,6 +1579,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 		partially_mapped = nr && atomic_read(mapped);
 		break;
 	case RMAP_LEVEL_PMD:
+	case RMAP_LEVEL_PUD:
 		atomic_dec(&folio->_large_mapcount);
 		last = atomic_add_negative(-1, &folio->_entire_mapcount);
 		if (last) {
@@ -1630,6 +1659,26 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
 #endif
 }
 
+/**
+ * folio_remove_rmap_pud - remove a PUD mapping from a page range of a folio
+ * @folio:	The folio to remove the mapping from
+ * @page:	The first page to remove
+ * @vma:	The vm area from which the mapping is removed
+ *
+ * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
+ *
+ * The caller needs to hold the page table lock.
+ */
+void folio_remove_rmap_pud(struct folio *folio, struct page *page,
+		struct vm_area_struct *vma)
+{
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	__folio_remove_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
+#else
+	WARN_ON_ONCE(true);
+#endif
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (5 preceding siblings ...)
  2024-09-10  4:14 ` [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-27  2:48   ` Dan Williams
  2024-09-10  4:14 ` [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce dax_insert_pfn_pmd. This will map the entire PMD-sized folio
and take references as it would for a normally mapped page.

This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
simply inserts a special devmap PMD entry into the page table without
holding a reference to the page for the mapping.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/huge_mm.h |  1 +-
 mm/huge_memory.c        | 57 ++++++++++++++++++++++++++++++++++--------
 2 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d3a1872..eaf3f78 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
 
 enum transparent_hugepage_flag {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e8985a4..790041e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1237,14 +1237,12 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t entry;
-	spinlock_t *ptl;
 
-	ptl = pmd_lock(mm, pmd);
 	if (!pmd_none(*pmd)) {
 		if (write) {
 			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
 				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
-				goto out_unlock;
+				return;
 			}
 			entry = pmd_mkyoung(*pmd);
 			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1252,7 +1250,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 				update_mmu_cache_pmd(vma, addr, pmd);
 		}
 
-		goto out_unlock;
+		return;
 	}
 
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
@@ -1271,11 +1269,6 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 	set_pmd_at(mm, addr, pmd, entry);
 	update_mmu_cache_pmd(vma, addr, pmd);
-
-out_unlock:
-	spin_unlock(ptl);
-	if (pgtable)
-		pte_free(mm, pgtable);
 }
 
 /**
@@ -1294,6 +1287,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 	struct vm_area_struct *vma = vmf->vma;
 	pgprot_t pgprot = vma->vm_page_prot;
 	pgtable_t pgtable = NULL;
+	spinlock_t *ptl;
 
 	/*
 	 * If we had pmd_special, we could avoid all these restrictions,
@@ -1316,12 +1310,55 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 	}
 
 	track_pfn_insert(vma, &pgprot, pfn);
-
+	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable);
+	spin_unlock(ptl);
+	if (pgtable)
+		pte_free(vma->vm_mm, pgtable);
+
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
 
+vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long addr = vmf->address & PMD_MASK;
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
+	pgtable_t pgtable = NULL;
+	struct folio *folio;
+	struct page *page;
+
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
+
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable)
+			return VM_FAULT_OOM;
+	}
+
+	track_pfn_insert(vma, &vma->vm_page_prot, pfn);
+
+	ptl = pmd_lock(mm, vmf->pmd);
+	if (pmd_none(*vmf->pmd)) {
+		page = pfn_t_to_page(pfn);
+		folio = page_folio(page);
+		folio_get(folio);
+		folio_add_file_rmap_pmd(folio, page, vma);
+		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
+	}
+	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, vma->vm_page_prot,
+		write, pgtable);
+	spin_unlock(ptl);
+	if (pgtable)
+		pte_free(mm, pgtable);
+
+	return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_insert_pfn_pmd);
+
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
 {
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (6 preceding siblings ...)
  2024-09-10  4:14 ` [PATCH 07/12] huge_memory: Allow mappings of PMD " Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-25  0:17   ` Dan Williams
  2024-09-10  4:14 ` [PATCH 09/12] mm: Update vm_normal_page() callers to accept " Alistair Popple
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Longterm pinning of FS DAX pages should already be disallowed by
various pXX_devmap checks. However a future change will cause these
checks to be invalid for FS DAX pages so make
folio_is_longterm_pinnable() return false for FS DAX pages.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 include/linux/memremap.h | 11 +++++++++++
 include/linux/mm.h       |  4 ++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 14273e6..6a1406a 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -187,6 +187,17 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
 	return is_device_coherent_page(&folio->page);
 }
 
+static inline bool is_device_dax_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page_dev_pagemap(page)->type == MEMORY_DEVICE_FS_DAX;
+}
+
+static inline bool folio_is_device_dax(const struct folio *folio)
+{
+	return is_device_dax_page(&folio->page);
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 void zone_device_page_init(struct page *page);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ae6d713..935e493 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1989,6 +1989,10 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
 	if (folio_is_device_coherent(folio))
 		return false;
 
+	/* DAX must also always allow eviction. */
+	if (folio_is_device_dax(folio))
+		return false;
+
 	/* Otherwise, non-movable zone folios can be pinned. */
 	return !folio_is_zone_movable(folio);
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/12] mm: Update vm_normal_page() callers to accept FS DAX pages
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (7 preceding siblings ...)
  2024-09-10  4:14 ` [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-27  7:15   ` Dan Williams
  2024-09-10  4:14 ` [PATCH 10/12] fs/dax: Properly refcount fs dax pages Alistair Popple
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Currently if a PTE points to a FS DAX page vm_normal_page() will
return NULL as these have their own special refcounting scheme. A
future change will allow FS DAX pages to be refcounted the same as any
other normal page.

Therefore vm_normal_page() will start returning FS DAX pages. To avoid
any change in behaviour callers that don't expect FS DAX pages will
need to explicitly check for this. As vm_normal_page() can already
return ZONE_DEVICE pages most callers already include a check for any
ZONE_DEVICE page.

However some callers don't, so add explicit checks where required.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 arch/x86/mm/pat/memtype.c |  4 +++-
 fs/proc/task_mmu.c        | 16 ++++++++++++----
 mm/memcontrol-v1.c        |  2 +-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 1fa0bf6..eb84593 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -951,6 +951,7 @@ static void free_pfn_range(u64 paddr, unsigned long size)
 static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
 		resource_size_t *phys)
 {
+	struct folio *folio;
 	pte_t *ptep, pte;
 	spinlock_t *ptl;
 
@@ -960,7 +961,8 @@ static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
 	pte = ptep_get(ptep);
 
 	/* Never return PFNs of anon folios in COW mappings. */
-	if (vm_normal_folio(vma, vma->vm_start, pte)) {
+	folio = vm_normal_folio(vma, vma->vm_start, pte);
+	if (folio || (folio && !folio_is_device_dax(folio))) {
 		pte_unmap_unlock(ptep, ptl);
 		return -EINVAL;
 	}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5f171ad..456b010 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -816,6 +816,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
 	if (pte_present(ptent)) {
 		page = vm_normal_page(vma, addr, ptent);
+		if (page && is_device_dax_page(page))
+			page = NULL;
 		young = pte_young(ptent);
 		dirty = pte_dirty(ptent);
 		present = true;
@@ -864,6 +866,8 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 
 	if (pmd_present(*pmd)) {
 		page = vm_normal_page_pmd(vma, addr, *pmd);
+		if (page && is_device_dax_page(page))
+			page = NULL;
 		present = true;
 	} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
@@ -1385,7 +1389,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
 	if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
 		return false;
 	folio = vm_normal_folio(vma, addr, pte);
-	if (!folio)
+	if (!folio || folio_is_device_dax(folio))
 		return false;
 	return folio_maybe_dma_pinned(folio);
 }
@@ -1710,6 +1714,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 			frame = pte_pfn(pte);
 		flags |= PM_PRESENT;
 		page = vm_normal_page(vma, addr, pte);
+		if (page && is_device_dax_page(page))
+			page = NULL;
 		if (pte_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
 		if (pte_uffd_wp(pte))
@@ -2096,7 +2102,8 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
 
 		if (p->masks_of_interest & PAGE_IS_FILE) {
 			page = vm_normal_page(vma, addr, pte);
-			if (page && !PageAnon(page))
+			if (page && !PageAnon(page) &&
+			    !is_device_dax_page(page))
 				categories |= PAGE_IS_FILE;
 		}
 
@@ -2158,7 +2165,8 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
 
 		if (p->masks_of_interest & PAGE_IS_FILE) {
 			page = vm_normal_page_pmd(vma, addr, pmd);
-			if (page && !PageAnon(page))
+			if (page && !PageAnon(page) &&
+			    !is_device_dax_page(page))
 				categories |= PAGE_IS_FILE;
 		}
 
@@ -2919,7 +2927,7 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 		return NULL;
 
 	page = vm_normal_page_pmd(vma, addr, pmd);
-	if (!page)
+	if (!page || is_device_dax_page(page))
 		return NULL;
 
 	if (PageReserved(page))
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index b37c0d8..e16053c 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -667,7 +667,7 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 {
 	struct page *page = vm_normal_page(vma, addr, ptent);
 
-	if (!page)
+	if (!page || is_device_dax_page(page))
 		return NULL;
 	if (PageAnon(page)) {
 		if (!(mc.flags & MOVE_ANON))
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/12] fs/dax: Properly refcount fs dax pages
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (8 preceding siblings ...)
  2024-09-10  4:14 ` [PATCH 09/12] mm: Update vm_normal_page() callers to accept " Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-27  7:59   ` Dan Williams
  2024-09-10  4:14 ` [PATCH 11/12] mm: Remove pXX_devmap callers Alistair Popple
  2024-09-10  4:14 ` [PATCH 12/12] mm: Remove devmap related functions and page table bits Alistair Popple
  11 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Currently fs dax pages are considered free when the refcount drops to
one and their refcounts are not increased when mapped via PTEs or
decreased when unmapped. This requires special logic in mm paths to
detect that these pages should not be properly refcounted, and to
detect when the refcount drops to one instead of zero.

On the other hand get_user_pages(), etc. will properly refcount fs dax
pages by taking a reference and dropping it when the page is
unpinned.

Tracking this special behaviour requires extra PTE bits
(eg. pte_devmap) and introduces rules that are potentially confusing
and specific to FS DAX pages. To fix this, and to possibly allow
removal of the special PTE bits in future, convert the fs dax page
refcounts to be zero based and instead take a reference on the page
each time it is mapped as is currently the case for normal pages.

This may also allow a future clean-up to remove the pgmap refcounting
that is currently done in mm/gup.c.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 drivers/dax/device.c       |  12 +-
 drivers/dax/super.c        |   2 +-
 drivers/nvdimm/pmem.c      |   4 +-
 fs/dax.c                   | 192 ++++++++++++++++++--------------------
 fs/fuse/virtio_fs.c        |   3 +-
 include/linux/dax.h        |   6 +-
 include/linux/mm.h         |  27 +-----
 include/linux/page-flags.h |   6 +-
 mm/gup.c                   |   9 +--
 mm/huge_memory.c           |   6 +-
 mm/internal.h              |   2 +-
 mm/memory-failure.c        |   6 +-
 mm/memory.c                |   6 +-
 mm/memremap.c              |  40 +++-----
 mm/mlock.c                 |   2 +-
 mm/mm_init.c               |   9 +--
 mm/swap.c                  |   2 +-
 17 files changed, 143 insertions(+), 191 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 9c1a729..4d3ddd1 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -126,11 +126,11 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+	pfn = phys_to_pfn_t(phys, 0);
 
 	dax_set_mapping(vmf, pfn, fault_size);
 
-	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	return dax_insert_pfn(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
@@ -169,11 +169,11 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+	pfn = phys_to_pfn_t(phys, 0);
 
 	dax_set_mapping(vmf, pfn, fault_size);
 
-	return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+	return dax_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -214,11 +214,11 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 		return VM_FAULT_SIGBUS;
 	}
 
-	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+	pfn = phys_to_pfn_t(phys, 0);
 
 	dax_set_mapping(vmf, pfn, fault_size);
 
-	return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+	return dax_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e16d1d4..57a94a6 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -257,7 +257,7 @@ EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 {
-	if (unlikely(!dax_write_cache_enabled(dax_dev)))
+	if (unlikely(dax_dev && !dax_write_cache_enabled(dax_dev)))
 		return;
 
 	arch_wb_cache_pmem(addr, size);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 210fb77..451cd0f 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -513,7 +513,7 @@ static int pmem_attach_disk(struct device *dev,
 
 	pmem->disk = disk;
 	pmem->pgmap.owner = pmem;
-	pmem->pfn_flags = PFN_DEV;
+	pmem->pfn_flags = 0;
 	if (is_nd_pfn(dev)) {
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
@@ -522,7 +522,6 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
 		pmem->pfn_pad = resource_size(res) -
 			range_len(&pmem->pgmap.range);
-		pmem->pfn_flags |= PFN_MAP;
 		bb_range = pmem->pgmap.range;
 		bb_range.start += pmem->data_offset;
 	} else if (pmem_should_map_pages(dev)) {
@@ -532,7 +531,6 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
-		pmem->pfn_flags |= PFN_MAP;
 		bb_range = pmem->pgmap.range;
 	} else {
 		addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/fs/dax.c b/fs/dax.c
index becb4a6..05f7b88 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -71,6 +71,11 @@ static unsigned long dax_to_pfn(void *entry)
 	return xa_to_value(entry) >> DAX_SHIFT;
 }
 
+static struct folio *dax_to_folio(void *entry)
+{
+	return page_folio(pfn_to_page(dax_to_pfn(entry)));
+}
+
 static void *dax_make_entry(pfn_t pfn, unsigned long flags)
 {
 	return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT));
@@ -318,85 +323,58 @@ static unsigned long dax_end_pfn(void *entry)
  */
 #define for_each_mapped_pfn(entry, pfn) \
 	for (pfn = dax_to_pfn(entry); \
-			pfn < dax_end_pfn(entry); pfn++)
+		pfn < dax_end_pfn(entry); pfn++)
 
-static inline bool dax_page_is_shared(struct page *page)
+static void dax_device_folio_init(struct folio *folio, int order)
 {
-	return page->mapping == PAGE_MAPPING_DAX_SHARED;
-}
+	int orig_order = folio_order(folio);
+	int i;
 
-/*
- * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
- * refcount.
- */
-static inline void dax_page_share_get(struct page *page)
-{
-	if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
-		/*
-		 * Reset the index if the page was already mapped
-		 * regularly before.
-		 */
-		if (page->mapping)
-			page->share = 1;
-		page->mapping = PAGE_MAPPING_DAX_SHARED;
-	}
-	page->share++;
-}
+	if (orig_order != order) {
+		struct dev_pagemap *pgmap = page_dev_pagemap(&folio->page);
 
-static inline unsigned long dax_page_share_put(struct page *page)
-{
-	return --page->share;
-}
+		for (i = 0; i < (1UL << orig_order); i++) {
+			struct page *page = folio_page(folio, i);
 
-/*
- * When it is called in dax_insert_entry(), the shared flag will indicate that
- * whether this entry is shared by multiple files.  If so, set the page->mapping
- * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
- */
-static void dax_associate_entry(void *entry, struct address_space *mapping,
-		struct vm_area_struct *vma, unsigned long address, bool shared)
-{
-	unsigned long size = dax_entry_size(entry), pfn, index;
-	int i = 0;
+			ClearPageHead(page);
+			clear_compound_head(page);
 
-	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return;
-
-	index = linear_page_index(vma, address & ~(size - 1));
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
+			/*
+			 * Reset pgmap which was over-written by
+			 * prep_compound_page().
+			 */
+			page_folio(page)->pgmap = pgmap;
 
-		if (shared) {
-			dax_page_share_get(page);
-		} else {
-			WARN_ON_ONCE(page->mapping);
-			page->mapping = mapping;
-			page->index = index + i++;
+			/* Make sure this isn't set to TAIL_MAPPING */
+			page->mapping = NULL;
 		}
 	}
+
+	if (order > 0) {
+		prep_compound_page(&folio->page, order);
+		if (order > 1)
+			INIT_LIST_HEAD(&folio->_deferred_list);
+	}
 }
 
-static void dax_disassociate_entry(void *entry, struct address_space *mapping,
-		bool trunc)
+static void dax_associate_new_entry(void *entry, struct address_space *mapping,
+				pgoff_t index)
 {
-	unsigned long pfn;
+	unsigned long order = dax_entry_order(entry);
+	struct folio *folio = dax_to_folio(entry);
 
-	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+	if (!dax_entry_size(entry))
 		return;
 
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
-
-		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-		if (dax_page_is_shared(page)) {
-			/* keep the shared flag if this page is still shared */
-			if (dax_page_share_put(page) > 0)
-				continue;
-		} else
-			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
-		page->mapping = NULL;
-		page->index = 0;
-	}
+	/*
+	 * We don't hold a reference for the DAX pagecache entry for the
+	 * page. But we need to initialise the folio so we can hand it
+	 * out. Nothing else should have a reference either.
+	 */
+	WARN_ON_ONCE(folio_ref_count(folio));
+	dax_device_folio_init(folio, order);
+	folio->mapping = mapping;
+	folio->index = index;
 }
 
 static struct page *dax_busy_page(void *entry)
@@ -406,7 +384,7 @@ static struct page *dax_busy_page(void *entry)
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		if (page_ref_count(page) > 1)
+		if (page_ref_count(page))
 			return page;
 	}
 	return NULL;
@@ -620,7 +598,6 @@ static void *grab_mapping_entry(struct xa_state *xas,
 			xas_lock_irq(xas);
 		}
 
-		dax_disassociate_entry(entry, mapping, false);
 		xas_store(xas, NULL);	/* undo the PMD join */
 		dax_wake_entry(xas, entry, WAKE_ALL);
 		mapping->nrpages -= PG_PMD_NR;
@@ -743,7 +720,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
 EXPORT_SYMBOL_GPL(dax_layout_busy_page);
 
 static int __dax_invalidate_entry(struct address_space *mapping,
-					  pgoff_t index, bool trunc)
+				  pgoff_t index, bool trunc)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
 	int ret = 0;
@@ -757,7 +734,6 @@ static int __dax_invalidate_entry(struct address_space *mapping,
 	    (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) ||
 	     xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE)))
 		goto out;
-	dax_disassociate_entry(entry, mapping, trunc);
 	xas_store(&xas, NULL);
 	mapping->nrpages -= 1UL << dax_entry_order(entry);
 	ret = 1;
@@ -894,9 +870,11 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 	if (shared || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		void *old;
 
-		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
-				shared);
+		if (!shared) {
+			dax_associate_new_entry(new_entry, mapping,
+				linear_page_index(vmf->vma, vmf->address));
+		}
+
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -1084,9 +1062,7 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
 		goto out;
 	if (pfn_t_to_pfn(*pfnp) & (PHYS_PFN(size)-1))
 		goto out;
-	/* For larger pages we need devmap */
-	if (length > 1 && !pfn_t_devmap(*pfnp))
-		goto out;
+
 	rc = 0;
 
 out_check_addr:
@@ -1189,11 +1165,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	struct inode *inode = iter->inode;
 	unsigned long vaddr = vmf->address;
 	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
+	struct page *page = pfn_t_to_page(pfn);
 	vm_fault_t ret;
 
 	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
 
-	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
+	page_ref_inc(page);
+	ret = dax_insert_pfn(vmf, pfn, false);
+	put_page(page);
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
 }
@@ -1212,8 +1191,13 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	pmd_t pmd_entry;
 	pfn_t pfn;
 
-	zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable)
+			return VM_FAULT_OOM;
+	}
 
+	zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
 	if (unlikely(!zero_folio))
 		goto fallback;
 
@@ -1221,29 +1205,23 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
 				  DAX_PMD | DAX_ZERO_PAGE);
 
-	if (arch_needs_pgtable_deposit()) {
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (!pgtable)
-			return VM_FAULT_OOM;
-	}
-
 	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
-	if (!pmd_none(*(vmf->pmd))) {
-		spin_unlock(ptl);
-		goto fallback;
-	}
+	if (!pmd_none(*vmf->pmd))
+		goto fallback_unlock;
 
-	if (pgtable) {
-		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
-		mm_inc_nr_ptes(vma->vm_mm);
-	}
-	pmd_entry = mk_pmd(&zero_folio->page, vmf->vma->vm_page_prot);
+	pmd_entry = mk_pmd(&zero_folio->page, vma->vm_page_prot);
 	pmd_entry = pmd_mkhuge(pmd_entry);
-	set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
+	if (pgtable)
+		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+	set_pmd_at(vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
 	spin_unlock(ptl);
 	trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
 	return VM_FAULT_NOPAGE;
 
+fallback_unlock:
+	spin_unlock(ptl);
+	mm_put_huge_zero_folio(vma->vm_mm);
+
 fallback:
 	if (pgtable)
 		pte_free(vma->vm_mm, pgtable);
@@ -1649,9 +1627,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
 	bool write = iter->flags & IOMAP_WRITE;
 	unsigned long entry_flags = pmd ? DAX_PMD : 0;
-	int err = 0;
+	int ret, err = 0;
 	pfn_t pfn;
 	void *kaddr;
+	struct page *page;
 
 	if (!pmd && vmf->cow_page)
 		return dax_fault_cow_page(vmf, iter);
@@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	if (dax_fault_is_synchronous(iter, vmf->vma))
 		return dax_fault_synchronous_pfnp(pfnp, pfn);
 
-	/* insert PMD pfn */
+	page = pfn_t_to_page(pfn);
+	page_ref_inc(page);
+
 	if (pmd)
-		return vmf_insert_pfn_pmd(vmf, pfn, write);
+		ret = dax_insert_pfn_pmd(vmf, pfn, write);
+	else
+		ret = dax_insert_pfn(vmf, pfn, write);
 
-	/* insert PTE pfn */
-	if (write)
-		return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
-	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+	/*
+	 * Insert PMD/PTE will have a reference on the page when mapping it so
+	 * drop ours.
+	 */
+	put_page(page);
+
+	return ret;
 }
 
 static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
@@ -1932,6 +1918,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
 	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order);
 	void *entry;
 	vm_fault_t ret;
+	struct page *page;
 
 	xas_lock_irq(&xas);
 	entry = get_unlocked_entry(&xas, order);
@@ -1947,14 +1934,17 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
 	xas_set_mark(&xas, PAGECACHE_TAG_DIRTY);
 	dax_lock_entry(&xas, entry);
 	xas_unlock_irq(&xas);
+	page = pfn_t_to_page(pfn);
+	page_ref_inc(page);
 	if (order == 0)
-		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
+		ret = dax_insert_pfn(vmf, pfn, true);
 #ifdef CONFIG_FS_DAX_PMD
 	else if (order == PMD_ORDER)
-		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
+		ret = dax_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
 #endif
 	else
 		ret = VM_FAULT_FALLBACK;
+	put_page(page);
 	dax_unlock_entry(&xas, entry);
 	trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret);
 	return ret;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index dd52601..f79a94d 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -875,8 +875,7 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
 	if (kaddr)
 		*kaddr = fs->window_kaddr + offset;
 	if (pfn)
-		*pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
-					PFN_DEV | PFN_MAP);
+		*pfn = phys_to_pfn_t(fs->window_phys_addr + offset, 0);
 	return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 773dfc4..0f6f355 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -217,8 +217,12 @@ static inline int dax_wait_page_idle(struct page *page,
 				void (cb)(struct inode *),
 				struct inode *inode)
 {
-	return ___wait_var_event(page, page_ref_count(page) == 1,
+	int ret;
+
+	ret = ___wait_var_event(page, !page_ref_count(page),
 				TASK_INTERRUPTIBLE, 0, 0, cb(inode));
+
+	return ret;
 }
 
 #if IS_ENABLED(CONFIG_DAX)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 935e493..592b992 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1071,6 +1071,8 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
 struct mmu_gather;
 struct inode;
 
+extern void prep_compound_page(struct page *page, unsigned int order);
+
 /*
  * compound_order() can be called without holding a reference, which means
  * that niceties like page_folio() don't work.  These callers should be
@@ -1394,25 +1396,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf);
  *   back into memory.
  */
 
-#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-bool __put_devmap_managed_folio_refs(struct folio *folio, int refs);
-static inline bool put_devmap_managed_folio_refs(struct folio *folio, int refs)
-{
-	if (!static_branch_unlikely(&devmap_managed_key))
-		return false;
-	if (!folio_is_zone_device(folio))
-		return false;
-	return __put_devmap_managed_folio_refs(folio, refs);
-}
-#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-static inline bool put_devmap_managed_folio_refs(struct folio *folio, int refs)
-{
-	return false;
-}
-#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-
 /* 127: arbitrary random number, small enough to assemble well */
 #define folio_ref_zero_or_close_to_overflow(folio) \
 	((unsigned int) folio_ref_count(folio) + 127u <= 127u)
@@ -1527,12 +1510,6 @@ static inline void put_page(struct page *page)
 {
 	struct folio *folio = page_folio(page);
 
-	/*
-	 * For some devmap managed pages we need to catch refcount transition
-	 * from 2 to 1:
-	 */
-	if (put_devmap_managed_folio_refs(folio, 1))
-		return;
 	folio_put(folio);
 }
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 2175ebc..0326a41 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -667,12 +667,6 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted)
 #define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 #define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 
-/*
- * Different with flags above, this flag is used only for fsdax mode.  It
- * indicates that this page->mapping is now under reflink case.
- */
-#define PAGE_MAPPING_DAX_SHARED	((void *)0x1)
-
 static __always_inline bool folio_mapping_flags(const struct folio *folio)
 {
 	return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) != 0;
diff --git a/mm/gup.c b/mm/gup.c
index 5d2fc9a..798c92b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -91,8 +91,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
 	 * belongs to this folio.
 	 */
 	if (unlikely(page_folio(page) != folio)) {
-		if (!put_devmap_managed_folio_refs(folio, refs))
-			folio_put_refs(folio, refs);
+		folio_put_refs(folio, refs);
 		goto retry;
 	}
 
@@ -111,8 +110,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 			refs *= GUP_PIN_COUNTING_BIAS;
 	}
 
-	if (!put_devmap_managed_folio_refs(folio, refs))
-		folio_put_refs(folio, refs);
+	folio_put_refs(folio, refs);
 }
 
 /**
@@ -543,8 +541,7 @@ static struct folio *try_grab_folio_fast(struct page *page, int refs,
 	 */
 	if (unlikely((flags & FOLL_LONGTERM) &&
 		     !folio_is_longterm_pinnable(folio))) {
-		if (!put_devmap_managed_folio_refs(folio, refs))
-			folio_put_refs(folio, refs);
+		folio_put_refs(folio, refs);
 		return NULL;
 	}
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 790041e..ab2cd4e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2017,7 +2017,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 						tlb->fullmm);
 	arch_check_zapped_pmd(vma, orig_pmd);
 	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-	if (vma_is_special_huge(vma)) {
+	if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
@@ -2661,13 +2661,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 */
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(mm, pmd);
-		if (vma_is_special_huge(vma))
+		if (!vma_is_dax(vma) && vma_is_special_huge(vma))
 			return;
 		if (unlikely(is_pmd_migration_entry(old_pmd))) {
 			swp_entry_t entry;
 
 			entry = pmd_to_swp_entry(old_pmd);
 			folio = pfn_swap_entry_folio(entry);
+		} else if (is_huge_zero_pmd(old_pmd)) {
+			return;
 		} else {
 			page = pmd_page(old_pmd);
 			folio = page_folio(page);
diff --git a/mm/internal.h b/mm/internal.h
index b00ea45..08123c2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -680,8 +680,6 @@ static inline void prep_compound_tail(struct page *head, int tail_idx)
 	set_page_private(p, 0);
 }
 
-extern void prep_compound_page(struct page *page, unsigned int order);
-
 extern void post_alloc_hook(struct page *page, unsigned int order,
 					gfp_t gfp_flags);
 extern bool free_pages_prepare(struct page *page, unsigned int order);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 96ce31e..80dd2a7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -419,18 +419,18 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
 	pud = pud_offset(p4d, address);
 	if (!pud_present(*pud))
 		return 0;
-	if (pud_devmap(*pud))
+	if (pud_trans_huge(*pud))
 		return PUD_SHIFT;
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		return 0;
-	if (pmd_devmap(*pmd))
+	if (pmd_trans_huge(*pmd))
 		return PMD_SHIFT;
 	pte = pte_offset_map(pmd, address);
 	if (!pte)
 		return 0;
 	ptent = ptep_get(pte);
-	if (pte_present(ptent) && pte_devmap(ptent))
+	if (pte_present(ptent))
 		ret = PAGE_SHIFT;
 	pte_unmap(pte);
 	return ret;
diff --git a/mm/memory.c b/mm/memory.c
index 368e15d..cc692d6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3752,13 +3752,15 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
 		/*
 		 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
-		 * VM_PFNMAP VMA.
+		 * VM_PFNMAP VMA. FS DAX also wants ops->pfn_mkwrite called.
 		 *
 		 * We should not cow pages in a shared writeable mapping.
 		 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 		 */
-		if (!vmf->page)
+		if (!vmf->page || is_device_dax_page(vmf->page)) {
+			vmf->page = NULL;
 			return wp_pfn_shared(vmf);
+		}
 		return wp_page_shared(vmf, folio);
 	}
 
diff --git a/mm/memremap.c b/mm/memremap.c
index e885bc9..89c0c3b 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -458,8 +458,13 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 void free_zone_device_folio(struct folio *folio)
 {
-	if (WARN_ON_ONCE(!folio->pgmap->ops ||
-			!folio->pgmap->ops->page_free))
+	struct dev_pagemap *pgmap = folio->pgmap;
+
+	if (WARN_ON_ONCE(!pgmap->ops))
+		return;
+
+	if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX &&
+			 !pgmap->ops->page_free))
 		return;
 
 	mem_cgroup_uncharge(folio);
@@ -486,24 +491,29 @@ void free_zone_device_folio(struct folio *folio)
 	 * to clear folio->mapping.
 	 */
 	folio->mapping = NULL;
-	folio->pgmap->ops->page_free(folio_page(folio, 0));
 
-	switch (folio->pgmap->type) {
+	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_COHERENT:
-		put_dev_pagemap(folio->pgmap);
+		pgmap->ops->page_free(folio_page(folio, 0));
+		put_dev_pagemap(pgmap);
 		break;
 
-	case MEMORY_DEVICE_FS_DAX:
 	case MEMORY_DEVICE_GENERIC:
 		/*
 		 * Reset the refcount to 1 to prepare for handing out the page
 		 * again.
 		 */
+		pgmap->ops->page_free(folio_page(folio, 0));
 		folio_set_count(folio, 1);
 		break;
 
+	case MEMORY_DEVICE_FS_DAX:
+		wake_up_var(&folio->page);
+		break;
+
 	case MEMORY_DEVICE_PCI_P2PDMA:
+		pgmap->ops->page_free(folio_page(folio, 0));
 		break;
 	}
 }
@@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
 	lock_page(page);
 }
 EXPORT_SYMBOL_GPL(zone_device_page_init);
-
-#ifdef CONFIG_FS_DAX
-bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
-{
-	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
-		return false;
-
-	/*
-	 * fsdax page refcounts are 1-based, rather than 0-based: if
-	 * refcount is 1, then the page is free and the refcount is
-	 * stable because nobody holds a reference on the page.
-	 */
-	if (folio_ref_sub_return(folio, refs) == 1)
-		wake_up_var(&folio->_refcount);
-	return true;
-}
-EXPORT_SYMBOL(__put_devmap_managed_folio_refs);
-#endif /* CONFIG_FS_DAX */
diff --git a/mm/mlock.c b/mm/mlock.c
index e3e3dc2..5352b00 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -362,6 +362,8 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (vma_is_dax(vma))
+		ptl = NULL;
 	if (ptl) {
 		if (!pmd_present(*pmd))
 			goto out;
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 3d0611e..3c32190 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1015,23 +1015,22 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 
 	/*
-	 * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and
-	 * MEMORY_TYPE_FS_DAX pages are released directly to the driver page
-	 * allocator which will set the page count to 1 when allocating the
-	 * page.
+	 * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
+	 * directly to the driver page allocator which will set the page count
+	 * to 1 when allocating the page.
 	 *
 	 * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
 	 * their refcount reset to one whenever they are freed (ie. after
 	 * their refcount drops to 0).
 	 */
 	switch (pgmap->type) {
+	case MEMORY_DEVICE_FS_DAX:
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_COHERENT:
 	case MEMORY_DEVICE_PCI_P2PDMA:
 		set_page_count(page, 0);
 		break;
 
-	case MEMORY_DEVICE_FS_DAX:
 	case MEMORY_DEVICE_GENERIC:
 		break;
 	}
diff --git a/mm/swap.c b/mm/swap.c
index 6b83898..0b90b61 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -969,8 +969,6 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			if (put_devmap_managed_folio_refs(folio, nr_refs))
-				continue;
 			if (folio_ref_sub_and_test(folio, nr_refs))
 				free_zone_device_folio(folio);
 			continue;
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 11/12] mm: Remove pXX_devmap callers
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (9 preceding siblings ...)
  2024-09-10  4:14 ` [PATCH 10/12] fs/dax: Properly refcount fs dax pages Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-27 12:29   ` Alexander Gordeev
  2024-09-10  4:14 ` [PATCH 12/12] mm: Remove devmap related functions and page table bits Alistair Popple
  11 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

The devmap PTE special bit was used to detect mappings of FS DAX
pages. This tracking was required to ensure the generic mm did not
manipulate the page reference counts as FS DAX implemented it's own
reference counting scheme.

Now that FS DAX pages have their references counted the same way as
normal pages this tracking is no longer needed and can be
removed.

Almost all existing uses of pmd_devmap() are paired with a check of
pmd_trans_huge(). As pmd_trans_huge() now returns true for FS DAX pages
dropping the check in these cases doesn't change anything.

However care needs to be taken because pmd_trans_huge() also checks that
a page is not an FS DAX page. This is dealt with either by checking
!vma_is_dax() or relying on the fact that the page pointer was obtained
from a page list. This is possible because zone device pages cannot
appear in any page list due to sharing page->lru with page->pgmap.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 arch/powerpc/mm/book3s64/hash_pgtable.c  |   3 +-
 arch/powerpc/mm/book3s64/pgtable.c       |   8 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c |   5 +-
 arch/powerpc/mm/pgtable.c                |   2 +-
 fs/dax.c                                 |   5 +-
 fs/userfaultfd.c                         |   2 +-
 include/linux/huge_mm.h                  |  10 +-
 include/linux/pgtable.h                  |   2 +-
 mm/gup.c                                 | 163 +------------------------
 mm/hmm.c                                 |   7 +-
 mm/huge_memory.c                         |  65 +---------
 mm/khugepaged.c                          |   2 +-
 mm/mapping_dirty_helpers.c               |   4 +-
 mm/memory.c                              |  37 +----
 mm/migrate_device.c                      |   2 +-
 mm/mprotect.c                            |   2 +-
 mm/mremap.c                              |   5 +-
 mm/page_vma_mapped.c                     |   5 +-
 mm/pagewalk.c                            |   8 +-
 mm/pgtable-generic.c                     |   7 +-
 mm/userfaultfd.c                         |   2 +-
 mm/vmscan.c                              |   5 +-
 22 files changed, 59 insertions(+), 292 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
index 988948d..82d3117 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -195,7 +195,7 @@ unsigned long hash__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr
 	unsigned long old;
 
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(!hash__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	WARN_ON(!hash__pmd_trans_huge(*pmdp));
 	assert_spin_locked(pmd_lockptr(mm, pmdp));
 #endif
 
@@ -227,7 +227,6 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 	VM_BUG_ON(pmd_trans_huge(*pmdp));
-	VM_BUG_ON(pmd_devmap(*pmdp));
 
 	pmd = *pmdp;
 	pmd_clear(pmdp);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 5a4a753..4537a29 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -50,7 +50,7 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 {
 	int changed;
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	WARN_ON(!pmd_trans_huge(*pmdp));
 	assert_spin_locked(pmd_lockptr(vma->vm_mm, pmdp));
 #endif
 	changed = !pmd_same(*(pmdp), entry);
@@ -70,7 +70,6 @@ int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 {
 	int changed;
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(!pud_devmap(*pudp));
 	assert_spin_locked(pud_lockptr(vma->vm_mm, pudp));
 #endif
 	changed = !pud_same(*(pudp), entry);
@@ -193,7 +192,7 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
 	pmd_t pmd;
 	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
-		   !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
+		   || !pmd_present(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
 	/*
 	 * if it not a fullmm flush, then we can possibly end up converting
@@ -211,8 +210,7 @@ pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
 	pud_t pud;
 
 	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-	VM_BUG_ON((pud_present(*pudp) && !pud_devmap(*pudp)) ||
-		  !pud_present(*pudp));
+	VM_BUG_ON(!pud_present(*pudp));
 	pud = pudp_huge_get_and_clear(vma->vm_mm, addr, pudp);
 	/*
 	 * if it not a fullmm flush, then we can possibly end up converting
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index b0d9270..78907b6 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1424,7 +1424,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long add
 	unsigned long old;
 
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(!radix__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	WARN_ON(!radix__pmd_trans_huge(*pmdp));
 	assert_spin_locked(pmd_lockptr(mm, pmdp));
 #endif
 
@@ -1441,7 +1441,7 @@ unsigned long radix__pud_hugepage_update(struct mm_struct *mm, unsigned long add
 	unsigned long old;
 
 #ifdef CONFIG_DEBUG_VM
-	WARN_ON(!pud_devmap(*pudp));
+	WARN_ON(!pud_trans_huge(*pudp));
 	assert_spin_locked(pud_lockptr(mm, pudp));
 #endif
 
@@ -1459,7 +1459,6 @@ pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addre
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 	VM_BUG_ON(radix__pmd_trans_huge(*pmdp));
-	VM_BUG_ON(pmd_devmap(*pmdp));
 	/*
 	 * khugepaged calls this for normal pmd
 	 */
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 7316396..c8cba4d 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -509,7 +509,7 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 		return NULL;
 #endif
 
-	if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+	if (pmd_trans_huge(pmd)) {
 		if (is_thp)
 			*is_thp = true;
 		ret_pte = (pte_t *)pmdp;
diff --git a/fs/dax.c b/fs/dax.c
index 05f7b88..6933ff3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1721,7 +1721,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * the PTE we need to set up.  If so just return and the fault will be
 	 * retried.
 	 */
-	if (pmd_trans_huge(*vmf->pmd) || pmd_devmap(*vmf->pmd)) {
+	if (pmd_trans_huge(*vmf->pmd)) {
 		ret = VM_FAULT_NOPAGE;
 		goto unlock_entry;
 	}
@@ -1842,8 +1842,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * the PMD we need to set up.  If so just return and the fault will be
 	 * retried.
 	 */
-	if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd) &&
-			!pmd_devmap(*vmf->pmd)) {
+	if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd)) {
 		ret = 0;
 		goto unlock_entry;
 	}
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 68cdd89..1c90913 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -304,7 +304,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 		goto out;
 
 	ret = false;
-	if (!pmd_present(_pmd) || pmd_devmap(_pmd))
+	if (!pmd_present(_pmd) || vma_is_dax(vmf->vma))
 		goto out;
 
 	if (pmd_trans_huge(_pmd)) {
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index eaf3f78..79a24ac 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -334,8 +334,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 #define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
-					|| pmd_devmap(*____pmd))	\
+		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd))	\
 			__split_huge_pmd(__vma, __pmd, __address,	\
 						false, NULL);		\
 	}  while (0)
@@ -361,8 +360,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 #define split_huge_pud(__vma, __pud, __address)				\
 	do {								\
 		pud_t *____pud = (__pud);				\
-		if (pud_trans_huge(*____pud)				\
-					|| pud_devmap(*____pud))	\
+		if (pud_trans_huge(*____pud))				\
 			__split_huge_pud(__vma, __pud, __address);	\
 	}  while (0)
 
@@ -385,7 +383,7 @@ static inline int is_swap_pmd(pmd_t pmd)
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
-	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
@@ -393,7 +391,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 		struct vm_area_struct *vma)
 {
-	if (pud_trans_huge(*pud) || pud_devmap(*pud))
+	if (pud_trans_huge(*pud))
 		return __pud_trans_huge_lock(pud, vma);
 	else
 		return NULL;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 780f3b4..a68e279 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1645,7 +1645,7 @@ static inline int pud_trans_unstable(pud_t *pud)
 	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 	pud_t pudval = READ_ONCE(*pud);
 
-	if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
+	if (pud_none(pudval) || pud_trans_huge(pudval))
 		return 1;
 	if (unlikely(pud_bad(pudval))) {
 		pud_clear_bad(pud);
diff --git a/mm/gup.c b/mm/gup.c
index 798c92b..74b0234 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -616,31 +616,9 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 		return NULL;
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
-
-	if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) &&
-	    pud_devmap(pud)) {
-		/*
-		 * device mapped pages can only be returned if the caller
-		 * will manage the page reference count.
-		 *
-		 * At least one of FOLL_GET | FOLL_PIN must be set, so
-		 * assert that here:
-		 */
-		if (!(flags & (FOLL_GET | FOLL_PIN)))
-			return ERR_PTR(-EEXIST);
-
-		if (flags & FOLL_TOUCH)
-			touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
-
-		ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
-		if (!ctx->pgmap)
-			return ERR_PTR(-EFAULT);
-	}
-
 	page = pfn_to_page(pfn);
 
-	if (!pud_devmap(pud) && !pud_write(pud) &&
-	    gup_must_unshare(vma, flags, page))
+	if (!pud_write(pud) && gup_must_unshare(vma, flags, page))
 		return ERR_PTR(-EMLINK);
 
 	ret = try_grab_folio(page_folio(page), 1, flags);
@@ -839,8 +817,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	page = vm_normal_page(vma, address, pte);
 
 	/*
-	 * We only care about anon pages in can_follow_write_pte() and don't
-	 * have to worry about pte_devmap() because they are never anon.
+	 * We only care about anon pages in can_follow_write_pte().
 	 */
 	if ((flags & FOLL_WRITE) &&
 	    !can_follow_write_pte(pte, page, vma, flags)) {
@@ -848,18 +825,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
-		/*
-		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
-		 * case since they are only valid while holding the pgmap
-		 * reference.
-		 */
-		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
-		if (*pgmap)
-			page = pte_page(pte);
-		else
-			goto no_page;
-	} else if (unlikely(!page)) {
+	if (unlikely(!page)) {
 		if (flags & FOLL_DUMP) {
 			/* Avoid special (like zero) pages in core dumps */
 			page = ERR_PTR(-EFAULT);
@@ -941,14 +907,6 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		return no_page_table(vma, flags, address);
 	if (!pmd_present(pmdval))
 		return no_page_table(vma, flags, address);
-	if (pmd_devmap(pmdval)) {
-		ptl = pmd_lock(mm, pmd);
-		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
-		spin_unlock(ptl);
-		if (page)
-			return page;
-		return no_page_table(vma, flags, address);
-	}
 	if (likely(!pmd_leaf(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
@@ -2830,7 +2788,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 		int *nr)
 {
 	struct dev_pagemap *pgmap = NULL;
-	int nr_start = *nr, ret = 0;
+	int ret = 0;
 	pte_t *ptep, *ptem;
 
 	ptem = ptep = pte_offset_map(&pmd, addr);
@@ -2854,16 +2812,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 		if (!pte_access_permitted(pte, flags & FOLL_WRITE))
 			goto pte_unmap;
 
-		if (pte_devmap(pte)) {
-			if (unlikely(flags & FOLL_LONGTERM))
-				goto pte_unmap;
-
-			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
-			if (unlikely(!pgmap)) {
-				gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-				goto pte_unmap;
-			}
-		} else if (pte_special(pte))
+		if (pte_special(pte))
 			goto pte_unmap;
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -2934,91 +2883,6 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
-#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr,
-	unsigned long end, unsigned int flags, struct page **pages, int *nr)
-{
-	int nr_start = *nr;
-	struct dev_pagemap *pgmap = NULL;
-
-	do {
-		struct folio *folio;
-		struct page *page = pfn_to_page(pfn);
-
-		pgmap = get_dev_pagemap(pfn, pgmap);
-		if (unlikely(!pgmap)) {
-			gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-			break;
-		}
-
-		folio = try_grab_folio_fast(page, 1, flags);
-		if (!folio) {
-			gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-			break;
-		}
-		folio_set_referenced(folio);
-		pages[*nr] = page;
-		(*nr)++;
-		pfn++;
-	} while (addr += PAGE_SIZE, addr != end);
-
-	put_dev_pagemap(pgmap);
-	return addr == end;
-}
-
-static int gup_fast_devmap_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages,
-		int *nr)
-{
-	unsigned long fault_pfn;
-	int nr_start = *nr;
-
-	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!gup_fast_devmap_leaf(fault_pfn, addr, end, flags, pages, nr))
-		return 0;
-
-	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-		return 0;
-	}
-	return 1;
-}
-
-static int gup_fast_devmap_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages,
-		int *nr)
-{
-	unsigned long fault_pfn;
-	int nr_start = *nr;
-
-	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!gup_fast_devmap_leaf(fault_pfn, addr, end, flags, pages, nr))
-		return 0;
-
-	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-		return 0;
-	}
-	return 1;
-}
-#else
-static int gup_fast_devmap_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages,
-		int *nr)
-{
-	BUILD_BUG();
-	return 0;
-}
-
-static int gup_fast_devmap_pud_leaf(pud_t pud, pud_t *pudp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages,
-		int *nr)
-{
-	BUILD_BUG();
-	return 0;
-}
-#endif
-
 static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, unsigned int flags, struct page **pages,
 		int *nr)
@@ -3030,13 +2894,6 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
-	if (pmd_devmap(orig)) {
-		if (unlikely(flags & FOLL_LONGTERM))
-			return 0;
-		return gup_fast_devmap_pmd_leaf(orig, pmdp, addr, end, flags,
-					        pages, nr);
-	}
-
 	page = pmd_page(orig);
 	refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);
 
@@ -3074,13 +2931,7 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
-	if (pud_devmap(orig)) {
-		if (unlikely(flags & FOLL_LONGTERM))
-			return 0;
-		return gup_fast_devmap_pud_leaf(orig, pudp, addr, end, flags,
-					        pages, nr);
-	}
-
+	// TODO: FOLL_LONGTERM?
 	page = pud_page(orig);
 	refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);
 
@@ -3119,8 +2970,6 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	if (!pgd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
-	BUILD_BUG_ON(pgd_devmap(orig));
-
 	page = pgd_page(orig);
 	refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index a11807c..1b85ed6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -298,7 +298,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	 * fall through and treat it like a normal page.
 	 */
 	if (!vm_normal_page(walk->vma, addr, pte) &&
-	    !pte_devmap(pte) &&
 	    !is_zero_pfn(pte_pfn(pte))) {
 		if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
 			pte_unmap(ptep);
@@ -351,7 +350,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
 	}
 
-	if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) {
+	if (pmd_trans_huge(pmd)) {
 		/*
 		 * No need to take pmd_lock here, even if some other thread
 		 * is splitting the huge pmd we will get that event through
@@ -362,7 +361,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		 * values.
 		 */
 		pmd = pmdp_get_lockless(pmdp);
-		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
+		if (!pmd_trans_huge(pmd))
 			goto again;
 
 		return hmm_vma_handle_pmd(walk, addr, end, hmm_pfns, pmd);
@@ -429,7 +428,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 		return hmm_vma_walk_hole(start, end, -1, walk);
 	}
 
-	if (pud_leaf(pud) && pud_devmap(pud)) {
+	if (pud_leaf(pud) && vma_is_dax(walk->vma)) {
 		unsigned long i, npages, pfn;
 		unsigned int required_fault;
 		unsigned long *hmm_pfns;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ab2cd4e..7c39950 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1254,8 +1254,6 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
-	if (pfn_t_devmap(pfn))
-		entry = pmd_mkdevmap(entry);
 	if (write) {
 		entry = pmd_mkyoung(pmd_mkdirty(entry));
 		entry = maybe_pmd_mkwrite(entry, vma);
@@ -1294,8 +1292,6 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -1389,8 +1385,6 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
-	if (pfn_t_devmap(pfn))
-		entry = pud_mkdevmap(entry);
 	if (write) {
 		entry = pud_mkyoung(pud_mkdirty(entry));
 		entry = maybe_pud_mkwrite(entry, vma);
@@ -1421,8 +1415,6 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	 * but we need to be consistent with PTEs and architectures that
 	 * can't support a 'special' bit.
 	 */
-	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -1493,46 +1485,6 @@ void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
 		update_mmu_cache_pmd(vma, addr, pmd);
 }
 
-struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
-{
-	unsigned long pfn = pmd_pfn(*pmd);
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int ret;
-
-	assert_spin_locked(pmd_lockptr(mm, pmd));
-
-	if (flags & FOLL_WRITE && !pmd_write(*pmd))
-		return NULL;
-
-	if (pmd_present(*pmd) && pmd_devmap(*pmd))
-		/* pass */;
-	else
-		return NULL;
-
-	if (flags & FOLL_TOUCH)
-		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
-
-	/*
-	 * device mapped pages can only be returned if the
-	 * caller will manage the page reference count.
-	 */
-	if (!(flags & (FOLL_GET | FOLL_PIN)))
-		return ERR_PTR(-EEXIST);
-
-	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
-		return ERR_PTR(-EFAULT);
-	page = pfn_to_page(pfn);
-	ret = try_grab_folio(page_folio(page), 1, flags);
-	if (ret)
-		page = ERR_PTR(ret);
-
-	return page;
-}
-
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
@@ -1664,7 +1616,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pud = *src_pud;
-	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
+	if (unlikely(!pud_trans_huge(pud)))
 		goto out_unlock;
 
 	/*
@@ -2473,8 +2425,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
-			pmd_devmap(*pmd)))
+	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -2491,7 +2442,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 
 	ptl = pud_lock(vma->vm_mm, pud);
-	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
+	if (likely(pud_trans_huge(*pud)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -2541,7 +2492,7 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
-	VM_BUG_ON(!pud_trans_huge(*pud) && !pud_devmap(*pud));
+	VM_BUG_ON(!pud_trans_huge(*pud));
 
 	count_vm_event(THP_SPLIT_PUD);
 
@@ -2575,7 +2526,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 				(address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 	ptl = pud_lock(vma->vm_mm, pud);
-	if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud)))
+	if (unlikely(!pud_trans_huge(*pud)))
 		goto out;
 	__split_huge_pud_locked(vma, pud, range.start);
 
@@ -2648,8 +2599,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
-				&& !pmd_devmap(*pmd));
+	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -2866,8 +2816,7 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 	 * require a folio to check the PMD against. Otherwise, there
 	 * is a risk of replacing the wrong folio.
 	 */
-	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
-	    is_pmd_migration_entry(*pmd)) {
+	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd)) {
 		if (folio && folio != pmd_folio(*pmd))
 			return;
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4a83c40..4e3ed2f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -961,8 +961,6 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 		return SCAN_PMD_NULL;
 	if (pmd_trans_huge(pmde))
 		return SCAN_PMD_MAPPED;
-	if (pmd_devmap(pmde))
-		return SCAN_PMD_NULL;
 	if (pmd_bad(pmde))
 		return SCAN_PMD_NULL;
 	return SCAN_SUCCEED;
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 2f8829b..208b428 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -129,7 +129,7 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pmd_t pmdval = pmdp_get_lockless(pmd);
 
 	/* Do not split a huge pmd, present or migrated */
-	if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
+	if (pmd_trans_huge(pmdval)) {
 		WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
 		walk->action = ACTION_CONTINUE;
 	}
@@ -152,7 +152,7 @@ static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
 	pud_t pudval = READ_ONCE(*pud);
 
 	/* Do not split a huge pud */
-	if (pud_trans_huge(pudval) || pud_devmap(pudval)) {
+	if (pud_trans_huge(pudval)) {
 		WARN_ON(pud_write(pudval) || pud_dirty(pudval));
 		walk->action = ACTION_CONTINUE;
 	}
diff --git a/mm/memory.c b/mm/memory.c
index cc692d6..0008735 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -604,16 +604,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return NULL;
 		if (is_zero_pfn(pfn))
 			return NULL;
-		if (pte_devmap(pte))
-		/*
-		 * NOTE: New users of ZONE_DEVICE will not set pte_devmap()
-		 * and will have refcounts incremented on their struct pages
-		 * when they are inserted into PTEs, thus they are safe to
-		 * return here. Legacy ZONE_DEVICE pages that set pte_devmap()
-		 * do not have refcounts. Example of legacy ZONE_DEVICE is
-		 * MEMORY_DEVICE_FS_DAX type in pmem or virtio_fs drivers.
-		 */
-			return NULL;
 
 		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
@@ -692,8 +682,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
-	if (pmd_devmap(pmd))
-		return NULL;
 	if (is_huge_zero_pmd(pmd))
 		return NULL;
 	if (unlikely(pfn > highest_memmap_pfn))
@@ -1235,8 +1223,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
-			|| pmd_devmap(*src_pmd)) {
+		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
 			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1272,7 +1259,7 @@ copy_pud_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pud = pud_offset(src_p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
+		if (pud_trans_huge(*src_pud)) {
 			int err;
 
 			VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma);
@@ -1710,7 +1697,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
@@ -1752,7 +1739,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
+		if (pud_trans_huge(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
 				split_huge_pud(vma, pud, addr);
@@ -2375,10 +2362,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	/* Ok, finally just insert the thing.. */
-	if (pfn_t_devmap(pfn))
-		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
-	else
-		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
+	entry = pte_mkspecial(pfn_t_pte(pfn, prot));
 
 	if (mkwrite) {
 		entry = pte_mkyoung(entry);
@@ -2489,8 +2473,6 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn, bool mkwrite)
 	/* these checks mirror the abort conditions in vm_normal_page */
 	if (vma->vm_flags & VM_MIXEDMAP)
 		return true;
-	if (pfn_t_devmap(pfn))
-		return true;
 	if (pfn_t_special(pfn))
 		return true;
 	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
@@ -2522,8 +2504,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 	 * than insert_pfn).  If a zero_pfn were inserted into a VM_MIXEDMAP
 	 * without pte special, it would there be refcounted as a normal page.
 	 */
-	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
-	    !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pfn_t_valid(pfn)) {
 		struct page *page;
 
 		/*
@@ -2568,8 +2549,6 @@ vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write)
 	if (!pfn_t_valid(pfn_t))
 		return VM_FAULT_SIGBUS;
 
-	WARN_ON_ONCE(pfn_t_devmap(pfn_t));
-
 	if (WARN_ON(is_zero_pfn(pfn) && write))
 		return VM_FAULT_SIGBUS;
 
@@ -5678,7 +5657,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		pud_t orig_pud = *vmf.pud;
 
 		barrier();
-		if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
+		if (pud_trans_huge(orig_pud)) {
 
 			/*
 			 * TODO once we support anonymous PUDs: NUMA case and
@@ -5719,7 +5698,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
 		}
-		if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
+		if (pmd_trans_huge(vmf.orig_pmd)) {
 			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 9d30107..f8c4baf 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -599,7 +599,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pmdp = pmd_alloc(mm, pudp, addr);
 	if (!pmdp)
 		goto abort;
-	if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
+	if (pmd_trans_huge(*pmdp))
 		goto abort;
 	if (pte_alloc(mm, pmdp))
 		goto abort;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0c5d6d0..e6b721d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -382,7 +382,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 			goto next;
 
 		_pmd = pmdp_get_lockless(pmd);
-		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
+		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
 			    pgtable_split_needed(vma, cp_flags)) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
diff --git a/mm/mremap.c b/mm/mremap.c
index 24712f8..a0f111c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -587,7 +587,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
 		if (!new_pud)
 			break;
-		if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
+		if (pud_trans_huge(*old_pud)) {
 			if (extent == HPAGE_PUD_SIZE) {
 				move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
 					       old_pud, new_pud, need_rmap_locks);
@@ -609,8 +609,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		if (!new_pmd)
 			break;
 again:
-		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
-		    pmd_devmap(*old_pmd)) {
+		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE &&
 			    move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
 					   old_pmd, new_pmd, need_rmap_locks))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae5cc42..77da636 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -235,8 +235,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		 */
 		pmde = pmdp_get_lockless(pvmw->pmd);
 
-		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
-		    (pmd_present(pmde) && pmd_devmap(pmde))) {
+		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
 			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 			pmde = *pvmw->pmd;
 			if (!pmd_present(pmde)) {
@@ -251,7 +250,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 					return not_found(pvmw);
 				return true;
 			}
-			if (likely(pmd_trans_huge(pmde) || pmd_devmap(pmde))) {
+			if (likely(pmd_trans_huge(pmde))) {
 				if (pvmw->flags & PVMW_MIGRATION)
 					return not_found(pvmw);
 				if (!check_pmd(pmd_pfn(pmde), pvmw))
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index cd79fb3..09a3ee4 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -753,7 +753,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 		fw->pudp = pudp;
 		fw->pud = pud;
 
-		if (!pud_present(pud) || pud_devmap(pud)) {
+		if (!pud_present(pud)) {
 			spin_unlock(ptl);
 			goto not_found;
 		} else if (!pud_leaf(pud)) {
@@ -765,6 +765,12 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 		 * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs.
 		 */
 		page = pud_page(pud);
+
+		if (is_device_dax_page(page)) {
+			spin_unlock(ptl);
+			goto not_found;
+		}
+
 		goto found;
 	}
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4ad..093c435 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -139,8 +139,7 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t pmd;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
-			   !pmd_devmap(*pmdp));
+	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
@@ -153,7 +152,7 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	pud_t pud;
 
 	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
-	VM_BUG_ON(!pud_trans_huge(*pudp) && !pud_devmap(*pudp));
+	VM_BUG_ON(!pud_trans_huge(*pudp));
 	pud = pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
 	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
 	return pud;
@@ -293,7 +292,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
 		goto nomap;
-	if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+	if (unlikely(pmd_trans_huge(pmdval)))
 		goto nomap;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 966e6c8..1ac83b3 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1679,7 +1679,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 
 		ptl = pmd_trans_huge_lock(src_pmd, src_vma);
 		if (ptl) {
-			if (pmd_devmap(*src_pmd)) {
+			if (vma_is_dax(src_vma)) {
 				spin_unlock(ptl);
 				err = -ENOENT;
 				break;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1b1fad0..c4d261b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3322,7 +3322,7 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
 	if (!pte_present(pte) || is_zero_pfn(pfn))
 		return -1;
 
-	if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte)))
+	if (WARN_ON_ONCE(pte_special(pte)))
 		return -1;
 
 	if (WARN_ON_ONCE(!pfn_valid(pfn)))
@@ -3340,9 +3340,6 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
 	if (!pmd_present(pmd) || is_huge_zero_pmd(pmd))
 		return -1;
 
-	if (WARN_ON_ONCE(pmd_devmap(pmd)))
-		return -1;
-
 	if (WARN_ON_ONCE(!pfn_valid(pfn)))
 		return -1;
 
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 12/12] mm: Remove devmap related functions and page table bits
  2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
                   ` (10 preceding siblings ...)
  2024-09-10  4:14 ` [PATCH 11/12] mm: Remove pXX_devmap callers Alistair Popple
@ 2024-09-10  4:14 ` Alistair Popple
  2024-09-11  7:47   ` Chunyan Zhang
  2024-09-12 12:55   ` kernel test robot
  11 siblings, 2 replies; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  4:14 UTC (permalink / raw)
  To: dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Now that DAX and all other reference counts to ZONE_DEVICE pages are
managed normally there is no need for the special devmap PTE/PMD/PUD
page table bits. So drop all references to these, freeing up a
software defined page table bit on architectures supporting it.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Will Deacon <will@kernel.org> # arm64
---
 Documentation/mm/arch_pgtable_helpers.rst     |  6 +--
 arch/arm64/Kconfig                            |  1 +-
 arch/arm64/include/asm/pgtable-prot.h         |  1 +-
 arch/arm64/include/asm/pgtable.h              | 24 +--------
 arch/powerpc/Kconfig                          |  1 +-
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  6 +--
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  7 +--
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 52 +------------------
 arch/powerpc/include/asm/book3s/64/radix.h    | 14 +-----
 arch/x86/Kconfig                              |  1 +-
 arch/x86/include/asm/pgtable.h                | 50 +-----------------
 arch/x86/include/asm/pgtable_types.h          |  5 +--
 include/linux/mm.h                            |  7 +--
 include/linux/pfn_t.h                         | 20 +-------
 include/linux/pgtable.h                       | 19 +------
 mm/Kconfig                                    |  4 +-
 mm/debug_vm_pgtable.c                         | 59 +--------------------
 mm/hmm.c                                      |  3 +-
 18 files changed, 11 insertions(+), 269 deletions(-)

diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
index af24516..c88c7fa 100644
--- a/Documentation/mm/arch_pgtable_helpers.rst
+++ b/Documentation/mm/arch_pgtable_helpers.rst
@@ -30,8 +30,6 @@ PTE Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pte_protnone              | Tests a PROT_NONE PTE                            |
 +---------------------------+--------------------------------------------------+
-| pte_devmap                | Tests a ZONE_DEVICE mapped PTE                   |
-+---------------------------+--------------------------------------------------+
 | pte_soft_dirty            | Tests a soft dirty PTE                           |
 +---------------------------+--------------------------------------------------+
 | pte_swp_soft_dirty        | Tests a soft dirty swapped PTE                   |
@@ -104,8 +102,6 @@ PMD Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pmd_protnone              | Tests a PROT_NONE PMD                            |
 +---------------------------+--------------------------------------------------+
-| pmd_devmap                | Tests a ZONE_DEVICE mapped PMD                   |
-+---------------------------+--------------------------------------------------+
 | pmd_soft_dirty            | Tests a soft dirty PMD                           |
 +---------------------------+--------------------------------------------------+
 | pmd_swp_soft_dirty        | Tests a soft dirty swapped PMD                   |
@@ -177,8 +173,6 @@ PUD Page Table Helpers
 +---------------------------+--------------------------------------------------+
 | pud_write                 | Tests a writable PUD                             |
 +---------------------------+--------------------------------------------------+
-| pud_devmap                | Tests a ZONE_DEVICE mapped PUD                   |
-+---------------------------+--------------------------------------------------+
 | pud_mkyoung               | Creates a young PUD                              |
 +---------------------------+--------------------------------------------------+
 | pud_mkold                 | Creates an old PUD                               |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6494848..b6d9c29 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -36,7 +36,6 @@ config ARM64
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
-	select ARCH_HAS_PTE_DEVMAP
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_HW_PTE_YOUNG
 	select ARCH_HAS_SETUP_DMA_OPS
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index b11cfb9..043b102 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -17,7 +17,6 @@
 #define PTE_SWP_EXCLUSIVE	(_AT(pteval_t, 1) << 2)	 /* only for swp ptes */
 #define PTE_DIRTY		(_AT(pteval_t, 1) << 55)
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
-#define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
 
 /*
  * PTE_PRESENT_INVALID=1 & PTE_VALID=0 indicates that the pte's fields should be
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7a4f560..d21b6c9 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -107,7 +107,6 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
 #define pte_user(pte)		(!!(pte_val(pte) & PTE_USER))
 #define pte_user_exec(pte)	(!(pte_val(pte) & PTE_UXN))
 #define pte_cont(pte)		(!!(pte_val(pte) & PTE_CONT))
-#define pte_devmap(pte)		(!!(pte_val(pte) & PTE_DEVMAP))
 #define pte_tagged(pte)		((pte_val(pte) & PTE_ATTRINDX_MASK) == \
 				 PTE_ATTRINDX(MT_NORMAL_TAGGED))
 
@@ -269,11 +268,6 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
 	return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
 }
 
-static inline pte_t pte_mkdevmap(pte_t pte)
-{
-	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
-}
-
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pte_uffd_wp(pte_t pte)
 {
@@ -569,14 +563,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
 
 #define pmd_mkhuge(pmd)		(__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define pmd_devmap(pmd)		pte_devmap(pmd_pte(pmd))
-#endif
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
-	return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP)));
-}
-
 #define __pmd_to_phys(pmd)	__pte_to_phys(pmd_pte(pmd))
 #define __phys_to_pmd_val(phys)	__phys_to_pte_val(phys)
 #define pmd_pfn(pmd)		((__pmd_to_phys(pmd) & PMD_MASK) >> PAGE_SHIFT)
@@ -1136,16 +1122,6 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
 	return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
 							pmd_pte(entry), dirty);
 }
-
-static inline int pud_devmap(pud_t pud)
-{
-	return 0;
-}
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
 #endif
 
 #ifdef CONFIG_PAGE_TABLE_CHECK
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d7b09b0..8094e48 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -144,7 +144,6 @@ config PPC
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PHYS_TO_DMA
 	select ARCH_HAS_PMEM_API
-	select ARCH_HAS_PTE_DEVMAP		if PPC_BOOK3S_64
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SCALED_CPUTIME		if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
 	select ARCH_HAS_SET_MEMORY
diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index c654c37..d23244b 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -140,12 +140,6 @@ extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
 extern int hash__has_transparent_hugepage(void);
 #endif
 
-static inline pmd_t hash__pmd_mkdevmap(pmd_t pmd)
-{
-	BUG();
-	return pmd;
-}
-
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_BOOK3S_64_HASH_4K_H */
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 0bf6fd0..0fb5b7d 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -259,7 +259,7 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
  */
 static inline int hash__pmd_trans_huge(pmd_t pmd)
 {
-	return !!((pmd_val(pmd) & (_PAGE_PTE | H_PAGE_THP_HUGE | _PAGE_DEVMAP)) ==
+	return !!((pmd_val(pmd) & (_PAGE_PTE | H_PAGE_THP_HUGE)) ==
 		  (_PAGE_PTE | H_PAGE_THP_HUGE));
 }
 
@@ -281,11 +281,6 @@ extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
 extern int hash__has_transparent_hugepage(void);
 #endif /*  CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline pmd_t hash__pmd_mkdevmap(pmd_t pmd)
-{
-	return __pmd(pmd_val(pmd) | (_PAGE_PTE | H_PAGE_THP_HUGE | _PAGE_DEVMAP));
-}
-
 #endif	/* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_BOOK3S_64_HASH_64K_H */
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 5da92ba..9bf54b5 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -88,7 +88,6 @@
 
 #define _PAGE_SOFT_DIRTY	_RPAGE_SW3 /* software: software dirty tracking */
 #define _PAGE_SPECIAL		_RPAGE_SW2 /* software: special page */
-#define _PAGE_DEVMAP		_RPAGE_SW1 /* software: ZONE_DEVICE page */
 
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
@@ -109,7 +108,7 @@
  */
 #define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \
 			 _PAGE_ACCESSED | H_PAGE_THP_HUGE | _PAGE_PTE | \
-			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
+			 _PAGE_SOFT_DIRTY)
 /*
  * user access blocked by key
  */
@@ -123,7 +122,7 @@
  */
 #define _PAGE_CHG_MASK	(PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \
 			 _PAGE_ACCESSED | _PAGE_SPECIAL | _PAGE_PTE |	\
-			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
+			 _PAGE_SOFT_DIRTY)
 
 /*
  * We define 2 sets of base prot bits, one for basic pages (ie,
@@ -635,24 +634,6 @@ static inline pte_t pte_mkhuge(pte_t pte)
 	return pte;
 }
 
-static inline pte_t pte_mkdevmap(pte_t pte)
-{
-	return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SPECIAL | _PAGE_DEVMAP));
-}
-
-/*
- * This is potentially called with a pmd as the argument, in which case it's not
- * safe to check _PAGE_DEVMAP unless we also confirm that _PAGE_PTE is set.
- * That's because the bit we use for _PAGE_DEVMAP is not reserved for software
- * use in page directory entries (ie. non-ptes).
- */
-static inline int pte_devmap(pte_t pte)
-{
-	__be64 mask = cpu_to_be64(_PAGE_DEVMAP | _PAGE_PTE);
-
-	return (pte_raw(pte) & mask) == mask;
-}
-
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	/* FIXME!! check whether this need to be a conditional */
@@ -1406,35 +1387,6 @@ static inline bool arch_needs_pgtable_deposit(void)
 }
 extern void serialize_against_pte_lookup(struct mm_struct *mm);
 
-
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
-	if (radix_enabled())
-		return radix__pmd_mkdevmap(pmd);
-	return hash__pmd_mkdevmap(pmd);
-}
-
-static inline pud_t pud_mkdevmap(pud_t pud)
-{
-	if (radix_enabled())
-		return radix__pud_mkdevmap(pud);
-	BUG();
-	return pud;
-}
-
-static inline int pmd_devmap(pmd_t pmd)
-{
-	return pte_devmap(pmd_pte(pmd));
-}
-
-static inline int pud_devmap(pud_t pud)
-{
-	return pte_devmap(pud_pte(pud));
-}
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 8f55ff7..df23a82 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -264,7 +264,7 @@ static inline int radix__p4d_bad(p4d_t p4d)
 
 static inline int radix__pmd_trans_huge(pmd_t pmd)
 {
-	return (pmd_val(pmd) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
+	return (pmd_val(pmd) & _PAGE_PTE) == _PAGE_PTE;
 }
 
 static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
@@ -274,7 +274,7 @@ static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
 
 static inline int radix__pud_trans_huge(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
+	return (pud_val(pud) & _PAGE_PTE) == _PAGE_PTE;
 }
 
 static inline pud_t radix__pud_mkhuge(pud_t pud)
@@ -315,16 +315,6 @@ static inline int radix__has_transparent_pud_hugepage(void)
 }
 #endif
 
-static inline pmd_t radix__pmd_mkdevmap(pmd_t pmd)
-{
-	return __pmd(pmd_val(pmd) | (_PAGE_PTE | _PAGE_DEVMAP));
-}
-
-static inline pud_t radix__pud_mkdevmap(pud_t pud)
-{
-	return __pud(pud_val(pud) | (_PAGE_PTE | _PAGE_DEVMAP));
-}
-
 struct vmem_altmap;
 struct dev_pagemap;
 extern int __meminit radix__vmemmap_create_mapping(unsigned long start,
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b74b9ee..9177fe3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -91,7 +91,6 @@ config X86
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
-	select ARCH_HAS_PTE_DEVMAP		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_HW_PTE_YOUNG
 	select ARCH_HAS_NONLEAF_PMD_YOUNG	if PGTABLE_LEVELS > 2
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 8d12bfa..b576a14 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -280,16 +280,15 @@ static inline bool pmd_leaf(pmd_t pte)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_leaf */
 static inline int pmd_trans_huge(pmd_t pmd)
 {
-	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pmd_val(pmd) & _PAGE_PSE) == _PAGE_PSE;
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static inline int pud_trans_huge(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pud_val(pud) & _PAGE_PSE) == _PAGE_PSE;
 }
 #endif
 
@@ -299,29 +298,6 @@ static inline int has_transparent_hugepage(void)
 	return boot_cpu_has(X86_FEATURE_PSE);
 }
 
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pmd_devmap(pmd_t pmd)
-{
-	return !!(pmd_val(pmd) & _PAGE_DEVMAP);
-}
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static inline int pud_devmap(pud_t pud)
-{
-	return !!(pud_val(pud) & _PAGE_DEVMAP);
-}
-#else
-static inline int pud_devmap(pud_t pud)
-{
-	return 0;
-}
-#endif
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
-#endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
@@ -482,11 +458,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
 	return pte_set_flags(pte, _PAGE_SPECIAL);
 }
 
-static inline pte_t pte_mkdevmap(pte_t pte)
-{
-	return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
-}
-
 static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
 {
 	pmdval_t v = native_pmd_val(pmd);
@@ -572,11 +543,6 @@ static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DIRTY);
 }
 
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_DEVMAP);
-}
-
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
@@ -656,11 +622,6 @@ static inline pud_t pud_mkdirty(pud_t pud)
 	return pud_mksaveddirty(pud);
 }
 
-static inline pud_t pud_mkdevmap(pud_t pud)
-{
-	return pud_set_flags(pud, _PAGE_DEVMAP);
-}
-
 static inline pud_t pud_mkhuge(pud_t pud)
 {
 	return pud_set_flags(pud, _PAGE_PSE);
@@ -988,13 +949,6 @@ static inline int pte_present(pte_t a)
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pte_devmap(pte_t a)
-{
-	return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
-}
-#endif
-
 #define pte_accessible pte_accessible
 static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
 {
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 2f32113..37c250f 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -33,7 +33,6 @@
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_UFFD_WP	_PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
 #ifdef CONFIG_X86_64
 #define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW5 /* Saved Dirty bit */
@@ -117,11 +116,9 @@
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
-#define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
 #define _PAGE_SOFTW4	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW4)
 #else
 #define _PAGE_NX	(_AT(pteval_t, 0))
-#define _PAGE_DEVMAP	(_AT(pteval_t, 0))
 #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
 #endif
 
@@ -148,7 +145,7 @@
 #define _COMMON_PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |	\
 				 _PAGE_SPECIAL | _PAGE_ACCESSED |	\
 				 _PAGE_DIRTY_BITS | _PAGE_SOFT_DIRTY |	\
-				 _PAGE_DEVMAP | _PAGE_CC | _PAGE_UFFD_WP)
+				 _PAGE_CC | _PAGE_UFFD_WP)
 #define _PAGE_CHG_MASK	(_COMMON_PAGE_CHG_MASK | _PAGE_PAT)
 #define _HPAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_PAT_LARGE)
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 592b992..5976276 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2624,13 +2624,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
 }
 #endif
 
-#ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pte_devmap(pte_t pte)
-{
-	return 0;
-}
-#endif
-
 extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
 			       spinlock_t **ptl);
 static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 2d91482..0100ad8 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -97,26 +97,6 @@ static inline pud_t pfn_t_pud(pfn_t pfn, pgprot_t pgprot)
 #endif
 #endif
 
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline bool pfn_t_devmap(pfn_t pfn)
-{
-	const u64 flags = PFN_DEV|PFN_MAP;
-
-	return (pfn.val & flags) == flags;
-}
-#else
-static inline bool pfn_t_devmap(pfn_t pfn)
-{
-	return false;
-}
-pte_t pte_mkdevmap(pte_t pte);
-pmd_t pmd_mkdevmap(pmd_t pmd);
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
-	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
-pud_t pud_mkdevmap(pud_t pud);
-#endif
-#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
-
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static inline bool pfn_t_special(pfn_t pfn)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a68e279..f3a95e3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1616,21 +1616,6 @@ static inline int pud_write(pud_t pud)
 }
 #endif /* pud_write */
 
-#if !defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline int pmd_devmap(pmd_t pmd)
-{
-	return 0;
-}
-static inline int pud_devmap(pud_t pud)
-{
-	return 0;
-}
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
-#endif
-
 #if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
 	!defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 static inline int pud_trans_huge(pud_t pud)
@@ -1885,8 +1870,8 @@ typedef unsigned int pgtbl_mod_mask;
  * - It should contain a huge PFN, which points to a huge page larger than
  *   PAGE_SIZE of the platform.  The PFN format isn't important here.
  *
- * - It should cover all kinds of huge mappings (e.g., pXd_trans_huge(),
- *   pXd_devmap(), or hugetlb mappings).
+ * - It should cover all kinds of huge mappings (i.e. pXd_trans_huge()
+ *   or hugetlb mappings).
  */
 #ifndef pgd_leaf
 #define pgd_leaf(x)	false
diff --git a/mm/Kconfig b/mm/Kconfig
index 8078a4b..58402d7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1017,9 +1017,6 @@ config ARCH_HAS_CURRENT_STACK_POINTER
 	  register alias named "current_stack_pointer", this config can be
 	  selected.
 
-config ARCH_HAS_PTE_DEVMAP
-	bool
-
 config ARCH_HAS_ZONE_DMA_SET
 	bool
 
@@ -1037,7 +1034,6 @@ config ZONE_DEVICE
 	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
 	depends on SPARSEMEM_VMEMMAP
-	depends on ARCH_HAS_PTE_DEVMAP
 	select XARRAY_MULTI
 
 	help
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index e4969fb..1262148 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -348,12 +348,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
 	vaddr &= HPAGE_PUD_MASK;
 
 	pud = pfn_pud(args->pud_pfn, args->page_prot);
-	/*
-	 * Some architectures have debug checks to make sure
-	 * huge pud mapping are only found with devmap entries
-	 * For now test with only devmap entries.
-	 */
-	pud = pud_mkdevmap(pud);
 	set_pud_at(args->mm, vaddr, args->pudp, pud);
 	flush_dcache_page(page);
 	pudp_set_wrprotect(args->mm, vaddr, args->pudp);
@@ -366,7 +360,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
 	WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
 	pud = pfn_pud(args->pud_pfn, args->page_prot);
-	pud = pud_mkdevmap(pud);
 	pud = pud_wrprotect(pud);
 	pud = pud_mkclean(pud);
 	set_pud_at(args->mm, vaddr, args->pudp, pud);
@@ -384,7 +377,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
 #endif /* __PAGETABLE_PMD_FOLDED */
 
 	pud = pfn_pud(args->pud_pfn, args->page_prot);
-	pud = pud_mkdevmap(pud);
 	pud = pud_mkyoung(pud);
 	set_pud_at(args->mm, vaddr, args->pudp, pud);
 	flush_dcache_page(page);
@@ -693,53 +685,6 @@ static void __init pmd_protnone_tests(struct pgtable_debug_args *args)
 static void __init pmd_protnone_tests(struct pgtable_debug_args *args) { }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static void __init pte_devmap_tests(struct pgtable_debug_args *args)
-{
-	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
-
-	pr_debug("Validating PTE devmap\n");
-	WARN_ON(!pte_devmap(pte_mkdevmap(pte)));
-}
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args)
-{
-	pmd_t pmd;
-
-	if (!has_transparent_hugepage())
-		return;
-
-	pr_debug("Validating PMD devmap\n");
-	pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot);
-	WARN_ON(!pmd_devmap(pmd_mkdevmap(pmd)));
-}
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static void __init pud_devmap_tests(struct pgtable_debug_args *args)
-{
-	pud_t pud;
-
-	if (!has_transparent_pud_hugepage())
-		return;
-
-	pr_debug("Validating PUD devmap\n");
-	pud = pfn_pud(args->fixed_pud_pfn, args->page_prot);
-	WARN_ON(!pud_devmap(pud_mkdevmap(pud)));
-}
-#else  /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-#else  /* CONFIG_TRANSPARENT_HUGEPAGE */
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#else
-static void __init pte_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
-
 static void __init pte_soft_dirty_tests(struct pgtable_debug_args *args)
 {
 	pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
@@ -1341,10 +1286,6 @@ static int __init debug_vm_pgtable(void)
 	pte_protnone_tests(&args);
 	pmd_protnone_tests(&args);
 
-	pte_devmap_tests(&args);
-	pmd_devmap_tests(&args);
-	pud_devmap_tests(&args);
-
 	pte_soft_dirty_tests(&args);
 	pmd_soft_dirty_tests(&args);
 	pte_swap_soft_dirty_tests(&args);
diff --git a/mm/hmm.c b/mm/hmm.c
index 1b85ed6..4fca141 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -395,8 +395,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
-#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && \
-    defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#if defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 static inline unsigned long pud_to_hmm_pfn_flags(struct hmm_range *range,
 						 pud_t pud)
 {
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] mm: Allow compound zone device pages
  2024-09-10  4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
@ 2024-09-10  4:47   ` Matthew Wilcox
  2024-09-10  6:57     ` Alistair Popple
  2024-09-12 12:44   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Matthew Wilcox @ 2024-09-10  4:47 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david, Jason Gunthorpe

On Tue, Sep 10, 2024 at 02:14:29PM +1000, Alistair Popple wrote:
> @@ -337,6 +341,7 @@ struct folio {
>  	/* private: */
>  				};
>  	/* public: */
> +			struct dev_pagemap *pgmap;

Shouldn't that be indented by one more tab stop?

And for ease of reading, perhaps it should be placed either immediately
before or after 'struct list_head lru;'?

> +++ b/include/linux/mmzone.h
> @@ -1134,6 +1134,12 @@ static inline bool is_zone_device_page(const struct page *page)
>  	return page_zonenum(page) == ZONE_DEVICE;
>  }
>  
> +static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
> +{
> +	WARN_ON(!is_zone_device_page(page));
> +	return page_folio(page)->pgmap;
> +}

I haven't read to the end yet, but presumably we'll eventually want:

static inline struct dev_pagemap *folio_dev_pagemap(const struct folio *folio)
{
	WARN_ON(!folio_is_zone_device(folio))
	return folio->pgmap;
}

and since we'll want it eventually, maybe now is the time to add it,
and make page_dev_pagemap() simply call it?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] mm: Allow compound zone device pages
  2024-09-10  4:47   ` Matthew Wilcox
@ 2024-09-10  6:57     ` Alistair Popple
  2024-09-10 13:41       ` Matthew Wilcox
  0 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-10  6:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dan.j.williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david, Jason Gunthorpe


Matthew Wilcox <willy@infradead.org> writes:

> On Tue, Sep 10, 2024 at 02:14:29PM +1000, Alistair Popple wrote:
>> @@ -337,6 +341,7 @@ struct folio {
>>  	/* private: */
>>  				};
>>  	/* public: */
>> +			struct dev_pagemap *pgmap;
>
> Shouldn't that be indented by one more tab stop?
>
> And for ease of reading, perhaps it should be placed either immediately
> before or after 'struct list_head lru;'?
>
>> +++ b/include/linux/mmzone.h
>> @@ -1134,6 +1134,12 @@ static inline bool is_zone_device_page(const struct page *page)
>>  	return page_zonenum(page) == ZONE_DEVICE;
>>  }
>>  
>> +static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
>> +{
>> +	WARN_ON(!is_zone_device_page(page));
>> +	return page_folio(page)->pgmap;
>> +}
>
> I haven't read to the end yet, but presumably we'll eventually want:
>
> static inline struct dev_pagemap *folio_dev_pagemap(const struct folio *folio)
> {
> 	WARN_ON(!folio_is_zone_device(folio))
> 	return folio->pgmap;
> }
>
> and since we'll want it eventually, maybe now is the time to add it,
> and make page_dev_pagemap() simply call it?

Sounds reasonable. I had open-coded folio->pgmap where it's needed
because at those points it's "obviously" a ZONE_DEVICE folio. Will add
it.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] mm: Allow compound zone device pages
  2024-09-10  6:57     ` Alistair Popple
@ 2024-09-10 13:41       ` Matthew Wilcox
  0 siblings, 0 replies; 53+ messages in thread
From: Matthew Wilcox @ 2024-09-10 13:41 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david, Jason Gunthorpe

On Tue, Sep 10, 2024 at 04:57:41PM +1000, Alistair Popple wrote:
> 
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Tue, Sep 10, 2024 at 02:14:29PM +1000, Alistair Popple wrote:
> >> @@ -337,6 +341,7 @@ struct folio {
> >>  	/* private: */
> >>  				};
> >>  	/* public: */
> >> +			struct dev_pagemap *pgmap;
> >
> > Shouldn't that be indented by one more tab stop?
> >
> > And for ease of reading, perhaps it should be placed either immediately
> > before or after 'struct list_head lru;'?
> >
> >> +++ b/include/linux/mmzone.h
> >> @@ -1134,6 +1134,12 @@ static inline bool is_zone_device_page(const struct page *page)
> >>  	return page_zonenum(page) == ZONE_DEVICE;
> >>  }
> >>  
> >> +static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
> >> +{
> >> +	WARN_ON(!is_zone_device_page(page));
> >> +	return page_folio(page)->pgmap;
> >> +}
> >
> > I haven't read to the end yet, but presumably we'll eventually want:
> >
> > static inline struct dev_pagemap *folio_dev_pagemap(const struct folio *folio)
> > {
> > 	WARN_ON(!folio_is_zone_device(folio))
> > 	return folio->pgmap;
> > }
> >
> > and since we'll want it eventually, maybe now is the time to add it,
> > and make page_dev_pagemap() simply call it?
> 
> Sounds reasonable. I had open-coded folio->pgmap where it's needed
> because at those points it's "obviously" a ZONE_DEVICE folio. Will add
> it.

Oh, if it's obvious then just do the dereference.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one
  2024-09-10  4:14 ` [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
@ 2024-09-10 13:47   ` Bjorn Helgaas
  2024-09-11  1:07     ` Alistair Popple
  2024-09-11  0:48   ` Logan Gunthorpe
  2024-09-22  1:00   ` Dan Williams
  2 siblings, 1 reply; 53+ messages in thread
From: Bjorn Helgaas @ 2024-09-10 13:47 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard,
	hch, david

In subject:

  PCI/P2PDMA: ...

would match previous history.

On Tue, Sep 10, 2024 at 02:14:27PM +1000, Alistair Popple wrote:
> The reference counts for ZONE_DEVICE private pages should be
> initialised by the driver when the page is actually allocated by the
> driver allocator, not when they are first created. This is currently
> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  drivers/pci/p2pdma.c |  6 ++++++
>  mm/memremap.c        | 17 +++++++++++++----
>  mm/mm_init.c         | 22 ++++++++++++++++++----
>  3 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4f47a13..210b9f4 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -129,6 +129,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  	}
>  
>  	/*
> +	 * Initialise the refcount for the freshly allocated page. As we have
> +	 * just allocated the page no one else should be using it.
> +	 */
> +	set_page_count(virt_to_page(kaddr), 1);

No doubt the subject line is true in some overall context, but it does
seem to say the opposite of what happens here.

Bjorn

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one
  2024-09-10  4:14 ` [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
  2024-09-10 13:47   ` Bjorn Helgaas
@ 2024-09-11  0:48   ` Logan Gunthorpe
  2024-10-11  0:20     ` Alistair Popple
  2024-09-22  1:00   ` Dan Williams
  2 siblings, 1 reply; 53+ messages in thread
From: Logan Gunthorpe @ 2024-09-11  0:48 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: vishal.l.verma, dave.jiang, bhelgaas, jack, jgg, catalin.marinas,
	will, mpe, npiggin, dave.hansen, ira.weiny, willy, djwong, tytso,
	linmiaohe, david, peterx, linux-doc, linux-kernel,
	linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
	linux-ext4, linux-xfs, jhubbard, hch, david



On 2024-09-09 22:14, Alistair Popple wrote:
> The reference counts for ZONE_DEVICE private pages should be
> initialised by the driver when the page is actually allocated by the
> driver allocator, not when they are first created. This is currently
> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  drivers/pci/p2pdma.c |  6 ++++++
>  mm/memremap.c        | 17 +++++++++++++----
>  mm/mm_init.c         | 22 ++++++++++++++++++----
>  3 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4f47a13..210b9f4 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -129,6 +129,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  	}
>  
>  	/*
> +	 * Initialise the refcount for the freshly allocated page. As we have
> +	 * just allocated the page no one else should be using it.
> +	 */
> +	set_page_count(virt_to_page(kaddr), 1);
> +
> +	/*
>  	 * vm_insert_page() can sleep, so a reference is taken to mapping
>  	 * such that rcu_read_unlock() can be done before inserting the
>  	 * pages
This seems to only set reference count to the first page, when there can
be more than one page referenced by kaddr.

I suspect the page count adjustment should be done in the for loop
that's a few lines lower than this.

I think a similar mistake was made by other recent changes.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one
  2024-09-10 13:47   ` Bjorn Helgaas
@ 2024-09-11  1:07     ` Alistair Popple
  2024-09-11 13:51       ` Bjorn Helgaas
  0 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-09-11  1:07 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: dan.j.williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard,
	hch, david


>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index 4f47a13..210b9f4 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -129,6 +129,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>>  	}
>>  
>>  	/*
>> +	 * Initialise the refcount for the freshly allocated page. As we have
>> +	 * just allocated the page no one else should be using it.
>> +	 */
>> +	set_page_count(virt_to_page(kaddr), 1);
>
> No doubt the subject line is true in some overall context, but it does
> seem to say the opposite of what happens here.

Fair. It made sense to me from the mm context I was coming from (it was
being initialised to 1 there) but not overall. Something like "move page
refcount initialisation to p2pdma driver" would make more sense?

 - Alistair

> Bjorn

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 12/12] mm: Remove devmap related functions and page table bits
  2024-09-10  4:14 ` [PATCH 12/12] mm: Remove devmap related functions and page table bits Alistair Popple
@ 2024-09-11  7:47   ` Chunyan Zhang
  2024-09-12 12:55   ` kernel test robot
  1 sibling, 0 replies; 53+ messages in thread
From: Chunyan Zhang @ 2024-09-11  7:47 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard,
	hch, david

Hi Alistair,

On Tue, 10 Sept 2024 at 12:21, Alistair Popple <apopple@nvidia.com> wrote:
>
> Now that DAX and all other reference counts to ZONE_DEVICE pages are
> managed normally there is no need for the special devmap PTE/PMD/PUD
> page table bits. So drop all references to these, freeing up a
> software defined page table bit on architectures supporting it.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Acked-by: Will Deacon <will@kernel.org> # arm64
> ---
>  Documentation/mm/arch_pgtable_helpers.rst     |  6 +--
>  arch/arm64/Kconfig                            |  1 +-
>  arch/arm64/include/asm/pgtable-prot.h         |  1 +-
>  arch/arm64/include/asm/pgtable.h              | 24 +--------
>  arch/powerpc/Kconfig                          |  1 +-
>  arch/powerpc/include/asm/book3s/64/hash-4k.h  |  6 +--
>  arch/powerpc/include/asm/book3s/64/hash-64k.h |  7 +--
>  arch/powerpc/include/asm/book3s/64/pgtable.h  | 52 +------------------
>  arch/powerpc/include/asm/book3s/64/radix.h    | 14 +-----
>  arch/x86/Kconfig                              |  1 +-
>  arch/x86/include/asm/pgtable.h                | 50 +-----------------
>  arch/x86/include/asm/pgtable_types.h          |  5 +--

RISC-V's references also need to be cleanup, it simply can be done by
reverting the commit

216e04bf1e4d (riscv: mm: Add support for ZONE_DEVICE)

Thanks,
Chunyan

>  include/linux/mm.h                            |  7 +--
>  include/linux/pfn_t.h                         | 20 +-------
>  include/linux/pgtable.h                       | 19 +------
>  mm/Kconfig                                    |  4 +-
>  mm/debug_vm_pgtable.c                         | 59 +--------------------
>  mm/hmm.c                                      |  3 +-
>  18 files changed, 11 insertions(+), 269 deletions(-)
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one
  2024-09-11  1:07     ` Alistair Popple
@ 2024-09-11 13:51       ` Bjorn Helgaas
  0 siblings, 0 replies; 53+ messages in thread
From: Bjorn Helgaas @ 2024-09-11 13:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard,
	hch, david

On Wed, Sep 11, 2024 at 11:07:51AM +1000, Alistair Popple wrote:
> 
> >> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> >> index 4f47a13..210b9f4 100644
> >> --- a/drivers/pci/p2pdma.c
> >> +++ b/drivers/pci/p2pdma.c
> >> @@ -129,6 +129,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> >>  	}
> >>  
> >>  	/*
> >> +	 * Initialise the refcount for the freshly allocated page. As we have
> >> +	 * just allocated the page no one else should be using it.
> >> +	 */
> >> +	set_page_count(virt_to_page(kaddr), 1);
> >
> > No doubt the subject line is true in some overall context, but it does
> > seem to say the opposite of what happens here.
> 
> Fair. It made sense to me from the mm context I was coming from (it was
> being initialised to 1 there) but not overall. Something like "move page
> refcount initialisation to p2pdma driver" would make more sense?

Definitely would, thanks.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] mm: Allow compound zone device pages
  2024-09-10  4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
  2024-09-10  4:47   ` Matthew Wilcox
@ 2024-09-12 12:44   ` kernel test robot
  2024-09-12 12:44   ` kernel test robot
  2024-09-22  1:01   ` Dan Williams
  3 siblings, 0 replies; 53+ messages in thread
From: kernel test robot @ 2024-09-12 12:44 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: llvm, oe-kbuild-all, Alistair Popple, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs

Hi Alistair,

kernel test robot noticed the following build errors:

[auto build test ERROR on 6f1833b8208c3b9e59eff10792667b6639365146]

url:    https://github.com/intel-lab-lkp/linux/commits/Alistair-Popple/mm-gup-c-Remove-redundant-check-for-PCI-P2PDMA-page/20240910-121806
base:   6f1833b8208c3b9e59eff10792667b6639365146
patch link:    https://lore.kernel.org/r/c7026449473790e2844bb82012216c57047c7639.1725941415.git-series.apopple%40nvidia.com
patch subject: [PATCH 04/12] mm: Allow compound zone device pages
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20240912/202409122055.AMlMSljd-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240912/202409122055.AMlMSljd-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409122055.AMlMSljd-lkp@intel.com/

All errors (new ones prefixed by >>):

         |         ^
   In file included from mm/memory.c:44:
   In file included from include/linux/mm.h:1106:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:163:1: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     163 | _SIG_SET_BINOP(sigandnsets, _sig_andn)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:141:3: note: expanded from macro '_SIG_SET_BINOP'
     141 |                 r->sig[2] = op(a2, b2);                                 \
         |                 ^      ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from mm/memory.c:44:
   In file included from include/linux/mm.h:1106:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:174:27: note: expanded from macro '_SIG_SET_OP'
     174 |         case 4: set->sig[3] = op(set->sig[3]);                          \
         |                                  ^        ~
   include/linux/signal.h:186:24: note: expanded from macro '_sig_not'
     186 | #define _sig_not(x)     (~(x))
         |                            ^
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from mm/memory.c:44:
   In file included from include/linux/mm.h:1106:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:174:10: note: expanded from macro '_SIG_SET_OP'
     174 |         case 4: set->sig[3] = op(set->sig[3]);                          \
         |                 ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from mm/memory.c:44:
   In file included from include/linux/mm.h:1106:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:175:20: note: expanded from macro '_SIG_SET_OP'
     175 |                 set->sig[2] = op(set->sig[2]);                          \
         |                                  ^        ~
   include/linux/signal.h:186:24: note: expanded from macro '_sig_not'
     186 | #define _sig_not(x)     (~(x))
         |                            ^
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from mm/memory.c:44:
   In file included from include/linux/mm.h:1106:
   In file included from include/linux/huge_mm.h:8:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:187:1: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
     187 | _SIG_SET_OP(signotset, _sig_not)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:175:3: note: expanded from macro '_SIG_SET_OP'
     175 |                 set->sig[2] = op(set->sig[2]);                          \
         |                 ^        ~
   arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
      24 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from mm/memory.c:51:
   include/linux/mman.h:158:9: warning: division by zero is undefined [-Wdivision-by-zero]
     158 |                _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
         |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/mman.h:136:21: note: expanded from macro '_calc_vm_trans'
     136 |    : ((x) & (bit1)) / ((bit1) / (bit2))))
         |                     ^ ~~~~~~~~~~~~~~~~~
   include/linux/mman.h:159:9: warning: division by zero is undefined [-Wdivision-by-zero]
     159 |                _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
         |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/mman.h:136:21: note: expanded from macro '_calc_vm_trans'
     136 |    : ((x) & (bit1)) / ((bit1) / (bit2))))
         |                     ^ ~~~~~~~~~~~~~~~~~
>> mm/memory.c:4052:12: error: call to undeclared function 'page_dev_pagemap'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    4052 |                         pgmap = page_dev_pagemap(vmf->page);
         |                                 ^
>> mm/memory.c:4052:10: error: incompatible integer to pointer conversion assigning to 'struct dev_pagemap *' from 'int' [-Wint-conversion]
    4052 |                         pgmap = page_dev_pagemap(vmf->page);
         |                               ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
   42 warnings and 8 errors generated.


vim +/page_dev_pagemap +4052 mm/memory.c

  3988	
  3989	/*
  3990	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  3991	 * but allow concurrent faults), and pte mapped but not yet locked.
  3992	 * We return with pte unmapped and unlocked.
  3993	 *
  3994	 * We return with the mmap_lock locked or unlocked in the same cases
  3995	 * as does filemap_fault().
  3996	 */
  3997	vm_fault_t do_swap_page(struct vm_fault *vmf)
  3998	{
  3999		struct vm_area_struct *vma = vmf->vma;
  4000		struct folio *swapcache, *folio = NULL;
  4001		struct page *page;
  4002		struct swap_info_struct *si = NULL;
  4003		rmap_t rmap_flags = RMAP_NONE;
  4004		bool need_clear_cache = false;
  4005		bool exclusive = false;
  4006		swp_entry_t entry;
  4007		pte_t pte;
  4008		vm_fault_t ret = 0;
  4009		void *shadow = NULL;
  4010		int nr_pages;
  4011		unsigned long page_idx;
  4012		unsigned long address;
  4013		pte_t *ptep;
  4014	
  4015		if (!pte_unmap_same(vmf))
  4016			goto out;
  4017	
  4018		entry = pte_to_swp_entry(vmf->orig_pte);
  4019		if (unlikely(non_swap_entry(entry))) {
  4020			if (is_migration_entry(entry)) {
  4021				migration_entry_wait(vma->vm_mm, vmf->pmd,
  4022						     vmf->address);
  4023			} else if (is_device_exclusive_entry(entry)) {
  4024				vmf->page = pfn_swap_entry_to_page(entry);
  4025				ret = remove_device_exclusive_entry(vmf);
  4026			} else if (is_device_private_entry(entry)) {
  4027				struct dev_pagemap *pgmap;
  4028				if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
  4029					/*
  4030					 * migrate_to_ram is not yet ready to operate
  4031					 * under VMA lock.
  4032					 */
  4033					vma_end_read(vma);
  4034					ret = VM_FAULT_RETRY;
  4035					goto out;
  4036				}
  4037	
  4038				vmf->page = pfn_swap_entry_to_page(entry);
  4039				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4040						vmf->address, &vmf->ptl);
  4041				if (unlikely(!vmf->pte ||
  4042					     !pte_same(ptep_get(vmf->pte),
  4043								vmf->orig_pte)))
  4044					goto unlock;
  4045	
  4046				/*
  4047				 * Get a page reference while we know the page can't be
  4048				 * freed.
  4049				 */
  4050				get_page(vmf->page);
  4051				pte_unmap_unlock(vmf->pte, vmf->ptl);
> 4052				pgmap = page_dev_pagemap(vmf->page);
  4053				ret = pgmap->ops->migrate_to_ram(vmf);
  4054				put_page(vmf->page);
  4055			} else if (is_hwpoison_entry(entry)) {
  4056				ret = VM_FAULT_HWPOISON;
  4057			} else if (is_pte_marker_entry(entry)) {
  4058				ret = handle_pte_marker(vmf);
  4059			} else {
  4060				print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
  4061				ret = VM_FAULT_SIGBUS;
  4062			}
  4063			goto out;
  4064		}
  4065	
  4066		/* Prevent swapoff from happening to us. */
  4067		si = get_swap_device(entry);
  4068		if (unlikely(!si))
  4069			goto out;
  4070	
  4071		folio = swap_cache_get_folio(entry, vma, vmf->address);
  4072		if (folio)
  4073			page = folio_file_page(folio, swp_offset(entry));
  4074		swapcache = folio;
  4075	
  4076		if (!folio) {
  4077			if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
  4078			    __swap_count(entry) == 1) {
  4079				/*
  4080				 * Prevent parallel swapin from proceeding with
  4081				 * the cache flag. Otherwise, another thread may
  4082				 * finish swapin first, free the entry, and swapout
  4083				 * reusing the same entry. It's undetectable as
  4084				 * pte_same() returns true due to entry reuse.
  4085				 */
  4086				if (swapcache_prepare(entry, 1)) {
  4087					/* Relax a bit to prevent rapid repeated page faults */
  4088					schedule_timeout_uninterruptible(1);
  4089					goto out;
  4090				}
  4091				need_clear_cache = true;
  4092	
  4093				/* skip swapcache */
  4094				folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
  4095							vma, vmf->address, false);
  4096				if (folio) {
  4097					__folio_set_locked(folio);
  4098					__folio_set_swapbacked(folio);
  4099	
  4100					if (mem_cgroup_swapin_charge_folio(folio,
  4101								vma->vm_mm, GFP_KERNEL,
  4102								entry)) {
  4103						ret = VM_FAULT_OOM;
  4104						goto out_page;
  4105					}
  4106					mem_cgroup_swapin_uncharge_swap(entry);
  4107	
  4108					shadow = get_shadow_from_swap_cache(entry);
  4109					if (shadow)
  4110						workingset_refault(folio, shadow);
  4111	
  4112					folio_add_lru(folio);
  4113	
  4114					/* To provide entry to swap_read_folio() */
  4115					folio->swap = entry;
  4116					swap_read_folio(folio, NULL);
  4117					folio->private = NULL;
  4118				}
  4119			} else {
  4120				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
  4121							vmf);
  4122				swapcache = folio;
  4123			}
  4124	
  4125			if (!folio) {
  4126				/*
  4127				 * Back out if somebody else faulted in this pte
  4128				 * while we released the pte lock.
  4129				 */
  4130				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4131						vmf->address, &vmf->ptl);
  4132				if (likely(vmf->pte &&
  4133					   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  4134					ret = VM_FAULT_OOM;
  4135				goto unlock;
  4136			}
  4137	
  4138			/* Had to read the page from swap area: Major fault */
  4139			ret = VM_FAULT_MAJOR;
  4140			count_vm_event(PGMAJFAULT);
  4141			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
  4142			page = folio_file_page(folio, swp_offset(entry));
  4143		} else if (PageHWPoison(page)) {
  4144			/*
  4145			 * hwpoisoned dirty swapcache pages are kept for killing
  4146			 * owner processes (which may be unknown at hwpoison time)
  4147			 */
  4148			ret = VM_FAULT_HWPOISON;
  4149			goto out_release;
  4150		}
  4151	
  4152		ret |= folio_lock_or_retry(folio, vmf);
  4153		if (ret & VM_FAULT_RETRY)
  4154			goto out_release;
  4155	
  4156		if (swapcache) {
  4157			/*
  4158			 * Make sure folio_free_swap() or swapoff did not release the
  4159			 * swapcache from under us.  The page pin, and pte_same test
  4160			 * below, are not enough to exclude that.  Even if it is still
  4161			 * swapcache, we need to check that the page's swap has not
  4162			 * changed.
  4163			 */
  4164			if (unlikely(!folio_test_swapcache(folio) ||
  4165				     page_swap_entry(page).val != entry.val))
  4166				goto out_page;
  4167	
  4168			/*
  4169			 * KSM sometimes has to copy on read faults, for example, if
  4170			 * page->index of !PageKSM() pages would be nonlinear inside the
  4171			 * anon VMA -- PageKSM() is lost on actual swapout.
  4172			 */
  4173			folio = ksm_might_need_to_copy(folio, vma, vmf->address);
  4174			if (unlikely(!folio)) {
  4175				ret = VM_FAULT_OOM;
  4176				folio = swapcache;
  4177				goto out_page;
  4178			} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
  4179				ret = VM_FAULT_HWPOISON;
  4180				folio = swapcache;
  4181				goto out_page;
  4182			}
  4183			if (folio != swapcache)
  4184				page = folio_page(folio, 0);
  4185	
  4186			/*
  4187			 * If we want to map a page that's in the swapcache writable, we
  4188			 * have to detect via the refcount if we're really the exclusive
  4189			 * owner. Try removing the extra reference from the local LRU
  4190			 * caches if required.
  4191			 */
  4192			if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
  4193			    !folio_test_ksm(folio) && !folio_test_lru(folio))
  4194				lru_add_drain();
  4195		}
  4196	
  4197		folio_throttle_swaprate(folio, GFP_KERNEL);
  4198	
  4199		/*
  4200		 * Back out if somebody else already faulted in this pte.
  4201		 */
  4202		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
  4203				&vmf->ptl);
  4204		if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  4205			goto out_nomap;
  4206	
  4207		if (unlikely(!folio_test_uptodate(folio))) {
  4208			ret = VM_FAULT_SIGBUS;
  4209			goto out_nomap;
  4210		}
  4211	
  4212		nr_pages = 1;
  4213		page_idx = 0;
  4214		address = vmf->address;
  4215		ptep = vmf->pte;
  4216		if (folio_test_large(folio) && folio_test_swapcache(folio)) {
  4217			int nr = folio_nr_pages(folio);
  4218			unsigned long idx = folio_page_idx(folio, page);
  4219			unsigned long folio_start = address - idx * PAGE_SIZE;
  4220			unsigned long folio_end = folio_start + nr * PAGE_SIZE;
  4221			pte_t *folio_ptep;
  4222			pte_t folio_pte;
  4223	
  4224			if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
  4225				goto check_folio;
  4226			if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
  4227				goto check_folio;
  4228	
  4229			folio_ptep = vmf->pte - idx;
  4230			folio_pte = ptep_get(folio_ptep);
  4231			if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
  4232			    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
  4233				goto check_folio;
  4234	
  4235			page_idx = idx;
  4236			address = folio_start;
  4237			ptep = folio_ptep;
  4238			nr_pages = nr;
  4239			entry = folio->swap;
  4240			page = &folio->page;
  4241		}
  4242	
  4243	check_folio:
  4244		/*
  4245		 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
  4246		 * must never point at an anonymous page in the swapcache that is
  4247		 * PG_anon_exclusive. Sanity check that this holds and especially, that
  4248		 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
  4249		 * check after taking the PT lock and making sure that nobody
  4250		 * concurrently faulted in this page and set PG_anon_exclusive.
  4251		 */
  4252		BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
  4253		BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
  4254	
  4255		/*
  4256		 * Check under PT lock (to protect against concurrent fork() sharing
  4257		 * the swap entry concurrently) for certainly exclusive pages.
  4258		 */
  4259		if (!folio_test_ksm(folio)) {
  4260			exclusive = pte_swp_exclusive(vmf->orig_pte);
  4261			if (folio != swapcache) {
  4262				/*
  4263				 * We have a fresh page that is not exposed to the
  4264				 * swapcache -> certainly exclusive.
  4265				 */
  4266				exclusive = true;
  4267			} else if (exclusive && folio_test_writeback(folio) &&
  4268				  data_race(si->flags & SWP_STABLE_WRITES)) {
  4269				/*
  4270				 * This is tricky: not all swap backends support
  4271				 * concurrent page modifications while under writeback.
  4272				 *
  4273				 * So if we stumble over such a page in the swapcache
  4274				 * we must not set the page exclusive, otherwise we can
  4275				 * map it writable without further checks and modify it
  4276				 * while still under writeback.
  4277				 *
  4278				 * For these problematic swap backends, simply drop the
  4279				 * exclusive marker: this is perfectly fine as we start
  4280				 * writeback only if we fully unmapped the page and
  4281				 * there are no unexpected references on the page after
  4282				 * unmapping succeeded. After fully unmapped, no
  4283				 * further GUP references (FOLL_GET and FOLL_PIN) can
  4284				 * appear, so dropping the exclusive marker and mapping
  4285				 * it only R/O is fine.
  4286				 */
  4287				exclusive = false;
  4288			}
  4289		}
  4290	
  4291		/*
  4292		 * Some architectures may have to restore extra metadata to the page
  4293		 * when reading from swap. This metadata may be indexed by swap entry
  4294		 * so this must be called before swap_free().
  4295		 */
  4296		arch_swap_restore(folio_swap(entry, folio), folio);
  4297	
  4298		/*
  4299		 * Remove the swap entry and conditionally try to free up the swapcache.
  4300		 * We're already holding a reference on the page but haven't mapped it
  4301		 * yet.
  4302		 */
  4303		swap_free_nr(entry, nr_pages);
  4304		if (should_try_to_free_swap(folio, vma, vmf->flags))
  4305			folio_free_swap(folio);
  4306	
  4307		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
  4308		add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
  4309		pte = mk_pte(page, vma->vm_page_prot);
  4310		if (pte_swp_soft_dirty(vmf->orig_pte))
  4311			pte = pte_mksoft_dirty(pte);
  4312		if (pte_swp_uffd_wp(vmf->orig_pte))
  4313			pte = pte_mkuffd_wp(pte);
  4314	
  4315		/*
  4316		 * Same logic as in do_wp_page(); however, optimize for pages that are
  4317		 * certainly not shared either because we just allocated them without
  4318		 * exposing them to the swapcache or because the swap entry indicates
  4319		 * exclusivity.
  4320		 */
  4321		if (!folio_test_ksm(folio) &&
  4322		    (exclusive || folio_ref_count(folio) == 1)) {
  4323			if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
  4324			    !pte_needs_soft_dirty_wp(vma, pte)) {
  4325				pte = pte_mkwrite(pte, vma);
  4326				if (vmf->flags & FAULT_FLAG_WRITE) {
  4327					pte = pte_mkdirty(pte);
  4328					vmf->flags &= ~FAULT_FLAG_WRITE;
  4329				}
  4330			}
  4331			rmap_flags |= RMAP_EXCLUSIVE;
  4332		}
  4333		folio_ref_add(folio, nr_pages - 1);
  4334		flush_icache_pages(vma, page, nr_pages);
  4335		vmf->orig_pte = pte_advance_pfn(pte, page_idx);
  4336	
  4337		/* ksm created a completely new copy */
  4338		if (unlikely(folio != swapcache && swapcache)) {
  4339			folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
  4340			folio_add_lru_vma(folio, vma);
  4341		} else if (!folio_test_anon(folio)) {
  4342			/*
  4343			 * We currently only expect small !anon folios, which are either
  4344			 * fully exclusive or fully shared. If we ever get large folios
  4345			 * here, we have to be careful.
  4346			 */
  4347			VM_WARN_ON_ONCE(folio_test_large(folio));
  4348			VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
  4349			folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
  4350		} else {
  4351			folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
  4352						rmap_flags);
  4353		}
  4354	
  4355		VM_BUG_ON(!folio_test_anon(folio) ||
  4356				(pte_write(pte) && !PageAnonExclusive(page)));
  4357		set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
  4358		arch_do_swap_page_nr(vma->vm_mm, vma, address,
  4359				pte, pte, nr_pages);
  4360	
  4361		folio_unlock(folio);
  4362		if (folio != swapcache && swapcache) {
  4363			/*
  4364			 * Hold the lock to avoid the swap entry to be reused
  4365			 * until we take the PT lock for the pte_same() check
  4366			 * (to avoid false positives from pte_same). For
  4367			 * further safety release the lock after the swap_free
  4368			 * so that the swap count won't change under a
  4369			 * parallel locked swapcache.
  4370			 */
  4371			folio_unlock(swapcache);
  4372			folio_put(swapcache);
  4373		}
  4374	
  4375		if (vmf->flags & FAULT_FLAG_WRITE) {
  4376			ret |= do_wp_page(vmf);
  4377			if (ret & VM_FAULT_ERROR)
  4378				ret &= VM_FAULT_ERROR;
  4379			goto out;
  4380		}
  4381	
  4382		/* No need to invalidate - it was non-present before */
  4383		update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
  4384	unlock:
  4385		if (vmf->pte)
  4386			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4387	out:
  4388		/* Clear the swap cache pin for direct swapin after PTL unlock */
  4389		if (need_clear_cache)
  4390			swapcache_clear(si, entry, 1);
  4391		if (si)
  4392			put_swap_device(si);
  4393		return ret;
  4394	out_nomap:
  4395		if (vmf->pte)
  4396			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4397	out_page:
  4398		folio_unlock(folio);
  4399	out_release:
  4400		folio_put(folio);
  4401		if (folio != swapcache && swapcache) {
  4402			folio_unlock(swapcache);
  4403			folio_put(swapcache);
  4404		}
  4405		if (need_clear_cache)
  4406			swapcache_clear(si, entry, 1);
  4407		if (si)
  4408			put_swap_device(si);
  4409		return ret;
  4410	}
  4411	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] mm: Allow compound zone device pages
  2024-09-10  4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
  2024-09-10  4:47   ` Matthew Wilcox
  2024-09-12 12:44   ` kernel test robot
@ 2024-09-12 12:44   ` kernel test robot
  2024-09-22  1:01   ` Dan Williams
  3 siblings, 0 replies; 53+ messages in thread
From: kernel test robot @ 2024-09-12 12:44 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: oe-kbuild-all, Alistair Popple, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs

Hi Alistair,

kernel test robot noticed the following build errors:

[auto build test ERROR on 6f1833b8208c3b9e59eff10792667b6639365146]

url:    https://github.com/intel-lab-lkp/linux/commits/Alistair-Popple/mm-gup-c-Remove-redundant-check-for-PCI-P2PDMA-page/20240910-121806
base:   6f1833b8208c3b9e59eff10792667b6639365146
patch link:    https://lore.kernel.org/r/c7026449473790e2844bb82012216c57047c7639.1725941415.git-series.apopple%40nvidia.com
patch subject: [PATCH 04/12] mm: Allow compound zone device pages
config: csky-defconfig (https://download.01.org/0day-ci/archive/20240912/202409122024.PPIwP6vb-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240912/202409122024.PPIwP6vb-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409122024.PPIwP6vb-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/mm.h:32,
                    from mm/gup.c:7:
   include/linux/memremap.h: In function 'is_device_private_page':
   include/linux/memremap.h:164:17: error: implicit declaration of function 'page_dev_pagemap' [-Wimplicit-function-declaration]
     164 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
         |                 ^~~~~~~~~~~~~~~~
   include/linux/memremap.h:164:39: error: invalid type argument of '->' (have 'int')
     164 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
         |                                       ^~
   include/linux/memremap.h: In function 'is_pci_p2pdma_page':
   include/linux/memremap.h:176:39: error: invalid type argument of '->' (have 'int')
     176 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
         |                                       ^~
   include/linux/memremap.h: In function 'is_device_coherent_page':
   include/linux/memremap.h:182:39: error: invalid type argument of '->' (have 'int')
     182 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_COHERENT;
         |                                       ^~
   include/linux/memremap.h: In function 'is_pci_p2pdma_page':
>> include/linux/memremap.h:177:1: warning: control reaches end of non-void function [-Wreturn-type]
     177 | }
         | ^
   include/linux/memremap.h: In function 'is_device_coherent_page':
   include/linux/memremap.h:183:1: warning: control reaches end of non-void function [-Wreturn-type]
     183 | }
         | ^
--
   In file included from include/linux/mm.h:32,
                    from mm/memory.c:44:
   include/linux/memremap.h: In function 'is_device_private_page':
   include/linux/memremap.h:164:17: error: implicit declaration of function 'page_dev_pagemap' [-Wimplicit-function-declaration]
     164 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
         |                 ^~~~~~~~~~~~~~~~
   include/linux/memremap.h:164:39: error: invalid type argument of '->' (have 'int')
     164 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
         |                                       ^~
   include/linux/memremap.h: In function 'is_pci_p2pdma_page':
   include/linux/memremap.h:176:39: error: invalid type argument of '->' (have 'int')
     176 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
         |                                       ^~
   include/linux/memremap.h: In function 'is_device_coherent_page':
   include/linux/memremap.h:182:39: error: invalid type argument of '->' (have 'int')
     182 |                 page_dev_pagemap(page)->type == MEMORY_DEVICE_COHERENT;
         |                                       ^~
   mm/memory.c: In function 'do_swap_page':
>> mm/memory.c:4052:31: error: assignment to 'struct dev_pagemap *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
    4052 |                         pgmap = page_dev_pagemap(vmf->page);
         |                               ^
   include/linux/memremap.h: In function 'is_device_private_page':
   include/linux/memremap.h:165:1: warning: control reaches end of non-void function [-Wreturn-type]
     165 | }
         | ^


vim +4052 mm/memory.c

  3988	
  3989	/*
  3990	 * We enter with non-exclusive mmap_lock (to exclude vma changes,
  3991	 * but allow concurrent faults), and pte mapped but not yet locked.
  3992	 * We return with pte unmapped and unlocked.
  3993	 *
  3994	 * We return with the mmap_lock locked or unlocked in the same cases
  3995	 * as does filemap_fault().
  3996	 */
  3997	vm_fault_t do_swap_page(struct vm_fault *vmf)
  3998	{
  3999		struct vm_area_struct *vma = vmf->vma;
  4000		struct folio *swapcache, *folio = NULL;
  4001		struct page *page;
  4002		struct swap_info_struct *si = NULL;
  4003		rmap_t rmap_flags = RMAP_NONE;
  4004		bool need_clear_cache = false;
  4005		bool exclusive = false;
  4006		swp_entry_t entry;
  4007		pte_t pte;
  4008		vm_fault_t ret = 0;
  4009		void *shadow = NULL;
  4010		int nr_pages;
  4011		unsigned long page_idx;
  4012		unsigned long address;
  4013		pte_t *ptep;
  4014	
  4015		if (!pte_unmap_same(vmf))
  4016			goto out;
  4017	
  4018		entry = pte_to_swp_entry(vmf->orig_pte);
  4019		if (unlikely(non_swap_entry(entry))) {
  4020			if (is_migration_entry(entry)) {
  4021				migration_entry_wait(vma->vm_mm, vmf->pmd,
  4022						     vmf->address);
  4023			} else if (is_device_exclusive_entry(entry)) {
  4024				vmf->page = pfn_swap_entry_to_page(entry);
  4025				ret = remove_device_exclusive_entry(vmf);
  4026			} else if (is_device_private_entry(entry)) {
  4027				struct dev_pagemap *pgmap;
  4028				if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
  4029					/*
  4030					 * migrate_to_ram is not yet ready to operate
  4031					 * under VMA lock.
  4032					 */
  4033					vma_end_read(vma);
  4034					ret = VM_FAULT_RETRY;
  4035					goto out;
  4036				}
  4037	
  4038				vmf->page = pfn_swap_entry_to_page(entry);
  4039				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4040						vmf->address, &vmf->ptl);
  4041				if (unlikely(!vmf->pte ||
  4042					     !pte_same(ptep_get(vmf->pte),
  4043								vmf->orig_pte)))
  4044					goto unlock;
  4045	
  4046				/*
  4047				 * Get a page reference while we know the page can't be
  4048				 * freed.
  4049				 */
  4050				get_page(vmf->page);
  4051				pte_unmap_unlock(vmf->pte, vmf->ptl);
> 4052				pgmap = page_dev_pagemap(vmf->page);
  4053				ret = pgmap->ops->migrate_to_ram(vmf);
  4054				put_page(vmf->page);
  4055			} else if (is_hwpoison_entry(entry)) {
  4056				ret = VM_FAULT_HWPOISON;
  4057			} else if (is_pte_marker_entry(entry)) {
  4058				ret = handle_pte_marker(vmf);
  4059			} else {
  4060				print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
  4061				ret = VM_FAULT_SIGBUS;
  4062			}
  4063			goto out;
  4064		}
  4065	
  4066		/* Prevent swapoff from happening to us. */
  4067		si = get_swap_device(entry);
  4068		if (unlikely(!si))
  4069			goto out;
  4070	
  4071		folio = swap_cache_get_folio(entry, vma, vmf->address);
  4072		if (folio)
  4073			page = folio_file_page(folio, swp_offset(entry));
  4074		swapcache = folio;
  4075	
  4076		if (!folio) {
  4077			if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
  4078			    __swap_count(entry) == 1) {
  4079				/*
  4080				 * Prevent parallel swapin from proceeding with
  4081				 * the cache flag. Otherwise, another thread may
  4082				 * finish swapin first, free the entry, and swapout
  4083				 * reusing the same entry. It's undetectable as
  4084				 * pte_same() returns true due to entry reuse.
  4085				 */
  4086				if (swapcache_prepare(entry, 1)) {
  4087					/* Relax a bit to prevent rapid repeated page faults */
  4088					schedule_timeout_uninterruptible(1);
  4089					goto out;
  4090				}
  4091				need_clear_cache = true;
  4092	
  4093				/* skip swapcache */
  4094				folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
  4095							vma, vmf->address, false);
  4096				if (folio) {
  4097					__folio_set_locked(folio);
  4098					__folio_set_swapbacked(folio);
  4099	
  4100					if (mem_cgroup_swapin_charge_folio(folio,
  4101								vma->vm_mm, GFP_KERNEL,
  4102								entry)) {
  4103						ret = VM_FAULT_OOM;
  4104						goto out_page;
  4105					}
  4106					mem_cgroup_swapin_uncharge_swap(entry);
  4107	
  4108					shadow = get_shadow_from_swap_cache(entry);
  4109					if (shadow)
  4110						workingset_refault(folio, shadow);
  4111	
  4112					folio_add_lru(folio);
  4113	
  4114					/* To provide entry to swap_read_folio() */
  4115					folio->swap = entry;
  4116					swap_read_folio(folio, NULL);
  4117					folio->private = NULL;
  4118				}
  4119			} else {
  4120				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
  4121							vmf);
  4122				swapcache = folio;
  4123			}
  4124	
  4125			if (!folio) {
  4126				/*
  4127				 * Back out if somebody else faulted in this pte
  4128				 * while we released the pte lock.
  4129				 */
  4130				vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4131						vmf->address, &vmf->ptl);
  4132				if (likely(vmf->pte &&
  4133					   pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  4134					ret = VM_FAULT_OOM;
  4135				goto unlock;
  4136			}
  4137	
  4138			/* Had to read the page from swap area: Major fault */
  4139			ret = VM_FAULT_MAJOR;
  4140			count_vm_event(PGMAJFAULT);
  4141			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
  4142			page = folio_file_page(folio, swp_offset(entry));
  4143		} else if (PageHWPoison(page)) {
  4144			/*
  4145			 * hwpoisoned dirty swapcache pages are kept for killing
  4146			 * owner processes (which may be unknown at hwpoison time)
  4147			 */
  4148			ret = VM_FAULT_HWPOISON;
  4149			goto out_release;
  4150		}
  4151	
  4152		ret |= folio_lock_or_retry(folio, vmf);
  4153		if (ret & VM_FAULT_RETRY)
  4154			goto out_release;
  4155	
  4156		if (swapcache) {
  4157			/*
  4158			 * Make sure folio_free_swap() or swapoff did not release the
  4159			 * swapcache from under us.  The page pin, and pte_same test
  4160			 * below, are not enough to exclude that.  Even if it is still
  4161			 * swapcache, we need to check that the page's swap has not
  4162			 * changed.
  4163			 */
  4164			if (unlikely(!folio_test_swapcache(folio) ||
  4165				     page_swap_entry(page).val != entry.val))
  4166				goto out_page;
  4167	
  4168			/*
  4169			 * KSM sometimes has to copy on read faults, for example, if
  4170			 * page->index of !PageKSM() pages would be nonlinear inside the
  4171			 * anon VMA -- PageKSM() is lost on actual swapout.
  4172			 */
  4173			folio = ksm_might_need_to_copy(folio, vma, vmf->address);
  4174			if (unlikely(!folio)) {
  4175				ret = VM_FAULT_OOM;
  4176				folio = swapcache;
  4177				goto out_page;
  4178			} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
  4179				ret = VM_FAULT_HWPOISON;
  4180				folio = swapcache;
  4181				goto out_page;
  4182			}
  4183			if (folio != swapcache)
  4184				page = folio_page(folio, 0);
  4185	
  4186			/*
  4187			 * If we want to map a page that's in the swapcache writable, we
  4188			 * have to detect via the refcount if we're really the exclusive
  4189			 * owner. Try removing the extra reference from the local LRU
  4190			 * caches if required.
  4191			 */
  4192			if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
  4193			    !folio_test_ksm(folio) && !folio_test_lru(folio))
  4194				lru_add_drain();
  4195		}
  4196	
  4197		folio_throttle_swaprate(folio, GFP_KERNEL);
  4198	
  4199		/*
  4200		 * Back out if somebody else already faulted in this pte.
  4201		 */
  4202		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
  4203				&vmf->ptl);
  4204		if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
  4205			goto out_nomap;
  4206	
  4207		if (unlikely(!folio_test_uptodate(folio))) {
  4208			ret = VM_FAULT_SIGBUS;
  4209			goto out_nomap;
  4210		}
  4211	
  4212		nr_pages = 1;
  4213		page_idx = 0;
  4214		address = vmf->address;
  4215		ptep = vmf->pte;
  4216		if (folio_test_large(folio) && folio_test_swapcache(folio)) {
  4217			int nr = folio_nr_pages(folio);
  4218			unsigned long idx = folio_page_idx(folio, page);
  4219			unsigned long folio_start = address - idx * PAGE_SIZE;
  4220			unsigned long folio_end = folio_start + nr * PAGE_SIZE;
  4221			pte_t *folio_ptep;
  4222			pte_t folio_pte;
  4223	
  4224			if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
  4225				goto check_folio;
  4226			if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
  4227				goto check_folio;
  4228	
  4229			folio_ptep = vmf->pte - idx;
  4230			folio_pte = ptep_get(folio_ptep);
  4231			if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
  4232			    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
  4233				goto check_folio;
  4234	
  4235			page_idx = idx;
  4236			address = folio_start;
  4237			ptep = folio_ptep;
  4238			nr_pages = nr;
  4239			entry = folio->swap;
  4240			page = &folio->page;
  4241		}
  4242	
  4243	check_folio:
  4244		/*
  4245		 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
  4246		 * must never point at an anonymous page in the swapcache that is
  4247		 * PG_anon_exclusive. Sanity check that this holds and especially, that
  4248		 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
  4249		 * check after taking the PT lock and making sure that nobody
  4250		 * concurrently faulted in this page and set PG_anon_exclusive.
  4251		 */
  4252		BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
  4253		BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
  4254	
  4255		/*
  4256		 * Check under PT lock (to protect against concurrent fork() sharing
  4257		 * the swap entry concurrently) for certainly exclusive pages.
  4258		 */
  4259		if (!folio_test_ksm(folio)) {
  4260			exclusive = pte_swp_exclusive(vmf->orig_pte);
  4261			if (folio != swapcache) {
  4262				/*
  4263				 * We have a fresh page that is not exposed to the
  4264				 * swapcache -> certainly exclusive.
  4265				 */
  4266				exclusive = true;
  4267			} else if (exclusive && folio_test_writeback(folio) &&
  4268				  data_race(si->flags & SWP_STABLE_WRITES)) {
  4269				/*
  4270				 * This is tricky: not all swap backends support
  4271				 * concurrent page modifications while under writeback.
  4272				 *
  4273				 * So if we stumble over such a page in the swapcache
  4274				 * we must not set the page exclusive, otherwise we can
  4275				 * map it writable without further checks and modify it
  4276				 * while still under writeback.
  4277				 *
  4278				 * For these problematic swap backends, simply drop the
  4279				 * exclusive marker: this is perfectly fine as we start
  4280				 * writeback only if we fully unmapped the page and
  4281				 * there are no unexpected references on the page after
  4282				 * unmapping succeeded. After fully unmapped, no
  4283				 * further GUP references (FOLL_GET and FOLL_PIN) can
  4284				 * appear, so dropping the exclusive marker and mapping
  4285				 * it only R/O is fine.
  4286				 */
  4287				exclusive = false;
  4288			}
  4289		}
  4290	
  4291		/*
  4292		 * Some architectures may have to restore extra metadata to the page
  4293		 * when reading from swap. This metadata may be indexed by swap entry
  4294		 * so this must be called before swap_free().
  4295		 */
  4296		arch_swap_restore(folio_swap(entry, folio), folio);
  4297	
  4298		/*
  4299		 * Remove the swap entry and conditionally try to free up the swapcache.
  4300		 * We're already holding a reference on the page but haven't mapped it
  4301		 * yet.
  4302		 */
  4303		swap_free_nr(entry, nr_pages);
  4304		if (should_try_to_free_swap(folio, vma, vmf->flags))
  4305			folio_free_swap(folio);
  4306	
  4307		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
  4308		add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
  4309		pte = mk_pte(page, vma->vm_page_prot);
  4310		if (pte_swp_soft_dirty(vmf->orig_pte))
  4311			pte = pte_mksoft_dirty(pte);
  4312		if (pte_swp_uffd_wp(vmf->orig_pte))
  4313			pte = pte_mkuffd_wp(pte);
  4314	
  4315		/*
  4316		 * Same logic as in do_wp_page(); however, optimize for pages that are
  4317		 * certainly not shared either because we just allocated them without
  4318		 * exposing them to the swapcache or because the swap entry indicates
  4319		 * exclusivity.
  4320		 */
  4321		if (!folio_test_ksm(folio) &&
  4322		    (exclusive || folio_ref_count(folio) == 1)) {
  4323			if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
  4324			    !pte_needs_soft_dirty_wp(vma, pte)) {
  4325				pte = pte_mkwrite(pte, vma);
  4326				if (vmf->flags & FAULT_FLAG_WRITE) {
  4327					pte = pte_mkdirty(pte);
  4328					vmf->flags &= ~FAULT_FLAG_WRITE;
  4329				}
  4330			}
  4331			rmap_flags |= RMAP_EXCLUSIVE;
  4332		}
  4333		folio_ref_add(folio, nr_pages - 1);
  4334		flush_icache_pages(vma, page, nr_pages);
  4335		vmf->orig_pte = pte_advance_pfn(pte, page_idx);
  4336	
  4337		/* ksm created a completely new copy */
  4338		if (unlikely(folio != swapcache && swapcache)) {
  4339			folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
  4340			folio_add_lru_vma(folio, vma);
  4341		} else if (!folio_test_anon(folio)) {
  4342			/*
  4343			 * We currently only expect small !anon folios, which are either
  4344			 * fully exclusive or fully shared. If we ever get large folios
  4345			 * here, we have to be careful.
  4346			 */
  4347			VM_WARN_ON_ONCE(folio_test_large(folio));
  4348			VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
  4349			folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
  4350		} else {
  4351			folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
  4352						rmap_flags);
  4353		}
  4354	
  4355		VM_BUG_ON(!folio_test_anon(folio) ||
  4356				(pte_write(pte) && !PageAnonExclusive(page)));
  4357		set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
  4358		arch_do_swap_page_nr(vma->vm_mm, vma, address,
  4359				pte, pte, nr_pages);
  4360	
  4361		folio_unlock(folio);
  4362		if (folio != swapcache && swapcache) {
  4363			/*
  4364			 * Hold the lock to avoid the swap entry to be reused
  4365			 * until we take the PT lock for the pte_same() check
  4366			 * (to avoid false positives from pte_same). For
  4367			 * further safety release the lock after the swap_free
  4368			 * so that the swap count won't change under a
  4369			 * parallel locked swapcache.
  4370			 */
  4371			folio_unlock(swapcache);
  4372			folio_put(swapcache);
  4373		}
  4374	
  4375		if (vmf->flags & FAULT_FLAG_WRITE) {
  4376			ret |= do_wp_page(vmf);
  4377			if (ret & VM_FAULT_ERROR)
  4378				ret &= VM_FAULT_ERROR;
  4379			goto out;
  4380		}
  4381	
  4382		/* No need to invalidate - it was non-present before */
  4383		update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
  4384	unlock:
  4385		if (vmf->pte)
  4386			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4387	out:
  4388		/* Clear the swap cache pin for direct swapin after PTL unlock */
  4389		if (need_clear_cache)
  4390			swapcache_clear(si, entry, 1);
  4391		if (si)
  4392			put_swap_device(si);
  4393		return ret;
  4394	out_nomap:
  4395		if (vmf->pte)
  4396			pte_unmap_unlock(vmf->pte, vmf->ptl);
  4397	out_page:
  4398		folio_unlock(folio);
  4399	out_release:
  4400		folio_put(folio);
  4401		if (folio != swapcache && swapcache) {
  4402			folio_unlock(swapcache);
  4403			folio_put(swapcache);
  4404		}
  4405		if (need_clear_cache)
  4406			swapcache_clear(si, entry, 1);
  4407		if (si)
  4408			put_swap_device(si);
  4409		return ret;
  4410	}
  4411	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 12/12] mm: Remove devmap related functions and page table bits
  2024-09-10  4:14 ` [PATCH 12/12] mm: Remove devmap related functions and page table bits Alistair Popple
  2024-09-11  7:47   ` Chunyan Zhang
@ 2024-09-12 12:55   ` kernel test robot
  1 sibling, 0 replies; 53+ messages in thread
From: kernel test robot @ 2024-09-12 12:55 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: oe-kbuild-all, Alistair Popple, vishal.l.verma, dave.jiang,
	logang, bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs

Hi Alistair,

kernel test robot noticed the following build errors:

[auto build test ERROR on 6f1833b8208c3b9e59eff10792667b6639365146]

url:    https://github.com/intel-lab-lkp/linux/commits/Alistair-Popple/mm-gup-c-Remove-redundant-check-for-PCI-P2PDMA-page/20240910-121806
base:   6f1833b8208c3b9e59eff10792667b6639365146
patch link:    https://lore.kernel.org/r/39b1a78aa16ebe5db1c4b723e44fbdd217d302ac.1725941415.git-series.apopple%40nvidia.com
patch subject: [PATCH 12/12] mm: Remove devmap related functions and page table bits
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20240912/202409122016.5i2hNKRU-lkp@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240912/202409122016.5i2hNKRU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409122016.5i2hNKRU-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/powerpc/include/asm/book3s/64/mmu-hash.h:20,
                    from arch/powerpc/include/asm/book3s/64/mmu.h:32,
                    from arch/powerpc/include/asm/mmu.h:377,
                    from arch/powerpc/include/asm/paca.h:18,
                    from arch/powerpc/include/asm/current.h:13,
                    from include/linux/thread_info.h:23,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/powerpc/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:79,
                    from include/linux/alloc_tag.h:11,
                    from include/linux/rhashtable-types.h:12,
                    from include/linux/ipc.h:7,
                    from include/uapi/linux/sem.h:5,
                    from include/linux/sem.h:5,
                    from include/linux/compat.h:14,
                    from arch/powerpc/kernel/asm-offsets.c:12:
>> arch/powerpc/include/asm/book3s/64/pgtable.h:1390:1: error: expected identifier or '(' before '}' token
    1390 | }
         | ^
   make[3]: *** [scripts/Makefile.build:117: arch/powerpc/kernel/asm-offsets.s] Error 1
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1193: prepare0] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:224: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:224: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +1390 arch/powerpc/include/asm/book3s/64/pgtable.h

953c66c2b22a30 Aneesh Kumar K.V  2016-12-12  1389  
ebd31197931d75 Oliver O'Halloran 2017-06-28 @1390  }
6a1ea36260f69f Aneesh Kumar K.V  2016-04-29  1391  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
ebd31197931d75 Oliver O'Halloran 2017-06-28  1392  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page
  2024-09-10  4:14 ` [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
@ 2024-09-22  1:00   ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2024-09-22  1:00 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david, Jason Gunthorpe

Alistair Popple wrote:
> PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
> check in __gup_device_huge() is redundant. Remove it
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/gup.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index d19884e..5d2fc9a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2954,11 +2954,6 @@ static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr,
>  			break;
>  		}
>  
> -		if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
> -			gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
> -			break;
> -		}
> -

Looks good.

Reviewed-by: Dan Wiliams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one
  2024-09-10  4:14 ` [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
  2024-09-10 13:47   ` Bjorn Helgaas
  2024-09-11  0:48   ` Logan Gunthorpe
@ 2024-09-22  1:00   ` Dan Williams
  2024-10-11  0:17     ` Alistair Popple
  2 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-09-22  1:00 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Alistair Popple wrote:
> The reference counts for ZONE_DEVICE private pages should be
> initialised by the driver when the page is actually allocated by the
> driver allocator, not when they are first created. This is currently
> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  drivers/pci/p2pdma.c |  6 ++++++
>  mm/memremap.c        | 17 +++++++++++++----
>  mm/mm_init.c         | 22 ++++++++++++++++++----
>  3 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4f47a13..210b9f4 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -129,6 +129,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>  	}
>  
>  	/*
> +	 * Initialise the refcount for the freshly allocated page. As we have
> +	 * just allocated the page no one else should be using it.
> +	 */
> +	set_page_count(virt_to_page(kaddr), 1);

Perhaps VM_WARN_ONCE to back up that assumption?

I also notice that there are multiple virt_to_page() lookups in this
routine, so maybe time for a local @page variable.

> +
> +	/*
>  	 * vm_insert_page() can sleep, so a reference is taken to mapping
>  	 * such that rcu_read_unlock() can be done before inserting the
>  	 * pages
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 40d4547..07bbe0e 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -488,15 +488,24 @@ void free_zone_device_folio(struct folio *folio)
>  	folio->mapping = NULL;
>  	folio->page.pgmap->ops->page_free(folio_page(folio, 0));
>  
> -	if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
> -	    folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
> +	switch (folio->page.pgmap->type) {
> +	case MEMORY_DEVICE_PRIVATE:
> +	case MEMORY_DEVICE_COHERENT:
> +		put_dev_pagemap(folio->page.pgmap);
> +		break;
> +
> +	case MEMORY_DEVICE_FS_DAX:
> +	case MEMORY_DEVICE_GENERIC:
>  		/*
>  		 * Reset the refcount to 1 to prepare for handing out the page
>  		 * again.
>  		 */
>  		folio_set_count(folio, 1);
> -	else
> -		put_dev_pagemap(folio->page.pgmap);
> +		break;
> +
> +	case MEMORY_DEVICE_PCI_P2PDMA:
> +		break;

A follow on cleanup is that either all implementations should be
put_dev_pagemap(), or none of them. Put the onus on the implementation
to track how many pages it has handed out in the implementation
allocator.

> +	}
>  }
>  
>  void zone_device_page_init(struct page *page)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 4ba5607..0489820 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1015,12 +1015,26 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>  	}
>  
>  	/*
> -	 * ZONE_DEVICE pages are released directly to the driver page allocator
> -	 * which will set the page count to 1 when allocating the page.
> +	 * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and
> +	 * MEMORY_TYPE_FS_DAX pages are released directly to the driver page
> +	 * allocator which will set the page count to 1 when allocating the
> +	 * page.
> +	 *
> +	 * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
> +	 * their refcount reset to one whenever they are freed (ie. after
> +	 * their refcount drops to 0).

I'll send some follow on patches to clean up device-dax.

For this one:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/12] fs/dax: Refactor wait for dax idle page
  2024-09-10  4:14 ` [PATCH 03/12] fs/dax: Refactor wait for dax idle page Alistair Popple
@ 2024-09-22  1:01   ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2024-09-22  1:01 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Alistair Popple wrote:
> A FS DAX page is considered idle when its refcount drops to one. This
> is currently open-coded in all file systems supporting FS DAX. Move
> the idle detection to a common function to make future changes easier.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Looks good to me:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/12] mm: Allow compound zone device pages
  2024-09-10  4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
                     ` (2 preceding siblings ...)
  2024-09-12 12:44   ` kernel test robot
@ 2024-09-22  1:01   ` Dan Williams
  3 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2024-09-22  1:01 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david, Jason Gunthorpe

Alistair Popple wrote:
> Zone device pages are used to represent various type of device memory
> managed by device drivers. Currently compound zone device pages are
> not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
> user of higher order zone device pages and have their own page
> reference counting.
> 
> A future change will unify FS DAX reference counting with normal page
> reference counting rules and remove the special FS DAX reference
> counting. Supporting that requires compound zone device pages.
> 
> Supporting compound zone device pages requires compound_head() to
> distinguish between head and tail pages whilst still preserving the
> special struct page fields that are specific to zone device pages.
> 
> A tail page is distinguished by having bit zero being set in
> page->compound_head, with the remaining bits pointing to the head
> page. For zone device pages page->compound_head is shared with
> page->pgmap.
> 
> The page->pgmap field is common to all pages within a memory section.
> Therefore pgmap is the same for both head and tail pages and can be
> moved into the folio and we can use the standard scheme to find
> compound_head from a tail page.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> ---
> 
> Changes since v1:
> 
>  - Move pgmap to the folio as suggested by Matthew Wilcox
> ---
>  drivers/gpu/drm/nouveau/nouveau_dmem.c |  3 ++-
>  drivers/pci/p2pdma.c                   |  6 +++---
>  include/linux/memremap.h               |  6 +++---
>  include/linux/migrate.h                |  4 ++--
>  include/linux/mm_types.h               |  9 +++++++--
>  include/linux/mmzone.h                 |  8 +++++++-
>  lib/test_hmm.c                         |  3 ++-
>  mm/hmm.c                               |  2 +-
>  mm/memory.c                            |  4 +++-
>  mm/memremap.c                          | 14 +++++++-------
>  mm/migrate_device.c                    |  7 +++++--
>  mm/mm_init.c                           |  2 +-
>  12 files changed, 43 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 6fb65b0..58d308c 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -88,7 +88,8 @@ struct nouveau_dmem {
>  
>  static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
>  {
> -	return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
> +	return container_of(page_dev_pagemap(page), struct nouveau_dmem_chunk,

page_dev_pagemap() feels like a mouthful. I would be ok with
page_pgmap() since that is the most common idenifier for struct
struct dev_pagemap instances.

> +			    pagemap);
>  }
>  
>  static struct nouveau_drm *page_to_drm(struct page *page)
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 210b9f4..a58f2c1 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -199,7 +199,7 @@ static const struct attribute_group p2pmem_group = {
>  
>  static void p2pdma_page_free(struct page *page)
>  {
> -	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
> +	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_dev_pagemap(page));
>  	/* safe to dereference while a reference is held to the percpu ref */
>  	struct pci_p2pdma *p2pdma =
>  		rcu_dereference_protected(pgmap->provider->p2pdma, 1);
> @@ -1022,8 +1022,8 @@ enum pci_p2pdma_map_type
>  pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
>  		       struct scatterlist *sg)
>  {
> -	if (state->pgmap != sg_page(sg)->pgmap) {
> -		state->pgmap = sg_page(sg)->pgmap;
> +	if (state->pgmap != page_dev_pagemap(sg_page(sg))) {
> +		state->pgmap = page_dev_pagemap(sg_page(sg));
>  		state->map = pci_p2pdma_map_type(state->pgmap, dev);
>  		state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
>  	}
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 3f7143a..14273e6 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -161,7 +161,7 @@ static inline bool is_device_private_page(const struct page *page)
>  {
>  	return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
>  		is_zone_device_page(page) &&
> -		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
> +		page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
>  }
>  
>  static inline bool folio_is_device_private(const struct folio *folio)
> @@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
>  {
>  	return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
>  		is_zone_device_page(page) &&
> -		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
> +		page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
>  }
>  
>  static inline bool is_device_coherent_page(const struct page *page)
>  {
>  	return is_zone_device_page(page) &&
> -		page->pgmap->type == MEMORY_DEVICE_COHERENT;
> +		page_dev_pagemap(page)->type == MEMORY_DEVICE_COHERENT;
>  }
>  
>  static inline bool folio_is_device_coherent(const struct folio *folio)
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 002e49b..9a85a82 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -207,8 +207,8 @@ struct migrate_vma {
>  	unsigned long		end;
>  
>  	/*
> -	 * Set to the owner value also stored in page->pgmap->owner for
> -	 * migrating out of device private memory. The flags also need to
> +	 * Set to the owner value also stored in page_dev_pagemap(page)->owner
> +	 * for migrating out of device private memory. The flags also need to
>  	 * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
>  	 * The caller should always set this field when using mmu notifier
>  	 * callbacks to avoid device MMU invalidations for device private
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e3bdf8..c2f1d53 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -129,8 +129,11 @@ struct page {
>  			unsigned long compound_head;	/* Bit zero is set */
>  		};
>  		struct {	/* ZONE_DEVICE pages */
> -			/** @pgmap: Points to the hosting device page map. */
> -			struct dev_pagemap *pgmap;
> +			/*
> +			 * The first word is used for compound_head or folio
> +			 * pgmap
> +			 */
> +			void *_unused;

I would feel better with "_unused_pgmap_compound_head", similar to how
_unused_slab_obj_exts in 'struct foio' indicates the placeholer
contents.

>  			void *zone_device_data;
>  			/*
>  			 * ZONE_DEVICE private pages are counted as being
> @@ -299,6 +302,7 @@ typedef struct {
>   * @_refcount: Do not access this member directly.  Use folio_ref_count()
>   *    to find how many references there are to this folio.
>   * @memcg_data: Memory Control Group data.
> + * @pgmap: Metadata for ZONE_DEVICE mappings
>   * @virtual: Virtual address in the kernel direct map.
>   * @_last_cpupid: IDs of last CPU and last process that accessed the folio.
>   * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
> @@ -337,6 +341,7 @@ struct folio {
>  	/* private: */
>  				};
>  	/* public: */
> +			struct dev_pagemap *pgmap;
>  			};
>  			struct address_space *mapping;
>  			pgoff_t index;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 17506e4..e191434 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1134,6 +1134,12 @@ static inline bool is_zone_device_page(const struct page *page)
>  	return page_zonenum(page) == ZONE_DEVICE;
>  }
>  
> +static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
> +{
> +	WARN_ON(!is_zone_device_page(page));

VM_WARN_ON()?

With the above fixups:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/12] mm/memory: Add dax_insert_pfn
  2024-09-10  4:14 ` [PATCH 05/12] mm/memory: Add dax_insert_pfn Alistair Popple
@ 2024-09-22  1:41   ` Dan Williams
  2024-10-01 10:43     ` Gerald Schaefer
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-09-22  1:41 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david, hca, gor, agordeev, borntraeger, svens, linux-s390,
	gerald.schaefer

[ add s390 folks to comment on CONFIG_FS_DAX_LIMITED ]

Alistair Popple wrote:
> Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
> creates a special devmap PTE entry for the pfn but does not take a
> reference on the underlying struct page for the mapping. This is
> because DAX page refcounts are treated specially, as indicated by the
> presence of a devmap entry.
> 
> To allow DAX page refcounts to be managed the same as normal page
> refcounts introduce dax_insert_pfn. This will take a reference on the
> underlying page much the same as vmf_insert_page, except it also
> permits upgrading an existing mapping to be writable if
> requested/possible.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> 
> ---
> 
> Updates from v1:
> 
>  - Re-arrange code in insert_page_into_pte_locked() based on comments
>    from Jan Kara.
> 
>  - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan.
> ---
>  include/linux/mm.h |  1 +-
>  mm/memory.c        | 83 ++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 76 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b0ff06d..ae6d713 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3463,6 +3463,7 @@ int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
>  				unsigned long num);
>  int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
>  				unsigned long num);
> +vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write);
>  vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>  			unsigned long pfn);
>  vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/memory.c b/mm/memory.c
> index d2785fb..368e15d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2039,19 +2039,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma,
>  }
>  
>  static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> -			unsigned long addr, struct page *page, pgprot_t prot)
> +				unsigned long addr, struct page *page,
> +				pgprot_t prot, bool mkwrite)

This upgrade of insert_page_into_pte_locked() to handle write faults
deserves to be its own patch with rationale along the lines of:

"In preparation for using insert_page() for DAX, enhance
insert_page_into_pte_locked() to handle establishing writable mappings.
Recall that DAX returns VM_FAULT_NOPAGE after installing a PTE which
bypasses the typical set_pte_range() in finish_fault."


>  {
>  	struct folio *folio = page_folio(page);
> +	pte_t entry = ptep_get(pte);
>  	pte_t pteval;
>  
> -	if (!pte_none(ptep_get(pte)))
> -		return -EBUSY;
> +	if (!pte_none(entry)) {
> +		if (!mkwrite)
> +			return -EBUSY;
> +
> +		/*
> +		 * For read faults on private mappings the PFN passed in may not
> +		 * match the PFN we have mapped if the mapped PFN is a writeable
> +		 * COW page.  In the mkwrite case we are creating a writable PTE
> +		 * for a shared mapping and we expect the PFNs to match. If they
> +		 * don't match, we are likely racing with block allocation and
> +		 * mapping invalidation so just skip the update.
> +		 */
> +		if (pte_pfn(entry) != page_to_pfn(page)) {
> +			WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
> +			return -EFAULT;
> +		}
> +		entry = maybe_mkwrite(entry, vma);
> +		entry = pte_mkyoung(entry);
> +		if (ptep_set_access_flags(vma, addr, pte, entry, 1))
> +			update_mmu_cache(vma, addr, pte);

I was going to say that this should be creating a shared helper with
insert_pfn(), but on closer look the mkwrite case in insert_pfn() is now
dead code (almost, *grumbles about dcssblk*). So I would just mention
that in the changelog for this standalone patch and then we can follow
on with a cleanup like the patch at the bottom of this mail (untested).

> +		return 0;
> +	}
> +
>  	/* Ok, finally just insert the thing.. */
>  	pteval = mk_pte(page, prot);
>  	if (unlikely(is_zero_folio(folio))) {
>  		pteval = pte_mkspecial(pteval);
>  	} else {
>  		folio_get(folio);
> +		entry = mk_pte(page, prot);
> +		if (mkwrite) {
> +			entry = pte_mkyoung(entry);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		}
>  		inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
>  		folio_add_file_rmap_pte(folio, page, vma);
>  	}
> @@ -2060,7 +2088,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
>  }
>  
>  static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> -			struct page *page, pgprot_t prot)
> +			struct page *page, pgprot_t prot, bool mkwrite)
>  {
>  	int retval;
>  	pte_t *pte;
> @@ -2073,7 +2101,8 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
>  	pte = get_locked_pte(vma->vm_mm, addr, &ptl);
>  	if (!pte)
>  		goto out;
> -	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
> +	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot,
> +					mkwrite);
>  	pte_unmap_unlock(pte, ptl);
>  out:
>  	return retval;
> @@ -2087,7 +2116,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
>  	err = validate_page_before_insert(vma, page);
>  	if (err)
>  		return err;
> -	return insert_page_into_pte_locked(vma, pte, addr, page, prot);
> +	return insert_page_into_pte_locked(vma, pte, addr, page, prot, false);
>  }
>  
>  /* insert_pages() amortizes the cost of spinlock operations
> @@ -2223,7 +2252,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
>  		BUG_ON(vma->vm_flags & VM_PFNMAP);
>  		vm_flags_set(vma, VM_MIXEDMAP);
>  	}
> -	return insert_page(vma, addr, page, vma->vm_page_prot);
> +	return insert_page(vma, addr, page, vma->vm_page_prot, false);
>  }
>  EXPORT_SYMBOL(vm_insert_page);
>  
> @@ -2503,7 +2532,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
>  		 * result in pfn_t_has_page() == false.
>  		 */
>  		page = pfn_to_page(pfn_t_to_pfn(pfn));
> -		err = insert_page(vma, addr, page, pgprot);
> +		err = insert_page(vma, addr, page, pgprot, mkwrite);
>  	} else {
>  		return insert_pfn(vma, addr, pfn, pgprot, mkwrite);
>  	}
> @@ -2516,6 +2545,44 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
>  	return VM_FAULT_NOPAGE;
>  }
>  
> +vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	pgprot_t pgprot = vma->vm_page_prot;
> +	unsigned long pfn = pfn_t_to_pfn(pfn_t);
> +	struct page *page = pfn_to_page(pfn);

The problem here is that we stubbornly have __dcssblk_direct_access() to
worry about. That is the only dax driver that does not return
pfn_valid() pfns.

In fact, it looks like __dcssblk_direct_access() is the only thing
standing in the way of the removal of pfn_t.

It turns out it has been 3 years since the last time the question of
bringing s390 fully into the ZONE_DEVICE regime was raised:

https://lore.kernel.org/all/20210820210318.187742e8@thinkpad/

Given that this series removes PTE_DEVMAP which was a stumbling block,
would it be feasible to remove CONFIG_FS_DAX_LIMITED for a few kernel
cycles until someone from the s390 side can circle back to add full
ZONE_DEVICE support?

> +	unsigned long addr = vmf->address;
> +	int err;
> +
> +	if (addr < vma->vm_start || addr >= vma->vm_end)
> +		return VM_FAULT_SIGBUS;
> +
> +	track_pfn_insert(vma, &pgprot, pfn_t);
> +
> +	if (!pfn_modify_allowed(pfn, pgprot))
> +		return VM_FAULT_SIGBUS;
> +
> +	/*
> +	 * We refcount the page normally so make sure pfn_valid is true.
> +	 */
> +	if (!pfn_t_valid(pfn_t))
> +		return VM_FAULT_SIGBUS;
> +
> +	WARN_ON_ONCE(pfn_t_devmap(pfn_t));
> +
> +	if (WARN_ON(is_zero_pfn(pfn) && write))
> +		return VM_FAULT_SIGBUS;
> +
> +	err = insert_page(vma, addr, page, pgprot, write);
> +	if (err == -ENOMEM)
> +		return VM_FAULT_OOM;
> +	if (err < 0 && err != -EBUSY)
> +		return VM_FAULT_SIGBUS;
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +EXPORT_SYMBOL_GPL(dax_insert_pfn);

With insert_page_into_pte_locked() split out into its own patch and the
dcssblk issue resolved to kill that special case, this patch looks good
to me.

> +
>  vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>  		pfn_t pfn)
>  {
> -- 
> git-series 0.9.1

-- >8 --
Subject: mm: Remove vmf_insert_mixed_mkwrite()

From: Dan Williams <dan.j.williams@intel.com>

Now that fsdax has switched to dax_insert_pfn() which uses
insert_page_into_pte_locked() internally, there are no more callers of
vmf_insert_mixed_mkwrite(). This also reveals that all remaining callers
of insert_pfn() never set @mkrite to true, so also cleanup insert_pfn()'s
@mkwrite argument.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h |    2 --
 mm/memory.c        |   60 +++++++---------------------------------------------
 2 files changed, 8 insertions(+), 54 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5976276d4494..d9517e109ac3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3444,8 +3444,6 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn, pgprot_t pgprot);
 vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
-vm_fault_t vmf_insert_mixed_mkwrite(struct vm_area_struct *vma,
-		unsigned long addr, pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 static inline vm_fault_t vmf_insert_page(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 000873596672..80b07dbd8304 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2327,7 +2327,7 @@ int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
 EXPORT_SYMBOL(vm_map_pages_zero);
 
 static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
-			pfn_t pfn, pgprot_t prot, bool mkwrite)
+			     pfn_t pfn, pgprot_t prot)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, entry;
@@ -2337,38 +2337,12 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	if (!pte)
 		return VM_FAULT_OOM;
 	entry = ptep_get(pte);
-	if (!pte_none(entry)) {
-		if (mkwrite) {
-			/*
-			 * For read faults on private mappings the PFN passed
-			 * in may not match the PFN we have mapped if the
-			 * mapped PFN is a writeable COW page.  In the mkwrite
-			 * case we are creating a writable PTE for a shared
-			 * mapping and we expect the PFNs to match. If they
-			 * don't match, we are likely racing with block
-			 * allocation and mapping invalidation so just skip the
-			 * update.
-			 */
-			if (pte_pfn(entry) != pfn_t_to_pfn(pfn)) {
-				WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
-				goto out_unlock;
-			}
-			entry = pte_mkyoung(entry);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			if (ptep_set_access_flags(vma, addr, pte, entry, 1))
-				update_mmu_cache(vma, addr, pte);
-		}
+	if (!pte_none(entry))
 		goto out_unlock;
-	}
 
 	/* Ok, finally just insert the thing.. */
 	entry = pte_mkspecial(pfn_t_pte(pfn, prot));
 
-	if (mkwrite) {
-		entry = pte_mkyoung(entry);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	}
-
 	set_pte_at(mm, addr, pte, entry);
 	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
 
@@ -2433,8 +2407,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 
 	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
 
-	return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
-			false);
+	return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot);
 }
 EXPORT_SYMBOL(vmf_insert_pfn_prot);
 
@@ -2480,8 +2453,8 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn, bool mkwrite)
 	return false;
 }
 
-static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
-		unsigned long addr, pfn_t pfn, bool mkwrite)
+vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+			    pfn_t pfn)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 	int err;
@@ -2513,9 +2486,9 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 		 * result in pfn_t_has_page() == false.
 		 */
 		page = pfn_to_page(pfn_t_to_pfn(pfn));
-		err = insert_page(vma, addr, page, pgprot, mkwrite);
+		err = insert_page(vma, addr, page, pgprot, false);
 	} else {
-		return insert_pfn(vma, addr, pfn, pgprot, mkwrite);
+		return insert_pfn(vma, addr, pfn, pgprot);
 	}
 
 	if (err == -ENOMEM)
@@ -2525,6 +2498,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 
 	return VM_FAULT_NOPAGE;
 }
+EXPORT_SYMBOL(vmf_insert_mixed);
 
 vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write)
 {
@@ -2562,24 +2536,6 @@ vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write)
 }
 EXPORT_SYMBOL_GPL(dax_insert_pfn);
 
-vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
-		pfn_t pfn)
-{
-	return __vm_insert_mixed(vma, addr, pfn, false);
-}
-EXPORT_SYMBOL(vmf_insert_mixed);
-
-/*
- *  If the insertion of PTE failed because someone else already added a
- *  different entry in the mean time, we treat that as success as we assume
- *  the same entry was actually inserted.
- */
-vm_fault_t vmf_insert_mixed_mkwrite(struct vm_area_struct *vma,
-		unsigned long addr, pfn_t pfn)
-{
-	return __vm_insert_mixed(vma, addr, pfn, true);
-}
-
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages
  2024-09-10  4:14 ` [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages Alistair Popple
@ 2024-09-22  2:07   ` Dan Williams
  2024-10-14  6:33     ` Alistair Popple
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-09-22  2:07 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Alistair Popple wrote:
> Currently DAX folio/page reference counts are managed differently to
> normal pages. To allow these to be managed the same as normal pages
> introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
> and take references as it would for a normally mapped page.
> 
> This is distinct from the current mechanism, vmf_insert_pfn_pud, which
> simply inserts a special devmap PUD entry into the page table without
> holding a reference to the page for the mapping.

This is missing some description or comment in the code about the
differences. More questions below:

> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  include/linux/huge_mm.h |  4 ++-
>  include/linux/rmap.h    | 15 +++++++-
>  mm/huge_memory.c        | 93 ++++++++++++++++++++++++++++++++++++------
>  mm/rmap.c               | 49 ++++++++++++++++++++++-
>  4 files changed, 149 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 6370026..d3a1872 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -40,6 +40,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  
>  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> +vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>  
>  enum transparent_hugepage_flag {
>  	TRANSPARENT_HUGEPAGE_UNSUPPORTED,
> @@ -114,6 +115,9 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
>  #define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
>  
> +#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  
>  extern unsigned long transparent_hugepage_flags;
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 91b5935..c465694 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -192,6 +192,7 @@ typedef int __bitwise rmap_t;
>  enum rmap_level {
>  	RMAP_LEVEL_PTE = 0,
>  	RMAP_LEVEL_PMD,
> +	RMAP_LEVEL_PUD,
>  };
>  
>  static inline void __folio_rmap_sanity_checks(struct folio *folio,
> @@ -228,6 +229,14 @@ static inline void __folio_rmap_sanity_checks(struct folio *folio,
>  		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
>  		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
>  		break;
> +	case RMAP_LEVEL_PUD:
> +		/*
> +		 * Asume that we are creating * a single "entire" mapping of the
> +		 * folio.
> +		 */
> +		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio);
> +		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
> +		break;
>  	default:
>  		VM_WARN_ON_ONCE(true);
>  	}
> @@ -251,12 +260,16 @@ void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
>  	folio_add_file_rmap_ptes(folio, page, 1, vma)
>  void folio_add_file_rmap_pmd(struct folio *, struct page *,
>  		struct vm_area_struct *);
> +void folio_add_file_rmap_pud(struct folio *, struct page *,
> +		struct vm_area_struct *);
>  void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
>  		struct vm_area_struct *);
>  #define folio_remove_rmap_pte(folio, page, vma) \
>  	folio_remove_rmap_ptes(folio, page, 1, vma)
>  void folio_remove_rmap_pmd(struct folio *, struct page *,
>  		struct vm_area_struct *);
> +void folio_remove_rmap_pud(struct folio *, struct page *,
> +		struct vm_area_struct *);
>  
>  void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
>  		unsigned long address, rmap_t flags);
> @@ -341,6 +354,7 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
>  		atomic_add(orig_nr_pages, &folio->_large_mapcount);
>  		break;
>  	case RMAP_LEVEL_PMD:
> +	case RMAP_LEVEL_PUD:
>  		atomic_inc(&folio->_entire_mapcount);
>  		atomic_inc(&folio->_large_mapcount);
>  		break;
> @@ -437,6 +451,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
>  		atomic_add(orig_nr_pages, &folio->_large_mapcount);
>  		break;
>  	case RMAP_LEVEL_PMD:
> +	case RMAP_LEVEL_PUD:
>  		if (PageAnonExclusive(page)) {
>  			if (unlikely(maybe_pinned))
>  				return -EBUSY;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c4b45ad..e8985a4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1336,21 +1336,19 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>  	struct mm_struct *mm = vma->vm_mm;
>  	pgprot_t prot = vma->vm_page_prot;
>  	pud_t entry;
> -	spinlock_t *ptl;
>  
> -	ptl = pud_lock(mm, pud);
>  	if (!pud_none(*pud)) {
>  		if (write) {
>  			if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
>  				WARN_ON_ONCE(!is_huge_zero_pud(*pud));
> -				goto out_unlock;
> +				return;
>  			}
>  			entry = pud_mkyoung(*pud);
>  			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
>  			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
>  				update_mmu_cache_pud(vma, addr, pud);
>  		}
> -		goto out_unlock;
> +		return;
>  	}
>  
>  	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
> @@ -1362,9 +1360,6 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>  	}
>  	set_pud_at(mm, addr, pud, entry);
>  	update_mmu_cache_pud(vma, addr, pud);
> -
> -out_unlock:
> -	spin_unlock(ptl);
>  }
>  
>  /**
> @@ -1382,6 +1377,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	unsigned long addr = vmf->address & PUD_MASK;
>  	struct vm_area_struct *vma = vmf->vma;
>  	pgprot_t pgprot = vma->vm_page_prot;
> +	spinlock_t *ptl;
>  
>  	/*
>  	 * If we had pud_special, we could avoid all these restrictions,
> @@ -1399,10 +1395,52 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>  
>  	track_pfn_insert(vma, &pgprot, pfn);
>  
> +	ptl = pud_lock(vma->vm_mm, vmf->pud);
>  	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
> +	spin_unlock(ptl);
> +
>  	return VM_FAULT_NOPAGE;
>  }
>  EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
> +
> +/**
> + * dax_insert_pfn_pud - insert a pud size pfn backed by a normal page
> + * @vmf: Structure describing the fault
> + * @pfn: pfn of the page to insert
> + * @write: whether it's a write fault

It strikes me that this documentation is not useful for recalling why
both vmf_insert_pfn_pud() and dax_insert_pfn_pud() exist. It looks like
the only difference is that the "dax_" flavor takes a reference on the
page. So maybe all these dax_insert_pfn{,_pmd,_pud} helpers should be
unified in a common vmf_insert_page() entry point where the caller is
responsible for initializing the compound page metadata before calling
the helper?

> + *
> + * Return: vm_fault_t value.
> + */
> +vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long addr = vmf->address & PUD_MASK;
> +	pud_t *pud = vmf->pud;
> +	pgprot_t prot = vma->vm_page_prot;
> +	struct mm_struct *mm = vma->vm_mm;
> +	spinlock_t *ptl;
> +	struct folio *folio;
> +	struct page *page;
> +
> +	if (addr < vma->vm_start || addr >= vma->vm_end)
> +		return VM_FAULT_SIGBUS;
> +
> +	track_pfn_insert(vma, &prot, pfn);
> +
> +	ptl = pud_lock(mm, pud);
> +	if (pud_none(*vmf->pud)) {
> +		page = pfn_t_to_page(pfn);
> +		folio = page_folio(page);
> +		folio_get(folio);
> +		folio_add_file_rmap_pud(folio, page, vma);
> +		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
> +	}
> +	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
> +	spin_unlock(ptl);
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +EXPORT_SYMBOL_GPL(dax_insert_pfn_pud);
>  #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>  
>  void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> @@ -1947,7 +1985,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			zap_deposited_table(tlb->mm, pmd);
>  		spin_unlock(ptl);
>  	} else if (is_huge_zero_pmd(orig_pmd)) {
> -		zap_deposited_table(tlb->mm, pmd);
> +		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
> +			zap_deposited_table(tlb->mm, pmd);

This looks subtle to me. Why is it needed to skip zap_deposited_table()
(I assume it is some THP assumption about the page being from the page
allocator)? Why is it ok to to force the zap if the arch demands it?

>  		spin_unlock(ptl);
>  	} else {
>  		struct folio *folio = NULL;
> @@ -2435,12 +2474,24 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
>  	arch_check_zapped_pud(vma, orig_pud);
>  	tlb_remove_pud_tlb_entry(tlb, pud, addr);
> -	if (vma_is_special_huge(vma)) {
> +	if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {

If vma_is_special_huge() is true vma_is_dax() will always be false, so
not clear to me why this check is combined?

>  		spin_unlock(ptl);
>  		/* No zero page support yet */
>  	} else {
> -		/* No support for anonymous PUD pages yet */
> -		BUG();
> +		struct page *page = NULL;
> +		struct folio *folio;
> +
> +		/* No support for anonymous PUD pages or migration yet */
> +		BUG_ON(vma_is_anonymous(vma) || !pud_present(orig_pud));
> +
> +		page = pud_page(orig_pud);
> +		folio = page_folio(page);
> +		folio_remove_rmap_pud(folio, page, vma);
> +		VM_BUG_ON_PAGE(!PageHead(page), page);
> +		add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
> +
> +		spin_unlock(ptl);
> +		tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
>  	}
>  	return 1;
>  }
> @@ -2448,6 +2499,8 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>  		unsigned long haddr)
>  {
> +	pud_t old_pud;
> +
>  	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
> @@ -2455,7 +2508,23 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>  
>  	count_vm_event(THP_SPLIT_PUD);
>  
> -	pudp_huge_clear_flush(vma, haddr, pud);
> +	old_pud = pudp_huge_clear_flush(vma, haddr, pud);
> +	if (is_huge_zero_pud(old_pud))
> +		return;
> +
> +	if (vma_is_dax(vma)) {
> +		struct page *page = pud_page(old_pud);
> +		struct folio *folio = page_folio(page);
> +
> +		if (!folio_test_dirty(folio) && pud_dirty(old_pud))
> +			folio_mark_dirty(folio);
> +		if (!folio_test_referenced(folio) && pud_young(old_pud))
> +			folio_set_referenced(folio);
> +		folio_remove_rmap_pud(folio, page, vma);
> +		folio_put(folio);
> +		add_mm_counter(vma->vm_mm, mm_counter_file(folio),
> +			-HPAGE_PUD_NR);
> +	}

So this does not split anything (no follow-on set_ptes()) it just clears
and updates some folio metadata. Something is wrong if we get this far
since the only dax mechanism that attempts PUD mappings is device-dax,
and device-dax is not prepared for PUD mappings to be fractured.

Peter Xu recently fixed mprotect() vs DAX PUD mappings, I need to check
how that interacts with this.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  2024-09-10  4:14 ` [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
@ 2024-09-25  0:17   ` Dan Williams
  2024-09-27  2:52     ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-09-25  0:17 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Alistair Popple wrote:
> Longterm pinning of FS DAX pages should already be disallowed by
> various pXX_devmap checks. However a future change will cause these
> checks to be invalid for FS DAX pages so make
> folio_is_longterm_pinnable() return false for FS DAX pages.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  include/linux/memremap.h | 11 +++++++++++
>  include/linux/mm.h       |  4 ++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 14273e6..6a1406a 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -187,6 +187,17 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
>  	return is_device_coherent_page(&folio->page);
>  }
>  
> +static inline bool is_device_dax_page(const struct page *page)
> +{
> +	return is_zone_device_page(page) &&
> +		page_dev_pagemap(page)->type == MEMORY_DEVICE_FS_DAX;
> +}
> +
> +static inline bool folio_is_device_dax(const struct folio *folio)
> +{
> +	return is_device_dax_page(&folio->page);
> +}
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  void zone_device_page_init(struct page *page);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ae6d713..935e493 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1989,6 +1989,10 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
>  	if (folio_is_device_coherent(folio))
>  		return false;
>  
> +	/* DAX must also always allow eviction. */
> +	if (folio_is_device_dax(folio))

Why is this called "folio_is_device_dax()" when the check is for fsdax?

I would expect:

if (folio_is_fsdax(folio))
	return false;

...and s/device_dax/fsdax/ for the rest of the helpers.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages
  2024-09-10  4:14 ` [PATCH 07/12] huge_memory: Allow mappings of PMD " Alistair Popple
@ 2024-09-27  2:48   ` Dan Williams
  2024-10-14  6:53     ` Alistair Popple
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-09-27  2:48 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Alistair Popple wrote:
> Currently DAX folio/page reference counts are managed differently to
> normal pages. To allow these to be managed the same as normal pages
> introduce dax_insert_pfn_pmd. This will map the entire PMD-sized folio
> and take references as it would for a normally mapped page.
> 
> This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
> simply inserts a special devmap PMD entry into the page table without
> holding a reference to the page for the mapping.

It would be useful to mention the rationale for the locking changes and
your understanding of the new "pgtable deposit" handling, because those
things make this not a trivial conversion.


> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  include/linux/huge_mm.h |  1 +-
>  mm/huge_memory.c        | 57 ++++++++++++++++++++++++++++++++++--------
>  2 files changed, 48 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index d3a1872..eaf3f78 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -40,6 +40,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  
>  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> +vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>  
>  enum transparent_hugepage_flag {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e8985a4..790041e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1237,14 +1237,12 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pmd_t entry;
> -	spinlock_t *ptl;
>  
> -	ptl = pmd_lock(mm, pmd);
>  	if (!pmd_none(*pmd)) {
>  		if (write) {
>  			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
>  				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
> -				goto out_unlock;
> +				return;
>  			}
>  			entry = pmd_mkyoung(*pmd);
>  			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -1252,7 +1250,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  				update_mmu_cache_pmd(vma, addr, pmd);
>  		}
>  
> -		goto out_unlock;
> +		return;
>  	}
>  
>  	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
> @@ -1271,11 +1269,6 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  
>  	set_pmd_at(mm, addr, pmd, entry);
>  	update_mmu_cache_pmd(vma, addr, pmd);
> -
> -out_unlock:
> -	spin_unlock(ptl);
> -	if (pgtable)
> -		pte_free(mm, pgtable);
>  }
>  
>  /**
> @@ -1294,6 +1287,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	struct vm_area_struct *vma = vmf->vma;
>  	pgprot_t pgprot = vma->vm_page_prot;
>  	pgtable_t pgtable = NULL;
> +	spinlock_t *ptl;
>  
>  	/*
>  	 * If we had pmd_special, we could avoid all these restrictions,
> @@ -1316,12 +1310,55 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	}
>  
>  	track_pfn_insert(vma, &pgprot, pfn);
> -
> +	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>  	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable);
> +	spin_unlock(ptl);
> +	if (pgtable)
> +		pte_free(vma->vm_mm, pgtable);
> +
>  	return VM_FAULT_NOPAGE;
>  }
>  EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
>  
> +vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long addr = vmf->address & PMD_MASK;
> +	struct mm_struct *mm = vma->vm_mm;
> +	spinlock_t *ptl;
> +	pgtable_t pgtable = NULL;
> +	struct folio *folio;
> +	struct page *page;
> +
> +	if (addr < vma->vm_start || addr >= vma->vm_end)
> +		return VM_FAULT_SIGBUS;
> +
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pte_alloc_one(vma->vm_mm);
> +		if (!pgtable)
> +			return VM_FAULT_OOM;
> +	}
> +
> +	track_pfn_insert(vma, &vma->vm_page_prot, pfn);
> +
> +	ptl = pmd_lock(mm, vmf->pmd);
> +	if (pmd_none(*vmf->pmd)) {
> +		page = pfn_t_to_page(pfn);
> +		folio = page_folio(page);
> +		folio_get(folio);
> +		folio_add_file_rmap_pmd(folio, page, vma);
> +		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> +	}
> +	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, vma->vm_page_prot,
> +		write, pgtable);
> +	spin_unlock(ptl);
> +	if (pgtable)
> +		pte_free(mm, pgtable);

Are not the deposit rules that the extra page table stick around for the
lifetime of the inserted pte? So would that not require this incremental
change?

---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ea65c2db2bb1..5ef1e5d21a96 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1232,7 +1232,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 
 static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 			   pmd_t *pmd, unsigned long pfn, pgprot_t prot,
-			   bool write, pgtable_t pgtable)
+			   bool write, pgtable_t *pgtable)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t entry;
@@ -1258,10 +1258,10 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		entry = maybe_pmd_mkwrite(entry, vma);
 	}
 
-	if (pgtable) {
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	if (*pgtable) {
+		pgtable_trans_huge_deposit(mm, pmd, *pgtable);
 		mm_inc_nr_ptes(mm);
-		pgtable = NULL;
+		*pgtable = NULL;
 	}
 
 	set_pmd_at(mm, addr, pmd, entry);
@@ -1306,7 +1306,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn, bool writ
 
 	track_pfn_insert(vma, &pgprot, pfn);
 	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable);
+	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, &pgtable);
 	spin_unlock(ptl);
 	if (pgtable)
 		pte_free(vma->vm_mm, pgtable);
@@ -1344,8 +1344,8 @@ vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn, bool writ
 		folio_add_file_rmap_pmd(folio, page, vma);
 		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
 	}
-	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, vma->vm_page_prot,
-		write, pgtable);
+	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, vma->vm_page_prot, write,
+		       &pgtable);
 	spin_unlock(ptl);
 	if (pgtable)
 		pte_free(mm, pgtable);
---

Along these lines it would be lovely if someone from the PowerPC side
could test these changes, or if someone has a canned qemu command line
to test radix vs hash with pmem+dax that they can share?

> +
> +	return VM_FAULT_NOPAGE;
> +}
> +EXPORT_SYMBOL_GPL(dax_insert_pfn_pmd);

Like I mentioned before, lets make the exported function
vmf_insert_folio() and move the pte, pmd, pud internal private / static
details of the implementation. The "dax_" specific aspect of this was
removed at the conversion of a dax_pfn to a folio.

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  2024-09-25  0:17   ` Dan Williams
@ 2024-09-27  2:52     ` Dan Williams
  2024-10-14  7:03       ` Alistair Popple
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-09-27  2:52 UTC (permalink / raw)
  To: Dan Williams, Alistair Popple, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Dan Williams wrote:
> Alistair Popple wrote:
> > Longterm pinning of FS DAX pages should already be disallowed by
> > various pXX_devmap checks. However a future change will cause these
> > checks to be invalid for FS DAX pages so make
> > folio_is_longterm_pinnable() return false for FS DAX pages.
> > 
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > ---
> >  include/linux/memremap.h | 11 +++++++++++
> >  include/linux/mm.h       |  4 ++++
> >  2 files changed, 15 insertions(+)
> > 
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 14273e6..6a1406a 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -187,6 +187,17 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
> >  	return is_device_coherent_page(&folio->page);
> >  }
> >  
> > +static inline bool is_device_dax_page(const struct page *page)
> > +{
> > +	return is_zone_device_page(page) &&
> > +		page_dev_pagemap(page)->type == MEMORY_DEVICE_FS_DAX;
> > +}
> > +
> > +static inline bool folio_is_device_dax(const struct folio *folio)
> > +{
> > +	return is_device_dax_page(&folio->page);
> > +}
> > +
> >  #ifdef CONFIG_ZONE_DEVICE
> >  void zone_device_page_init(struct page *page);
> >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ae6d713..935e493 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1989,6 +1989,10 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
> >  	if (folio_is_device_coherent(folio))
> >  		return false;
> >  
> > +	/* DAX must also always allow eviction. */
> > +	if (folio_is_device_dax(folio))
> 
> Why is this called "folio_is_device_dax()" when the check is for fsdax?
> 
> I would expect:
> 
> if (folio_is_fsdax(folio))
> 	return false;
> 
> ...and s/device_dax/fsdax/ for the rest of the helpers.

Specifically devdax is ok to allow longterm pinning since it is
statically allocated. fsdax is the only ZONE_DEVICE mode where there is
a higher-level allocator that does not support a 3rd party the block its
operations indefinitely with a pin. So this needs to be explicit for
that case.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/12] mm: Update vm_normal_page() callers to accept FS DAX pages
  2024-09-10  4:14 ` [PATCH 09/12] mm: Update vm_normal_page() callers to accept " Alistair Popple
@ 2024-09-27  7:15   ` Dan Williams
  2024-10-14  7:16     ` Alistair Popple
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-09-27  7:15 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Alistair Popple wrote:
> Currently if a PTE points to a FS DAX page vm_normal_page() will
> return NULL as these have their own special refcounting scheme. A
> future change will allow FS DAX pages to be refcounted the same as any
> other normal page.
> 
> Therefore vm_normal_page() will start returning FS DAX pages. To avoid
> any change in behaviour callers that don't expect FS DAX pages will
> need to explicitly check for this. As vm_normal_page() can already
> return ZONE_DEVICE pages most callers already include a check for any
> ZONE_DEVICE page.
> 
> However some callers don't, so add explicit checks where required.

I would expect justification for each of these conversions, and
hopefully with fsdax returning fully formed folios there is less need to
sprinkle these checks around.

At a minimum I think this patch needs to be broken up by file touched.

> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  arch/x86/mm/pat/memtype.c |  4 +++-
>  fs/proc/task_mmu.c        | 16 ++++++++++++----
>  mm/memcontrol-v1.c        |  2 +-
>  3 files changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 1fa0bf6..eb84593 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -951,6 +951,7 @@ static void free_pfn_range(u64 paddr, unsigned long size)
>  static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
>  		resource_size_t *phys)
>  {
> +	struct folio *folio;
>  	pte_t *ptep, pte;
>  	spinlock_t *ptl;
>  
> @@ -960,7 +961,8 @@ static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
>  	pte = ptep_get(ptep);
>  
>  	/* Never return PFNs of anon folios in COW mappings. */
> -	if (vm_normal_folio(vma, vma->vm_start, pte)) {
> +	folio = vm_normal_folio(vma, vma->vm_start, pte);
> +	if (folio || (folio && !folio_is_device_dax(folio))) {

...for example, I do not immediately see why follow_phys() would need to
be careful with fsdax pages?

...but I do see why copy_page_range() (which calls follow_phys() through
track_pfn_copy()) might care. It just turns out that vma_needs_copy(),
afaics, bypasses dax MAP_SHARED mappings.

So this touch of memtype.c looks like it can be dropped.

>  		pte_unmap_unlock(ptep, ptl);
>  		return -EINVAL;
>  	}
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 5f171ad..456b010 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -816,6 +816,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>  
>  	if (pte_present(ptent)) {
>  		page = vm_normal_page(vma, addr, ptent);
> +		if (page && is_device_dax_page(page))
> +			page = NULL;
>  		young = pte_young(ptent);
>  		dirty = pte_dirty(ptent);
>  		present = true;
> @@ -864,6 +866,8 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>  
>  	if (pmd_present(*pmd)) {
>  		page = vm_normal_page_pmd(vma, addr, *pmd);
> +		if (page && is_device_dax_page(page))
> +			page = NULL;
>  		present = true;
>  	} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(*pmd);

The above can be replaced with a catch like

   if (folio_test_device(folio))
	return;

...in smaps_account() since ZONE_DEVICE pages are not suitable to
account as they do not reflect any memory pressure on the system memory
pool.

> @@ -1385,7 +1389,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
>  	if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
>  		return false;
>  	folio = vm_normal_folio(vma, addr, pte);
> -	if (!folio)
> +	if (!folio || folio_is_device_dax(folio))
>  		return false;
>  	return folio_maybe_dma_pinned(folio);

The whole point of ZONE_DEVICE is to account for DMA so I see no reason
for pte_is_pinned() to special case dax. The caller of pte_is_pinned()
is doing it for soft_dirty reasons, and I believe soft_dirty is already
disabled for vma_is_dax(). I assume MEMORY_DEVICE_PRIVATE also does not
support soft-dirty, so I expect all ZONE_DEVICE already opt-out of this.

>  }
> @@ -1710,6 +1714,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  			frame = pte_pfn(pte);
>  		flags |= PM_PRESENT;
>  		page = vm_normal_page(vma, addr, pte);
> +		if (page && is_device_dax_page(page))
> +			page = NULL;
>  		if (pte_soft_dirty(pte))
>  			flags |= PM_SOFT_DIRTY;
>  		if (pte_uffd_wp(pte))
> @@ -2096,7 +2102,8 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
>  
>  		if (p->masks_of_interest & PAGE_IS_FILE) {
>  			page = vm_normal_page(vma, addr, pte);
> -			if (page && !PageAnon(page))
> +			if (page && !PageAnon(page) &&
> +			    !is_device_dax_page(page))
>  				categories |= PAGE_IS_FILE;
>  		}
>  
> @@ -2158,7 +2165,8 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
>  
>  		if (p->masks_of_interest & PAGE_IS_FILE) {
>  			page = vm_normal_page_pmd(vma, addr, pmd);
> -			if (page && !PageAnon(page))
> +			if (page && !PageAnon(page) &&
> +			    !is_device_dax_page(page))
>  				categories |= PAGE_IS_FILE;
>  		}
>  
> @@ -2919,7 +2927,7 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
>  		return NULL;
>  
>  	page = vm_normal_page_pmd(vma, addr, pmd);
> -	if (!page)
> +	if (!page || is_device_dax_page(page))
>  		return NULL;

I am not immediately seeing a reason to block pagemap_read() from
interrogating dax-backed virtual mappings. I think these protections can
be dropped.

>  
>  	if (PageReserved(page))
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index b37c0d8..e16053c 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -667,7 +667,7 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
>  {
>  	struct page *page = vm_normal_page(vma, addr, ptent);
>  
> -	if (!page)
> +	if (!page || is_device_dax_page(page))
>  		return NULL;
>  	if (PageAnon(page)) {
>  		if (!(mc.flags & MOVE_ANON))

I think this better handled with something like this to disable all
memcg accounting for ZONE_DEVICE pages:


diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index b37c0d870816..cfc43e8c59fe 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -940,8 +940,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
                 */
                if (folio_memcg(folio) == mc.from) {
                        ret = MC_TARGET_PAGE;
-                       if (folio_is_device_private(folio) ||
-                           folio_is_device_coherent(folio))
+                       if (folio_is_device(folio))
                                ret = MC_TARGET_DEVICE;
                        if (target)
                                target->folio = folio;

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
  2024-09-10  4:14 ` [PATCH 10/12] fs/dax: Properly refcount fs dax pages Alistair Popple
@ 2024-09-27  7:59   ` Dan Williams
  2024-10-24  7:52     ` Alistair Popple
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-09-27  7:59 UTC (permalink / raw)
  To: Alistair Popple, dan.j.williams, linux-mm
  Cc: Alistair Popple, vishal.l.verma, dave.jiang, logang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david

Alistair Popple wrote:
> Currently fs dax pages are considered free when the refcount drops to
> one and their refcounts are not increased when mapped via PTEs or
> decreased when unmapped. This requires special logic in mm paths to
> detect that these pages should not be properly refcounted, and to
> detect when the refcount drops to one instead of zero.
> 
> On the other hand get_user_pages(), etc. will properly refcount fs dax
> pages by taking a reference and dropping it when the page is
> unpinned.
> 
> Tracking this special behaviour requires extra PTE bits
> (eg. pte_devmap) and introduces rules that are potentially confusing
> and specific to FS DAX pages. To fix this, and to possibly allow
> removal of the special PTE bits in future, convert the fs dax page
> refcounts to be zero based and instead take a reference on the page
> each time it is mapped as is currently the case for normal pages.
> 
> This may also allow a future clean-up to remove the pgmap refcounting
> that is currently done in mm/gup.c.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>  drivers/dax/device.c       |  12 +-
>  drivers/dax/super.c        |   2 +-
>  drivers/nvdimm/pmem.c      |   4 +-
>  fs/dax.c                   | 192 ++++++++++++++++++--------------------
>  fs/fuse/virtio_fs.c        |   3 +-
>  include/linux/dax.h        |   6 +-
>  include/linux/mm.h         |  27 +-----
>  include/linux/page-flags.h |   6 +-
>  mm/gup.c                   |   9 +--
>  mm/huge_memory.c           |   6 +-
>  mm/internal.h              |   2 +-
>  mm/memory-failure.c        |   6 +-
>  mm/memory.c                |   6 +-
>  mm/memremap.c              |  40 +++-----
>  mm/mlock.c                 |   2 +-
>  mm/mm_init.c               |   9 +--
>  mm/swap.c                  |   2 +-
>  17 files changed, 143 insertions(+), 191 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 9c1a729..4d3ddd1 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -126,11 +126,11 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
>  		return VM_FAULT_SIGBUS;
>  	}
>  
> -	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
> +	pfn = phys_to_pfn_t(phys, 0);

BTW, this is part of what prompted me to do the pfn_t cleanup [1] that I
will rebase on top of your series:

[1]: http://lore.kernel.org/66f34a9caeb97_2a7f294fa@dwillia2-xfh.jf.intel.com.notmuch

[..]
> @@ -318,85 +323,58 @@ static unsigned long dax_end_pfn(void *entry)
>   */
>  #define for_each_mapped_pfn(entry, pfn) \
>  	for (pfn = dax_to_pfn(entry); \
> -			pfn < dax_end_pfn(entry); pfn++)
> +		pfn < dax_end_pfn(entry); pfn++)
>  
> -static inline bool dax_page_is_shared(struct page *page)
> +static void dax_device_folio_init(struct folio *folio, int order)
>  {
> -	return page->mapping == PAGE_MAPPING_DAX_SHARED;
> -}
> +	int orig_order = folio_order(folio);
> +	int i;
>  
> -/*
> - * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
> - * refcount.
> - */
> -static inline void dax_page_share_get(struct page *page)
> -{
> -	if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
> -		/*
> -		 * Reset the index if the page was already mapped
> -		 * regularly before.
> -		 */
> -		if (page->mapping)
> -			page->share = 1;
> -		page->mapping = PAGE_MAPPING_DAX_SHARED;
> -	}
> -	page->share++;
> -}
> +	if (orig_order != order) {
> +		struct dev_pagemap *pgmap = page_dev_pagemap(&folio->page);

Was there a discussion I missed about why the conversion to typical
folios allows the page->share accounting to be dropped.

I assume this is because the page->mapping validation was dropped, which
I think might be useful to keep at least for one development cycle to
make sure this conversion is not triggering any of the old warnings.

Otherwise, the ->share field of 'struct page' can also be cleaned up.

> -static inline unsigned long dax_page_share_put(struct page *page)
> -{
> -	return --page->share;
> -}
> +		for (i = 0; i < (1UL << orig_order); i++) {
> +			struct page *page = folio_page(folio, i);
>  
> -/*
> - * When it is called in dax_insert_entry(), the shared flag will indicate that
> - * whether this entry is shared by multiple files.  If so, set the page->mapping
> - * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
> - */
> -static void dax_associate_entry(void *entry, struct address_space *mapping,
> -		struct vm_area_struct *vma, unsigned long address, bool shared)
> -{
> -	unsigned long size = dax_entry_size(entry), pfn, index;
> -	int i = 0;
> +			ClearPageHead(page);
> +			clear_compound_head(page);
>  
> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> -		return;
> -
> -	index = linear_page_index(vma, address & ~(size - 1));
> -	for_each_mapped_pfn(entry, pfn) {
> -		struct page *page = pfn_to_page(pfn);
> +			/*
> +			 * Reset pgmap which was over-written by
> +			 * prep_compound_page().
> +			 */
> +			page_folio(page)->pgmap = pgmap;
>  
> -		if (shared) {
> -			dax_page_share_get(page);
> -		} else {
> -			WARN_ON_ONCE(page->mapping);
> -			page->mapping = mapping;
> -			page->index = index + i++;
> +			/* Make sure this isn't set to TAIL_MAPPING */
> +			page->mapping = NULL;
>  		}
>  	}
> +
> +	if (order > 0) {
> +		prep_compound_page(&folio->page, order);
> +		if (order > 1)
> +			INIT_LIST_HEAD(&folio->_deferred_list);
> +	}
>  }
>  
> -static void dax_disassociate_entry(void *entry, struct address_space *mapping,
> -		bool trunc)
> +static void dax_associate_new_entry(void *entry, struct address_space *mapping,
> +				pgoff_t index)

Lets call this dax_create_folio(), to mirror filemap_create_folio() and
have it transition the folio refcount from 0 to 1 to indicate that it is
allocated.

While I am not sure anything requires that, it seems odd that page cache
pages have an elevated refcount at map time and dax pages do not.

It does have implications for the dax dma-idle tracking thought, see
below.

>  {
> -	unsigned long pfn;
> +	unsigned long order = dax_entry_order(entry);
> +	struct folio *folio = dax_to_folio(entry);
>  
> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> +	if (!dax_entry_size(entry))
>  		return;
>  
> -	for_each_mapped_pfn(entry, pfn) {
> -		struct page *page = pfn_to_page(pfn);
> -
> -		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> -		if (dax_page_is_shared(page)) {
> -			/* keep the shared flag if this page is still shared */
> -			if (dax_page_share_put(page) > 0)
> -				continue;
> -		} else
> -			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> -		page->mapping = NULL;
> -		page->index = 0;
> -	}
> +	/*
> +	 * We don't hold a reference for the DAX pagecache entry for the
> +	 * page. But we need to initialise the folio so we can hand it
> +	 * out. Nothing else should have a reference either.
> +	 */
> +	WARN_ON_ONCE(folio_ref_count(folio));

Per above I would feel more comfortable if we kept the paranoia around
to ensure that all the pages in this folio have dropped all references
and cleared ->mapping and ->index.

That paranoia can be placed behind a CONFIG_DEBUB_VM check, and we can
delete in a follow-on development cycle, but in the meantime it helps to
prove the correctness of the conversion.

[..]
> @@ -1189,11 +1165,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  	struct inode *inode = iter->inode;
>  	unsigned long vaddr = vmf->address;
>  	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
> +	struct page *page = pfn_t_to_page(pfn);
>  	vm_fault_t ret;
>  
>  	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
>  
> -	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
> +	page_ref_inc(page);
> +	ret = dax_insert_pfn(vmf, pfn, false);
> +	put_page(page);

Per above I think it is problematic to have pages live in the system
without a refcount.

One scenario where this might be needed is invalidate_inode_pages() vs
DMA. The invaldation should pause and wait for DMA pins to be dropped
before the mapping xarray is cleaned up and the dax folio is marked
free.

I think this may be a gap in the current code. I'll attempt to write a
test for this to check.

[..]
> @@ -1649,9 +1627,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>  	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
>  	bool write = iter->flags & IOMAP_WRITE;
>  	unsigned long entry_flags = pmd ? DAX_PMD : 0;
> -	int err = 0;
> +	int ret, err = 0;
>  	pfn_t pfn;
>  	void *kaddr;
> +	struct page *page;
>  
>  	if (!pmd && vmf->cow_page)
>  		return dax_fault_cow_page(vmf, iter);
> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>  	if (dax_fault_is_synchronous(iter, vmf->vma))
>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
>  
> -	/* insert PMD pfn */
> +	page = pfn_t_to_page(pfn);

I think this is clearer if dax_insert_entry() returns folios with an
elevated refrence count that is dropped when the folio is invalidated
out of the mapping.

[..]
> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
>  	lock_page(page);
>  }
>  EXPORT_SYMBOL_GPL(zone_device_page_init);
> -
> -#ifdef CONFIG_FS_DAX
> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
> -{
> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
> -		return false;
> -
> -	/*
> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
> -	 * refcount is 1, then the page is free and the refcount is
> -	 * stable because nobody holds a reference on the page.
> -	 */
> -	if (folio_ref_sub_return(folio, refs) == 1)
> -		wake_up_var(&folio->_refcount);
> -	return true;

It follow from the refcount disvussion above that I think there is an
argument to still keep this wakeup based on the 2->1 transitition.
pagecache pages are refcount==1 when they are dma-idle but still
allocated. To keep the same semantics for dax a dax_folio would have an
elevated refcount whenever it is referenced by mapping entry.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/12] mm: Remove pXX_devmap callers
  2024-09-10  4:14 ` [PATCH 11/12] mm: Remove pXX_devmap callers Alistair Popple
@ 2024-09-27 12:29   ` Alexander Gordeev
  2024-10-14  7:14     ` Alistair Popple
  0 siblings, 1 reply; 53+ messages in thread
From: Alexander Gordeev @ 2024-09-27 12:29 UTC (permalink / raw)
  To: Alistair Popple
  Cc: dan.j.williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard,
	hch, david

On Tue, Sep 10, 2024 at 02:14:36PM +1000, Alistair Popple wrote:

Hi Alistair,

> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> index 5a4a753..4537a29 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -193,7 +192,7 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
>  	pmd_t pmd;
>  	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
>  	VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> -		   !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
> +		   || !pmd_present(*pmdp));

That looks broken.

>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
>  	/*
>  	 * if it not a fullmm flush, then we can possibly end up converting

Thanks!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/12] mm/memory: Add dax_insert_pfn
  2024-09-22  1:41   ` Dan Williams
@ 2024-10-01 10:43     ` Gerald Schaefer
  0 siblings, 0 replies; 53+ messages in thread
From: Gerald Schaefer @ 2024-10-01 10:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Alistair Popple, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard,
	hch, david, hca, gor, agordeev, borntraeger, svens, linux-s390

On Sun, 22 Sep 2024 03:41:57 +0200
Dan Williams <dan.j.williams@intel.com> wrote:

> [ add s390 folks to comment on CONFIG_FS_DAX_LIMITED ]

[...]

> > @@ -2516,6 +2545,44 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
> >  	return VM_FAULT_NOPAGE;
> >  }
> >  
> > +vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	pgprot_t pgprot = vma->vm_page_prot;
> > +	unsigned long pfn = pfn_t_to_pfn(pfn_t);
> > +	struct page *page = pfn_to_page(pfn);  
> 
> The problem here is that we stubbornly have __dcssblk_direct_access() to
> worry about. That is the only dax driver that does not return
> pfn_valid() pfns.
> 
> In fact, it looks like __dcssblk_direct_access() is the only thing
> standing in the way of the removal of pfn_t.
> 
> It turns out it has been 3 years since the last time the question of
> bringing s390 fully into the ZONE_DEVICE regime was raised:
> 
> https://lore.kernel.org/all/20210820210318.187742e8@thinkpad/
> 
> Given that this series removes PTE_DEVMAP which was a stumbling block,
> would it be feasible to remove CONFIG_FS_DAX_LIMITED for a few kernel
> cycles until someone from the s390 side can circle back to add full
> ZONE_DEVICE support?

Yes, see also my reply to your "dcssblk: Mark DAX broken" patch.
Thanks Alistair for your effort, making ZONE_DEVICE usable w/o extra
PTE bit!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one
  2024-09-22  1:00   ` Dan Williams
@ 2024-10-11  0:17     ` Alistair Popple
  0 siblings, 0 replies; 53+ messages in thread
From: Alistair Popple @ 2024-10-11  0:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:

[...]

>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 40d4547..07bbe0e 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -488,15 +488,24 @@ void free_zone_device_folio(struct folio *folio)
>>  	folio->mapping = NULL;
>>  	folio->page.pgmap->ops->page_free(folio_page(folio, 0));
>>  
>> -	if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
>> -	    folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
>> +	switch (folio->page.pgmap->type) {
>> +	case MEMORY_DEVICE_PRIVATE:
>> +	case MEMORY_DEVICE_COHERENT:
>> +		put_dev_pagemap(folio->page.pgmap);
>> +		break;
>> +
>> +	case MEMORY_DEVICE_FS_DAX:
>> +	case MEMORY_DEVICE_GENERIC:
>>  		/*
>>  		 * Reset the refcount to 1 to prepare for handing out the page
>>  		 * again.
>>  		 */
>>  		folio_set_count(folio, 1);
>> -	else
>> -		put_dev_pagemap(folio->page.pgmap);
>> +		break;
>> +
>> +	case MEMORY_DEVICE_PCI_P2PDMA:
>> +		break;
>
> A follow on cleanup is that either all implementations should be
> put_dev_pagemap(), or none of them. Put the onus on the implementation
> to track how many pages it has handed out in the implementation
> allocator.

Agreed. I've ignored the get/put_dev_pagemap() calls for this clean up
but am planning to do a follow up to clean those up too, probably by
removing them entirely as you suggest.

[...]

> For this one:
>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>

Thanks.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one
  2024-09-11  0:48   ` Logan Gunthorpe
@ 2024-10-11  0:20     ` Alistair Popple
  0 siblings, 0 replies; 53+ messages in thread
From: Alistair Popple @ 2024-10-11  0:20 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: dan.j.williams, linux-mm, vishal.l.verma, dave.jiang, bhelgaas,
	jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
	ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
	linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
	linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
	david


Logan Gunthorpe <logang@deltatee.com> writes:

> On 2024-09-09 22:14, Alistair Popple wrote:
>> The reference counts for ZONE_DEVICE private pages should be
>> initialised by the driver when the page is actually allocated by the
>> driver allocator, not when they are first created. This is currently
>> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
>> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.
>> 
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>  drivers/pci/p2pdma.c |  6 ++++++
>>  mm/memremap.c        | 17 +++++++++++++----
>>  mm/mm_init.c         | 22 ++++++++++++++++++----
>>  3 files changed, 37 insertions(+), 8 deletions(-)
>> 
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index 4f47a13..210b9f4 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -129,6 +129,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>>  	}
>>  
>>  	/*
>> +	 * Initialise the refcount for the freshly allocated page. As we have
>> +	 * just allocated the page no one else should be using it.
>> +	 */
>> +	set_page_count(virt_to_page(kaddr), 1);
>> +
>> +	/*
>>  	 * vm_insert_page() can sleep, so a reference is taken to mapping
>>  	 * such that rcu_read_unlock() can be done before inserting the
>>  	 * pages
> This seems to only set reference count to the first page, when there can
> be more than one page referenced by kaddr.

Good point.

> I suspect the page count adjustment should be done in the for loop
> that's a few lines lower than this.

Have moved it there for the next version, thanks!

> I think a similar mistake was made by other recent changes.
>
> Thanks,
>
> Logan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages
  2024-09-22  2:07   ` Dan Williams
@ 2024-10-14  6:33     ` Alistair Popple
  0 siblings, 0 replies; 53+ messages in thread
From: Alistair Popple @ 2024-10-14  6:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> Currently DAX folio/page reference counts are managed differently to
>> normal pages. To allow these to be managed the same as normal pages
>> introduce dax_insert_pfn_pud. This will map the entire PUD-sized folio
>> and take references as it would for a normally mapped page.
>> 
>> This is distinct from the current mechanism, vmf_insert_pfn_pud, which
>> simply inserts a special devmap PUD entry into the page table without
>> holding a reference to the page for the mapping.
>
> This is missing some description or comment in the code about the
> differences. More questions below:
>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>  include/linux/huge_mm.h |  4 ++-
>>  include/linux/rmap.h    | 15 +++++++-
>>  mm/huge_memory.c        | 93 ++++++++++++++++++++++++++++++++++++------
>>  mm/rmap.c               | 49 ++++++++++++++++++++++-
>>  4 files changed, 149 insertions(+), 12 deletions(-)
>> 
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 6370026..d3a1872 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -40,6 +40,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  
>>  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>>  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>> +vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>>  
>>  enum transparent_hugepage_flag {
>>  	TRANSPARENT_HUGEPAGE_UNSUPPORTED,
>> @@ -114,6 +115,9 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>>  #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
>>  #define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
>>  
>> +#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
>> +#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  
>>  extern unsigned long transparent_hugepage_flags;
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 91b5935..c465694 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -192,6 +192,7 @@ typedef int __bitwise rmap_t;
>>  enum rmap_level {
>>  	RMAP_LEVEL_PTE = 0,
>>  	RMAP_LEVEL_PMD,
>> +	RMAP_LEVEL_PUD,
>>  };
>>  
>>  static inline void __folio_rmap_sanity_checks(struct folio *folio,
>> @@ -228,6 +229,14 @@ static inline void __folio_rmap_sanity_checks(struct folio *folio,
>>  		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
>>  		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
>>  		break;
>> +	case RMAP_LEVEL_PUD:
>> +		/*
>> +		 * Asume that we are creating * a single "entire" mapping of the
>> +		 * folio.
>> +		 */
>> +		VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio);
>> +		VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
>> +		break;
>>  	default:
>>  		VM_WARN_ON_ONCE(true);
>>  	}
>> @@ -251,12 +260,16 @@ void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>  	folio_add_file_rmap_ptes(folio, page, 1, vma)
>>  void folio_add_file_rmap_pmd(struct folio *, struct page *,
>>  		struct vm_area_struct *);
>> +void folio_add_file_rmap_pud(struct folio *, struct page *,
>> +		struct vm_area_struct *);
>>  void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>  		struct vm_area_struct *);
>>  #define folio_remove_rmap_pte(folio, page, vma) \
>>  	folio_remove_rmap_ptes(folio, page, 1, vma)
>>  void folio_remove_rmap_pmd(struct folio *, struct page *,
>>  		struct vm_area_struct *);
>> +void folio_remove_rmap_pud(struct folio *, struct page *,
>> +		struct vm_area_struct *);
>>  
>>  void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
>>  		unsigned long address, rmap_t flags);
>> @@ -341,6 +354,7 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
>>  		atomic_add(orig_nr_pages, &folio->_large_mapcount);
>>  		break;
>>  	case RMAP_LEVEL_PMD:
>> +	case RMAP_LEVEL_PUD:
>>  		atomic_inc(&folio->_entire_mapcount);
>>  		atomic_inc(&folio->_large_mapcount);
>>  		break;
>> @@ -437,6 +451,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
>>  		atomic_add(orig_nr_pages, &folio->_large_mapcount);
>>  		break;
>>  	case RMAP_LEVEL_PMD:
>> +	case RMAP_LEVEL_PUD:
>>  		if (PageAnonExclusive(page)) {
>>  			if (unlikely(maybe_pinned))
>>  				return -EBUSY;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c4b45ad..e8985a4 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1336,21 +1336,19 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>>  	struct mm_struct *mm = vma->vm_mm;
>>  	pgprot_t prot = vma->vm_page_prot;
>>  	pud_t entry;
>> -	spinlock_t *ptl;
>>  
>> -	ptl = pud_lock(mm, pud);
>>  	if (!pud_none(*pud)) {
>>  		if (write) {
>>  			if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
>>  				WARN_ON_ONCE(!is_huge_zero_pud(*pud));
>> -				goto out_unlock;
>> +				return;
>>  			}
>>  			entry = pud_mkyoung(*pud);
>>  			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
>>  			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
>>  				update_mmu_cache_pud(vma, addr, pud);
>>  		}
>> -		goto out_unlock;
>> +		return;
>>  	}
>>  
>>  	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
>> @@ -1362,9 +1360,6 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>>  	}
>>  	set_pud_at(mm, addr, pud, entry);
>>  	update_mmu_cache_pud(vma, addr, pud);
>> -
>> -out_unlock:
>> -	spin_unlock(ptl);
>>  }
>>  
>>  /**
>> @@ -1382,6 +1377,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>>  	unsigned long addr = vmf->address & PUD_MASK;
>>  	struct vm_area_struct *vma = vmf->vma;
>>  	pgprot_t pgprot = vma->vm_page_prot;
>> +	spinlock_t *ptl;
>>  
>>  	/*
>>  	 * If we had pud_special, we could avoid all these restrictions,
>> @@ -1399,10 +1395,52 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>>  
>>  	track_pfn_insert(vma, &pgprot, pfn);
>>  
>> +	ptl = pud_lock(vma->vm_mm, vmf->pud);
>>  	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
>> +	spin_unlock(ptl);
>> +
>>  	return VM_FAULT_NOPAGE;
>>  }
>>  EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
>> +
>> +/**
>> + * dax_insert_pfn_pud - insert a pud size pfn backed by a normal page
>> + * @vmf: Structure describing the fault
>> + * @pfn: pfn of the page to insert
>> + * @write: whether it's a write fault
>
> It strikes me that this documentation is not useful for recalling why
> both vmf_insert_pfn_pud() and dax_insert_pfn_pud() exist. It looks like
> the only difference is that the "dax_" flavor takes a reference on the
> page. So maybe all these dax_insert_pfn{,_pmd,_pud} helpers should be
> unified in a common vmf_insert_page() entry point where the caller is
> responsible for initializing the compound page metadata before calling
> the helper?

Honestly I'm impressed I wrote documentation at all :-) Even if it was
just a terrible cut-and-paste job. What exactly do you mean by
"initializing the compound page metadata" though? I assume this bit:

	folio_get(folio);
	folio_add_file_rmap_pud(folio, page, vma);
	add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);

Problem with doing that in the caller is that at the very least
folio_add_file_rmap_*() calls need to happen under PTL. Of course the
point on lack of documentation still stands, and as an aside my original
plan was to remove the vmf_insert_pfn* variants as dead-code once DAX
stopped using them, but Peter pointed out this series which added some
callers:

https://lore.kernel.org/all/20240826204353.2228736-20-peterx@redhat.com/T/

>> + *
>> + * Return: vm_fault_t value.
>> + */
>> +vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>> +{
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	unsigned long addr = vmf->address & PUD_MASK;
>> +	pud_t *pud = vmf->pud;
>> +	pgprot_t prot = vma->vm_page_prot;
>> +	struct mm_struct *mm = vma->vm_mm;
>> +	spinlock_t *ptl;
>> +	struct folio *folio;
>> +	struct page *page;
>> +
>> +	if (addr < vma->vm_start || addr >= vma->vm_end)
>> +		return VM_FAULT_SIGBUS;
>> +
>> +	track_pfn_insert(vma, &prot, pfn);
>> +
>> +	ptl = pud_lock(mm, pud);
>> +	if (pud_none(*vmf->pud)) {
>> +		page = pfn_t_to_page(pfn);
>> +		folio = page_folio(page);
>> +		folio_get(folio);
>> +		folio_add_file_rmap_pud(folio, page, vma);
>> +		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
>> +	}
>> +	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
>> +	spin_unlock(ptl);
>> +
>> +	return VM_FAULT_NOPAGE;
>> +}
>> +EXPORT_SYMBOL_GPL(dax_insert_pfn_pud);
>>  #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>  
>>  void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>> @@ -1947,7 +1985,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  			zap_deposited_table(tlb->mm, pmd);
>>  		spin_unlock(ptl);
>>  	} else if (is_huge_zero_pmd(orig_pmd)) {
>> -		zap_deposited_table(tlb->mm, pmd);
>> +		if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
>> +			zap_deposited_table(tlb->mm, pmd);
>
> This looks subtle to me. Why is it needed to skip zap_deposited_table()
> (I assume it is some THP assumption about the page being from the page
> allocator)? Why is it ok to to force the zap if the arch demands it?

We don't skip it - it happens in the "else" clause if required. Note
that vma_is_special_huge(dax_vma) == True. So the checks are to maintain
existing behaviour - previously a DAX VMA would take this path:

        if (vma_is_special_huge(vma)) {
                if (arch_needs_pgtable_deposit())
                        zap_deposited_table(tlb->mm, pmd);
                spin_unlock(ptl);

Now that DAX pages are treated normally we need to take the "else"
clause. So in the case of a zero pmd we only zap_deposited_table() if it
is not a DAX VMA or if it is a DAX VMA and the arch requires it, which is
the same as what would happen previously.

Of course that's not to say the previous behaviour was correct, I am far
from an expert on the pgtable deposit/withdraw code.

>>  		spin_unlock(ptl);
>>  	} else {
>>  		struct folio *folio = NULL;
>> @@ -2435,12 +2474,24 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
>>  	arch_check_zapped_pud(vma, orig_pud);
>>  	tlb_remove_pud_tlb_entry(tlb, pud, addr);
>> -	if (vma_is_special_huge(vma)) {
>> +	if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
>
> If vma_is_special_huge() is true vma_is_dax() will always be false, so
> not clear to me why this check is combined?

Because that's not true:

static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
{
	return vma_is_dax(vma) || (vma->vm_file &&
				   (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
}

I suppose there is a good argument to change that now though. Thoughts?

>>  		spin_unlock(ptl);
>>  		/* No zero page support yet */
>>  	} else {
>> -		/* No support for anonymous PUD pages yet */
>> -		BUG();
>> +		struct page *page = NULL;
>> +		struct folio *folio;
>> +
>> +		/* No support for anonymous PUD pages or migration yet */
>> +		BUG_ON(vma_is_anonymous(vma) || !pud_present(orig_pud));
>> +
>> +		page = pud_page(orig_pud);
>> +		folio = page_folio(page);
>> +		folio_remove_rmap_pud(folio, page, vma);
>> +		VM_BUG_ON_PAGE(!PageHead(page), page);
>> +		add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
>> +
>> +		spin_unlock(ptl);
>> +		tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
>>  	}
>>  	return 1;
>>  }
>> @@ -2448,6 +2499,8 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>  		unsigned long haddr)
>>  {
>> +	pud_t old_pud;
>> +
>>  	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
>>  	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>>  	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
>> @@ -2455,7 +2508,23 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>  
>>  	count_vm_event(THP_SPLIT_PUD);
>>  
>> -	pudp_huge_clear_flush(vma, haddr, pud);
>> +	old_pud = pudp_huge_clear_flush(vma, haddr, pud);
>> +	if (is_huge_zero_pud(old_pud))
>> +		return;
>> +
>> +	if (vma_is_dax(vma)) {
>> +		struct page *page = pud_page(old_pud);
>> +		struct folio *folio = page_folio(page);
>> +
>> +		if (!folio_test_dirty(folio) && pud_dirty(old_pud))
>> +			folio_mark_dirty(folio);
>> +		if (!folio_test_referenced(folio) && pud_young(old_pud))
>> +			folio_set_referenced(folio);
>> +		folio_remove_rmap_pud(folio, page, vma);
>> +		folio_put(folio);
>> +		add_mm_counter(vma->vm_mm, mm_counter_file(folio),
>> +			-HPAGE_PUD_NR);
>> +	}
>
> So this does not split anything (no follow-on set_ptes()) it just clears
> and updates some folio metadata. Something is wrong if we get this far
> since the only dax mechanism that attempts PUD mappings is device-dax,
> and device-dax is not prepared for PUD mappings to be fractured.

Ok. Current upstream will just clear them though, and this just
maintains that behaviour along with updating metadata. So are you saying
that we shouldn't be able to get here? Or that just clearing the PUD is
wrong? Because unless I've missed something I think we have been able to
get here since support for transparent PUD with DAX was added.

> Peter Xu recently fixed mprotect() vs DAX PUD mappings, I need to check
> how that interacts with this.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages
  2024-09-27  2:48   ` Dan Williams
@ 2024-10-14  6:53     ` Alistair Popple
  2024-10-23 23:14       ` Alistair Popple
  0 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-10-14  6:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> Currently DAX folio/page reference counts are managed differently to
>> normal pages. To allow these to be managed the same as normal pages
>> introduce dax_insert_pfn_pmd. This will map the entire PMD-sized folio
>> and take references as it would for a normally mapped page.
>> 
>> This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
>> simply inserts a special devmap PMD entry into the page table without
>> holding a reference to the page for the mapping.
>
> It would be useful to mention the rationale for the locking changes and
> your understanding of the new "pgtable deposit" handling, because those
> things make this not a trivial conversion.

My intent was not to change the locking for the existing
vmf_insert_pfn_pmd() but just to move it up a level in the stack so
dax_insert_pfn_pmd() could do the metadata manipulation while holding
the lock. Looks like I didn't get that quite right though, so I will
review it for the next version.

>> 
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>  include/linux/huge_mm.h |  1 +-
>>  mm/huge_memory.c        | 57 ++++++++++++++++++++++++++++++++++--------
>>  2 files changed, 48 insertions(+), 10 deletions(-)
>> 
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index d3a1872..eaf3f78 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -40,6 +40,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  
>>  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>>  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>> +vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>>  vm_fault_t dax_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
>>  
>>  enum transparent_hugepage_flag {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e8985a4..790041e 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1237,14 +1237,12 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>>  {
>>  	struct mm_struct *mm = vma->vm_mm;
>>  	pmd_t entry;
>> -	spinlock_t *ptl;
>>  
>> -	ptl = pmd_lock(mm, pmd);
>>  	if (!pmd_none(*pmd)) {
>>  		if (write) {
>>  			if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
>>  				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
>> -				goto out_unlock;
>> +				return;
>>  			}
>>  			entry = pmd_mkyoung(*pmd);
>>  			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>> @@ -1252,7 +1250,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>>  				update_mmu_cache_pmd(vma, addr, pmd);
>>  		}
>>  
>> -		goto out_unlock;
>> +		return;
>>  	}
>>  
>>  	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>> @@ -1271,11 +1269,6 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>>  
>>  	set_pmd_at(mm, addr, pmd, entry);
>>  	update_mmu_cache_pmd(vma, addr, pmd);
>> -
>> -out_unlock:
>> -	spin_unlock(ptl);
>> -	if (pgtable)
>> -		pte_free(mm, pgtable);
>>  }
>>  
>>  /**
>> @@ -1294,6 +1287,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>  	struct vm_area_struct *vma = vmf->vma;
>>  	pgprot_t pgprot = vma->vm_page_prot;
>>  	pgtable_t pgtable = NULL;
>> +	spinlock_t *ptl;
>>  
>>  	/*
>>  	 * If we had pmd_special, we could avoid all these restrictions,
>> @@ -1316,12 +1310,55 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>  	}
>>  
>>  	track_pfn_insert(vma, &pgprot, pfn);
>> -
>> +	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>  	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable);
>> +	spin_unlock(ptl);
>> +	if (pgtable)
>> +		pte_free(vma->vm_mm, pgtable);
>> +
>>  	return VM_FAULT_NOPAGE;
>>  }
>>  EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
>>  
>> +vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>> +{
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	unsigned long addr = vmf->address & PMD_MASK;
>> +	struct mm_struct *mm = vma->vm_mm;
>> +	spinlock_t *ptl;
>> +	pgtable_t pgtable = NULL;
>> +	struct folio *folio;
>> +	struct page *page;
>> +
>> +	if (addr < vma->vm_start || addr >= vma->vm_end)
>> +		return VM_FAULT_SIGBUS;
>> +
>> +	if (arch_needs_pgtable_deposit()) {
>> +		pgtable = pte_alloc_one(vma->vm_mm);
>> +		if (!pgtable)
>> +			return VM_FAULT_OOM;
>> +	}
>> +
>> +	track_pfn_insert(vma, &vma->vm_page_prot, pfn);
>> +
>> +	ptl = pmd_lock(mm, vmf->pmd);
>> +	if (pmd_none(*vmf->pmd)) {
>> +		page = pfn_t_to_page(pfn);
>> +		folio = page_folio(page);
>> +		folio_get(folio);
>> +		folio_add_file_rmap_pmd(folio, page, vma);
>> +		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
>> +	}
>> +	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, vma->vm_page_prot,
>> +		write, pgtable);
>> +	spin_unlock(ptl);
>> +	if (pgtable)
>> +		pte_free(mm, pgtable);
>
> Are not the deposit rules that the extra page table stick around for the
> lifetime of the inserted pte? So would that not require this incremental
> change?

Yeah, thanks for catching this.

> ---
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ea65c2db2bb1..5ef1e5d21a96 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1232,7 +1232,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  
>  static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  			   pmd_t *pmd, unsigned long pfn, pgprot_t prot,
> -			   bool write, pgtable_t pgtable)
> +			   bool write, pgtable_t *pgtable)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pmd_t entry;
> @@ -1258,10 +1258,10 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		entry = maybe_pmd_mkwrite(entry, vma);
>  	}
>  
> -	if (pgtable) {
> -		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +	if (*pgtable) {
> +		pgtable_trans_huge_deposit(mm, pmd, *pgtable);
>  		mm_inc_nr_ptes(mm);
> -		pgtable = NULL;
> +		*pgtable = NULL;
>  	}
>  
>  	set_pmd_at(mm, addr, pmd, entry);
> @@ -1306,7 +1306,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn, bool writ
>  
>  	track_pfn_insert(vma, &pgprot, pfn);
>  	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> -	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable);
> +	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, &pgtable);
>  	spin_unlock(ptl);
>  	if (pgtable)
>  		pte_free(vma->vm_mm, pgtable);
> @@ -1344,8 +1344,8 @@ vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn, bool writ
>  		folio_add_file_rmap_pmd(folio, page, vma);
>  		add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
>  	}
> -	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, vma->vm_page_prot,
> -		write, pgtable);
> +	insert_pfn_pmd(vma, addr, vmf->pmd, pfn, vma->vm_page_prot, write,
> +		       &pgtable);
>  	spin_unlock(ptl);
>  	if (pgtable)
>  		pte_free(mm, pgtable);
> ---
>
> Along these lines it would be lovely if someone from the PowerPC side
> could test these changes, or if someone has a canned qemu command line
> to test radix vs hash with pmem+dax that they can share?

Michael, Nick, do you know of a qemu command or anyone who might?

>> +
>> +	return VM_FAULT_NOPAGE;
>> +}
>> +EXPORT_SYMBOL_GPL(dax_insert_pfn_pmd);
>
> Like I mentioned before, lets make the exported function
> vmf_insert_folio() and move the pte, pmd, pud internal private / static
> details of the implementation. The "dax_" specific aspect of this was
> removed at the conversion of a dax_pfn to a folio.

Ok, let me try that. Note that vmf_insert_pfn{_pmd|_pud} will have to
stick around though.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
  2024-09-27  2:52     ` Dan Williams
@ 2024-10-14  7:03       ` Alistair Popple
  0 siblings, 0 replies; 53+ messages in thread
From: Alistair Popple @ 2024-10-14  7:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david


Dan Williams <dan.j.williams@intel.com> writes:

> Dan Williams wrote:
>> Alistair Popple wrote:
>> > Longterm pinning of FS DAX pages should already be disallowed by
>> > various pXX_devmap checks. However a future change will cause these
>> > checks to be invalid for FS DAX pages so make
>> > folio_is_longterm_pinnable() return false for FS DAX pages.
>> > 
>> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> > ---
>> >  include/linux/memremap.h | 11 +++++++++++
>> >  include/linux/mm.h       |  4 ++++
>> >  2 files changed, 15 insertions(+)
>> > 
>> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> > index 14273e6..6a1406a 100644
>> > --- a/include/linux/memremap.h
>> > +++ b/include/linux/memremap.h
>> > @@ -187,6 +187,17 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
>> >  	return is_device_coherent_page(&folio->page);
>> >  }
>> >  
>> > +static inline bool is_device_dax_page(const struct page *page)
>> > +{
>> > +	return is_zone_device_page(page) &&
>> > +		page_dev_pagemap(page)->type == MEMORY_DEVICE_FS_DAX;
>> > +}
>> > +
>> > +static inline bool folio_is_device_dax(const struct folio *folio)
>> > +{
>> > +	return is_device_dax_page(&folio->page);
>> > +}
>> > +
>> >  #ifdef CONFIG_ZONE_DEVICE
>> >  void zone_device_page_init(struct page *page);
>> >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>> > diff --git a/include/linux/mm.h b/include/linux/mm.h
>> > index ae6d713..935e493 100644
>> > --- a/include/linux/mm.h
>> > +++ b/include/linux/mm.h
>> > @@ -1989,6 +1989,10 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
>> >  	if (folio_is_device_coherent(folio))
>> >  		return false;
>> >  
>> > +	/* DAX must also always allow eviction. */
>> > +	if (folio_is_device_dax(folio))
>> 
>> Why is this called "folio_is_device_dax()" when the check is for fsdax?
>> 
>> I would expect:
>> 
>> if (folio_is_fsdax(folio))
>> 	return false;
>> 
>> ...and s/device_dax/fsdax/ for the rest of the helpers.
>
> Specifically devdax is ok to allow longterm pinning since it is
> statically allocated. fsdax is the only ZONE_DEVICE mode where there is
> a higher-level allocator that does not support a 3rd party the block its
> operations indefinitely with a pin. So this needs to be explicit for
> that case.

Yeah, that all makes sense. I see what I did - was thinking in terms of
is this a zone device page - is_device - and if so what type
_(fs)dax. folio_is_fsdax() is much clearer though, thanks!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 11/12] mm: Remove pXX_devmap callers
  2024-09-27 12:29   ` Alexander Gordeev
@ 2024-10-14  7:14     ` Alistair Popple
  0 siblings, 0 replies; 53+ messages in thread
From: Alistair Popple @ 2024-10-14  7:14 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: dan.j.williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard,
	hch, david


Alexander Gordeev <agordeev@linux.ibm.com> writes:

> On Tue, Sep 10, 2024 at 02:14:36PM +1000, Alistair Popple wrote:
>
> Hi Alistair,
>
>> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
>> index 5a4a753..4537a29 100644
>> --- a/arch/powerpc/mm/book3s64/pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/pgtable.c
>> @@ -193,7 +192,7 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
>>  	pmd_t pmd;
>>  	VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
>>  	VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
>> -		   !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
>> +		   || !pmd_present(*pmdp));
>
> That looks broken.

Thanks! Clearly I better go reinstall that PPC cross-compiler...

>>  	pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
>>  	/*
>>  	 * if it not a fullmm flush, then we can possibly end up converting
>
> Thanks!


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/12] mm: Update vm_normal_page() callers to accept FS DAX pages
  2024-09-27  7:15   ` Dan Williams
@ 2024-10-14  7:16     ` Alistair Popple
  0 siblings, 0 replies; 53+ messages in thread
From: Alistair Popple @ 2024-10-14  7:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> Currently if a PTE points to a FS DAX page vm_normal_page() will
>> return NULL as these have their own special refcounting scheme. A
>> future change will allow FS DAX pages to be refcounted the same as any
>> other normal page.
>> 
>> Therefore vm_normal_page() will start returning FS DAX pages. To avoid
>> any change in behaviour callers that don't expect FS DAX pages will
>> need to explicitly check for this. As vm_normal_page() can already
>> return ZONE_DEVICE pages most callers already include a check for any
>> ZONE_DEVICE page.
>> 
>> However some callers don't, so add explicit checks where required.
>
> I would expect justification for each of these conversions, and
> hopefully with fsdax returning fully formed folios there is less need to
> sprinkle these checks around.
>
> At a minimum I think this patch needs to be broken up by file touched.

Good idea.

>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> ---
>>  arch/x86/mm/pat/memtype.c |  4 +++-
>>  fs/proc/task_mmu.c        | 16 ++++++++++++----
>>  mm/memcontrol-v1.c        |  2 +-
>>  3 files changed, 16 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
>> index 1fa0bf6..eb84593 100644
>> --- a/arch/x86/mm/pat/memtype.c
>> +++ b/arch/x86/mm/pat/memtype.c
>> @@ -951,6 +951,7 @@ static void free_pfn_range(u64 paddr, unsigned long size)
>>  static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
>>  		resource_size_t *phys)
>>  {
>> +	struct folio *folio;
>>  	pte_t *ptep, pte;
>>  	spinlock_t *ptl;
>>  
>> @@ -960,7 +961,8 @@ static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
>>  	pte = ptep_get(ptep);
>>  
>>  	/* Never return PFNs of anon folios in COW mappings. */
>> -	if (vm_normal_folio(vma, vma->vm_start, pte)) {
>> +	folio = vm_normal_folio(vma, vma->vm_start, pte);
>> +	if (folio || (folio && !folio_is_device_dax(folio))) {
>
> ...for example, I do not immediately see why follow_phys() would need to
> be careful with fsdax pages?

The intent was to maintain the original behaviour as much as
possible, partly to reduce the chance of unintended bugs/consequences
and partly to maintain my sanity by not having to dig too deeply into
all the callers.

I see I got this a little bit wrong though - it only filters FSDAX pages
and not device DAX (my intent was to filter both).

> ...but I do see why copy_page_range() (which calls follow_phys() through
> track_pfn_copy()) might care. It just turns out that vma_needs_copy(),
> afaics, bypasses dax MAP_SHARED mappings.
>
> So this touch of memtype.c looks like it can be dropped.

Ok. Although it feels safer to leave it (along with a check for device
DAX). Someone can always remove it in future if they really do want DAX
pages but this is all x86 specific so will take your guidance here.

>>  		pte_unmap_unlock(ptep, ptl);
>>  		return -EINVAL;
>>  	}
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 5f171ad..456b010 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -816,6 +816,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>>  
>>  	if (pte_present(ptent)) {
>>  		page = vm_normal_page(vma, addr, ptent);
>> +		if (page && is_device_dax_page(page))
>> +			page = NULL;
>>  		young = pte_young(ptent);
>>  		dirty = pte_dirty(ptent);
>>  		present = true;
>> @@ -864,6 +866,8 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>>  
>>  	if (pmd_present(*pmd)) {
>>  		page = vm_normal_page_pmd(vma, addr, *pmd);
>> +		if (page && is_device_dax_page(page))
>> +			page = NULL;
>>  		present = true;
>>  	} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
>>  		swp_entry_t entry = pmd_to_swp_entry(*pmd);
>
> The above can be replaced with a catch like
>
>    if (folio_test_device(folio))
> 	return;
>
> ...in smaps_account() since ZONE_DEVICE pages are not suitable to
> account as they do not reflect any memory pressure on the system memory
> pool.

Sounds good.

>> @@ -1385,7 +1389,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
>>  	if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
>>  		return false;
>>  	folio = vm_normal_folio(vma, addr, pte);
>> -	if (!folio)
>> +	if (!folio || folio_is_device_dax(folio))
>>  		return false;
>>  	return folio_maybe_dma_pinned(folio);
>
> The whole point of ZONE_DEVICE is to account for DMA so I see no reason
> for pte_is_pinned() to special case dax. The caller of pte_is_pinned()
> is doing it for soft_dirty reasons, and I believe soft_dirty is already
> disabled for vma_is_dax(). I assume MEMORY_DEVICE_PRIVATE also does not
> support soft-dirty, so I expect all ZONE_DEVICE already opt-out of this.

Actually soft-dirty is theoretically supported on DEVICE_PRIVATE pages
in the sense that the soft-dirty bits are copied around. Whether or not
it actually works is a different question though, I've certainly never
tried it.

Again, was just trying to maintain previous behaviour but can drop this
check.

>>  }
>> @@ -1710,6 +1714,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>  			frame = pte_pfn(pte);
>>  		flags |= PM_PRESENT;
>>  		page = vm_normal_page(vma, addr, pte);
>> +		if (page && is_device_dax_page(page))
>> +			page = NULL;
>>  		if (pte_soft_dirty(pte))
>>  			flags |= PM_SOFT_DIRTY;
>>  		if (pte_uffd_wp(pte))
>> @@ -2096,7 +2102,8 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
>>  
>>  		if (p->masks_of_interest & PAGE_IS_FILE) {
>>  			page = vm_normal_page(vma, addr, pte);
>> -			if (page && !PageAnon(page))
>> +			if (page && !PageAnon(page) &&
>> +			    !is_device_dax_page(page))
>>  				categories |= PAGE_IS_FILE;
>>  		}
>>  
>> @@ -2158,7 +2165,8 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
>>  
>>  		if (p->masks_of_interest & PAGE_IS_FILE) {
>>  			page = vm_normal_page_pmd(vma, addr, pmd);
>> -			if (page && !PageAnon(page))
>> +			if (page && !PageAnon(page) &&
>> +			    !is_device_dax_page(page))
>>  				categories |= PAGE_IS_FILE;
>>  		}
>>  
>> @@ -2919,7 +2927,7 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
>>  		return NULL;
>>  
>>  	page = vm_normal_page_pmd(vma, addr, pmd);
>> -	if (!page)
>> +	if (!page || is_device_dax_page(page))
>>  		return NULL;
>
> I am not immediately seeing a reason to block pagemap_read() from
> interrogating dax-backed virtual mappings. I think these protections can
> be dropped.

Ok.

>>  
>>  	if (PageReserved(page))
>> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
>> index b37c0d8..e16053c 100644
>> --- a/mm/memcontrol-v1.c
>> +++ b/mm/memcontrol-v1.c
>> @@ -667,7 +667,7 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
>>  {
>>  	struct page *page = vm_normal_page(vma, addr, ptent);
>>  
>> -	if (!page)
>> +	if (!page || is_device_dax_page(page))
>>  		return NULL;
>>  	if (PageAnon(page)) {
>>  		if (!(mc.flags & MOVE_ANON))
>
> I think this better handled with something like this to disable all
> memcg accounting for ZONE_DEVICE pages:

Ok, thanks for the review.

> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index b37c0d870816..cfc43e8c59fe 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -940,8 +940,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>                  */
>                 if (folio_memcg(folio) == mc.from) {
>                         ret = MC_TARGET_PAGE;
> -                       if (folio_is_device_private(folio) ||
> -                           folio_is_device_coherent(folio))
> +                       if (folio_is_device(folio))
>                                 ret = MC_TARGET_DEVICE;
>                         if (target)
>                                 target->folio = folio;


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages
  2024-10-14  6:53     ` Alistair Popple
@ 2024-10-23 23:14       ` Alistair Popple
  2024-10-23 23:38         ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-10-23 23:14 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Dan Williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard,
	hch, david

Alistair Popple <apopple@nvidia.com> writes:

> Alistair Popple wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:

[...]

>>> +
>>> +	return VM_FAULT_NOPAGE;
>>> +}
>>> +EXPORT_SYMBOL_GPL(dax_insert_pfn_pmd);
>>
>> Like I mentioned before, lets make the exported function
>> vmf_insert_folio() and move the pte, pmd, pud internal private / static
>> details of the implementation. The "dax_" specific aspect of this was
>> removed at the conversion of a dax_pfn to a folio.
>
> Ok, let me try that. Note that vmf_insert_pfn{_pmd|_pud} will have to
> stick around though.

Creating a single vmf_insert_folio() seems somewhat difficult because it
needs to be called from multiple fault paths (either PTE, PMD or PUD
fault) and do something different for each.

Specifically the issue I ran into is that DAX does not downgrade PMD
entries to PTE entries if they are backed by storage. So the PTE fault
handler will get a PMD-sized DAX entry and therefore a PMD size folio.

The way I tried implementing vmf_insert_folio() was to look at
folio_order() to determine which internal implementation to call. But
that doesn't work for a PTE fault, because there's no way to determine
if we should PTE map a subpage or PMD map the entire folio.

We could pass down some context as to what type of fault we're handling,
or add it to the vmf struct, but that seems excessive given callers
already know this and could just call a specific
vmf_insert_page_{pte|pmd|pud}.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages
  2024-10-23 23:14       ` Alistair Popple
@ 2024-10-23 23:38         ` Dan Williams
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Williams @ 2024-10-23 23:38 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Dan Williams, linux-mm, vishal.l.verma, dave.jiang, logang,
	bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
	dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
	peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
	nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard,
	hch, david

Alistair Popple wrote:
> 
> Alistair Popple <apopple@nvidia.com> writes:
> 
> > Alistair Popple wrote:
> >> Dan Williams <dan.j.williams@intel.com> writes:
> 
> [...]
> 
> >>> +
> >>> +	return VM_FAULT_NOPAGE;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(dax_insert_pfn_pmd);
> >>
> >> Like I mentioned before, lets make the exported function
> >> vmf_insert_folio() and move the pte, pmd, pud internal private / static
> >> details of the implementation. The "dax_" specific aspect of this was
> >> removed at the conversion of a dax_pfn to a folio.
> >
> > Ok, let me try that. Note that vmf_insert_pfn{_pmd|_pud} will have to
> > stick around though.
> 
> Creating a single vmf_insert_folio() seems somewhat difficult because it
> needs to be called from multiple fault paths (either PTE, PMD or PUD
> fault) and do something different for each.
> 
> Specifically the issue I ran into is that DAX does not downgrade PMD
> entries to PTE entries if they are backed by storage. So the PTE fault
> handler will get a PMD-sized DAX entry and therefore a PMD size folio.
> 
> The way I tried implementing vmf_insert_folio() was to look at
> folio_order() to determine which internal implementation to call. But
> that doesn't work for a PTE fault, because there's no way to determine
> if we should PTE map a subpage or PMD map the entire folio.

Ah, that conflict makes sense.

> We could pass down some context as to what type of fault we're handling,
> or add it to the vmf struct, but that seems excessive given callers
> already know this and could just call a specific
> vmf_insert_page_{pte|pmd|pud}.

Ok, I think it is good to capture that "because dax does not downgrade
entries it may satisfy PTE faults with PMD inserts", or something like
that in comment or changelog.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
  2024-09-27  7:59   ` Dan Williams
@ 2024-10-24  7:52     ` Alistair Popple
  2024-10-24 23:52       ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-10-24  7:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:

[...]

>> @@ -318,85 +323,58 @@ static unsigned long dax_end_pfn(void *entry)
>>   */
>>  #define for_each_mapped_pfn(entry, pfn) \
>>  	for (pfn = dax_to_pfn(entry); \
>> -			pfn < dax_end_pfn(entry); pfn++)
>> +		pfn < dax_end_pfn(entry); pfn++)
>>  
>> -static inline bool dax_page_is_shared(struct page *page)
>> +static void dax_device_folio_init(struct folio *folio, int order)
>>  {
>> -	return page->mapping == PAGE_MAPPING_DAX_SHARED;
>> -}
>> +	int orig_order = folio_order(folio);
>> +	int i;
>>  
>> -/*
>> - * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
>> - * refcount.
>> - */
>> -static inline void dax_page_share_get(struct page *page)
>> -{
>> -	if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
>> -		/*
>> -		 * Reset the index if the page was already mapped
>> -		 * regularly before.
>> -		 */
>> -		if (page->mapping)
>> -			page->share = 1;
>> -		page->mapping = PAGE_MAPPING_DAX_SHARED;
>> -	}
>> -	page->share++;
>> -}
>> +	if (orig_order != order) {
>> +		struct dev_pagemap *pgmap = page_dev_pagemap(&folio->page);
>
> Was there a discussion I missed about why the conversion to typical
> folios allows the page->share accounting to be dropped.

The problem with keeping it is we now treat DAX pages as "normal"
pages according to vm_normal_page(). As such we use the normal paths
for unmapping pages.

Specifically page->share accounting relies on PAGE_MAPPING_DAX_SHARED
aka PAGE_MAPPING_ANON which causes folio_test_anon(), PageAnon(),
etc. to return true leading to all sorts of issues in at least the
unmap paths.

There hasn't been a previous discussion on this, but given this is
only used to print warnings it seemed easier to get rid of it. I
probably should have called that out more clearly in the commit
message though.

> I assume this is because the page->mapping validation was dropped, which
> I think might be useful to keep at least for one development cycle to
> make sure this conversion is not triggering any of the old warnings.
>
> Otherwise, the ->share field of 'struct page' can also be cleaned up.

Yes, we should also clean up the ->share field, unless you have an
alternate suggestion to solve the above issue.

>> -static inline unsigned long dax_page_share_put(struct page *page)
>> -{
>> -	return --page->share;
>> -}
>> +		for (i = 0; i < (1UL << orig_order); i++) {
>> +			struct page *page = folio_page(folio, i);
>>  
>> -/*
>> - * When it is called in dax_insert_entry(), the shared flag will indicate that
>> - * whether this entry is shared by multiple files.  If so, set the page->mapping
>> - * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
>> - */
>> -static void dax_associate_entry(void *entry, struct address_space *mapping,
>> -		struct vm_area_struct *vma, unsigned long address, bool shared)
>> -{
>> -	unsigned long size = dax_entry_size(entry), pfn, index;
>> -	int i = 0;
>> +			ClearPageHead(page);
>> +			clear_compound_head(page);
>>  
>> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> -		return;
>> -
>> -	index = linear_page_index(vma, address & ~(size - 1));
>> -	for_each_mapped_pfn(entry, pfn) {
>> -		struct page *page = pfn_to_page(pfn);
>> +			/*
>> +			 * Reset pgmap which was over-written by
>> +			 * prep_compound_page().
>> +			 */
>> +			page_folio(page)->pgmap = pgmap;
>>  
>> -		if (shared) {
>> -			dax_page_share_get(page);
>> -		} else {
>> -			WARN_ON_ONCE(page->mapping);
>> -			page->mapping = mapping;
>> -			page->index = index + i++;
>> +			/* Make sure this isn't set to TAIL_MAPPING */
>> +			page->mapping = NULL;
>>  		}
>>  	}
>> +
>> +	if (order > 0) {
>> +		prep_compound_page(&folio->page, order);
>> +		if (order > 1)
>> +			INIT_LIST_HEAD(&folio->_deferred_list);
>> +	}
>>  }
>>  
>> -static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>> -		bool trunc)
>> +static void dax_associate_new_entry(void *entry, struct address_space *mapping,
>> +				pgoff_t index)
>
> Lets call this dax_create_folio(), to mirror filemap_create_folio() and
> have it transition the folio refcount from 0 to 1 to indicate that it is
> allocated.
>
> While I am not sure anything requires that, it seems odd that page cache
> pages have an elevated refcount at map time and dax pages do not.

The refcount gets elevated further up the call stack, but I agree it
would be clearer to move it here.

> It does have implications for the dax dma-idle tracking thought, see
> below.
>
>>  {
>> -	unsigned long pfn;
>> +	unsigned long order = dax_entry_order(entry);
>> +	struct folio *folio = dax_to_folio(entry);
>>  
>> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> +	if (!dax_entry_size(entry))
>>  		return;
>>  
>> -	for_each_mapped_pfn(entry, pfn) {
>> -		struct page *page = pfn_to_page(pfn);
>> -
>> -		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
>> -		if (dax_page_is_shared(page)) {
>> -			/* keep the shared flag if this page is still shared */
>> -			if (dax_page_share_put(page) > 0)
>> -				continue;
>> -		} else
>> -			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
>> -		page->mapping = NULL;
>> -		page->index = 0;
>> -	}
>> +	/*
>> +	 * We don't hold a reference for the DAX pagecache entry for the
>> +	 * page. But we need to initialise the folio so we can hand it
>> +	 * out. Nothing else should have a reference either.
>> +	 */
>> +	WARN_ON_ONCE(folio_ref_count(folio));
>
> Per above I would feel more comfortable if we kept the paranoia around
> to ensure that all the pages in this folio have dropped all references
> and cleared ->mapping and ->index.
>
> That paranoia can be placed behind a CONFIG_DEBUB_VM check, and we can
> delete in a follow-on development cycle, but in the meantime it helps to
> prove the correctness of the conversion.

I'm ok with paranoia, but as noted above the issue is that at a minimum
page->mapping (and probably index) now needs to be valid for any code
that might walk the page tables.

> [..]
>> @@ -1189,11 +1165,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>>  	struct inode *inode = iter->inode;
>>  	unsigned long vaddr = vmf->address;
>>  	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
>> +	struct page *page = pfn_t_to_page(pfn);
>>  	vm_fault_t ret;
>>  
>>  	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
>>  
>> -	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>> +	page_ref_inc(page);
>> +	ret = dax_insert_pfn(vmf, pfn, false);
>> +	put_page(page);
>
> Per above I think it is problematic to have pages live in the system
> without a refcount.

I'm a bit confused by this - the pages have a reference taken on them
when they are mapped. They only live in the system without a refcount
when the mm considers them free (except for the bit between getting
created in dax_associate_entry() and actually getting mapped but as
noted I will fix that).

> One scenario where this might be needed is invalidate_inode_pages() vs
> DMA. The invaldation should pause and wait for DMA pins to be dropped
> before the mapping xarray is cleaned up and the dax folio is marked
> free.

I'm not really following this scenario, or at least how it relates to
the comment above. If the page is pinned for DMA it will have taken a
refcount on it and so the page won't be considered free/idle per
dax_wait_page_idle() or any of the other mm code.

> I think this may be a gap in the current code. I'll attempt to write a
> test for this to check.

Ok, let me know if you come up with anything there as it might help
explain the problem more clearly.

> [..]
>> @@ -1649,9 +1627,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>>  	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
>>  	bool write = iter->flags & IOMAP_WRITE;
>>  	unsigned long entry_flags = pmd ? DAX_PMD : 0;
>> -	int err = 0;
>> +	int ret, err = 0;
>>  	pfn_t pfn;
>>  	void *kaddr;
>> +	struct page *page;
>>  
>>  	if (!pmd && vmf->cow_page)
>>  		return dax_fault_cow_page(vmf, iter);
>> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>>  	if (dax_fault_is_synchronous(iter, vmf->vma))
>>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
>>  
>> -	/* insert PMD pfn */
>> +	page = pfn_t_to_page(pfn);
>
> I think this is clearer if dax_insert_entry() returns folios with an
> elevated refrence count that is dropped when the folio is invalidated
> out of the mapping.

I presume this comment is for the next line:

+	page_ref_inc(page);
 
I can move that into dax_insert_entry(), but we would still need to
drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
transition when the page is unmapped and therefore
freed. Alternatively we can make it so vmf_insert_*() don't take
references on the page, and instead ownership of the reference is
transfered to the mapping. Personally I prefered having those
functions take their own reference but let me know what you think.

> [..]
>> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
>>  	lock_page(page);
>>  }
>>  EXPORT_SYMBOL_GPL(zone_device_page_init);
>> -
>> -#ifdef CONFIG_FS_DAX
>> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
>> -{
>> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
>> -		return false;
>> -
>> -	/*
>> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
>> -	 * refcount is 1, then the page is free and the refcount is
>> -	 * stable because nobody holds a reference on the page.
>> -	 */
>> -	if (folio_ref_sub_return(folio, refs) == 1)
>> -		wake_up_var(&folio->_refcount);
>> -	return true;
>
> It follow from the refcount disvussion above that I think there is an
> argument to still keep this wakeup based on the 2->1 transitition.
> pagecache pages are refcount==1 when they are dma-idle but still
> allocated. To keep the same semantics for dax a dax_folio would have an
> elevated refcount whenever it is referenced by mapping entry.

I'm not sold on keeping it as it doesn't seem to offer any benefit
IMHO. I know both Jason and Christoph were keen to see it go so it be
good to get their feedback too. Also one of the primary goals of this
series was to refcount the page normally so we could remove the whole
"page is free with a refcount of 1" semantics.

  - Alistair

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
  2024-10-24  7:52     ` Alistair Popple
@ 2024-10-24 23:52       ` Dan Williams
  2024-10-25  2:46         ` Alistair Popple
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-10-24 23:52 UTC (permalink / raw)
  To: Alistair Popple, Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david

Alistair Popple wrote:
[..]
> >
> > Was there a discussion I missed about why the conversion to typical
> > folios allows the page->share accounting to be dropped.
> 
> The problem with keeping it is we now treat DAX pages as "normal"
> pages according to vm_normal_page(). As such we use the normal paths
> for unmapping pages.
> 
> Specifically page->share accounting relies on PAGE_MAPPING_DAX_SHARED
> aka PAGE_MAPPING_ANON which causes folio_test_anon(), PageAnon(),
> etc. to return true leading to all sorts of issues in at least the
> unmap paths.

Oh, I missed that PAGE_MAPPING_DAX_SHARED aliases with
PAGE_MAPPING_ANON.

> There hasn't been a previous discussion on this, but given this is
> only used to print warnings it seemed easier to get rid of it. I
> probably should have called that out more clearly in the commit
> message though.
> 
> > I assume this is because the page->mapping validation was dropped, which
> > I think might be useful to keep at least for one development cycle to
> > make sure this conversion is not triggering any of the old warnings.
> >
> > Otherwise, the ->share field of 'struct page' can also be cleaned up.
> 
> Yes, we should also clean up the ->share field, unless you have an
> alternate suggestion to solve the above issue.

kmalloc mininimum alignment is 8, so there is room to do this?

---
diff --git a/fs/dax.c b/fs/dax.c
index c62acd2812f8..a70f081c32cb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -322,7 +322,7 @@ static unsigned long dax_end_pfn(void *entry)
 
 static inline bool dax_page_is_shared(struct page *page)
 {
-	return page->mapping == PAGE_MAPPING_DAX_SHARED;
+	return folio_test_dax_shared(page_folio(page));
 }
 
 /*
@@ -331,14 +331,14 @@ static inline bool dax_page_is_shared(struct page *page)
  */
 static inline void dax_page_share_get(struct page *page)
 {
-	if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
+	if (!dax_page_is_shared(page)) {
 		/*
 		 * Reset the index if the page was already mapped
 		 * regularly before.
 		 */
 		if (page->mapping)
 			page->share = 1;
-		page->mapping = PAGE_MAPPING_DAX_SHARED;
+		page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;
 	}
 	page->share++;
 }
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1b3a76710487..21b355999ce0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -666,13 +666,14 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted)
 #define PAGE_MAPPING_ANON	0x1
 #define PAGE_MAPPING_MOVABLE	0x2
 #define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
-#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
+/* to be removed once typical page refcounting for dax proves stable */
+#define PAGE_MAPPING_DAX_SHARED	0x4
+#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE | PAGE_MAPPING_DAX_SHARED)
 
 /*
  * Different with flags above, this flag is used only for fsdax mode.  It
  * indicates that this page->mapping is now under reflink case.
  */
-#define PAGE_MAPPING_DAX_SHARED	((void *)0x1)
 
 static __always_inline bool folio_mapping_flags(const struct folio *folio)
 {
@@ -689,6 +690,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio)
 	return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0;
 }
 
+static __always_inline bool folio_test_dax_shared(const struct folio *folio)
+{
+	return ((unsigned long)folio->mapping & PAGE_MAPPING_DAX_SHARED) != 0;
+}
+
 static __always_inline bool PageAnon(const struct page *page)
 {
 	return folio_test_anon(page_folio(page));
---

...and keep the validation around at least for one post conversion
development cycle?

> > It does have implications for the dax dma-idle tracking thought, see
> > below.
> >
> >>  {
> >> -	unsigned long pfn;
> >> +	unsigned long order = dax_entry_order(entry);
> >> +	struct folio *folio = dax_to_folio(entry);
> >>  
> >> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> >> +	if (!dax_entry_size(entry))
> >>  		return;
> >>  
> >> -	for_each_mapped_pfn(entry, pfn) {
> >> -		struct page *page = pfn_to_page(pfn);
> >> -
> >> -		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> >> -		if (dax_page_is_shared(page)) {
> >> -			/* keep the shared flag if this page is still shared */
> >> -			if (dax_page_share_put(page) > 0)
> >> -				continue;
> >> -		} else
> >> -			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> >> -		page->mapping = NULL;
> >> -		page->index = 0;
> >> -	}
> >> +	/*
> >> +	 * We don't hold a reference for the DAX pagecache entry for the
> >> +	 * page. But we need to initialise the folio so we can hand it
> >> +	 * out. Nothing else should have a reference either.
> >> +	 */
> >> +	WARN_ON_ONCE(folio_ref_count(folio));
> >
> > Per above I would feel more comfortable if we kept the paranoia around
> > to ensure that all the pages in this folio have dropped all references
> > and cleared ->mapping and ->index.
> >
> > That paranoia can be placed behind a CONFIG_DEBUB_VM check, and we can
> > delete in a follow-on development cycle, but in the meantime it helps to
> > prove the correctness of the conversion.
> 
> I'm ok with paranoia, but as noted above the issue is that at a minimum
> page->mapping (and probably index) now needs to be valid for any code
> that might walk the page tables.

A quick look seems to say the confusion is limited to aliasing
PAGE_MAPPING_ANON.

> > [..]
> >> @@ -1189,11 +1165,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> >>  	struct inode *inode = iter->inode;
> >>  	unsigned long vaddr = vmf->address;
> >>  	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
> >> +	struct page *page = pfn_t_to_page(pfn);
> >>  	vm_fault_t ret;
> >>  
> >>  	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
> >>  
> >> -	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
> >> +	page_ref_inc(page);
> >> +	ret = dax_insert_pfn(vmf, pfn, false);
> >> +	put_page(page);
> >
> > Per above I think it is problematic to have pages live in the system
> > without a refcount.
> 
> I'm a bit confused by this - the pages have a reference taken on them
> when they are mapped. They only live in the system without a refcount
> when the mm considers them free (except for the bit between getting
> created in dax_associate_entry() and actually getting mapped but as
> noted I will fix that).
> 
> > One scenario where this might be needed is invalidate_inode_pages() vs
> > DMA. The invaldation should pause and wait for DMA pins to be dropped
> > before the mapping xarray is cleaned up and the dax folio is marked
> > free.
> 
> I'm not really following this scenario, or at least how it relates to
> the comment above. If the page is pinned for DMA it will have taken a
> refcount on it and so the page won't be considered free/idle per
> dax_wait_page_idle() or any of the other mm code.

[ tl;dr: I think we're ok, analysis below, but I did talk myself into
the proposed dax_busy_page() changes indeed being broken and needing to
remain checking for refcount > 1, not > 0 ]

It's not the mm code I am worried about. It's the filesystem block
allocator staying in-sync with the allocation state of the page.

fs/dax.c is charged with converting idle storage blocks to pfns to
mapped folios. Once they are mapped, DMA can pin the folio, but nothing
in fs/dax.c pins the mapping. In the pagecache case the page reference
is sufficient to keep the DMA-busy page from being reused. In the dax
case something needs to arrange for DMA to be idle before
dax_delete_mapping_entry().

However, looking at XFS it indeed makes that guarantee. First it does
xfs_break_dax_layouts() then it does truncate_inode_pages() =>
dax_delete_mapping_entry().

It follows that that the DMA-idle condition still needs to look for the
case where the refcount is > 1 rather than 0 since refcount == 1 is the
page-mapped-but-DMA-idle condition.

[..]
> >> @@ -1649,9 +1627,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> >>  	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
> >>  	bool write = iter->flags & IOMAP_WRITE;
> >>  	unsigned long entry_flags = pmd ? DAX_PMD : 0;
> >> -	int err = 0;
> >> +	int ret, err = 0;
> >>  	pfn_t pfn;
> >>  	void *kaddr;
> >> +	struct page *page;
> >>  
> >>  	if (!pmd && vmf->cow_page)
> >>  		return dax_fault_cow_page(vmf, iter);
> >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> >>  	if (dax_fault_is_synchronous(iter, vmf->vma))
> >>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
> >>  
> >> -	/* insert PMD pfn */
> >> +	page = pfn_t_to_page(pfn);
> >
> > I think this is clearer if dax_insert_entry() returns folios with an
> > elevated refrence count that is dropped when the folio is invalidated
> > out of the mapping.
> 
> I presume this comment is for the next line:
> 
> +	page_ref_inc(page);
>  
> I can move that into dax_insert_entry(), but we would still need to
> drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
> transition when the page is unmapped and therefore
> freed. Alternatively we can make it so vmf_insert_*() don't take
> references on the page, and instead ownership of the reference is
> transfered to the mapping. Personally I prefered having those
> functions take their own reference but let me know what you think.

Oh, the model I was thinking was that until vmf_insert_XXX() succeeds
then the page was never allocated because it was never mapped. What
happens with the code as proposed is that put_page() triggers page-free
semantics on vmf_insert_XXX() failures, right?

There is no need to invoke the page-free / final-put path on
vmf_insert_XXX() error because the storage-block / pfn never actually
transitioned into a page / folio.

> > [..]
> >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
> >>  	lock_page(page);
> >>  }
> >>  EXPORT_SYMBOL_GPL(zone_device_page_init);
> >> -
> >> -#ifdef CONFIG_FS_DAX
> >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
> >> -{
> >> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
> >> -		return false;
> >> -
> >> -	/*
> >> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
> >> -	 * refcount is 1, then the page is free and the refcount is
> >> -	 * stable because nobody holds a reference on the page.
> >> -	 */
> >> -	if (folio_ref_sub_return(folio, refs) == 1)
> >> -		wake_up_var(&folio->_refcount);
> >> -	return true;
> >
> > It follow from the refcount disvussion above that I think there is an
> > argument to still keep this wakeup based on the 2->1 transitition.
> > pagecache pages are refcount==1 when they are dma-idle but still
> > allocated. To keep the same semantics for dax a dax_folio would have an
> > elevated refcount whenever it is referenced by mapping entry.
> 
> I'm not sold on keeping it as it doesn't seem to offer any benefit
> IMHO. I know both Jason and Christoph were keen to see it go so it be
> good to get their feedback too. Also one of the primary goals of this
> series was to refcount the page normally so we could remove the whole
> "page is free with a refcount of 1" semantics.

The page is still free at refcount 0, no argument there. But, by
introducing a new "page refcount is elevated while mapped" (as it
should), it follows that "page is DMA idle at refcount == 1", right?
Otherwise, the current assumption that fileystems can have
dax_layout_busy_page_range() poll on the state of the pfn in the mapping
is broken because page refcount == 0 also means no page to mapping
association.

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
  2024-10-24 23:52       ` Dan Williams
@ 2024-10-25  2:46         ` Alistair Popple
  2024-10-25  4:35           ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-10-25  2:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
> [..]
>> >
>> > Was there a discussion I missed about why the conversion to typical
>> > folios allows the page->share accounting to be dropped.
>> 
>> The problem with keeping it is we now treat DAX pages as "normal"
>> pages according to vm_normal_page(). As such we use the normal paths
>> for unmapping pages.
>> 
>> Specifically page->share accounting relies on PAGE_MAPPING_DAX_SHARED
>> aka PAGE_MAPPING_ANON which causes folio_test_anon(), PageAnon(),
>> etc. to return true leading to all sorts of issues in at least the
>> unmap paths.
>
> Oh, I missed that PAGE_MAPPING_DAX_SHARED aliases with
> PAGE_MAPPING_ANON.
>
>> There hasn't been a previous discussion on this, but given this is
>> only used to print warnings it seemed easier to get rid of it. I
>> probably should have called that out more clearly in the commit
>> message though.
>> 
>> > I assume this is because the page->mapping validation was dropped, which
>> > I think might be useful to keep at least for one development cycle to
>> > make sure this conversion is not triggering any of the old warnings.
>> >
>> > Otherwise, the ->share field of 'struct page' can also be cleaned up.
>> 
>> Yes, we should also clean up the ->share field, unless you have an
>> alternate suggestion to solve the above issue.
>
> kmalloc mininimum alignment is 8, so there is room to do this?

Oh right, given the aliasing I had assumed there wasn't room.

> ---
> diff --git a/fs/dax.c b/fs/dax.c
> index c62acd2812f8..a70f081c32cb 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -322,7 +322,7 @@ static unsigned long dax_end_pfn(void *entry)
>  
>  static inline bool dax_page_is_shared(struct page *page)
>  {
> -	return page->mapping == PAGE_MAPPING_DAX_SHARED;
> +	return folio_test_dax_shared(page_folio(page));
>  }
>  
>  /*
> @@ -331,14 +331,14 @@ static inline bool dax_page_is_shared(struct page *page)
>   */
>  static inline void dax_page_share_get(struct page *page)
>  {
> -	if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
> +	if (!dax_page_is_shared(page)) {
>  		/*
>  		 * Reset the index if the page was already mapped
>  		 * regularly before.
>  		 */
>  		if (page->mapping)
>  			page->share = 1;
> -		page->mapping = PAGE_MAPPING_DAX_SHARED;
> +		page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;
>  	}
>  	page->share++;
>  }
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 1b3a76710487..21b355999ce0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -666,13 +666,14 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted)
>  #define PAGE_MAPPING_ANON	0x1
>  #define PAGE_MAPPING_MOVABLE	0x2
>  #define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
> -#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
> +/* to be removed once typical page refcounting for dax proves stable */
> +#define PAGE_MAPPING_DAX_SHARED	0x4
> +#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE | PAGE_MAPPING_DAX_SHARED)
>  
>  /*
>   * Different with flags above, this flag is used only for fsdax mode.  It
>   * indicates that this page->mapping is now under reflink case.
>   */
> -#define PAGE_MAPPING_DAX_SHARED	((void *)0x1)
>  
>  static __always_inline bool folio_mapping_flags(const struct folio *folio)
>  {
> @@ -689,6 +690,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio)
>  	return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0;
>  }
>  
> +static __always_inline bool folio_test_dax_shared(const struct folio *folio)
> +{
> +	return ((unsigned long)folio->mapping & PAGE_MAPPING_DAX_SHARED) != 0;
> +}
> +
>  static __always_inline bool PageAnon(const struct page *page)
>  {
>  	return folio_test_anon(page_folio(page));
> ---
>
> ...and keep the validation around at least for one post conversion
> development cycle?

Looks reasonable, will add that back for at least a development
cycle. In reality it will probably stay forever and I will add a comment
to the PAGE_MAPPING_DAX_SHARED definition saying it can be easily
removed if more flags are needed.

>> > It does have implications for the dax dma-idle tracking thought, see
>> > below.
>> >
>> >>  {
>> >> -	unsigned long pfn;
>> >> +	unsigned long order = dax_entry_order(entry);
>> >> +	struct folio *folio = dax_to_folio(entry);
>> >>  
>> >> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> >> +	if (!dax_entry_size(entry))
>> >>  		return;
>> >>  
>> >> -	for_each_mapped_pfn(entry, pfn) {
>> >> -		struct page *page = pfn_to_page(pfn);
>> >> -
>> >> -		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
>> >> -		if (dax_page_is_shared(page)) {
>> >> -			/* keep the shared flag if this page is still shared */
>> >> -			if (dax_page_share_put(page) > 0)
>> >> -				continue;
>> >> -		} else
>> >> -			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
>> >> -		page->mapping = NULL;
>> >> -		page->index = 0;
>> >> -	}
>> >> +	/*
>> >> +	 * We don't hold a reference for the DAX pagecache entry for the
>> >> +	 * page. But we need to initialise the folio so we can hand it
>> >> +	 * out. Nothing else should have a reference either.
>> >> +	 */
>> >> +	WARN_ON_ONCE(folio_ref_count(folio));
>> >
>> > Per above I would feel more comfortable if we kept the paranoia around
>> > to ensure that all the pages in this folio have dropped all references
>> > and cleared ->mapping and ->index.
>> >
>> > That paranoia can be placed behind a CONFIG_DEBUB_VM check, and we can
>> > delete in a follow-on development cycle, but in the meantime it helps to
>> > prove the correctness of the conversion.
>> 
>> I'm ok with paranoia, but as noted above the issue is that at a minimum
>> page->mapping (and probably index) now needs to be valid for any code
>> that might walk the page tables.
>
> A quick look seems to say the confusion is limited to aliasing
> PAGE_MAPPING_ANON.

Correct. Looks like we can solve that though.

>> > [..]
>> >> @@ -1189,11 +1165,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>> >>  	struct inode *inode = iter->inode;
>> >>  	unsigned long vaddr = vmf->address;
>> >>  	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
>> >> +	struct page *page = pfn_t_to_page(pfn);
>> >>  	vm_fault_t ret;
>> >>  
>> >>  	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
>> >>  
>> >> -	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>> >> +	page_ref_inc(page);
>> >> +	ret = dax_insert_pfn(vmf, pfn, false);
>> >> +	put_page(page);
>> >
>> > Per above I think it is problematic to have pages live in the system
>> > without a refcount.
>> 
>> I'm a bit confused by this - the pages have a reference taken on them
>> when they are mapped. They only live in the system without a refcount
>> when the mm considers them free (except for the bit between getting
>> created in dax_associate_entry() and actually getting mapped but as
>> noted I will fix that).
>> 
>> > One scenario where this might be needed is invalidate_inode_pages() vs
>> > DMA. The invaldation should pause and wait for DMA pins to be dropped
>> > before the mapping xarray is cleaned up and the dax folio is marked
>> > free.
>> 
>> I'm not really following this scenario, or at least how it relates to
>> the comment above. If the page is pinned for DMA it will have taken a
>> refcount on it and so the page won't be considered free/idle per
>> dax_wait_page_idle() or any of the other mm code.
>
> [ tl;dr: I think we're ok, analysis below, but I did talk myself into
> the proposed dax_busy_page() changes indeed being broken and needing to
> remain checking for refcount > 1, not > 0 ]
>
> It's not the mm code I am worried about. It's the filesystem block
> allocator staying in-sync with the allocation state of the page.
>
> fs/dax.c is charged with converting idle storage blocks to pfns to
> mapped folios. Once they are mapped, DMA can pin the folio, but nothing
> in fs/dax.c pins the mapping. In the pagecache case the page reference
> is sufficient to keep the DMA-busy page from being reused. In the dax
> case something needs to arrange for DMA to be idle before
> dax_delete_mapping_entry().

Ok. How does that work today? My current mental model is that something
has to call dax_layout_busy_page() whilst holding the correct locks to
prevent a new mapping being established prior to calling
dax_delete_mapping_entry(). Is that correct?

> However, looking at XFS it indeed makes that guarantee. First it does
> xfs_break_dax_layouts() then it does truncate_inode_pages() =>
> dax_delete_mapping_entry().
>
> It follows that that the DMA-idle condition still needs to look for the
> case where the refcount is > 1 rather than 0 since refcount == 1 is the
> page-mapped-but-DMA-idle condition.

Sorry, but I'm still not following this line of reasoning. If the
refcount == 1 the page is either mapped xor DMA-busy. So a refcount >= 1
is enough to conclude that the page cannot be reused because it is
either being accessed from userspace via a CPU mapping or from some
device DMA or some other in kernel user.

The current proposal is that dax_busy_page() returns true if refcount >=
1, and dax_wait_page_idle() will wait until the refcount ==
0. dax_busy_page() will try and force the refcount == 0 by unmapping it,
but obviously can't force other pinners to release their reference hence
the need to wait. Callers should already be holding locks to ensure new
mappings can't be established and hence can't become DMA-busy after the
unmap.

> [..]
>> >> @@ -1649,9 +1627,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>> >>  	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
>> >>  	bool write = iter->flags & IOMAP_WRITE;
>> >>  	unsigned long entry_flags = pmd ? DAX_PMD : 0;
>> >> -	int err = 0;
>> >> +	int ret, err = 0;
>> >>  	pfn_t pfn;
>> >>  	void *kaddr;
>> >> +	struct page *page;
>> >>  
>> >>  	if (!pmd && vmf->cow_page)
>> >>  		return dax_fault_cow_page(vmf, iter);
>> >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>> >>  	if (dax_fault_is_synchronous(iter, vmf->vma))
>> >>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
>> >>  
>> >> -	/* insert PMD pfn */
>> >> +	page = pfn_t_to_page(pfn);
>> >
>> > I think this is clearer if dax_insert_entry() returns folios with an
>> > elevated refrence count that is dropped when the folio is invalidated
>> > out of the mapping.
>> 
>> I presume this comment is for the next line:
>> 
>> +	page_ref_inc(page);
>>  
>> I can move that into dax_insert_entry(), but we would still need to
>> drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
>> transition when the page is unmapped and therefore
>> freed. Alternatively we can make it so vmf_insert_*() don't take
>> references on the page, and instead ownership of the reference is
>> transfered to the mapping. Personally I prefered having those
>> functions take their own reference but let me know what you think.
>
> Oh, the model I was thinking was that until vmf_insert_XXX() succeeds
> then the page was never allocated because it was never mapped. What
> happens with the code as proposed is that put_page() triggers page-free
> semantics on vmf_insert_XXX() failures, right?

Right. And actually that means I can't move the page_ref_inc(page) into
what will be called dax_create_folio(), because an entry may have been
created previously that had a failed vmf_insert_XXX() which will
therefore have a zero refcount folio associated with it.

But I think that model is wrong. I think the model needs to be the page
gets allocated when the entry is first created (ie. when
dax_create_folio() is called). A subsequent free (ether due to
vmf_insert_XXX() failing or the page being unmapped or becoming
DMA-idle) should then delete the entry.

I think that makes the semantics around dax_busy_page() nicer as well -
no need for the truncate to have a special path to call
dax_delete_mapping_entry().

> There is no need to invoke the page-free / final-put path on
> vmf_insert_XXX() error because the storage-block / pfn never actually
> transitioned into a page / folio.

It's not mapping a page/folio that transitions a pfn into a page/folio
it is the allocation of the folio that happens in dax_create_folio()
(aka. dax_associate_new_entry()). So we need to delete the entry (as
noted above I don't do that currently) if the insertion fails.

>> > [..]
>> >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
>> >>  	lock_page(page);
>> >>  }
>> >>  EXPORT_SYMBOL_GPL(zone_device_page_init);
>> >> -
>> >> -#ifdef CONFIG_FS_DAX
>> >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
>> >> -{
>> >> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
>> >> -		return false;
>> >> -
>> >> -	/*
>> >> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
>> >> -	 * refcount is 1, then the page is free and the refcount is
>> >> -	 * stable because nobody holds a reference on the page.
>> >> -	 */
>> >> -	if (folio_ref_sub_return(folio, refs) == 1)
>> >> -		wake_up_var(&folio->_refcount);
>> >> -	return true;
>> >
>> > It follow from the refcount disvussion above that I think there is an
>> > argument to still keep this wakeup based on the 2->1 transitition.
>> > pagecache pages are refcount==1 when they are dma-idle but still
>> > allocated. To keep the same semantics for dax a dax_folio would have an
>> > elevated refcount whenever it is referenced by mapping entry.
>> 
>> I'm not sold on keeping it as it doesn't seem to offer any benefit
>> IMHO. I know both Jason and Christoph were keen to see it go so it be
>> good to get their feedback too. Also one of the primary goals of this
>> series was to refcount the page normally so we could remove the whole
>> "page is free with a refcount of 1" semantics.
>
> The page is still free at refcount 0, no argument there. But, by
> introducing a new "page refcount is elevated while mapped" (as it
> should), it follows that "page is DMA idle at refcount == 1", right?

No. The page is either mapped xor DMA-busy - ie. not free. If we want
(need?) to tell the difference we can use folio_maybe_dma_pinned(),
assuming the driver doing DMA has called pin_user_pages() as it should.

That said I'm not sure why we care about the distinction between
DMA-idle and mapped? If the page is not free from the mm perspective the
block can't be reallocated by the filesystem.

> Otherwise, the current assumption that fileystems can have
> dax_layout_busy_page_range() poll on the state of the pfn in the mapping
> is broken because page refcount == 0 also means no page to mapping
> association.

And also means nothing from the mm (userspace mapping, DMA-busy, etc.)
is using the page so the page isn't busy and is free to be reallocated
right?

 - Alistair

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
  2024-10-25  2:46         ` Alistair Popple
@ 2024-10-25  4:35           ` Dan Williams
  2024-10-28  4:24             ` Alistair Popple
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-10-25  4:35 UTC (permalink / raw)
  To: Alistair Popple, Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david

Alistair Popple wrote:
[..]
>> I'm not really following this scenario, or at least how it relates to
> >> the comment above. If the page is pinned for DMA it will have taken a
> >> refcount on it and so the page won't be considered free/idle per
> >> dax_wait_page_idle() or any of the other mm code.
> >
> > [ tl;dr: I think we're ok, analysis below, but I did talk myself into
> > the proposed dax_busy_page() changes indeed being broken and needing to
> > remain checking for refcount > 1, not > 0 ]
> >
> > It's not the mm code I am worried about. It's the filesystem block
> > allocator staying in-sync with the allocation state of the page.
> >
> > fs/dax.c is charged with converting idle storage blocks to pfns to
> > mapped folios. Once they are mapped, DMA can pin the folio, but nothing
> > in fs/dax.c pins the mapping. In the pagecache case the page reference
> > is sufficient to keep the DMA-busy page from being reused. In the dax
> > case something needs to arrange for DMA to be idle before
> > dax_delete_mapping_entry().
> 
> Ok. How does that work today? My current mental model is that something
> has to call dax_layout_busy_page() whilst holding the correct locks to
> prevent a new mapping being established prior to calling
> dax_delete_mapping_entry(). Is that correct?

Correct. dax_delete_mapping_entry() is invoked by the filesystem with
inode locks held. See xfs_file_fallocate() where it takes the lock,
calls xfs_break_layouts() and if that succeeds performs
xfs_file_free_space() with the lock held.

xfs_file_free_space() triggers dax_delete_mapping_entry() with knowledge
that the mapping cannot be re-established until the lock is dropped.

> > However, looking at XFS it indeed makes that guarantee. First it does
> > xfs_break_dax_layouts() then it does truncate_inode_pages() =>
> > dax_delete_mapping_entry().
> >
> > It follows that that the DMA-idle condition still needs to look for the
> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
> > page-mapped-but-DMA-idle condition.
> 
> Sorry, but I'm still not following this line of reasoning. If the
> refcount == 1 the page is either mapped xor DMA-busy.

No, my expectation is the refcount is 1 while the page has a mapping
entry, analagous to an idle / allocated page cache page, and the
refcount is 2 or more for DMA, get_user_pages(), or any page walker that
takes a transient page pin.

> is enough to conclude that the page cannot be reused because it is
> either being accessed from userspace via a CPU mapping or from some
> device DMA or some other in kernel user.

Userspace access is not a problem, that access can always be safely
revoked by unmapping the page, and that's what dax_layout_busy_page()
does to force a fault and re-taking the inode + mmap locks so that the
truncate path knows it has temporary exclusive access to the page, pfn,
and storage-block association.

> The current proposal is that dax_busy_page() returns true if refcount >=
> 1, and dax_wait_page_idle() will wait until the refcount ==
> 0. dax_busy_page() will try and force the refcount == 0 by unmapping it,
> but obviously can't force other pinners to release their reference hence
> the need to wait. Callers should already be holding locks to ensure new
> mappings can't be established and hence can't become DMA-busy after the
> unmap.

Am I missing a page_ref_dec() somewhere? Are you saying that
dax_layout_busy_page() will find entries with ->mapping non-NULL and
refcount == 0?

[..]
> >> >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> >> >>  	if (dax_fault_is_synchronous(iter, vmf->vma))
> >> >>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
> >> >>  
> >> >> -	/* insert PMD pfn */
> >> >> +	page = pfn_t_to_page(pfn);
> >> >
> >> > I think this is clearer if dax_insert_entry() returns folios with an
> >> > elevated refrence count that is dropped when the folio is invalidated
> >> > out of the mapping.
> >> 
> >> I presume this comment is for the next line:
> >> 
> >> +	page_ref_inc(page);
> >>  
> >> I can move that into dax_insert_entry(), but we would still need to
> >> drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
> >> transition when the page is unmapped and therefore
> >> freed. Alternatively we can make it so vmf_insert_*() don't take
> >> references on the page, and instead ownership of the reference is
> >> transfered to the mapping. Personally I prefered having those
> >> functions take their own reference but let me know what you think.
> >
> > Oh, the model I was thinking was that until vmf_insert_XXX() succeeds
> > then the page was never allocated because it was never mapped. What
> > happens with the code as proposed is that put_page() triggers page-free
> > semantics on vmf_insert_XXX() failures, right?
> 
> Right. And actually that means I can't move the page_ref_inc(page) into
> what will be called dax_create_folio(), because an entry may have been
> created previously that had a failed vmf_insert_XXX() which will
> therefore have a zero refcount folio associated with it.

I would expect a full cleanup on on vmf_insert_XXX() failure, not
leaving a zero-referenced entry.

> But I think that model is wrong. I think the model needs to be the page
> gets allocated when the entry is first created (ie. when
> dax_create_folio() is called). A subsequent free (ether due to
> vmf_insert_XXX() failing or the page being unmapped or becoming
> DMA-idle) should then delete the entry.
>
> I think that makes the semantics around dax_busy_page() nicer as well -
> no need for the truncate to have a special path to call
> dax_delete_mapping_entry().

I agree it would be lovely if the final put could clean up the mapping
entry and not depend on truncate_inode_pages_range() to do that.

...but I do not immediately see how to get there when block, pfn, and
page are so tightly coupled with dax. That's a whole new project to
introduce that paradigm, no? The page cache case gets away with
it by safely disconnecting the pfn+page from the block and then letting
DMA final put_page() take its time.

> > There is no need to invoke the page-free / final-put path on
> > vmf_insert_XXX() error because the storage-block / pfn never actually
> > transitioned into a page / folio.
> 
> It's not mapping a page/folio that transitions a pfn into a page/folio
> it is the allocation of the folio that happens in dax_create_folio()
> (aka. dax_associate_new_entry()). So we need to delete the entry (as
> noted above I don't do that currently) if the insertion fails.

Yeah, deletion on insert failure makes sense.

[..]
> >> >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
> >> >>  	lock_page(page);
> >> >>  }
> >> >>  EXPORT_SYMBOL_GPL(zone_device_page_init);
> >> >> -
> >> >> -#ifdef CONFIG_FS_DAX
> >> >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
> >> >> -{
> >> >> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
> >> >> -		return false;
> >> >> -
> >> >> -	/*
> >> >> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
> >> >> -	 * refcount is 1, then the page is free and the refcount is
> >> >> -	 * stable because nobody holds a reference on the page.
> >> >> -	 */
> >> >> -	if (folio_ref_sub_return(folio, refs) == 1)
> >> >> -		wake_up_var(&folio->_refcount);
> >> >> -	return true;
> >> >
> >> > It follow from the refcount disvussion above that I think there is an
> >> > argument to still keep this wakeup based on the 2->1 transitition.
> >> > pagecache pages are refcount==1 when they are dma-idle but still
> >> > allocated. To keep the same semantics for dax a dax_folio would have an
> >> > elevated refcount whenever it is referenced by mapping entry.
> >> 
> >> I'm not sold on keeping it as it doesn't seem to offer any benefit
> >> IMHO. I know both Jason and Christoph were keen to see it go so it be
> >> good to get their feedback too. Also one of the primary goals of this
> >> series was to refcount the page normally so we could remove the whole
> >> "page is free with a refcount of 1" semantics.
> >
> > The page is still free at refcount 0, no argument there. But, by
> > introducing a new "page refcount is elevated while mapped" (as it
> > should), it follows that "page is DMA idle at refcount == 1", right?
> 
> No. The page is either mapped xor DMA-busy - ie. not free. If we want
> (need?) to tell the difference we can use folio_maybe_dma_pinned(),
> assuming the driver doing DMA has called pin_user_pages() as it should.
> 
> That said I'm not sure why we care about the distinction between
> DMA-idle and mapped? If the page is not free from the mm perspective the
> block can't be reallocated by the filesystem.

"can't be reallocated", what enforces that in your view? I am hoping it
is something I am overlooking.

In my view the filesystem has no idea of this page-to-block
relationship. All it knows is that when it wants to destroy the
page-to-block association, dax notices and says "uh, oh, this is my last
chance to make sure the block can go back into the fs allocation pool so
I need to wait for the mm to say that the page is exclusive to me (dax
core) before dax_delete_mapping_entry() destroys the page-to-block
association and the fs reclaims the allocation".

> > Otherwise, the current assumption that fileystems can have
> > dax_layout_busy_page_range() poll on the state of the pfn in the mapping
> > is broken because page refcount == 0 also means no page to mapping
> > association.
> 
> And also means nothing from the mm (userspace mapping, DMA-busy, etc.)
> is using the page so the page isn't busy and is free to be reallocated
> right?

Lets take the 'map => start dma => truncate => end dma' scenario.

At the 'end dma' step, how does the filesystem learn that the block that
it truncated, potentially hours ago, is now a free block? The filesystem
thought it reclaimed the block when truncate completed. I.e. dax says,
thou shalt 'end dma' => 'truncate' in all cases.

Note "dma" can be replaced with "any non dax core page_ref".

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
  2024-10-25  4:35           ` Dan Williams
@ 2024-10-28  4:24             ` Alistair Popple
  2024-10-29  2:03               ` Dan Williams
  0 siblings, 1 reply; 53+ messages in thread
From: Alistair Popple @ 2024-10-28  4:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
> [..]
>>> I'm not really following this scenario, or at least how it relates to
>> >> the comment above. If the page is pinned for DMA it will have taken a
>> >> refcount on it and so the page won't be considered free/idle per
>> >> dax_wait_page_idle() or any of the other mm code.
>> >
>> > [ tl;dr: I think we're ok, analysis below, but I did talk myself into
>> > the proposed dax_busy_page() changes indeed being broken and needing to
>> > remain checking for refcount > 1, not > 0 ]
>> >
>> > It's not the mm code I am worried about. It's the filesystem block
>> > allocator staying in-sync with the allocation state of the page.
>> >
>> > fs/dax.c is charged with converting idle storage blocks to pfns to
>> > mapped folios. Once they are mapped, DMA can pin the folio, but nothing
>> > in fs/dax.c pins the mapping. In the pagecache case the page reference
>> > is sufficient to keep the DMA-busy page from being reused. In the dax
>> > case something needs to arrange for DMA to be idle before
>> > dax_delete_mapping_entry().
>> 
>> Ok. How does that work today? My current mental model is that something
>> has to call dax_layout_busy_page() whilst holding the correct locks to
>> prevent a new mapping being established prior to calling
>> dax_delete_mapping_entry(). Is that correct?
>
> Correct. dax_delete_mapping_entry() is invoked by the filesystem with
> inode locks held. See xfs_file_fallocate() where it takes the lock,
> calls xfs_break_layouts() and if that succeeds performs
> xfs_file_free_space() with the lock held.

Thanks for confirming. I've broken it enough times during development of
this that I thought I was correct but the confirmation is nice.

> xfs_file_free_space() triggers dax_delete_mapping_entry() with knowledge
> that the mapping cannot be re-established until the lock is dropped.
>
>> > However, looking at XFS it indeed makes that guarantee. First it does
>> > xfs_break_dax_layouts() then it does truncate_inode_pages() =>
>> > dax_delete_mapping_entry().
>> >
>> > It follows that that the DMA-idle condition still needs to look for the
>> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
>> > page-mapped-but-DMA-idle condition.
>> 
>> Sorry, but I'm still not following this line of reasoning. If the
>> refcount == 1 the page is either mapped xor DMA-busy.
>
> No, my expectation is the refcount is 1 while the page has a mapping
> entry, analagous to an idle / allocated page cache page, and the
> refcount is 2 or more for DMA, get_user_pages(), or any page walker that
> takes a transient page pin.

Argh, I think we may have been talking past each other. By "mapped" I
was thinking of folio_mapped() == true. Ie. page->mapcount >= 1 due to
having page table entries. I suspect you're talking about DAX page-cache
entries here?

The way the series currently works the DAX page-cache does not hold a
reference on the page. Whether or not that is a good idea (or even
valid/functionally correct) is a reasonable question and where I think
this discussion is heading (see below).

>> is enough to conclude that the page cannot be reused because it is
>> either being accessed from userspace via a CPU mapping or from some
>> device DMA or some other in kernel user.
>
> Userspace access is not a problem, that access can always be safely
> revoked by unmapping the page, and that's what dax_layout_busy_page()
> does to force a fault and re-taking the inode + mmap locks so that the
> truncate path knows it has temporary exclusive access to the page, pfn,
> and storage-block association.

Right.

>> The current proposal is that dax_busy_page() returns true if refcount >=
>> 1, and dax_wait_page_idle() will wait until the refcount ==
>> 0. dax_busy_page() will try and force the refcount == 0 by unmapping it,
>> but obviously can't force other pinners to release their reference hence
>> the need to wait. Callers should already be holding locks to ensure new
>> mappings can't be established and hence can't become DMA-busy after the
>> unmap.
>
> Am I missing a page_ref_dec() somewhere? Are you saying that
> dax_layout_busy_page() will find entries with ->mapping non-NULL and
> refcount == 0?

No, ->mapping gets set to NULL when the page is freed in
free_zone_device_folio() but I think the mapping->i_pages XArray will
still contain references to the page with a zero
refcount. Ie. truncate_inode_pages_range() will still find them and call
truncate_inode_pages_range().

> [..]
>> >> >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
>> >> >>  	if (dax_fault_is_synchronous(iter, vmf->vma))
>> >> >>  		return dax_fault_synchronous_pfnp(pfnp, pfn);
>> >> >>  
>> >> >> -	/* insert PMD pfn */
>> >> >> +	page = pfn_t_to_page(pfn);
>> >> >
>> >> > I think this is clearer if dax_insert_entry() returns folios with an
>> >> > elevated refrence count that is dropped when the folio is invalidated
>> >> > out of the mapping.
>> >> 
>> >> I presume this comment is for the next line:
>> >> 
>> >> +	page_ref_inc(page);
>> >>  
>> >> I can move that into dax_insert_entry(), but we would still need to
>> >> drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
>> >> transition when the page is unmapped and therefore
>> >> freed. Alternatively we can make it so vmf_insert_*() don't take
>> >> references on the page, and instead ownership of the reference is
>> >> transfered to the mapping. Personally I prefered having those
>> >> functions take their own reference but let me know what you think.
>> >
>> > Oh, the model I was thinking was that until vmf_insert_XXX() succeeds
>> > then the page was never allocated because it was never mapped. What
>> > happens with the code as proposed is that put_page() triggers page-free
>> > semantics on vmf_insert_XXX() failures, right?
>> 
>> Right. And actually that means I can't move the page_ref_inc(page) into
>> what will be called dax_create_folio(), because an entry may have been
>> created previously that had a failed vmf_insert_XXX() which will
>> therefore have a zero refcount folio associated with it.
>
> I would expect a full cleanup on on vmf_insert_XXX() failure, not
> leaving a zero-referenced entry.
>
>> But I think that model is wrong. I think the model needs to be the page
>> gets allocated when the entry is first created (ie. when
>> dax_create_folio() is called). A subsequent free (ether due to
>> vmf_insert_XXX() failing or the page being unmapped or becoming
>> DMA-idle) should then delete the entry.
>>
>> I think that makes the semantics around dax_busy_page() nicer as well -
>> no need for the truncate to have a special path to call
>> dax_delete_mapping_entry().
>
> I agree it would be lovely if the final put could clean up the mapping
> entry and not depend on truncate_inode_pages_range() to do that.

I think I'm understanding you better now, thanks for your patience. I
think the problem here is most filesystems tend to basically do the
following:

1. Call some fs-specific version of break_dax_layouts() which:
   a) unmaps all the pages from the page-tables via
      dax_layout_busy_page()
   b) waits for DMA[1] to complete by looking at page refcounts

2. Removes DAX page-cache entries by calling
   truncate_inode_pages_range() or some equivalent.

In this series this works because the DAX page-cache doesn't hold a page
reference nor does it call dax_delete_mapping_entry() on free - it
relies on the truncate code to do that. So I think I understand your
original comment now:

>> > It follows that that the DMA-idle condition still needs to look for the
>> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
>> > page-mapped-but-DMA-idle condition.

Because if the DAX page-cache holds a reference the refcount won't go to
zero until dax_delete_mapping_entry() is called. However this interface
seems really strange to me - filesystems call
dax_layout_busy_page()/dax_wait_page_idle() to make sure both user-space
and DMA[1] have finished with the page, but not the DAX code which still
has references in it's page-cache.

Is there some reason for this? In order words why can't the interface to
the filesystem be something like calling dax_break_layouts() which
ensures everything, including core FS DAX code, has finished with the
page(s) in question? I don't see why that wouldn't work for at least
EXT4 and XFS (FUSE seemed a bit different but I haven't dug too deeply).

If we could do that dax_break_layouts() would essentially:
1. unmap userspace via eg. unmap_mapping_pages() to drive the refcount
   down.
2. delete the DAX page-cache entry to remove its refcount.
3. wait for DMA to complete by waiting for the refcount to hit zero.

The problem with the filesystem truncate code at the moment is steps 2
and 3 are reversed so step 3 has to wait for a refcount of 1 as you
pointed out previously. But does that matter? Are there ever cases when
a filesystem needs to wait for the page to be idle but maintain it's DAX
page-cache entry?

I may be missing something though, because I was having trouble getting
this scheme to actually work today.

[1] - Where "DMA" means any unknown page reference

> ...but I do not immediately see how to get there when block, pfn, and
> page are so tightly coupled with dax. That's a whole new project to
> introduce that paradigm, no? The page cache case gets away with
> it by safely disconnecting the pfn+page from the block and then letting
> DMA final put_page() take its time.

Oh of course, thanks for pointing out the difference there.

>> > There is no need to invoke the page-free / final-put path on
>> > vmf_insert_XXX() error because the storage-block / pfn never actually
>> > transitioned into a page / folio.
>> 
>> It's not mapping a page/folio that transitions a pfn into a page/folio
>> it is the allocation of the folio that happens in dax_create_folio()
>> (aka. dax_associate_new_entry()). So we need to delete the entry (as
>> noted above I don't do that currently) if the insertion fails.
>
> Yeah, deletion on insert failure makes sense.
>
> [..]
>> >> >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
>> >> >>  	lock_page(page);
>> >> >>  }
>> >> >>  EXPORT_SYMBOL_GPL(zone_device_page_init);
>> >> >> -
>> >> >> -#ifdef CONFIG_FS_DAX
>> >> >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
>> >> >> -{
>> >> >> -	if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
>> >> >> -		return false;
>> >> >> -
>> >> >> -	/*
>> >> >> -	 * fsdax page refcounts are 1-based, rather than 0-based: if
>> >> >> -	 * refcount is 1, then the page is free and the refcount is
>> >> >> -	 * stable because nobody holds a reference on the page.
>> >> >> -	 */
>> >> >> -	if (folio_ref_sub_return(folio, refs) == 1)
>> >> >> -		wake_up_var(&folio->_refcount);
>> >> >> -	return true;
>> >> >
>> >> > It follow from the refcount disvussion above that I think there is an
>> >> > argument to still keep this wakeup based on the 2->1 transitition.
>> >> > pagecache pages are refcount==1 when they are dma-idle but still
>> >> > allocated. To keep the same semantics for dax a dax_folio would have an
>> >> > elevated refcount whenever it is referenced by mapping entry.
>> >> 
>> >> I'm not sold on keeping it as it doesn't seem to offer any benefit
>> >> IMHO. I know both Jason and Christoph were keen to see it go so it be
>> >> good to get their feedback too. Also one of the primary goals of this
>> >> series was to refcount the page normally so we could remove the whole
>> >> "page is free with a refcount of 1" semantics.
>> >
>> > The page is still free at refcount 0, no argument there. But, by
>> > introducing a new "page refcount is elevated while mapped" (as it
>> > should), it follows that "page is DMA idle at refcount == 1", right?
>> 
>> No. The page is either mapped xor DMA-busy - ie. not free. If we want
>> (need?) to tell the difference we can use folio_maybe_dma_pinned(),
>> assuming the driver doing DMA has called pin_user_pages() as it should.
>> 
>> That said I'm not sure why we care about the distinction between
>> DMA-idle and mapped? If the page is not free from the mm perspective the
>> block can't be reallocated by the filesystem.
>
> "can't be reallocated", what enforces that in your view? I am hoping it
> is something I am overlooking.
>
> In my view the filesystem has no idea of this page-to-block
> relationship. All it knows is that when it wants to destroy the
> page-to-block association, dax notices and says "uh, oh, this is my last
> chance to make sure the block can go back into the fs allocation pool so
> I need to wait for the mm to say that the page is exclusive to me (dax
> core) before dax_delete_mapping_entry() destroys the page-to-block
> association and the fs reclaims the allocation".
>
>> > Otherwise, the current assumption that fileystems can have
>> > dax_layout_busy_page_range() poll on the state of the pfn in the mapping
>> > is broken because page refcount == 0 also means no page to mapping
>> > association.
>> 
>> And also means nothing from the mm (userspace mapping, DMA-busy, etc.)
>> is using the page so the page isn't busy and is free to be reallocated
>> right?
>
> Lets take the 'map => start dma => truncate => end dma' scenario.
>
> At the 'end dma' step, how does the filesystem learn that the block that
> it truncated, potentially hours ago, is now a free block? The filesystem
> thought it reclaimed the block when truncate completed. I.e. dax says,
> thou shalt 'end dma' => 'truncate' in all cases.

Agreed, but I don't think I was suggesting we change that. I agree DAX
has to ensure 'end dma' happens before truncate completes.

> Note "dma" can be replaced with "any non dax core page_ref".


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
  2024-10-28  4:24             ` Alistair Popple
@ 2024-10-29  2:03               ` Dan Williams
  2024-10-30  5:57                 ` Alistair Popple
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Williams @ 2024-10-29  2:03 UTC (permalink / raw)
  To: Alistair Popple, Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david

Alistair Popple wrote:
[..]

> >> > It follows that that the DMA-idle condition still needs to look for the
> >> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
> >> > page-mapped-but-DMA-idle condition.
> 
> Because if the DAX page-cache holds a reference the refcount won't go to
> zero until dax_delete_mapping_entry() is called. However this interface
> seems really strange to me - filesystems call
> dax_layout_busy_page()/dax_wait_page_idle() to make sure both user-space
> and DMA[1] have finished with the page, but not the DAX code which still
> has references in it's page-cache.

First, I appreciate the clarification that I was mixing up "mapped"
(elevated map count) with, for lack of a better term, "tracked" (mapping
entry valid).

So, to repeat back to you what I understand now, the proposal is to
attempt to allow _count==0 as the DMA idle condition, but still have the
final return of the block to the allocator (fs allocator) occur after
dax_delete_mapping_entry().

> Is there some reason for this? In order words why can't the interface to
> the filesystem be something like calling dax_break_layouts() which
> ensures everything, including core FS DAX code, has finished with the
> page(s) in question? I don't see why that wouldn't work for at least
> EXT4 and XFS (FUSE seemed a bit different but I haven't dug too deeply).
> 
> If we could do that dax_break_layouts() would essentially:
> 1. unmap userspace via eg. unmap_mapping_pages() to drive the refcount
>    down.

Am I missing where unmap_mapping_pages() drops the _count? I can see
where it drops _mapcount. I don't think that matters for the proposal,
but that's my last gap in tracking the proposed refcount model.

> 2. delete the DAX page-cache entry to remove its refcount.
> 3. wait for DMA to complete by waiting for the refcount to hit zero.
> 
> The problem with the filesystem truncate code at the moment is steps 2
> and 3 are reversed so step 3 has to wait for a refcount of 1 as you
> pointed out previously. But does that matter? Are there ever cases when
> a filesystem needs to wait for the page to be idle but maintain it's DAX
> page-cache entry?

No, not that I can think of. The filesystem just cares that the page was
seen as part of the file at some point and that it is holding locks to
keep the block associated with that page allocated to the file until it
can complete its operation.

I think what we are talking about is a pfn-state not a page state. If
the block-pfn-page lifecycle from allocation to free is deconstructed as:

    block free
    block allocated
    pfn untracked
    pfn tracked
    page free
    page busy
    page free
    pfn untracked
    block free

...then I can indeed see cases where there is pfn metadata live even
though the page is free.

So I think I was playing victim to the current implementation that
assumes that "pfn tracked" means the page is allocated and that
pfn_to_folio(pfn)->mapping is valid and not NULL.

All this to say I am at least on the same page as you that _count == 0
can be used as the page free state even if the pfn tracking goes through
delayed cleanup.

However, if vmf_insert_XXX is increasing _count then, per my
unmap_mapping_pages() question above, I think dax_wait_page_idle() needs
to call try_to_unmap() to drop that _count, right? Similar observation
for the memory_failure_dev_pagemap() path, I think that path only calls
unmap_mapping_range() not try_to_unmap() and leaves _count elevated.

Lastly walking through the code again I think this fix is valid today:

diff --git a/fs/dax.c b/fs/dax.c
index fcbe62bde685..48f2c85690e1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -660,7 +660,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
        pgoff_t end_idx;
        XA_STATE(xas, &mapping->i_pages, start_idx);

-       if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+       if (!dax_mapping(mapping))
                return NULL;

        /* If end == LLONG_MAX, all pages from start to till end of file */

...because unmap_mapping_pages() will mark the mapping as unmapped even
though there are "pfn tracked + page busy" entries to clean up.

Appreciate you grappling this with me!

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages
  2024-10-29  2:03               ` Dan Williams
@ 2024-10-30  5:57                 ` Alistair Popple
  0 siblings, 0 replies; 53+ messages in thread
From: Alistair Popple @ 2024-10-30  5:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
	catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
	willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
	linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
	linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david


Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
> [..]
>
>> >> > It follows that that the DMA-idle condition still needs to look for the
>> >> > case where the refcount is > 1 rather than 0 since refcount == 1 is the
>> >> > page-mapped-but-DMA-idle condition.
>> 
>> Because if the DAX page-cache holds a reference the refcount won't go to
>> zero until dax_delete_mapping_entry() is called. However this interface
>> seems really strange to me - filesystems call
>> dax_layout_busy_page()/dax_wait_page_idle() to make sure both user-space
>> and DMA[1] have finished with the page, but not the DAX code which still
>> has references in it's page-cache.
>
> First, I appreciate the clarification that I was mixing up "mapped"
> (elevated map count) with, for lack of a better term, "tracked" (mapping
> entry valid).
>
> So, to repeat back to you what I understand now, the proposal is to
> attempt to allow _count==0 as the DMA idle condition, but still have the
> final return of the block to the allocator (fs allocator) occur after
> dax_delete_mapping_entry().

Right, that is what I would like to achieve if possible. The outstanding
question I think is "should the DAX page-cache have a reference on the
page?". Or to use your terminology below "if a pfn is tracked should
pfn_to_page(pfn)->_refcount == 0 or 1?"

This version implements it as being zero because altering that requires
re-ordering all the existing filesystem and mm users of
dax_layout_busy_range() and dax_delete_mapping_entry(). Based on this
discussion though I'm beginning to think it probably should be one, but
I haven't been able to make that work yet.

>> Is there some reason for this? In order words why can't the interface to
>> the filesystem be something like calling dax_break_layouts() which
>> ensures everything, including core FS DAX code, has finished with the
>> page(s) in question? I don't see why that wouldn't work for at least
>> EXT4 and XFS (FUSE seemed a bit different but I haven't dug too deeply).
>> 
>> If we could do that dax_break_layouts() would essentially:
>> 1. unmap userspace via eg. unmap_mapping_pages() to drive the refcount
>>    down.
>
> Am I missing where unmap_mapping_pages() drops the _count? I can see
> where it drops _mapcount. I don't think that matters for the proposal,
> but that's my last gap in tracking the proposed refcount model.

It is suitably obtuse due to MMU_GATHER. unmap_mapping_pages() drops the
folio/page reference after flushing the TLB. Ie:

=> tlb_finish_mmu
    => tlb_flush_mmu
        => __tlb_batch_free_encoded_pages
            => free_pages_and_swap_cache
                => folios_put_refs

>> 2. delete the DAX page-cache entry to remove its refcount.
>> 3. wait for DMA to complete by waiting for the refcount to hit zero.
>> 
>> The problem with the filesystem truncate code at the moment is steps 2
>> and 3 are reversed so step 3 has to wait for a refcount of 1 as you
>> pointed out previously. But does that matter? Are there ever cases when
>> a filesystem needs to wait for the page to be idle but maintain it's DAX
>> page-cache entry?
>
> No, not that I can think of. The filesystem just cares that the page was
> seen as part of the file at some point and that it is holding locks to
> keep the block associated with that page allocated to the file until it
> can complete its operation.
>
> I think what we are talking about is a pfn-state not a page state. If
> the block-pfn-page lifecycle from allocation to free is deconstructed as:
>
>     block free
>     block allocated
>     pfn untracked
>     pfn tracked
>     page free
>     page busy
>     page free
>     pfn untracked
>     block free
>
> ...then I can indeed see cases where there is pfn metadata live even
> though the page is free.
>
> So I think I was playing victim to the current implementation that
> assumes that "pfn tracked" means the page is allocated and that
> pfn_to_folio(pfn)->mapping is valid and not NULL.
>
> All this to say I am at least on the same page as you that _count == 0
> can be used as the page free state even if the pfn tracking goes through
> delayed cleanup.

Great, and I like this terminology of pfn tracked, etc.

> However, if vmf_insert_XXX is increasing _count then, per my
> unmap_mapping_pages() question above, I think dax_wait_page_idle() needs
> to call try_to_unmap() to drop that _count, right?

At the moment filesystems open-code their own version of
XXXX_break_layouts() which typically calls dax_layout_busy_page()
followed by dax_wait_page_idle(). The former will call
unmap_mapping_range(), which for shared mappings I thought should be
sufficient to find and unmap all page table references (and therefore
folio/page _refcounts) based on the address space / index.

I think try_to_unmap() would only be neccessary if we only had the folio
and not the address space / index and therefore needed to find them from
the mm (not fs!) rmap.

> Similar observation for the memory_failure_dev_pagemap() path, I think
> that path only calls unmap_mapping_range() not try_to_unmap() and
> leaves _count elevated.

As noted above unmap_mapping_range() will drop the refcount whenever it
clears a pte/pmd mapping the folio and I think it should find all the
pte's mapping it.

> Lastly walking through the code again I think this fix is valid today:
>
> diff --git a/fs/dax.c b/fs/dax.c
> index fcbe62bde685..48f2c85690e1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -660,7 +660,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
>         pgoff_t end_idx;
>         XA_STATE(xas, &mapping->i_pages, start_idx);
>  
> -       if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> +       if (!dax_mapping(mapping))
>                 return NULL;
>  
>         /* If end == LLONG_MAX, all pages from start to till end of file */
>
>
> ...because unmap_mapping_pages() will mark the mapping as unmapped even
> though there are "pfn tracked + page busy" entries to clean up.

Yep, I noticed this today when I was trying to figure out why my
re-ordering of the unmap/wait/untrack pfn wasn't working as expected. It
still isn't for some other reason, and I'm still figuring out if the
above is correct/valid, but it is on my list of things to look more
closely at.

> Appreciate you grappling this with me!

Not at all! And thank you as well ... I feel like this has helped me a
lot in getting a slightly better understanding of the problems. Also
unless you react violently to anything I've said here I think I have
enough material to post (and perhaps even explain!) the next version of
this series.

 - Alistair

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2024-10-30  5:57 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-10  4:14 [PATCH 00/12] fs/dax: Fix FS DAX page reference counts Alistair Popple
2024-09-10  4:14 ` [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page Alistair Popple
2024-09-22  1:00   ` Dan Williams
2024-09-10  4:14 ` [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one Alistair Popple
2024-09-10 13:47   ` Bjorn Helgaas
2024-09-11  1:07     ` Alistair Popple
2024-09-11 13:51       ` Bjorn Helgaas
2024-09-11  0:48   ` Logan Gunthorpe
2024-10-11  0:20     ` Alistair Popple
2024-09-22  1:00   ` Dan Williams
2024-10-11  0:17     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 03/12] fs/dax: Refactor wait for dax idle page Alistair Popple
2024-09-22  1:01   ` Dan Williams
2024-09-10  4:14 ` [PATCH 04/12] mm: Allow compound zone device pages Alistair Popple
2024-09-10  4:47   ` Matthew Wilcox
2024-09-10  6:57     ` Alistair Popple
2024-09-10 13:41       ` Matthew Wilcox
2024-09-12 12:44   ` kernel test robot
2024-09-12 12:44   ` kernel test robot
2024-09-22  1:01   ` Dan Williams
2024-09-10  4:14 ` [PATCH 05/12] mm/memory: Add dax_insert_pfn Alistair Popple
2024-09-22  1:41   ` Dan Williams
2024-10-01 10:43     ` Gerald Schaefer
2024-09-10  4:14 ` [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages Alistair Popple
2024-09-22  2:07   ` Dan Williams
2024-10-14  6:33     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 07/12] huge_memory: Allow mappings of PMD " Alistair Popple
2024-09-27  2:48   ` Dan Williams
2024-10-14  6:53     ` Alistair Popple
2024-10-23 23:14       ` Alistair Popple
2024-10-23 23:38         ` Dan Williams
2024-09-10  4:14 ` [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
2024-09-25  0:17   ` Dan Williams
2024-09-27  2:52     ` Dan Williams
2024-10-14  7:03       ` Alistair Popple
2024-09-10  4:14 ` [PATCH 09/12] mm: Update vm_normal_page() callers to accept " Alistair Popple
2024-09-27  7:15   ` Dan Williams
2024-10-14  7:16     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 10/12] fs/dax: Properly refcount fs dax pages Alistair Popple
2024-09-27  7:59   ` Dan Williams
2024-10-24  7:52     ` Alistair Popple
2024-10-24 23:52       ` Dan Williams
2024-10-25  2:46         ` Alistair Popple
2024-10-25  4:35           ` Dan Williams
2024-10-28  4:24             ` Alistair Popple
2024-10-29  2:03               ` Dan Williams
2024-10-30  5:57                 ` Alistair Popple
2024-09-10  4:14 ` [PATCH 11/12] mm: Remove pXX_devmap callers Alistair Popple
2024-09-27 12:29   ` Alexander Gordeev
2024-10-14  7:14     ` Alistair Popple
2024-09-10  4:14 ` [PATCH 12/12] mm: Remove devmap related functions and page table bits Alistair Popple
2024-09-11  7:47   ` Chunyan Zhang
2024-09-12 12:55   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).