[PATCH v2 00/19] mm: Support huge pfnmaps

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/19] mm: Support huge pfnmaps
@ 2024-08-26 20:43 Peter Xu
  2024-08-26 20:43 ` [PATCH v2 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
                   ` (20 more replies)
  0 siblings, 21 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

v2:
- Added tags
- Let folio_walk_start() scan special pmd/pud bits [DavidH]
- Switch copy_huge_pmd() COW+writable check into a VM_WARN_ON_ONCE()
- Update commit message to drop mentioning of gup-fast, in patch "mm: Mark
  special bits for huge pfn mappings when inject" [JasonG]
- In gup-fast, reorder _special check v.s. _devmap check, so as to make
  pmd/pud path look the same as pte path [DavidH, JasonG]
- Enrich comments for follow_pfnmap*() API, emphasize the risk when PFN is
  used after the end() is invoked, s/-ve/negative/ [JasonG, Sean]

Overview
========

This series is based on mm-unstable, commit b659edec079c of Aug 26th
latest, with patch "vma remove the unneeded avc bound with non-CoWed folio"
reverted, as reported broken [0].

This series implements huge pfnmaps support for mm in general.  Huge pfnmap
allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
as large as 8GB or even bigger.

Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
patch (from Alex Williamson) will be the first user of huge pfnmap, so as
to enable vfio-pci driver to fault in huge pfn mappings.

Implementation
==============

In reality, it's relatively simple to add such support comparing to many
other types of mappings, because of PFNMAP's specialties when there's no
vmemmap backing it, so that most of the kernel routines on huge mappings
should simply already fail for them, like GUPs or old-school follow_page()
(which is recently rewritten to be folio_walk* APIs by David).

One trick here is that we're still unmature on PUDs in generic paths here
and there, as DAX is so far the only user.  This patchset will add the 2nd
user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
go on smoothly, but to be discussed later.

The other trick is how to allow gup-fast working for such huge mappings
even if there's no direct sign of knowing whether it's a normal page or
MMIO mapping.  This series chose to keep the pte_special solution, so that
it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
gup-fast will be able to identify them and fail properly.

Along the way, we'll also notice that the major pgtable pfn walker, aka,
follow_pte(), will need to retire soon due to the fact that it only works
with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
be able to do whatever follow_pte() can already do, plus that it can also
process huge pfnmaps now.  Half of this series is about that and converting
all existing pfnmap walkers to use the new API properly.  Hopefully the new
API also looks better to avoid exposing e.g. pgtable lock details into the
callers, so that it can be used in an even more straightforward way.

Here, three more options will be introduced and involved in huge pfnmap:

  - ARCH_SUPPORTS_HUGE_PFNMAP

    Arch developers will need to select this option when huge pfnmap is
    supported in arch's Kconfig.  After this patchset applied, both x86_64
    and arm64 will start to enable it by default.

  - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP

    These options are for driver developers to identify whether current
    arch / config supports huge pfnmaps, making decision on whether it can
    use the huge pfnmap APIs to inject them.  One can refer to the last
    vfio-pci patch from Alex on the use of them properly in a device
    driver.

So after the whole set applied, and if one would enable some dynamic debug
lines in vfio-pci core files, we should observe things like:

  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100

In this specific case, it says that vfio-pci faults in PMDs properly for a
few BAR0 offsets.

Patch Layout
============

Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
Patch 2:         A tiny cleanup
Patch 3-8:       Preparation patches for huge pfnmap (include introduce
                 special bit for pmd/pud)
Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
                 then drop follow_pte() API
Patch 17:        Add huge pfnmap support for x86_64
Patch 18:        Add huge pfnmap support for arm64
Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)

TODO
====

More architectures / More page sizes
------------------------------------

Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems to
have plan to support arm64 1G later on top of this series [2].

Any arch will need to first support THP / THP_1G, then provide a special
bit in pmds/puds to support huge pfnmaps.

remap_pfn_range() support
-------------------------

Currently, remap_pfn_range() still only maps PTEs.  With the new option,
remap_pfn_range() can logically start to inject either PMDs or PUDs when
the alignment requirements match on the VAs.

When the support is there, it should be able to silently benefit all
drivers that is using remap_pfn_range() in its mmap() handler on better TLB
hit rate and overall faster MMIO accesses similar to processor on hugepages.

More driver support
-------------------

VFIO is so far the only consumer for the huge pfnmaps after this series
applied.  Besides above remap_pfn_range() generic optimization, device
driver can also try to optimize its mmap() on a better VA alignment for
either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
as the driver doesn't normally decide the VA to map a bar.  But I don't
think I know all the drivers to know the full picture.

Tests Done
==========

- Cross-build tests

- run_vmtests.sh

- Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect()
  and fork() tests on the bar mapped

- x86_64 + AMD GPU
  - Needs Alex's modified QEMU to guarantee proper VA alignment to make
    sure all pages to be mapped with PUDs
  - Main BAR (8GB) start to use PUD mappings
  - Sub BAR (??MBs?) start to use PMD mappings
  - Performance wise, slight improvement comparing to the old PTE mappings

- aarch64 + NIC
  - Detached NIC test to make sure driver loads fine with PMD mappings

Credits all go to Alex on help testing the GPU/NIC use cases above.

Comments welcomed, thanks.

[0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
[1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
[2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com

Alex Williamson (1):
  vfio/pci: Implement huge_fault support

Peter Xu (18):
  mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
  mm: Drop is_huge_zero_pud()
  mm: Mark special bits for huge pfn mappings when inject
  mm: Allow THP orders for PFNMAPs
  mm/gup: Detect huge pfnmap entries in gup-fast
  mm/pagewalk: Check pfnmap for folio_walk_start()
  mm/fork: Accept huge pfnmap entries
  mm: Always define pxx_pgprot()
  mm: New follow_pfnmap API
  KVM: Use follow_pfnmap API
  s390/pci_mmio: Use follow_pfnmap API
  mm/x86/pat: Use the new follow_pfnmap API
  vfio: Use the new follow_pfnmap API
  acrn: Use the new follow_pfnmap API
  mm/access_process_vm: Use the new follow_pfnmap API
  mm: Remove follow_pte()
  mm/x86: Support large pfn mappings
  mm/arm64: Support large pfn mappings

 arch/arm64/Kconfig                  |   1 +
 arch/arm64/include/asm/pgtable.h    |  30 +++++
 arch/powerpc/include/asm/pgtable.h  |   1 +
 arch/s390/include/asm/pgtable.h     |   1 +
 arch/s390/pci/pci_mmio.c            |  22 ++--
 arch/sparc/include/asm/pgtable_64.h |   1 +
 arch/x86/Kconfig                    |   1 +
 arch/x86/include/asm/pgtable.h      |  80 +++++++-----
 arch/x86/mm/pat/memtype.c           |  17 ++-
 drivers/vfio/pci/vfio_pci_core.c    |  60 ++++++---
 drivers/vfio/vfio_iommu_type1.c     |  16 +--
 drivers/virt/acrn/mm.c              |  16 +--
 include/linux/huge_mm.h             |  16 +--
 include/linux/mm.h                  |  57 ++++++++-
 include/linux/pgtable.h             |  12 ++
 mm/Kconfig                          |  13 ++
 mm/gup.c                            |   6 +
 mm/huge_memory.c                    |  50 +++++---
 mm/memory.c                         | 183 ++++++++++++++++++++--------
 mm/pagewalk.c                       |   4 +-
 virt/kvm/kvm_main.c                 |  19 ++-
 21 files changed, 425 insertions(+), 181 deletions(-)

-- 
2.45.0

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 02/19] mm: Drop is_huge_zero_pud() Peter Xu
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

This patch introduces the option to introduce special pte bit into
pmd/puds.  Archs can start to define pmd_special / pud_special when
supported by selecting the new option.  Per-arch support will be added
later.

Before that, create fallbacks for these helpers so that they are always
available.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 24 ++++++++++++++++++++++++
 mm/Kconfig         | 13 +++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b0ff06d18c71..d900f15b7650 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2643,6 +2643,30 @@ static inline pte_t pte_mkspecial(pte_t pte)
 }
 #endif
 
+#ifndef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+static inline bool pmd_special(pmd_t pmd)
+{
+	return false;
+}
+
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return pmd;
+}
+#endif	/* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */
+
+#ifndef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
+static inline bool pud_special(pud_t pud)
+{
+	return false;
+}
+
+static inline pud_t pud_mkspecial(pud_t pud)
+{
+	return pud;
+}
+#endif	/* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
+
 #ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pte_devmap(pte_t pte)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 8078a4b3c509..b23913d4e47e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -875,6 +875,19 @@ endif # TRANSPARENT_HUGEPAGE
 config PGTABLE_HAS_HUGE_LEAVES
 	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
 
+# TODO: Allow to be enabled without THP
+config ARCH_SUPPORTS_HUGE_PFNMAP
+	def_bool n
+	depends on TRANSPARENT_HUGEPAGE
+
+config ARCH_SUPPORTS_PMD_PFNMAP
+	def_bool y
+	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE
+
+config ARCH_SUPPORTS_PUD_PFNMAP
+	def_bool y
+	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+
 #
 # UP and nommu archs use km based percpu allocator
 #
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 02/19] mm: Drop is_huge_zero_pud()
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
  2024-08-26 20:43 ` [PATCH v2 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 03/19] mm: Mark special bits for huge pfn mappings when inject Peter Xu
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson, Matthew Wilcox, Aneesh Kumar K . V

It constantly returns false since 2017.  One assertion is added in 2019 but
it should never have triggered, IOW it means what is checked should be
asserted instead.

If it didn't exist for 7 years maybe it's good idea to remove it and only
add it when it comes.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 10 ----------
 mm/huge_memory.c        | 13 +------------
 2 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4902e2f7e896..b550b5a248bb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -433,11 +433,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 	return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
 }
 
-static inline bool is_huge_zero_pud(pud_t pud)
-{
-	return false;
-}
-
 struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
 void mm_put_huge_zero_folio(struct mm_struct *mm);
 
@@ -578,11 +573,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 	return false;
 }
 
-static inline bool is_huge_zero_pud(pud_t pud)
-{
-	return false;
-}
-
 static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
 {
 	return;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a81eab98d6b8..3f74b09ada38 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1429,10 +1429,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	ptl = pud_lock(mm, pud);
 	if (!pud_none(*pud)) {
 		if (write) {
-			if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
-				WARN_ON_ONCE(!is_huge_zero_pud(*pud));
+			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn_t_to_pfn(pfn)))
 				goto out_unlock;
-			}
 			entry = pud_mkyoung(*pud);
 			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
 			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
@@ -1680,15 +1678,6 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
 		goto out_unlock;
 
-	/*
-	 * When page table lock is held, the huge zero pud should not be
-	 * under splitting since we don't split the page itself, only pud to
-	 * a page table.
-	 */
-	if (is_huge_zero_pud(pud)) {
-		/* No huge zero pud yet */
-	}
-
 	/*
 	 * TODO: once we support anonymous pages, use
 	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 03/19] mm: Mark special bits for huge pfn mappings when inject
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
  2024-08-26 20:43 ` [PATCH v2 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
  2024-08-26 20:43 ` [PATCH v2 02/19] mm: Drop is_huge_zero_pud() Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-28 15:31   ` David Hildenbrand
  2024-08-26 20:43 ` [PATCH v2 04/19] mm: Allow THP orders for PFNMAPs Peter Xu
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

We need these special bits to be around on pfnmaps.  Mark properly for
!devmap case, reflecting that there's no page struct backing the entry.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3f74b09ada38..dec17d09390f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1346,6 +1346,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pmd_mkdevmap(entry);
+	else
+		entry = pmd_mkspecial(entry);
 	if (write) {
 		entry = pmd_mkyoung(pmd_mkdirty(entry));
 		entry = maybe_pmd_mkwrite(entry, vma);
@@ -1442,6 +1444,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pud_mkdevmap(entry);
+	else
+		entry = pud_mkspecial(entry);
 	if (write) {
 		entry = pud_mkyoung(pud_mkdirty(entry));
 		entry = maybe_pud_mkwrite(entry, vma);
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 04/19] mm: Allow THP orders for PFNMAPs
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (2 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 03/19] mm: Mark special bits for huge pfn mappings when inject Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-28 15:31   ` David Hildenbrand
  2024-08-26 20:43 ` [PATCH v2 05/19] mm/gup: Detect huge pfnmap entries in gup-fast Peter Xu
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson, Matthew Wilcox, Ryan Roberts

This enables PFNMAPs to be mapped at either pmd/pud layers.  Generalize the
dax case into vma_is_special_huge() so as to cover both.  Meanwhile, rename
the macro to THP_ORDERS_ALL_SPECIAL.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 6 +++---
 mm/huge_memory.c        | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b550b5a248bb..4da102b74a8c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -76,9 +76,9 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 /*
  * Mask of all large folio orders supported for file THP. Folios in a DAX
  * file is never split and the MAX_PAGECACHE_ORDER limit does not apply to
- * it.
+ * it.  Same to PFNMAPs where there's neither page* nor pagecache.
  */
-#define THP_ORDERS_ALL_FILE_DAX		\
+#define THP_ORDERS_ALL_SPECIAL		\
 	(BIT(PMD_ORDER) | BIT(PUD_ORDER))
 #define THP_ORDERS_ALL_FILE_DEFAULT	\
 	((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0))
@@ -87,7 +87,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
  * Mask of all large folio orders supported for THP.
  */
 #define THP_ORDERS_ALL	\
-	(THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT)
+	(THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT)
 
 #define TVA_SMAPS		(1 << 0)	/* Will be used for procfs */
 #define TVA_IN_PF		(1 << 1)	/* Page fault handler */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dec17d09390f..e2c314f631f3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -96,8 +96,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	/* Check the intersection of requested and supported orders. */
 	if (vma_is_anonymous(vma))
 		supported_orders = THP_ORDERS_ALL_ANON;
-	else if (vma_is_dax(vma))
-		supported_orders = THP_ORDERS_ALL_FILE_DAX;
+	else if (vma_is_special_huge(vma))
+		supported_orders = THP_ORDERS_ALL_SPECIAL;
 	else
 		supported_orders = THP_ORDERS_ALL_FILE_DEFAULT;
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 05/19] mm/gup: Detect huge pfnmap entries in gup-fast
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (3 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 04/19] mm: Allow THP orders for PFNMAPs Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start() Peter Xu
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Since gup-fast doesn't have the vma reference, teach it to detect such huge
pfnmaps by checking the special bit for pmd/pud too, just like ptes.

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index d19884e097fd..a49f67a512ee 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3038,6 +3038,9 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
+	if (pmd_special(orig))
+		return 0;
+
 	if (pmd_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
@@ -3082,6 +3085,9 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
+	if (pud_special(orig))
+		return 0;
+
 	if (pud_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start()
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (4 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 05/19] mm/gup: Detect huge pfnmap entries in gup-fast Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-28  7:44   ` David Hildenbrand
  2024-08-26 20:43 ` [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Teach folio_walk_start() to recognize special pmd/pud mappings, and fail
them properly as it means there's no folio backing them.

Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/pagewalk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index cd79fb3b89e5..12be5222d70e 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -753,7 +753,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 		fw->pudp = pudp;
 		fw->pud = pud;
 
-		if (!pud_present(pud) || pud_devmap(pud)) {
+		if (!pud_present(pud) || pud_devmap(pud) || pud_special(pud)) {
 			spin_unlock(ptl);
 			goto not_found;
 		} else if (!pud_leaf(pud)) {
@@ -783,7 +783,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 		fw->pmdp = pmdp;
 		fw->pmd = pmd;
 
-		if (pmd_none(pmd)) {
+		if (pmd_none(pmd) || pmd_special(pmd)) {
 			spin_unlock(ptl);
 			goto not_found;
 		} else if (!pmd_leaf(pmd)) {
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (5 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start() Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-29 15:10   ` David Hildenbrand
  2024-09-02  7:58   ` Yan Zhao
  2024-08-26 20:43 ` [PATCH v2 08/19] mm: Always define pxx_pgprot() Peter Xu
                   ` (13 subsequent siblings)
  20 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
much easier, the write bit needs to be persisted though for writable and
shared pud mappings like PFNMAP ones, otherwise a follow up write in either
parent or child process will trigger a write fault.

Do the same for pmd level.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2c314f631f3..15418ffdd377 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pgtable_t pgtable = NULL;
 	int ret = -ENOMEM;
 
+	pmd = pmdp_get_lockless(src_pmd);
+	if (unlikely(pmd_special(pmd))) {
+		dst_ptl = pmd_lock(dst_mm, dst_pmd);
+		src_ptl = pmd_lockptr(src_mm, src_pmd);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+		/*
+		 * No need to recheck the pmd, it can't change with write
+		 * mmap lock held here.
+		 *
+		 * Meanwhile, making sure it's not a CoW VMA with writable
+		 * mapping, otherwise it means either the anon page wrongly
+		 * applied special bit, or we made the PRIVATE mapping be
+		 * able to wrongly write to the backend MMIO.
+		 */
+		VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
+		goto set_pmd;
+	}
+
 	/* Skip if can be re-fill on fault */
 	if (!vma_is_anonymous(dst_vma))
 		return 0;
@@ -1640,7 +1658,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_clear_uffd_wp(pmd);
-	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	pmd = pmd_wrprotect(pmd);
+set_pmd:
+	pmd = pmd_mkold(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
 	ret = 0;
@@ -1686,8 +1706,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * TODO: once we support anonymous pages, use
 	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
 	 */
-	pudp_set_wrprotect(src_mm, addr, src_pud);
-	pud = pud_mkold(pud_wrprotect(pud));
+	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
+		pudp_set_wrprotect(src_mm, addr, src_pud);
+		pud = pud_wrprotect(pud);
+	}
+	pud = pud_mkold(pud);
 	set_pud_at(dst_mm, addr, dst_pud, pud);
 
 	ret = 0;
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 08/19] mm: Always define pxx_pgprot()
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (6 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 09/19] mm: New follow_pfnmap API Peter Xu
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

There're:

  - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that
  support pte_pgprot().

  - 2 archs (x86, sparc) that support pmd_pgprot().

  - 1 arch (x86) that support pud_pgprot().

Always define them to be used in generic code, and then we don't need to
fiddle with "#ifdef"s when doing so.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/include/asm/pgtable.h    |  1 +
 arch/powerpc/include/asm/pgtable.h  |  1 +
 arch/s390/include/asm/pgtable.h     |  1 +
 arch/sparc/include/asm/pgtable_64.h |  1 +
 include/linux/pgtable.h             | 12 ++++++++++++
 5 files changed, 16 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7a4f5604be3f..b78cc4a6758b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -384,6 +384,7 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
 /*
  * Select all bits except the pfn
  */
+#define pte_pgprot pte_pgprot
 static inline pgprot_t pte_pgprot(pte_t pte)
 {
 	unsigned long pfn = pte_pfn(pte);
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 264a6c09517a..2f72ad885332 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -65,6 +65,7 @@ static inline unsigned long pte_pfn(pte_t pte)
 /*
  * Select all bits except the pfn
  */
+#define pte_pgprot pte_pgprot
 static inline pgprot_t pte_pgprot(pte_t pte)
 {
 	unsigned long pte_flags;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 3fa280d0672a..0ffbaf741955 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -955,6 +955,7 @@ static inline int pte_unused(pte_t pte)
  * young/old accounting is not supported, i.e _PAGE_PROTECT and _PAGE_INVALID
  * must not be set.
  */
+#define pte_pgprot pte_pgprot
 static inline pgprot_t pte_pgprot(pte_t pte)
 {
 	unsigned long pte_flags = pte_val(pte) & _PAGE_CHG_MASK;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 3fe429d73a65..2b7f358762c1 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -783,6 +783,7 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 	return __pmd(pte_val(pte));
 }
 
+#define pmd_pgprot pmd_pgprot
 static inline pgprot_t pmd_pgprot(pmd_t entry)
 {
 	unsigned long val = pmd_val(entry);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 780f3b439d98..e8b2ac6bd2ae 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1956,6 +1956,18 @@ typedef unsigned int pgtbl_mod_mask;
 #define MAX_PTRS_PER_P4D PTRS_PER_P4D
 #endif
 
+#ifndef pte_pgprot
+#define pte_pgprot(x) ((pgprot_t) {0})
+#endif
+
+#ifndef pmd_pgprot
+#define pmd_pgprot(x) ((pgprot_t) {0})
+#endif
+
+#ifndef pud_pgprot
+#define pud_pgprot(x) ((pgprot_t) {0})
+#endif
+
 /* description of effects of mapping type and prot in current implementation.
  * this is due to the limited x86 page protection hardware.  The expected
  * behavior is in parens:
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 09/19] mm: New follow_pfnmap API
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (7 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 08/19] mm: Always define pxx_pgprot() Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 10/19] KVM: Use " Peter Xu
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Introduce a pair of APIs to follow pfn mappings to get entry information.
It's very similar to what follow_pte() does before, but different in that
it recognizes huge pfn mappings.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  31 ++++++++++
 mm/memory.c        | 150 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 181 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d900f15b7650..161d496bfd18 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2373,6 +2373,37 @@ int follow_pte(struct vm_area_struct *vma, unsigned long address,
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write);
 
+struct follow_pfnmap_args {
+	/**
+	 * Inputs:
+	 * @vma: Pointer to @vm_area_struct struct
+	 * @address: the virtual address to walk
+	 */
+	struct vm_area_struct *vma;
+	unsigned long address;
+	/**
+	 * Internals:
+	 *
+	 * The caller shouldn't touch any of these.
+	 */
+	spinlock_t *lock;
+	pte_t *ptep;
+	/**
+	 * Outputs:
+	 *
+	 * @pfn: the PFN of the address
+	 * @pgprot: the pgprot_t of the mapping
+	 * @writable: whether the mapping is writable
+	 * @special: whether the mapping is a special mapping (real PFN maps)
+	 */
+	unsigned long pfn;
+	pgprot_t pgprot;
+	bool writable;
+	bool special;
+};
+int follow_pfnmap_start(struct follow_pfnmap_args *args);
+void follow_pfnmap_end(struct follow_pfnmap_args *args);
+
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
 void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
diff --git a/mm/memory.c b/mm/memory.c
index 93c0c25433d0..0b136c398257 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6173,6 +6173,156 @@ int follow_pte(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(follow_pte);
 
+static inline void pfnmap_args_setup(struct follow_pfnmap_args *args,
+				     spinlock_t *lock, pte_t *ptep,
+				     pgprot_t pgprot, unsigned long pfn_base,
+				     unsigned long addr_mask, bool writable,
+				     bool special)
+{
+	args->lock = lock;
+	args->ptep = ptep;
+	args->pfn = pfn_base + ((args->address & ~addr_mask) >> PAGE_SHIFT);
+	args->pgprot = pgprot;
+	args->writable = writable;
+	args->special = special;
+}
+
+static inline void pfnmap_lockdep_assert(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_LOCKDEP
+	struct address_space *mapping = vma->vm_file->f_mapping;
+
+	if (mapping)
+		lockdep_assert(lockdep_is_held(&vma->vm_file->f_mapping->i_mmap_rwsem) ||
+			       lockdep_is_held(&vma->vm_mm->mmap_lock));
+	else
+		lockdep_assert(lockdep_is_held(&vma->vm_mm->mmap_lock));
+#endif
+}
+
+/**
+ * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
+ * @args: Pointer to struct @follow_pfnmap_args
+ *
+ * The caller needs to setup args->vma and args->address to point to the
+ * virtual address as the target of such lookup.  On a successful return,
+ * the results will be put into other output fields.
+ *
+ * After the caller finished using the fields, the caller must invoke
+ * another follow_pfnmap_end() to proper releases the locks and resources
+ * of such look up request.
+ *
+ * During the start() and end() calls, the results in @args will be valid
+ * as proper locks will be held.  After the end() is called, all the fields
+ * in @follow_pfnmap_args will be invalid to be further accessed.  Further
+ * use of such information after end() may require proper synchronizations
+ * by the caller with page table updates, otherwise it can create a
+ * security bug.
+ *
+ * If the PTE maps a refcounted page, callers are responsible to protect
+ * against invalidation with MMU notifiers; otherwise access to the PFN at
+ * a later point in time can trigger use-after-free.
+ *
+ * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
+ * should be taken for read, and the mmap semaphore cannot be released
+ * before the end() is invoked.
+ *
+ * This function must not be used to modify PTE content.
+ *
+ * Return: zero on success, negative otherwise.
+ */
+int follow_pfnmap_start(struct follow_pfnmap_args *args)
+{
+	struct vm_area_struct *vma = args->vma;
+	unsigned long address = args->address;
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *lock;
+	pgd_t *pgdp;
+	p4d_t *p4dp, p4d;
+	pud_t *pudp, pud;
+	pmd_t *pmdp, pmd;
+	pte_t *ptep, pte;
+
+	pfnmap_lockdep_assert(vma);
+
+	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
+		goto out;
+
+	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
+		goto out;
+retry:
+	pgdp = pgd_offset(mm, address);
+	if (pgd_none(*pgdp) || unlikely(pgd_bad(*pgdp)))
+		goto out;
+
+	p4dp = p4d_offset(pgdp, address);
+	p4d = READ_ONCE(*p4dp);
+	if (p4d_none(p4d) || unlikely(p4d_bad(p4d)))
+		goto out;
+
+	pudp = pud_offset(p4dp, address);
+	pud = READ_ONCE(*pudp);
+	if (pud_none(pud))
+		goto out;
+	if (pud_leaf(pud)) {
+		lock = pud_lock(mm, pudp);
+		if (!unlikely(pud_leaf(pud))) {
+			spin_unlock(lock);
+			goto retry;
+		}
+		pfnmap_args_setup(args, lock, NULL, pud_pgprot(pud),
+				  pud_pfn(pud), PUD_MASK, pud_write(pud),
+				  pud_special(pud));
+		return 0;
+	}
+
+	pmdp = pmd_offset(pudp, address);
+	pmd = pmdp_get_lockless(pmdp);
+	if (pmd_leaf(pmd)) {
+		lock = pmd_lock(mm, pmdp);
+		if (!unlikely(pmd_leaf(pmd))) {
+			spin_unlock(lock);
+			goto retry;
+		}
+		pfnmap_args_setup(args, lock, NULL, pmd_pgprot(pmd),
+				  pmd_pfn(pmd), PMD_MASK, pmd_write(pmd),
+				  pmd_special(pmd));
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, address, &lock);
+	if (!ptep)
+		goto out;
+	pte = ptep_get(ptep);
+	if (!pte_present(pte))
+		goto unlock;
+	pfnmap_args_setup(args, lock, ptep, pte_pgprot(pte),
+			  pte_pfn(pte), PAGE_MASK, pte_write(pte),
+			  pte_special(pte));
+	return 0;
+unlock:
+	pte_unmap_unlock(ptep, lock);
+out:
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(follow_pfnmap_start);
+
+/**
+ * follow_pfnmap_end(): End a follow_pfnmap_start() process
+ * @args: Pointer to struct @follow_pfnmap_args
+ *
+ * Must be used in pair of follow_pfnmap_start().  See the start() function
+ * above for more information.
+ */
+void follow_pfnmap_end(struct follow_pfnmap_args *args)
+{
+	if (args->lock)
+		spin_unlock(args->lock);
+	if (args->ptep)
+		pte_unmap(args->ptep);
+}
+EXPORT_SYMBOL_GPL(follow_pfnmap_end);
+
 #ifdef CONFIG_HAVE_IOREMAP_PROT
 /**
  * generic_access_phys - generic implementation for iomem mmap access
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 10/19] KVM: Use follow_pfnmap API
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (8 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 09/19] mm: New follow_pfnmap API Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 11/19] s390/pci_mmio: " Peter Xu
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
is done perfectly on the other side (host_pfn_mapping_level()).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb2b78e92910..f416d5e3f9c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2860,13 +2860,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			       unsigned long addr, bool write_fault,
 			       bool *writable, kvm_pfn_t *p_pfn)
 {
+	struct follow_pfnmap_args args = { .vma = vma, .address = addr };
 	kvm_pfn_t pfn;
-	pte_t *ptep;
-	pte_t pte;
-	spinlock_t *ptl;
 	int r;
 
-	r = follow_pte(vma, addr, &ptep, &ptl);
+	r = follow_pfnmap_start(&args);
 	if (r) {
 		/*
 		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
@@ -2881,21 +2879,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		if (r)
 			return r;
 
-		r = follow_pte(vma, addr, &ptep, &ptl);
+		r = follow_pfnmap_start(&args);
 		if (r)
 			return r;
 	}
 
-	pte = ptep_get(ptep);
-
-	if (write_fault && !pte_write(pte)) {
+	if (write_fault && !args.writable) {
 		pfn = KVM_PFN_ERR_RO_FAULT;
 		goto out;
 	}
 
 	if (writable)
-		*writable = pte_write(pte);
-	pfn = pte_pfn(pte);
+		*writable = args.writable;
+	pfn = args.pfn;
 
 	/*
 	 * Get a reference here because callers of *hva_to_pfn* and
@@ -2916,9 +2912,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 */
 	if (!kvm_try_get_pfn(pfn))
 		r = -EFAULT;
-
 out:
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 	*p_pfn = pfn;
 
 	return r;
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 11/19] s390/pci_mmio: Use follow_pfnmap API
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (9 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 10/19] KVM: Use " Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 12/19] mm/x86/pat: Use the new " Peter Xu
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson, Niklas Schnelle, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, linux-s390

Use the new API that can understand huge pfn mappings.

Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: linux-s390@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/s390/pci/pci_mmio.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
index 5398729bfe1b..de5c0b389a3e 100644
--- a/arch/s390/pci/pci_mmio.c
+++ b/arch/s390/pci/pci_mmio.c
@@ -118,12 +118,11 @@ static inline int __memcpy_toio_inuser(void __iomem *dst,
 SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
 		const void __user *, user_buffer, size_t, length)
 {
+	struct follow_pfnmap_args args = { };
 	u8 local_buf[64];
 	void __iomem *io_addr;
 	void *buf;
 	struct vm_area_struct *vma;
-	pte_t *ptep;
-	spinlock_t *ptl;
 	long ret;
 
 	if (!zpci_is_enabled())
@@ -169,11 +168,13 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
 	if (!(vma->vm_flags & VM_WRITE))
 		goto out_unlock_mmap;
 
-	ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
+	args.address = mmio_addr;
+	args.vma = vma;
+	ret = follow_pfnmap_start(&args);
 	if (ret)
 		goto out_unlock_mmap;
 
-	io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
+	io_addr = (void __iomem *)((args.pfn << PAGE_SHIFT) |
 			(mmio_addr & ~PAGE_MASK));
 
 	if ((unsigned long) io_addr < ZPCI_IOMAP_ADDR_BASE)
@@ -181,7 +182,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
 
 	ret = zpci_memcpy_toio(io_addr, buf, length);
 out_unlock_pt:
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 out_unlock_mmap:
 	mmap_read_unlock(current->mm);
 out_free:
@@ -260,12 +261,11 @@ static inline int __memcpy_fromio_inuser(void __user *dst,
 SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
 		void __user *, user_buffer, size_t, length)
 {
+	struct follow_pfnmap_args args = { };
 	u8 local_buf[64];
 	void __iomem *io_addr;
 	void *buf;
 	struct vm_area_struct *vma;
-	pte_t *ptep;
-	spinlock_t *ptl;
 	long ret;
 
 	if (!zpci_is_enabled())
@@ -308,11 +308,13 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
 	if (!(vma->vm_flags & VM_WRITE))
 		goto out_unlock_mmap;
 
-	ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
+	args.vma = vma;
+	args.address = mmio_addr;
+	ret = follow_pfnmap_start(&args);
 	if (ret)
 		goto out_unlock_mmap;
 
-	io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
+	io_addr = (void __iomem *)((args.pfn << PAGE_SHIFT) |
 			(mmio_addr & ~PAGE_MASK));
 
 	if ((unsigned long) io_addr < ZPCI_IOMAP_ADDR_BASE) {
@@ -322,7 +324,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
 	ret = zpci_memcpy_fromio(buf, io_addr, length);
 
 out_unlock_pt:
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 out_unlock_mmap:
 	mmap_read_unlock(current->mm);
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 12/19] mm/x86/pat: Use the new follow_pfnmap API
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (10 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 11/19] s390/pci_mmio: " Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 13/19] vfio: " Peter Xu
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Use the new API that can understand huge pfn mappings.

Cc: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/mm/pat/memtype.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 1fa0bf6ed295..f73b5ce270b3 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -951,23 +951,20 @@ static void free_pfn_range(u64 paddr, unsigned long size)
 static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
 		resource_size_t *phys)
 {
-	pte_t *ptep, pte;
-	spinlock_t *ptl;
+	struct follow_pfnmap_args args = { .vma = vma, .address = vma->vm_start };
 
-	if (follow_pte(vma, vma->vm_start, &ptep, &ptl))
+	if (follow_pfnmap_start(&args))
 		return -EINVAL;
 
-	pte = ptep_get(ptep);
-
 	/* Never return PFNs of anon folios in COW mappings. */
-	if (vm_normal_folio(vma, vma->vm_start, pte)) {
-		pte_unmap_unlock(ptep, ptl);
+	if (!args.special) {
+		follow_pfnmap_end(&args);
 		return -EINVAL;
 	}
 
-	*prot = pgprot_val(pte_pgprot(pte));
-	*phys = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
-	pte_unmap_unlock(ptep, ptl);
+	*prot = pgprot_val(args.pgprot);
+	*phys = (resource_size_t)args.pfn << PAGE_SHIFT;
+	follow_pfnmap_end(&args);
 	return 0;
 }
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 13/19] vfio: Use the new follow_pfnmap API
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (11 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 12/19] mm/x86/pat: Use the new " Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 14/19] acrn: " Peter Xu
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Use the new API that can understand huge pfn mappings.

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0960699e7554..bf391b40e576 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -513,12 +513,10 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 			    unsigned long vaddr, unsigned long *pfn,
 			    bool write_fault)
 {
-	pte_t *ptep;
-	pte_t pte;
-	spinlock_t *ptl;
+	struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
 	int ret;
 
-	ret = follow_pte(vma, vaddr, &ptep, &ptl);
+	ret = follow_pfnmap_start(&args);
 	if (ret) {
 		bool unlocked = false;
 
@@ -532,19 +530,17 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 		if (ret)
 			return ret;
 
-		ret = follow_pte(vma, vaddr, &ptep, &ptl);
+		ret = follow_pfnmap_start(&args);
 		if (ret)
 			return ret;
 	}
 
-	pte = ptep_get(ptep);
-
-	if (write_fault && !pte_write(pte))
+	if (write_fault && !args.writable)
 		ret = -EFAULT;
 	else
-		*pfn = pte_pfn(pte);
+		*pfn = args.pfn;
 
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 	return ret;
 }
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 14/19] acrn: Use the new follow_pfnmap API
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (12 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 13/19] vfio: " Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 15/19] mm/access_process_vm: " Peter Xu
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Use the new API that can understand huge pfn mappings.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/virt/acrn/mm.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/virt/acrn/mm.c b/drivers/virt/acrn/mm.c
index db8ff1d0ac23..4c2f28715b70 100644
--- a/drivers/virt/acrn/mm.c
+++ b/drivers/virt/acrn/mm.c
@@ -177,9 +177,7 @@ int acrn_vm_ram_map(struct acrn_vm *vm, struct acrn_vm_memmap *memmap)
 	vma = vma_lookup(current->mm, memmap->vma_base);
 	if (vma && ((vma->vm_flags & VM_PFNMAP) != 0)) {
 		unsigned long start_pfn, cur_pfn;
-		spinlock_t *ptl;
 		bool writable;
-		pte_t *ptep;
 
 		if ((memmap->vma_base + memmap->len) > vma->vm_end) {
 			mmap_read_unlock(current->mm);
@@ -187,16 +185,20 @@ int acrn_vm_ram_map(struct acrn_vm *vm, struct acrn_vm_memmap *memmap)
 		}
 
 		for (i = 0; i < nr_pages; i++) {
-			ret = follow_pte(vma, memmap->vma_base + i * PAGE_SIZE,
-					 &ptep, &ptl);
+			struct follow_pfnmap_args args = {
+				.vma = vma,
+				.address = memmap->vma_base + i * PAGE_SIZE,
+			};
+
+			ret = follow_pfnmap_start(&args);
 			if (ret)
 				break;
 
-			cur_pfn = pte_pfn(ptep_get(ptep));
+			cur_pfn = args.pfn;
 			if (i == 0)
 				start_pfn = cur_pfn;
-			writable = !!pte_write(ptep_get(ptep));
-			pte_unmap_unlock(ptep, ptl);
+			writable = args.writable;
+			follow_pfnmap_end(&args);
 
 			/* Disallow write access if the PTE is not writable. */
 			if (!writable &&
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 15/19] mm/access_process_vm: Use the new follow_pfnmap API
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (13 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 14/19] acrn: " Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 16/19] mm: Remove follow_pte() Peter Xu
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Use the new API that can understand huge pfn mappings.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0b136c398257..b5d07f493d5d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6342,34 +6342,34 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 	resource_size_t phys_addr;
 	unsigned long prot = 0;
 	void __iomem *maddr;
-	pte_t *ptep, pte;
-	spinlock_t *ptl;
 	int offset = offset_in_page(addr);
 	int ret = -EINVAL;
+	bool writable;
+	struct follow_pfnmap_args args = { .vma = vma, .address = addr };
 
 retry:
-	if (follow_pte(vma, addr, &ptep, &ptl))
+	if (follow_pfnmap_start(&args))
 		return -EINVAL;
-	pte = ptep_get(ptep);
-	pte_unmap_unlock(ptep, ptl);
+	prot = pgprot_val(args.pgprot);
+	phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
+	writable = args.writable;
+	follow_pfnmap_end(&args);
 
-	prot = pgprot_val(pte_pgprot(pte));
-	phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
-
-	if ((write & FOLL_WRITE) && !pte_write(pte))
+	if ((write & FOLL_WRITE) && !writable)
 		return -EINVAL;
 
 	maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot);
 	if (!maddr)
 		return -ENOMEM;
 
-	if (follow_pte(vma, addr, &ptep, &ptl))
+	if (follow_pfnmap_start(&args))
 		goto out_unmap;
 
-	if (!pte_same(pte, ptep_get(ptep))) {
-		pte_unmap_unlock(ptep, ptl);
+	if ((prot != pgprot_val(args.pgprot)) ||
+	    (phys_addr != (args.pfn << PAGE_SHIFT)) ||
+	    (writable != args.writable)) {
+		follow_pfnmap_end(&args);
 		iounmap(maddr);
-
 		goto retry;
 	}
 
@@ -6378,7 +6378,7 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 	else
 		memcpy_fromio(buf, maddr + offset, len);
 	ret = len;
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 out_unmap:
 	iounmap(maddr);
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 16/19] mm: Remove follow_pte()
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (14 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 15/19] mm/access_process_vm: " Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-09-01  4:33   ` Yu Zhao
  2024-08-26 20:43 ` [PATCH v2 17/19] mm/x86: Support large pfn mappings Peter Xu
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

follow_pte() users have been converted to follow_pfnmap*().  Remove the
API.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  2 --
 mm/memory.c        | 73 ----------------------------------------------
 2 files changed, 75 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 161d496bfd18..b31d4bdd65ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2368,8 +2368,6 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
-int follow_pte(struct vm_area_struct *vma, unsigned long address,
-	       pte_t **ptepp, spinlock_t **ptlp);
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write);
 
diff --git a/mm/memory.c b/mm/memory.c
index b5d07f493d5d..288f81a8698e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6100,79 +6100,6 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-/**
- * follow_pte - look up PTE at a user virtual address
- * @vma: the memory mapping
- * @address: user virtual address
- * @ptepp: location to store found PTE
- * @ptlp: location to store the lock for the PTE
- *
- * On a successful return, the pointer to the PTE is stored in @ptepp;
- * the corresponding lock is taken and its location is stored in @ptlp.
- *
- * The contents of the PTE are only stable until @ptlp is released using
- * pte_unmap_unlock(). This function will fail if the PTE is non-present.
- * Present PTEs may include PTEs that map refcounted pages, such as
- * anonymous folios in COW mappings.
- *
- * Callers must be careful when relying on PTE content after
- * pte_unmap_unlock(). Especially if the PTE maps a refcounted page,
- * callers must protect against invalidation with MMU notifiers; otherwise
- * access to the PFN at a later point in time can trigger use-after-free.
- *
- * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
- * should be taken for read.
- *
- * This function must not be used to modify PTE content.
- *
- * Return: zero on success, -ve otherwise.
- */
-int follow_pte(struct vm_area_struct *vma, unsigned long address,
-	       pte_t **ptepp, spinlock_t **ptlp)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pgd_t *pgd;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *ptep;
-
-	mmap_assert_locked(mm);
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
-		goto out;
-
-	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
-		goto out;
-
-	pgd = pgd_offset(mm, address);
-	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-		goto out;
-
-	p4d = p4d_offset(pgd, address);
-	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
-		goto out;
-
-	pud = pud_offset(p4d, address);
-	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
-		goto out;
-
-	pmd = pmd_offset(pud, address);
-	VM_BUG_ON(pmd_trans_huge(*pmd));
-
-	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
-	if (!ptep)
-		goto out;
-	if (!pte_present(ptep_get(ptep)))
-		goto unlock;
-	*ptepp = ptep;
-	return 0;
-unlock:
-	pte_unmap_unlock(ptep, *ptlp);
-out:
-	return -EINVAL;
-}
-EXPORT_SYMBOL_GPL(follow_pte);
-
 static inline void pfnmap_args_setup(struct follow_pfnmap_args *args,
 				     spinlock_t *lock, pte_t *ptep,
 				     pgprot_t pgprot, unsigned long pfn_base,
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 17/19] mm/x86: Support large pfn mappings
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (15 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 16/19] mm: Remove follow_pte() Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-26 20:43 ` [PATCH v2 18/19] mm/arm64: " Peter Xu
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Helpers to install and detect special pmd/pud entries.  In short, bit 9 on
x86 is not used for pmd/pud, so we can directly define them the same as the
pte level.  One note is that it's also used in _PAGE_BIT_CPA_TEST but that
is only used in the debug test, and shouldn't conflict in this case.

One note is that pxx_set|clear_flags() for pmd/pud will need to be moved
upper so that they can be referenced by the new special bit helpers.
There's no change in the code that was moved.

Cc: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/Kconfig               |  1 +
 arch/x86/include/asm/pgtable.h | 80 ++++++++++++++++++++++------------
 2 files changed, 53 insertions(+), 28 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b74b9ee484da..d4dbe9717e96 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -28,6 +28,7 @@ config X86_64
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_PER_VMA_LOCK
+	select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_SOFT_DIRTY
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 8d12bfad6a1d..4c2d080d26b4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -120,6 +120,34 @@ extern pmdval_t early_pmd_flags;
 #define arch_end_context_switch(prev)	do {} while(0)
 #endif	/* CONFIG_PARAVIRT_XXL */
 
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v | set);
+}
+
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v & ~clear);
+}
+
+static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
+{
+	pudval_t v = native_pud_val(pud);
+
+	return native_make_pud(v | set);
+}
+
+static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
+{
+	pudval_t v = native_pud_val(pud);
+
+	return native_make_pud(v & ~clear);
+}
+
 /*
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
@@ -317,6 +345,30 @@ static inline int pud_devmap(pud_t pud)
 }
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+static inline bool pmd_special(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SPECIAL;
+}
+
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SPECIAL);
+}
+#endif	/* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */
+
+#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
+static inline bool pud_special(pud_t pud)
+{
+	return pud_flags(pud) & _PAGE_SPECIAL;
+}
+
+static inline pud_t pud_mkspecial(pud_t pud)
+{
+	return pud_set_flags(pud, _PAGE_SPECIAL);
+}
+#endif	/* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
+
 static inline int pgd_devmap(pgd_t pgd)
 {
 	return 0;
@@ -487,20 +539,6 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
 }
 
-static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
-{
-	pmdval_t v = native_pmd_val(pmd);
-
-	return native_make_pmd(v | set);
-}
-
-static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
-{
-	pmdval_t v = native_pmd_val(pmd);
-
-	return native_make_pmd(v & ~clear);
-}
-
 /* See comments above mksaveddirty_shift() */
 static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
 {
@@ -595,20 +633,6 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 #define pmd_mkwrite pmd_mkwrite
 
-static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
-{
-	pudval_t v = native_pud_val(pud);
-
-	return native_make_pud(v | set);
-}
-
-static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
-{
-	pudval_t v = native_pud_val(pud);
-
-	return native_make_pud(v & ~clear);
-}
-
 /* See comments above mksaveddirty_shift() */
 static inline pud_t pud_mksaveddirty(pud_t pud)
 {
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 18/19] mm/arm64: Support large pfn mappings
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (16 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 17/19] mm/x86: Support large pfn mappings Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2025-03-19 22:22   ` Keith Busch
  2024-08-26 20:43 ` [PATCH v2 19/19] vfio/pci: Implement huge_fault support Peter Xu
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

Support huge pfnmaps by using bit 56 (PTE_SPECIAL) for "special" on
pmds/puds.  Provide the pmd/pud helpers to set/get special bit.

There's one more thing missing for arm64 which is the pxx_pgprot() for
pmd/pud.  Add them too, which is mostly the same as the pte version by
dropping the pfn field.  These helpers are essential to be used in the new
follow_pfnmap*() API to report valid pgprot_t results.

Note that arm64 doesn't yet support huge PUD yet, but it's still
straightforward to provide the pud helpers that we need altogether.  Only
PMD helpers will make an immediate benefit until arm64 will support huge
PUDs first in general (e.g. in THPs).

Cc: linux-arm-kernel@lists.infradead.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/Kconfig               |  1 +
 arch/arm64/include/asm/pgtable.h | 29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6494848019a0..6607ed8fdbb4 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -99,6 +99,7 @@ config ARM64
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
 	select ARCH_SUPPORTS_PER_VMA_LOCK
+	select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b78cc4a6758b..2faecc033a19 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -578,6 +578,14 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 	return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP)));
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+#define pmd_special(pte)	(!!((pmd_val(pte) & PTE_SPECIAL)))
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return set_pmd_bit(pmd, __pgprot(PTE_SPECIAL));
+}
+#endif
+
 #define __pmd_to_phys(pmd)	__pte_to_phys(pmd_pte(pmd))
 #define __phys_to_pmd_val(phys)	__phys_to_pte_val(phys)
 #define pmd_pfn(pmd)		((__pmd_to_phys(pmd) & PMD_MASK) >> PAGE_SHIFT)
@@ -595,6 +603,27 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 #define pud_pfn(pud)		((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT)
 #define pfn_pud(pfn,prot)	__pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
+#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
+#define pud_special(pte)	pte_special(pud_pte(pud))
+#define pud_mkspecial(pte)	pte_pud(pte_mkspecial(pud_pte(pud)))
+#endif
+
+#define pmd_pgprot pmd_pgprot
+static inline pgprot_t pmd_pgprot(pmd_t pmd)
+{
+	unsigned long pfn = pmd_pfn(pmd);
+
+	return __pgprot(pmd_val(pfn_pmd(pfn, __pgprot(0))) ^ pmd_val(pmd));
+}
+
+#define pud_pgprot pud_pgprot
+static inline pgprot_t pud_pgprot(pud_t pud)
+{
+	unsigned long pfn = pud_pfn(pud);
+
+	return __pgprot(pud_val(pfn_pud(pfn, __pgprot(0))) ^ pud_val(pud));
+}
+
 static inline void __set_pte_at(struct mm_struct *mm,
 				unsigned long __always_unused addr,
 				pte_t *ptep, pte_t pte, unsigned int nr)
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 19/19] vfio/pci: Implement huge_fault support
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (17 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 18/19] mm/arm64: " Peter Xu
@ 2024-08-26 20:43 ` Peter Xu
  2024-08-27 22:36 ` [PATCH v2 00/19] mm: Support huge pfnmaps Jiaqi Yan
  2024-09-09  4:03 ` Ankit Agrawal
  20 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-26 20:43 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, peterx, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

From: Alex Williamson <alex.williamson@redhat.com>

With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we
can take advantage of PMD and PUD faults to PCI BAR mmaps and create
more efficient mappings.  PCI BARs are always a power of two and will
typically get at least PMD alignment without userspace even trying.
Userspace alignment for PUD mappings is also not too difficult.

Consolidate faults through a single handler with a new wrapper for
standard single page faults.  The pre-faulting behavior of commit
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") is
removed in this refactoring since huge_fault will cover the bulk of
the faults and results in more efficient page table usage.  We also
want to avoid that pre-faulted single page mappings preempt huge page
mappings.

Cc: kvm@vger.kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 60 +++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ba0ce0075b2f..2d7478e9a62d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -20,6 +20,7 @@
 #include <linux/mutex.h>
 #include <linux/notifier.h>
 #include <linux/pci.h>
+#include <linux/pfn_t.h>
 #include <linux/pm_runtime.h>
 #include <linux/slab.h>
 #include <linux/types.h>
@@ -1657,14 +1658,20 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
 	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
 }
 
-static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
+static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
+					   unsigned int order)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct vfio_pci_core_device *vdev = vma->vm_private_data;
 	unsigned long pfn, pgoff = vmf->pgoff - vma->vm_pgoff;
-	unsigned long addr = vma->vm_start;
 	vm_fault_t ret = VM_FAULT_SIGBUS;
 
+	if (order && (vmf->address & ((PAGE_SIZE << order) - 1) ||
+		      vmf->address + (PAGE_SIZE << order) > vma->vm_end)) {
+		ret = VM_FAULT_FALLBACK;
+		goto out;
+	}
+
 	pfn = vma_to_pfn(vma);
 
 	down_read(&vdev->memory_lock);
@@ -1672,30 +1679,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
 		goto out_unlock;
 
-	ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
-	if (ret & VM_FAULT_ERROR)
-		goto out_unlock;
-
-	/*
-	 * Pre-fault the remainder of the vma, abort further insertions and
-	 * supress error if fault is encountered during pre-fault.
-	 */
-	for (; addr < vma->vm_end; addr += PAGE_SIZE, pfn++) {
-		if (addr == vmf->address)
-			continue;
-
-		if (vmf_insert_pfn(vma, addr, pfn) & VM_FAULT_ERROR)
-			break;
+	switch (order) {
+	case 0:
+		ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
+		break;
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+	case PMD_ORDER:
+		ret = vmf_insert_pfn_pmd(vmf, __pfn_to_pfn_t(pfn + pgoff,
+							     PFN_DEV), false);
+		break;
+#endif
+#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
+	case PUD_ORDER:
+		ret = vmf_insert_pfn_pud(vmf, __pfn_to_pfn_t(pfn + pgoff,
+							     PFN_DEV), false);
+		break;
+#endif
+	default:
+		ret = VM_FAULT_FALLBACK;
 	}
 
 out_unlock:
 	up_read(&vdev->memory_lock);
+out:
+	dev_dbg_ratelimited(&vdev->pdev->dev,
+			   "%s(,order = %d) BAR %ld page offset 0x%lx: 0x%x\n",
+			    __func__, order,
+			    vma->vm_pgoff >>
+				(VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT),
+			    pgoff, (unsigned int)ret);
 
 	return ret;
 }
 
+static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf)
+{
+	return vfio_pci_mmap_huge_fault(vmf, 0);
+}
+
 static const struct vm_operations_struct vfio_pci_mmap_ops = {
-	.fault = vfio_pci_mmap_fault,
+	.fault = vfio_pci_mmap_page_fault,
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+	.huge_fault = vfio_pci_mmap_huge_fault,
+#endif
 };
 
 int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma)
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (18 preceding siblings ...)
  2024-08-26 20:43 ` [PATCH v2 19/19] vfio/pci: Implement huge_fault support Peter Xu
@ 2024-08-27 22:36 ` Jiaqi Yan
  2024-08-27 22:57   ` Peter Xu
  2024-09-09  4:03 ` Ankit Agrawal
  20 siblings, 1 reply; 69+ messages in thread
From: Jiaqi Yan @ 2024-08-27 22:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On Mon, Aug 26, 2024 at 1:44 PM Peter Xu <peterx@redhat.com> wrote:
>
> v2:
> - Added tags
> - Let folio_walk_start() scan special pmd/pud bits [DavidH]
> - Switch copy_huge_pmd() COW+writable check into a VM_WARN_ON_ONCE()
> - Update commit message to drop mentioning of gup-fast, in patch "mm: Mark
>   special bits for huge pfn mappings when inject" [JasonG]
> - In gup-fast, reorder _special check v.s. _devmap check, so as to make
>   pmd/pud path look the same as pte path [DavidH, JasonG]
> - Enrich comments for follow_pfnmap*() API, emphasize the risk when PFN is
>   used after the end() is invoked, s/-ve/negative/ [JasonG, Sean]
>
> Overview
> ========
>
> This series is based on mm-unstable, commit b659edec079c of Aug 26th
> latest, with patch "vma remove the unneeded avc bound with non-CoWed folio"
> reverted, as reported broken [0].
>
> This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> as large as 8GB or even bigger.
>
> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
> patch (from Alex Williamson) will be the first user of huge pfnmap, so as
> to enable vfio-pci driver to fault in huge pfn mappings.
>
> Implementation
> ==============
>
> In reality, it's relatively simple to add such support comparing to many
> other types of mappings, because of PFNMAP's specialties when there's no
> vmemmap backing it, so that most of the kernel routines on huge mappings
> should simply already fail for them, like GUPs or old-school follow_page()
> (which is recently rewritten to be folio_walk* APIs by David).
>
> One trick here is that we're still unmature on PUDs in generic paths here
> and there, as DAX is so far the only user.  This patchset will add the 2nd
> user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
> go on smoothly, but to be discussed later.
>
> The other trick is how to allow gup-fast working for such huge mappings
> even if there's no direct sign of knowing whether it's a normal page or
> MMIO mapping.  This series chose to keep the pte_special solution, so that
> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> gup-fast will be able to identify them and fail properly.
>
> Along the way, we'll also notice that the major pgtable pfn walker, aka,
> follow_pte(), will need to retire soon due to the fact that it only works
> with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
> be able to do whatever follow_pte() can already do, plus that it can also
> process huge pfnmaps now.  Half of this series is about that and converting
> all existing pfnmap walkers to use the new API properly.  Hopefully the new
> API also looks better to avoid exposing e.g. pgtable lock details into the
> callers, so that it can be used in an even more straightforward way.
>
> Here, three more options will be introduced and involved in huge pfnmap:
>
>   - ARCH_SUPPORTS_HUGE_PFNMAP
>
>     Arch developers will need to select this option when huge pfnmap is
>     supported in arch's Kconfig.  After this patchset applied, both x86_64
>     and arm64 will start to enable it by default.
>
>   - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
>
>     These options are for driver developers to identify whether current
>     arch / config supports huge pfnmaps, making decision on whether it can
>     use the huge pfnmap APIs to inject them.  One can refer to the last
>     vfio-pci patch from Alex on the use of them properly in a device
>     driver.
>
> So after the whole set applied, and if one would enable some dynamic debug
> lines in vfio-pci core files, we should observe things like:
>
>   vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
>   vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
>   vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
>
> In this specific case, it says that vfio-pci faults in PMDs properly for a
> few BAR0 offsets.
>
> Patch Layout
> ============
>
> Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
> Patch 2:         A tiny cleanup
> Patch 3-8:       Preparation patches for huge pfnmap (include introduce
>                  special bit for pmd/pud)
> Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
>                  then drop follow_pte() API
> Patch 17:        Add huge pfnmap support for x86_64
> Patch 18:        Add huge pfnmap support for arm64
> Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)
>
> TODO
> ====
>
> More architectures / More page sizes
> ------------------------------------
>
> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems to
> have plan to support arm64 1G later on top of this series [2].
>
> Any arch will need to first support THP / THP_1G, then provide a special
> bit in pmds/puds to support huge pfnmaps.
>
> remap_pfn_range() support
> -------------------------
>
> Currently, remap_pfn_range() still only maps PTEs.  With the new option,
> remap_pfn_range() can logically start to inject either PMDs or PUDs when
> the alignment requirements match on the VAs.
>
> When the support is there, it should be able to silently benefit all
> drivers that is using remap_pfn_range() in its mmap() handler on better TLB
> hit rate and overall faster MMIO accesses similar to processor on hugepages.
>

Hi Peter,

I am curious if there is any work needed for unmap_mapping_range? If a
driver hugely remap_pfn_range()ed at 1G granularity, can the driver
unmap at PAGE_SIZE granularity? For example, when handling a PFN is
poisoned in the 1G mapping, it would be great if the mapping can be
splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
is lost. (Pretty much like the past proposal* to use HGM** to improve
hugetlb's memory failure handling).

Probably these questions can be answered after reading your code,
which I plan to do, but just want to ask in case you have an easy
answer for me.

* https://patchwork.plctlab.org/project/linux-kernel/cover/20230428004139.2899856-1-jiaqiyan@google.com/
** https://lwn.net/Articles/912017

> More driver support
> -------------------
>
> VFIO is so far the only consumer for the huge pfnmaps after this series
> applied.  Besides above remap_pfn_range() generic optimization, device
> driver can also try to optimize its mmap() on a better VA alignment for
> either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
> as the driver doesn't normally decide the VA to map a bar.  But I don't
> think I know all the drivers to know the full picture.
>
> Tests Done
> ==========
>
> - Cross-build tests
>
> - run_vmtests.sh
>
> - Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect()
>   and fork() tests on the bar mapped
>
> - x86_64 + AMD GPU
>   - Needs Alex's modified QEMU to guarantee proper VA alignment to make
>     sure all pages to be mapped with PUDs
>   - Main BAR (8GB) start to use PUD mappings
>   - Sub BAR (??MBs?) start to use PMD mappings
>   - Performance wise, slight improvement comparing to the old PTE mappings
>
> - aarch64 + NIC
>   - Detached NIC test to make sure driver loads fine with PMD mappings
>
> Credits all go to Alex on help testing the GPU/NIC use cases above.
>
> Comments welcomed, thanks.
>
> [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
> [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
> [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
>
> Alex Williamson (1):
>   vfio/pci: Implement huge_fault support
>
> Peter Xu (18):
>   mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
>   mm: Drop is_huge_zero_pud()
>   mm: Mark special bits for huge pfn mappings when inject
>   mm: Allow THP orders for PFNMAPs
>   mm/gup: Detect huge pfnmap entries in gup-fast
>   mm/pagewalk: Check pfnmap for folio_walk_start()
>   mm/fork: Accept huge pfnmap entries
>   mm: Always define pxx_pgprot()
>   mm: New follow_pfnmap API
>   KVM: Use follow_pfnmap API
>   s390/pci_mmio: Use follow_pfnmap API
>   mm/x86/pat: Use the new follow_pfnmap API
>   vfio: Use the new follow_pfnmap API
>   acrn: Use the new follow_pfnmap API
>   mm/access_process_vm: Use the new follow_pfnmap API
>   mm: Remove follow_pte()
>   mm/x86: Support large pfn mappings
>   mm/arm64: Support large pfn mappings
>
>  arch/arm64/Kconfig                  |   1 +
>  arch/arm64/include/asm/pgtable.h    |  30 +++++
>  arch/powerpc/include/asm/pgtable.h  |   1 +
>  arch/s390/include/asm/pgtable.h     |   1 +
>  arch/s390/pci/pci_mmio.c            |  22 ++--
>  arch/sparc/include/asm/pgtable_64.h |   1 +
>  arch/x86/Kconfig                    |   1 +
>  arch/x86/include/asm/pgtable.h      |  80 +++++++-----
>  arch/x86/mm/pat/memtype.c           |  17 ++-
>  drivers/vfio/pci/vfio_pci_core.c    |  60 ++++++---
>  drivers/vfio/vfio_iommu_type1.c     |  16 +--
>  drivers/virt/acrn/mm.c              |  16 +--
>  include/linux/huge_mm.h             |  16 +--
>  include/linux/mm.h                  |  57 ++++++++-
>  include/linux/pgtable.h             |  12 ++
>  mm/Kconfig                          |  13 ++
>  mm/gup.c                            |   6 +
>  mm/huge_memory.c                    |  50 +++++---
>  mm/memory.c                         | 183 ++++++++++++++++++++--------
>  mm/pagewalk.c                       |   4 +-
>  virt/kvm/kvm_main.c                 |  19 ++-
>  21 files changed, 425 insertions(+), 181 deletions(-)
>
> --
> 2.45.0
>
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-27 22:36 ` [PATCH v2 00/19] mm: Support huge pfnmaps Jiaqi Yan
@ 2024-08-27 22:57   ` Peter Xu
  2024-08-28  0:42     ` Jiaqi Yan
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-08-27 22:57 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote:
> Hi Peter,

Hi, Jiaqi,

> I am curious if there is any work needed for unmap_mapping_range? If a
> driver hugely remap_pfn_range()ed at 1G granularity, can the driver
> unmap at PAGE_SIZE granularity? For example, when handling a PFN is

Yes it can, but it'll invoke the split_huge_pud() which default routes to
removal of the whole pud right now (currently only covers either DAX
mappings or huge pfnmaps; it won't for anonymous if it comes, for example).

In that case it'll rely on the driver providing proper fault() /
huge_fault() to refault things back with smaller sizes later when accessed
again.

> poisoned in the 1G mapping, it would be great if the mapping can be
> splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
> is lost. (Pretty much like the past proposal* to use HGM** to improve
> hugetlb's memory failure handling).

Note that we're only talking about MMIO mappings here, in which case the
PFN doesn't even have a struct page, so the whole poison idea shouldn't
apply, afaiu.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-27 22:57   ` Peter Xu
@ 2024-08-28  0:42     ` Jiaqi Yan
  2024-08-28  0:46       ` Jiaqi Yan
                         ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Jiaqi Yan @ 2024-08-28  0:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On Tue, Aug 27, 2024 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote:
> > Hi Peter,
>
> Hi, Jiaqi,
>
> > I am curious if there is any work needed for unmap_mapping_range? If a
> > driver hugely remap_pfn_range()ed at 1G granularity, can the driver
> > unmap at PAGE_SIZE granularity? For example, when handling a PFN is
>
> Yes it can, but it'll invoke the split_huge_pud() which default routes to
> removal of the whole pud right now (currently only covers either DAX
> mappings or huge pfnmaps; it won't for anonymous if it comes, for example).
>
> In that case it'll rely on the driver providing proper fault() /
> huge_fault() to refault things back with smaller sizes later when accessed
> again.

I see, so the driver needs to drive the recovery process, and code
needs to be in the driver.

But it seems to me the recovery process will be more or less the same
to different drivers? In that case does it make sense that
memory_failure do the common things for all drivers?

Instead of removing the whole pud, can driver or memory_failure do
something similar to non-struct-page-version of split_huge_page? So
driver doesn't need to re-fault good pages back?


>
> > poisoned in the 1G mapping, it would be great if the mapping can be
> > splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
> > is lost. (Pretty much like the past proposal* to use HGM** to improve
> > hugetlb's memory failure handling).
>
> Note that we're only talking about MMIO mappings here, in which case the
> PFN doesn't even have a struct page, so the whole poison idea shouldn't
> apply, afaiu.

Yes, there won't be any struct page. Ankit proposed this patchset* for
handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci
driver adopts your change via new remap_pfn_range (install PMD/PUD
instead of PTE), and memory_failure_pfn still
unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE,
0), can it somehow just work and no re-fault needed?

* https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t



>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-28  0:42     ` Jiaqi Yan
@ 2024-08-28  0:46       ` Jiaqi Yan
  2024-08-28 14:24       ` Jason Gunthorpe
  2024-08-28 14:41       ` Peter Xu
  2 siblings, 0 replies; 69+ messages in thread
From: Jiaqi Yan @ 2024-08-28  0:46 UTC (permalink / raw)
  To: ankita
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

Adding Ankit in case he has opinions.

On Tue, Aug 27, 2024 at 5:42 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> On Tue, Aug 27, 2024 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote:
> > > Hi Peter,
> >
> > Hi, Jiaqi,
> >
> > > I am curious if there is any work needed for unmap_mapping_range? If a
> > > driver hugely remap_pfn_range()ed at 1G granularity, can the driver
> > > unmap at PAGE_SIZE granularity? For example, when handling a PFN is
> >
> > Yes it can, but it'll invoke the split_huge_pud() which default routes to
> > removal of the whole pud right now (currently only covers either DAX
> > mappings or huge pfnmaps; it won't for anonymous if it comes, for example).
> >
> > In that case it'll rely on the driver providing proper fault() /
> > huge_fault() to refault things back with smaller sizes later when accessed
> > again.
>
> I see, so the driver needs to drive the recovery process, and code
> needs to be in the driver.
>
> But it seems to me the recovery process will be more or less the same
> to different drivers? In that case does it make sense that
> memory_failure do the common things for all drivers?
>
> Instead of removing the whole pud, can driver or memory_failure do
> something similar to non-struct-page-version of split_huge_page? So
> driver doesn't need to re-fault good pages back?
>
>
> >
> > > poisoned in the 1G mapping, it would be great if the mapping can be
> > > splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
> > > is lost. (Pretty much like the past proposal* to use HGM** to improve
> > > hugetlb's memory failure handling).
> >
> > Note that we're only talking about MMIO mappings here, in which case the
> > PFN doesn't even have a struct page, so the whole poison idea shouldn't
> > apply, afaiu.
>
> Yes, there won't be any struct page. Ankit proposed this patchset* for
> handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci
> driver adopts your change via new remap_pfn_range (install PMD/PUD
> instead of PTE), and memory_failure_pfn still
> unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE,
> 0), can it somehow just work and no re-fault needed?
>
> * https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t
>
>
>
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start()
  2024-08-26 20:43 ` [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start() Peter Xu
@ 2024-08-28  7:44   ` David Hildenbrand
  2024-08-28 14:24     ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: David Hildenbrand @ 2024-08-28  7:44 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On 26.08.24 22:43, Peter Xu wrote:
> Teach folio_walk_start() to recognize special pmd/pud mappings, and fail
> them properly as it means there's no folio backing them.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/pagewalk.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index cd79fb3b89e5..12be5222d70e 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -753,7 +753,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>   		fw->pudp = pudp;
>   		fw->pud = pud;
>   
> -		if (!pud_present(pud) || pud_devmap(pud)) {
> +		if (!pud_present(pud) || pud_devmap(pud) || pud_special(pud)) {
>   			spin_unlock(ptl);
>   			goto not_found;
>   		} else if (!pud_leaf(pud)) {
> @@ -783,7 +783,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>   		fw->pmdp = pmdp;
>   		fw->pmd = pmd;
>   
> -		if (pmd_none(pmd)) {
> +		if (pmd_none(pmd) || pmd_special(pmd)) {
>   			spin_unlock(ptl);
>   			goto not_found;
>   		} else if (!pmd_leaf(pmd)) {

As raised, this is not the right way to to it. You should follow what
CONFIG_ARCH_HAS_PTE_SPECIAL and vm_normal_page() does.

It's even spelled out in vm_normal_page_pmd() that at the time it was
introduced there was no pmd_special(), so there was no way to handle that.



diff --git a/mm/memory.c b/mm/memory.c
index f0cf5d02b4740..272445e9db147 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -672,15 +672,29 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
  {
         unsigned long pfn = pmd_pfn(pmd);
  
-       /*
-        * There is no pmd_special() but there may be special pmds, e.g.
-        * in a direct-access (dax) mapping, so let's just replicate the
-        * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
-        */
+       if (IS_ENABLED(CONFIG_ARCH_HAS_PMD_SPECIAL)) {
+               if (likely(!pmd_special(pmd)))
+                       goto check_pfn;
+               if (vma->vm_ops && vma->vm_ops->find_special_page)
+                       return vma->vm_ops->find_special_page(vma, addr);
+               if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+                       return NULL;
+               if (is_huge_zero_pmd(pmd))
+                       return NULL;
+               if (pmd_devmap(pmd))
+                       /* See vm_normal_page() */
+                       return NULL;
+               return NULL;
+       }
+
+       /* !CONFIG_ARCH_HAS_PMD_SPECIAL case follows: */
+
         if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
                 if (vma->vm_flags & VM_MIXEDMAP) {
                         if (!pfn_valid(pfn))
                                 return NULL;
+                       if (is_huge_zero_pmd(pmd))
+                               return NULL;
                         goto out;
                 } else {
                         unsigned long off;
@@ -692,6 +706,11 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
                 }
         }
  
+       /*
+        * For historical reasons, these might not have pmd_special() set,
+        * so we'll check them manually, in contrast to vm_normal_page().
+        */
+check_pfn:
         if (pmd_devmap(pmd))
                 return NULL;
         if (is_huge_zero_pmd(pmd))



We should then look into mapping huge zeropages also with pmd_special.
pmd_devmap we'll leave alone until removed. But that's indeoendent of your series.

I wonder if CONFIG_ARCH_HAS_PTE_SPECIAL is sufficient and we don't need additional
CONFIG_ARCH_HAS_PMD_SPECIAL.

As I said, if you need someone to add vm_normal_page_pud(), I can handle that.

-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start()
  2024-08-28  7:44   ` David Hildenbrand
@ 2024-08-28 14:24     ` Peter Xu
  2024-08-28 15:30       ` David Hildenbrand
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-08-28 14:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On Wed, Aug 28, 2024 at 09:44:04AM +0200, David Hildenbrand wrote:
> On 26.08.24 22:43, Peter Xu wrote:
> > Teach folio_walk_start() to recognize special pmd/pud mappings, and fail
> > them properly as it means there's no folio backing them.
> > 
> > Cc: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/pagewalk.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index cd79fb3b89e5..12be5222d70e 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -753,7 +753,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
> >   		fw->pudp = pudp;
> >   		fw->pud = pud;
> > -		if (!pud_present(pud) || pud_devmap(pud)) {
> > +		if (!pud_present(pud) || pud_devmap(pud) || pud_special(pud)) {
> >   			spin_unlock(ptl);
> >   			goto not_found;
> >   		} else if (!pud_leaf(pud)) {
> > @@ -783,7 +783,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
> >   		fw->pmdp = pmdp;
> >   		fw->pmd = pmd;
> > -		if (pmd_none(pmd)) {
> > +		if (pmd_none(pmd) || pmd_special(pmd)) {
> >   			spin_unlock(ptl);
> >   			goto not_found;
> >   		} else if (!pmd_leaf(pmd)) {
> 
> As raised, this is not the right way to to it. You should follow what
> CONFIG_ARCH_HAS_PTE_SPECIAL and vm_normal_page() does.
> 
> It's even spelled out in vm_normal_page_pmd() that at the time it was
> introduced there was no pmd_special(), so there was no way to handle that.

I can try to do something like that, but even so it'll be mostly cosmetic
changes, and AFAICT there's no real functional difference.

Meanwhile, see below comment.

> 
> 
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index f0cf5d02b4740..272445e9db147 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -672,15 +672,29 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  {
>         unsigned long pfn = pmd_pfn(pmd);
> -       /*
> -        * There is no pmd_special() but there may be special pmds, e.g.
> -        * in a direct-access (dax) mapping, so let's just replicate the
> -        * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
> -        */

This one is correct; I overlooked this comment which can be obsolete.  I
can either refine this patch or add one patch on top to refine the comment
at least.

> +       if (IS_ENABLED(CONFIG_ARCH_HAS_PMD_SPECIAL)) {

We don't yet have CONFIG_ARCH_HAS_PMD_SPECIAL, but I get your point.

> +               if (likely(!pmd_special(pmd)))
> +                       goto check_pfn;
> +               if (vma->vm_ops && vma->vm_ops->find_special_page)
> +                       return vma->vm_ops->find_special_page(vma, addr);

Why do we ever need this?  This is so far destined to be totally a waste of
cycles.  I think it's better we leave that until either xen/gntdev.c or any
new driver start to use it, rather than keeping dead code around.

> +               if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> +                       return NULL;
> +               if (is_huge_zero_pmd(pmd))
> +                       return NULL;

This is meaningless too until we make huge zero pmd apply special bit
first, which does sound like to be outside the scope of this series.

> +               if (pmd_devmap(pmd))
> +                       /* See vm_normal_page() */
> +                       return NULL;

When will it be pmd_devmap() if it's already pmd_special()?

> +               return NULL;

And see this one.. it's after:

  if (xxx)
      return NULL;
  if (yyy)
      return NULL;
  if (zzz)
      return NULL;
  return NULL;

Hmm??  If so, what's the difference if we simply check pmd_special and
return NULL..

> +       }
> +
> +       /* !CONFIG_ARCH_HAS_PMD_SPECIAL case follows: */
> +
>         if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>                 if (vma->vm_flags & VM_MIXEDMAP) {
>                         if (!pfn_valid(pfn))
>                                 return NULL;
> +                       if (is_huge_zero_pmd(pmd))
> +                               return NULL;

I'd rather not touch here as this series doesn't change anything for
MIXEDMAP yet..

>                         goto out;
>                 } else {
>                         unsigned long off;
> @@ -692,6 +706,11 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>                 }
>         }
> +       /*
> +        * For historical reasons, these might not have pmd_special() set,
> +        * so we'll check them manually, in contrast to vm_normal_page().
> +        */
> +check_pfn:
>         if (pmd_devmap(pmd))
>                 return NULL;
>         if (is_huge_zero_pmd(pmd))
> 
> 
> 
> We should then look into mapping huge zeropages also with pmd_special.
> pmd_devmap we'll leave alone until removed. But that's indeoendent of your series.

This does look reasonable to match what we do with pte zeropage.  Could you
remind me what might be the benefit when we switch to using special bit for
pmd zero pages?

> 
> I wonder if CONFIG_ARCH_HAS_PTE_SPECIAL is sufficient and we don't need additional
> CONFIG_ARCH_HAS_PMD_SPECIAL.

The hope is we can always reuse the bit in the pte to work the same for
pmd/pud.

Now we require arch to select ARCH_SUPPORTS_HUGE_PFNMAP to say "pmd/pud has
the same special bit defined".

> 
> As I said, if you need someone to add vm_normal_page_pud(), I can handle that.

I'm pretty confused why we need that for this series alone.

If you prefer vm_normal_page_pud() to be defined and check pud_special()
there, I can do that.  But again, I don't yet see how that can make a
functional difference considering the so far very limited usage of the
special bit, and wonder whether we can do that on top when it became
necessary (and when we start to have functional requirement of such).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-28  0:42     ` Jiaqi Yan
  2024-08-28  0:46       ` Jiaqi Yan
@ 2024-08-28 14:24       ` Jason Gunthorpe
  2024-08-28 16:10         ` Jiaqi Yan
  2024-08-28 14:41       ` Peter Xu
  2 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2024-08-28 14:24 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:

> Instead of removing the whole pud, can driver or memory_failure do
> something similar to non-struct-page-version of split_huge_page? So
> driver doesn't need to re-fault good pages back?

It would be far nicer if we didn't have to poke a hole in a 1G mapping
just for memory failure reporting.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-28  0:42     ` Jiaqi Yan
  2024-08-28  0:46       ` Jiaqi Yan
  2024-08-28 14:24       ` Jason Gunthorpe
@ 2024-08-28 14:41       ` Peter Xu
  2024-08-28 16:23         ` Jiaqi Yan
  2 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-08-28 14:41 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
> On Tue, Aug 27, 2024 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote:
> > > Hi Peter,
> >
> > Hi, Jiaqi,
> >
> > > I am curious if there is any work needed for unmap_mapping_range? If a
> > > driver hugely remap_pfn_range()ed at 1G granularity, can the driver
> > > unmap at PAGE_SIZE granularity? For example, when handling a PFN is
> >
> > Yes it can, but it'll invoke the split_huge_pud() which default routes to
> > removal of the whole pud right now (currently only covers either DAX
> > mappings or huge pfnmaps; it won't for anonymous if it comes, for example).
> >
> > In that case it'll rely on the driver providing proper fault() /
> > huge_fault() to refault things back with smaller sizes later when accessed
> > again.
> 
> I see, so the driver needs to drive the recovery process, and code
> needs to be in the driver.
> 
> But it seems to me the recovery process will be more or less the same
> to different drivers? In that case does it make sense that
> memory_failure do the common things for all drivers?
> 
> Instead of removing the whole pud, can driver or memory_failure do
> something similar to non-struct-page-version of split_huge_page? So
> driver doesn't need to re-fault good pages back?

I think we can, it's just that we don't yet have a valid use case.

DAX is definitely fault-able.

While for the new huge pfnmap, currently vfio is the only user, and vfio
only requires to either zap all or map all.  In that case there's no real
need to ask for what you described yet.  Meanwhile it's also faultable, so
if / when needed it should hopefully still do the work properly.

I believe it's not usual requirement too for most of the rest drivers, as
most of them don't even support fault() afaiu. remap_pfn_range() can start
to use huge mappings, however I'd expect they're mostly not ready for
random tearing down of any MMIO mappings.

It sounds doable to me though when there's a need of what you're
describing, but I don't think I know well on the use case yet.

> 
> 
> >
> > > poisoned in the 1G mapping, it would be great if the mapping can be
> > > splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
> > > is lost. (Pretty much like the past proposal* to use HGM** to improve
> > > hugetlb's memory failure handling).
> >
> > Note that we're only talking about MMIO mappings here, in which case the
> > PFN doesn't even have a struct page, so the whole poison idea shouldn't
> > apply, afaiu.
> 
> Yes, there won't be any struct page. Ankit proposed this patchset* for
> handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci
> driver adopts your change via new remap_pfn_range (install PMD/PUD
> instead of PTE), and memory_failure_pfn still
> unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE,
> 0), can it somehow just work and no re-fault needed?
> 
> * https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t

I see now, interesting.. Thanks for the link.  

In that case of nvgpu usage, one way is to do as what you said; we can
enhance the pmd/pud split for pfnmap, but maybe that's an overkill.

I saw that the nvgpu will need a fault() anyway so as to detect poisoned
PFNs, then it's also feasible that in the new nvgrace_gpu_vfio_pci_fault()
when it supports huge pfnmaps it'll need to try to detect whether the whole
faulting range contains any poisoned PFNs, then provide FALLBACK if so
(rather than VM_FAULT_HWPOISON).

E.g., when 4K of 2M is poisoned, we'll erase the 2M completely.  When
access happens, as long as the accessed 4K is not on top of the poisoned
4k, huge_fault() should still detect that there's 4k range poisoned, then
it'll not inject pmd but return FALLBACK, then in the fault() it'll see
the accessed 4k range is not poisoned, then install a pte.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start()
  2024-08-28 14:24     ` Peter Xu
@ 2024-08-28 15:30       ` David Hildenbrand
  2024-08-28 19:45         ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: David Hildenbrand @ 2024-08-28 15:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

> This one is correct; I overlooked this comment which can be obsolete.  I
> can either refine this patch or add one patch on top to refine the comment
> at least.

Probably best if you use what you consider reasonable in your patch.

> 
>> +       if (IS_ENABLED(CONFIG_ARCH_HAS_PMD_SPECIAL)) {
> 
> We don't yet have CONFIG_ARCH_HAS_PMD_SPECIAL, but I get your point.
> 
>> +               if (likely(!pmd_special(pmd)))
>> +                       goto check_pfn;
>> +               if (vma->vm_ops && vma->vm_ops->find_special_page)
>> +                       return vma->vm_ops->find_special_page(vma, addr);
> 
> Why do we ever need this?  This is so far destined to be totally a waste of
> cycles.  I think it's better we leave that until either xen/gntdev.c or any
> new driver start to use it, rather than keeping dead code around.

I just copy-pasted what we had in vm_normal_page() to showcase. If not 
required, good, we can add a comment we this is not required.

> 
>> +               if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>> +                       return NULL;
>> +               if (is_huge_zero_pmd(pmd))
>> +                       return NULL;
> 
> This is meaningless too until we make huge zero pmd apply special bit
> first, which does sound like to be outside the scope of this series.

Again, copy-paste, but ...

> 
>> +               if (pmd_devmap(pmd))
>> +                       /* See vm_normal_page() */
>> +                       return NULL;
> 
> When will it be pmd_devmap() if it's already pmd_special()?
> 
>> +               return NULL;
> 
> And see this one.. it's after:
> 
>    if (xxx)
>        return NULL;
>    if (yyy)
>        return NULL;
>    if (zzz)
>        return NULL;
>    return NULL;
> 
> Hmm??  If so, what's the difference if we simply check pmd_special and
> return NULL..

Yes, they all return NULL. The compiler likely optimizes it all out. 
Maybe we have it like that for pure documentation purposes. But yeah, we 
should simply return NULL and think about cleaning up vm_normal_page() 
as well, it does look strange.

> 
>> +       }
>> +
>> +       /* !CONFIG_ARCH_HAS_PMD_SPECIAL case follows: */
>> +
>>          if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>>                  if (vma->vm_flags & VM_MIXEDMAP) {
>>                          if (!pfn_valid(pfn))
>>                                  return NULL;
>> +                       if (is_huge_zero_pmd(pmd))
>> +                               return NULL;
> 
> I'd rather not touch here as this series doesn't change anything for
> MIXEDMAP yet..

Yes, that can be a separate change.

> 
>>                          goto out;
>>                  } else {
>>                          unsigned long off;
>> @@ -692,6 +706,11 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>>                  }
>>          }
>> +       /*
>> +        * For historical reasons, these might not have pmd_special() set,
>> +        * so we'll check them manually, in contrast to vm_normal_page().
>> +        */
>> +check_pfn:
>>          if (pmd_devmap(pmd))
>>                  return NULL;
>>          if (is_huge_zero_pmd(pmd))
>>
>>
>>
>> We should then look into mapping huge zeropages also with pmd_special.
>> pmd_devmap we'll leave alone until removed. But that's indeoendent of your series.
> 
> This does look reasonable to match what we do with pte zeropage.  Could you
> remind me what might be the benefit when we switch to using special bit for
> pmd zero pages?

See below. It's the way to tell the VM that a page is special, so you 
can avoid a separate check at relevant places, like GUP-fast or in 
vm_normal_*.

> 
>>
>> I wonder if CONFIG_ARCH_HAS_PTE_SPECIAL is sufficient and we don't need additional
>> CONFIG_ARCH_HAS_PMD_SPECIAL.
> 
> The hope is we can always reuse the bit in the pte to work the same for
> pmd/pud.
> 
> Now we require arch to select ARCH_SUPPORTS_HUGE_PFNMAP to say "pmd/pud has
> the same special bit defined".

Note that pte_special() is the way to signalize to the VM that a PTE 
does not reference a refcounted page, or is similarly special and shall 
mostly be ignored. It doesn't imply that it is a PFNAMP pte, not at all.

The shared zeropage is usually not refcounted (except during GUP 
FOLL_GET ... but not FOLL_PIN) and the huge zeropage is usually also not 
refcounted (but FOLL_PIN still does it). Both are special.

If you take a look at the history pte_special(), it was introduced for 
VM_MIXEDMAP handling on s390x, because pfn_valid() to identify "special" 
pages did not work:

commit 7e675137a8e1a4d45822746456dd389b65745bf6
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Mon Apr 28 02:13:00 2008 -0700

     mm: introduce pte_special pte bit

In the meantime, it's required for architectures that wants to support 
GUP-fast I think, to make GUP-fast bail out and fallback to the slow 
path where we do a vm_normal_page() -- or fail right at the VMA check 
for now (VM_PFNMAP).

An architecture that doesn't implement pte_special() can support pfnmaps 
but not GUP-fast. Similarly, an architecture that doesn't implement 
pmd_special() can support huge pfnmaps, but not GUP-fast.

If you take a closer look, really the only two code paths that look at 
pte_special() are GUP-fast and vm_normal_page().

If we use pmd_special/pud_special in other code than that, we are 
diverging from the pte_special() model, and are likely doing something 
wrong.

I see how you arrived at the current approach, focusing exclusively on 
x86. But I think this just adds inconsistency.

So my point is that we use the same model, where we limit

* pmd_special() to GUP-fast and vm_normal_page_pmd()
* pud_special() to GUP-fast and vm_normal_page_pud()

And simply do the exact same thing as we do for pte_special().

If an arch supports pmd_special() and pud_special() we can support both 
types of hugepfn mappings. If not, an architecture *might* support it, 
depending on support for GUP-fast and maybe depending on MIXEDMAP 
support (again, just like pte_special()). Not your task to worry about, 
you will only "unlock" x86.

So maybe we do want CONFIG_ARCH_HAS_PMD_SPECIAL as well, maybe it can be 
glued to CONFIG_ARCH_HAS_PTE_SPECIAL (but I'm afraid it can't unless all 
archs support both). I'll leave that up to you.

> 
>>
>> As I said, if you need someone to add vm_normal_page_pud(), I can handle that.
> 
> I'm pretty confused why we need that for this series alone.

See above.

> 
> If you prefer vm_normal_page_pud() to be defined and check pud_special()
> there, I can do that.  But again, I don't yet see how that can make a
> functional difference considering the so far very limited usage of the
> special bit, and wonder whether we can do that on top when it became
> necessary (and when we start to have functional requirement of such).

I hope my explanation why pte_special() even exists and how it is used 
makes it clearer.

It's not that much code to handle it like pte_special(), really. I don't 
expect you to teach GUP-slow about vm_normal_page() etc.

If you want me to just takeover some stuff, let me know.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 03/19] mm: Mark special bits for huge pfn mappings when inject
  2024-08-26 20:43 ` [PATCH v2 03/19] mm: Mark special bits for huge pfn mappings when inject Peter Xu
@ 2024-08-28 15:31   ` David Hildenbrand
  0 siblings, 0 replies; 69+ messages in thread
From: David Hildenbrand @ 2024-08-28 15:31 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On 26.08.24 22:43, Peter Xu wrote:
> We need these special bits to be around on pfnmaps.  Mark properly for
> !devmap case, reflecting that there's no page struct backing the entry.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/huge_memory.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3f74b09ada38..dec17d09390f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1346,6 +1346,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>   	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>   	if (pfn_t_devmap(pfn))
>   		entry = pmd_mkdevmap(entry);
> +	else
> +		entry = pmd_mkspecial(entry);
>   	if (write) {
>   		entry = pmd_mkyoung(pmd_mkdirty(entry));
>   		entry = maybe_pmd_mkwrite(entry, vma);
> @@ -1442,6 +1444,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
>   	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
>   	if (pfn_t_devmap(pfn))
>   		entry = pud_mkdevmap(entry);
> +	else
> +		entry = pud_mkspecial(entry);
>   	if (write) {
>   		entry = pud_mkyoung(pud_mkdirty(entry));
>   		entry = maybe_pud_mkwrite(entry, vma);

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/19] mm: Allow THP orders for PFNMAPs
  2024-08-26 20:43 ` [PATCH v2 04/19] mm: Allow THP orders for PFNMAPs Peter Xu
@ 2024-08-28 15:31   ` David Hildenbrand
  0 siblings, 0 replies; 69+ messages in thread
From: David Hildenbrand @ 2024-08-28 15:31 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson,
	Matthew Wilcox, Ryan Roberts

On 26.08.24 22:43, Peter Xu wrote:
> This enables PFNMAPs to be mapped at either pmd/pud layers.  Generalize the
> dax case into vma_is_special_huge() so as to cover both.  Meanwhile, rename
> the macro to THP_ORDERS_ALL_SPECIAL.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Gavin Shan <gshan@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-28 14:24       ` Jason Gunthorpe
@ 2024-08-28 16:10         ` Jiaqi Yan
  2024-08-28 23:49           ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Jiaqi Yan @ 2024-08-28 16:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On Wed, Aug 28, 2024 at 7:24 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
>
> > Instead of removing the whole pud, can driver or memory_failure do
> > something similar to non-struct-page-version of split_huge_page? So
> > driver doesn't need to re-fault good pages back?
>
> It would be far nicer if we didn't have to poke a hole in a 1G mapping
> just for memory failure reporting.

If I follow this, which of the following sounds better? 1. remove pud
and rely on the driver to re-fault PFNs that it knows are not poisoned
(what Peter suggested), or 2. keep the pud and allow access to both
good and bad PFNs.

Or provide some knob (configured by ?) so that kernel + driver can
switch between the two?

>
> Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-28 14:41       ` Peter Xu
@ 2024-08-28 16:23         ` Jiaqi Yan
  0 siblings, 0 replies; 69+ messages in thread
From: Jiaqi Yan @ 2024-08-28 16:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On Wed, Aug 28, 2024 at 7:41 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
> > On Tue, Aug 27, 2024 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Aug 27, 2024 at 03:36:07PM -0700, Jiaqi Yan wrote:
> > > > Hi Peter,
> > >
> > > Hi, Jiaqi,
> > >
> > > > I am curious if there is any work needed for unmap_mapping_range? If a
> > > > driver hugely remap_pfn_range()ed at 1G granularity, can the driver
> > > > unmap at PAGE_SIZE granularity? For example, when handling a PFN is
> > >
> > > Yes it can, but it'll invoke the split_huge_pud() which default routes to
> > > removal of the whole pud right now (currently only covers either DAX
> > > mappings or huge pfnmaps; it won't for anonymous if it comes, for example).
> > >
> > > In that case it'll rely on the driver providing proper fault() /
> > > huge_fault() to refault things back with smaller sizes later when accessed
> > > again.
> >
> > I see, so the driver needs to drive the recovery process, and code
> > needs to be in the driver.
> >
> > But it seems to me the recovery process will be more or less the same
> > to different drivers? In that case does it make sense that
> > memory_failure do the common things for all drivers?
> >
> > Instead of removing the whole pud, can driver or memory_failure do
> > something similar to non-struct-page-version of split_huge_page? So
> > driver doesn't need to re-fault good pages back?
>
> I think we can, it's just that we don't yet have a valid use case.
>
> DAX is definitely fault-able.
>
> While for the new huge pfnmap, currently vfio is the only user, and vfio
> only requires to either zap all or map all.  In that case there's no real
> need to ask for what you described yet.  Meanwhile it's also faultable, so
> if / when needed it should hopefully still do the work properly.
>
> I believe it's not usual requirement too for most of the rest drivers, as
> most of them don't even support fault() afaiu. remap_pfn_range() can start
> to use huge mappings, however I'd expect they're mostly not ready for
> random tearing down of any MMIO mappings.
>
> It sounds doable to me though when there's a need of what you're
> describing, but I don't think I know well on the use case yet.
>
> >
> >
> > >
> > > > poisoned in the 1G mapping, it would be great if the mapping can be
> > > > splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
> > > > is lost. (Pretty much like the past proposal* to use HGM** to improve
> > > > hugetlb's memory failure handling).
> > >
> > > Note that we're only talking about MMIO mappings here, in which case the
> > > PFN doesn't even have a struct page, so the whole poison idea shouldn't
> > > apply, afaiu.
> >
> > Yes, there won't be any struct page. Ankit proposed this patchset* for
> > handling poisoning. I wonder if someday the vfio-nvgrace-gpu-pci
> > driver adopts your change via new remap_pfn_range (install PMD/PUD
> > instead of PTE), and memory_failure_pfn still
> > unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE,
> > 0), can it somehow just work and no re-fault needed?
> >
> > * https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t
>
> I see now, interesting.. Thanks for the link.
>
> In that case of nvgpu usage, one way is to do as what you said; we can
> enhance the pmd/pud split for pfnmap, but maybe that's an overkill.

Yeah, just want a poke to see if splitting pmd/pud is some low-hanging fruit.

>
> I saw that the nvgpu will need a fault() anyway so as to detect poisoned
> PFNs, then it's also feasible that in the new nvgrace_gpu_vfio_pci_fault()
> when it supports huge pfnmaps it'll need to try to detect whether the whole
> faulting range contains any poisoned PFNs, then provide FALLBACK if so
> (rather than VM_FAULT_HWPOISON).
>
> E.g., when 4K of 2M is poisoned, we'll erase the 2M completely.  When
> access happens, as long as the accessed 4K is not on top of the poisoned
> 4k, huge_fault() should still detect that there's 4k range poisoned, then
> it'll not inject pmd but return FALLBACK, then in the fault() it'll see
> the accessed 4k range is not poisoned, then install a pte.

Thanks for illustrating the re-fault flow again. I think this should
work well for drivers (having large MMIO size) that care about memory
errors. We can put the pmd/pud split idea to backlog and see if it is
needed in future.

>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start()
  2024-08-28 15:30       ` David Hildenbrand
@ 2024-08-28 19:45         ` Peter Xu
  2024-08-28 23:46           ` Jason Gunthorpe
  2024-08-29 15:10           ` David Hildenbrand
  0 siblings, 2 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-28 19:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On Wed, Aug 28, 2024 at 05:30:43PM +0200, David Hildenbrand wrote:
> > This one is correct; I overlooked this comment which can be obsolete.  I
> > can either refine this patch or add one patch on top to refine the comment
> > at least.
> 
> Probably best if you use what you consider reasonable in your patch.
> 
> > 
> > > +       if (IS_ENABLED(CONFIG_ARCH_HAS_PMD_SPECIAL)) {
> > 
> > We don't yet have CONFIG_ARCH_HAS_PMD_SPECIAL, but I get your point.
> > 
> > > +               if (likely(!pmd_special(pmd)))
> > > +                       goto check_pfn;
> > > +               if (vma->vm_ops && vma->vm_ops->find_special_page)
> > > +                       return vma->vm_ops->find_special_page(vma, addr);
> > 
> > Why do we ever need this?  This is so far destined to be totally a waste of
> > cycles.  I think it's better we leave that until either xen/gntdev.c or any
> > new driver start to use it, rather than keeping dead code around.
> 
> I just copy-pasted what we had in vm_normal_page() to showcase. If not
> required, good, we can add a comment we this is not required.
> 
> > 
> > > +               if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> > > +                       return NULL;
> > > +               if (is_huge_zero_pmd(pmd))
> > > +                       return NULL;
> > 
> > This is meaningless too until we make huge zero pmd apply special bit
> > first, which does sound like to be outside the scope of this series.
> 
> Again, copy-paste, but ...
> 
> > 
> > > +               if (pmd_devmap(pmd))
> > > +                       /* See vm_normal_page() */
> > > +                       return NULL;
> > 
> > When will it be pmd_devmap() if it's already pmd_special()?
> > 
> > > +               return NULL;
> > 
> > And see this one.. it's after:
> > 
> >    if (xxx)
> >        return NULL;
> >    if (yyy)
> >        return NULL;
> >    if (zzz)
> >        return NULL;
> >    return NULL;
> > 
> > Hmm??  If so, what's the difference if we simply check pmd_special and
> > return NULL..
> 
> Yes, they all return NULL. The compiler likely optimizes it all out. Maybe
> we have it like that for pure documentation purposes. But yeah, we should
> simply return NULL and think about cleaning up vm_normal_page() as well, it
> does look strange.
> 
> > 
> > > +       }
> > > +
> > > +       /* !CONFIG_ARCH_HAS_PMD_SPECIAL case follows: */
> > > +
> > >          if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> > >                  if (vma->vm_flags & VM_MIXEDMAP) {
> > >                          if (!pfn_valid(pfn))
> > >                                  return NULL;
> > > +                       if (is_huge_zero_pmd(pmd))
> > > +                               return NULL;
> > 
> > I'd rather not touch here as this series doesn't change anything for
> > MIXEDMAP yet..
> 
> Yes, that can be a separate change.
> 
> > 
> > >                          goto out;
> > >                  } else {
> > >                          unsigned long off;
> > > @@ -692,6 +706,11 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > >                  }
> > >          }
> > > +       /*
> > > +        * For historical reasons, these might not have pmd_special() set,
> > > +        * so we'll check them manually, in contrast to vm_normal_page().
> > > +        */
> > > +check_pfn:
> > >          if (pmd_devmap(pmd))
> > >                  return NULL;
> > >          if (is_huge_zero_pmd(pmd))
> > > 
> > > 
> > > 
> > > We should then look into mapping huge zeropages also with pmd_special.
> > > pmd_devmap we'll leave alone until removed. But that's indeoendent of your series.
> > 
> > This does look reasonable to match what we do with pte zeropage.  Could you
> > remind me what might be the benefit when we switch to using special bit for
> > pmd zero pages?
> 
> See below. It's the way to tell the VM that a page is special, so you can
> avoid a separate check at relevant places, like GUP-fast or in vm_normal_*.
> 
> > 
> > > 
> > > I wonder if CONFIG_ARCH_HAS_PTE_SPECIAL is sufficient and we don't need additional
> > > CONFIG_ARCH_HAS_PMD_SPECIAL.
> > 
> > The hope is we can always reuse the bit in the pte to work the same for
> > pmd/pud.
> > 
> > Now we require arch to select ARCH_SUPPORTS_HUGE_PFNMAP to say "pmd/pud has
> > the same special bit defined".
> 
> Note that pte_special() is the way to signalize to the VM that a PTE does
> not reference a refcounted page, or is similarly special and shall mostly be
> ignored. It doesn't imply that it is a PFNAMP pte, not at all.

Right, it's just that this patch started with having pmd/pud special bit
sololy used for pfnmaps only so far.  I'd agree, again, that I think it
makes sense to keep it consistent with pte's in a longer run, but that'll
need to be done step by step, and tested properly on each of the goals
(e.g. when extend that to zeropage pmd).

> 
> The shared zeropage is usually not refcounted (except during GUP FOLL_GET
> ... but not FOLL_PIN) and the huge zeropage is usually also not refcounted
> (but FOLL_PIN still does it). Both are special.
> 
> 
> If you take a look at the history pte_special(), it was introduced for
> VM_MIXEDMAP handling on s390x, because pfn_valid() to identify "special"
> pages did not work:
> 
> commit 7e675137a8e1a4d45822746456dd389b65745bf6
> Author: Nicholas Piggin <npiggin@gmail.com>
> Date:   Mon Apr 28 02:13:00 2008 -0700
> 
>     mm: introduce pte_special pte bit
> 
> 
> In the meantime, it's required for architectures that wants to support
> GUP-fast I think, to make GUP-fast bail out and fallback to the slow path
> where we do a vm_normal_page() -- or fail right at the VMA check for now
> (VM_PFNMAP).

I wonder whether pfn_valid() would work for the archs that do not support
pte_special but to enable gup-fast.

Meanwhile I'm actually not 100% sure pte_special is only needed in
gup-fast.  See vm_normal_page() and for VM_PFNMAP when pte_special bit is
not defined:

		} else {
			unsigned long off;
			off = (addr - vma->vm_start) >> PAGE_SHIFT;
			if (pfn == vma->vm_pgoff + off) <------------------ [1]
				return NULL;
			if (!is_cow_mapping(vma->vm_flags))
				return NULL;
		}

I suspect things can go wrong when there's assumption on vm_pgoff [1].  At
least vfio-pci isn't storing vm_pgoff for the base PFN, so this check will
go wrong when pte_special is not supported on any arch but when vfio-pci is
present.  I suspect more drivers can break it.

So I wonder if it's really the case in real life that only gup-fast would
need the special bit.  It could be that we thought it like that, but nobody
really seriously tried run it without special bit yet to see things broke.

This series so far limit huge pfnmap with special bits; that make me feel
safer to do as a start point.

> 
> An architecture that doesn't implement pte_special() can support pfnmaps but
> not GUP-fast. Similarly, an architecture that doesn't implement
> pmd_special() can support huge pfnmaps, but not GUP-fast.
> 
> If you take a closer look, really the only two code paths that look at
> pte_special() are GUP-fast and vm_normal_page().
> 
> If we use pmd_special/pud_special in other code than that, we are diverging
> from the pte_special() model, and are likely doing something wrong.
> 
> I see how you arrived at the current approach, focusing exclusively on x86.
> But I think this just adds inconsistency.

Hmm, that's definitely not what I wanted to express..

IMHO it's about our current code base has very limited use of larger
mappings, especialy pud, so even if I try to create the so-called
vm_normal_page_pud() to match pte, it'll mostly only contain the pud
special bit test.

We could add some pfn_valid() checks (even if I know no arch that I can
support !special but rely on pfn_valid.. nowhere I can test at all),
process vm_ops->find_special_page() even if I know nobody is using it, and
so on (obviously pud zeropage is missing so nothing to copy over
there).. just trying to match vm_normal_page().

But so far they're all redundant, and I prefer not adding redundant or dead
codes; as simple as that..  It makes more sense to me sticking with what we
know that will work, and then go from there, then we can add things by
justifying them properly step by step later.

We indeed already have vm_normal_page_pmd(), please see below.

> 
> So my point is that we use the same model, where we limit
> 
> * pmd_special() to GUP-fast and vm_normal_page_pmd()
> * pud_special() to GUP-fast and vm_normal_page_pud()
> 
> And simply do the exact same thing as we do for pte_special().
> 
> If an arch supports pmd_special() and pud_special() we can support both
> types of hugepfn mappings. If not, an architecture *might* support it,
> depending on support for GUP-fast and maybe depending on MIXEDMAP support
> (again, just like pte_special()). Not your task to worry about, you will
> only "unlock" x86.

And arm64 2M.  Yes I think I'd better leave the rest to others if I have
totally no idea how to even test them..  Even with the current Alex was
helping or I don't really have hardwares on hand.

> 
> So maybe we do want CONFIG_ARCH_HAS_PMD_SPECIAL as well, maybe it can be
> glued to CONFIG_ARCH_HAS_PTE_SPECIAL (but I'm afraid it can't unless all
> archs support both). I'll leave that up to you.
> 
> > 
> > > 
> > > As I said, if you need someone to add vm_normal_page_pud(), I can handle that.
> > 
> > I'm pretty confused why we need that for this series alone.
> 
> See above.
> 
> > 
> > If you prefer vm_normal_page_pud() to be defined and check pud_special()
> > there, I can do that.  But again, I don't yet see how that can make a
> > functional difference considering the so far very limited usage of the
> > special bit, and wonder whether we can do that on top when it became
> > necessary (and when we start to have functional requirement of such).
> 
> I hope my explanation why pte_special() even exists and how it is used makes
> it clearer.
> 
> It's not that much code to handle it like pte_special(), really. I don't
> expect you to teach GUP-slow about vm_normal_page() etc.

One thing I can do here is I move the pmd_special() check into the existing
vm_normal_page_pmd(), then it'll be a fixup on top of this patch:

===8<===
diff --git a/mm/memory.c b/mm/memory.c
index 288f81a8698e..42674c0748cb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -672,11 +672,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 {
 	unsigned long pfn = pmd_pfn(pmd);
 
-	/*
-	 * There is no pmd_special() but there may be special pmds, e.g.
-	 * in a direct-access (dax) mapping, so let's just replicate the
-	 * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
-	 */
+	/* Currently it's only used for huge pfnmaps */
+	if (unlikely(pmd_special(pmd)))
+		return NULL;
+
 	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
 		if (vma->vm_flags & VM_MIXEDMAP) {
 			if (!pfn_valid(pfn))
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 12be5222d70e..461ea3bbd8d9 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -783,7 +783,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 		fw->pmdp = pmdp;
 		fw->pmd = pmd;
 
-		if (pmd_none(pmd) || pmd_special(pmd)) {
+		if (pmd_none(pmd)) {
 			spin_unlock(ptl);
 			goto not_found;
 		} else if (!pmd_leaf(pmd)) {
-- 
2.45.0
===8<===

Would that look better to you?

> 
> If you want me to just takeover some stuff, let me know.

Do you mean sending something on top of this?  I suppose any of us is free
to do so, so please go ahead if it's the case.

Thanks,

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start()
  2024-08-28 19:45         ` Peter Xu
@ 2024-08-28 23:46           ` Jason Gunthorpe
  2024-08-29  6:35             ` David Hildenbrand
  2024-08-29 15:10           ` David Hildenbrand
  1 sibling, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2024-08-28 23:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-kernel, linux-mm, Gavin Shan,
	Catalin Marinas, x86, Ingo Molnar, Andrew Morton, Paolo Bonzini,
	Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, Oscar Salvador,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On Wed, Aug 28, 2024 at 03:45:49PM -0400, Peter Xu wrote:

> Meanwhile I'm actually not 100% sure pte_special is only needed in
> gup-fast.  See vm_normal_page() and for VM_PFNMAP when pte_special bit is
> not defined:
> 
> 		} else {
> 			unsigned long off;
> 			off = (addr - vma->vm_start) >> PAGE_SHIFT;
> 			if (pfn == vma->vm_pgoff + off) <------------------ [1]
> 				return NULL;
> 			if (!is_cow_mapping(vma->vm_flags))
> 				return NULL;
> 		}
> 
> I suspect things can go wrong when there's assumption on vm_pgoff [1].  At
> least vfio-pci isn't storing vm_pgoff for the base PFN, so this check will
> go wrong when pte_special is not supported on any arch but when vfio-pci is
> present.  I suspect more drivers can break it.

I think that is a very important point.

IIRC this was done magically in one of the ioremap pfns type calls,
and if VFIO is using fault instead it won't do it.

This probably needs more hand holding for the driver somehow..

> So I wonder if it's really the case in real life that only gup-fast would
> need the special bit.  It could be that we thought it like that, but nobody
> really seriously tried run it without special bit yet to see things broke.

Indeed.

What arches even use the whole 'special but not special' system?

Can we start banning some of this stuff on non-special arches?

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-28 16:10         ` Jiaqi Yan
@ 2024-08-28 23:49           ` Jason Gunthorpe
  2024-08-29 19:21             ` Jiaqi Yan
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2024-08-28 23:49 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On Wed, Aug 28, 2024 at 09:10:34AM -0700, Jiaqi Yan wrote:
> On Wed, Aug 28, 2024 at 7:24 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
> >
> > > Instead of removing the whole pud, can driver or memory_failure do
> > > something similar to non-struct-page-version of split_huge_page? So
> > > driver doesn't need to re-fault good pages back?
> >
> > It would be far nicer if we didn't have to poke a hole in a 1G mapping
> > just for memory failure reporting.
> 
> If I follow this, which of the following sounds better? 1. remove pud
> and rely on the driver to re-fault PFNs that it knows are not poisoned
> (what Peter suggested), or 2. keep the pud and allow access to both
> good and bad PFNs.

In practice I think people will need 2, as breaking up a 1G mapping
just because a few bits are bad will destroy the VM performance.

For this the expectation would be for the VM to co-operate and not
keep causing memory failures, or perhaps for the platform to spare in
good memory somehow.

> Or provide some knob (configured by ?) so that kernel + driver can
> switch between the two?

This is also sounding reasonable, especially if we need some
alternative protocol to signal userspace about the failed memory
besides fault and SIGBUS.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start()
  2024-08-28 23:46           ` Jason Gunthorpe
@ 2024-08-29  6:35             ` David Hildenbrand
  2024-08-29 18:45               ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: David Hildenbrand @ 2024-08-29  6:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

On 29.08.24 01:46, Jason Gunthorpe wrote:
> On Wed, Aug 28, 2024 at 03:45:49PM -0400, Peter Xu wrote:
> 
>> Meanwhile I'm actually not 100% sure pte_special is only needed in
>> gup-fast.  See vm_normal_page() and for VM_PFNMAP when pte_special bit is
>> not defined:
>>
>> 		} else {
>> 			unsigned long off;
>> 			off = (addr - vma->vm_start) >> PAGE_SHIFT;
>> 			if (pfn == vma->vm_pgoff + off) <------------------ [1]
>> 				return NULL;
>> 			if (!is_cow_mapping(vma->vm_flags))
>> 				return NULL;
>> 		}
>>
>> I suspect things can go wrong when there's assumption on vm_pgoff [1].  At
>> least vfio-pci isn't storing vm_pgoff for the base PFN, so this check will
>> go wrong when pte_special is not supported on any arch but when vfio-pci is
>> present.  I suspect more drivers can break it.

Fortunately, we did an excellent job at documenting vm_normal_page():

  * There are 2 broad cases. Firstly, an architecture may define a pte_special()
  * pte bit, in which case this function is trivial. Secondly, an architecture
  * may not have a spare pte bit, which requires a more complicated scheme,
  * described below.
  *
  * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
  * special mapping (even if there are underlying and valid "struct pages").
  * COWed pages of a VM_PFNMAP are always normal.
  *
  * The way we recognize COWed pages within VM_PFNMAP mappings is through the
  * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
  * set, and the vm_pgoff will point to the first PFN mapped: thus every special
  * mapping will always honor the rule
  *
  *	pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT)
  *
  * And for normal mappings this is false.
  *

remap_pfn_range_notrack() will currently handle that for us:

if (is_cow_mapping(vma->vm_flags)) {
	if (addr != vma->vm_start || end != vma->vm_end)
		return -EINVAL;
}

Even if [1] would succeed, the is_cow_mapping() check will return NULL and it will
all work as expected, even without pte_special().

Because VM_PFNMAP is easy: in a !COW mapping, everything is special.

> 
> I think that is a very important point.
> 
> IIRC this was done magically in one of the ioremap pfns type calls,
> and if VFIO is using fault instead it won't do it.
> 
> This probably needs more hand holding for the driver somehow..

As long as these drivers don't support COW-mappings. It's all good.

And IIUC, we cannot support COW mappings if we don't use remap_pfn_range().

For this reason, remap_pfn_range() also bails out if not the whole VMA is covered
in a COW mapping.

It would be great if we could detect and fail that. Likely when trying to insert
PFNs (*not* using remap_pfn_range) manually we would have to WARN if we stumble over
a COW mapping.

In the meantime, we should really avoid any new VM_PFNMAP COW users ...

> 
>> So I wonder if it's really the case in real life that only gup-fast would
>> need the special bit.  It could be that we thought it like that, but nobody
>> really seriously tried run it without special bit yet to see things broke.
> 
> Indeed.

VM_PFNMAP for sure works.

VM_MIXEDMAP, I am not so sure. The s390x introduction of pte_special() [again,
I posted the commit] raised why they need it: because pfn_valid() could have
returned non-refcounted pages. One would have to dig if that is even still the
case as of today, and if other architectures have similar constraints.

> 
> What arches even use the whole 'special but not special' system?
> 
> Can we start banning some of this stuff on non-special arches?

Again, VM_PFNMAP is not a problem. Only VM_MIXEDMAP, and I would love to
see that go. There are some, but not that many users ... but I'm afraid it's
not that easy :)

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start()
  2024-08-28 19:45         ` Peter Xu
  2024-08-28 23:46           ` Jason Gunthorpe
@ 2024-08-29 15:10           ` David Hildenbrand
  2024-08-29 18:49             ` Peter Xu
  1 sibling, 1 reply; 69+ messages in thread
From: David Hildenbrand @ 2024-08-29 15:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

>>>
>>> If you prefer vm_normal_page_pud() to be defined and check pud_special()
>>> there, I can do that.  But again, I don't yet see how that can make a
>>> functional difference considering the so far very limited usage of the
>>> special bit, and wonder whether we can do that on top when it became
>>> necessary (and when we start to have functional requirement of such).
>>
>> I hope my explanation why pte_special() even exists and how it is used makes
>> it clearer.
>>
>> It's not that much code to handle it like pte_special(), really. I don't
>> expect you to teach GUP-slow about vm_normal_page() etc.
> 
> One thing I can do here is I move the pmd_special() check into the existing
> vm_normal_page_pmd(), then it'll be a fixup on top of this patch:
> 
> ===8<===
> diff --git a/mm/memory.c b/mm/memory.c
> index 288f81a8698e..42674c0748cb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -672,11 +672,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>   {
>   	unsigned long pfn = pmd_pfn(pmd);
>   
> -	/*
> -	 * There is no pmd_special() but there may be special pmds, e.g.
> -	 * in a direct-access (dax) mapping, so let's just replicate the
> -	 * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
> -	 */
> +	/* Currently it's only used for huge pfnmaps */
> +	if (unlikely(pmd_special(pmd)))
> +		return NULL;


Better.

I'd appreciate a vm_normal_page_pud(), but I guess I have to be the one 
cleaning up the mess after you.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-26 20:43 ` [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
@ 2024-08-29 15:10   ` David Hildenbrand
  2024-08-29 18:26     ` Peter Xu
  2024-09-02  7:58   ` Yan Zhao
  1 sibling, 1 reply; 69+ messages in thread
From: David Hildenbrand @ 2024-08-29 15:10 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Gavin Shan, Catalin Marinas, x86, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Borislav Petkov, Zi Yan, Axel Rasmussen,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On 26.08.24 22:43, Peter Xu wrote:
> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> much easier, the write bit needs to be persisted though for writable and
> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> parent or child process will trigger a write fault.
> 
> Do the same for pmd level.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
>   1 file changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e2c314f631f3..15418ffdd377 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	pgtable_t pgtable = NULL;
>   	int ret = -ENOMEM;
>   
> +	pmd = pmdp_get_lockless(src_pmd);
> +	if (unlikely(pmd_special(pmd))) {

I assume I have to clean up your mess here as well?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-29 15:10   ` David Hildenbrand
@ 2024-08-29 18:26     ` Peter Xu
  2024-08-29 19:44       ` David Hildenbrand
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-08-29 18:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On Thu, Aug 29, 2024 at 05:10:42PM +0200, David Hildenbrand wrote:
> On 26.08.24 22:43, Peter Xu wrote:
> > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > much easier, the write bit needs to be persisted though for writable and
> > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > parent or child process will trigger a write fault.
> > 
> > Do the same for pmd level.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
> >   1 file changed, 26 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e2c314f631f3..15418ffdd377 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	pgtable_t pgtable = NULL;
> >   	int ret = -ENOMEM;
> > +	pmd = pmdp_get_lockless(src_pmd);
> > +	if (unlikely(pmd_special(pmd))) {
> 
> I assume I have to clean up your mess here as well?

Can you leave meaningful and explicit comment?  I'll try to address.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start()
  2024-08-29  6:35             ` David Hildenbrand
@ 2024-08-29 18:45               ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-29 18:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, linux-kernel, linux-mm, Gavin Shan,
	Catalin Marinas, x86, Ingo Molnar, Andrew Morton, Paolo Bonzini,
	Dave Hansen, Thomas Gleixner, Alistair Popple, kvm,
	linux-arm-kernel, Sean Christopherson, Oscar Salvador,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On Thu, Aug 29, 2024 at 08:35:49AM +0200, David Hildenbrand wrote:
> Fortunately, we did an excellent job at documenting vm_normal_page():
> 
>  * There are 2 broad cases. Firstly, an architecture may define a pte_special()
>  * pte bit, in which case this function is trivial. Secondly, an architecture
>  * may not have a spare pte bit, which requires a more complicated scheme,
>  * described below.
>  *
>  * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
>  * special mapping (even if there are underlying and valid "struct pages").
>  * COWed pages of a VM_PFNMAP are always normal.
>  *
>  * The way we recognize COWed pages within VM_PFNMAP mappings is through the
>  * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
>  * set, and the vm_pgoff will point to the first PFN mapped: thus every special
>  * mapping will always honor the rule
>  *
>  *	pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT)
>  *
>  * And for normal mappings this is false.
>  *
> 
> remap_pfn_range_notrack() will currently handle that for us:
> 
> if (is_cow_mapping(vma->vm_flags)) {
> 	if (addr != vma->vm_start || end != vma->vm_end)
> 		return -EINVAL;
> }
> 
> Even if [1] would succeed, the is_cow_mapping() check will return NULL and it will
> all work as expected, even without pte_special().

IMHO referencing vm_pgoff is ambiguous, and could be wrong, if without a
clear contract.

For example, consider when the driver setup a MAP_PRIVATE + VM_PFNMAP vma,
vm_pgoff to be not the "base PFN" but some random value, then for a COWed
page it's possible the calculation accidentally satisfies "pfn ==
vma->vm_pgoff + off".  Then it could wrongly return NULL rather than the
COWed anonymous page here.  This is extremely unlikely, but just to show
why it's wrong to reference it at all.

> 
> Because VM_PFNMAP is easy: in a !COW mapping, everything is special.

Yes it's safe for vfio-pci, as vfio-pci doesn't have private mappings.  But
still, I don't think it's clear enough now on how VM_PFNMAP should be
mapped.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start()
  2024-08-29 15:10           ` David Hildenbrand
@ 2024-08-29 18:49             ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-29 18:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On Thu, Aug 29, 2024 at 05:10:15PM +0200, David Hildenbrand wrote:
> > > > 
> > > > If you prefer vm_normal_page_pud() to be defined and check pud_special()
> > > > there, I can do that.  But again, I don't yet see how that can make a
> > > > functional difference considering the so far very limited usage of the
> > > > special bit, and wonder whether we can do that on top when it became
> > > > necessary (and when we start to have functional requirement of such).
> > > 
> > > I hope my explanation why pte_special() even exists and how it is used makes
> > > it clearer.
> > > 
> > > It's not that much code to handle it like pte_special(), really. I don't
> > > expect you to teach GUP-slow about vm_normal_page() etc.
> > 
> > One thing I can do here is I move the pmd_special() check into the existing
> > vm_normal_page_pmd(), then it'll be a fixup on top of this patch:
> > 
> > ===8<===
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 288f81a8698e..42674c0748cb 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -672,11 +672,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> >   {
> >   	unsigned long pfn = pmd_pfn(pmd);
> > -	/*
> > -	 * There is no pmd_special() but there may be special pmds, e.g.
> > -	 * in a direct-access (dax) mapping, so let's just replicate the
> > -	 * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
> > -	 */
> > +	/* Currently it's only used for huge pfnmaps */
> > +	if (unlikely(pmd_special(pmd)))
> > +		return NULL;
> 
> 
> Better.
> 
> I'd appreciate a vm_normal_page_pud(), but I guess I have to be the one
> cleaning up the mess after you.

I'll prepare either a fixup patch for above, or repost if there're more
changes.

Again, please leave explicit comment please.

As I mentioned, to me vm_normal_page_pud() currently should only contain
pud_special() check, as most of the things in pmd/pte don't seem to apply.

I don't feel strongly to add that wrapper yet in this case, but if you can
elaborate what you're suggesting otherwise, it may help me to understand
what you're looking for, then I can try to address them.

Or if you prefer some cleanup patch by yourself, please go ahead.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-28 23:49           ` Jason Gunthorpe
@ 2024-08-29 19:21             ` Jiaqi Yan
  2024-09-04 15:52               ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Jiaqi Yan @ 2024-08-29 19:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson, ankita

On Wed, Aug 28, 2024 at 4:50 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Aug 28, 2024 at 09:10:34AM -0700, Jiaqi Yan wrote:
> > On Wed, Aug 28, 2024 at 7:24 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
> > >
> > > > Instead of removing the whole pud, can driver or memory_failure do
> > > > something similar to non-struct-page-version of split_huge_page? So
> > > > driver doesn't need to re-fault good pages back?
> > >
> > > It would be far nicer if we didn't have to poke a hole in a 1G mapping
> > > just for memory failure reporting.
> >
> > If I follow this, which of the following sounds better? 1. remove pud
> > and rely on the driver to re-fault PFNs that it knows are not poisoned
> > (what Peter suggested), or 2. keep the pud and allow access to both
> > good and bad PFNs.
>
> In practice I think people will need 2, as breaking up a 1G mapping
> just because a few bits are bad will destroy the VM performance.
>

Totally agreed.

> For this the expectation would be for the VM to co-operate and not
> keep causing memory failures, or perhaps for the platform to spare in
> good memory somehow.

Yes, whether a VM gets into a memory-error-consumption loop
maliciously or accidentally, a reasonable VMM should have means to
detect and break it.

>
> > Or provide some knob (configured by ?) so that kernel + driver can
> > switch between the two?
>
> This is also sounding reasonable, especially if we need some
> alternative protocol to signal userspace about the failed memory
> besides fault and SIGBUS.

To clarify, what on my mind is a knob say named
"sysctl_enable_hard_offline", configured by userspace.

To apply to Ankit's memory_failure_pfn patch[*]:

static int memory_failure_pfn(unsigned long pfn, int flags)
{
  struct interval_tree_node *node;
  int res = MF_FAILED;
  LIST_HEAD(tokill);

  mutex_lock(&pfn_space_lock);
   for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
         node = interval_tree_iter_next(node, pfn, pfn)) {
    struct pfn_address_space *pfn_space =
      container_of(node, struct pfn_address_space, node);

    if (pfn_space->ops)
      pfn_space->ops->failure(pfn_space, pfn);

    collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);

    if (sysctl_enable_hard_offline)
      unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
                                             PAGE_SIZE, 0);

    res = MF_RECOVERED;
  }
  mutex_unlock(&pfn_space_lock);

  if (res == MF_FAILED)
    return action_result(pfn, MF_MSG_PFN_MAP, res);

  flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
  kill_procs(&tokill, true, false, pfn, flags);

  return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
}

I think we still want to attempt to SIGBUS userspace, regardless of
doing unmap_mapping_range or not.

[*] https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t

>
> Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-29 18:26     ` Peter Xu
@ 2024-08-29 19:44       ` David Hildenbrand
  2024-08-29 20:01         ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: David Hildenbrand @ 2024-08-29 19:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On 29.08.24 20:26, Peter Xu wrote:
> On Thu, Aug 29, 2024 at 05:10:42PM +0200, David Hildenbrand wrote:
>> On 26.08.24 22:43, Peter Xu wrote:
>>> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
>>> much easier, the write bit needs to be persisted though for writable and
>>> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
>>> parent or child process will trigger a write fault.
>>>
>>> Do the same for pmd level.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
>>>    1 file changed, 26 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index e2c314f631f3..15418ffdd377 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>    	pgtable_t pgtable = NULL;
>>>    	int ret = -ENOMEM;
>>> +	pmd = pmdp_get_lockless(src_pmd);
>>> +	if (unlikely(pmd_special(pmd))) {
>>
>> I assume I have to clean up your mess here as well?
> 
> Can you leave meaningful and explicit comment?  I'll try to address.

Sorry Peter, but I raised all that as reply to v1. For example, I 
stated that vm_normal_page_pmd() already *exist* and why these 
pmd_special() checks should be kept there.

I hear you, you're not interested in cleaning that up. So at this point 
it's easier for me to clean it up myself.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-29 19:44       ` David Hildenbrand
@ 2024-08-29 20:01         ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-08-29 20:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

On Thu, Aug 29, 2024 at 09:44:01PM +0200, David Hildenbrand wrote:
> On 29.08.24 20:26, Peter Xu wrote:
> > On Thu, Aug 29, 2024 at 05:10:42PM +0200, David Hildenbrand wrote:
> > > On 26.08.24 22:43, Peter Xu wrote:
> > > > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > > > much easier, the write bit needs to be persisted though for writable and
> > > > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > > > parent or child process will trigger a write fault.
> > > > 
> > > > Do the same for pmd level.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >    mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
> > > >    1 file changed, 26 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index e2c314f631f3..15418ffdd377 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > >    	pgtable_t pgtable = NULL;
> > > >    	int ret = -ENOMEM;
> > > > +	pmd = pmdp_get_lockless(src_pmd);
> > > > +	if (unlikely(pmd_special(pmd))) {
> > > 
> > > I assume I have to clean up your mess here as well?
> > 
> > Can you leave meaningful and explicit comment?  I'll try to address.
> 
> Sorry Peter, but I raised all that as reply to v1. For example, I stated
> that vm_normal_page_pmd() already *exist* and why these pmd_special() checks
> should be kept there.

We discussed the usage of pmd_page() but I don't think this is clear you
suggest it to be used there.  IOW, copy_huge_pmd() doesn't use
vm_normal_page_pmd() yet so far and I'm not sure whether it's always safe.

E.g. at least one thing I spot is vm_normal_page_pmd() returns NULL for
huge zeropage pmd but here in fork() we need to take a ref with
mm_get_huge_zero_folio().

> 
> I hear you, you're not interested in cleaning that up. So at this point it's
> easier for me to clean it up myself.

It might be easier indeed you provide a patch that you think the best.

Then I'll leave that to you, and I'll send the solo fixup patch to be
squashed soon to the list.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 16/19] mm: Remove follow_pte()
  2024-08-26 20:43 ` [PATCH v2 16/19] mm: Remove follow_pte() Peter Xu
@ 2024-09-01  4:33   ` Yu Zhao
  2024-09-01 13:39     ` David Hildenbrand
  0 siblings, 1 reply; 69+ messages in thread
From: Yu Zhao @ 2024-09-01  4:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On Mon, Aug 26, 2024 at 2:44 PM Peter Xu <peterx@redhat.com> wrote:
>
> follow_pte() users have been converted to follow_pfnmap*().  Remove the
> API.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/mm.h |  2 --
>  mm/memory.c        | 73 ----------------------------------------------
>  2 files changed, 75 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 161d496bfd18..b31d4bdd65ad 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2368,8 +2368,6 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
>                 unsigned long end, unsigned long floor, unsigned long ceiling);
>  int
>  copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
> -int follow_pte(struct vm_area_struct *vma, unsigned long address,
> -              pte_t **ptepp, spinlock_t **ptlp);
>  int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>                         void *buf, int len, int write);
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b5d07f493d5d..288f81a8698e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6100,79 +6100,6 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
>  }
>  #endif /* __PAGETABLE_PMD_FOLDED */
>
> -/**
> - * follow_pte - look up PTE at a user virtual address
> - * @vma: the memory mapping
> - * @address: user virtual address
> - * @ptepp: location to store found PTE
> - * @ptlp: location to store the lock for the PTE
> - *
> - * On a successful return, the pointer to the PTE is stored in @ptepp;
> - * the corresponding lock is taken and its location is stored in @ptlp.
> - *
> - * The contents of the PTE are only stable until @ptlp is released using
> - * pte_unmap_unlock(). This function will fail if the PTE is non-present.
> - * Present PTEs may include PTEs that map refcounted pages, such as
> - * anonymous folios in COW mappings.
> - *
> - * Callers must be careful when relying on PTE content after
> - * pte_unmap_unlock(). Especially if the PTE maps a refcounted page,
> - * callers must protect against invalidation with MMU notifiers; otherwise
> - * access to the PFN at a later point in time can trigger use-after-free.
> - *
> - * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
> - * should be taken for read.
> - *
> - * This function must not be used to modify PTE content.
> - *
> - * Return: zero on success, -ve otherwise.
> - */
> -int follow_pte(struct vm_area_struct *vma, unsigned long address,
> -              pte_t **ptepp, spinlock_t **ptlp)
> -{
> -       struct mm_struct *mm = vma->vm_mm;
> -       pgd_t *pgd;
> -       p4d_t *p4d;
> -       pud_t *pud;
> -       pmd_t *pmd;
> -       pte_t *ptep;
> -
> -       mmap_assert_locked(mm);
> -       if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> -               goto out;
> -
> -       if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> -               goto out;
> -
> -       pgd = pgd_offset(mm, address);
> -       if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> -               goto out;
> -
> -       p4d = p4d_offset(pgd, address);
> -       if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
> -               goto out;
> -
> -       pud = pud_offset(p4d, address);
> -       if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> -               goto out;
> -
> -       pmd = pmd_offset(pud, address);
> -       VM_BUG_ON(pmd_trans_huge(*pmd));
> -
> -       ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
> -       if (!ptep)
> -               goto out;
> -       if (!pte_present(ptep_get(ptep)))
> -               goto unlock;
> -       *ptepp = ptep;
> -       return 0;
> -unlock:
> -       pte_unmap_unlock(ptep, *ptlp);
> -out:
> -       return -EINVAL;
> -}
> -EXPORT_SYMBOL_GPL(follow_pte);

I ran into build errors with this -- removing exported symbols breaks
ABI, so I think we should make follow_pte() as a wrapper of its new
equivalent, if that's possible?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 16/19] mm: Remove follow_pte()
  2024-09-01  4:33   ` Yu Zhao
@ 2024-09-01 13:39     ` David Hildenbrand
  0 siblings, 0 replies; 69+ messages in thread
From: David Hildenbrand @ 2024-09-01 13:39 UTC (permalink / raw)
  To: Yu Zhao, Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

Am 01.09.24 um 06:33 schrieb Yu Zhao:
> On Mon, Aug 26, 2024 at 2:44 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> follow_pte() users have been converted to follow_pfnmap*().  Remove the
>> API.
>>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>>   include/linux/mm.h |  2 --
>>   mm/memory.c        | 73 ----------------------------------------------
>>   2 files changed, 75 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 161d496bfd18..b31d4bdd65ad 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2368,8 +2368,6 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
>>                  unsigned long end, unsigned long floor, unsigned long ceiling);
>>   int
>>   copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
>> -int follow_pte(struct vm_area_struct *vma, unsigned long address,
>> -              pte_t **ptepp, spinlock_t **ptlp);
>>   int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>                          void *buf, int len, int write);
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5d07f493d5d..288f81a8698e 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6100,79 +6100,6 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
>>   }
>>   #endif /* __PAGETABLE_PMD_FOLDED */
>>
>> -/**
>> - * follow_pte - look up PTE at a user virtual address
>> - * @vma: the memory mapping
>> - * @address: user virtual address
>> - * @ptepp: location to store found PTE
>> - * @ptlp: location to store the lock for the PTE
>> - *
>> - * On a successful return, the pointer to the PTE is stored in @ptepp;
>> - * the corresponding lock is taken and its location is stored in @ptlp.
>> - *
>> - * The contents of the PTE are only stable until @ptlp is released using
>> - * pte_unmap_unlock(). This function will fail if the PTE is non-present.
>> - * Present PTEs may include PTEs that map refcounted pages, such as
>> - * anonymous folios in COW mappings.
>> - *
>> - * Callers must be careful when relying on PTE content after
>> - * pte_unmap_unlock(). Especially if the PTE maps a refcounted page,
>> - * callers must protect against invalidation with MMU notifiers; otherwise
>> - * access to the PFN at a later point in time can trigger use-after-free.
>> - *
>> - * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
>> - * should be taken for read.
>> - *
>> - * This function must not be used to modify PTE content.
>> - *
>> - * Return: zero on success, -ve otherwise.
>> - */
>> -int follow_pte(struct vm_area_struct *vma, unsigned long address,
>> -              pte_t **ptepp, spinlock_t **ptlp)
>> -{
>> -       struct mm_struct *mm = vma->vm_mm;
>> -       pgd_t *pgd;
>> -       p4d_t *p4d;
>> -       pud_t *pud;
>> -       pmd_t *pmd;
>> -       pte_t *ptep;
>> -
>> -       mmap_assert_locked(mm);
>> -       if (unlikely(address < vma->vm_start || address >= vma->vm_end))
>> -               goto out;
>> -
>> -       if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
>> -               goto out;
>> -
>> -       pgd = pgd_offset(mm, address);
>> -       if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
>> -               goto out;
>> -
>> -       p4d = p4d_offset(pgd, address);
>> -       if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
>> -               goto out;
>> -
>> -       pud = pud_offset(p4d, address);
>> -       if (pud_none(*pud) || unlikely(pud_bad(*pud)))
>> -               goto out;
>> -
>> -       pmd = pmd_offset(pud, address);
>> -       VM_BUG_ON(pmd_trans_huge(*pmd));
>> -
>> -       ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
>> -       if (!ptep)
>> -               goto out;
>> -       if (!pte_present(ptep_get(ptep)))
>> -               goto unlock;
>> -       *ptepp = ptep;
>> -       return 0;
>> -unlock:
>> -       pte_unmap_unlock(ptep, *ptlp);
>> -out:
>> -       return -EINVAL;
>> -}
>> -EXPORT_SYMBOL_GPL(follow_pte);
> 
> I ran into build errors with this -- removing exported symbols breaks
> ABI, so I think we should make follow_pte() as a wrapper of its new
> equivalent, if that's possible?

Build error with OOT modules or in-tree modules?

If you are talking about OOT modules, it is their responsibility to fix this up 
in their implementation. There are no real kabi stability guarantees provided by 
the kernel.

If you are talking about in-tree modules, did Peter miss some (probably in -next?)?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-26 20:43 ` [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
  2024-08-29 15:10   ` David Hildenbrand
@ 2024-09-02  7:58   ` Yan Zhao
  2024-09-03 21:23     ` Peter Xu
  1 sibling, 1 reply; 69+ messages in thread
From: Yan Zhao @ 2024-09-02  7:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Will Deacon, Kefeng Wang, Alex Williamson

On Mon, Aug 26, 2024 at 04:43:41PM -0400, Peter Xu wrote:
> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> much easier, the write bit needs to be persisted though for writable and
> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> parent or child process will trigger a write fault.
> 
> Do the same for pmd level.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
>  1 file changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e2c314f631f3..15418ffdd377 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	pgtable_t pgtable = NULL;
>  	int ret = -ENOMEM;
>  
> +	pmd = pmdp_get_lockless(src_pmd);
> +	if (unlikely(pmd_special(pmd))) {
> +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> +		/*
> +		 * No need to recheck the pmd, it can't change with write
> +		 * mmap lock held here.
> +		 *
> +		 * Meanwhile, making sure it's not a CoW VMA with writable
> +		 * mapping, otherwise it means either the anon page wrongly
> +		 * applied special bit, or we made the PRIVATE mapping be
> +		 * able to wrongly write to the backend MMIO.
> +		 */
> +		VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
> +		goto set_pmd;
> +	}
> +
>  	/* Skip if can be re-fill on fault */
>  	if (!vma_is_anonymous(dst_vma))
>  		return 0;
> @@ -1640,7 +1658,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	pmdp_set_wrprotect(src_mm, addr, src_pmd);
>  	if (!userfaultfd_wp(dst_vma))
>  		pmd = pmd_clear_uffd_wp(pmd);
> -	pmd = pmd_mkold(pmd_wrprotect(pmd));
> +	pmd = pmd_wrprotect(pmd);
> +set_pmd:
> +	pmd = pmd_mkold(pmd);
>  	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
>  
>  	ret = 0;
> @@ -1686,8 +1706,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	 * TODO: once we support anonymous pages, use
>  	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
>  	 */
> -	pudp_set_wrprotect(src_mm, addr, src_pud);
> -	pud = pud_mkold(pud_wrprotect(pud));
> +	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
> +		pudp_set_wrprotect(src_mm, addr, src_pud);
> +		pud = pud_wrprotect(pud);
> +	}
Do we need the logic to clear dirty bit in the child as that in
__copy_present_ptes()?  (and also for the pmd's case).

e.g.
if (vma->vm_flags & VM_SHARED)
	pud = pud_mkclean(pud);

> +	pud = pud_mkold(pud);
>  	set_pud_at(dst_mm, addr, dst_pud, pud);
>  
>  	ret = 0;
> -- 
> 2.45.0
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-09-02  7:58   ` Yan Zhao
@ 2024-09-03 21:23     ` Peter Xu
  2024-09-09 22:25       ` Andrew Morton
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-09-03 21:23 UTC (permalink / raw)
  To: Yan Zhao
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Will Deacon, Kefeng Wang, Alex Williamson

On Mon, Sep 02, 2024 at 03:58:38PM +0800, Yan Zhao wrote:
> On Mon, Aug 26, 2024 at 04:43:41PM -0400, Peter Xu wrote:
> > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > much easier, the write bit needs to be persisted though for writable and
> > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > parent or child process will trigger a write fault.
> > 
> > Do the same for pmd level.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
> >  1 file changed, 26 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e2c314f631f3..15418ffdd377 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	pgtable_t pgtable = NULL;
> >  	int ret = -ENOMEM;
> >  
> > +	pmd = pmdp_get_lockless(src_pmd);
> > +	if (unlikely(pmd_special(pmd))) {
> > +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> > +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> > +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > +		/*
> > +		 * No need to recheck the pmd, it can't change with write
> > +		 * mmap lock held here.
> > +		 *
> > +		 * Meanwhile, making sure it's not a CoW VMA with writable
> > +		 * mapping, otherwise it means either the anon page wrongly
> > +		 * applied special bit, or we made the PRIVATE mapping be
> > +		 * able to wrongly write to the backend MMIO.
> > +		 */
> > +		VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
> > +		goto set_pmd;
> > +	}
> > +
> >  	/* Skip if can be re-fill on fault */
> >  	if (!vma_is_anonymous(dst_vma))
> >  		return 0;
> > @@ -1640,7 +1658,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> >  	if (!userfaultfd_wp(dst_vma))
> >  		pmd = pmd_clear_uffd_wp(pmd);
> > -	pmd = pmd_mkold(pmd_wrprotect(pmd));
> > +	pmd = pmd_wrprotect(pmd);
> > +set_pmd:
> > +	pmd = pmd_mkold(pmd);
> >  	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> >  
> >  	ret = 0;
> > @@ -1686,8 +1706,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	 * TODO: once we support anonymous pages, use
> >  	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
> >  	 */
> > -	pudp_set_wrprotect(src_mm, addr, src_pud);
> > -	pud = pud_mkold(pud_wrprotect(pud));
> > +	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
> > +		pudp_set_wrprotect(src_mm, addr, src_pud);
> > +		pud = pud_wrprotect(pud);
> > +	}
> Do we need the logic to clear dirty bit in the child as that in
> __copy_present_ptes()?  (and also for the pmd's case).
> 
> e.g.
> if (vma->vm_flags & VM_SHARED)
> 	pud = pud_mkclean(pud);

Yeah, good question.  I remember I thought about that when initially
working on these lines, but I forgot the details, or maybe I simply tried
to stick with the current code base, as the dirty bit used to be kept even
in the child here.

I'd expect there's only performance differences, but still sounds like I'd
better leave that to whoever knows the best on the implications, then draft
it as a separate patch but only when needed.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-29 19:21             ` Jiaqi Yan
@ 2024-09-04 15:52               ` Jason Gunthorpe
  2024-09-04 16:38                 ` Jiaqi Yan
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2024-09-04 15:52 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson, ankita

On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:

> I think we still want to attempt to SIGBUS userspace, regardless of
> doing unmap_mapping_range or not.

IMHO we need to eliminate this path if we actually want to keep things
mapped.

There is no way to generate the SIGBUS without poking a 4k hole in the
1G page, as only that 4k should get SIGBUS, every other byte of the 1G
is clean.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-09-04 15:52               ` Jason Gunthorpe
@ 2024-09-04 16:38                 ` Jiaqi Yan
  2024-09-04 16:43                   ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Jiaqi Yan @ 2024-09-04 16:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson, ankita

On Wed, Sep 4, 2024 at 8:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:
>
> > I think we still want to attempt to SIGBUS userspace, regardless of
> > doing unmap_mapping_range or not.
>
> IMHO we need to eliminate this path if we actually want to keep things
> mapped.
>
> There is no way to generate the SIGBUS without poking a 4k hole in the
> 1G page, as only that 4k should get SIGBUS, every other byte of the 1G
> is clean.

Ah, sorry I wasn't clear. The SIGBUS will be only for poisoned PFN;
clean PFNs under the same PUD/PMD for sure don't need any SIGBUS,
which is the whole purpose of not unmapping.

>
> Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-09-04 16:38                 ` Jiaqi Yan
@ 2024-09-04 16:43                   ` Jason Gunthorpe
  2024-09-04 16:58                     ` Jiaqi Yan
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2024-09-04 16:43 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson, ankita

On Wed, Sep 04, 2024 at 09:38:22AM -0700, Jiaqi Yan wrote:
> On Wed, Sep 4, 2024 at 8:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:
> >
> > > I think we still want to attempt to SIGBUS userspace, regardless of
> > > doing unmap_mapping_range or not.
> >
> > IMHO we need to eliminate this path if we actually want to keep things
> > mapped.
> >
> > There is no way to generate the SIGBUS without poking a 4k hole in the
> > 1G page, as only that 4k should get SIGBUS, every other byte of the 1G
> > is clean.
> 
> Ah, sorry I wasn't clear. The SIGBUS will be only for poisoned PFN;
> clean PFNs under the same PUD/PMD for sure don't need any SIGBUS,
> which is the whole purpose of not unmapping.

You can't get a SIGBUS if the things are still mapped. This is why the
SIGBUS flow requires poking a non-present hole around the poisoned
memory.

So keeping things mapped at 1G also means giving up on SIGBUS.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-09-04 16:43                   ` Jason Gunthorpe
@ 2024-09-04 16:58                     ` Jiaqi Yan
  2024-09-04 17:00                       ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Jiaqi Yan @ 2024-09-04 16:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson, ankita

On Wed, Sep 4, 2024 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Sep 04, 2024 at 09:38:22AM -0700, Jiaqi Yan wrote:
> > On Wed, Sep 4, 2024 at 8:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:
> > >
> > > > I think we still want to attempt to SIGBUS userspace, regardless of
> > > > doing unmap_mapping_range or not.
> > >
> > > IMHO we need to eliminate this path if we actually want to keep things
> > > mapped.
> > >
> > > There is no way to generate the SIGBUS without poking a 4k hole in the
> > > 1G page, as only that 4k should get SIGBUS, every other byte of the 1G
> > > is clean.
> >
> > Ah, sorry I wasn't clear. The SIGBUS will be only for poisoned PFN;
> > clean PFNs under the same PUD/PMD for sure don't need any SIGBUS,
> > which is the whole purpose of not unmapping.
>
> You can't get a SIGBUS if the things are still mapped. This is why the
> SIGBUS flow requires poking a non-present hole around the poisoned
> memory.
>
> So keeping things mapped at 1G also means giving up on SIGBUS.

SIGBUS during page fault is definitely impossible when memory is still
mapped, but the platform still MCE or SEA in case of poison
consumption, right? So I wanted to propose new code to SIGBUS (either
BUS_MCEERR_AR or BUS_OBJERR) as long as the platform notifies the
kernel in the synchronous poison consumption context, e.g. MCE on X86
and SEA on ARM64.

>
> Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-09-04 16:58                     ` Jiaqi Yan
@ 2024-09-04 17:00                       ` Jason Gunthorpe
  2024-09-04 17:07                         ` Jiaqi Yan
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2024-09-04 17:00 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson, ankita

On Wed, Sep 04, 2024 at 09:58:54AM -0700, Jiaqi Yan wrote:
> On Wed, Sep 4, 2024 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Wed, Sep 04, 2024 at 09:38:22AM -0700, Jiaqi Yan wrote:
> > > On Wed, Sep 4, 2024 at 8:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:
> > > >
> > > > > I think we still want to attempt to SIGBUS userspace, regardless of
> > > > > doing unmap_mapping_range or not.
> > > >
> > > > IMHO we need to eliminate this path if we actually want to keep things
> > > > mapped.
> > > >
> > > > There is no way to generate the SIGBUS without poking a 4k hole in the
> > > > 1G page, as only that 4k should get SIGBUS, every other byte of the 1G
> > > > is clean.
> > >
> > > Ah, sorry I wasn't clear. The SIGBUS will be only for poisoned PFN;
> > > clean PFNs under the same PUD/PMD for sure don't need any SIGBUS,
> > > which is the whole purpose of not unmapping.
> >
> > You can't get a SIGBUS if the things are still mapped. This is why the
> > SIGBUS flow requires poking a non-present hole around the poisoned
> > memory.
> >
> > So keeping things mapped at 1G also means giving up on SIGBUS.
> 
> SIGBUS during page fault is definitely impossible when memory is still
> mapped, but the platform still MCE or SEA in case of poison
> consumption, right? So I wanted to propose new code to SIGBUS (either
> BUS_MCEERR_AR or BUS_OBJERR) as long as the platform notifies the
> kernel in the synchronous poison consumption context, e.g. MCE on X86
> and SEA on ARM64.

So you want a SIGBUS that is delivered asynchronously instead of via
the page fault handler? Something like that is sort of what I ment by
"eliminate this path", though I didn't think keeping an async SIGBUS
was an option?

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-09-04 17:00                       ` Jason Gunthorpe
@ 2024-09-04 17:07                         ` Jiaqi Yan
  2024-09-09  3:56                           ` Ankit Agrawal
  0 siblings, 1 reply; 69+ messages in thread
From: Jiaqi Yan @ 2024-09-04 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson, ankita

On Wed, Sep 4, 2024 at 10:00 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Sep 04, 2024 at 09:58:54AM -0700, Jiaqi Yan wrote:
> > On Wed, Sep 4, 2024 at 9:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Wed, Sep 04, 2024 at 09:38:22AM -0700, Jiaqi Yan wrote:
> > > > On Wed, Sep 4, 2024 at 8:52 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > > >
> > > > > On Thu, Aug 29, 2024 at 12:21:39PM -0700, Jiaqi Yan wrote:
> > > > >
> > > > > > I think we still want to attempt to SIGBUS userspace, regardless of
> > > > > > doing unmap_mapping_range or not.
> > > > >
> > > > > IMHO we need to eliminate this path if we actually want to keep things
> > > > > mapped.
> > > > >
> > > > > There is no way to generate the SIGBUS without poking a 4k hole in the
> > > > > 1G page, as only that 4k should get SIGBUS, every other byte of the 1G
> > > > > is clean.
> > > >
> > > > Ah, sorry I wasn't clear. The SIGBUS will be only for poisoned PFN;
> > > > clean PFNs under the same PUD/PMD for sure don't need any SIGBUS,
> > > > which is the whole purpose of not unmapping.
> > >
> > > You can't get a SIGBUS if the things are still mapped. This is why the
> > > SIGBUS flow requires poking a non-present hole around the poisoned
> > > memory.
> > >
> > > So keeping things mapped at 1G also means giving up on SIGBUS.
> >
> > SIGBUS during page fault is definitely impossible when memory is still
> > mapped, but the platform still MCE or SEA in case of poison
> > consumption, right? So I wanted to propose new code to SIGBUS (either
> > BUS_MCEERR_AR or BUS_OBJERR) as long as the platform notifies the
> > kernel in the synchronous poison consumption context, e.g. MCE on X86
> > and SEA on ARM64.
>
> So you want a SIGBUS that is delivered asynchronously instead of via
> the page fault handler? Something like that is sort of what I ment by
> "eliminate this path", though I didn't think keeping an async SIGBUS
> was an option?

Not really, I don't think an SIGBUS *async* to the poison consuming
thread is critical, at least not as useful as SIGBUS *sync* to the
poison consuming thread.

>
> Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-09-04 17:07                         ` Jiaqi Yan
@ 2024-09-09  3:56                           ` Ankit Agrawal
  0 siblings, 0 replies; 69+ messages in thread
From: Ankit Agrawal @ 2024-09-09  3:56 UTC (permalink / raw)
  To: Jiaqi Yan, Jason Gunthorpe
  Cc: Peter Xu, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Gavin Shan, Catalin Marinas, x86@kernel.org, Ingo Molnar,
	Andrew Morton, Paolo Bonzini, Dave Hansen, Thomas Gleixner,
	Alistair Popple, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Sean Christopherson,
	Oscar Salvador, Borislav Petkov, Zi Yan, Axel Rasmussen,
	David Hildenbrand, Yan Zhao, Will Deacon, Kefeng Wang,
	Alex Williamson

> Yes, whether a VM gets into a memory-error-consumption loop
> maliciously or accidentally, a reasonable VMM should have means to
> detect and break it.
Agreed we need a way to handle it. I suppose it can easily happen if
a malicious app in the VM handles the SIGBUS to say read/write again
among other ways.

Regarding the following two ways discussed..
> 1. remove pud and rely on the driver to re-fault PFNs that it knows
> are not poisoned (what Peter suggested), or 2. keep the pud and
> allow access to both good and bad PFNs.
As mentioned, 2. have the advantage from the performance POV.
For my understanding, what are the pros for the mechanism 1 vs 2?
Wondering it is a choice out of some technical constraints.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (19 preceding siblings ...)
  2024-08-27 22:36 ` [PATCH v2 00/19] mm: Support huge pfnmaps Jiaqi Yan
@ 2024-09-09  4:03 ` Ankit Agrawal
  2024-09-09 15:03   ` Peter Xu
  20 siblings, 1 reply; 69+ messages in thread
From: Ankit Agrawal @ 2024-09-09  4:03 UTC (permalink / raw)
  To: Peter Xu, linux-kernel@vger.kernel.org, linux-mm@kvack.org
  Cc: Gavin Shan, Catalin Marinas, x86@kernel.org, Ingo Molnar,
	Andrew Morton, Paolo Bonzini, Dave Hansen, Thomas Gleixner,
	Alistair Popple, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Sean Christopherson,
	Oscar Salvador, Jason Gunthorpe, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Yan Zhao, Will Deacon,
	Kefeng Wang, Alex Williamson

> More architectures / More page sizes
> ------------------------------------
> 
> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems to
> have plan to support arm64 1G later on top of this series [2].
> 
> Any arch will need to first support THP / THP_1G, then provide a special
> bit in pmds/puds to support huge pfnmaps.

Just to confirm, would this also not support 512M for 64K pages on aarch64
with special PMD? Or am I missing something?

> remap_pfn_range() support
> -------------------------
> 
> Currently, remap_pfn_range() still only maps PTEs.  With the new option,
> remap_pfn_range() can logically start to inject either PMDs or PUDs when
> the alignment requirements match on the VAs.
>
> When the support is there, it should be able to silently benefit all
> drivers that is using remap_pfn_range() in its mmap() handler on better TLB
> hit rate and overall faster MMIO accesses similar to processor on hugepages.

Does Peter or other folks know of an ongoing effort/patches to extend
remap_pfn_range() to use this?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 00/19] mm: Support huge pfnmaps
  2024-09-09  4:03 ` Ankit Agrawal
@ 2024-09-09 15:03   ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-09-09 15:03 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gavin Shan,
	Catalin Marinas, x86@kernel.org, Ingo Molnar, Andrew Morton,
	Paolo Bonzini, Dave Hansen, Thomas Gleixner, Alistair Popple,
	kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On Mon, Sep 09, 2024 at 04:03:55AM +0000, Ankit Agrawal wrote:
> > More architectures / More page sizes
> > ------------------------------------
> > 
> > Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems to
> > have plan to support arm64 1G later on top of this series [2].
> > 
> > Any arch will need to first support THP / THP_1G, then provide a special
> > bit in pmds/puds to support huge pfnmaps.
> 
> Just to confirm, would this also not support 512M for 64K pages on aarch64
> with special PMD? Or am I missing something?

I don't think it's properly tested yet, but logically it should be
supported indeed, as here what matters is "pmd/pud", not the explicit size
that it uses.

> 
> > remap_pfn_range() support
> > -------------------------
> > 
> > Currently, remap_pfn_range() still only maps PTEs.  With the new option,
> > remap_pfn_range() can logically start to inject either PMDs or PUDs when
> > the alignment requirements match on the VAs.
> >
> > When the support is there, it should be able to silently benefit all
> > drivers that is using remap_pfn_range() in its mmap() handler on better TLB
> > hit rate and overall faster MMIO accesses similar to processor on hugepages.
> 
> Does Peter or other folks know of an ongoing effort/patches to extend
> remap_pfn_range() to use this?

Not away of any from my side.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-09-03 21:23     ` Peter Xu
@ 2024-09-09 22:25       ` Andrew Morton
  2024-09-09 22:43         ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: Andrew Morton @ 2024-09-09 22:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: Yan Zhao, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Paolo Bonzini, Dave Hansen, Thomas Gleixner,
	Alistair Popple, kvm, linux-arm-kernel, Sean Christopherson,
	Oscar Salvador, Jason Gunthorpe, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Will Deacon, Kefeng Wang,
	Alex Williamson

On Tue, 3 Sep 2024 17:23:38 -0400 Peter Xu <peterx@redhat.com> wrote:

> > > @@ -1686,8 +1706,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >  	 * TODO: once we support anonymous pages, use
> > >  	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
> > >  	 */
> > > -	pudp_set_wrprotect(src_mm, addr, src_pud);
> > > -	pud = pud_mkold(pud_wrprotect(pud));
> > > +	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
> > > +		pudp_set_wrprotect(src_mm, addr, src_pud);
> > > +		pud = pud_wrprotect(pud);
> > > +	}
> > Do we need the logic to clear dirty bit in the child as that in
> > __copy_present_ptes()?  (and also for the pmd's case).
> > 
> > e.g.
> > if (vma->vm_flags & VM_SHARED)
> > 	pud = pud_mkclean(pud);
> 
> Yeah, good question.  I remember I thought about that when initially
> working on these lines, but I forgot the details, or maybe I simply tried
> to stick with the current code base, as the dirty bit used to be kept even
> in the child here.
> 
> I'd expect there's only performance differences, but still sounds like I'd
> better leave that to whoever knows the best on the implications, then draft
> it as a separate patch but only when needed.

Sorry, but this vaguensss simply leaves me with nowhere to go.

I'll drop the series - let's revisit after -rc1 please.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-09-09 22:25       ` Andrew Morton
@ 2024-09-09 22:43         ` Peter Xu
  2024-09-09 23:15           ` Andrew Morton
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-09-09 22:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yan Zhao, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Paolo Bonzini, Dave Hansen, Thomas Gleixner,
	Alistair Popple, kvm, linux-arm-kernel, Sean Christopherson,
	Oscar Salvador, Jason Gunthorpe, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Will Deacon, Kefeng Wang,
	Alex Williamson

On Mon, Sep 09, 2024 at 03:25:46PM -0700, Andrew Morton wrote:
> On Tue, 3 Sep 2024 17:23:38 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > > > @@ -1686,8 +1706,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > >  	 * TODO: once we support anonymous pages, use
> > > >  	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
> > > >  	 */
> > > > -	pudp_set_wrprotect(src_mm, addr, src_pud);
> > > > -	pud = pud_mkold(pud_wrprotect(pud));
> > > > +	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
> > > > +		pudp_set_wrprotect(src_mm, addr, src_pud);
> > > > +		pud = pud_wrprotect(pud);
> > > > +	}
> > > Do we need the logic to clear dirty bit in the child as that in
> > > __copy_present_ptes()?  (and also for the pmd's case).
> > > 
> > > e.g.
> > > if (vma->vm_flags & VM_SHARED)
> > > 	pud = pud_mkclean(pud);
> > 
> > Yeah, good question.  I remember I thought about that when initially
> > working on these lines, but I forgot the details, or maybe I simply tried
> > to stick with the current code base, as the dirty bit used to be kept even
> > in the child here.
> > 
> > I'd expect there's only performance differences, but still sounds like I'd
> > better leave that to whoever knows the best on the implications, then draft
> > it as a separate patch but only when needed.
> 
> Sorry, but this vaguensss simply leaves me with nowhere to go.
> 
> I'll drop the series - let's revisit after -rc1 please.

Andrew, would you please explain why it needs to be dropped?

I meant in the reply that I think we should leave that as is, and I think
so far nobody in real life should care much on this bit, so I think it's
fine to leave the dirty bit as-is.

I still think whoever has a better use of the dirty bit and would like to
change the behavior should find the use case and work on top, but only if
necessary.

At least this whole fork() path is not useful at all for the use case we're
working on.  Please still consider having this series as I think it's useful.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-09-09 22:43         ` Peter Xu
@ 2024-09-09 23:15           ` Andrew Morton
  2024-09-10  0:08             ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: Andrew Morton @ 2024-09-09 23:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Yan Zhao, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Paolo Bonzini, Dave Hansen, Thomas Gleixner,
	Alistair Popple, kvm, linux-arm-kernel, Sean Christopherson,
	Oscar Salvador, Jason Gunthorpe, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Will Deacon, Kefeng Wang,
	Alex Williamson

On Mon, 9 Sep 2024 18:43:22 -0400 Peter Xu <peterx@redhat.com> wrote:

> > > > Do we need the logic to clear dirty bit in the child as that in
> > > > __copy_present_ptes()?  (and also for the pmd's case).
> > > > 
> > > > e.g.
> > > > if (vma->vm_flags & VM_SHARED)
> > > > 	pud = pud_mkclean(pud);
> > > 
> > > Yeah, good question.  I remember I thought about that when initially
> > > working on these lines, but I forgot the details, or maybe I simply tried
> > > to stick with the current code base, as the dirty bit used to be kept even
> > > in the child here.
> > > 
> > > I'd expect there's only performance differences, but still sounds like I'd
> > > better leave that to whoever knows the best on the implications, then draft
> > > it as a separate patch but only when needed.
> > 
> > Sorry, but this vaguensss simply leaves me with nowhere to go.
> > 
> > I'll drop the series - let's revisit after -rc1 please.
> 
> Andrew, would you please explain why it needs to be dropped?
> 
> I meant in the reply that I think we should leave that as is, and I think
> so far nobody in real life should care much on this bit, so I think it's
> fine to leave the dirty bit as-is.
> 
> I still think whoever has a better use of the dirty bit and would like to
> change the behavior should find the use case and work on top, but only if
> necessary.

Well.  "I'd expect there's only performance differences" means to me
"there might be correctness issues, I don't know".  Is it or is it not
merely a performance thing?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-09-09 23:15           ` Andrew Morton
@ 2024-09-10  0:08             ` Peter Xu
  2024-09-10  2:52               ` Yan Zhao
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-09-10  0:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yan Zhao, linux-kernel, linux-mm, Gavin Shan, Catalin Marinas,
	x86, Ingo Molnar, Paolo Bonzini, Dave Hansen, Thomas Gleixner,
	Alistair Popple, kvm, linux-arm-kernel, Sean Christopherson,
	Oscar Salvador, Jason Gunthorpe, Borislav Petkov, Zi Yan,
	Axel Rasmussen, David Hildenbrand, Will Deacon, Kefeng Wang,
	Alex Williamson

On Mon, Sep 09, 2024 at 04:15:39PM -0700, Andrew Morton wrote:
> On Mon, 9 Sep 2024 18:43:22 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > > > > Do we need the logic to clear dirty bit in the child as that in
> > > > > __copy_present_ptes()?  (and also for the pmd's case).
> > > > > 
> > > > > e.g.
> > > > > if (vma->vm_flags & VM_SHARED)
> > > > > 	pud = pud_mkclean(pud);
> > > > 
> > > > Yeah, good question.  I remember I thought about that when initially
> > > > working on these lines, but I forgot the details, or maybe I simply tried
> > > > to stick with the current code base, as the dirty bit used to be kept even
> > > > in the child here.
> > > > 
> > > > I'd expect there's only performance differences, but still sounds like I'd
> > > > better leave that to whoever knows the best on the implications, then draft
> > > > it as a separate patch but only when needed.
> > > 
> > > Sorry, but this vaguensss simply leaves me with nowhere to go.
> > > 
> > > I'll drop the series - let's revisit after -rc1 please.
> > 
> > Andrew, would you please explain why it needs to be dropped?
> > 
> > I meant in the reply that I think we should leave that as is, and I think
> > so far nobody in real life should care much on this bit, so I think it's
> > fine to leave the dirty bit as-is.
> > 
> > I still think whoever has a better use of the dirty bit and would like to
> > change the behavior should find the use case and work on top, but only if
> > necessary.
> 
> Well.  "I'd expect there's only performance differences" means to me
> "there might be correctness issues, I don't know".  Is it or is it not
> merely a performance thing?

There should have no correctness issue pending.  It can only be about
performance, and AFAIU what this patch does is exactly the way where it
shouldn't ever change performance either, as it didn't change how dirty bit
was processed (just like before this patch), not to mention correctness (in
regards to dirty bits).

I can provide some more details.

Here the question we're discussing is "whether we should clear the dirty
bit in the child for a pgtable entry when it's VM_SHARED".  Yan observed
that we don't do the same thing for pte/pmd/pud, which is true.

Before this patch:

  - For pte:     we clear dirty bit if VM_SHARED in child when copy
  - For pmd/pud: we never clear dirty bit in the child when copy

The behavior of clearing dirty bit for VM_SHARED in child for pte level
originates to the 1st commit that git history starts.  So we always do so
for 19 years.

That makes sense to me, because clearing dirty bit in pte normally requires
a SetDirty on the folio, e.g. in unmap path:

        if (pte_dirty(pteval))
                folio_mark_dirty(folio);

Hence cleared dirty bit in the child should avoid some extra overheads when
the pte maps a file cache, so clean pte can at least help us to avoid calls
into e.g. mapping's dirty_folio() functions (in which it should normally
check folio_test_set_dirty() again anyway, and parent pte still have the
dirty bit set so we won't miss setting folio dirty):

folio_mark_dirty():
        if (folio_test_reclaim(folio))
                folio_clear_reclaim(folio);
        return mapping->a_ops->dirty_folio(mapping, folio);

However there's the other side of thing where when the dirty bit is missing
I _think_ it also means when the child writes to the cleaned pte, it'll
require (e.g. on hardware accelerated archs) MMU setting dirty bit which is
slower than if we don't clear the dirty bit... and on software emulated
dirty bits it could even require a page fault, IIUC.

In short, personally I don't know what's the best to do, on keep / remove
the dirty bit even if it's safe either way: there are pros and cons on
different decisions.

That's why I said I'm not sure which is the best way.  I had a feeling that
most of the people didn't even notice this, and we kept running this code
for the past 19 years just all fine..

OTOH, we don't do the same for pmds/puds (in which case we persist dirty
bits always in child), and I didn't check whether it's intended, or why.
It'll have similar reasoning as above discussion on pte, or even more I
overlooked.

So again, the safest approach here is in terms of dirty bit we keep what we
do as before.  And that's what this patch does as of now.

IOW, if I'll need a repost, I'll repost exactly the same thing (with the
fixup I sent later, which is already in mm-unstable).

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-09-10  0:08             ` Peter Xu
@ 2024-09-10  2:52               ` Yan Zhao
  2024-09-10 12:16                 ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: Yan Zhao @ 2024-09-10  2:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, linux-kernel, linux-mm, Gavin Shan,
	Catalin Marinas, x86, Ingo Molnar, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Will Deacon, Kefeng Wang, Alex Williamson

On Mon, Sep 09, 2024 at 08:08:16PM -0400, Peter Xu wrote:
> On Mon, Sep 09, 2024 at 04:15:39PM -0700, Andrew Morton wrote:
> > On Mon, 9 Sep 2024 18:43:22 -0400 Peter Xu <peterx@redhat.com> wrote:
> > 
> > > > > > Do we need the logic to clear dirty bit in the child as that in
> > > > > > __copy_present_ptes()?  (and also for the pmd's case).
> > > > > > 
> > > > > > e.g.
> > > > > > if (vma->vm_flags & VM_SHARED)
> > > > > > 	pud = pud_mkclean(pud);
> > > > > 
> > > > > Yeah, good question.  I remember I thought about that when initially
> > > > > working on these lines, but I forgot the details, or maybe I simply tried
> > > > > to stick with the current code base, as the dirty bit used to be kept even
> > > > > in the child here.
> > > > > 
> > > > > I'd expect there's only performance differences, but still sounds like I'd
> > > > > better leave that to whoever knows the best on the implications, then draft
> > > > > it as a separate patch but only when needed.
> > > > 
> > > > Sorry, but this vaguensss simply leaves me with nowhere to go.
> > > > 
> > > > I'll drop the series - let's revisit after -rc1 please.
> > > 
> > > Andrew, would you please explain why it needs to be dropped?
> > > 
> > > I meant in the reply that I think we should leave that as is, and I think
> > > so far nobody in real life should care much on this bit, so I think it's
> > > fine to leave the dirty bit as-is.
> > > 
> > > I still think whoever has a better use of the dirty bit and would like to
> > > change the behavior should find the use case and work on top, but only if
> > > necessary.
> > 
> > Well.  "I'd expect there's only performance differences" means to me
> > "there might be correctness issues, I don't know".  Is it or is it not
> > merely a performance thing?
> 
> There should have no correctness issue pending.  It can only be about
> performance, and AFAIU what this patch does is exactly the way where it
> shouldn't ever change performance either, as it didn't change how dirty bit
> was processed (just like before this patch), not to mention correctness (in
> regards to dirty bits).
> 
> I can provide some more details.
> 
> Here the question we're discussing is "whether we should clear the dirty
> bit in the child for a pgtable entry when it's VM_SHARED".  Yan observed
> that we don't do the same thing for pte/pmd/pud, which is true.
> 
> Before this patch:
> 
>   - For pte:     we clear dirty bit if VM_SHARED in child when copy
>   - For pmd/pud: we never clear dirty bit in the child when copy
Hi Peter,

Not sure if I missed anything.

It looks that before this patch, pmd/pud are alawys write protected without
checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect()
clears dirty bit by moving the dirty value to the software bit.

And I have a question that why previously pmd/pud are always write protected.

Thanks
Yan

> 
> The behavior of clearing dirty bit for VM_SHARED in child for pte level
> originates to the 1st commit that git history starts.  So we always do so
> for 19 years.
> 
> That makes sense to me, because clearing dirty bit in pte normally requires
> a SetDirty on the folio, e.g. in unmap path:
> 
>         if (pte_dirty(pteval))
>                 folio_mark_dirty(folio);
> 
> Hence cleared dirty bit in the child should avoid some extra overheads when
> the pte maps a file cache, so clean pte can at least help us to avoid calls
> into e.g. mapping's dirty_folio() functions (in which it should normally
> check folio_test_set_dirty() again anyway, and parent pte still have the
> dirty bit set so we won't miss setting folio dirty):
> 
> folio_mark_dirty():
>         if (folio_test_reclaim(folio))
>                 folio_clear_reclaim(folio);
>         return mapping->a_ops->dirty_folio(mapping, folio);
> 
> However there's the other side of thing where when the dirty bit is missing
> I _think_ it also means when the child writes to the cleaned pte, it'll
> require (e.g. on hardware accelerated archs) MMU setting dirty bit which is
> slower than if we don't clear the dirty bit... and on software emulated
> dirty bits it could even require a page fault, IIUC.
> 
> In short, personally I don't know what's the best to do, on keep / remove
> the dirty bit even if it's safe either way: there are pros and cons on
> different decisions.
> 
> That's why I said I'm not sure which is the best way.  I had a feeling that
> most of the people didn't even notice this, and we kept running this code
> for the past 19 years just all fine..
> 
> OTOH, we don't do the same for pmds/puds (in which case we persist dirty
> bits always in child), and I didn't check whether it's intended, or why.
> It'll have similar reasoning as above discussion on pte, or even more I
> overlooked.
> 
> So again, the safest approach here is in terms of dirty bit we keep what we
> do as before.  And that's what this patch does as of now.
> 
> IOW, if I'll need a repost, I'll repost exactly the same thing (with the
> fixup I sent later, which is already in mm-unstable).
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-09-10  2:52               ` Yan Zhao
@ 2024-09-10 12:16                 ` Peter Xu
  2024-09-11  2:16                   ` Yan Zhao
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2024-09-10 12:16 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Andrew Morton, linux-kernel, linux-mm, Gavin Shan,
	Catalin Marinas, x86, Ingo Molnar, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Will Deacon, Kefeng Wang, Alex Williamson

On Tue, Sep 10, 2024 at 10:52:01AM +0800, Yan Zhao wrote:
> Hi Peter,

Hi, Yan,

> 
> Not sure if I missed anything.
> 
> It looks that before this patch, pmd/pud are alawys write protected without
> checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect()
> clears dirty bit by moving the dirty value to the software bit.
> 
> And I have a question that why previously pmd/pud are always write protected.

IIUC this is a separate question - the move of dirty bit in pud_wrprotect()
is to avoid wrongly creating shadow stack mappings.  In our discussion I
think that's an extra complexity and can be put aside; the dirty bit will
get recovered in pud_clear_saveddirty() later, so it's not the same as
pud_mkclean().

AFAIU pmd/pud paths don't consider is_cow_mapping() because normally we
will not duplicate pgtables in fork() for most of shared file mappings
(!CoW).  Please refer to vma_needs_copy(), and the comment before returning
false at last.  I think it's not strictly is_cow_mapping(), as we're
checking anon_vma there, however it's mostly it, just to also cover
MAP_PRIVATE on file mappings too when there's no CoW happened (as if CoW
happened then anon_vma will appear already).

There're some outliers, e.g. userfault protected, or pfnmaps/mixedmaps.
Userfault & mixedmap are not involved in this series at all, so let's
discuss pfnmaps.

It means, fork() can still copy pgtable for pfnmap vmas, and it's relevant
to this series, because before this series pfnmap only exists in pte level,
hence IMO the is_cow_mapping() must exist for pte level as you described,
because it needs to properly take care of those.  Note that in the pte
processing it also checks pte_write() to make sure it's a COWed page, not a
RO page cache / pfnmap / ..., for example.

Meanwhile, since pfnmap won't appear in pmd/pud, I think it's fair that
pmd/pud assumes when seeing a huge mapping it must be MAP_PRIVATE otherwise
the whole copy_page_range() could be already skipped.  IOW I think they
only need to process COWed pages here, and those pages require write bit
removed in both parent and child when fork().

After this series, pfnmaps can appear in the form of pmd/pud, then the
previous assumption will stop holding true, as we'll still copy pfnmaps
during fork() always. My guessing of the reason is because most of the
drivers map pfnmap vmas only during mmap(), it means there can normally
have no fault() handler at all for those pfns.

In this case, we'll need to also identify whether the page is COWed, using
the newly added "is_cow_mapping() && pxx_write()" in this series (added
to pud path, while for pmd path I used a WARN_ON_ONCE instead).

If we don't do that, it means e.g. for a VM_SHARED pfnmap vma, after fork()
we'll wrongly observe write protected entries.  Here the change will make
sure VM_SHARED can properly persist the write bits on pmds/puds.

Hope that explains.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-09-10 12:16                 ` Peter Xu
@ 2024-09-11  2:16                   ` Yan Zhao
  2024-09-11 14:34                     ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: Yan Zhao @ 2024-09-11  2:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, linux-kernel, linux-mm, Gavin Shan,
	Catalin Marinas, x86, Ingo Molnar, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Will Deacon, Kefeng Wang, Alex Williamson

On Tue, Sep 10, 2024 at 08:16:10AM -0400, Peter Xu wrote:
> On Tue, Sep 10, 2024 at 10:52:01AM +0800, Yan Zhao wrote:
> > Hi Peter,
> 
> Hi, Yan,
> 
> > 
> > Not sure if I missed anything.
> > 
> > It looks that before this patch, pmd/pud are alawys write protected without
> > checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect()
> > clears dirty bit by moving the dirty value to the software bit.
> > 
> > And I have a question that why previously pmd/pud are always write protected.
> 
> IIUC this is a separate question - the move of dirty bit in pud_wrprotect()
> is to avoid wrongly creating shadow stack mappings.  In our discussion I
> think that's an extra complexity and can be put aside; the dirty bit will
> get recovered in pud_clear_saveddirty() later, so it's not the same as
> pud_mkclean().
But pud_clear_saveddirty() will only set dirty bit when write bit is 1.

> 
> AFAIU pmd/pud paths don't consider is_cow_mapping() because normally we
> will not duplicate pgtables in fork() for most of shared file mappings
> (!CoW).  Please refer to vma_needs_copy(), and the comment before returning
> false at last.  I think it's not strictly is_cow_mapping(), as we're
> checking anon_vma there, however it's mostly it, just to also cover
> MAP_PRIVATE on file mappings too when there's no CoW happened (as if CoW
> happened then anon_vma will appear already).
> 
> There're some outliers, e.g. userfault protected, or pfnmaps/mixedmaps.
> Userfault & mixedmap are not involved in this series at all, so let's
> discuss pfnmaps.
> 
> It means, fork() can still copy pgtable for pfnmap vmas, and it's relevant
> to this series, because before this series pfnmap only exists in pte level,
> hence IMO the is_cow_mapping() must exist for pte level as you described,
> because it needs to properly take care of those.  Note that in the pte
> processing it also checks pte_write() to make sure it's a COWed page, not a
> RO page cache / pfnmap / ..., for example.
> 
> Meanwhile, since pfnmap won't appear in pmd/pud, I think it's fair that
> pmd/pud assumes when seeing a huge mapping it must be MAP_PRIVATE otherwise
> the whole copy_page_range() could be already skipped.  IOW I think they
> only need to process COWed pages here, and those pages require write bit
> removed in both parent and child when fork().
Is it also based on that there's no MAP_SHARED huge DEVMAP pages up to now?

> 
> After this series, pfnmaps can appear in the form of pmd/pud, then the
> previous assumption will stop holding true, as we'll still copy pfnmaps
> during fork() always. My guessing of the reason is because most of the
> drivers map pfnmap vmas only during mmap(), it means there can normally
> have no fault() handler at all for those pfns.
> 
> In this case, we'll need to also identify whether the page is COWed, using
> the newly added "is_cow_mapping() && pxx_write()" in this series (added
> to pud path, while for pmd path I used a WARN_ON_ONCE instead).
> 
> If we don't do that, it means e.g. for a VM_SHARED pfnmap vma, after fork()
> we'll wrongly observe write protected entries.  Here the change will make
> sure VM_SHARED can properly persist the write bits on pmds/puds.
> 
> Hope that explains.
> 
Thanks a lot for such detailed explanation!
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
  2024-09-11  2:16                   ` Yan Zhao
@ 2024-09-11 14:34                     ` Peter Xu
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Xu @ 2024-09-11 14:34 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Andrew Morton, linux-kernel, linux-mm, Gavin Shan,
	Catalin Marinas, x86, Ingo Molnar, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Will Deacon, Kefeng Wang, Alex Williamson

On Wed, Sep 11, 2024 at 10:16:55AM +0800, Yan Zhao wrote:
> On Tue, Sep 10, 2024 at 08:16:10AM -0400, Peter Xu wrote:
> > On Tue, Sep 10, 2024 at 10:52:01AM +0800, Yan Zhao wrote:
> > > Hi Peter,
> > 
> > Hi, Yan,
> > 
> > > 
> > > Not sure if I missed anything.
> > > 
> > > It looks that before this patch, pmd/pud are alawys write protected without
> > > checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect()
> > > clears dirty bit by moving the dirty value to the software bit.
> > > 
> > > And I have a question that why previously pmd/pud are always write protected.
> > 
> > IIUC this is a separate question - the move of dirty bit in pud_wrprotect()
> > is to avoid wrongly creating shadow stack mappings.  In our discussion I
> > think that's an extra complexity and can be put aside; the dirty bit will
> > get recovered in pud_clear_saveddirty() later, so it's not the same as
> > pud_mkclean().
> But pud_clear_saveddirty() will only set dirty bit when write bit is 1.

Yes, it's because x86 wants to avoid unexpected write=0 && dirty=1 entries,
because it can wrongly reflect a shadow stack mapping.  Here we cannot
recover the dirty bit if set only if write bit is 1 first.

> 
> > 
> > AFAIU pmd/pud paths don't consider is_cow_mapping() because normally we
> > will not duplicate pgtables in fork() for most of shared file mappings
> > (!CoW).  Please refer to vma_needs_copy(), and the comment before returning
> > false at last.  I think it's not strictly is_cow_mapping(), as we're
> > checking anon_vma there, however it's mostly it, just to also cover
> > MAP_PRIVATE on file mappings too when there's no CoW happened (as if CoW
> > happened then anon_vma will appear already).
> > 
> > There're some outliers, e.g. userfault protected, or pfnmaps/mixedmaps.
> > Userfault & mixedmap are not involved in this series at all, so let's
> > discuss pfnmaps.
> > 
> > It means, fork() can still copy pgtable for pfnmap vmas, and it's relevant
> > to this series, because before this series pfnmap only exists in pte level,
> > hence IMO the is_cow_mapping() must exist for pte level as you described,
> > because it needs to properly take care of those.  Note that in the pte
> > processing it also checks pte_write() to make sure it's a COWed page, not a
> > RO page cache / pfnmap / ..., for example.
> > 
> > Meanwhile, since pfnmap won't appear in pmd/pud, I think it's fair that
> > pmd/pud assumes when seeing a huge mapping it must be MAP_PRIVATE otherwise
> > the whole copy_page_range() could be already skipped.  IOW I think they
> > only need to process COWed pages here, and those pages require write bit
> > removed in both parent and child when fork().
> Is it also based on that there's no MAP_SHARED huge DEVMAP pages up to now?

Correct.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 18/19] mm/arm64: Support large pfn mappings
  2024-08-26 20:43 ` [PATCH v2 18/19] mm/arm64: " Peter Xu
@ 2025-03-19 22:22   ` Keith Busch
  2025-03-19 22:46     ` Peter Xu
  0 siblings, 1 reply; 69+ messages in thread
From: Keith Busch @ 2025-03-19 22:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On Mon, Aug 26, 2024 at 04:43:52PM -0400, Peter Xu wrote:
> +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> +#define pud_special(pte)	pte_special(pud_pte(pud))
> +#define pud_mkspecial(pte)	pte_pud(pte_mkspecial(pud_pte(pud)))
> +#endif

Sorry for such a late reply, but this looked a bit weird as I'm doing
some backporting. Not that I'm actually interested in this arch, so I
can't readily test this, but I believe the intention was to name the
macro argument "pud", not "pte".


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 18/19] mm/arm64: Support large pfn mappings
  2025-03-19 22:22   ` Keith Busch
@ 2025-03-19 22:46     ` Peter Xu
  2025-03-19 22:53       ` Keith Busch
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Xu @ 2025-03-19 22:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On Wed, Mar 19, 2025 at 04:22:04PM -0600, Keith Busch wrote:
> On Mon, Aug 26, 2024 at 04:43:52PM -0400, Peter Xu wrote:
> > +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> > +#define pud_special(pte)	pte_special(pud_pte(pud))
> > +#define pud_mkspecial(pte)	pte_pud(pte_mkspecial(pud_pte(pud)))
> > +#endif
> 
> Sorry for such a late reply, but this looked a bit weird as I'm doing
> some backporting. Not that I'm actually interested in this arch, so I
> can't readily test this, but I believe the intention was to name the
> macro argument "pud", not "pte".

Probably no way to test it from anyone yet, as I don't see aarch64 selects
HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD, which means IIUC this two lines (before
PUD being enabled on aarch64) won't be compiled.. which also matches with
the test results in the cover letter, that we only tried pmd on arm.

The patch will still be needed though for pmd to work.

I can draft a patch to change this, but considering arm's PUD support isn't
there anyway.. maybe I should instead draft a change to remove these as
they're dead code so far, and see if anyone would like to collect it.

Thanks for reporting this.  I'll prepare something soon and keep you
posted.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 18/19] mm/arm64: Support large pfn mappings
  2025-03-19 22:46     ` Peter Xu
@ 2025-03-19 22:53       ` Keith Busch
  0 siblings, 0 replies; 69+ messages in thread
From: Keith Busch @ 2025-03-19 22:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Gavin Shan, Catalin Marinas, x86,
	Ingo Molnar, Andrew Morton, Paolo Bonzini, Dave Hansen,
	Thomas Gleixner, Alistair Popple, kvm, linux-arm-kernel,
	Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Borislav Petkov, Zi Yan, Axel Rasmussen, David Hildenbrand,
	Yan Zhao, Will Deacon, Kefeng Wang, Alex Williamson

On Wed, Mar 19, 2025 at 06:46:35PM -0400, Peter Xu wrote:
> On Wed, Mar 19, 2025 at 04:22:04PM -0600, Keith Busch wrote:
> > On Mon, Aug 26, 2024 at 04:43:52PM -0400, Peter Xu wrote:
> > > +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> > > +#define pud_special(pte)	pte_special(pud_pte(pud))
> > > +#define pud_mkspecial(pte)	pte_pud(pte_mkspecial(pud_pte(pud)))
> > > +#endif
> > 
> > Sorry for such a late reply, but this looked a bit weird as I'm doing
> > some backporting. Not that I'm actually interested in this arch, so I
> > can't readily test this, but I believe the intention was to name the
> > macro argument "pud", not "pte".
> 
> Probably no way to test it from anyone yet, as I don't see aarch64 selects
> HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD, which means IIUC this two lines (before
> PUD being enabled on aarch64) won't be compiled.. which also matches with
> the test results in the cover letter, that we only tried pmd on arm.
> 
> The patch will still be needed though for pmd to work.
> 
> I can draft a patch to change this, but considering arm's PUD support isn't
> there anyway.. maybe I should instead draft a change to remove these as
> they're dead code so far, and see if anyone would like to collect it.
> 
> Thanks for reporting this.  I'll prepare something soon and keep you
> posted.

Thanks, good to know it wasn't reachable.

The reason I was conerned is because the only caller, insert_pfn_pud(),
just so happens to have a variable named "pud" in scope, but it's not
the pud passed into the macro. So if this macro was used, it would just
so happen to compile by that coincidence, but its using the wrong pud.


^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2025-03-19 22:53 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
2024-08-26 20:43 ` [PATCH v2 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
2024-08-26 20:43 ` [PATCH v2 02/19] mm: Drop is_huge_zero_pud() Peter Xu
2024-08-26 20:43 ` [PATCH v2 03/19] mm: Mark special bits for huge pfn mappings when inject Peter Xu
2024-08-28 15:31   ` David Hildenbrand
2024-08-26 20:43 ` [PATCH v2 04/19] mm: Allow THP orders for PFNMAPs Peter Xu
2024-08-28 15:31   ` David Hildenbrand
2024-08-26 20:43 ` [PATCH v2 05/19] mm/gup: Detect huge pfnmap entries in gup-fast Peter Xu
2024-08-26 20:43 ` [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start() Peter Xu
2024-08-28  7:44   ` David Hildenbrand
2024-08-28 14:24     ` Peter Xu
2024-08-28 15:30       ` David Hildenbrand
2024-08-28 19:45         ` Peter Xu
2024-08-28 23:46           ` Jason Gunthorpe
2024-08-29  6:35             ` David Hildenbrand
2024-08-29 18:45               ` Peter Xu
2024-08-29 15:10           ` David Hildenbrand
2024-08-29 18:49             ` Peter Xu
2024-08-26 20:43 ` [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
2024-08-29 15:10   ` David Hildenbrand
2024-08-29 18:26     ` Peter Xu
2024-08-29 19:44       ` David Hildenbrand
2024-08-29 20:01         ` Peter Xu
2024-09-02  7:58   ` Yan Zhao
2024-09-03 21:23     ` Peter Xu
2024-09-09 22:25       ` Andrew Morton
2024-09-09 22:43         ` Peter Xu
2024-09-09 23:15           ` Andrew Morton
2024-09-10  0:08             ` Peter Xu
2024-09-10  2:52               ` Yan Zhao
2024-09-10 12:16                 ` Peter Xu
2024-09-11  2:16                   ` Yan Zhao
2024-09-11 14:34                     ` Peter Xu
2024-08-26 20:43 ` [PATCH v2 08/19] mm: Always define pxx_pgprot() Peter Xu
2024-08-26 20:43 ` [PATCH v2 09/19] mm: New follow_pfnmap API Peter Xu
2024-08-26 20:43 ` [PATCH v2 10/19] KVM: Use " Peter Xu
2024-08-26 20:43 ` [PATCH v2 11/19] s390/pci_mmio: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 12/19] mm/x86/pat: Use the new " Peter Xu
2024-08-26 20:43 ` [PATCH v2 13/19] vfio: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 14/19] acrn: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 15/19] mm/access_process_vm: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 16/19] mm: Remove follow_pte() Peter Xu
2024-09-01  4:33   ` Yu Zhao
2024-09-01 13:39     ` David Hildenbrand
2024-08-26 20:43 ` [PATCH v2 17/19] mm/x86: Support large pfn mappings Peter Xu
2024-08-26 20:43 ` [PATCH v2 18/19] mm/arm64: " Peter Xu
2025-03-19 22:22   ` Keith Busch
2025-03-19 22:46     ` Peter Xu
2025-03-19 22:53       ` Keith Busch
2024-08-26 20:43 ` [PATCH v2 19/19] vfio/pci: Implement huge_fault support Peter Xu
2024-08-27 22:36 ` [PATCH v2 00/19] mm: Support huge pfnmaps Jiaqi Yan
2024-08-27 22:57   ` Peter Xu
2024-08-28  0:42     ` Jiaqi Yan
2024-08-28  0:46       ` Jiaqi Yan
2024-08-28 14:24       ` Jason Gunthorpe
2024-08-28 16:10         ` Jiaqi Yan
2024-08-28 23:49           ` Jason Gunthorpe
2024-08-29 19:21             ` Jiaqi Yan
2024-09-04 15:52               ` Jason Gunthorpe
2024-09-04 16:38                 ` Jiaqi Yan
2024-09-04 16:43                   ` Jason Gunthorpe
2024-09-04 16:58                     ` Jiaqi Yan
2024-09-04 17:00                       ` Jason Gunthorpe
2024-09-04 17:07                         ` Jiaqi Yan
2024-09-09  3:56                           ` Ankit Agrawal
2024-08-28 14:41       ` Peter Xu
2024-08-28 16:23         ` Jiaqi Yan
2024-09-09  4:03 ` Ankit Agrawal
2024-09-09 15:03   ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).