[PATCH 00/19] mm: Support huge pfnmaps

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/19] mm: Support huge pfnmaps
@ 2024-08-09 16:08 Peter Xu
  2024-08-09 16:08 ` [PATCH 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
                   ` (20 more replies)
  0 siblings, 21 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Overview
========

This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
plus dax 1g fix [1].  Note that this series should also apply if without
the dax 1g fix series, but when without it, mprotect() will trigger similar
errors otherwise on PUD mappings.

This series implements huge pfnmaps support for mm in general.  Huge pfnmap
allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
as large as 8GB or even bigger.

Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
patch (from Alex Williamson) will be the first user of huge pfnmap, so as
to enable vfio-pci driver to fault in huge pfn mappings.

Implementation
==============

In reality, it's relatively simple to add such support comparing to many
other types of mappings, because of PFNMAP's specialties when there's no
vmemmap backing it, so that most of the kernel routines on huge mappings
should simply already fail for them, like GUPs or old-school follow_page()
(which is recently rewritten to be folio_walk* APIs by David).

One trick here is that we're still unmature on PUDs in generic paths here
and there, as DAX is so far the only user.  This patchset will add the 2nd
user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
go on smoothly, but to be discussed later.

The other trick is how to allow gup-fast working for such huge mappings
even if there's no direct sign of knowing whether it's a normal page or
MMIO mapping.  This series chose to keep the pte_special solution, so that
it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
gup-fast will be able to identify them and fail properly.

Along the way, we'll also notice that the major pgtable pfn walker, aka,
follow_pte(), will need to retire soon due to the fact that it only works
with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
be able to do whatever follow_pte() can already do, plus that it can also
process huge pfnmaps now.  Half of this series is about that and converting
all existing pfnmap walkers to use the new API properly.  Hopefully the new
API also looks better to avoid exposing e.g. pgtable lock details into the
callers, so that it can be used in an even more straightforward way.

Here, three more options will be introduced and involved in huge pfnmap:

  - ARCH_SUPPORTS_HUGE_PFNMAP

    Arch developers will need to select this option when huge pfnmap is
    supported in arch's Kconfig.  After this patchset applied, both x86_64
    and arm64 will start to enable it by default.

  - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP

    These options are for driver developers to identify whether current
    arch / config supports huge pfnmaps, making decision on whether it can
    use the huge pfnmap APIs to inject them.  One can refer to the last
    vfio-pci patch from Alex on the use of them properly in a device
    driver.

So after the whole set applied, and if one would enable some dynamic debug
lines in vfio-pci core files, we should observe things like:

  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100

In this specific case, it says that vfio-pci faults in PMDs properly for a
few BAR0 offsets.

Patch Layout
============

Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
Patch 2:         A tiny cleanup
Patch 3-8:       Preparation patches for huge pfnmap (include introduce
                 special bit for pmd/pud)
Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
                 then drop follow_pte() API
Patch 17:        Add huge pfnmap support for x86_64
Patch 18:        Add huge pfnmap support for arm64
Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)

TODO
====

Nothing I plan to do myself, as in our VM use case most of these doesn't
yet apply, but still list something I think might be interesting.

More architectures / More page sizes
------------------------------------

Currently only x86_64 (2M+1G) and arm64 (2M) are supported.

For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
on 1G will be automatically enabled.

Generically speaking, arch will need to first support THP / THP_1G, then
provide a special bit in pmds/puds to support huge pfnmaps.

remap_pfn_range() support
-------------------------

Currently, remap_pfn_range() still only maps PTEs.  With the new option,
remap_pfn_range() can logically start to inject either PMDs or PUDs when
the alignment requirements match on the VAs.

When the support is there, it should be able to silently benefit all
drivers that is using remap_pfn_range() in its mmap() handler on better TLB
hit rate and overall faster MMIO accesses similar to processor on hugepages.

More driver support
-------------------

VFIO is so far the only consumer for the huge pfnmaps after this series
applied.  Besides above remap_pfn_range() generic optimization, device
driver can also try to optimize its mmap() on a better VA alignment for
either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
as the driver doesn't normally decide the VA to map a bar.  But I don't
think I know all the drivers to know the full picture.

Tests
=====

- Cross-build tests that I normally do. I only saw one bluetooth driver
  build failure on i386 PAE on top of latest mm-unstable, but shouldn't be
  relevant.

- run_vmtests.sh whole set, no more failures (e.g. mlock2 tests fail on
  mm-unstable)

- Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect()
  and fork() tests on the bar mapped

- x86_64 + AMD GPU
  - Needs Alex's modified QEMU to guarantee proper VA alignment to make
    sure all pages to be mapped with PUDs
  - Main BAR (8GB) start to use PUD mappings
  - Sub BAR (??MBs?) start to use PMD mappings
  - Performance wise, slight improvement comparing to the old PTE mappings

- aarch64 + NIC
  - Detached NIC test to make sure driver loads fine with PMD mappings

Credits all go to Alex on help testing the GPU/NIC use cases above.

Comments welcomed, thanks.

[1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com

Alex Williamson (1):
  vfio/pci: Implement huge_fault support

Peter Xu (18):
  mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
  mm: Drop is_huge_zero_pud()
  mm: Mark special bits for huge pfn mappings when inject
  mm: Allow THP orders for PFNMAPs
  mm/gup: Detect huge pfnmap entries in gup-fast
  mm/pagewalk: Check pfnmap early for folio_walk_start()
  mm/fork: Accept huge pfnmap entries
  mm: Always define pxx_pgprot()
  mm: New follow_pfnmap API
  KVM: Use follow_pfnmap API
  s390/pci_mmio: Use follow_pfnmap API
  mm/x86/pat: Use the new follow_pfnmap API
  vfio: Use the new follow_pfnmap API
  acrn: Use the new follow_pfnmap API
  mm/access_process_vm: Use the new follow_pfnmap API
  mm: Remove follow_pte()
  mm/x86: Support large pfn mappings
  mm/arm64: Support large pfn mappings

 arch/arm64/Kconfig                  |   1 +
 arch/arm64/include/asm/pgtable.h    |  30 +++++
 arch/powerpc/include/asm/pgtable.h  |   1 +
 arch/s390/include/asm/pgtable.h     |   1 +
 arch/s390/pci/pci_mmio.c            |  22 ++--
 arch/sparc/include/asm/pgtable_64.h |   1 +
 arch/x86/Kconfig                    |   1 +
 arch/x86/include/asm/pgtable.h      |  80 ++++++++-----
 arch/x86/mm/pat/memtype.c           |  17 ++-
 drivers/vfio/pci/vfio_pci_core.c    |  60 +++++++---
 drivers/vfio/vfio_iommu_type1.c     |  16 +--
 drivers/virt/acrn/mm.c              |  16 +--
 include/linux/huge_mm.h             |  16 +--
 include/linux/mm.h                  |  57 ++++++++-
 include/linux/pgtable.h             |  12 ++
 mm/Kconfig                          |  13 ++
 mm/gup.c                            |   6 +
 mm/huge_memory.c                    |  48 +++++---
 mm/memory.c                         | 178 ++++++++++++++++++++--------
 mm/pagewalk.c                       |   5 +
 virt/kvm/kvm_main.c                 |  19 ++-
 21 files changed, 422 insertions(+), 178 deletions(-)

-- 
2.45.0

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
@ 2024-08-09 16:08 ` Peter Xu
  2024-08-09 16:34   ` David Hildenbrand
  2024-08-09 16:08 ` [PATCH 02/19] mm: Drop is_huge_zero_pud() Peter Xu
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

This patch introduces the option to introduce special pte bit into
pmd/puds.  Archs can start to define pmd_special / pud_special when
supported by selecting the new option.  Per-arch support will be added
later.

Before that, create fallbacks for these helpers so that they are always
available.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 24 ++++++++++++++++++++++++
 mm/Kconfig         | 13 +++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 43b40334e9b2..90ca84200800 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2644,6 +2644,30 @@ static inline pte_t pte_mkspecial(pte_t pte)
 }
 #endif
 
+#ifndef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+static inline bool pmd_special(pmd_t pmd)
+{
+	return false;
+}
+
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return pmd;
+}
+#endif	/* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */
+
+#ifndef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
+static inline bool pud_special(pud_t pud)
+{
+	return false;
+}
+
+static inline pud_t pud_mkspecial(pud_t pud)
+{
+	return pud;
+}
+#endif	/* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
+
 #ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pte_devmap(pte_t pte)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 3936fe4d26d9..3db0eebb53e2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -881,6 +881,19 @@ endif # TRANSPARENT_HUGEPAGE
 config PGTABLE_HAS_HUGE_LEAVES
 	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
 
+# TODO: Allow to be enabled without THP
+config ARCH_SUPPORTS_HUGE_PFNMAP
+	def_bool n
+	depends on TRANSPARENT_HUGEPAGE
+
+config ARCH_SUPPORTS_PMD_PFNMAP
+	def_bool y
+	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE
+
+config ARCH_SUPPORTS_PUD_PFNMAP
+	def_bool y
+	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+
 #
 # UP and nommu archs use km based percpu allocator
 #
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
  2024-08-09 16:08 ` [PATCH 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
@ 2024-08-09 16:34   ` David Hildenbrand
  2024-08-09 17:16     ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2024-08-09 16:34 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, Thomas Gleixner,
	kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 09.08.24 18:08, Peter Xu wrote:
> This patch introduces the option to introduce special pte bit into
> pmd/puds.  Archs can start to define pmd_special / pud_special when
> supported by selecting the new option.  Per-arch support will be added
> later.
> 
> Before that, create fallbacks for these helpers so that they are always
> available.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/mm.h | 24 ++++++++++++++++++++++++
>   mm/Kconfig         | 13 +++++++++++++
>   2 files changed, 37 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 43b40334e9b2..90ca84200800 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2644,6 +2644,30 @@ static inline pte_t pte_mkspecial(pte_t pte)
>   }
>   #endif
>   
> +#ifndef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> +static inline bool pmd_special(pmd_t pmd)
> +{
> +	return false;
> +}
> +
> +static inline pmd_t pmd_mkspecial(pmd_t pmd)
> +{
> +	return pmd;
> +}
> +#endif	/* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */
> +
> +#ifndef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> +static inline bool pud_special(pud_t pud)
> +{
> +	return false;
> +}
> +
> +static inline pud_t pud_mkspecial(pud_t pud)
> +{
> +	return pud;
> +}
> +#endif	/* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
> +
>   #ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
>   static inline int pte_devmap(pte_t pte)
>   {
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 3936fe4d26d9..3db0eebb53e2 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -881,6 +881,19 @@ endif # TRANSPARENT_HUGEPAGE
>   config PGTABLE_HAS_HUGE_LEAVES
>   	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
>   
> +# TODO: Allow to be enabled without THP
> +config ARCH_SUPPORTS_HUGE_PFNMAP
> +	def_bool n
> +	depends on TRANSPARENT_HUGEPAGE
> +
> +config ARCH_SUPPORTS_PMD_PFNMAP
> +	def_bool y
> +	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE
> +
> +config ARCH_SUPPORTS_PUD_PFNMAP
> +	def_bool y
> +	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +
>   #
>   # UP and nommu archs use km based percpu allocator
>   #

As noted in reply to other patches, I think you have to take care of 
vm_normal_page_pmd() [if not done in another patch I am missing] and 
likely you want to introduce vm_normal_page_pud().

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
  2024-08-09 16:34   ` David Hildenbrand
@ 2024-08-09 17:16     ` Peter Xu
  2024-08-09 18:06       ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 17:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 06:34:15PM +0200, David Hildenbrand wrote:
> On 09.08.24 18:08, Peter Xu wrote:
> > This patch introduces the option to introduce special pte bit into
> > pmd/puds.  Archs can start to define pmd_special / pud_special when
> > supported by selecting the new option.  Per-arch support will be added
> > later.
> > 
> > Before that, create fallbacks for these helpers so that they are always
> > available.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/linux/mm.h | 24 ++++++++++++++++++++++++
> >   mm/Kconfig         | 13 +++++++++++++
> >   2 files changed, 37 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 43b40334e9b2..90ca84200800 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2644,6 +2644,30 @@ static inline pte_t pte_mkspecial(pte_t pte)
> >   }
> >   #endif
> > +#ifndef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> > +static inline bool pmd_special(pmd_t pmd)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline pmd_t pmd_mkspecial(pmd_t pmd)
> > +{
> > +	return pmd;
> > +}
> > +#endif	/* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */
> > +
> > +#ifndef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> > +static inline bool pud_special(pud_t pud)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline pud_t pud_mkspecial(pud_t pud)
> > +{
> > +	return pud;
> > +}
> > +#endif	/* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
> > +
> >   #ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
> >   static inline int pte_devmap(pte_t pte)
> >   {
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 3936fe4d26d9..3db0eebb53e2 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -881,6 +881,19 @@ endif # TRANSPARENT_HUGEPAGE
> >   config PGTABLE_HAS_HUGE_LEAVES
> >   	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
> > +# TODO: Allow to be enabled without THP
> > +config ARCH_SUPPORTS_HUGE_PFNMAP
> > +	def_bool n
> > +	depends on TRANSPARENT_HUGEPAGE
> > +
> > +config ARCH_SUPPORTS_PMD_PFNMAP
> > +	def_bool y
> > +	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE
> > +
> > +config ARCH_SUPPORTS_PUD_PFNMAP
> > +	def_bool y
> > +	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> > +
> >   #
> >   # UP and nommu archs use km based percpu allocator
> >   #
> 
> As noted in reply to other patches, I think you have to take care of
> vm_normal_page_pmd() [if not done in another patch I am missing] and likely
> you want to introduce vm_normal_page_pud().

So far this patch may not have direct involvement with vm_normal_page_pud()
yet?  Anyway, let's keep the discussion there, then we'll know how to move
on.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
  2024-08-09 17:16     ` Peter Xu
@ 2024-08-09 18:06       ` David Hildenbrand
  0 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2024-08-09 18:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 09.08.24 19:16, Peter Xu wrote:
> On Fri, Aug 09, 2024 at 06:34:15PM +0200, David Hildenbrand wrote:
>> On 09.08.24 18:08, Peter Xu wrote:
>>> This patch introduces the option to introduce special pte bit into
>>> pmd/puds.  Archs can start to define pmd_special / pud_special when
>>> supported by selecting the new option.  Per-arch support will be added
>>> later.
>>>
>>> Before that, create fallbacks for these helpers so that they are always
>>> available.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    include/linux/mm.h | 24 ++++++++++++++++++++++++
>>>    mm/Kconfig         | 13 +++++++++++++
>>>    2 files changed, 37 insertions(+)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 43b40334e9b2..90ca84200800 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2644,6 +2644,30 @@ static inline pte_t pte_mkspecial(pte_t pte)
>>>    }
>>>    #endif
>>> +#ifndef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
>>> +static inline bool pmd_special(pmd_t pmd)
>>> +{
>>> +	return false;
>>> +}
>>> +
>>> +static inline pmd_t pmd_mkspecial(pmd_t pmd)
>>> +{
>>> +	return pmd;
>>> +}
>>> +#endif	/* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */
>>> +
>>> +#ifndef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
>>> +static inline bool pud_special(pud_t pud)
>>> +{
>>> +	return false;
>>> +}
>>> +
>>> +static inline pud_t pud_mkspecial(pud_t pud)
>>> +{
>>> +	return pud;
>>> +}
>>> +#endif	/* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
>>> +
>>>    #ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
>>>    static inline int pte_devmap(pte_t pte)
>>>    {
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 3936fe4d26d9..3db0eebb53e2 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -881,6 +881,19 @@ endif # TRANSPARENT_HUGEPAGE
>>>    config PGTABLE_HAS_HUGE_LEAVES
>>>    	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
>>> +# TODO: Allow to be enabled without THP
>>> +config ARCH_SUPPORTS_HUGE_PFNMAP
>>> +	def_bool n
>>> +	depends on TRANSPARENT_HUGEPAGE
>>> +
>>> +config ARCH_SUPPORTS_PMD_PFNMAP
>>> +	def_bool y
>>> +	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE
>>> +
>>> +config ARCH_SUPPORTS_PUD_PFNMAP
>>> +	def_bool y
>>> +	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>> +
>>>    #
>>>    # UP and nommu archs use km based percpu allocator
>>>    #
>>
>> As noted in reply to other patches, I think you have to take care of
>> vm_normal_page_pmd() [if not done in another patch I am missing] and likely
>> you want to introduce vm_normal_page_pud().
> 
> So far this patch may not have direct involvement with vm_normal_page_pud()
> yet?  Anyway, let's keep the discussion there, then we'll know how to move
> on.

vm_normal_page_pud() might make sense as of today already, primarily to 
wrap the pud_devmap() stuff (maybe that is gone soon, who knows). 
Anyhow, I can send a patch to add that as well.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 02/19] mm: Drop is_huge_zero_pud()
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
  2024-08-09 16:08 ` [PATCH 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
@ 2024-08-09 16:08 ` Peter Xu
  2024-08-09 16:34   ` David Hildenbrand
  2024-08-14 12:38   ` Jason Gunthorpe
  2024-08-09 16:08 ` [PATCH 03/19] mm: Mark special bits for huge pfn mappings when inject Peter Xu
                   ` (18 subsequent siblings)
  20 siblings, 2 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Matthew Wilcox, Aneesh Kumar K . V

It constantly returns false since 2017.  One assertion is added in 2019 but
it should never have triggered, IOW it means what is checked should be
asserted instead.

If it didn't exist for 7 years maybe it's good idea to remove it and only
add it when it comes.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 10 ----------
 mm/huge_memory.c        | 13 +------------
 2 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6370026689e0..2121060232ce 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -421,11 +421,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 	return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
 }
 
-static inline bool is_huge_zero_pud(pud_t pud)
-{
-	return false;
-}
-
 struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
 void mm_put_huge_zero_folio(struct mm_struct *mm);
 
@@ -566,11 +561,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 	return false;
 }
 
-static inline bool is_huge_zero_pud(pud_t pud)
-{
-	return false;
-}
-
 static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
 {
 	return;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0aafd26d7a53..39c401a62e87 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1245,10 +1245,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	ptl = pud_lock(mm, pud);
 	if (!pud_none(*pud)) {
 		if (write) {
-			if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
-				WARN_ON_ONCE(!is_huge_zero_pud(*pud));
+			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn_t_to_pfn(pfn)))
 				goto out_unlock;
-			}
 			entry = pud_mkyoung(*pud);
 			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
 			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
@@ -1496,15 +1494,6 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
 		goto out_unlock;
 
-	/*
-	 * When page table lock is held, the huge zero pud should not be
-	 * under splitting since we don't split the page itself, only pud to
-	 * a page table.
-	 */
-	if (is_huge_zero_pud(pud)) {
-		/* No huge zero pud yet */
-	}
-
 	/*
 	 * TODO: once we support anonymous pages, use
 	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/19] mm: Drop is_huge_zero_pud()
  2024-08-09 16:08 ` [PATCH 02/19] mm: Drop is_huge_zero_pud() Peter Xu
@ 2024-08-09 16:34   ` David Hildenbrand
  2024-08-14 12:38   ` Jason Gunthorpe
  1 sibling, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2024-08-09 16:34 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, Thomas Gleixner,
	kvm, Dave Hansen, Alex Williamson, Yan Zhao, Matthew Wilcox,
	Aneesh Kumar K . V

On 09.08.24 18:08, Peter Xu wrote:
> It constantly returns false since 2017.  One assertion is added in 2019 but
> it should never have triggered, IOW it means what is checked should be
> asserted instead.
> 
> If it didn't exist for 7 years maybe it's good idea to remove it and only
> add it when it comes.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

Out with it

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 02/19] mm: Drop is_huge_zero_pud()
  2024-08-09 16:08 ` [PATCH 02/19] mm: Drop is_huge_zero_pud() Peter Xu
  2024-08-09 16:34   ` David Hildenbrand
@ 2024-08-14 12:38   ` Jason Gunthorpe
  1 sibling, 0 replies; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 12:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Matthew Wilcox, Aneesh Kumar K . V

On Fri, Aug 09, 2024 at 12:08:52PM -0400, Peter Xu wrote:
> It constantly returns false since 2017.  One assertion is added in 2019 but
> it should never have triggered, IOW it means what is checked should be
> asserted instead.
> 
> If it didn't exist for 7 years maybe it's good idea to remove it and only
> add it when it comes.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/huge_mm.h | 10 ----------
>  mm/huge_memory.c        | 13 +------------
>  2 files changed, 1 insertion(+), 22 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 03/19] mm: Mark special bits for huge pfn mappings when inject
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
  2024-08-09 16:08 ` [PATCH 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
  2024-08-09 16:08 ` [PATCH 02/19] mm: Drop is_huge_zero_pud() Peter Xu
@ 2024-08-09 16:08 ` Peter Xu
  2024-08-14 12:40   ` Jason Gunthorpe
  2024-08-09 16:08 ` [PATCH 04/19] mm: Allow THP orders for PFNMAPs Peter Xu
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

We need these special bits to be around to enable gup-fast on pfnmaps.
Mark properly for !devmap case, reflecting that there's no page struct
backing the entry.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 39c401a62e87..e95b3a468aee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1162,6 +1162,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pmd_mkdevmap(entry);
+	else
+		entry = pmd_mkspecial(entry);
 	if (write) {
 		entry = pmd_mkyoung(pmd_mkdirty(entry));
 		entry = maybe_pmd_mkwrite(entry, vma);
@@ -1258,6 +1260,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pud_mkdevmap(entry);
+	else
+		entry = pud_mkspecial(entry);
 	if (write) {
 		entry = pud_mkyoung(pud_mkdirty(entry));
 		entry = maybe_pud_mkwrite(entry, vma);
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] mm: Mark special bits for huge pfn mappings when inject
  2024-08-09 16:08 ` [PATCH 03/19] mm: Mark special bits for huge pfn mappings when inject Peter Xu
@ 2024-08-14 12:40   ` Jason Gunthorpe
  2024-08-14 15:23     ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 12:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 12:08:53PM -0400, Peter Xu wrote:
> We need these special bits to be around to enable gup-fast on pfnmaps.

It is not gup-fast you are after but follow_pfn/etc for KVM usage
right?

GUP family of functions should all fail on pfnmaps.

> Mark properly for !devmap case, reflecting that there's no page struct
> backing the entry.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/huge_memory.c | 4 ++++
>  1 file changed, 4 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] mm: Mark special bits for huge pfn mappings when inject
  2024-08-14 12:40   ` Jason Gunthorpe
@ 2024-08-14 15:23     ` Peter Xu
  2024-08-14 15:53       ` Jason Gunthorpe
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-14 15:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024 at 09:40:00AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 12:08:53PM -0400, Peter Xu wrote:
> > We need these special bits to be around to enable gup-fast on pfnmaps.
> 
> It is not gup-fast you are after but follow_pfn/etc for KVM usage
> right?

Gup-fast needs it to make sure we don't pmd_page() it and fail early.  So
still needed in some sort..

But yeah, this comment is ambiguous and not describing the whole picture,
as multiple places will so far rely this bit, e.g. fork() to identify a
private page or pfnmap. Similarly we'll do that in folio_walk_start(), and
follow_pfnmap.  I plan to simplify that to:

  We need these special bits to be around on pfnmaps.  Mark properly for
  !devmap case, reflecting that there's no page struct backing the entry.

> 
> GUP family of functions should all fail on pfnmaps.
> 
> > Mark properly for !devmap case, reflecting that there's no page struct
> > backing the entry.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/huge_memory.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

So I'll tentatively take this with the amended commit message, unless
there's objection.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] mm: Mark special bits for huge pfn mappings when inject
  2024-08-14 15:23     ` Peter Xu
@ 2024-08-14 15:53       ` Jason Gunthorpe
  0 siblings, 0 replies; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 15:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024 at 11:23:41AM -0400, Peter Xu wrote:
> On Wed, Aug 14, 2024 at 09:40:00AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 09, 2024 at 12:08:53PM -0400, Peter Xu wrote:
> > > We need these special bits to be around to enable gup-fast on pfnmaps.
> > 
> > It is not gup-fast you are after but follow_pfn/etc for KVM usage
> > right?
> 
> Gup-fast needs it to make sure we don't pmd_page() it and fail early.  So
> still needed in some sort..

Yes, but making gup-fast fail is not "enabling" it :)

> But yeah, this comment is ambiguous and not describing the whole picture,
> as multiple places will so far rely this bit, e.g. fork() to identify a
> private page or pfnmap. Similarly we'll do that in folio_walk_start(), and
> follow_pfnmap.  I plan to simplify that to:
> 
>   We need these special bits to be around on pfnmaps.  Mark properly for
>   !devmap case, reflecting that there's no page struct backing the entry.

Yes

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 04/19] mm: Allow THP orders for PFNMAPs
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (2 preceding siblings ...)
  2024-08-09 16:08 ` [PATCH 03/19] mm: Mark special bits for huge pfn mappings when inject Peter Xu
@ 2024-08-09 16:08 ` Peter Xu
  2024-08-14 12:40   ` Jason Gunthorpe
  2024-08-09 16:08 ` [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast Peter Xu
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Matthew Wilcox, Ryan Roberts

This enables PFNMAPs to be mapped at either pmd/pud layers.  Generalize the
dax case into vma_is_special_huge() so as to cover both.  Meanwhile, rename
the macro to THP_ORDERS_ALL_SPECIAL.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 6 +++---
 mm/huge_memory.c        | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2121060232ce..984cbc960d8b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -76,9 +76,9 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 /*
  * Mask of all large folio orders supported for file THP. Folios in a DAX
  * file is never split and the MAX_PAGECACHE_ORDER limit does not apply to
- * it.
+ * it.  Same to PFNMAPs where there's neither page* nor pagecache.
  */
-#define THP_ORDERS_ALL_FILE_DAX		\
+#define THP_ORDERS_ALL_SPECIAL		\
 	(BIT(PMD_ORDER) | BIT(PUD_ORDER))
 #define THP_ORDERS_ALL_FILE_DEFAULT	\
 	((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0))
@@ -87,7 +87,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
  * Mask of all large folio orders supported for THP.
  */
 #define THP_ORDERS_ALL	\
-	(THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT)
+	(THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAULT)
 
 #define TVA_SMAPS		(1 << 0)	/* Will be used for procfs */
 #define TVA_IN_PF		(1 << 1)	/* Page fault handler */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e95b3a468aee..6568586b21ab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -95,8 +95,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	/* Check the intersection of requested and supported orders. */
 	if (vma_is_anonymous(vma))
 		supported_orders = THP_ORDERS_ALL_ANON;
-	else if (vma_is_dax(vma))
-		supported_orders = THP_ORDERS_ALL_FILE_DAX;
+	else if (vma_is_special_huge(vma))
+		supported_orders = THP_ORDERS_ALL_SPECIAL;
 	else
 		supported_orders = THP_ORDERS_ALL_FILE_DEFAULT;
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 04/19] mm: Allow THP orders for PFNMAPs
  2024-08-09 16:08 ` [PATCH 04/19] mm: Allow THP orders for PFNMAPs Peter Xu
@ 2024-08-14 12:40   ` Jason Gunthorpe
  0 siblings, 0 replies; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 12:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Matthew Wilcox, Ryan Roberts

On Fri, Aug 09, 2024 at 12:08:54PM -0400, Peter Xu wrote:
> This enables PFNMAPs to be mapped at either pmd/pud layers.  Generalize the
> dax case into vma_is_special_huge() so as to cover both.  Meanwhile, rename
> the macro to THP_ORDERS_ALL_SPECIAL.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Gavin Shan <gshan@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/huge_mm.h | 6 +++---
>  mm/huge_memory.c        | 4 ++--
>  2 files changed, 5 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (3 preceding siblings ...)
  2024-08-09 16:08 ` [PATCH 04/19] mm: Allow THP orders for PFNMAPs Peter Xu
@ 2024-08-09 16:08 ` Peter Xu
  2024-08-09 16:23   ` David Hildenbrand
  2024-08-14 12:41   ` Jason Gunthorpe
  2024-08-09 16:08 ` [PATCH 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
                   ` (15 subsequent siblings)
  20 siblings, 2 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Since gup-fast doesn't have the vma reference, teach it to detect such huge
pfnmaps by checking the special bit for pmd/pud too, just like ptes.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index d19884e097fd..a49f67a512ee 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3038,6 +3038,9 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
+	if (pmd_special(orig))
+		return 0;
+
 	if (pmd_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
@@ -3082,6 +3085,9 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
+	if (pud_special(orig))
+		return 0;
+
 	if (pud_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast
  2024-08-09 16:08 ` [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast Peter Xu
@ 2024-08-09 16:23   ` David Hildenbrand
  2024-08-09 16:59     ` Peter Xu
  2024-08-14 12:41   ` Jason Gunthorpe
  1 sibling, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2024-08-09 16:23 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, Thomas Gleixner,
	kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 09.08.24 18:08, Peter Xu wrote:
> Since gup-fast doesn't have the vma reference, teach it to detect such huge
> pfnmaps by checking the special bit for pmd/pud too, just like ptes.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/gup.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index d19884e097fd..a49f67a512ee 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3038,6 +3038,9 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>   	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
>   		return 0;
>   
> +	if (pmd_special(orig))
> +		return 0;
> +
>   	if (pmd_devmap(orig)) {
>   		if (unlikely(flags & FOLL_LONGTERM))
>   			return 0;
> @@ -3082,6 +3085,9 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
>   	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
>   		return 0;
>   
> +	if (pud_special(orig))
> +		return 0;
> +
>   	if (pud_devmap(orig)) {
>   		if (unlikely(flags & FOLL_LONGTERM))
>   			return 0;

In gup_fast_pte_range() we check after checking pte_devmap(). Do we want 
to do it in a similar fashion here, or is there a reason to do it 
differently?

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast
  2024-08-09 16:23   ` David Hildenbrand
@ 2024-08-09 16:59     ` Peter Xu
  2024-08-14 12:42       ` Jason Gunthorpe
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 06:23:53PM +0200, David Hildenbrand wrote:
> On 09.08.24 18:08, Peter Xu wrote:
> > Since gup-fast doesn't have the vma reference, teach it to detect such huge
> > pfnmaps by checking the special bit for pmd/pud too, just like ptes.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/gup.c | 6 ++++++
> >   1 file changed, 6 insertions(+)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index d19884e097fd..a49f67a512ee 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -3038,6 +3038,9 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> >   	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
> >   		return 0;
> > +	if (pmd_special(orig))
> > +		return 0;
> > +
> >   	if (pmd_devmap(orig)) {
> >   		if (unlikely(flags & FOLL_LONGTERM))
> >   			return 0;
> > @@ -3082,6 +3085,9 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
> >   	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
> >   		return 0;
> > +	if (pud_special(orig))
> > +		return 0;
> > +
> >   	if (pud_devmap(orig)) {
> >   		if (unlikely(flags & FOLL_LONGTERM))
> >   			return 0;
> 
> In gup_fast_pte_range() we check after checking pte_devmap(). Do we want to
> do it in a similar fashion here, or is there a reason to do it differently?

IIUC they should behave the same, as the two should be mutual exclusive so
far.  E.g. see insert_pfn():

	if (pfn_t_devmap(pfn))
		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
	else
		entry = pte_mkspecial(pfn_t_pte(pfn, prot));

It might change for sure if Alistair move on with the devmap work, though..
these two always are processed together now, so I hope that won't add much
burden which series will land first, then we may need some care on merging
them.  I don't expect anything too tricky in merge if that was about
removal of the devmap bits.

> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks, I'll take this one first.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast
  2024-08-09 16:59     ` Peter Xu
@ 2024-08-14 12:42       ` Jason Gunthorpe
  2024-08-14 15:34         ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 12:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 12:59:40PM -0400, Peter Xu wrote:
> > In gup_fast_pte_range() we check after checking pte_devmap(). Do we want to
> > do it in a similar fashion here, or is there a reason to do it differently?
> 
> IIUC they should behave the same, as the two should be mutual exclusive so
> far.  E.g. see insert_pfn():

Yes, agree no functional difference, but David has a point to try to
keep the logic structurally the same in all pte/pmd/pud copies.

> 	if (pfn_t_devmap(pfn))
> 		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> 	else
> 		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> 
> It might change for sure if Alistair move on with the devmap work, though..
> these two always are processed together now, so I hope that won't add much
> burden which series will land first, then we may need some care on merging
> them.  I don't expect anything too tricky in merge if that was about
> removal of the devmap bits.

Removing pte_mkdevmap can only make things simpler :)

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast
  2024-08-14 12:42       ` Jason Gunthorpe
@ 2024-08-14 15:34         ` Peter Xu
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-14 15:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024 at 09:42:28AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 12:59:40PM -0400, Peter Xu wrote:
> > > In gup_fast_pte_range() we check after checking pte_devmap(). Do we want to
> > > do it in a similar fashion here, or is there a reason to do it differently?
> > 
> > IIUC they should behave the same, as the two should be mutual exclusive so
> > far.  E.g. see insert_pfn():
> 
> Yes, agree no functional difference, but David has a point to try to
> keep the logic structurally the same in all pte/pmd/pud copies.

OK, let me reorder them if that helps.

> 
> > 	if (pfn_t_devmap(pfn))
> > 		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
> > 	else
> > 		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
> > 
> > It might change for sure if Alistair move on with the devmap work, though..
> > these two always are processed together now, so I hope that won't add much
> > burden which series will land first, then we may need some care on merging
> > them.  I don't expect anything too tricky in merge if that was about
> > removal of the devmap bits.
> 
> Removing pte_mkdevmap can only make things simpler :)

Yep. :)

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast
  2024-08-09 16:08 ` [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast Peter Xu
  2024-08-09 16:23   ` David Hildenbrand
@ 2024-08-14 12:41   ` Jason Gunthorpe
  1 sibling, 0 replies; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 12:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 12:08:55PM -0400, Peter Xu wrote:
> Since gup-fast doesn't have the vma reference, teach it to detect such huge
> pfnmaps by checking the special bit for pmd/pud too, just like ptes.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/gup.c | 6 ++++++
>  1 file changed, 6 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (4 preceding siblings ...)
  2024-08-09 16:08 ` [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast Peter Xu
@ 2024-08-09 16:08 ` Peter Xu
  2024-08-09 16:32   ` David Hildenbrand
  2024-08-09 16:08 ` [PATCH 08/19] mm: Always define pxx_pgprot() Peter Xu
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
much easier, the write bit needs to be persisted though for writable and
shared pud mappings like PFNMAP ones, otherwise a follow up write in either
parent or child process will trigger a write fault.

Do the same for pmd level.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6568586b21ab..015c9468eed5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pgtable_t pgtable = NULL;
 	int ret = -ENOMEM;
 
+	pmd = pmdp_get_lockless(src_pmd);
+	if (unlikely(pmd_special(pmd))) {
+		dst_ptl = pmd_lock(dst_mm, dst_pmd);
+		src_ptl = pmd_lockptr(src_mm, src_pmd);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+		/*
+		 * No need to recheck the pmd, it can't change with write
+		 * mmap lock held here.
+		 */
+		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
+			pmdp_set_wrprotect(src_mm, addr, src_pmd);
+			pmd = pmd_wrprotect(pmd);
+		}
+		goto set_pmd;
+	}
+
 	/* Skip if can be re-fill on fault */
 	if (!vma_is_anonymous(dst_vma))
 		return 0;
@@ -1456,7 +1472,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	if (!userfaultfd_wp(dst_vma))
 		pmd = pmd_clear_uffd_wp(pmd);
-	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	pmd = pmd_wrprotect(pmd);
+set_pmd:
+	pmd = pmd_mkold(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
 	ret = 0;
@@ -1502,8 +1520,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * TODO: once we support anonymous pages, use
 	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
 	 */
-	pudp_set_wrprotect(src_mm, addr, src_pud);
-	pud = pud_mkold(pud_wrprotect(pud));
+	if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
+		pudp_set_wrprotect(src_mm, addr, src_pud);
+		pud = pud_wrprotect(pud);
+	}
+	pud = pud_mkold(pud);
 	set_pud_at(dst_mm, addr, dst_pud, pud);
 
 	ret = 0;
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-09 16:08 ` [PATCH 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
@ 2024-08-09 16:32   ` David Hildenbrand
  2024-08-09 17:15     ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2024-08-09 16:32 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, Thomas Gleixner,
	kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 09.08.24 18:08, Peter Xu wrote:
> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> much easier, the write bit needs to be persisted though for writable and
> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> parent or child process will trigger a write fault.
> 
> Do the same for pmd level.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/huge_memory.c | 27 ++++++++++++++++++++++++---
>   1 file changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6568586b21ab..015c9468eed5 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   	pgtable_t pgtable = NULL;
>   	int ret = -ENOMEM;
>   
> +	pmd = pmdp_get_lockless(src_pmd);
> +	if (unlikely(pmd_special(pmd))) {
> +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> +		/*
> +		 * No need to recheck the pmd, it can't change with write
> +		 * mmap lock held here.
> +		 */
> +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
> +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +			pmd = pmd_wrprotect(pmd);
> +		}
> +		goto set_pmd;
> +	}
> +

I strongly assume we should be using using vm_normal_page_pmd() instead 
of pmd_page() further below. pmd_special() should be mostly limited to 
GUP-fast and vm_normal_page_pmd().

Again, we should be doing this similar to how we handle PTEs.

I'm a bit confused about the "unlikely(!pmd_trans_huge(pmd)" check, 
below: what else should we have here if it's not a migration entry but a 
present entry?

Likely this function needs a bit of rework.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-09 16:32   ` David Hildenbrand
@ 2024-08-09 17:15     ` Peter Xu
  2024-08-09 17:59       ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 17:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
> On 09.08.24 18:08, Peter Xu wrote:
> > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > much easier, the write bit needs to be persisted though for writable and
> > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > parent or child process will trigger a write fault.
> > 
> > Do the same for pmd level.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/huge_memory.c | 27 ++++++++++++++++++++++++---
> >   1 file changed, 24 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 6568586b21ab..015c9468eed5 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >   	pgtable_t pgtable = NULL;
> >   	int ret = -ENOMEM;
> > +	pmd = pmdp_get_lockless(src_pmd);
> > +	if (unlikely(pmd_special(pmd))) {
> > +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> > +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> > +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > +		/*
> > +		 * No need to recheck the pmd, it can't change with write
> > +		 * mmap lock held here.
> > +		 */
> > +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
> > +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > +			pmd = pmd_wrprotect(pmd);
> > +		}
> > +		goto set_pmd;
> > +	}
> > +
> 
> I strongly assume we should be using using vm_normal_page_pmd() instead of
> pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
> and vm_normal_page_pmd().

One thing to mention that it has this:

	if (!vma_is_anonymous(dst_vma))
		return 0;

So it's only about anonymous below that.  In that case I feel like the
pmd_page() is benign, and actually good.

Though what you're saying here made me notice my above check doesn't seem
to be necessary, I mean, "(is_cow_mapping(src_vma->vm_flags) &&
pmd_write(pmd))" can't be true when special bit is set, aka, pfnmaps.. and
if it's writable for CoW it means it's already an anon.

I think I can probably drop that line there, perhaps with a
VM_WARN_ON_ONCE() making sure it won't happen.

> 
> Again, we should be doing this similar to how we handle PTEs.
> 
> I'm a bit confused about the "unlikely(!pmd_trans_huge(pmd)" check, below:
> what else should we have here if it's not a migration entry but a present
> entry?

I had a feeling that it was just a safety belt since the 1st day of thp
when Andrea worked that out, so that it'll work with e.g. file truncation
races.

But with current code it looks like it's only anonymous indeed, so looks
not possible at least from that pov.

Thanks,

> 
> Likely this function needs a bit of rework.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-09 17:15     ` Peter Xu
@ 2024-08-09 17:59       ` David Hildenbrand
  2024-08-12 18:29         ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2024-08-09 17:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 09.08.24 19:15, Peter Xu wrote:
> On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
>> On 09.08.24 18:08, Peter Xu wrote:
>>> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
>>> much easier, the write bit needs to be persisted though for writable and
>>> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
>>> parent or child process will trigger a write fault.
>>>
>>> Do the same for pmd level.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    mm/huge_memory.c | 27 ++++++++++++++++++++++++---
>>>    1 file changed, 24 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 6568586b21ab..015c9468eed5 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>    	pgtable_t pgtable = NULL;
>>>    	int ret = -ENOMEM;
>>> +	pmd = pmdp_get_lockless(src_pmd);
>>> +	if (unlikely(pmd_special(pmd))) {
>>> +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
>>> +		src_ptl = pmd_lockptr(src_mm, src_pmd);
>>> +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>>> +		/*
>>> +		 * No need to recheck the pmd, it can't change with write
>>> +		 * mmap lock held here.
>>> +		 */
>>> +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
>>> +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
>>> +			pmd = pmd_wrprotect(pmd);
>>> +		}
>>> +		goto set_pmd;
>>> +	}
>>> +
>>
>> I strongly assume we should be using using vm_normal_page_pmd() instead of
>> pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
>> and vm_normal_page_pmd().
> 
> One thing to mention that it has this:
> 
> 	if (!vma_is_anonymous(dst_vma))
> 		return 0;

Another obscure thing in this function. It's not the job of 
copy_huge_pmd() to make the decision whether to copy, it's the job of 
vma_needs_copy() in copy_page_range().

And now I have to suspect that uffd-wp is broken with this function, 
because as vma_needs_copy() clearly states, we must copy, and we don't 
do that for PMDs. Ugh.

What a mess, we should just do what we do for PTEs and we will be fine ;)

Also, we call copy_huge_pmd() only if "is_swap_pmd(*src_pmd) || 
pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)"

Would that even be the case with PFNMAP? I suspect that pmd_trans_huge() 
would return "true" for special pfnmap, which is rather "surprising", 
but fortunate for us.

Likely we should be calling copy_huge_pmd() if pmd_leaf() ... cleanup 
for another day.

> 
> So it's only about anonymous below that.  In that case I feel like the
> pmd_page() is benign, and actually good.

Yes, it would likely currently work.

> 
> Though what you're saying here made me notice my above check doesn't seem
> to be necessary, I mean, "(is_cow_mapping(src_vma->vm_flags) &&
> pmd_write(pmd))" can't be true when special bit is set, aka, pfnmaps.. and
> if it's writable for CoW it means it's already an anon.
> 
> I think I can probably drop that line there, perhaps with a
> VM_WARN_ON_ONCE() making sure it won't happen.
> 
>>
>> Again, we should be doing this similar to how we handle PTEs.
>>
>> I'm a bit confused about the "unlikely(!pmd_trans_huge(pmd)" check, below:
>> what else should we have here if it's not a migration entry but a present
>> entry?
> 
> I had a feeling that it was just a safety belt since the 1st day of thp
> when Andrea worked that out, so that it'll work with e.g. file truncation
> races.
> 
> But with current code it looks like it's only anonymous indeed, so looks
> not possible at least from that pov.

Yes, as stated above, likely broken with UFFD-WP ...

I really think we should make this code just behave like it would with 
PTEs, instead of throwing in more "different" handling.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-09 17:59       ` David Hildenbrand
@ 2024-08-12 18:29         ` Peter Xu
  2024-08-12 18:50           ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-12 18:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 07:59:58PM +0200, David Hildenbrand wrote:
> On 09.08.24 19:15, Peter Xu wrote:
> > On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
> > > On 09.08.24 18:08, Peter Xu wrote:
> > > > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > > > much easier, the write bit needs to be persisted though for writable and
> > > > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > > > parent or child process will trigger a write fault.
> > > > 
> > > > Do the same for pmd level.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >    mm/huge_memory.c | 27 ++++++++++++++++++++++++---
> > > >    1 file changed, 24 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index 6568586b21ab..015c9468eed5 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > >    	pgtable_t pgtable = NULL;
> > > >    	int ret = -ENOMEM;
> > > > +	pmd = pmdp_get_lockless(src_pmd);
> > > > +	if (unlikely(pmd_special(pmd))) {
> > > > +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> > > > +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> > > > +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > > > +		/*
> > > > +		 * No need to recheck the pmd, it can't change with write
> > > > +		 * mmap lock held here.
> > > > +		 */
> > > > +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
> > > > +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > > > +			pmd = pmd_wrprotect(pmd);
> > > > +		}
> > > > +		goto set_pmd;
> > > > +	}
> > > > +
> > > 
> > > I strongly assume we should be using using vm_normal_page_pmd() instead of
> > > pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
> > > and vm_normal_page_pmd().
> > 
> > One thing to mention that it has this:
> > 
> > 	if (!vma_is_anonymous(dst_vma))
> > 		return 0;
> 
> Another obscure thing in this function. It's not the job of copy_huge_pmd()
> to make the decision whether to copy, it's the job of vma_needs_copy() in
> copy_page_range().
> 
> And now I have to suspect that uffd-wp is broken with this function, because
> as vma_needs_copy() clearly states, we must copy, and we don't do that for
> PMDs. Ugh.
> 
> What a mess, we should just do what we do for PTEs and we will be fine ;)

IIUC it's not a problem: file uffd-wp is different from anonymous, in that
it pushes everything down to ptes.

It means if we skipped one huge pmd here for file, then it's destined to
have nothing to do with uffd-wp, otherwise it should have already been
split at the first attempt to wr-protect.

> 
> Also, we call copy_huge_pmd() only if "is_swap_pmd(*src_pmd) ||
> pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)"
> 
> Would that even be the case with PFNMAP? I suspect that pmd_trans_huge()
> would return "true" for special pfnmap, which is rather "surprising", but
> fortunate for us.

It's definitely not surprising to me as that's the plan.. and I thought it
shoulidn't be surprising to you - if you remember before I sent this one, I
tried to decouple that here with the "thp agnostic" series:

  https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com

in which you reviewed it (which I appreciated).

So yes, pfnmap on pmd so far will report pmd_trans_huge==true.

> 
> Likely we should be calling copy_huge_pmd() if pmd_leaf() ... cleanup for
> another day.

Yes, ultimately it should really be a pmd_leaf(), but since I didn't get
much feedback there, and that can further postpone this series from being
posted I'm afraid, then I decided to just move on with "taking pfnmap as
THPs".  The corresponding change on this path is here in that series:

https://lore.kernel.org/all/20240717220219.3743374-7-peterx@redhat.com/

@@ -1235,8 +1235,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
-			|| pmd_devmap(*src_pmd)) {
+		if (is_swap_pmd(*src_pmd) || pmd_is_leaf(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
 			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,

> 
> > 
> > So it's only about anonymous below that.  In that case I feel like the
> > pmd_page() is benign, and actually good.
> 
> Yes, it would likely currently work.
> 
> > 
> > Though what you're saying here made me notice my above check doesn't seem
> > to be necessary, I mean, "(is_cow_mapping(src_vma->vm_flags) &&
> > pmd_write(pmd))" can't be true when special bit is set, aka, pfnmaps.. and
> > if it's writable for CoW it means it's already an anon.
> > 
> > I think I can probably drop that line there, perhaps with a
> > VM_WARN_ON_ONCE() making sure it won't happen.
> > 
> > > 
> > > Again, we should be doing this similar to how we handle PTEs.
> > > 
> > > I'm a bit confused about the "unlikely(!pmd_trans_huge(pmd)" check, below:
> > > what else should we have here if it's not a migration entry but a present
> > > entry?
> > 
> > I had a feeling that it was just a safety belt since the 1st day of thp
> > when Andrea worked that out, so that it'll work with e.g. file truncation
> > races.
> > 
> > But with current code it looks like it's only anonymous indeed, so looks
> > not possible at least from that pov.
> 
> Yes, as stated above, likely broken with UFFD-WP ...
> 
> I really think we should make this code just behave like it would with PTEs,
> instead of throwing in more "different" handling.

So it could simply because file / anon uffd-wp work very differently.

Let me know if you still spot something that is suspicious, but in all
cases I guess we can move on with this series, but maybe if you can find
something I can tackle them together when I decide to go back to the
mremap() issues in the other thread.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-12 18:29         ` Peter Xu
@ 2024-08-12 18:50           ` David Hildenbrand
  2024-08-12 19:05             ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2024-08-12 18:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 12.08.24 20:29, Peter Xu wrote:
> On Fri, Aug 09, 2024 at 07:59:58PM +0200, David Hildenbrand wrote:
>> On 09.08.24 19:15, Peter Xu wrote:
>>> On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
>>>> On 09.08.24 18:08, Peter Xu wrote:
>>>>> Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
>>>>> much easier, the write bit needs to be persisted though for writable and
>>>>> shared pud mappings like PFNMAP ones, otherwise a follow up write in either
>>>>> parent or child process will trigger a write fault.
>>>>>
>>>>> Do the same for pmd level.
>>>>>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>>     mm/huge_memory.c | 27 ++++++++++++++++++++++++---
>>>>>     1 file changed, 24 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 6568586b21ab..015c9468eed5 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>>>>     	pgtable_t pgtable = NULL;
>>>>>     	int ret = -ENOMEM;
>>>>> +	pmd = pmdp_get_lockless(src_pmd);
>>>>> +	if (unlikely(pmd_special(pmd))) {
>>>>> +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
>>>>> +		src_ptl = pmd_lockptr(src_mm, src_pmd);
>>>>> +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>>>>> +		/*
>>>>> +		 * No need to recheck the pmd, it can't change with write
>>>>> +		 * mmap lock held here.
>>>>> +		 */
>>>>> +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
>>>>> +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
>>>>> +			pmd = pmd_wrprotect(pmd);
>>>>> +		}
>>>>> +		goto set_pmd;
>>>>> +	}
>>>>> +
>>>>
>>>> I strongly assume we should be using using vm_normal_page_pmd() instead of
>>>> pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
>>>> and vm_normal_page_pmd().
>>>
>>> One thing to mention that it has this:
>>>
>>> 	if (!vma_is_anonymous(dst_vma))
>>> 		return 0;
>>
>> Another obscure thing in this function. It's not the job of copy_huge_pmd()
>> to make the decision whether to copy, it's the job of vma_needs_copy() in
>> copy_page_range().
>>
>> And now I have to suspect that uffd-wp is broken with this function, because
>> as vma_needs_copy() clearly states, we must copy, and we don't do that for
>> PMDs. Ugh.
>>
>> What a mess, we should just do what we do for PTEs and we will be fine ;)
> 
> IIUC it's not a problem: file uffd-wp is different from anonymous, in that
> it pushes everything down to ptes.
> 
> It means if we skipped one huge pmd here for file, then it's destined to
> have nothing to do with uffd-wp, otherwise it should have already been
> split at the first attempt to wr-protect.

Is that also true for UFFD_FEATURE_WP_ASYNC, when we call 
pagemap_scan_thp_entry()->make_uffd_wp_pmd() ?

I'm not immediately finding the code that does the "pushes everything 
down to ptes", so I might miss that part.

> 
>>
>> Also, we call copy_huge_pmd() only if "is_swap_pmd(*src_pmd) ||
>> pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)"
>>
>> Would that even be the case with PFNMAP? I suspect that pmd_trans_huge()
>> would return "true" for special pfnmap, which is rather "surprising", but
>> fortunate for us.
> 
> It's definitely not surprising to me as that's the plan.. and I thought it
> shoulidn't be surprising to you - if you remember before I sent this one, I
> tried to decouple that here with the "thp agnostic" series:
> 
>    https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com
> 
> in which you reviewed it (which I appreciated).
> 
> So yes, pfnmap on pmd so far will report pmd_trans_huge==true.

I review way to much stuff to remember everything :) That certainly 
screams for a cleanup ...

> 
>>
>> Likely we should be calling copy_huge_pmd() if pmd_leaf() ... cleanup for
>> another day.
> 
> Yes, ultimately it should really be a pmd_leaf(), but since I didn't get
> much feedback there, and that can further postpone this series from being
> posted I'm afraid, then I decided to just move on with "taking pfnmap as
> THPs".  The corresponding change on this path is here in that series:
> 
> https://lore.kernel.org/all/20240717220219.3743374-7-peterx@redhat.com/
> 
> @@ -1235,8 +1235,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>   	src_pmd = pmd_offset(src_pud, addr);
>   	do {
>   		next = pmd_addr_end(addr, end);
> -		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
> -			|| pmd_devmap(*src_pmd)) {
> +		if (is_swap_pmd(*src_pmd) || pmd_is_leaf(*src_pmd)) {
>   			int err;
>   			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
>   			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
> 

Ah, good.

[...]

>> Yes, as stated above, likely broken with UFFD-WP ...
>>
>> I really think we should make this code just behave like it would with PTEs,
>> instead of throwing in more "different" handling.
> 
> So it could simply because file / anon uffd-wp work very differently.

Or because nobody wants to clean up that code ;)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/19] mm/fork: Accept huge pfnmap entries
  2024-08-12 18:50           ` David Hildenbrand
@ 2024-08-12 19:05             ` Peter Xu
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-12 19:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Mon, Aug 12, 2024 at 08:50:12PM +0200, David Hildenbrand wrote:
> On 12.08.24 20:29, Peter Xu wrote:
> > On Fri, Aug 09, 2024 at 07:59:58PM +0200, David Hildenbrand wrote:
> > > On 09.08.24 19:15, Peter Xu wrote:
> > > > On Fri, Aug 09, 2024 at 06:32:44PM +0200, David Hildenbrand wrote:
> > > > > On 09.08.24 18:08, Peter Xu wrote:
> > > > > > Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
> > > > > > much easier, the write bit needs to be persisted though for writable and
> > > > > > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > > > > > parent or child process will trigger a write fault.
> > > > > > 
> > > > > > Do the same for pmd level.
> > > > > > 
> > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > >     mm/huge_memory.c | 27 ++++++++++++++++++++++++---
> > > > > >     1 file changed, 24 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > index 6568586b21ab..015c9468eed5 100644
> > > > > > --- a/mm/huge_memory.c
> > > > > > +++ b/mm/huge_memory.c
> > > > > > @@ -1375,6 +1375,22 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > > > >     	pgtable_t pgtable = NULL;
> > > > > >     	int ret = -ENOMEM;
> > > > > > +	pmd = pmdp_get_lockless(src_pmd);
> > > > > > +	if (unlikely(pmd_special(pmd))) {
> > > > > > +		dst_ptl = pmd_lock(dst_mm, dst_pmd);
> > > > > > +		src_ptl = pmd_lockptr(src_mm, src_pmd);
> > > > > > +		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > > > > > +		/*
> > > > > > +		 * No need to recheck the pmd, it can't change with write
> > > > > > +		 * mmap lock held here.
> > > > > > +		 */
> > > > > > +		if (is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)) {
> > > > > > +			pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > > > > > +			pmd = pmd_wrprotect(pmd);
> > > > > > +		}
> > > > > > +		goto set_pmd;
> > > > > > +	}
> > > > > > +
> > > > > 
> > > > > I strongly assume we should be using using vm_normal_page_pmd() instead of
> > > > > pmd_page() further below. pmd_special() should be mostly limited to GUP-fast
> > > > > and vm_normal_page_pmd().
> > > > 
> > > > One thing to mention that it has this:
> > > > 
> > > > 	if (!vma_is_anonymous(dst_vma))
> > > > 		return 0;
> > > 
> > > Another obscure thing in this function. It's not the job of copy_huge_pmd()
> > > to make the decision whether to copy, it's the job of vma_needs_copy() in
> > > copy_page_range().
> > > 
> > > And now I have to suspect that uffd-wp is broken with this function, because
> > > as vma_needs_copy() clearly states, we must copy, and we don't do that for
> > > PMDs. Ugh.
> > > 
> > > What a mess, we should just do what we do for PTEs and we will be fine ;)
> > 
> > IIUC it's not a problem: file uffd-wp is different from anonymous, in that
> > it pushes everything down to ptes.
> > 
> > It means if we skipped one huge pmd here for file, then it's destined to
> > have nothing to do with uffd-wp, otherwise it should have already been
> > split at the first attempt to wr-protect.
> 
> Is that also true for UFFD_FEATURE_WP_ASYNC, when we call
> pagemap_scan_thp_entry()->make_uffd_wp_pmd() ?
> 
> I'm not immediately finding the code that does the "pushes everything down
> to ptes", so I might miss that part.

UFFDIO_WRITEPROTECT should have all those covered, while I guess you're
right, looks like the pagemap ioctl is overlooked..

> 
> > 
> > > 
> > > Also, we call copy_huge_pmd() only if "is_swap_pmd(*src_pmd) ||
> > > pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)"
> > > 
> > > Would that even be the case with PFNMAP? I suspect that pmd_trans_huge()
> > > would return "true" for special pfnmap, which is rather "surprising", but
> > > fortunate for us.
> > 
> > It's definitely not surprising to me as that's the plan.. and I thought it
> > shoulidn't be surprising to you - if you remember before I sent this one, I
> > tried to decouple that here with the "thp agnostic" series:
> > 
> >    https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com
> > 
> > in which you reviewed it (which I appreciated).
> > 
> > So yes, pfnmap on pmd so far will report pmd_trans_huge==true.
> 
> I review way to much stuff to remember everything :) That certainly screams
> for a cleanup ...

Definitely.

> 
> > 
> > > 
> > > Likely we should be calling copy_huge_pmd() if pmd_leaf() ... cleanup for
> > > another day.
> > 
> > Yes, ultimately it should really be a pmd_leaf(), but since I didn't get
> > much feedback there, and that can further postpone this series from being
> > posted I'm afraid, then I decided to just move on with "taking pfnmap as
> > THPs".  The corresponding change on this path is here in that series:
> > 
> > https://lore.kernel.org/all/20240717220219.3743374-7-peterx@redhat.com/
> > 
> > @@ -1235,8 +1235,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >   	src_pmd = pmd_offset(src_pud, addr);
> >   	do {
> >   		next = pmd_addr_end(addr, end);
> > -		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
> > -			|| pmd_devmap(*src_pmd)) {
> > +		if (is_swap_pmd(*src_pmd) || pmd_is_leaf(*src_pmd)) {
> >   			int err;
> >   			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
> >   			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
> > 
> 
> Ah, good.
> 
> [...]
> 
> > > Yes, as stated above, likely broken with UFFD-WP ...
> > > 
> > > I really think we should make this code just behave like it would with PTEs,
> > > instead of throwing in more "different" handling.
> > 
> > So it could simply because file / anon uffd-wp work very differently.
> 
> Or because nobody wants to clean up that code ;)

I think in this case maybe the fork() part is all fine? As long as we can
switch pagemap ioctl to do proper break-downs when necessary, or even try
to reuse what UFFDIO_WRITEPROTECT does if still possible in some way.

In all cases, definitely sounds like another separate effort.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 08/19] mm: Always define pxx_pgprot()
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (5 preceding siblings ...)
  2024-08-09 16:08 ` [PATCH 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
@ 2024-08-09 16:08 ` Peter Xu
  2024-08-14 13:09   ` Jason Gunthorpe
  2024-08-09 16:08 ` [PATCH 09/19] mm: New follow_pfnmap API Peter Xu
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

There're:

  - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that
  support pte_pgprot().

  - 2 archs (x86, sparc) that support pmd_pgprot().

  - 1 arch (x86) that support pud_pgprot().

Always define them to be used in generic code, and then we don't need to
fiddle with "#ifdef"s when doing so.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/include/asm/pgtable.h    |  1 +
 arch/powerpc/include/asm/pgtable.h  |  1 +
 arch/s390/include/asm/pgtable.h     |  1 +
 arch/sparc/include/asm/pgtable_64.h |  1 +
 include/linux/pgtable.h             | 12 ++++++++++++
 5 files changed, 16 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7a4f5604be3f..b78cc4a6758b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -384,6 +384,7 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
 /*
  * Select all bits except the pfn
  */
+#define pte_pgprot pte_pgprot
 static inline pgprot_t pte_pgprot(pte_t pte)
 {
 	unsigned long pfn = pte_pfn(pte);
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 264a6c09517a..2f72ad885332 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -65,6 +65,7 @@ static inline unsigned long pte_pfn(pte_t pte)
 /*
  * Select all bits except the pfn
  */
+#define pte_pgprot pte_pgprot
 static inline pgprot_t pte_pgprot(pte_t pte)
 {
 	unsigned long pte_flags;
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 3fa280d0672a..0ffbaf741955 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -955,6 +955,7 @@ static inline int pte_unused(pte_t pte)
  * young/old accounting is not supported, i.e _PAGE_PROTECT and _PAGE_INVALID
  * must not be set.
  */
+#define pte_pgprot pte_pgprot
 static inline pgprot_t pte_pgprot(pte_t pte)
 {
 	unsigned long pte_flags = pte_val(pte) & _PAGE_CHG_MASK;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 3fe429d73a65..2b7f358762c1 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -783,6 +783,7 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 	return __pmd(pte_val(pte));
 }
 
+#define pmd_pgprot pmd_pgprot
 static inline pgprot_t pmd_pgprot(pmd_t entry)
 {
 	unsigned long val = pmd_val(entry);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 780f3b439d98..e8b2ac6bd2ae 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1956,6 +1956,18 @@ typedef unsigned int pgtbl_mod_mask;
 #define MAX_PTRS_PER_P4D PTRS_PER_P4D
 #endif
 
+#ifndef pte_pgprot
+#define pte_pgprot(x) ((pgprot_t) {0})
+#endif
+
+#ifndef pmd_pgprot
+#define pmd_pgprot(x) ((pgprot_t) {0})
+#endif
+
+#ifndef pud_pgprot
+#define pud_pgprot(x) ((pgprot_t) {0})
+#endif
+
 /* description of effects of mapping type and prot in current implementation.
  * this is due to the limited x86 page protection hardware.  The expected
  * behavior is in parens:
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 08/19] mm: Always define pxx_pgprot()
  2024-08-09 16:08 ` [PATCH 08/19] mm: Always define pxx_pgprot() Peter Xu
@ 2024-08-14 13:09   ` Jason Gunthorpe
  2024-08-14 15:43     ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 13:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 12:08:58PM -0400, Peter Xu wrote:
> There're:
> 
>   - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that
>   support pte_pgprot().
> 
>   - 2 archs (x86, sparc) that support pmd_pgprot().
> 
>   - 1 arch (x86) that support pud_pgprot().
> 
> Always define them to be used in generic code, and then we don't need to
> fiddle with "#ifdef"s when doing so.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/arm64/include/asm/pgtable.h    |  1 +
>  arch/powerpc/include/asm/pgtable.h  |  1 +
>  arch/s390/include/asm/pgtable.h     |  1 +
>  arch/sparc/include/asm/pgtable_64.h |  1 +
>  include/linux/pgtable.h             | 12 ++++++++++++
>  5 files changed, 16 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7a4f5604be3f..b78cc4a6758b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -384,6 +384,7 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
>  /*
>   * Select all bits except the pfn
>   */
> +#define pte_pgprot pte_pgprot
>  static inline pgprot_t pte_pgprot(pte_t pte)
>  {
>  	unsigned long pfn = pte_pfn(pte);

Stylistically I've been putting the #defines after the function body,
I wonder if there is a common pattern..

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 08/19] mm: Always define pxx_pgprot()
  2024-08-14 13:09   ` Jason Gunthorpe
@ 2024-08-14 15:43     ` Peter Xu
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-14 15:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024 at 10:09:15AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 12:08:58PM -0400, Peter Xu wrote:
> > There're:
> > 
> >   - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that
> >   support pte_pgprot().
> > 
> >   - 2 archs (x86, sparc) that support pmd_pgprot().
> > 
> >   - 1 arch (x86) that support pud_pgprot().
> > 
> > Always define them to be used in generic code, and then we don't need to
> > fiddle with "#ifdef"s when doing so.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/arm64/include/asm/pgtable.h    |  1 +
> >  arch/powerpc/include/asm/pgtable.h  |  1 +
> >  arch/s390/include/asm/pgtable.h     |  1 +
> >  arch/sparc/include/asm/pgtable_64.h |  1 +
> >  include/linux/pgtable.h             | 12 ++++++++++++
> >  5 files changed, 16 insertions(+)
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 7a4f5604be3f..b78cc4a6758b 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -384,6 +384,7 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
> >  /*
> >   * Select all bits except the pfn
> >   */
> > +#define pte_pgprot pte_pgprot
> >  static inline pgprot_t pte_pgprot(pte_t pte)
> >  {
> >  	unsigned long pfn = pte_pfn(pte);
> 
> Stylistically I've been putting the #defines after the function body,
> I wonder if there is a common pattern..

Right, I see both happening in tree right now and I don't know which is
better.  Personally I preferred "before function" as it makes spell checks
easy to match macro/func names, and also cscope indexes both macro and
func, so a jump to any of them would make me look at the entry of func.

I'll keep it as-is for now just to make it easy for me.. but please comment
if we do have a preferred pattern the other way round, then I'll follow.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (6 preceding siblings ...)
  2024-08-09 16:08 ` [PATCH 08/19] mm: Always define pxx_pgprot() Peter Xu
@ 2024-08-09 16:08 ` Peter Xu
  2024-08-14 13:19   ` Jason Gunthorpe
  2024-08-16 23:12   ` Sean Christopherson
  2024-08-09 16:09 ` [PATCH 10/19] KVM: Use " Peter Xu
                   ` (12 subsequent siblings)
  20 siblings, 2 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Introduce a pair of APIs to follow pfn mappings to get entry information.
It's very similar to what follow_pte() does before, but different in that
it recognizes huge pfn mappings.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  31 ++++++++++
 mm/memory.c        | 147 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 178 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 90ca84200800..7471302658af 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2374,6 +2374,37 @@ int follow_pte(struct vm_area_struct *vma, unsigned long address,
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write);
 
+struct follow_pfnmap_args {
+	/**
+	 * Inputs:
+	 * @vma: Pointer to @vm_area_struct struct
+	 * @address: the virtual address to walk
+	 */
+	struct vm_area_struct *vma;
+	unsigned long address;
+	/**
+	 * Internals:
+	 *
+	 * The caller shouldn't touch any of these.
+	 */
+	spinlock_t *lock;
+	pte_t *ptep;
+	/**
+	 * Outputs:
+	 *
+	 * @pfn: the PFN of the address
+	 * @pgprot: the pgprot_t of the mapping
+	 * @writable: whether the mapping is writable
+	 * @special: whether the mapping is a special mapping (real PFN maps)
+	 */
+	unsigned long pfn;
+	pgprot_t pgprot;
+	bool writable;
+	bool special;
+};
+int follow_pfnmap_start(struct follow_pfnmap_args *args);
+void follow_pfnmap_end(struct follow_pfnmap_args *args);
+
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
 void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
diff --git a/mm/memory.c b/mm/memory.c
index 67496dc5064f..2194e0f9f541 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6338,6 +6338,153 @@ int follow_pte(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(follow_pte);
 
+static inline void pfnmap_args_setup(struct follow_pfnmap_args *args,
+				     spinlock_t *lock, pte_t *ptep,
+				     pgprot_t pgprot, unsigned long pfn_base,
+				     unsigned long addr_mask, bool writable,
+				     bool special)
+{
+	args->lock = lock;
+	args->ptep = ptep;
+	args->pfn = pfn_base + ((args->address & ~addr_mask) >> PAGE_SHIFT);
+	args->pgprot = pgprot;
+	args->writable = writable;
+	args->special = special;
+}
+
+static inline void pfnmap_lockdep_assert(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_LOCKDEP
+	struct address_space *mapping = vma->vm_file->f_mapping;
+
+	if (mapping)
+		lockdep_assert(lockdep_is_held(&vma->vm_file->f_mapping->i_mmap_rwsem) ||
+			       lockdep_is_held(&vma->vm_mm->mmap_lock));
+	else
+		lockdep_assert(lockdep_is_held(&vma->vm_mm->mmap_lock));
+#endif
+}
+
+/**
+ * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
+ * @args: Pointer to struct @follow_pfnmap_args
+ *
+ * The caller needs to setup args->vma and args->address to point to the
+ * virtual address as the target of such lookup.  On a successful return,
+ * the results will be put into other output fields.
+ *
+ * After the caller finished using the fields, the caller must invoke
+ * another follow_pfnmap_end() to proper releases the locks and resources
+ * of such look up request.
+ *
+ * During the start() and end() calls, the results in @args will be valid
+ * as proper locks will be held.  After the end() is called, all the fields
+ * in @follow_pfnmap_args will be invalid to be further accessed.
+ *
+ * If the PTE maps a refcounted page, callers are responsible to protect
+ * against invalidation with MMU notifiers; otherwise access to the PFN at
+ * a later point in time can trigger use-after-free.
+ *
+ * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
+ * should be taken for read, and the mmap semaphore cannot be released
+ * before the end() is invoked.
+ *
+ * This function must not be used to modify PTE content.
+ *
+ * Return: zero on success, -ve otherwise.
+ */
+int follow_pfnmap_start(struct follow_pfnmap_args *args)
+{
+	struct vm_area_struct *vma = args->vma;
+	unsigned long address = args->address;
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *lock;
+	pgd_t *pgdp;
+	p4d_t *p4dp, p4d;
+	pud_t *pudp, pud;
+	pmd_t *pmdp, pmd;
+	pte_t *ptep, pte;
+
+	pfnmap_lockdep_assert(vma);
+
+	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
+		goto out;
+
+	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
+		goto out;
+retry:
+	pgdp = pgd_offset(mm, address);
+	if (pgd_none(*pgdp) || unlikely(pgd_bad(*pgdp)))
+		goto out;
+
+	p4dp = p4d_offset(pgdp, address);
+	p4d = READ_ONCE(*p4dp);
+	if (p4d_none(p4d) || unlikely(p4d_bad(p4d)))
+		goto out;
+
+	pudp = pud_offset(p4dp, address);
+	pud = READ_ONCE(*pudp);
+	if (pud_none(pud))
+		goto out;
+	if (pud_leaf(pud)) {
+		lock = pud_lock(mm, pudp);
+		if (!unlikely(pud_leaf(pud))) {
+			spin_unlock(lock);
+			goto retry;
+		}
+		pfnmap_args_setup(args, lock, NULL, pud_pgprot(pud),
+				  pud_pfn(pud), PUD_MASK, pud_write(pud),
+				  pud_special(pud));
+		return 0;
+	}
+
+	pmdp = pmd_offset(pudp, address);
+	pmd = pmdp_get_lockless(pmdp);
+	if (pmd_leaf(pmd)) {
+		lock = pmd_lock(mm, pmdp);
+		if (!unlikely(pmd_leaf(pmd))) {
+			spin_unlock(lock);
+			goto retry;
+		}
+		pfnmap_args_setup(args, lock, NULL, pmd_pgprot(pmd),
+				  pmd_pfn(pmd), PMD_MASK, pmd_write(pmd),
+				  pmd_special(pmd));
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, address, &lock);
+	if (!ptep)
+		goto out;
+	pte = ptep_get(ptep);
+	if (!pte_present(pte))
+		goto unlock;
+	pfnmap_args_setup(args, lock, ptep, pte_pgprot(pte),
+			  pte_pfn(pte), PAGE_MASK, pte_write(pte),
+			  pte_special(pte));
+	return 0;
+unlock:
+	pte_unmap_unlock(ptep, lock);
+out:
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(follow_pfnmap_start);
+
+/**
+ * follow_pfnmap_end(): End a follow_pfnmap_start() process
+ * @args: Pointer to struct @follow_pfnmap_args
+ *
+ * Must be used in pair of follow_pfnmap_start().  See the start() function
+ * above for more information.
+ */
+void follow_pfnmap_end(struct follow_pfnmap_args *args)
+{
+	if (args->lock)
+		spin_unlock(args->lock);
+	if (args->ptep)
+		pte_unmap(args->ptep);
+}
+EXPORT_SYMBOL_GPL(follow_pfnmap_end);
+
 #ifdef CONFIG_HAVE_IOREMAP_PROT
 /**
  * generic_access_phys - generic implementation for iomem mmap access
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-09 16:08 ` [PATCH 09/19] mm: New follow_pfnmap API Peter Xu
@ 2024-08-14 13:19   ` Jason Gunthorpe
  2024-08-14 18:24     ` Peter Xu
  2024-08-16 23:12   ` Sean Christopherson
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 13:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:

> +/**
> + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> + * @args: Pointer to struct @follow_pfnmap_args
> + *
> + * The caller needs to setup args->vma and args->address to point to the
> + * virtual address as the target of such lookup.  On a successful return,
> + * the results will be put into other output fields.
> + *
> + * After the caller finished using the fields, the caller must invoke
> + * another follow_pfnmap_end() to proper releases the locks and resources
> + * of such look up request.
> + *
> + * During the start() and end() calls, the results in @args will be valid
> + * as proper locks will be held.  After the end() is called, all the fields
> + * in @follow_pfnmap_args will be invalid to be further accessed.
> + *
> + * If the PTE maps a refcounted page, callers are responsible to protect
> + * against invalidation with MMU notifiers; otherwise access to the PFN at
> + * a later point in time can trigger use-after-free.
> + *
> + * Only IO mappings and raw PFN mappings are allowed.  

What does this mean? The paragraph before said this can return a
refcounted page?

> + * The mmap semaphore
> + * should be taken for read, and the mmap semaphore cannot be released
> + * before the end() is invoked.

This function is not safe for IO mappings and PFNs either, VFIO has a
known security issue to call it. That should be emphasised in the
comment.

The caller must be protected by mmu notifiers or other locking that
guarentees the PTE cannot be removed while the caller is using it. In
all cases. 

Since this hold the PTL until end is it always safe to use the
returned address before calling end?

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-14 13:19   ` Jason Gunthorpe
@ 2024-08-14 18:24     ` Peter Xu
  2024-08-14 22:14       ` Jason Gunthorpe
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-14 18:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:
> 
> > +/**
> > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> > + * @args: Pointer to struct @follow_pfnmap_args
> > + *
> > + * The caller needs to setup args->vma and args->address to point to the
> > + * virtual address as the target of such lookup.  On a successful return,
> > + * the results will be put into other output fields.
> > + *
> > + * After the caller finished using the fields, the caller must invoke
> > + * another follow_pfnmap_end() to proper releases the locks and resources
> > + * of such look up request.
> > + *
> > + * During the start() and end() calls, the results in @args will be valid
> > + * as proper locks will be held.  After the end() is called, all the fields
> > + * in @follow_pfnmap_args will be invalid to be further accessed.
> > + *
> > + * If the PTE maps a refcounted page, callers are responsible to protect
> > + * against invalidation with MMU notifiers; otherwise access to the PFN at
> > + * a later point in time can trigger use-after-free.
> > + *
> > + * Only IO mappings and raw PFN mappings are allowed.  
> 
> What does this mean? The paragraph before said this can return a
> refcounted page?

This came from the old follow_pte(), I kept that as I suppose we should
allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose
only the pfnmap matters where huge mappings can start to appear.

> 
> > + * The mmap semaphore
> > + * should be taken for read, and the mmap semaphore cannot be released
> > + * before the end() is invoked.
> 
> This function is not safe for IO mappings and PFNs either, VFIO has a
> known security issue to call it. That should be emphasised in the
> comment.

Any elaboration on this?  I could have missed that..

> 
> The caller must be protected by mmu notifiers or other locking that
> guarentees the PTE cannot be removed while the caller is using it. In
> all cases. 
> 
> Since this hold the PTL until end is it always safe to use the
> returned address before calling end?

I suppose so?  As the pgtable is stable, I thought it means it's safe, but
I'm not sure now when you mentioned there's a VFIO known issue, so I could
have overlooked something.  There's no address returned, but pfn, pgprot,
write, etc.

The user needs to do proper mapping if they need an usable address,
e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't
change.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-14 18:24     ` Peter Xu
@ 2024-08-14 22:14       ` Jason Gunthorpe
  2024-08-15 15:41         ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 22:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024 at 02:24:47PM -0400, Peter Xu wrote:
> On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:
> > 
> > > +/**
> > > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> > > + * @args: Pointer to struct @follow_pfnmap_args
> > > + *
> > > + * The caller needs to setup args->vma and args->address to point to the
> > > + * virtual address as the target of such lookup.  On a successful return,
> > > + * the results will be put into other output fields.
> > > + *
> > > + * After the caller finished using the fields, the caller must invoke
> > > + * another follow_pfnmap_end() to proper releases the locks and resources
> > > + * of such look up request.
> > > + *
> > > + * During the start() and end() calls, the results in @args will be valid
> > > + * as proper locks will be held.  After the end() is called, all the fields
> > > + * in @follow_pfnmap_args will be invalid to be further accessed.
> > > + *
> > > + * If the PTE maps a refcounted page, callers are responsible to protect
> > > + * against invalidation with MMU notifiers; otherwise access to the PFN at
> > > + * a later point in time can trigger use-after-free.
> > > + *
> > > + * Only IO mappings and raw PFN mappings are allowed.  
> > 
> > What does this mean? The paragraph before said this can return a
> > refcounted page?
> 
> This came from the old follow_pte(), I kept that as I suppose we should
> allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose
> only the pfnmap matters where huge mappings can start to appear.

If that is the intention it should actively block returning anything
that is vm_normal_page() not check the VM flags, see the other
discussion..

It makes sense as a restriction if you call the API follow pfnmap.

> > > + * The mmap semaphore
> > > + * should be taken for read, and the mmap semaphore cannot be released
> > > + * before the end() is invoked.
> > 
> > This function is not safe for IO mappings and PFNs either, VFIO has a
> > known security issue to call it. That should be emphasised in the
> > comment.
> 
> Any elaboration on this?  I could have missed that..

Just because the memory is a PFN or IO doesn't mean it is safe to
access it without a refcount. There are many driver scenarios where
revoking a PFN from mmap needs to be a hard fence that nothing else
has access to that PFN. Otherwise it is a security problem for that
driver.

> I suppose so?  As the pgtable is stable, I thought it means it's safe, but
> I'm not sure now when you mentioned there's a VFIO known issue, so I could
> have overlooked something.  There's no address returned, but pfn, pgprot,
> write, etc.

zap/etc will wait on the PTL, I think, so it should be safe for at
least the issues I am thinking of.

> The user needs to do proper mapping if they need an usable address,
> e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't
> change.

No, you can't take the phys_addr_t outside the start/end region that
explicitly holds the lock protecting it. This is what the comment must
warn against doing.

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-14 22:14       ` Jason Gunthorpe
@ 2024-08-15 15:41         ` Peter Xu
  2024-08-15 16:16           ` Jason Gunthorpe
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-15 15:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024 at 07:14:41PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2024 at 02:24:47PM -0400, Peter Xu wrote:
> > On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:
> > > 
> > > > +/**
> > > > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> > > > + * @args: Pointer to struct @follow_pfnmap_args
> > > > + *
> > > > + * The caller needs to setup args->vma and args->address to point to the
> > > > + * virtual address as the target of such lookup.  On a successful return,
> > > > + * the results will be put into other output fields.
> > > > + *
> > > > + * After the caller finished using the fields, the caller must invoke
> > > > + * another follow_pfnmap_end() to proper releases the locks and resources
> > > > + * of such look up request.
> > > > + *
> > > > + * During the start() and end() calls, the results in @args will be valid
> > > > + * as proper locks will be held.  After the end() is called, all the fields
> > > > + * in @follow_pfnmap_args will be invalid to be further accessed.
> > > > + *
> > > > + * If the PTE maps a refcounted page, callers are responsible to protect
> > > > + * against invalidation with MMU notifiers; otherwise access to the PFN at
> > > > + * a later point in time can trigger use-after-free.
> > > > + *
> > > > + * Only IO mappings and raw PFN mappings are allowed.  
> > > 
> > > What does this mean? The paragraph before said this can return a
> > > refcounted page?
> > 
> > This came from the old follow_pte(), I kept that as I suppose we should
> > allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose
> > only the pfnmap matters where huge mappings can start to appear.
> 
> If that is the intention it should actively block returning anything
> that is vm_normal_page() not check the VM flags, see the other
> discussion..

The restriction should only be applied to the vma attributes, not a
specific pte mapping, IMHO.

I mean, the comment was describing "which VMA is allowed to use this
function", reflecting that we'll fail at anything !PFNMAP && !IO.

It seems legal to have private mappings of them, where vm_normal_page() can
return true here for some of the mappings under PFNMAP|IO. IIUC either the
old follow_pte() or follow_pfnmap*() API cared much on this part yet so
far.

> 
> It makes sense as a restriction if you call the API follow pfnmap.

I'm open to any better suggestion to names.  Again, I think here it's more
about the vma attribute, not "every mapping under the memory range".

> 
> > > > + * The mmap semaphore
> > > > + * should be taken for read, and the mmap semaphore cannot be released
> > > > + * before the end() is invoked.
> > > 
> > > This function is not safe for IO mappings and PFNs either, VFIO has a
> > > known security issue to call it. That should be emphasised in the
> > > comment.
> > 
> > Any elaboration on this?  I could have missed that..
> 
> Just because the memory is a PFN or IO doesn't mean it is safe to
> access it without a refcount. There are many driver scenarios where
> revoking a PFN from mmap needs to be a hard fence that nothing else
> has access to that PFN. Otherwise it is a security problem for that
> driver.

Oh ok, I suppose you meant the VFIO whole thing on "zapping mapping when
MMIO disabled"?  If so I get it.  More below.

> 
> > I suppose so?  As the pgtable is stable, I thought it means it's safe, but
> > I'm not sure now when you mentioned there's a VFIO known issue, so I could
> > have overlooked something.  There's no address returned, but pfn, pgprot,
> > write, etc.
> 
> zap/etc will wait on the PTL, I think, so it should be safe for at
> least the issues I am thinking of.
> 
> > The user needs to do proper mapping if they need an usable address,
> > e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't
> > change.
> 
> No, you can't take the phys_addr_t outside the start/end region that
> explicitly holds the lock protecting it. This is what the comment must
> warn against doing.

I think the comment has that part covered more or less:

 * During the start() and end() calls, the results in @args will be valid
 * as proper locks will be held.  After the end() is called, all the fields
 * in @follow_pfnmap_args will be invalid to be further accessed.

Feel free to suggest anything that will make it better.

For generic_access_phys() as a specific example: I think it is safe to map
the pfn even after end().  I meant here the "map" operation is benign with
ioremap_prot(), afaiu: it doesn't include an access on top of the mapping
yet.

After the map, it rewalks the pgtable, making sure PFN is still there and
valid, and it'll only access it this time before end():

	if (write)
		memcpy_toio(maddr + offset, buf, len);
	else
		memcpy_fromio(buf, maddr + offset, len);
	ret = len;
	follow_pfnmap_end(&args);

If PFN changed, it properly releases the mapping:

	if ((prot != pgprot_val(args.pgprot)) ||
	    (phys_addr != (args.pfn << PAGE_SHIFT)) ||
	    (writable != args.writable)) {
		follow_pfnmap_end(&args);
		iounmap(maddr);
		goto retry;
	}

Then taking the example of VFIO: there's no risk of racing with a
concurrent zapping as far as I can see, because otherwise it'll see pfn
changed.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-15 15:41         ` Peter Xu
@ 2024-08-15 16:16           ` Jason Gunthorpe
  2024-08-15 17:21             ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-15 16:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Thu, Aug 15, 2024 at 11:41:39AM -0400, Peter Xu wrote:
> On Wed, Aug 14, 2024 at 07:14:41PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2024 at 02:24:47PM -0400, Peter Xu wrote:
> > > On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:
> > > > 
> > > > > +/**
> > > > > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> > > > > + * @args: Pointer to struct @follow_pfnmap_args
> > > > > + *
> > > > > + * The caller needs to setup args->vma and args->address to point to the
> > > > > + * virtual address as the target of such lookup.  On a successful return,
> > > > > + * the results will be put into other output fields.
> > > > > + *
> > > > > + * After the caller finished using the fields, the caller must invoke
> > > > > + * another follow_pfnmap_end() to proper releases the locks and resources
> > > > > + * of such look up request.
> > > > > + *
> > > > > + * During the start() and end() calls, the results in @args will be valid
> > > > > + * as proper locks will be held.  After the end() is called, all the fields
> > > > > + * in @follow_pfnmap_args will be invalid to be further accessed.
> > > > > + *
> > > > > + * If the PTE maps a refcounted page, callers are responsible to protect
> > > > > + * against invalidation with MMU notifiers; otherwise access to the PFN at
> > > > > + * a later point in time can trigger use-after-free.
> > > > > + *
> > > > > + * Only IO mappings and raw PFN mappings are allowed.  
> > > > 
> > > > What does this mean? The paragraph before said this can return a
> > > > refcounted page?
> > > 
> > > This came from the old follow_pte(), I kept that as I suppose we should
> > > allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose
> > > only the pfnmap matters where huge mappings can start to appear.
> > 
> > If that is the intention it should actively block returning anything
> > that is vm_normal_page() not check the VM flags, see the other
> > discussion..
> 
> The restriction should only be applied to the vma attributes, not a
> specific pte mapping, IMHO.
> 
> I mean, the comment was describing "which VMA is allowed to use this
> function", reflecting that we'll fail at anything !PFNMAP && !IO.
> 
> It seems legal to have private mappings of them, where vm_normal_page() can
> return true here for some of the mappings under PFNMAP|IO. IIUC either the
> old follow_pte() or follow_pfnmap*() API cared much on this part yet so
> far.

Why? Either the function only returns PFN map no-struct page things or
it returns struct page stuff too, in which case why bother to check
the VMA flags if the caller already has to be correct for struct page
backed results?

This function is only safe to use under the proper locking, and under
those rules it doesn't matter at all what the result is..
> > > > > + * The mmap semaphore
> > > > > + * should be taken for read, and the mmap semaphore cannot be released
> > > > > + * before the end() is invoked.
> > > > 
> > > > This function is not safe for IO mappings and PFNs either, VFIO has a
> > > > known security issue to call it. That should be emphasised in the
> > > > comment.
> > > 
> > > Any elaboration on this?  I could have missed that..
> > 
> > Just because the memory is a PFN or IO doesn't mean it is safe to
> > access it without a refcount. There are many driver scenarios where
> > revoking a PFN from mmap needs to be a hard fence that nothing else
> > has access to that PFN. Otherwise it is a security problem for that
> > driver.
> 
> Oh ok, I suppose you meant the VFIO whole thing on "zapping mapping when
> MMIO disabled"?  If so I get it.  More below.

And more..

> > > The user needs to do proper mapping if they need an usable address,
> > > e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't
> > > change.
> > 
> > No, you can't take the phys_addr_t outside the start/end region that
> > explicitly holds the lock protecting it. This is what the comment must
> > warn against doing.
> 
> I think the comment has that part covered more or less:
> 
>  * During the start() and end() calls, the results in @args will be valid
>  * as proper locks will be held.  After the end() is called, all the fields
>  * in @follow_pfnmap_args will be invalid to be further accessed.
> 
> Feel free to suggest anything that will make it better.

Be much more specific and scary:

  Any physical address obtained through this API is only valid while
  the @follow_pfnmap_args. Continuing to use the address after end(),
  without some other means to synchronize with page table updates
  will create a security bug.
 
> For generic_access_phys() as a specific example: I think it is safe to map
> the pfn even after end(). 

The map could be safe, but also the memory could be hot unplugged as a
race. I don't know either way if all arch code is safe for that.

> After the map, it rewalks the pgtable, making sure PFN is still there and
> valid, and it'll only access it this time before end():
> 
> 	if (write)
> 		memcpy_toio(maddr + offset, buf, len);
> 	else
> 		memcpy_fromio(buf, maddr + offset, len);
> 	ret = len;
> 	follow_pfnmap_end(&args);

Yes
 
> If PFN changed, it properly releases the mapping:
> 
> 	if ((prot != pgprot_val(args.pgprot)) ||
> 	    (phys_addr != (args.pfn << PAGE_SHIFT)) ||
> 	    (writable != args.writable)) {
> 		follow_pfnmap_end(&args);
> 		iounmap(maddr);
> 		goto retry;
> 	}
> 
> Then taking the example of VFIO: there's no risk of racing with a
> concurrent zapping as far as I can see, because otherwise it'll see pfn
> changed.

VFIO dumps the physical address into the IOMMU and ignores
zap. Concurrent zap results in a UAF through the IOMMU mapping.

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-15 16:16           ` Jason Gunthorpe
@ 2024-08-15 17:21             ` Peter Xu
  2024-08-15 17:24               ` Jason Gunthorpe
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-15 17:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Thu, Aug 15, 2024 at 01:16:03PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2024 at 11:41:39AM -0400, Peter Xu wrote:
> > On Wed, Aug 14, 2024 at 07:14:41PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 14, 2024 at 02:24:47PM -0400, Peter Xu wrote:
> > > > On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote:
> > > > > 
> > > > > > +/**
> > > > > > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address
> > > > > > + * @args: Pointer to struct @follow_pfnmap_args
> > > > > > + *
> > > > > > + * The caller needs to setup args->vma and args->address to point to the
> > > > > > + * virtual address as the target of such lookup.  On a successful return,
> > > > > > + * the results will be put into other output fields.
> > > > > > + *
> > > > > > + * After the caller finished using the fields, the caller must invoke
> > > > > > + * another follow_pfnmap_end() to proper releases the locks and resources
> > > > > > + * of such look up request.
> > > > > > + *
> > > > > > + * During the start() and end() calls, the results in @args will be valid
> > > > > > + * as proper locks will be held.  After the end() is called, all the fields
> > > > > > + * in @follow_pfnmap_args will be invalid to be further accessed.
> > > > > > + *
> > > > > > + * If the PTE maps a refcounted page, callers are responsible to protect
> > > > > > + * against invalidation with MMU notifiers; otherwise access to the PFN at
> > > > > > + * a later point in time can trigger use-after-free.
> > > > > > + *
> > > > > > + * Only IO mappings and raw PFN mappings are allowed.  
> > > > > 
> > > > > What does this mean? The paragraph before said this can return a
> > > > > refcounted page?
> > > > 
> > > > This came from the old follow_pte(), I kept that as I suppose we should
> > > > allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose
> > > > only the pfnmap matters where huge mappings can start to appear.
> > > 
> > > If that is the intention it should actively block returning anything
> > > that is vm_normal_page() not check the VM flags, see the other
> > > discussion..
> > 
> > The restriction should only be applied to the vma attributes, not a
> > specific pte mapping, IMHO.
> > 
> > I mean, the comment was describing "which VMA is allowed to use this
> > function", reflecting that we'll fail at anything !PFNMAP && !IO.
> > 
> > It seems legal to have private mappings of them, where vm_normal_page() can
> > return true here for some of the mappings under PFNMAP|IO. IIUC either the
> > old follow_pte() or follow_pfnmap*() API cared much on this part yet so
> > far.
> 
> Why? Either the function only returns PFN map no-struct page things or
> it returns struct page stuff too, in which case why bother to check
> the VMA flags if the caller already has to be correct for struct page
> backed results?
> 
> This function is only safe to use under the proper locking, and under
> those rules it doesn't matter at all what the result is..

Do you mean we should drop the PFNMAP|IO check?  I didn't see all the
callers to say that they won't rely on proper failing of !PFNMAP&&!IO vmas
to work alright.  So I assume we should definitely keep them around.

Or I could have totally missed what you're suggesting here..

> > > > > > + * The mmap semaphore
> > > > > > + * should be taken for read, and the mmap semaphore cannot be released
> > > > > > + * before the end() is invoked.
> > > > > 
> > > > > This function is not safe for IO mappings and PFNs either, VFIO has a
> > > > > known security issue to call it. That should be emphasised in the
> > > > > comment.
> > > > 
> > > > Any elaboration on this?  I could have missed that..
> > > 
> > > Just because the memory is a PFN or IO doesn't mean it is safe to
> > > access it without a refcount. There are many driver scenarios where
> > > revoking a PFN from mmap needs to be a hard fence that nothing else
> > > has access to that PFN. Otherwise it is a security problem for that
> > > driver.
> > 
> > Oh ok, I suppose you meant the VFIO whole thing on "zapping mapping when
> > MMIO disabled"?  If so I get it.  More below.
> 
> And more..
> 
> > > > The user needs to do proper mapping if they need an usable address,
> > > > e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't
> > > > change.
> > > 
> > > No, you can't take the phys_addr_t outside the start/end region that
> > > explicitly holds the lock protecting it. This is what the comment must
> > > warn against doing.
> > 
> > I think the comment has that part covered more or less:
> > 
> >  * During the start() and end() calls, the results in @args will be valid
> >  * as proper locks will be held.  After the end() is called, all the fields
> >  * in @follow_pfnmap_args will be invalid to be further accessed.
> > 
> > Feel free to suggest anything that will make it better.
> 
> Be much more specific and scary:
> 
>   Any physical address obtained through this API is only valid while
>   the @follow_pfnmap_args. Continuing to use the address after end(),
>   without some other means to synchronize with page table updates
>   will create a security bug.

Some misuse on wordings here (e.g. we don't return PA but PFN), and some
sentence doesn't seem to be complete.. but I think I get the "scary" part
of it.  How about this, appending the scary part to the end?

 * During the start() and end() calls, the results in @args will be valid
 * as proper locks will be held.  After the end() is called, all the fields
 * in @follow_pfnmap_args will be invalid to be further accessed.  Further
 * use of such information after end() may require proper synchronizations
 * by the caller with page table updates, otherwise it can create a
 * security bug.

>  
> > For generic_access_phys() as a specific example: I think it is safe to map
> > the pfn even after end(). 
> 
> The map could be safe, but also the memory could be hot unplugged as a
> race. I don't know either way if all arch code is safe for that.

I hope it's ok, or we have similar problem with follow_pte() for all
theseyears.. in all cases, this sounds like another thing to be checked
outside of scope of this patch..

> 
> > After the map, it rewalks the pgtable, making sure PFN is still there and
> > valid, and it'll only access it this time before end():
> > 
> > 	if (write)
> > 		memcpy_toio(maddr + offset, buf, len);
> > 	else
> > 		memcpy_fromio(buf, maddr + offset, len);
> > 	ret = len;
> > 	follow_pfnmap_end(&args);
> 
> Yes
>  
> > If PFN changed, it properly releases the mapping:
> > 
> > 	if ((prot != pgprot_val(args.pgprot)) ||
> > 	    (phys_addr != (args.pfn << PAGE_SHIFT)) ||
> > 	    (writable != args.writable)) {
> > 		follow_pfnmap_end(&args);
> > 		iounmap(maddr);
> > 		goto retry;
> > 	}
> > 
> > Then taking the example of VFIO: there's no risk of racing with a
> > concurrent zapping as far as I can see, because otherwise it'll see pfn
> > changed.
> 
> VFIO dumps the physical address into the IOMMU and ignores
> zap. Concurrent zap results in a UAF through the IOMMU mapping.

Ohhh, so this is what I'm missing..

It worked for generic mem only because VFIO pins them upfront so any form
of zapping won't throw things away, but we can't pin pfnmaps yet so far.

It sounds like we need some mmu notifiers when mapping the IOMMU pgtables,
as long as there's MMIO-region / P2P involved.  It'll make sure when
tearing down the BAR mappings, the devices will at least see the same view
as the processors.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-15 17:21             ` Peter Xu
@ 2024-08-15 17:24               ` Jason Gunthorpe
  2024-08-15 18:52                 ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-15 17:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Thu, Aug 15, 2024 at 01:21:01PM -0400, Peter Xu wrote:
> > Why? Either the function only returns PFN map no-struct page things or
> > it returns struct page stuff too, in which case why bother to check
> > the VMA flags if the caller already has to be correct for struct page
> > backed results?
> > 
> > This function is only safe to use under the proper locking, and under
> > those rules it doesn't matter at all what the result is..
> 
> Do you mean we should drop the PFNMAP|IO check?

Yeah

>  I didn't see all the
> callers to say that they won't rely on proper failing of !PFNMAP&&!IO vmas
> to work alright.  So I assume we should definitely keep them around.

But as before, if we care about this we should be using vm_normal_page
as that is sort of abusing the PFNMAP flags.

> >   Any physical address obtained through this API is only valid while
> >   the @follow_pfnmap_args. Continuing to use the address after end(),
> >   without some other means to synchronize with page table updates
> >   will create a security bug.
> 
> Some misuse on wordings here (e.g. we don't return PA but PFN), and some
> sentence doesn't seem to be complete.. but I think I get the "scary" part
> of it.  How about this, appending the scary part to the end?
> 
>  * During the start() and end() calls, the results in @args will be valid
>  * as proper locks will be held.  After the end() is called, all the fields
>  * in @follow_pfnmap_args will be invalid to be further accessed.  Further
>  * use of such information after end() may require proper synchronizations
>  * by the caller with page table updates, otherwise it can create a
>  * security bug.

I would specifically emphasis that the pfn may not be used after
end. That is the primary mistake people have made.

They think it is a PFN so it is safe.

> It sounds like we need some mmu notifiers when mapping the IOMMU pgtables,
> as long as there's MMIO-region / P2P involved.  It'll make sure when
> tearing down the BAR mappings, the devices will at least see the same view
> as the processors.

I think the mmu notifiers can trigger too often for this to be
practical for DMA :(

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-15 17:24               ` Jason Gunthorpe
@ 2024-08-15 18:52                 ` Peter Xu
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-15 18:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Thu, Aug 15, 2024 at 02:24:45PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2024 at 01:21:01PM -0400, Peter Xu wrote:
> > > Why? Either the function only returns PFN map no-struct page things or
> > > it returns struct page stuff too, in which case why bother to check
> > > the VMA flags if the caller already has to be correct for struct page
> > > backed results?
> > > 
> > > This function is only safe to use under the proper locking, and under
> > > those rules it doesn't matter at all what the result is..
> > 
> > Do you mean we should drop the PFNMAP|IO check?
> 
> Yeah
> 
> >  I didn't see all the
> > callers to say that they won't rely on proper failing of !PFNMAP&&!IO vmas
> > to work alright.  So I assume we should definitely keep them around.
> 
> But as before, if we care about this we should be using vm_normal_page
> as that is sort of abusing the PFNMAP flags.

I can't say it's abusing..  Taking access_remote_vm() as example again, it
can go back as far as 2008 with Rik's commit here:

    commit 28b2ee20c7cba812b6f2ccf6d722cf86d00a84dc
    Author: Rik van Riel <riel@redhat.com>
    Date:   Wed Jul 23 21:27:05 2008 -0700

    access_process_vm device memory infrastructure

So it starts with having GUP failing pfnmaps first for remote vm access, as
what we also do right now with check_vma_flags(), then this whole walker is
a remedy for that.

It isn't used at all for normal VMAs, unless it's a private pfnmap mapping
which should be extremely rare, or if it's IO+!PFNMAP, which is a world I
am not familiar with..

In all cases, I hope we can still leave this alone in the huge pfnmap
effort, as they do not yet to be closely relevant.  From that POV, this
patch as simple as "teach follow_pte() to know huge mappings", while it's
just that we can't modify on top when the old interface won't work when
stick with pte_t.  Most of the rest was inherited from follow_pte();
there're still some trivial changes elsewhere, but here on the vma flag
check we stick the same with old.

> 
> > >   Any physical address obtained through this API is only valid while
> > >   the @follow_pfnmap_args. Continuing to use the address after end(),
> > >   without some other means to synchronize with page table updates
> > >   will create a security bug.
> > 
> > Some misuse on wordings here (e.g. we don't return PA but PFN), and some
> > sentence doesn't seem to be complete.. but I think I get the "scary" part
> > of it.  How about this, appending the scary part to the end?
> > 
> >  * During the start() and end() calls, the results in @args will be valid
> >  * as proper locks will be held.  After the end() is called, all the fields
> >  * in @follow_pfnmap_args will be invalid to be further accessed.  Further
> >  * use of such information after end() may require proper synchronizations
> >  * by the caller with page table updates, otherwise it can create a
> >  * security bug.
> 
> I would specifically emphasis that the pfn may not be used after
> end. That is the primary mistake people have made.
> 
> They think it is a PFN so it is safe.

I understand your concern. It's just that it seems still legal to me to use
it as long as proper action is taken.

I hope "require proper synchronizations" would be the best way to phrase
this matter, but maybe you have even better suggestion to put this, which
I'm definitely open to that too.

> 
> > It sounds like we need some mmu notifiers when mapping the IOMMU pgtables,
> > as long as there's MMIO-region / P2P involved.  It'll make sure when
> > tearing down the BAR mappings, the devices will at least see the same view
> > as the processors.
> 
> I think the mmu notifiers can trigger too often for this to be
> practical for DMA :(

I guess the DMAs are fine as normally the notifier will be no-op, as long
as the BAR enable/disable happens rare.  But yeah, I see you point, and
that is probably a concern if those notifier needs to be kicked off and
walk a bunch of MMIO regions, even if 99% of the cases it'll do nothing.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-09 16:08 ` [PATCH 09/19] mm: New follow_pfnmap API Peter Xu
  2024-08-14 13:19   ` Jason Gunthorpe
@ 2024-08-16 23:12   ` Sean Christopherson
  2024-08-17 11:05     ` David Hildenbrand
  2024-08-21 19:10     ` Peter Xu
  1 sibling, 2 replies; 90+ messages in thread
From: Sean Christopherson @ 2024-08-16 23:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024, Peter Xu wrote:
> Introduce a pair of APIs to follow pfn mappings to get entry information.
> It's very similar to what follow_pte() does before, but different in that
> it recognizes huge pfn mappings.

...

> +int follow_pfnmap_start(struct follow_pfnmap_args *args);
> +void follow_pfnmap_end(struct follow_pfnmap_args *args);

I find the start+end() terminology to be unintuitive.  E.g. I had to look at the
implementation to understand why KVM invoke fixup_user_fault() if follow_pfnmap_start()
failed.

What about follow_pfnmap_and_lock()?  And then maybe follow_pfnmap_unlock()?
Though that second one reads a little weird.

> + * Return: zero on success, -ve otherwise.

ve?

> +int follow_pfnmap_start(struct follow_pfnmap_args *args)
> +{
> +	struct vm_area_struct *vma = args->vma;
> +	unsigned long address = args->address;
> +	struct mm_struct *mm = vma->vm_mm;
> +	spinlock_t *lock;
> +	pgd_t *pgdp;
> +	p4d_t *p4dp, p4d;
> +	pud_t *pudp, pud;
> +	pmd_t *pmdp, pmd;
> +	pte_t *ptep, pte;
> +
> +	pfnmap_lockdep_assert(vma);
> +
> +	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> +		goto out;
> +
> +	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> +		goto out;

Why use goto intead of simply?

		return -EINVAL;

That's relevant because I think the cases where no PxE is found should return
-ENOENT, not -EINVAL.  E.g. if the caller doesn't precheck, then it can bail
immediately on EINVAL, but know that it's worth trying to fault-in the pfn on
ENOENT. 

> +retry:
> +	pgdp = pgd_offset(mm, address);
> +	if (pgd_none(*pgdp) || unlikely(pgd_bad(*pgdp)))
> +		goto out;
> +
> +	p4dp = p4d_offset(pgdp, address);
> +	p4d = READ_ONCE(*p4dp);
> +	if (p4d_none(p4d) || unlikely(p4d_bad(p4d)))
> +		goto out;
> +
> +	pudp = pud_offset(p4dp, address);
> +	pud = READ_ONCE(*pudp);
> +	if (pud_none(pud))
> +		goto out;
> +	if (pud_leaf(pud)) {
> +		lock = pud_lock(mm, pudp);
> +		if (!unlikely(pud_leaf(pud))) {
> +			spin_unlock(lock);
> +			goto retry;
> +		}
> +		pfnmap_args_setup(args, lock, NULL, pud_pgprot(pud),
> +				  pud_pfn(pud), PUD_MASK, pud_write(pud),
> +				  pud_special(pud));
> +		return 0;
> +	}
> +
> +	pmdp = pmd_offset(pudp, address);
> +	pmd = pmdp_get_lockless(pmdp);
> +	if (pmd_leaf(pmd)) {
> +		lock = pmd_lock(mm, pmdp);
> +		if (!unlikely(pmd_leaf(pmd))) {
> +			spin_unlock(lock);
> +			goto retry;
> +		}
> +		pfnmap_args_setup(args, lock, NULL, pmd_pgprot(pmd),
> +				  pmd_pfn(pmd), PMD_MASK, pmd_write(pmd),
> +				  pmd_special(pmd));
> +		return 0;
> +	}
> +
> +	ptep = pte_offset_map_lock(mm, pmdp, address, &lock);
> +	if (!ptep)
> +		goto out;
> +	pte = ptep_get(ptep);
> +	if (!pte_present(pte))
> +		goto unlock;
> +	pfnmap_args_setup(args, lock, ptep, pte_pgprot(pte),
> +			  pte_pfn(pte), PAGE_MASK, pte_write(pte),
> +			  pte_special(pte));
> +	return 0;
> +unlock:
> +	pte_unmap_unlock(ptep, lock);
> +out:
> +	return -EINVAL;
> +}
> +EXPORT_SYMBOL_GPL(follow_pfnmap_start);


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-16 23:12   ` Sean Christopherson
@ 2024-08-17 11:05     ` David Hildenbrand
  2024-08-21 19:10     ` Peter Xu
  1 sibling, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2024-08-17 11:05 UTC (permalink / raw)
  To: Sean Christopherson, Peter Xu
  Cc: linux-mm, linux-kernel, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, Thomas Gleixner,
	kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 17.08.24 01:12, Sean Christopherson wrote:
> On Fri, Aug 09, 2024, Peter Xu wrote:
>> Introduce a pair of APIs to follow pfn mappings to get entry information.
>> It's very similar to what follow_pte() does before, but different in that
>> it recognizes huge pfn mappings.
> 
> ...
> 
>> +int follow_pfnmap_start(struct follow_pfnmap_args *args);
>> +void follow_pfnmap_end(struct follow_pfnmap_args *args);
> 
> I find the start+end() terminology to be unintuitive.  E.g. I had to look at the
> implementation to understand why KVM invoke fixup_user_fault() if follow_pfnmap_start()
> failed.

It roughly matches folio_walk_start() / folio_walk_end(), that I 
recently introduced.

Maybe we should call it pfnmap_walk_start() / pfnmap_walk_end() here, to 
remove the old "follow" semantics for good.

> 
> What about follow_pfnmap_and_lock()?  And then maybe follow_pfnmap_unlock()?
> Though that second one reads a little weird.

Yes, I prefer start/end (lock/unlock reads like an implementation 
detail). But whatever we do, let's try doing something that is 
consistent with existing stuff.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 09/19] mm: New follow_pfnmap API
  2024-08-16 23:12   ` Sean Christopherson
  2024-08-17 11:05     ` David Hildenbrand
@ 2024-08-21 19:10     ` Peter Xu
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-21 19:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-mm, linux-kernel, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 16, 2024 at 04:12:24PM -0700, Sean Christopherson wrote:
> On Fri, Aug 09, 2024, Peter Xu wrote:
> > Introduce a pair of APIs to follow pfn mappings to get entry information.
> > It's very similar to what follow_pte() does before, but different in that
> > it recognizes huge pfn mappings.
> 
> ...
> 
> > +int follow_pfnmap_start(struct follow_pfnmap_args *args);
> > +void follow_pfnmap_end(struct follow_pfnmap_args *args);
> 
> I find the start+end() terminology to be unintuitive.  E.g. I had to look at the
> implementation to understand why KVM invoke fixup_user_fault() if follow_pfnmap_start()
> failed.
> 
> What about follow_pfnmap_and_lock()?  And then maybe follow_pfnmap_unlock()?
> Though that second one reads a little weird.

If to go with the _lock() I tend to drop "and" to follow_pfnmap_[un]lock().
However looks like David preferred me keeping the name, so we don't reach a
quorum yet.  I'm happy to change the name as long as we have enough votes..

> 
> > + * Return: zero on success, -ve otherwise.
> 
> ve?

This one came from the old follow_pte() and I kept it. I only knew this
after search: a short way to write "negative" (while positive is "+ve").

Doesn't look like something productive.. I'll spell it out in the next
version.

> 
> > +int follow_pfnmap_start(struct follow_pfnmap_args *args)
> > +{
> > +	struct vm_area_struct *vma = args->vma;
> > +	unsigned long address = args->address;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	spinlock_t *lock;
> > +	pgd_t *pgdp;
> > +	p4d_t *p4dp, p4d;
> > +	pud_t *pudp, pud;
> > +	pmd_t *pmdp, pmd;
> > +	pte_t *ptep, pte;
> > +
> > +	pfnmap_lockdep_assert(vma);
> > +
> > +	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > +		goto out;
> > +
> > +	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> > +		goto out;
> 
> Why use goto intead of simply?
> 
> 		return -EINVAL;
> 
> That's relevant because I think the cases where no PxE is found should return
> -ENOENT, not -EINVAL.  E.g. if the caller doesn't precheck, then it can bail
> immediately on EINVAL, but know that it's worth trying to fault-in the pfn on
> ENOENT. 

I tend to avoid changing the retval in this series to make the goal of this
patchset simple.

One issue is I _think_ there's one ioctl() that will rely on this retval:

      acrn_dev_ioctl ->
        acrn_vm_memseg_map ->
          acrn_vm_ram_map ->
            follow_pfnmap_start

So we may want to try check with people to not break it..

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 10/19] KVM: Use follow_pfnmap API
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (7 preceding siblings ...)
  2024-08-09 16:08 ` [PATCH 09/19] mm: New follow_pfnmap API Peter Xu
@ 2024-08-09 16:09 ` Peter Xu
  2024-08-09 17:23   ` Axel Rasmussen
  2024-08-09 16:09 ` [PATCH 11/19] s390/pci_mmio: " Peter Xu
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
is done perfectly on the other side (host_pfn_mapping_level()).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 virt/kvm/kvm_main.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..9fb1c527a8e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2862,13 +2862,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			       unsigned long addr, bool write_fault,
 			       bool *writable, kvm_pfn_t *p_pfn)
 {
+	struct follow_pfnmap_args args = { .vma = vma, .address = addr };
 	kvm_pfn_t pfn;
-	pte_t *ptep;
-	pte_t pte;
-	spinlock_t *ptl;
 	int r;
 
-	r = follow_pte(vma, addr, &ptep, &ptl);
+	r = follow_pfnmap_start(&args);
 	if (r) {
 		/*
 		 * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
@@ -2883,21 +2881,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		if (r)
 			return r;
 
-		r = follow_pte(vma, addr, &ptep, &ptl);
+		r = follow_pfnmap_start(&args);
 		if (r)
 			return r;
 	}
 
-	pte = ptep_get(ptep);
-
-	if (write_fault && !pte_write(pte)) {
+	if (write_fault && !args.writable) {
 		pfn = KVM_PFN_ERR_RO_FAULT;
 		goto out;
 	}
 
 	if (writable)
-		*writable = pte_write(pte);
-	pfn = pte_pfn(pte);
+		*writable = args.writable;
+	pfn = args.pfn;
 
 	/*
 	 * Get a reference here because callers of *hva_to_pfn* and
@@ -2918,9 +2914,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 */
 	if (!kvm_try_get_pfn(pfn))
 		r = -EFAULT;
-
 out:
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 	*p_pfn = pfn;
 
 	return r;
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] KVM: Use follow_pfnmap API
  2024-08-09 16:09 ` [PATCH 10/19] KVM: Use " Peter Xu
@ 2024-08-09 17:23   ` Axel Rasmussen
  2024-08-12 18:58     ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: Axel Rasmussen @ 2024-08-09 17:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
>
> Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> is done perfectly on the other side (host_pfn_mapping_level()).

I don't think it has to be done in this series, but a future
optimization to consider is having follow_pfnmap just tell the caller
about the mapping level directly. It already found this information as
part of its walk. I think there's a possibility to simplify KVM /
avoid it having to do its own walk again later.

>
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  virt/kvm/kvm_main.c | 19 +++++++------------
>  1 file changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d0788d0a72cc..9fb1c527a8e1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2862,13 +2862,11 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>                                unsigned long addr, bool write_fault,
>                                bool *writable, kvm_pfn_t *p_pfn)
>  {
> +       struct follow_pfnmap_args args = { .vma = vma, .address = addr };
>         kvm_pfn_t pfn;
> -       pte_t *ptep;
> -       pte_t pte;
> -       spinlock_t *ptl;
>         int r;
>
> -       r = follow_pte(vma, addr, &ptep, &ptl);
> +       r = follow_pfnmap_start(&args);
>         if (r) {
>                 /*
>                  * get_user_pages fails for VM_IO and VM_PFNMAP vmas and does
> @@ -2883,21 +2881,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>                 if (r)
>                         return r;
>
> -               r = follow_pte(vma, addr, &ptep, &ptl);
> +               r = follow_pfnmap_start(&args);
>                 if (r)
>                         return r;
>         }
>
> -       pte = ptep_get(ptep);
> -
> -       if (write_fault && !pte_write(pte)) {
> +       if (write_fault && !args.writable) {
>                 pfn = KVM_PFN_ERR_RO_FAULT;
>                 goto out;
>         }
>
>         if (writable)
> -               *writable = pte_write(pte);
> -       pfn = pte_pfn(pte);
> +               *writable = args.writable;
> +       pfn = args.pfn;
>
>         /*
>          * Get a reference here because callers of *hva_to_pfn* and
> @@ -2918,9 +2914,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>          */
>         if (!kvm_try_get_pfn(pfn))
>                 r = -EFAULT;
> -
>  out:
> -       pte_unmap_unlock(ptep, ptl);
> +       follow_pfnmap_end(&args);
>         *p_pfn = pfn;
>
>         return r;
> --
> 2.45.0
>


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] KVM: Use follow_pfnmap API
  2024-08-09 17:23   ` Axel Rasmussen
@ 2024-08-12 18:58     ` Peter Xu
  2024-08-12 22:47       ` Axel Rasmussen
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-12 18:58 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote:
> On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> > is done perfectly on the other side (host_pfn_mapping_level()).
> 
> I don't think it has to be done in this series, but a future
> optimization to consider is having follow_pfnmap just tell the caller
> about the mapping level directly. It already found this information as
> part of its walk. I think there's a possibility to simplify KVM /
> avoid it having to do its own walk again later.

AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
idea also to a generic page here, so probably not directly relevant to this
patch alone.

But I agree with you, sounds like something we can consider trying.  I
would be curious on whether the perf difference would be measurable in this
specific case, though.  I mean, this first walk will heat up all the
things, so I'd expect the 2nd walk (which is lockless) later be pretty fast
normally.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] KVM: Use follow_pfnmap API
  2024-08-12 18:58     ` Peter Xu
@ 2024-08-12 22:47       ` Axel Rasmussen
  2024-08-12 23:44         ` Sean Christopherson
  0 siblings, 1 reply; 90+ messages in thread
From: Axel Rasmussen @ 2024-08-12 22:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Mon, Aug 12, 2024 at 11:58 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote:
> > On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> > > is done perfectly on the other side (host_pfn_mapping_level()).
> >
> > I don't think it has to be done in this series, but a future
> > optimization to consider is having follow_pfnmap just tell the caller
> > about the mapping level directly. It already found this information as
> > part of its walk. I think there's a possibility to simplify KVM /
> > avoid it having to do its own walk again later.
>
> AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> idea also to a generic page here, so probably not directly relevant to this
> patch alone.
>
> But I agree with you, sounds like something we can consider trying.  I
> would be curious on whether the perf difference would be measurable in this
> specific case, though.  I mean, this first walk will heat up all the
> things, so I'd expect the 2nd walk (which is lockless) later be pretty fast
> normally.

Agreed, the main benefit is probably just code simplification.

>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] KVM: Use follow_pfnmap API
  2024-08-12 22:47       ` Axel Rasmussen
@ 2024-08-12 23:44         ` Sean Christopherson
  2024-08-14 13:15           ` Jason Gunthorpe
  0 siblings, 1 reply; 90+ messages in thread
From: Sean Christopherson @ 2024-08-12 23:44 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Peter Xu, linux-mm, linux-kernel, Oscar Salvador, Jason Gunthorpe,
	linux-arm-kernel, x86, Will Deacon, Gavin Shan, Paolo Bonzini,
	Zi Yan, Andrew Morton, Catalin Marinas, Ingo Molnar,
	Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Mon, Aug 12, 2024, Axel Rasmussen wrote:
> On Mon, Aug 12, 2024 at 11:58 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Aug 09, 2024 at 10:23:20AM -0700, Axel Rasmussen wrote:
> > > On Fri, Aug 9, 2024 at 9:09 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
> > > > is done perfectly on the other side (host_pfn_mapping_level()).
> > >
> > > I don't think it has to be done in this series, but a future
> > > optimization to consider is having follow_pfnmap just tell the caller
> > > about the mapping level directly. It already found this information as
> > > part of its walk. I think there's a possibility to simplify KVM /
> > > avoid it having to do its own walk again later.
> >
> > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> > idea also to a generic page here, so probably not directly relevant to this
> > patch alone.

Ya.  My original hope was that KVM could simply walk the host page tables and get
whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus
regular pages.  That might be feasible after dropping all of KVM's refcounting
shenanigans[*]?  Not sure, haven't thought too much about it, precisely because
I too think it won't provide any meaningful performance boost.

> > But I agree with you, sounds like something we can consider trying.  I
> > would be curious on whether the perf difference would be measurable in this
> > specific case, though.  I mean, this first walk will heat up all the
> > things, so I'd expect the 2nd walk (which is lockless) later be pretty fast
> > normally.
> 
> Agreed, the main benefit is probably just code simplification.

+1.  I wouldn't spend much time, if any, trying to plumb the size back out.
Unless we can convert regular pages as well, it'd probably be more confusing to
have separate ways of getting the mapping size.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] KVM: Use follow_pfnmap API
  2024-08-12 23:44         ` Sean Christopherson
@ 2024-08-14 13:15           ` Jason Gunthorpe
  2024-08-14 14:23             ` Sean Christopherson
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 13:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Axel Rasmussen, Peter Xu, linux-mm, linux-kernel, Oscar Salvador,
	linux-arm-kernel, x86, Will Deacon, Gavin Shan, Paolo Bonzini,
	Zi Yan, Andrew Morton, Catalin Marinas, Ingo Molnar,
	Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Mon, Aug 12, 2024 at 04:44:40PM -0700, Sean Christopherson wrote:

> > > > I don't think it has to be done in this series, but a future
> > > > optimization to consider is having follow_pfnmap just tell the caller
> > > > about the mapping level directly. It already found this information as
> > > > part of its walk. I think there's a possibility to simplify KVM /
> > > > avoid it having to do its own walk again later.
> > >
> > > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> > > idea also to a generic page here, so probably not directly relevant to this
> > > patch alone.
> 
> Ya.  My original hope was that KVM could simply walk the host page tables and get
> whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus
> regular pages.  That might be feasible after dropping all of KVM's refcounting
> shenanigans[*]?  Not sure, haven't thought too much about it, precisely because
> I too think it won't provide any meaningful performance boost.

The main thing, from my perspective, is that KVM reliably creates 1G
mappings in its table if the VMA has 1G mappings, across all arches
and scenarios. For normal memory and PFNMAP equally.

Not returning the size here makes me wonder if that actually happens?
Does KVM have another way to know what size entry to create?

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] KVM: Use follow_pfnmap API
  2024-08-14 13:15           ` Jason Gunthorpe
@ 2024-08-14 14:23             ` Sean Christopherson
  0 siblings, 0 replies; 90+ messages in thread
From: Sean Christopherson @ 2024-08-14 14:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Axel Rasmussen, Peter Xu, linux-mm, linux-kernel, Oscar Salvador,
	linux-arm-kernel, x86, Will Deacon, Gavin Shan, Paolo Bonzini,
	Zi Yan, Andrew Morton, Catalin Marinas, Ingo Molnar,
	Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> On Mon, Aug 12, 2024 at 04:44:40PM -0700, Sean Christopherson wrote:
> 
> > > > > I don't think it has to be done in this series, but a future
> > > > > optimization to consider is having follow_pfnmap just tell the caller
> > > > > about the mapping level directly. It already found this information as
> > > > > part of its walk. I think there's a possibility to simplify KVM /
> > > > > avoid it having to do its own walk again later.
> > > >
> > > > AFAIU pfnmap isn't special in this case, as we do the "walk pgtable twice"
> > > > idea also to a generic page here, so probably not directly relevant to this
> > > > patch alone.
> > 
> > Ya.  My original hope was that KVM could simply walk the host page tables and get
> > whatever PFN+size it found, i.e. that KVM wouldn't care about pfn-mapped versus
> > regular pages.  That might be feasible after dropping all of KVM's refcounting
> > shenanigans[*]?  Not sure, haven't thought too much about it, precisely because
> > I too think it won't provide any meaningful performance boost.
> 
> The main thing, from my perspective, is that KVM reliably creates 1G
> mappings in its table if the VMA has 1G mappings, across all arches
> and scenarios. For normal memory and PFNMAP equally.

Yes, KVM walks the host page tables for the user virtual address and uses whatever
page size it finds, regardless of what the mapping type.

> Not returning the size here makes me wonder if that actually happens?

It does happen, the idea here was purely to avoid the second page table walk.

> Does KVM have another way to know what size entry to create?
> 
> Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 11/19] s390/pci_mmio: Use follow_pfnmap API
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (8 preceding siblings ...)
  2024-08-09 16:09 ` [PATCH 10/19] KVM: Use " Peter Xu
@ 2024-08-09 16:09 ` Peter Xu
  2024-08-09 16:09 ` [PATCH 12/19] mm/x86/pat: Use the new " Peter Xu
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Niklas Schnelle, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	linux-s390

Use the new API that can understand huge pfn mappings.

Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: linux-s390@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/s390/pci/pci_mmio.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
index 5398729bfe1b..de5c0b389a3e 100644
--- a/arch/s390/pci/pci_mmio.c
+++ b/arch/s390/pci/pci_mmio.c
@@ -118,12 +118,11 @@ static inline int __memcpy_toio_inuser(void __iomem *dst,
 SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
 		const void __user *, user_buffer, size_t, length)
 {
+	struct follow_pfnmap_args args = { };
 	u8 local_buf[64];
 	void __iomem *io_addr;
 	void *buf;
 	struct vm_area_struct *vma;
-	pte_t *ptep;
-	spinlock_t *ptl;
 	long ret;
 
 	if (!zpci_is_enabled())
@@ -169,11 +168,13 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
 	if (!(vma->vm_flags & VM_WRITE))
 		goto out_unlock_mmap;
 
-	ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
+	args.address = mmio_addr;
+	args.vma = vma;
+	ret = follow_pfnmap_start(&args);
 	if (ret)
 		goto out_unlock_mmap;
 
-	io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
+	io_addr = (void __iomem *)((args.pfn << PAGE_SHIFT) |
 			(mmio_addr & ~PAGE_MASK));
 
 	if ((unsigned long) io_addr < ZPCI_IOMAP_ADDR_BASE)
@@ -181,7 +182,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
 
 	ret = zpci_memcpy_toio(io_addr, buf, length);
 out_unlock_pt:
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 out_unlock_mmap:
 	mmap_read_unlock(current->mm);
 out_free:
@@ -260,12 +261,11 @@ static inline int __memcpy_fromio_inuser(void __user *dst,
 SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
 		void __user *, user_buffer, size_t, length)
 {
+	struct follow_pfnmap_args args = { };
 	u8 local_buf[64];
 	void __iomem *io_addr;
 	void *buf;
 	struct vm_area_struct *vma;
-	pte_t *ptep;
-	spinlock_t *ptl;
 	long ret;
 
 	if (!zpci_is_enabled())
@@ -308,11 +308,13 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
 	if (!(vma->vm_flags & VM_WRITE))
 		goto out_unlock_mmap;
 
-	ret = follow_pte(vma, mmio_addr, &ptep, &ptl);
+	args.vma = vma;
+	args.address = mmio_addr;
+	ret = follow_pfnmap_start(&args);
 	if (ret)
 		goto out_unlock_mmap;
 
-	io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) |
+	io_addr = (void __iomem *)((args.pfn << PAGE_SHIFT) |
 			(mmio_addr & ~PAGE_MASK));
 
 	if ((unsigned long) io_addr < ZPCI_IOMAP_ADDR_BASE) {
@@ -322,7 +324,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
 	ret = zpci_memcpy_fromio(buf, io_addr, length);
 
 out_unlock_pt:
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 out_unlock_mmap:
 	mmap_read_unlock(current->mm);
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 12/19] mm/x86/pat: Use the new follow_pfnmap API
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (9 preceding siblings ...)
  2024-08-09 16:09 ` [PATCH 11/19] s390/pci_mmio: " Peter Xu
@ 2024-08-09 16:09 ` Peter Xu
  2024-08-09 16:09 ` [PATCH 13/19] vfio: " Peter Xu
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Use the new API that can understand huge pfn mappings.

Cc: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/mm/pat/memtype.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index bdc2a240c2aa..fd210b362a04 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -951,23 +951,20 @@ static void free_pfn_range(u64 paddr, unsigned long size)
 static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
 		resource_size_t *phys)
 {
-	pte_t *ptep, pte;
-	spinlock_t *ptl;
+	struct follow_pfnmap_args args = { .vma = vma, .address = vma->vm_start };
 
-	if (follow_pte(vma, vma->vm_start, &ptep, &ptl))
+	if (follow_pfnmap_start(&args))
 		return -EINVAL;
 
-	pte = ptep_get(ptep);
-
 	/* Never return PFNs of anon folios in COW mappings. */
-	if (vm_normal_folio(vma, vma->vm_start, pte)) {
-		pte_unmap_unlock(ptep, ptl);
+	if (!args.special) {
+		follow_pfnmap_end(&args);
 		return -EINVAL;
 	}
 
-	*prot = pgprot_val(pte_pgprot(pte));
-	*phys = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
-	pte_unmap_unlock(ptep, ptl);
+	*prot = pgprot_val(args.pgprot);
+	*phys = (resource_size_t)args.pfn << PAGE_SHIFT;
+	follow_pfnmap_end(&args);
 	return 0;
 }
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 13/19] vfio: Use the new follow_pfnmap API
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (10 preceding siblings ...)
  2024-08-09 16:09 ` [PATCH 12/19] mm/x86/pat: Use the new " Peter Xu
@ 2024-08-09 16:09 ` Peter Xu
  2024-08-14 13:20   ` Jason Gunthorpe
  2024-08-09 16:09 ` [PATCH 14/19] acrn: " Peter Xu
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Use the new API that can understand huge pfn mappings.

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0960699e7554..bf391b40e576 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -513,12 +513,10 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 			    unsigned long vaddr, unsigned long *pfn,
 			    bool write_fault)
 {
-	pte_t *ptep;
-	pte_t pte;
-	spinlock_t *ptl;
+	struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
 	int ret;
 
-	ret = follow_pte(vma, vaddr, &ptep, &ptl);
+	ret = follow_pfnmap_start(&args);
 	if (ret) {
 		bool unlocked = false;
 
@@ -532,19 +530,17 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 		if (ret)
 			return ret;
 
-		ret = follow_pte(vma, vaddr, &ptep, &ptl);
+		ret = follow_pfnmap_start(&args);
 		if (ret)
 			return ret;
 	}
 
-	pte = ptep_get(ptep);
-
-	if (write_fault && !pte_write(pte))
+	if (write_fault && !args.writable)
 		ret = -EFAULT;
 	else
-		*pfn = pte_pfn(pte);
+		*pfn = args.pfn;
 
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 	return ret;
 }
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 13/19] vfio: Use the new follow_pfnmap API
  2024-08-09 16:09 ` [PATCH 13/19] vfio: " Peter Xu
@ 2024-08-14 13:20   ` Jason Gunthorpe
  0 siblings, 0 replies; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 13:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 12:09:03PM -0400, Peter Xu wrote:
> Use the new API that can understand huge pfn mappings.
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0960699e7554..bf391b40e576 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -513,12 +513,10 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
>  			    unsigned long vaddr, unsigned long *pfn,
>  			    bool write_fault)
>  {
> -	pte_t *ptep;
> -	pte_t pte;
> -	spinlock_t *ptl;
> +	struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
>  	int ret;
>  
> -	ret = follow_pte(vma, vaddr, &ptep, &ptl);
> +	ret = follow_pfnmap_start(&args);

Let's add a comment here that this is not locked properly to
discourage anyone from copying it.

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 14/19] acrn: Use the new follow_pfnmap API
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (11 preceding siblings ...)
  2024-08-09 16:09 ` [PATCH 13/19] vfio: " Peter Xu
@ 2024-08-09 16:09 ` Peter Xu
  2024-08-09 16:09 ` [PATCH 15/19] mm/access_process_vm: " Peter Xu
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Use the new API that can understand huge pfn mappings.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/virt/acrn/mm.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/virt/acrn/mm.c b/drivers/virt/acrn/mm.c
index db8ff1d0ac23..4c2f28715b70 100644
--- a/drivers/virt/acrn/mm.c
+++ b/drivers/virt/acrn/mm.c
@@ -177,9 +177,7 @@ int acrn_vm_ram_map(struct acrn_vm *vm, struct acrn_vm_memmap *memmap)
 	vma = vma_lookup(current->mm, memmap->vma_base);
 	if (vma && ((vma->vm_flags & VM_PFNMAP) != 0)) {
 		unsigned long start_pfn, cur_pfn;
-		spinlock_t *ptl;
 		bool writable;
-		pte_t *ptep;
 
 		if ((memmap->vma_base + memmap->len) > vma->vm_end) {
 			mmap_read_unlock(current->mm);
@@ -187,16 +185,20 @@ int acrn_vm_ram_map(struct acrn_vm *vm, struct acrn_vm_memmap *memmap)
 		}
 
 		for (i = 0; i < nr_pages; i++) {
-			ret = follow_pte(vma, memmap->vma_base + i * PAGE_SIZE,
-					 &ptep, &ptl);
+			struct follow_pfnmap_args args = {
+				.vma = vma,
+				.address = memmap->vma_base + i * PAGE_SIZE,
+			};
+
+			ret = follow_pfnmap_start(&args);
 			if (ret)
 				break;
 
-			cur_pfn = pte_pfn(ptep_get(ptep));
+			cur_pfn = args.pfn;
 			if (i == 0)
 				start_pfn = cur_pfn;
-			writable = !!pte_write(ptep_get(ptep));
-			pte_unmap_unlock(ptep, ptl);
+			writable = args.writable;
+			follow_pfnmap_end(&args);
 
 			/* Disallow write access if the PTE is not writable. */
 			if (!writable &&
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 15/19] mm/access_process_vm: Use the new follow_pfnmap API
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (12 preceding siblings ...)
  2024-08-09 16:09 ` [PATCH 14/19] acrn: " Peter Xu
@ 2024-08-09 16:09 ` Peter Xu
  2024-08-09 16:09 ` [PATCH 16/19] mm: Remove follow_pte() Peter Xu
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Use the new API that can understand huge pfn mappings.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2194e0f9f541..313c17eedf56 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6504,34 +6504,34 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 	resource_size_t phys_addr;
 	unsigned long prot = 0;
 	void __iomem *maddr;
-	pte_t *ptep, pte;
-	spinlock_t *ptl;
 	int offset = offset_in_page(addr);
 	int ret = -EINVAL;
+	bool writable;
+	struct follow_pfnmap_args args = { .vma = vma, .address = addr };
 
 retry:
-	if (follow_pte(vma, addr, &ptep, &ptl))
+	if (follow_pfnmap_start(&args))
 		return -EINVAL;
-	pte = ptep_get(ptep);
-	pte_unmap_unlock(ptep, ptl);
+	prot = pgprot_val(args.pgprot);
+	phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
+	writable = args.writable;
+	follow_pfnmap_end(&args);
 
-	prot = pgprot_val(pte_pgprot(pte));
-	phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
-
-	if ((write & FOLL_WRITE) && !pte_write(pte))
+	if ((write & FOLL_WRITE) && !writable)
 		return -EINVAL;
 
 	maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot);
 	if (!maddr)
 		return -ENOMEM;
 
-	if (follow_pte(vma, addr, &ptep, &ptl))
+	if (follow_pfnmap_start(&args))
 		goto out_unmap;
 
-	if (!pte_same(pte, ptep_get(ptep))) {
-		pte_unmap_unlock(ptep, ptl);
+	if ((prot != pgprot_val(args.pgprot)) ||
+	    (phys_addr != (args.pfn << PAGE_SHIFT)) ||
+	    (writable != args.writable)) {
+		follow_pfnmap_end(&args);
 		iounmap(maddr);
-
 		goto retry;
 	}
 
@@ -6540,7 +6540,7 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 	else
 		memcpy_fromio(buf, maddr + offset, len);
 	ret = len;
-	pte_unmap_unlock(ptep, ptl);
+	follow_pfnmap_end(&args);
 out_unmap:
 	iounmap(maddr);
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 16/19] mm: Remove follow_pte()
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (13 preceding siblings ...)
  2024-08-09 16:09 ` [PATCH 15/19] mm/access_process_vm: " Peter Xu
@ 2024-08-09 16:09 ` Peter Xu
  2024-08-09 16:09 ` [PATCH 17/19] mm/x86: Support large pfn mappings Peter Xu
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

follow_pte() users have been converted to follow_pfnmap*().  Remove the
API.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  2 --
 mm/memory.c        | 73 ----------------------------------------------
 2 files changed, 75 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7471302658af..c5949b8052c6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2369,8 +2369,6 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
-int follow_pte(struct vm_area_struct *vma, unsigned long address,
-	       pte_t **ptepp, spinlock_t **ptlp);
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write);
 
diff --git a/mm/memory.c b/mm/memory.c
index 313c17eedf56..72f61fffdda2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6265,79 +6265,6 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-/**
- * follow_pte - look up PTE at a user virtual address
- * @vma: the memory mapping
- * @address: user virtual address
- * @ptepp: location to store found PTE
- * @ptlp: location to store the lock for the PTE
- *
- * On a successful return, the pointer to the PTE is stored in @ptepp;
- * the corresponding lock is taken and its location is stored in @ptlp.
- *
- * The contents of the PTE are only stable until @ptlp is released using
- * pte_unmap_unlock(). This function will fail if the PTE is non-present.
- * Present PTEs may include PTEs that map refcounted pages, such as
- * anonymous folios in COW mappings.
- *
- * Callers must be careful when relying on PTE content after
- * pte_unmap_unlock(). Especially if the PTE maps a refcounted page,
- * callers must protect against invalidation with MMU notifiers; otherwise
- * access to the PFN at a later point in time can trigger use-after-free.
- *
- * Only IO mappings and raw PFN mappings are allowed.  The mmap semaphore
- * should be taken for read.
- *
- * This function must not be used to modify PTE content.
- *
- * Return: zero on success, -ve otherwise.
- */
-int follow_pte(struct vm_area_struct *vma, unsigned long address,
-	       pte_t **ptepp, spinlock_t **ptlp)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pgd_t *pgd;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *ptep;
-
-	mmap_assert_locked(mm);
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
-		goto out;
-
-	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
-		goto out;
-
-	pgd = pgd_offset(mm, address);
-	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-		goto out;
-
-	p4d = p4d_offset(pgd, address);
-	if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
-		goto out;
-
-	pud = pud_offset(p4d, address);
-	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
-		goto out;
-
-	pmd = pmd_offset(pud, address);
-	VM_BUG_ON(pmd_trans_huge(*pmd));
-
-	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
-	if (!ptep)
-		goto out;
-	if (!pte_present(ptep_get(ptep)))
-		goto unlock;
-	*ptepp = ptep;
-	return 0;
-unlock:
-	pte_unmap_unlock(ptep, *ptlp);
-out:
-	return -EINVAL;
-}
-EXPORT_SYMBOL_GPL(follow_pte);
-
 static inline void pfnmap_args_setup(struct follow_pfnmap_args *args,
 				     spinlock_t *lock, pte_t *ptep,
 				     pgprot_t pgprot, unsigned long pfn_base,
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 17/19] mm/x86: Support large pfn mappings
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (14 preceding siblings ...)
  2024-08-09 16:09 ` [PATCH 16/19] mm: Remove follow_pte() Peter Xu
@ 2024-08-09 16:09 ` Peter Xu
  2024-08-09 16:09 ` [PATCH 18/19] mm/arm64: " Peter Xu
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Helpers to install and detect special pmd/pud entries.  In short, bit 9 on
x86 is not used for pmd/pud, so we can directly define them the same as the
pte level.  One note is that it's also used in _PAGE_BIT_CPA_TEST but that
is only used in the debug test, and shouldn't conflict in this case.

One note is that pxx_set|clear_flags() for pmd/pud will need to be moved
upper so that they can be referenced by the new special bit helpers.
There's no change in the code that was moved.

Cc: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/Kconfig               |  1 +
 arch/x86/include/asm/pgtable.h | 80 ++++++++++++++++++++++------------
 2 files changed, 53 insertions(+), 28 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index acd9745bf2ae..7a3fb2ff3e72 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -28,6 +28,7 @@ config X86_64
 	select ARCH_HAS_GIGANTIC_PAGE
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_PER_VMA_LOCK
+	select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_SOFT_DIRTY
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a7c1e9cfea41..1e463c9a650f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -120,6 +120,34 @@ extern pmdval_t early_pmd_flags;
 #define arch_end_context_switch(prev)	do {} while(0)
 #endif	/* CONFIG_PARAVIRT_XXL */
 
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v | set);
+}
+
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v & ~clear);
+}
+
+static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
+{
+	pudval_t v = native_pud_val(pud);
+
+	return native_make_pud(v | set);
+}
+
+static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
+{
+	pudval_t v = native_pud_val(pud);
+
+	return native_make_pud(v & ~clear);
+}
+
 /*
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
@@ -317,6 +345,30 @@ static inline int pud_devmap(pud_t pud)
 }
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+static inline bool pmd_special(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_SPECIAL;
+}
+
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SPECIAL);
+}
+#endif	/* CONFIG_ARCH_SUPPORTS_PMD_PFNMAP */
+
+#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
+static inline bool pud_special(pud_t pud)
+{
+	return pud_flags(pud) & _PAGE_SPECIAL;
+}
+
+static inline pud_t pud_mkspecial(pud_t pud)
+{
+	return pud_set_flags(pud, _PAGE_SPECIAL);
+}
+#endif	/* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
+
 static inline int pgd_devmap(pgd_t pgd)
 {
 	return 0;
@@ -487,20 +539,6 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
 }
 
-static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
-{
-	pmdval_t v = native_pmd_val(pmd);
-
-	return native_make_pmd(v | set);
-}
-
-static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
-{
-	pmdval_t v = native_pmd_val(pmd);
-
-	return native_make_pmd(v & ~clear);
-}
-
 /* See comments above mksaveddirty_shift() */
 static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
 {
@@ -595,20 +633,6 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
 pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 #define pmd_mkwrite pmd_mkwrite
 
-static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
-{
-	pudval_t v = native_pud_val(pud);
-
-	return native_make_pud(v | set);
-}
-
-static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
-{
-	pudval_t v = native_pud_val(pud);
-
-	return native_make_pud(v & ~clear);
-}
-
 /* See comments above mksaveddirty_shift() */
 static inline pud_t pud_mksaveddirty(pud_t pud)
 {
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 18/19] mm/arm64: Support large pfn mappings
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (15 preceding siblings ...)
  2024-08-09 16:09 ` [PATCH 17/19] mm/x86: Support large pfn mappings Peter Xu
@ 2024-08-09 16:09 ` Peter Xu
  2024-08-09 16:09 ` [PATCH 19/19] vfio/pci: Implement huge_fault support Peter Xu
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

Support huge pfnmaps by using bit 56 (PTE_SPECIAL) for "special" on
pmds/puds.  Provide the pmd/pud helpers to set/get special bit.

There's one more thing missing for arm64 which is the pxx_pgprot() for
pmd/pud.  Add them too, which is mostly the same as the pte version by
dropping the pfn field.  These helpers are essential to be used in the new
follow_pfnmap*() API to report valid pgprot_t results.

Note that arm64 doesn't yet support huge PUD yet, but it's still
straightforward to provide the pud helpers that we need altogether.  Only
PMD helpers will make an immediate benefit until arm64 will support huge
PUDs first in general (e.g. in THPs).

Cc: linux-arm-kernel@lists.infradead.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/Kconfig               |  1 +
 arch/arm64/include/asm/pgtable.h | 29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index b3fc891f1544..5f026b95f309 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -99,6 +99,7 @@ config ARM64
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
 	select ARCH_SUPPORTS_PER_VMA_LOCK
+	select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b78cc4a6758b..2faecc033a19 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -578,6 +578,14 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 	return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP)));
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+#define pmd_special(pte)	(!!((pmd_val(pte) & PTE_SPECIAL)))
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+	return set_pmd_bit(pmd, __pgprot(PTE_SPECIAL));
+}
+#endif
+
 #define __pmd_to_phys(pmd)	__pte_to_phys(pmd_pte(pmd))
 #define __phys_to_pmd_val(phys)	__phys_to_pte_val(phys)
 #define pmd_pfn(pmd)		((__pmd_to_phys(pmd) & PMD_MASK) >> PAGE_SHIFT)
@@ -595,6 +603,27 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 #define pud_pfn(pud)		((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT)
 #define pfn_pud(pfn,prot)	__pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 
+#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
+#define pud_special(pte)	pte_special(pud_pte(pud))
+#define pud_mkspecial(pte)	pte_pud(pte_mkspecial(pud_pte(pud)))
+#endif
+
+#define pmd_pgprot pmd_pgprot
+static inline pgprot_t pmd_pgprot(pmd_t pmd)
+{
+	unsigned long pfn = pmd_pfn(pmd);
+
+	return __pgprot(pmd_val(pfn_pmd(pfn, __pgprot(0))) ^ pmd_val(pmd));
+}
+
+#define pud_pgprot pud_pgprot
+static inline pgprot_t pud_pgprot(pud_t pud)
+{
+	unsigned long pfn = pud_pfn(pud);
+
+	return __pgprot(pud_val(pfn_pud(pfn, __pgprot(0))) ^ pud_val(pud));
+}
+
 static inline void __set_pte_at(struct mm_struct *mm,
 				unsigned long __always_unused addr,
 				pte_t *ptep, pte_t pte, unsigned int nr)
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 19/19] vfio/pci: Implement huge_fault support
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (16 preceding siblings ...)
  2024-08-09 16:09 ` [PATCH 18/19] mm/arm64: " Peter Xu
@ 2024-08-09 16:09 ` Peter Xu
  2024-08-14 13:25   ` Jason Gunthorpe
       [not found] ` <20240809160909.1023470-7-peterx@redhat.com>
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, peterx, Will Deacon,
	Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

From: Alex Williamson <alex.williamson@redhat.com>

With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we
can take advantage of PMD and PUD faults to PCI BAR mmaps and create
more efficient mappings.  PCI BARs are always a power of two and will
typically get at least PMD alignment without userspace even trying.
Userspace alignment for PUD mappings is also not too difficult.

Consolidate faults through a single handler with a new wrapper for
standard single page faults.  The pre-faulting behavior of commit
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") is
removed in this refactoring since huge_fault will cover the bulk of
the faults and results in more efficient page table usage.  We also
want to avoid that pre-faulted single page mappings preempt huge page
mappings.

Cc: kvm@vger.kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 60 +++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ba0ce0075b2f..2d7478e9a62d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -20,6 +20,7 @@
 #include <linux/mutex.h>
 #include <linux/notifier.h>
 #include <linux/pci.h>
+#include <linux/pfn_t.h>
 #include <linux/pm_runtime.h>
 #include <linux/slab.h>
 #include <linux/types.h>
@@ -1657,14 +1658,20 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
 	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
 }
 
-static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
+static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
+					   unsigned int order)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct vfio_pci_core_device *vdev = vma->vm_private_data;
 	unsigned long pfn, pgoff = vmf->pgoff - vma->vm_pgoff;
-	unsigned long addr = vma->vm_start;
 	vm_fault_t ret = VM_FAULT_SIGBUS;
 
+	if (order && (vmf->address & ((PAGE_SIZE << order) - 1) ||
+		      vmf->address + (PAGE_SIZE << order) > vma->vm_end)) {
+		ret = VM_FAULT_FALLBACK;
+		goto out;
+	}
+
 	pfn = vma_to_pfn(vma);
 
 	down_read(&vdev->memory_lock);
@@ -1672,30 +1679,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
 		goto out_unlock;
 
-	ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
-	if (ret & VM_FAULT_ERROR)
-		goto out_unlock;
-
-	/*
-	 * Pre-fault the remainder of the vma, abort further insertions and
-	 * supress error if fault is encountered during pre-fault.
-	 */
-	for (; addr < vma->vm_end; addr += PAGE_SIZE, pfn++) {
-		if (addr == vmf->address)
-			continue;
-
-		if (vmf_insert_pfn(vma, addr, pfn) & VM_FAULT_ERROR)
-			break;
+	switch (order) {
+	case 0:
+		ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
+		break;
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+	case PMD_ORDER:
+		ret = vmf_insert_pfn_pmd(vmf, __pfn_to_pfn_t(pfn + pgoff,
+							     PFN_DEV), false);
+		break;
+#endif
+#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
+	case PUD_ORDER:
+		ret = vmf_insert_pfn_pud(vmf, __pfn_to_pfn_t(pfn + pgoff,
+							     PFN_DEV), false);
+		break;
+#endif
+	default:
+		ret = VM_FAULT_FALLBACK;
 	}
 
 out_unlock:
 	up_read(&vdev->memory_lock);
+out:
+	dev_dbg_ratelimited(&vdev->pdev->dev,
+			   "%s(,order = %d) BAR %ld page offset 0x%lx: 0x%x\n",
+			    __func__, order,
+			    vma->vm_pgoff >>
+				(VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT),
+			    pgoff, (unsigned int)ret);
 
 	return ret;
 }
 
+static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf)
+{
+	return vfio_pci_mmap_huge_fault(vmf, 0);
+}
+
 static const struct vm_operations_struct vfio_pci_mmap_ops = {
-	.fault = vfio_pci_mmap_fault,
+	.fault = vfio_pci_mmap_page_fault,
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+	.huge_fault = vfio_pci_mmap_huge_fault,
+#endif
 };
 
 int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma)
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] vfio/pci: Implement huge_fault support
  2024-08-09 16:09 ` [PATCH 19/19] vfio/pci: Implement huge_fault support Peter Xu
@ 2024-08-14 13:25   ` Jason Gunthorpe
  2024-08-14 16:08     ` Alex Williamson
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 13:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 12:09:09PM -0400, Peter Xu wrote:
> @@ -1672,30 +1679,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
>  		goto out_unlock;
>  
> -	ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
> -	if (ret & VM_FAULT_ERROR)
> -		goto out_unlock;
> -
> -	/*
> -	 * Pre-fault the remainder of the vma, abort further insertions and
> -	 * supress error if fault is encountered during pre-fault.
> -	 */
> -	for (; addr < vma->vm_end; addr += PAGE_SIZE, pfn++) {
> -		if (addr == vmf->address)
> -			continue;
> -
> -		if (vmf_insert_pfn(vma, addr, pfn) & VM_FAULT_ERROR)
> -			break;
> +	switch (order) {
> +	case 0:
> +		ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
> +		break;
> +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> +	case PMD_ORDER:
> +		ret = vmf_insert_pfn_pmd(vmf, __pfn_to_pfn_t(pfn + pgoff,
> +							     PFN_DEV), false);
> +		break;
> +#endif
> +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> +	case PUD_ORDER:
> +		ret = vmf_insert_pfn_pud(vmf, __pfn_to_pfn_t(pfn + pgoff,
> +							     PFN_DEV), false);
> +		break;
> +#endif

I feel like this switch should be in some general function? 

vmf_insert_pfn_order(vmf, order, __pfn_to_pfn_t(pfn + pgoff, PFN_DEV), false);

No reason to expose every driver to this when you've already got a
nice contract to have the driver work on the passed in order.

What happens if the driver can't get a PFN that matches the requested
order?

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] vfio/pci: Implement huge_fault support
  2024-08-14 13:25   ` Jason Gunthorpe
@ 2024-08-14 16:08     ` Alex Williamson
  2024-08-14 16:24       ` Jason Gunthorpe
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Williamson @ 2024-08-14 16:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	David Hildenbrand, Thomas Gleixner, kvm, Dave Hansen, Yan Zhao

On Wed, 14 Aug 2024 10:25:08 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Aug 09, 2024 at 12:09:09PM -0400, Peter Xu wrote:
> > @@ -1672,30 +1679,49 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
> >  		goto out_unlock;
> >  
> > -	ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
> > -	if (ret & VM_FAULT_ERROR)
> > -		goto out_unlock;
> > -
> > -	/*
> > -	 * Pre-fault the remainder of the vma, abort further insertions and
> > -	 * supress error if fault is encountered during pre-fault.
> > -	 */
> > -	for (; addr < vma->vm_end; addr += PAGE_SIZE, pfn++) {
> > -		if (addr == vmf->address)
> > -			continue;
> > -
> > -		if (vmf_insert_pfn(vma, addr, pfn) & VM_FAULT_ERROR)
> > -			break;
> > +	switch (order) {
> > +	case 0:
> > +		ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
> > +		break;
> > +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> > +	case PMD_ORDER:
> > +		ret = vmf_insert_pfn_pmd(vmf, __pfn_to_pfn_t(pfn + pgoff,
> > +							     PFN_DEV), false);
> > +		break;
> > +#endif
> > +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> > +	case PUD_ORDER:
> > +		ret = vmf_insert_pfn_pud(vmf, __pfn_to_pfn_t(pfn + pgoff,
> > +							     PFN_DEV), false);
> > +		break;
> > +#endif  
> 
> I feel like this switch should be in some general function? 
> 
> vmf_insert_pfn_order(vmf, order, __pfn_to_pfn_t(pfn + pgoff, PFN_DEV), false);
> 
> No reason to expose every driver to this when you've already got a
> nice contract to have the driver work on the passed in order.
> 
> What happens if the driver can't get a PFN that matches the requested
> order?

There was some alignment and size checking chopped from the previous
reply that triggered a fallback, but in general PCI BARs are a power of
two and naturally aligned, so there should always be an order aligned
pfn.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] vfio/pci: Implement huge_fault support
  2024-08-14 16:08     ` Alex Williamson
@ 2024-08-14 16:24       ` Jason Gunthorpe
  0 siblings, 0 replies; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 16:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Peter Xu, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	David Hildenbrand, Thomas Gleixner, kvm, Dave Hansen, Yan Zhao

On Wed, Aug 14, 2024 at 10:08:49AM -0600, Alex Williamson wrote:

> There was some alignment and size checking chopped from the previous
> reply that triggered a fallback, but in general PCI BARs are a power of
> two and naturally aligned, so there should always be an order aligned
> pfn.

Sure, though I was mostlyo thinking about how to use this API in other
drivers.

Maybe the device has 2M page alignment only but the VMA was aligned to
1G? It will be called with an order higher than it can support but
that is not an error that should be failed.

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

[parent not found: <20240809160909.1023470-7-peterx@redhat.com>]

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
       [not found] ` <20240809160909.1023470-7-peterx@redhat.com>
@ 2024-08-09 16:20   ` David Hildenbrand
  2024-08-09 16:54     ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2024-08-09 16:20 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, Thomas Gleixner,
	kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 09.08.24 18:08, Peter Xu wrote:
> Pfnmaps can always be identified with special bits in the ptes/pmds/puds.
> However that's unnecessary if the vma is stable, and when it's mapped under
> VM_PFNMAP | VM_IO.
> 
> Instead of adding similar checks in all the levels for huge pfnmaps, let
> folio_walk_start() fail even earlier for these mappings.  It's also
> something gup-slow already does, so make them match.
> 
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/pagewalk.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index cd79fb3b89e5..fd3965efe773 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>   	p4d_t *p4dp;
>   
>   	mmap_assert_locked(vma->vm_mm);
> +
> +	/* It has no folio backing the mappings at all.. */
> +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +		return NULL;
> +

That is in general not what we want, and we still have some places that 
wrongly hard-code that behavior.

In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.

vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO, 
vm_normal_page_pud()] should be able to identify PFN maps and reject 
them, no?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-09 16:20   ` [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start() David Hildenbrand
@ 2024-08-09 16:54     ` Peter Xu
  2024-08-09 17:25       ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-09 16:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote:
> On 09.08.24 18:08, Peter Xu wrote:
> > Pfnmaps can always be identified with special bits in the ptes/pmds/puds.
> > However that's unnecessary if the vma is stable, and when it's mapped under
> > VM_PFNMAP | VM_IO.
> > 
> > Instead of adding similar checks in all the levels for huge pfnmaps, let
> > folio_walk_start() fail even earlier for these mappings.  It's also
> > something gup-slow already does, so make them match.
> > 
> > Cc: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   mm/pagewalk.c | 5 +++++
> >   1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index cd79fb3b89e5..fd3965efe773 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw,
> >   	p4d_t *p4dp;
> >   	mmap_assert_locked(vma->vm_mm);
> > +
> > +	/* It has no folio backing the mappings at all.. */
> > +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> > +		return NULL;
> > +
> 
> That is in general not what we want, and we still have some places that
> wrongly hard-code that behavior.
> 
> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> 
> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> no?

Yep, I think we can also rely on special bit.

When I was working on this whole series I must confess I am already
confused on the real users of MAP_PRIVATE pfnmaps.  E.g. we probably don't
need either PFNMAP for either mprotect/fork/... at least for our use case,
then VM_PRIVATE is even one step further.

Here I chose to follow gup-slow, and I suppose you meant that's also wrong?
If so, would it make sense we keep them aligned as of now, and change them
altogether?  Or do you think we should just rely on the special bits?

And, just curious: is there any use case you're aware of that can benefit
from caring PRIVATE pfnmaps yet so far, especially in this path?

As far as I read, none of folio_walk_start() users so far should even
stumble on top of a pfnmap, share or private.  But that's a fairly quick
glimps only.  IOW, I was wondering whether I'm just over cautious here.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-09 16:54     ` Peter Xu
@ 2024-08-09 17:25       ` David Hildenbrand
  2024-08-09 21:37         ` Peter Xu
  2024-08-14 13:05         ` Jason Gunthorpe
  0 siblings, 2 replies; 90+ messages in thread
From: David Hildenbrand @ 2024-08-09 17:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 09.08.24 18:54, Peter Xu wrote:
> On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote:
>> On 09.08.24 18:08, Peter Xu wrote:
>>> Pfnmaps can always be identified with special bits in the ptes/pmds/puds.
>>> However that's unnecessary if the vma is stable, and when it's mapped under
>>> VM_PFNMAP | VM_IO.
>>>
>>> Instead of adding similar checks in all the levels for huge pfnmaps, let
>>> folio_walk_start() fail even earlier for these mappings.  It's also
>>> something gup-slow already does, so make them match.
>>>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    mm/pagewalk.c | 5 +++++
>>>    1 file changed, 5 insertions(+)
>>>
>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>> index cd79fb3b89e5..fd3965efe773 100644
>>> --- a/mm/pagewalk.c
>>> +++ b/mm/pagewalk.c
>>> @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>>>    	p4d_t *p4dp;
>>>    	mmap_assert_locked(vma->vm_mm);
>>> +
>>> +	/* It has no folio backing the mappings at all.. */
>>> +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
>>> +		return NULL;
>>> +
>>
>> That is in general not what we want, and we still have some places that
>> wrongly hard-code that behavior.
>>
>> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
>>
>> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
>> vm_normal_page_pud()] should be able to identify PFN maps and reject them,
>> no?
> 
> Yep, I think we can also rely on special bit.
> 
> When I was working on this whole series I must confess I am already
> confused on the real users of MAP_PRIVATE pfnmaps.  E.g. we probably don't
> need either PFNMAP for either mprotect/fork/... at least for our use case,
> then VM_PRIVATE is even one step further.

Yes, it's rather a corner case indeed.
> 
> Here I chose to follow gup-slow, and I suppose you meant that's also wrong?

I assume just nobody really noticed, just like nobody noticed that 
walk_page_test() skips VM_PFNMAP (but not VM_IO :) ).

Your process memory stats will likely miss anon folios on COW PFNMAP 
mappings ... in the rare cases where they exist (e.g., mmap() of /dev/mem).

> If so, would it make sense we keep them aligned as of now, and change them
> altogether?  Or do you think we should just rely on the special bits?

GUP already refuses to work on a lot of other stuff, so likely not a 
good use of time unless somebody complains.

But yes, long-term we should make all code either respect that it could 
happen (and bury less awkward checks in page table walkers) or rip 
support for MAP_PRIVATE PFNMAP out completely.

> 
> And, just curious: is there any use case you're aware of that can benefit
> from caring PRIVATE pfnmaps yet so far, especially in this path?

In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO.

There was a discussion (in VM_PAT) some time ago whether we could remove 
MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW 
mappings on /dev/mem, although not many (and they might not actually 
write to these areas).

I'm happy if someone wants to try ripping that out, I'm not brave enough :)

[1] 
https://lkml.kernel.org/r/1f2a8ed4-aaff-4be7-b3b6-63d2841a2908@redhat.com

> 
> As far as I read, none of folio_walk_start() users so far should even
> stumble on top of a pfnmap, share or private.  But that's a fairly quick
> glimps only.

do_pages_stat()->do_pages_stat_array() should be able to trigger it, if 
you pass "nodes=NULL" to move_pages().

Maybe s390x could be tricked into it, but likely as you say, most code 
shouldn't trigger it. The function itself should be handling it 
correctly as of today, though.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-09 17:25       ` David Hildenbrand
@ 2024-08-09 21:37         ` Peter Xu
  2024-08-14 13:05         ` Jason Gunthorpe
  1 sibling, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-09 21:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Jason Gunthorpe, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
> On 09.08.24 18:54, Peter Xu wrote:
> > On Fri, Aug 09, 2024 at 06:20:06PM +0200, David Hildenbrand wrote:
> > > On 09.08.24 18:08, Peter Xu wrote:
> > > > Pfnmaps can always be identified with special bits in the ptes/pmds/puds.
> > > > However that's unnecessary if the vma is stable, and when it's mapped under
> > > > VM_PFNMAP | VM_IO.
> > > > 
> > > > Instead of adding similar checks in all the levels for huge pfnmaps, let
> > > > folio_walk_start() fail even earlier for these mappings.  It's also
> > > > something gup-slow already does, so make them match.
> > > > 
> > > > Cc: David Hildenbrand <david@redhat.com>
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >    mm/pagewalk.c | 5 +++++
> > > >    1 file changed, 5 insertions(+)
> > > > 
> > > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > > > index cd79fb3b89e5..fd3965efe773 100644
> > > > --- a/mm/pagewalk.c
> > > > +++ b/mm/pagewalk.c
> > > > @@ -727,6 +727,11 @@ struct folio *folio_walk_start(struct folio_walk *fw,
> > > >    	p4d_t *p4dp;
> > > >    	mmap_assert_locked(vma->vm_mm);
> > > > +
> > > > +	/* It has no folio backing the mappings at all.. */
> > > > +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> > > > +		return NULL;
> > > > +
> > > 
> > > That is in general not what we want, and we still have some places that
> > > wrongly hard-code that behavior.
> > > 
> > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> > > 
> > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> > > vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> > > no?
> > 
> > Yep, I think we can also rely on special bit.
> > 
> > When I was working on this whole series I must confess I am already
> > confused on the real users of MAP_PRIVATE pfnmaps.  E.g. we probably don't
> > need either PFNMAP for either mprotect/fork/... at least for our use case,
> > then VM_PRIVATE is even one step further.
> 
> Yes, it's rather a corner case indeed.
> > 
> > Here I chose to follow gup-slow, and I suppose you meant that's also wrong?
> 
> I assume just nobody really noticed, just like nobody noticed that
> walk_page_test() skips VM_PFNMAP (but not VM_IO :) ).

I noticed it, and that's one of the reasons why this series can be small,
as walk page callers are intact.

> 
> Your process memory stats will likely miss anon folios on COW PFNMAP
> mappings ... in the rare cases where they exist (e.g., mmap() of /dev/mem).

Do you mean /proc/$PID/status?  I thought that (aka, mm counters) should be
fine with anon pages CoWed on top of private pfnmaps, but possibly I
misunderstood what you meant.

> 
> > If so, would it make sense we keep them aligned as of now, and change them
> > altogether?  Or do you think we should just rely on the special bits?
> 
> GUP already refuses to work on a lot of other stuff, so likely not a good
> use of time unless somebody complains.
> 
> But yes, long-term we should make all code either respect that it could
> happen (and bury less awkward checks in page table walkers) or rip support
> for MAP_PRIVATE PFNMAP out completely.
> 
> > 
> > And, just curious: is there any use case you're aware of that can benefit
> > from caring PRIVATE pfnmaps yet so far, especially in this path?
> 
> In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO.
> 
> There was a discussion (in VM_PAT) some time ago whether we could remove
> MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW
> mappings on /dev/mem, although not many (and they might not actually write
> to these areas).

Ah, looks like the private map on /dev/mem is the only thing we know.

> 
> I'm happy if someone wants to try ripping that out, I'm not brave enough :)
> 
> [1]
> https://lkml.kernel.org/r/1f2a8ed4-aaff-4be7-b3b6-63d2841a2908@redhat.com
> 
> > 
> > As far as I read, none of folio_walk_start() users so far should even
> > stumble on top of a pfnmap, share or private.  But that's a fairly quick
> > glimps only.
> 
> do_pages_stat()->do_pages_stat_array() should be able to trigger it, if you
> pass "nodes=NULL" to move_pages().

.. so assume this is also about private mapping over /dev/mem, then:
someone tries to write some pages there to some MMIO regions, then tries to
use move_pages() to fetch which node those pages locate?  Hmm.. OK :)

> 
> Maybe s390x could be tricked into it, but likely as you say, most code
> shouldn't trigger it. The function itself should be handling it correctly as
> of today, though.

So indeed I cannot justify it won't be used, and it's not a huge deal
indeed if we stick with special bits.  Let me go with that in the next
version for folio_walk_start().

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-09 17:25       ` David Hildenbrand
  2024-08-09 21:37         ` Peter Xu
@ 2024-08-14 13:05         ` Jason Gunthorpe
  2024-08-16  9:30           ` David Hildenbrand
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 13:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:

> > > That is in general not what we want, and we still have some places that
> > > wrongly hard-code that behavior.
> > > 
> > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> > > 
> > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> > > vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> > > no?
> > 
> > Yep, I think we can also rely on special bit.

It is more than just relying on the special bit..

VM_PFNMAP/VM_MIXEDMAP should really only be used inside
vm_normal_page() because thay are, effectively, support for a limited
emulation of the special bit on arches that don't have them. There are
a bunch of weird rules that are used to try and make that work
properly that have to be followed.

On arches with the sepcial bit they should possibly never be checked
since the special bit does everything you need.

Arguably any place reading those flags out side of vm_normal_page/etc
is suspect.

> > Here I chose to follow gup-slow, and I suppose you meant that's also wrong?
> 
> I assume just nobody really noticed, just like nobody noticed that
> walk_page_test() skips VM_PFNMAP (but not VM_IO :) ).

Like here..

> > And, just curious: is there any use case you're aware of that can benefit
> > from caring PRIVATE pfnmaps yet so far, especially in this path?
> 
> In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO.
> 
> There was a discussion (in VM_PAT) some time ago whether we could remove
> MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW
> mappings on /dev/mem, although not many (and they might not actually write
> to these areas).

I've squashed many bugs where kernel drivers don't demand userspace
use MAP_SHARED when asking for a PFNMAP, and of course userspace has
gained the wrong flags too. I don't know if anyone needs this, but it
has crept wrongly into the API.

Maybe an interesting place to start is a warning printk about using an
obsolete feature and see where things go from there??

Jason

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-14 13:05         ` Jason Gunthorpe
@ 2024-08-16  9:30           ` David Hildenbrand
  2024-08-16 14:21             ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2024-08-16  9:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 14.08.24 15:05, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
> 
>>>> That is in general not what we want, and we still have some places that
>>>> wrongly hard-code that behavior.
>>>>
>>>> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
>>>>
>>>> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
>>>> vm_normal_page_pud()] should be able to identify PFN maps and reject them,
>>>> no?
>>>
>>> Yep, I think we can also rely on special bit.
> 
> It is more than just relying on the special bit..
> 
> VM_PFNMAP/VM_MIXEDMAP should really only be used inside
> vm_normal_page() because thay are, effectively, support for a limited
> emulation of the special bit on arches that don't have them. There are
> a bunch of weird rules that are used to try and make that work
> properly that have to be followed.
> 
> On arches with the sepcial bit they should possibly never be checked
> since the special bit does everything you need.
> 
> Arguably any place reading those flags out side of vm_normal_page/etc
> is suspect.

IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and 
pte_special()/... usage should be limited to 
vm_normal_page/vm_normal_page_pmd/ ... of course, GUP-fast is special 
(one of the reason for "pte_special()" and friends after all).

> 
>>> Here I chose to follow gup-slow, and I suppose you meant that's also wrong?
>>
>> I assume just nobody really noticed, just like nobody noticed that
>> walk_page_test() skips VM_PFNMAP (but not VM_IO :) ).
> 
> Like here..
> 
>>> And, just curious: is there any use case you're aware of that can benefit
>>> from caring PRIVATE pfnmaps yet so far, especially in this path?
>>
>> In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO.
>>
>> There was a discussion (in VM_PAT) some time ago whether we could remove
>> MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW
>> mappings on /dev/mem, although not many (and they might not actually write
>> to these areas).
> 
> I've squashed many bugs where kernel drivers don't demand userspace
> use MAP_SHARED when asking for a PFNMAP, and of course userspace has
> gained the wrong flags too. I don't know if anyone needs this, but it
> has crept wrongly into the API.
> 
> Maybe an interesting place to start is a warning printk about using an
> obsolete feature and see where things go from there??

Maybe we should start with some way to pr_warn_ONCE() whenever we get a 
COW/unshare-fault in such a MAP_PRIVATE mapping, and essentially 
populate the fresh anon folio.

Then we don't only know who mmaps() something like that, but who 
actually relies on getting anon folios in there.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-16  9:30           ` David Hildenbrand
@ 2024-08-16 14:21             ` Peter Xu
  2024-08-16 17:38               ` Jason Gunthorpe
  2024-08-16 17:56               ` David Hildenbrand
  0 siblings, 2 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-16 14:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote:
> On 14.08.24 15:05, Jason Gunthorpe wrote:
> > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
> > 
> > > > > That is in general not what we want, and we still have some places that
> > > > > wrongly hard-code that behavior.
> > > > > 
> > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> > > > > 
> > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> > > > > no?
> > > > 
> > > > Yep, I think we can also rely on special bit.
> > 
> > It is more than just relying on the special bit..
> > 
> > VM_PFNMAP/VM_MIXEDMAP should really only be used inside
> > vm_normal_page() because thay are, effectively, support for a limited
> > emulation of the special bit on arches that don't have them. There are
> > a bunch of weird rules that are used to try and make that work
> > properly that have to be followed.
> > 
> > On arches with the sepcial bit they should possibly never be checked
> > since the special bit does everything you need.
> > 
> > Arguably any place reading those flags out side of vm_normal_page/etc
> > is suspect.
> 
> IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/...
> usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course,
> GUP-fast is special (one of the reason for "pte_special()" and friends after
> all).

The issue is at least GUP currently doesn't work with pfnmaps, while
there're potentially users who wants to be able to work on both page +
!page use cases.  Besides access_process_vm(), KVM also uses similar thing,
and maybe more; these all seem to be valid use case of reference the vma
flags for PFNMAP and such, so they can identify "it's pfnmap" or more
generic issues like "permission check error on pgtable".

The whole private mapping thing definitely made it complicated.

> 
> > 
> > > > Here I chose to follow gup-slow, and I suppose you meant that's also wrong?
> > > 
> > > I assume just nobody really noticed, just like nobody noticed that
> > > walk_page_test() skips VM_PFNMAP (but not VM_IO :) ).
> > 
> > Like here..
> > 
> > > > And, just curious: is there any use case you're aware of that can benefit
> > > > from caring PRIVATE pfnmaps yet so far, especially in this path?
> > > 
> > > In general MAP_PRIVATE pfnmaps is not really useful on things like MMIO.
> > > 
> > > There was a discussion (in VM_PAT) some time ago whether we could remove
> > > MAP_PRIVATE PFNMAPs completely [1]. At least some users still use COW
> > > mappings on /dev/mem, although not many (and they might not actually write
> > > to these areas).
> > 
> > I've squashed many bugs where kernel drivers don't demand userspace
> > use MAP_SHARED when asking for a PFNMAP, and of course userspace has
> > gained the wrong flags too. I don't know if anyone needs this, but it
> > has crept wrongly into the API.
> > 
> > Maybe an interesting place to start is a warning printk about using an
> > obsolete feature and see where things go from there??
> 
> Maybe we should start with some way to pr_warn_ONCE() whenever we get a
> COW/unshare-fault in such a MAP_PRIVATE mapping, and essentially populate
> the fresh anon folio.
> 
> Then we don't only know who mmaps() something like that, but who actually
> relies on getting anon folios in there.

Sounds useful to me, if nobody yet has solid understanding of those private
mappings while we'd want to collect some info.  My gut feeling is we'll see
some valid use of them, but I hope I'm wrong..  I hope we can still leave
that as a separate thing so we focus on large mappings in this series.  And
yes, I'll stick with special bits here to not add one more flag reference.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-16 14:21             ` Peter Xu
@ 2024-08-16 17:38               ` Jason Gunthorpe
  2024-08-21 18:42                 ` Peter Xu
  2024-08-16 17:56               ` David Hildenbrand
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-16 17:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 16, 2024 at 10:21:17AM -0400, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote:
> > On 14.08.24 15:05, Jason Gunthorpe wrote:
> > > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
> > > 
> > > > > > That is in general not what we want, and we still have some places that
> > > > > > wrongly hard-code that behavior.
> > > > > > 
> > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> > > > > > 
> > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> > > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> > > > > > no?
> > > > > 
> > > > > Yep, I think we can also rely on special bit.
> > > 
> > > It is more than just relying on the special bit..
> > > 
> > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside
> > > vm_normal_page() because thay are, effectively, support for a limited
> > > emulation of the special bit on arches that don't have them. There are
> > > a bunch of weird rules that are used to try and make that work
> > > properly that have to be followed.
> > > 
> > > On arches with the sepcial bit they should possibly never be checked
> > > since the special bit does everything you need.
> > > 
> > > Arguably any place reading those flags out side of vm_normal_page/etc
> > > is suspect.
> > 
> > IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/...
> > usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course,
> > GUP-fast is special (one of the reason for "pte_special()" and friends after
> > all).
> 
> The issue is at least GUP currently doesn't work with pfnmaps, while
> there're potentially users who wants to be able to work on both page +
> !page use cases.  Besides access_process_vm(), KVM also uses similar thing,
> and maybe more; these all seem to be valid use case of reference the vma
> flags for PFNMAP and such, so they can identify "it's pfnmap" or more
> generic issues like "permission check error on pgtable".

Why are those valid compared with calling vm_normal_page() per-page
instead?

What reason is there to not do something based only on the PFNMAP
flag?

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-16 17:38               ` Jason Gunthorpe
@ 2024-08-21 18:42                 ` Peter Xu
  0 siblings, 0 replies; 90+ messages in thread
From: Peter Xu @ 2024-08-21 18:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 16, 2024 at 02:38:36PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2024 at 10:21:17AM -0400, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote:
> > > On 14.08.24 15:05, Jason Gunthorpe wrote:
> > > > On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
> > > > 
> > > > > > > That is in general not what we want, and we still have some places that
> > > > > > > wrongly hard-code that behavior.
> > > > > > > 
> > > > > > > In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
> > > > > > > 
> > > > > > > vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
> > > > > > > vm_normal_page_pud()] should be able to identify PFN maps and reject them,
> > > > > > > no?
> > > > > > 
> > > > > > Yep, I think we can also rely on special bit.
> > > > 
> > > > It is more than just relying on the special bit..
> > > > 
> > > > VM_PFNMAP/VM_MIXEDMAP should really only be used inside
> > > > vm_normal_page() because thay are, effectively, support for a limited
> > > > emulation of the special bit on arches that don't have them. There are
> > > > a bunch of weird rules that are used to try and make that work
> > > > properly that have to be followed.
> > > > 
> > > > On arches with the sepcial bit they should possibly never be checked
> > > > since the special bit does everything you need.
> > > > 
> > > > Arguably any place reading those flags out side of vm_normal_page/etc
> > > > is suspect.
> > > 
> > > IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/...
> > > usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course,
> > > GUP-fast is special (one of the reason for "pte_special()" and friends after
> > > all).
> > 
> > The issue is at least GUP currently doesn't work with pfnmaps, while
> > there're potentially users who wants to be able to work on both page +
> > !page use cases.  Besides access_process_vm(), KVM also uses similar thing,
> > and maybe more; these all seem to be valid use case of reference the vma
> > flags for PFNMAP and such, so they can identify "it's pfnmap" or more
> > generic issues like "permission check error on pgtable".
> 
> Why are those valid compared with calling vm_normal_page() per-page
> instead?
> 
> What reason is there to not do something based only on the PFNMAP
> flag?

My comment was for answering "why VM_PFNMAP flags is needed outside
vm_normal_page()", because GUP lacks supports of it.

Are you suggesting we should support VM_PFNMAP in GUP, perhaps?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-16 14:21             ` Peter Xu
  2024-08-16 17:38               ` Jason Gunthorpe
@ 2024-08-16 17:56               ` David Hildenbrand
  2024-08-19 12:19                 ` Jason Gunthorpe
  1 sibling, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2024-08-16 17:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 16.08.24 16:21, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 11:30:31AM +0200, David Hildenbrand wrote:
>> On 14.08.24 15:05, Jason Gunthorpe wrote:
>>> On Fri, Aug 09, 2024 at 07:25:36PM +0200, David Hildenbrand wrote:
>>>
>>>>>> That is in general not what we want, and we still have some places that
>>>>>> wrongly hard-code that behavior.
>>>>>>
>>>>>> In a MAP_PRIVATE mapping you might have anon pages that we can happily walk.
>>>>>>
>>>>>> vm_normal_page() / vm_normal_page_pmd() [and as commented as a TODO,
>>>>>> vm_normal_page_pud()] should be able to identify PFN maps and reject them,
>>>>>> no?
>>>>>
>>>>> Yep, I think we can also rely on special bit.
>>>
>>> It is more than just relying on the special bit..
>>>
>>> VM_PFNMAP/VM_MIXEDMAP should really only be used inside
>>> vm_normal_page() because thay are, effectively, support for a limited
>>> emulation of the special bit on arches that don't have them. There are
>>> a bunch of weird rules that are used to try and make that work
>>> properly that have to be followed.
>>>
>>> On arches with the sepcial bit they should possibly never be checked
>>> since the special bit does everything you need.
>>>
>>> Arguably any place reading those flags out side of vm_normal_page/etc
>>> is suspect.
>>
>> IIUC, your opinion matches mine: VM_PFNMAP/VM_MIXEDMAP and pte_special()/...
>> usage should be limited to vm_normal_page/vm_normal_page_pmd/ ... of course,
>> GUP-fast is special (one of the reason for "pte_special()" and friends after
>> all).
> 
> The issue is at least GUP currently doesn't work with pfnmaps, while
> there're potentially users who wants to be able to work on both page +
> !page use cases.  Besides access_process_vm(), KVM also uses similar thing,
> and maybe more; these all seem to be valid use case of reference the vma
> flags for PFNMAP and such, so they can identify "it's pfnmap" or more
> generic issues like "permission check error on pgtable".

What at least VFIO does is first try GUP, and if that fails, try 
follow_fault_pfn()->follow_pte(). There is a VM_PFNMAP check in there, yes.

Ideally, follow_pte() would never return refcounted/normal pages, then 
the PFNMAP check might only be a performance improvement (maybe).

> 
> The whole private mapping thing definitely made it complicated.

Yes, and follow_pte() for now could even return ordinary anon pages. I 
spotted that when I was working on that VM_PAT stuff, but I was to 
unsure what to do (see below that KVM with MAP_PRIVATE /dev/mem might 
just work, no idea if there are use cases?).

Fortunately, vfio calls is_invalid_reserved_pfn() and refuses anything 
that has a struct page.

I think KVM does something nasty: if it something with a "struct page", 
and it's not PageReserved, it would take a reference (if I get 
kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not 
normal" page -- it essentially ignores the vm_normal_page() information 
in the page tables ...

So anon pages in pivate mappings from follow_pte() might currently work 
with KVM ... because of the way KVM uses follow_pte().

I did not play with it, so I'm not sure if I am missing some detail.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-16 17:56               ` David Hildenbrand
@ 2024-08-19 12:19                 ` Jason Gunthorpe
  2024-08-19 14:19                   ` Sean Christopherson
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-19 12:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 16, 2024 at 07:56:30PM +0200, David Hildenbrand wrote:

> I think KVM does something nasty: if it something with a "struct page", and
> it's not PageReserved, it would take a reference (if I get
> kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not normal"
> page -- it essentially ignores the vm_normal_page() information in the page
> tables ...

Oh that's nasty. Nothing should be upgrading the output of the follow
functions to refcounted. That's what GUP is for.

And PFNMAP pages, even if they have struct pages for some reason,
should *NEVER* be refcounted because they are in a PFNMAP VMA. That is
completely against the whole point :\ If they could be safely
refcounted then it would be a MIXEDMAP.

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start()
  2024-08-19 12:19                 ` Jason Gunthorpe
@ 2024-08-19 14:19                   ` Sean Christopherson
  0 siblings, 0 replies; 90+ messages in thread
From: Sean Christopherson @ 2024-08-19 14:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Peter Xu, linux-mm, linux-kernel,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Mon, Aug 19, 2024, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2024 at 07:56:30PM +0200, David Hildenbrand wrote:
> 
> > I think KVM does something nasty: if it something with a "struct page", and
> > it's not PageReserved, it would take a reference (if I get
> > kvm_pfn_to_refcounted_page()) independent if it's a "normal" or "not normal"
> > page -- it essentially ignores the vm_normal_page() information in the page
> > tables ...
> 
> Oh that's nasty. Nothing should be upgrading the output of the follow
> functions to refcounted. That's what GUP is for.
> 
> And PFNMAP pages, even if they have struct pages for some reason,
> should *NEVER* be refcounted because they are in a PFNMAP VMA. That is
> completely against the whole point :\ If they could be safely
> refcounted then it would be a MIXEDMAP.

Yeah yeah, I'm working on it.

https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (18 preceding siblings ...)
       [not found] ` <20240809160909.1023470-7-peterx@redhat.com>
@ 2024-08-09 18:12 ` David Hildenbrand
  2024-08-14 12:37 ` Jason Gunthorpe
  20 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2024-08-09 18:12 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Sean Christopherson, Oscar Salvador, Jason Gunthorpe,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, Thomas Gleixner,
	kvm, Dave Hansen, Alex Williamson, Yan Zhao

On 09.08.24 18:08, Peter Xu wrote:
> Overview
> ========
> 
> This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> plus dax 1g fix [1].  Note that this series should also apply if without
> the dax 1g fix series, but when without it, mprotect() will trigger similar
> errors otherwise on PUD mappings.
> 
> This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> as large as 8GB or even bigger.
> 
> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
> patch (from Alex Williamson) will be the first user of huge pfnmap, so as
> to enable vfio-pci driver to fault in huge pfn mappings.
> 
> Implementation
> ==============
> 
> In reality, it's relatively simple to add such support comparing to many
> other types of mappings, because of PFNMAP's specialties when there's no
> vmemmap backing it, so that most of the kernel routines on huge mappings
> should simply already fail for them, like GUPs or old-school follow_page()
> (which is recently rewritten to be folio_walk* APIs by David).

Indeed, skimming most patches, there is very limit core-mm impact. I 
expected much more :)

I suspect primarily because DAX already paved the way. And as DAX likely 
supports fault-after-fork, which is why the fork() case wasn't relevant 
before.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
                   ` (19 preceding siblings ...)
  2024-08-09 18:12 ` [PATCH 00/19] mm: Support huge pfnmaps David Hildenbrand
@ 2024-08-14 12:37 ` Jason Gunthorpe
  2024-08-14 14:35   ` Sean Christopherson
  2024-08-15 19:20   ` Peter Xu
  20 siblings, 2 replies; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 12:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> Overview
> ========
> 
> This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> plus dax 1g fix [1].  Note that this series should also apply if without
> the dax 1g fix series, but when without it, mprotect() will trigger similar
> errors otherwise on PUD mappings.
> 
> This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> as large as 8GB or even bigger.

FWIW, I've started to hear people talk about needing this in the VFIO
context with VMs.

vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
setup the IOMMU, but KVM is not able to do it so reliably. There is a
notable performance gap with two dimensional paging between 4k and 1G
entries in the KVM table. The platforms are being architected with the
assumption that 1G TLB entires will be used throughout the hypervisor
environment.

> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  

There is definitely interest here in extending ARM to support the 1G
size too, what is missing?

> The other trick is how to allow gup-fast working for such huge mappings
> even if there's no direct sign of knowing whether it's a normal page or
> MMIO mapping.  This series chose to keep the pte_special solution, so that
> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> gup-fast will be able to identify them and fail properly.

Make sense

> More architectures / More page sizes
> ------------------------------------
> 
> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
> 
> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
> on 1G will be automatically enabled.

Oh that sounds like a bigger step..
 
> VFIO is so far the only consumer for the huge pfnmaps after this series
> applied.  Besides above remap_pfn_range() generic optimization, device
> driver can also try to optimize its mmap() on a better VA alignment for
> either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
> as the driver doesn't normally decide the VA to map a bar.  But I don't
> think I know all the drivers to know the full picture.

How does alignment work? In most caes I'm aware of the userspace does
not use MAP_FIXED so the expectation would be for the kernel to
automatically select a high alignment. I suppose your cases are
working because qemu uses MAP_FIXED and naturally aligns the BAR
addresses?

> - x86_64 + AMD GPU
>   - Needs Alex's modified QEMU to guarantee proper VA alignment to make
>     sure all pages to be mapped with PUDs

Oh :(

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 12:37 ` Jason Gunthorpe
@ 2024-08-14 14:35   ` Sean Christopherson
  2024-08-14 14:42     ` Paolo Bonzini
  2024-08-14 14:43     ` Jason Gunthorpe
  2024-08-15 19:20   ` Peter Xu
  1 sibling, 2 replies; 90+ messages in thread
From: Sean Christopherson @ 2024-08-14 14:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, linux-mm, linux-kernel, Oscar Salvador, Axel Rasmussen,
	linux-arm-kernel, x86, Will Deacon, Gavin Shan, Paolo Bonzini,
	Zi Yan, Andrew Morton, Catalin Marinas, Ingo Molnar,
	Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> > Overview
> > ========
> > 
> > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> > plus dax 1g fix [1].  Note that this series should also apply if without
> > the dax 1g fix series, but when without it, mprotect() will trigger similar
> > errors otherwise on PUD mappings.
> > 
> > This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> > what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> > as large as 8GB or even bigger.
> 
> FWIW, I've started to hear people talk about needing this in the VFIO
> context with VMs.
> 
> vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> setup the IOMMU, but KVM is not able to do it so reliably.

Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
a huge page unless the mapping is huge in the primary MMU.  And that's very much
by design, as KVM has no knowledge of what actually resides at a given PFN, and
thus can't determine whether or not its safe to create a huge page if KVM happens
to realize the VM has access to a contiguous range of memory.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 14:35   ` Sean Christopherson
@ 2024-08-14 14:42     ` Paolo Bonzini
  2024-08-14 14:43     ` Jason Gunthorpe
  1 sibling, 0 replies; 90+ messages in thread
From: Paolo Bonzini @ 2024-08-14 14:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jason Gunthorpe, Peter Xu, linux-mm, linux-kernel, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Zi Yan, Andrew Morton, Catalin Marinas, Ingo Molnar,
	Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024 at 4:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> > setup the IOMMU, but KVM is not able to do it so reliably.
>
> Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
> a huge page unless the mapping is huge in the primary MMU.  And that's very much
> by design, as KVM has no knowledge of what actually resides at a given PFN, and
> thus can't determine whether or not its safe to create a huge page if KVM happens
> to realize the VM has access to a contiguous range of memory.

Indeed: the EPT is managed as a secondary MMU. It replays the contents
of the primary MMU, apart from A/D bits (which are independent) and
permissions possibly being more restrictive, and that includes the
page size.

Which in turn explains why the VA has to be aligned for KVM to pick up
the hint: aligning the VA allows the primary MMU to use a hugepage,
which is a prerequisite for using it in EPT.

Paolo



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 14:35   ` Sean Christopherson
  2024-08-14 14:42     ` Paolo Bonzini
@ 2024-08-14 14:43     ` Jason Gunthorpe
  2024-08-14 20:54       ` Sean Christopherson
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 14:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, linux-mm, linux-kernel, Oscar Salvador, Axel Rasmussen,
	linux-arm-kernel, x86, Will Deacon, Gavin Shan, Paolo Bonzini,
	Zi Yan, Andrew Morton, Catalin Marinas, Ingo Molnar,
	Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote:
> On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> > > Overview
> > > ========
> > > 
> > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> > > plus dax 1g fix [1].  Note that this series should also apply if without
> > > the dax 1g fix series, but when without it, mprotect() will trigger similar
> > > errors otherwise on PUD mappings.
> > > 
> > > This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> > > what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> > > as large as 8GB or even bigger.
> > 
> > FWIW, I've started to hear people talk about needing this in the VFIO
> > context with VMs.
> > 
> > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> > setup the IOMMU, but KVM is not able to do it so reliably.
> 
> Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
> a huge page unless the mapping is huge in the primary MMU.  And that's very much
> by design, as KVM has no knowledge of what actually resides at a given PFN, and
> thus can't determine whether or not its safe to create a huge page if KVM happens
> to realize the VM has access to a contiguous range of memory.

Oh? Someone told me recently x86 kvm had code to reassemble contiguous
ranges?

I don't quite understand your safety argument, if the VMA has 1G of
contiguous physical memory described with 4K it is definitely safe for
KVM to reassemble that same memory and represent it as 1G.

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 14:43     ` Jason Gunthorpe
@ 2024-08-14 20:54       ` Sean Christopherson
  2024-08-14 22:00         ` Sean Christopherson
                           ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: Sean Christopherson @ 2024-08-14 20:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, linux-mm, linux-kernel, Oscar Salvador, Axel Rasmussen,
	linux-arm-kernel, x86, Will Deacon, Gavin Shan, Paolo Bonzini,
	Zi Yan, Andrew Morton, Catalin Marinas, Ingo Molnar,
	Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Oliver Upton, Marc Zyngier

+Marc and Oliver

On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote:
> > On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> > > > Overview
> > > > ========
> > > > 
> > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> > > > plus dax 1g fix [1].  Note that this series should also apply if without
> > > > the dax 1g fix series, but when without it, mprotect() will trigger similar
> > > > errors otherwise on PUD mappings.
> > > > 
> > > > This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> > > > as large as 8GB or even bigger.
> > > 
> > > FWIW, I've started to hear people talk about needing this in the VFIO
> > > context with VMs.
> > > 
> > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> > > setup the IOMMU, but KVM is not able to do it so reliably.
> > 
> > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
> > a huge page unless the mapping is huge in the primary MMU.  And that's very much
> > by design, as KVM has no knowledge of what actually resides at a given PFN, and
> > thus can't determine whether or not its safe to create a huge page if KVM happens
> > to realize the VM has access to a contiguous range of memory.
> 
> Oh? Someone told me recently x86 kvm had code to reassemble contiguous
> ranges?

Nope.  KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only
a win in very select use cases, and is overall a non-trivial loss.  

> I don't quite understand your safety argument, if the VMA has 1G of
> contiguous physical memory described with 4K it is definitely safe for
> KVM to reassemble that same memory and represent it as 1G.

That would require taking mmap_lock to get the VMA, which would be a net negative,
especially for workloads that are latency sensitive.   E.g. if userspace is doing
mprotect(), madvise(), etc. on VMA that is NOT mapped into the guest, taking
mmap_lock in the guest page fault path means vCPUs block waiting for the unrelated
host operation to complete.  And vise versa, subsequent host operations can be
blocked waiting on vCPUs.

Which reminds me...

Marc/Oliver,

TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
on arm64, specifically the mprotect() testcase[1], as performance is significantly
worse compared to x86, and there might be bugs lurking the mmu_notifier flows.

When running mmu_stress_test the mprotect() phase that makes guest memory read-only
takes more than three times as long on arm64 versus x86.  The time to initially
popuplate memory (run1) is also notably higher on arm64, as is the time to
mprotect() back to RW protections.

The test doesn't go super far out of its way to control the environment, but it
should be a fairly reasonable apples-to-apples comparison.  

Ouch.  I take that back, it's not apples-to-apples, because the test does more
work for x86.  On x86, during mprotect(PROT_READ), the userspace side skips the
faulting instruction on -EFAULT and so vCPUs keep writing for the entire duration.
Other architectures stop running the vCPU after the first write -EFAULT and wait
for the mproptect() to complete.  If I comment out the x86-only logic and have
vCPUs stop on the first -EFAULT, the mprotect() goes way down.

/me fiddles with arm64

And if I have arm64 vCPUs keep faulting, the time goes up, as exptected.

With 128GiB of guest memory (aliased to a single 2GiB chunk of physical memory),
and 48 vCPUs (on systems with 64+ CPUs), stopping on the first fault:

 x86:
  run1 =  6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro =  6.149083106s, rw = 7.713627355s

 arm64:
  run1 = 13.960144969s, reset = 0.000178596s, run2 = 0.018020005s, ro = 50.924434051s, rw = 14.712983786

and skipping on -EFAULT and thus writing throughout mprotect():

 x86:
  run1 =  6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s

 arm64:
  run1 = 13.543469513s, reset = 0.000018763s, run2 = 0.020533896s, ro = 81.063504438s, rw = 14.967504024s

I originally suspected at the main source of difference is that user_mem_abort()
takes mmap_lock for read.  But I doubt that's the case now that I realize arm64
vCPUs stop after the first -EFAULT, i.e. won't contend mmap_lock.

And it's shouldn't be the lack of support for mmu_invalidate_retry_gfn() (on arm64
vs. x86), because the mprotect() is relevant to guest memory (though range-based
retry is something that KVM ARM likely should support).  And again, arm64 is still
much slower when vCPUs stop on the first -EFAULT.

However, before I realized mmap_lock likely wasn't the main problem, I tried to
prove that taking mmap_lock is problematic, and that didn't end well.  When I
hacked user_mem_abort() to not take mmap_lock, the host reboots when the mprotect()
read-only phase kicks in.  AFAICT, there is no crash, e.g. no kdump and nothing
printed to the console, the host suddenly just starts executing firmware code.

To try and rule out a hidden dependency I'm missing, I tried trimming out pretty
much everything that runs under mmap_lock (and only running the selftest, which
doesn't do anything "odd", e.g. doesn't passthrough device memory).  I also forced
small pages, e.g. in case transparent_hugepage_adjust() is somehow reliant on
mmap_lock being taken, to no avail.  Just in case someone can spot an obvious
flaw in my hack, the final diff I tried is below.

Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page
fault path for valid VMAs does introduce a performance regression, but only ~30%,
not the ~6x jump from x86 to arm64.  So that too makes it unlikely taking mmap_lock
is the main problem, though it's still good justification for avoid mmap_lock in
the page fault path.

[1] https://lore.kernel.org/all/20240809194335.1726916-9-seanjc@google.com

---
 arch/arm64/kvm/mmu.c | 167 +++----------------------------------------
 1 file changed, 9 insertions(+), 158 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6981b1bc0946..df551c19f626 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1424,14 +1424,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  bool fault_is_perm)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
-	bool exec_fault, mte_allowed;
-	bool device = false, vfio_allow_any_uc = false;
+	bool write_fault, writable;
+	bool exec_fault;
 	unsigned long mmu_seq;
 	phys_addr_t ipa = fault_ipa;
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
-	struct vm_area_struct *vma;
 	short vma_shift;
 	gfn_t gfn;
 	kvm_pfn_t pfn;
@@ -1451,6 +1449,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
+	if (WARN_ON_ONCE(nested) || WARN_ON_ONCE(kvm_has_mte(kvm)))
+		return -EIO;
+
 	/*
 	 * Permission faults just need to update the existing leaf entry,
 	 * and so normally don't require allocations from the memcache. The
@@ -1464,92 +1465,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			return ret;
 	}
 
-	/*
-	 * Let's check if we will get back a huge page backed by hugetlbfs, or
-	 * get block mapping for device MMIO region.
-	 */
-	mmap_read_lock(current->mm);
-	vma = vma_lookup(current->mm, hva);
-	if (unlikely(!vma)) {
-		kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
-		mmap_read_unlock(current->mm);
-		return -EFAULT;
-	}
-
-	/*
-	 * logging_active is guaranteed to never be true for VM_PFNMAP
-	 * memslots.
-	 */
-	if (logging_active) {
-		force_pte = true;
-		vma_shift = PAGE_SHIFT;
-	} else {
-		vma_shift = get_vma_page_shift(vma, hva);
-	}
-
-	switch (vma_shift) {
-#ifndef __PAGETABLE_PMD_FOLDED
-	case PUD_SHIFT:
-		if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
-			break;
-		fallthrough;
-#endif
-	case CONT_PMD_SHIFT:
-		vma_shift = PMD_SHIFT;
-		fallthrough;
-	case PMD_SHIFT:
-		if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE))
-			break;
-		fallthrough;
-	case CONT_PTE_SHIFT:
-		vma_shift = PAGE_SHIFT;
-		force_pte = true;
-		fallthrough;
-	case PAGE_SHIFT:
-		break;
-	default:
-		WARN_ONCE(1, "Unknown vma_shift %d", vma_shift);
-	}
-
+	vma_shift = PAGE_SHIFT;
 	vma_pagesize = 1UL << vma_shift;
-
-	if (nested) {
-		unsigned long max_map_size;
-
-		max_map_size = force_pte ? PAGE_SIZE : PUD_SIZE;
-
-		ipa = kvm_s2_trans_output(nested);
-
-		/*
-		 * If we're about to create a shadow stage 2 entry, then we
-		 * can only create a block mapping if the guest stage 2 page
-		 * table uses at least as big a mapping.
-		 */
-		max_map_size = min(kvm_s2_trans_size(nested), max_map_size);
-
-		/*
-		 * Be careful that if the mapping size falls between
-		 * two host sizes, take the smallest of the two.
-		 */
-		if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE)
-			max_map_size = PMD_SIZE;
-		else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE)
-			max_map_size = PAGE_SIZE;
-
-		force_pte = (max_map_size == PAGE_SIZE);
-		vma_pagesize = min(vma_pagesize, (long)max_map_size);
-	}
-
-	if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
-		fault_ipa &= ~(vma_pagesize - 1);
-
 	gfn = ipa >> PAGE_SHIFT;
-	mte_allowed = kvm_vma_mte_allowed(vma);
-
-	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
-
-	/* Don't use the VMA after the unlock -- it may have vanished */
-	vma = NULL;
 
 	/*
 	 * Read mmu_invalidate_seq so that KVM can detect if the results of
@@ -1560,7 +1478,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * with the smp_wmb() in kvm_mmu_invalidate_end().
 	 */
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
-	mmap_read_unlock(current->mm);
+	smp_rmb();
 
 	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
 				   write_fault, &writable, NULL);
@@ -1571,19 +1489,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;
 
-	if (kvm_is_device_pfn(pfn)) {
-		/*
-		 * If the page was identified as device early by looking at
-		 * the VMA flags, vma_pagesize is already representing the
-		 * largest quantity we can map.  If instead it was mapped
-		 * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
-		 * and must not be upgraded.
-		 *
-		 * In both cases, we don't let transparent_hugepage_adjust()
-		 * change things at the last minute.
-		 */
-		device = true;
-	} else if (logging_active && !write_fault) {
+	if (logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
 		 * fault.
@@ -1591,28 +1497,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		writable = false;
 	}
 
-	if (exec_fault && device)
-		return -ENOEXEC;
-
-	/*
-	 * Potentially reduce shadow S2 permissions to match the guest's own
-	 * S2. For exec faults, we'd only reach this point if the guest
-	 * actually allowed it (see kvm_s2_handle_perm_fault).
-	 *
-	 * Also encode the level of the original translation in the SW bits
-	 * of the leaf entry as a proxy for the span of that translation.
-	 * This will be retrieved on TLB invalidation from the guest and
-	 * used to limit the invalidation scope if a TTL hint or a range
-	 * isn't provided.
-	 */
-	if (nested) {
-		writable &= kvm_s2_trans_writable(nested);
-		if (!kvm_s2_trans_readable(nested))
-			prot &= ~KVM_PGTABLE_PROT_R;
-
-		prot |= kvm_encode_nested_level(nested);
-	}
-
 	read_lock(&kvm->mmu_lock);
 	pgt = vcpu->arch.hw_mmu->pgt;
 	if (mmu_invalidate_retry(kvm, mmu_seq)) {
@@ -1620,46 +1504,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		goto out_unlock;
 	}
 
-	/*
-	 * If we are not forced to use page mapping, check if we are
-	 * backed by a THP and thus use block mapping if possible.
-	 */
-	if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
-		if (fault_is_perm && fault_granule > PAGE_SIZE)
-			vma_pagesize = fault_granule;
-		else
-			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
-								   hva, &pfn,
-								   &fault_ipa);
-
-		if (vma_pagesize < 0) {
-			ret = vma_pagesize;
-			goto out_unlock;
-		}
-	}
-
-	if (!fault_is_perm && !device && kvm_has_mte(kvm)) {
-		/* Check the VMM hasn't introduced a new disallowed VMA */
-		if (mte_allowed) {
-			sanitise_mte_tags(kvm, pfn, vma_pagesize);
-		} else {
-			ret = -EFAULT;
-			goto out_unlock;
-		}
-	}
-
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
 	if (exec_fault)
 		prot |= KVM_PGTABLE_PROT_X;
 
-	if (device) {
-		if (vfio_allow_any_uc)
-			prot |= KVM_PGTABLE_PROT_NORMAL_NC;
-		else
-			prot |= KVM_PGTABLE_PROT_DEVICE;
-	} else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
+	if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
 		   (!nested || kvm_s2_trans_executable(nested))) {
 		prot |= KVM_PGTABLE_PROT_X;
 	}

base-commit: 15e1c3d65975524c5c792fcd59f7d89f00402261
-- 


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 20:54       ` Sean Christopherson
@ 2024-08-14 22:00         ` Sean Christopherson
  2024-08-14 22:10         ` Jason Gunthorpe
  2024-08-14 23:27         ` Oliver Upton
  2 siblings, 0 replies; 90+ messages in thread
From: Sean Christopherson @ 2024-08-14 22:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, linux-mm, linux-kernel, Oscar Salvador, Axel Rasmussen,
	linux-arm-kernel, x86, Will Deacon, Gavin Shan, Paolo Bonzini,
	Zi Yan, Andrew Morton, Catalin Marinas, Ingo Molnar,
	Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Oliver Upton, Marc Zyngier

On Wed, Aug 14, 2024, Sean Christopherson wrote:
> TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
> on arm64, specifically the mprotect() testcase[1], as performance is significantly
> worse compared to x86, and there might be bugs lurking the mmu_notifier flows.
> 
> When running mmu_stress_test the mprotect() phase that makes guest memory read-only
> takes more than three times as long on arm64 versus x86.  The time to initially
> popuplate memory (run1) is also notably higher on arm64, as is the time to
> mprotect() back to RW protections.
> 
> The test doesn't go super far out of its way to control the environment, but it
> should be a fairly reasonable apples-to-apples comparison.  
> 
> Ouch.  I take that back, it's not apples-to-apples, because the test does more
> work for x86.  On x86, during mprotect(PROT_READ), the userspace side skips the
> faulting instruction on -EFAULT and so vCPUs keep writing for the entire duration.
> Other architectures stop running the vCPU after the first write -EFAULT and wait
> for the mproptect() to complete.  If I comment out the x86-only logic and have
> vCPUs stop on the first -EFAULT, the mprotect() goes way down.
> 
> /me fiddles with arm64
> 
> And if I have arm64 vCPUs keep faulting, the time goes up, as exptected.
> 
> With 128GiB of guest memory (aliased to a single 2GiB chunk of physical memory),
> and 48 vCPUs (on systems with 64+ CPUs), stopping on the first fault:
> 
>  x86:
>   run1 =  6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro =  6.149083106s, rw = 7.713627355s
> 
>  arm64:
>   run1 = 13.960144969s, reset = 0.000178596s, run2 = 0.018020005s, ro = 50.924434051s, rw = 14.712983786
> 
> and skipping on -EFAULT and thus writing throughout mprotect():
> 
>  x86:
>   run1 =  6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s
> 
>  arm64:
>   run1 = 13.543469513s, reset = 0.000018763s, run2 = 0.020533896s, ro = 81.063504438s, rw = 14.967504024s

Oliver pointed out off-list that the hardware I was using doesn't have forced
write-back, and so the overhead on arm64 is likely due to cache maintenance.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 20:54       ` Sean Christopherson
  2024-08-14 22:00         ` Sean Christopherson
@ 2024-08-14 22:10         ` Jason Gunthorpe
  2024-08-14 23:36           ` Oliver Upton
  2024-08-14 23:27         ` Oliver Upton
  2 siblings, 1 reply; 90+ messages in thread
From: Jason Gunthorpe @ 2024-08-14 22:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Peter Xu, linux-mm, linux-kernel, Oscar Salvador, Axel Rasmussen,
	linux-arm-kernel, x86, Will Deacon, Gavin Shan, Paolo Bonzini,
	Zi Yan, Andrew Morton, Catalin Marinas, Ingo Molnar,
	Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Oliver Upton, Marc Zyngier

On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote:
> +Marc and Oliver
> 
> On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> > > > > Overview
> > > > > ========
> > > > > 
> > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> > > > > plus dax 1g fix [1].  Note that this series should also apply if without
> > > > > the dax 1g fix series, but when without it, mprotect() will trigger similar
> > > > > errors otherwise on PUD mappings.
> > > > > 
> > > > > This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> > > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> > > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> > > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> > > > > as large as 8GB or even bigger.
> > > > 
> > > > FWIW, I've started to hear people talk about needing this in the VFIO
> > > > context with VMs.
> > > > 
> > > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> > > > setup the IOMMU, but KVM is not able to do it so reliably.
> > > 
> > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
> > > a huge page unless the mapping is huge in the primary MMU.  And that's very much
> > > by design, as KVM has no knowledge of what actually resides at a given PFN, and
> > > thus can't determine whether or not its safe to create a huge page if KVM happens
> > > to realize the VM has access to a contiguous range of memory.
> > 
> > Oh? Someone told me recently x86 kvm had code to reassemble contiguous
> > ranges?
> 
> Nope.  KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only
> a win in very select use cases, and is overall a non-trivial loss.  

Ah that ARM behavior was probably what was being mentioned then! So
take my original remark as applying to this :)

> > I don't quite understand your safety argument, if the VMA has 1G of
> > contiguous physical memory described with 4K it is definitely safe for
> > KVM to reassemble that same memory and represent it as 1G.
>
> That would require taking mmap_lock to get the VMA, which would be a net negative,
> especially for workloads that are latency sensitive.

You can aggregate if the read and aggregating logic are protected by
mmu notifiers, I think. A invalidation would still have enough
information to clear the aggregate shadow entry. If you get a sequence
number collision then you'd throw away the aggregation.

But yes, I also think it would be slow to have aggregation logic in
KVM. Doing in the main mmu is much better.

Jason


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 22:10         ` Jason Gunthorpe
@ 2024-08-14 23:36           ` Oliver Upton
  0 siblings, 0 replies; 90+ messages in thread
From: Oliver Upton @ 2024-08-14 23:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sean Christopherson, Peter Xu, linux-mm, linux-kernel,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	David Hildenbrand, Thomas Gleixner, kvm, Dave Hansen,
	Alex Williamson, Yan Zhao, Marc Zyngier

On Wed, Aug 14, 2024 at 07:10:31PM -0300, Jason Gunthorpe wrote:

[...]

> > Nope.  KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only
> > a win in very select use cases, and is overall a non-trivial loss.  
> 
> Ah that ARM behavior was probably what was being mentioned then! So
> take my original remark as applying to this :)
> 
> > > I don't quite understand your safety argument, if the VMA has 1G of
> > > contiguous physical memory described with 4K it is definitely safe for
> > > KVM to reassemble that same memory and represent it as 1G.
> >
> > That would require taking mmap_lock to get the VMA, which would be a net negative,
> > especially for workloads that are latency sensitive.
> 
> You can aggregate if the read and aggregating logic are protected by
> mmu notifiers, I think. A invalidation would still have enough
> information to clear the aggregate shadow entry. If you get a sequence
> number collision then you'd throw away the aggregation.
> 
> But yes, I also think it would be slow to have aggregation logic in
> KVM. Doing in the main mmu is much better.

+1.

For KVM/arm64 I'm quite hesitant to change the behavior to PTE mappings
in this situation (i.e. dump get_vma_page_shift()), as I'm quite certain
that'll have a performance regression on someone's workload. But once we
can derive huge PFNMAP from the primary MMU then we should just normalize
on that.

-- 
Thanks,
Oliver


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 20:54       ` Sean Christopherson
  2024-08-14 22:00         ` Sean Christopherson
  2024-08-14 22:10         ` Jason Gunthorpe
@ 2024-08-14 23:27         ` Oliver Upton
  2024-08-14 23:38           ` Oliver Upton
  2 siblings, 1 reply; 90+ messages in thread
From: Oliver Upton @ 2024-08-14 23:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jason Gunthorpe, Peter Xu, linux-mm, linux-kernel, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Marc Zyngier

On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote:
> TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
> on arm64, specifically the mprotect() testcase[1], as performance is significantly
> worse compared to x86,

Sharing what we discussed offline:

Sean was using a machine w/o FEAT_FWB for this test, so the increased
runtime on arm64 is likely explained by the CMOs we're doing when
creating or invalidating a stage-2 PTE.

Using a machine w/ FEAT_FWB would be better for making these sort of
cross-architecture comparisons. Beyond CMOs, we do have some 

> and there might be bugs lurking the mmu_notifier flows.

Impossible! :)

> Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page
> fault path for valid VMAs does introduce a performance regression, but only ~30%,
> not the ~6x jump from x86 to arm64.  So that too makes it unlikely taking mmap_lock
> is the main problem, though it's still good justification for avoid mmap_lock in
> the page fault path.

I'm curious how much of that 30% in a microbenchmark would translate to
real world performance, since it isn't *that* egregious. We also have
other uses for getting at the VMA beyond mapping granularity (MTE and
the VFIO Normal-NC hint) that'd require some attention too.

-- 
Thanks,
Oliver


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 23:27         ` Oliver Upton
@ 2024-08-14 23:38           ` Oliver Upton
  2024-08-15  0:23             ` Sean Christopherson
  0 siblings, 1 reply; 90+ messages in thread
From: Oliver Upton @ 2024-08-14 23:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jason Gunthorpe, Peter Xu, linux-mm, linux-kernel, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Marc Zyngier

On Wed, Aug 14, 2024 at 04:28:00PM -0700, Oliver Upton wrote:
> On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote:
> > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
> > on arm64, specifically the mprotect() testcase[1], as performance is significantly
> > worse compared to x86,
> 
> Sharing what we discussed offline:
> 
> Sean was using a machine w/o FEAT_FWB for this test, so the increased
> runtime on arm64 is likely explained by the CMOs we're doing when
> creating or invalidating a stage-2 PTE.
> 
> Using a machine w/ FEAT_FWB would be better for making these sort of
> cross-architecture comparisons. Beyond CMOs, we do have some 

... some heavy barriers (e.g. DSB(ishst)) we use to ensure page table
updates are visible to the system. So there could still be some
arch-specific quirks that'll show up in the test.

> > and there might be bugs lurking the mmu_notifier flows.
> 
> Impossible! :)
> 
> > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page
> > fault path for valid VMAs does introduce a performance regression, but only ~30%,
> > not the ~6x jump from x86 to arm64.  So that too makes it unlikely taking mmap_lock
> > is the main problem, though it's still good justification for avoid mmap_lock in
> > the page fault path.
> 
> I'm curious how much of that 30% in a microbenchmark would translate to
> real world performance, since it isn't *that* egregious. We also have
> other uses for getting at the VMA beyond mapping granularity (MTE and
> the VFIO Normal-NC hint) that'd require some attention too.
> 
> -- 
> Thanks,
> Oliver


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 23:38           ` Oliver Upton
@ 2024-08-15  0:23             ` Sean Christopherson
  0 siblings, 0 replies; 90+ messages in thread
From: Sean Christopherson @ 2024-08-15  0:23 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Jason Gunthorpe, Peter Xu, linux-mm, linux-kernel, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao,
	Marc Zyngier

On Wed, Aug 14, 2024, Oliver Upton wrote:
> On Wed, Aug 14, 2024 at 04:28:00PM -0700, Oliver Upton wrote:
> > On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote:
> > > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
> > > on arm64, specifically the mprotect() testcase[1], as performance is significantly
> > > worse compared to x86,
> > 
> > Sharing what we discussed offline:
> > 
> > Sean was using a machine w/o FEAT_FWB for this test, so the increased
> > runtime on arm64 is likely explained by the CMOs we're doing when
> > creating or invalidating a stage-2 PTE.
> > 
> > Using a machine w/ FEAT_FWB would be better for making these sort of
> > cross-architecture comparisons. Beyond CMOs, we do have some 
> 
> ... some heavy barriers (e.g. DSB(ishst)) we use to ensure page table
> updates are visible to the system. So there could still be some
> arch-specific quirks that'll show up in the test.

Nope, 'twas FWB.  On a system with FWB, ARM nicely outperforms x86 on mprotect()
when vCPUs stop on the first -EFAULT.  I suspect because ARM can do broadcast TLB
invalidations and doesn't need to interrupt and wait for every vCPU to respond.

  run1 = 10.723194154s, reset = 0.000014732s, run2 = 0.013790876s, ro = 2.151261587s, rw = 10.624272116s

However, having vCPUs continue faulting while mprotect() is running turns the
tables, I suspect due to mmap_lock

  run1 = 10.768003815s, reset = 0.000012051s, run2 = 0.013781921s, ro = 23.277624455s, rw = 10.649136889s

The x86 numbers since they're out of sight now:

 -EFAULT once
  run1 =  6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro =  6.149083106s, rw = 7.713627355s

 -EFAULT forever
  run1 =  6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s

> > > and there might be bugs lurking the mmu_notifier flows.
> > 
> > Impossible! :)
> > 
> > > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page
> > > fault path for valid VMAs does introduce a performance regression, but only ~30%,
> > > not the ~6x jump from x86 to arm64.  So that too makes it unlikely taking mmap_lock
> > > is the main problem, though it's still good justification for avoid mmap_lock in
> > > the page fault path.
> > 
> > I'm curious how much of that 30% in a microbenchmark would translate to
> > real world performance, since it isn't *that* egregious.

vCPU jitter is the big problem, especially if userspace is doing something odd,
and/or if the kernel is preemptible (which also triggers yeild-on-contention logic
for spinlocks, ew).  E.g. the range-based retry to avoid spinning and waiting on
an unrelated MM operation was added by the ChromeOS folks[1] to resolve issues
where an MM operation got preempted and so blocked vCPU faults.

But even for cloud setups with a non-preemptible kernel, contending with unrelated
userspace VMM modification can be problematic, e.g. it turns out even the
gfn_to_pfn_cache logic needs range-based retry[2] (though that's a rather
pathological case where userspace is spamming madvise() to the point where vCPUs
can't even make forward progress).

> > We also have other uses for getting at the VMA beyond mapping granularity
> > (MTE and the VFIO Normal-NC hint) that'd require some attention too.

Yeah, though it seems like it'd be easy enough to take mmap_lock if and only if
it's necessary, e.g. similar to how common KVM takes it only if it encounters
VM_PFNMAP'd memory.

E.g. take mmap_lock if and only if MTE is active (I assume that's uncommon?), or
if the fault is to device memory.

[1] https://lore.kernel.org/all/20210222024522.1751719-1-stevensd@google.com
[2] https://lore.kernel.org/all/f862cefff2ed3f4211b69d785670f41667703cf3.camel@infradead.org


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-14 12:37 ` Jason Gunthorpe
  2024-08-14 14:35   ` Sean Christopherson
@ 2024-08-15 19:20   ` Peter Xu
  2024-08-16  3:05     ` Kefeng Wang
  1 sibling, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-15 19:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao

On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote:
> > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  
> 
> There is definitely interest here in extending ARM to support the 1G
> size too, what is missing?

Currently PUD pfnmap relies on THP_PUD config option:

config ARCH_SUPPORTS_PUD_PFNMAP
	def_bool y
	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD

Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet.

Ideally, pfnmap is too simple comparing to real THPs and it shouldn't
require to depend on THP at all, but we'll need things like below to land
first:

https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com

I sent that first a while ago, but I didn't collect enough inputs, and I
decided to unblock this series from that, so x86_64 shouldn't be affected,
and arm64 will at least start to have 2M.

> 
> > The other trick is how to allow gup-fast working for such huge mappings
> > even if there's no direct sign of knowing whether it's a normal page or
> > MMIO mapping.  This series chose to keep the pte_special solution, so that
> > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> > gup-fast will be able to identify them and fail properly.
> 
> Make sense
> 
> > More architectures / More page sizes
> > ------------------------------------
> > 
> > Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
> > 
> > For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
> > on 1G will be automatically enabled.
> 
> Oh that sounds like a bigger step..

Just to mention, no real THP 1G needed here for pfnmaps.  The real gap here
is only about the pud helpers that only exists so far with CONFIG_THP_PUD
in huge_memory.c.

>  
> > VFIO is so far the only consumer for the huge pfnmaps after this series
> > applied.  Besides above remap_pfn_range() generic optimization, device
> > driver can also try to optimize its mmap() on a better VA alignment for
> > either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
> > as the driver doesn't normally decide the VA to map a bar.  But I don't
> > think I know all the drivers to know the full picture.
> 
> How does alignment work? In most caes I'm aware of the userspace does
> not use MAP_FIXED so the expectation would be for the kernel to
> automatically select a high alignment. I suppose your cases are
> working because qemu uses MAP_FIXED and naturally aligns the BAR
> addresses?
> 
> > - x86_64 + AMD GPU
> >   - Needs Alex's modified QEMU to guarantee proper VA alignment to make
> >     sure all pages to be mapped with PUDs
> 
> Oh :(

So I suppose this answers above. :) Yes, alignment needed.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-15 19:20   ` Peter Xu
@ 2024-08-16  3:05     ` Kefeng Wang
  2024-08-16 14:33       ` Peter Xu
  0 siblings, 1 reply; 90+ messages in thread
From: Kefeng Wang @ 2024-08-16  3:05 UTC (permalink / raw)
  To: Peter Xu, Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Sean Christopherson, Oscar Salvador,
	Axel Rasmussen, linux-arm-kernel, x86, Will Deacon, Gavin Shan,
	Paolo Bonzini, Zi Yan, Andrew Morton, Catalin Marinas,
	Ingo Molnar, Alistair Popple, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, kvm, Dave Hansen, Alex Williamson, Yan Zhao



On 2024/8/16 3:20, Peter Xu wrote:
> On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote:
>>> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.
>>
>> There is definitely interest here in extending ARM to support the 1G
>> size too, what is missing?
> 
> Currently PUD pfnmap relies on THP_PUD config option:
> 
> config ARCH_SUPPORTS_PUD_PFNMAP
> 	def_bool y
> 	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> 
> Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet.
> 
> Ideally, pfnmap is too simple comparing to real THPs and it shouldn't
> require to depend on THP at all, but we'll need things like below to land
> first:
> 
> https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com
> 
> I sent that first a while ago, but I didn't collect enough inputs, and I
> decided to unblock this series from that, so x86_64 shouldn't be affected,
> and arm64 will at least start to have 2M.
> 
>>
>>> The other trick is how to allow gup-fast working for such huge mappings
>>> even if there's no direct sign of knowing whether it's a normal page or
>>> MMIO mapping.  This series chose to keep the pte_special solution, so that
>>> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
>>> gup-fast will be able to identify them and fail properly.
>>
>> Make sense
>>
>>> More architectures / More page sizes
>>> ------------------------------------
>>>
>>> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
>>>
>>> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
>>> on 1G will be automatically enabled.

A draft patch to enable THP_PUD on arm64, only passed with 
DEBUG_VM_PGTABLE, we may test pud pfnmaps on arm64.

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2f8ff354ca6..ff0d27c72020 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -184,6 +184,7 @@ config ARM64
  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
  	select HAVE_ARCH_TRACEHOOK
  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2
  	select HAVE_ARCH_VMAP_STACK
  	select HAVE_ARM_SMCCC
  	select HAVE_ASM_MODVERSIONS
diff --git a/arch/arm64/include/asm/pgtable.h 
b/arch/arm64/include/asm/pgtable.h
index 7a4f5604be3f..e013fe458476 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
  #define pud_valid(pud)		pte_valid(pud_pte(pud))
  #define pud_user(pud)		pte_user(pud_pte(pud))
  #define pud_user_exec(pud)	pte_user_exec(pud_pte(pud))
+#define pud_dirty(pud)		pte_dirty(pud_pte(pud))
+#define pud_devmap(pud)		pte_devmap(pud_pte(pud))
+#define pud_wrprotect(pud)	pte_pud(pte_wrprotect(pud_pte(pud)))
+#define pud_mkold(pud)		pte_pud(pte_mkold(pud_pte(pud)))
+#define pud_mkwrite(pud)	pte_pud(pte_mkwrite_novma(pud_pte(pud)))
+#define pud_mkclean(pud)	pte_pud(pte_mkclean(pud_pte(pud)))
+#define pud_mkdirty(pud)	pte_pud(pte_mkdirty(pud_pte(pud)))
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline int pud_trans_huge(pud_t pud)
+{
+	return pud_val(pud) && pud_present(pud) && !(pud_val(pud) & 
PUD_TABLE_BIT);
+}
+
+static inline pud_t pud_mkdevmap(pud_t pud)
+{
+	return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP)));
+}
+#endif

  static inline bool pgtable_l4_enabled(void);

@@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct 
vm_area_struct *vma,
  							pmd_pte(entry), dirty);
  }

+static inline int pudp_set_access_flags(struct vm_area_struct *vma,
+					unsigned long address, pud_t *pudp,
+					pud_t entry, int dirty)
+{
+	return __ptep_set_access_flags(vma, address, (pte_t *)pudp,
+							pud_pte(entry), dirty);
+}
+
+#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
  static inline int pud_devmap(pud_t pud)
  {
  	return 0;
  }
+#endif

  static inline int pgd_devmap(pgd_t pgd)
  {
@@ -1213,6 +1242,13 @@ static inline int 
pmdp_test_and_clear_young(struct vm_area_struct *vma,
  {
  	return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
  }
+
+static inline int pudp_test_and_clear_young(struct vm_area_struct *vma,
+					    unsigned long address,
+					    pud_t *pudp)
+{
+	return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp);
+}
  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

  static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
@@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct 
vm_fault *vmf,
  #define update_mmu_cache(vma, addr, ptep) \
  	update_mmu_cache_range(NULL, vma, addr, ptep, 1)
  #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0)
+#define update_mmu_cache_pud(vma, address, pud) do { } while (0)

  #ifdef CONFIG_ARM64_PA_BITS_52
  #define phys_to_ttbr(addr)	(((addr) | ((addr) >> 46)) & 
TTBR_BADDR_MASK_52)
-- 
2.27.0


>>
>> Oh that sounds like a bigger step..
> 
> Just to mention, no real THP 1G needed here for pfnmaps.  The real gap here
> is only about the pud helpers that only exists so far with CONFIG_THP_PUD
> in huge_memory.c.
> 
>>   
>>> VFIO is so far the only consumer for the huge pfnmaps after this series
>>> applied.  Besides above remap_pfn_range() generic optimization, device
>>> driver can also try to optimize its mmap() on a better VA alignment for
>>> either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
>>> as the driver doesn't normally decide the VA to map a bar.  But I don't
>>> think I know all the drivers to know the full picture.
>>
>> How does alignment work? In most caes I'm aware of the userspace does
>> not use MAP_FIXED so the expectation would be for the kernel to
>> automatically select a high alignment. I suppose your cases are
>> working because qemu uses MAP_FIXED and naturally aligns the BAR
>> addresses?
>>
>>> - x86_64 + AMD GPU
>>>    - Needs Alex's modified QEMU to guarantee proper VA alignment to make
>>>      sure all pages to be mapped with PUDs
>>
>> Oh :(
> 
> So I suppose this answers above. :) Yes, alignment needed.
> 


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-16  3:05     ` Kefeng Wang
@ 2024-08-16 14:33       ` Peter Xu
  2024-08-19 13:14         ` Kefeng Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Xu @ 2024-08-16 14:33 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	David Hildenbrand, Thomas Gleixner, kvm, Dave Hansen,
	Alex Williamson, Yan Zhao

On Fri, Aug 16, 2024 at 11:05:33AM +0800, Kefeng Wang wrote:
> 
> 
> On 2024/8/16 3:20, Peter Xu wrote:
> > On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote:
> > > > Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.
> > > 
> > > There is definitely interest here in extending ARM to support the 1G
> > > size too, what is missing?
> > 
> > Currently PUD pfnmap relies on THP_PUD config option:
> > 
> > config ARCH_SUPPORTS_PUD_PFNMAP
> > 	def_bool y
> > 	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> > 
> > Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet.
> > 
> > Ideally, pfnmap is too simple comparing to real THPs and it shouldn't
> > require to depend on THP at all, but we'll need things like below to land
> > first:
> > 
> > https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com
> > 
> > I sent that first a while ago, but I didn't collect enough inputs, and I
> > decided to unblock this series from that, so x86_64 shouldn't be affected,
> > and arm64 will at least start to have 2M.
> > 
> > > 
> > > > The other trick is how to allow gup-fast working for such huge mappings
> > > > even if there's no direct sign of knowing whether it's a normal page or
> > > > MMIO mapping.  This series chose to keep the pte_special solution, so that
> > > > it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> > > > gup-fast will be able to identify them and fail properly.
> > > 
> > > Make sense
> > > 
> > > > More architectures / More page sizes
> > > > ------------------------------------
> > > > 
> > > > Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
> > > > 
> > > > For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
> > > > on 1G will be automatically enabled.
> 
> A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE,
> we may test pud pfnmaps on arm64.

Thanks, Kefeng.  It'll be great if this works already, as simple.

Might be interesting to know whether it works already if you have some
few-GBs GPU around on the systems.

Logically as long as you have HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD selected
below, 1g pfnmap will be automatically enabled when you rebuild the kernel.
You can double check that by looking for this:

  CONFIG_ARCH_SUPPORTS_PUD_PFNMAP=y

And you can try to observe the mappings by enabling dynamic debug for
vfio_pci_mmap_huge_fault(), then map the bar with vfio-pci and read
something from it.

> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2f8ff354ca6..ff0d27c72020 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -184,6 +184,7 @@ config ARM64
>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>  	select HAVE_ARCH_TRACEHOOK
>  	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
> +	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2
>  	select HAVE_ARCH_VMAP_STACK
>  	select HAVE_ARM_SMCCC
>  	select HAVE_ASM_MODVERSIONS
> diff --git a/arch/arm64/include/asm/pgtable.h
> b/arch/arm64/include/asm/pgtable.h
> index 7a4f5604be3f..e013fe458476 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>  #define pud_valid(pud)		pte_valid(pud_pte(pud))
>  #define pud_user(pud)		pte_user(pud_pte(pud))
>  #define pud_user_exec(pud)	pte_user_exec(pud_pte(pud))
> +#define pud_dirty(pud)		pte_dirty(pud_pte(pud))
> +#define pud_devmap(pud)		pte_devmap(pud_pte(pud))
> +#define pud_wrprotect(pud)	pte_pud(pte_wrprotect(pud_pte(pud)))
> +#define pud_mkold(pud)		pte_pud(pte_mkold(pud_pte(pud)))
> +#define pud_mkwrite(pud)	pte_pud(pte_mkwrite_novma(pud_pte(pud)))
> +#define pud_mkclean(pud)	pte_pud(pte_mkclean(pud_pte(pud)))
> +#define pud_mkdirty(pud)	pte_pud(pte_mkdirty(pud_pte(pud)))
> +
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +static inline int pud_trans_huge(pud_t pud)
> +{
> +	return pud_val(pud) && pud_present(pud) && !(pud_val(pud) &
> PUD_TABLE_BIT);
> +}
> +
> +static inline pud_t pud_mkdevmap(pud_t pud)
> +{
> +	return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP)));
> +}
> +#endif
> 
>  static inline bool pgtable_l4_enabled(void);
> 
> @@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct
> vm_area_struct *vma,
>  							pmd_pte(entry), dirty);
>  }
> 
> +static inline int pudp_set_access_flags(struct vm_area_struct *vma,
> +					unsigned long address, pud_t *pudp,
> +					pud_t entry, int dirty)
> +{
> +	return __ptep_set_access_flags(vma, address, (pte_t *)pudp,
> +							pud_pte(entry), dirty);
> +}
> +
> +#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>  static inline int pud_devmap(pud_t pud)
>  {
>  	return 0;
>  }
> +#endif
> 
>  static inline int pgd_devmap(pgd_t pgd)
>  {
> @@ -1213,6 +1242,13 @@ static inline int pmdp_test_and_clear_young(struct
> vm_area_struct *vma,
>  {
>  	return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
>  }
> +
> +static inline int pudp_test_and_clear_young(struct vm_area_struct *vma,
> +					    unsigned long address,
> +					    pud_t *pudp)
> +{
> +	return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp);
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> 
>  static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
> @@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct
> vm_fault *vmf,
>  #define update_mmu_cache(vma, addr, ptep) \
>  	update_mmu_cache_range(NULL, vma, addr, ptep, 1)
>  #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0)
> +#define update_mmu_cache_pud(vma, address, pud) do { } while (0)
> 
>  #ifdef CONFIG_ARM64_PA_BITS_52
>  #define phys_to_ttbr(addr)	(((addr) | ((addr) >> 46)) & TTBR_BADDR_MASK_52)
> -- 
> 2.27.0

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 00/19] mm: Support huge pfnmaps
  2024-08-16 14:33       ` Peter Xu
@ 2024-08-19 13:14         ` Kefeng Wang
  0 siblings, 0 replies; 90+ messages in thread
From: Kefeng Wang @ 2024-08-19 13:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Gunthorpe, linux-mm, linux-kernel, Sean Christopherson,
	Oscar Salvador, Axel Rasmussen, linux-arm-kernel, x86,
	Will Deacon, Gavin Shan, Paolo Bonzini, Zi Yan, Andrew Morton,
	Catalin Marinas, Ingo Molnar, Alistair Popple, Borislav Petkov,
	David Hildenbrand, Thomas Gleixner, kvm, Dave Hansen,
	Alex Williamson, Yan Zhao



On 2024/8/16 22:33, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 11:05:33AM +0800, Kefeng Wang wrote:
>>
>>
>> On 2024/8/16 3:20, Peter Xu wrote:
>>> On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote:
>>>>> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.
>>>>
>>>> There is definitely interest here in extending ARM to support the 1G
>>>> size too, what is missing?
>>>
>>> Currently PUD pfnmap relies on THP_PUD config option:
>>>
>>> config ARCH_SUPPORTS_PUD_PFNMAP
>>> 	def_bool y
>>> 	depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>
>>> Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet.
>>>
>>> Ideally, pfnmap is too simple comparing to real THPs and it shouldn't
>>> require to depend on THP at all, but we'll need things like below to land
>>> first:
>>>
>>> https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com
>>>
>>> I sent that first a while ago, but I didn't collect enough inputs, and I
>>> decided to unblock this series from that, so x86_64 shouldn't be affected,
>>> and arm64 will at least start to have 2M.
>>>
>>>>
>>>>> The other trick is how to allow gup-fast working for such huge mappings
>>>>> even if there's no direct sign of knowing whether it's a normal page or
>>>>> MMIO mapping.  This series chose to keep the pte_special solution, so that
>>>>> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
>>>>> gup-fast will be able to identify them and fail properly.
>>>>
>>>> Make sense
>>>>
>>>>> More architectures / More page sizes
>>>>> ------------------------------------
>>>>>
>>>>> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
>>>>>
>>>>> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
>>>>> on 1G will be automatically enabled.
>>
>> A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE,
>> we may test pud pfnmaps on arm64.
> 
> Thanks, Kefeng.  It'll be great if this works already, as simple.
> 
> Might be interesting to know whether it works already if you have some
> few-GBs GPU around on the systems.
> 
> Logically as long as you have HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD selected
> below, 1g pfnmap will be automatically enabled when you rebuild the kernel.
> You can double check that by looking for this:
> 
>    CONFIG_ARCH_SUPPORTS_PUD_PFNMAP=y
> 
> And you can try to observe the mappings by enabling dynamic debug for
> vfio_pci_mmap_huge_fault(), then map the bar with vfio-pci and read
> something from it.


I don't have such device, but we write a driver which use
vmf_insert_pfn_pmd/pud in huge_fault,

static const struct vm_operations_struct test_vm_ops = {
         .huge_fault = test_huge_fault,
	...
}

and read/write it after mmap(,2M/1G,test_fd,...), it works as expected,
since it could be used by dax, let's send it separately.


^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2024-08-21 19:10 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-09 16:08 [PATCH 00/19] mm: Support huge pfnmaps Peter Xu
2024-08-09 16:08 ` [PATCH 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
2024-08-09 16:34   ` David Hildenbrand
2024-08-09 17:16     ` Peter Xu
2024-08-09 18:06       ` David Hildenbrand
2024-08-09 16:08 ` [PATCH 02/19] mm: Drop is_huge_zero_pud() Peter Xu
2024-08-09 16:34   ` David Hildenbrand
2024-08-14 12:38   ` Jason Gunthorpe
2024-08-09 16:08 ` [PATCH 03/19] mm: Mark special bits for huge pfn mappings when inject Peter Xu
2024-08-14 12:40   ` Jason Gunthorpe
2024-08-14 15:23     ` Peter Xu
2024-08-14 15:53       ` Jason Gunthorpe
2024-08-09 16:08 ` [PATCH 04/19] mm: Allow THP orders for PFNMAPs Peter Xu
2024-08-14 12:40   ` Jason Gunthorpe
2024-08-09 16:08 ` [PATCH 05/19] mm/gup: Detect huge pfnmap entries in gup-fast Peter Xu
2024-08-09 16:23   ` David Hildenbrand
2024-08-09 16:59     ` Peter Xu
2024-08-14 12:42       ` Jason Gunthorpe
2024-08-14 15:34         ` Peter Xu
2024-08-14 12:41   ` Jason Gunthorpe
2024-08-09 16:08 ` [PATCH 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
2024-08-09 16:32   ` David Hildenbrand
2024-08-09 17:15     ` Peter Xu
2024-08-09 17:59       ` David Hildenbrand
2024-08-12 18:29         ` Peter Xu
2024-08-12 18:50           ` David Hildenbrand
2024-08-12 19:05             ` Peter Xu
2024-08-09 16:08 ` [PATCH 08/19] mm: Always define pxx_pgprot() Peter Xu
2024-08-14 13:09   ` Jason Gunthorpe
2024-08-14 15:43     ` Peter Xu
2024-08-09 16:08 ` [PATCH 09/19] mm: New follow_pfnmap API Peter Xu
2024-08-14 13:19   ` Jason Gunthorpe
2024-08-14 18:24     ` Peter Xu
2024-08-14 22:14       ` Jason Gunthorpe
2024-08-15 15:41         ` Peter Xu
2024-08-15 16:16           ` Jason Gunthorpe
2024-08-15 17:21             ` Peter Xu
2024-08-15 17:24               ` Jason Gunthorpe
2024-08-15 18:52                 ` Peter Xu
2024-08-16 23:12   ` Sean Christopherson
2024-08-17 11:05     ` David Hildenbrand
2024-08-21 19:10     ` Peter Xu
2024-08-09 16:09 ` [PATCH 10/19] KVM: Use " Peter Xu
2024-08-09 17:23   ` Axel Rasmussen
2024-08-12 18:58     ` Peter Xu
2024-08-12 22:47       ` Axel Rasmussen
2024-08-12 23:44         ` Sean Christopherson
2024-08-14 13:15           ` Jason Gunthorpe
2024-08-14 14:23             ` Sean Christopherson
2024-08-09 16:09 ` [PATCH 11/19] s390/pci_mmio: " Peter Xu
2024-08-09 16:09 ` [PATCH 12/19] mm/x86/pat: Use the new " Peter Xu
2024-08-09 16:09 ` [PATCH 13/19] vfio: " Peter Xu
2024-08-14 13:20   ` Jason Gunthorpe
2024-08-09 16:09 ` [PATCH 14/19] acrn: " Peter Xu
2024-08-09 16:09 ` [PATCH 15/19] mm/access_process_vm: " Peter Xu
2024-08-09 16:09 ` [PATCH 16/19] mm: Remove follow_pte() Peter Xu
2024-08-09 16:09 ` [PATCH 17/19] mm/x86: Support large pfn mappings Peter Xu
2024-08-09 16:09 ` [PATCH 18/19] mm/arm64: " Peter Xu
2024-08-09 16:09 ` [PATCH 19/19] vfio/pci: Implement huge_fault support Peter Xu
2024-08-14 13:25   ` Jason Gunthorpe
2024-08-14 16:08     ` Alex Williamson
2024-08-14 16:24       ` Jason Gunthorpe
     [not found] ` <20240809160909.1023470-7-peterx@redhat.com>
2024-08-09 16:20   ` [PATCH 06/19] mm/pagewalk: Check pfnmap early for folio_walk_start() David Hildenbrand
2024-08-09 16:54     ` Peter Xu
2024-08-09 17:25       ` David Hildenbrand
2024-08-09 21:37         ` Peter Xu
2024-08-14 13:05         ` Jason Gunthorpe
2024-08-16  9:30           ` David Hildenbrand
2024-08-16 14:21             ` Peter Xu
2024-08-16 17:38               ` Jason Gunthorpe
2024-08-21 18:42                 ` Peter Xu
2024-08-16 17:56               ` David Hildenbrand
2024-08-19 12:19                 ` Jason Gunthorpe
2024-08-19 14:19                   ` Sean Christopherson
2024-08-09 18:12 ` [PATCH 00/19] mm: Support huge pfnmaps David Hildenbrand
2024-08-14 12:37 ` Jason Gunthorpe
2024-08-14 14:35   ` Sean Christopherson
2024-08-14 14:42     ` Paolo Bonzini
2024-08-14 14:43     ` Jason Gunthorpe
2024-08-14 20:54       ` Sean Christopherson
2024-08-14 22:00         ` Sean Christopherson
2024-08-14 22:10         ` Jason Gunthorpe
2024-08-14 23:36           ` Oliver Upton
2024-08-14 23:27         ` Oliver Upton
2024-08-14 23:38           ` Oliver Upton
2024-08-15  0:23             ` Sean Christopherson
2024-08-15 19:20   ` Peter Xu
2024-08-16  3:05     ` Kefeng Wang
2024-08-16 14:33       ` Peter Xu
2024-08-19 13:14         ` Kefeng Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).