linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jiaqi Yan <jiaqiyan@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	 Gavin Shan <gshan@redhat.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	x86@kernel.org,  Ingo Molnar <mingo@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Paolo Bonzini <pbonzini@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	 Thomas Gleixner <tglx@linutronix.de>,
	Alistair Popple <apopple@nvidia.com>,
	kvm@vger.kernel.org,  linux-arm-kernel@lists.infradead.org,
	Sean Christopherson <seanjc@google.com>,
	 Oscar Salvador <osalvador@suse.de>,
	Jason Gunthorpe <jgg@nvidia.com>, Borislav Petkov <bp@alien8.de>,
	 Zi Yan <ziy@nvidia.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 David Hildenbrand <david@redhat.com>,
	Yan Zhao <yan.y.zhao@intel.com>, Will Deacon <will@kernel.org>,
	 Kefeng Wang <wangkefeng.wang@huawei.com>,
	Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [PATCH v2 00/19] mm: Support huge pfnmaps
Date: Tue, 27 Aug 2024 15:36:07 -0700	[thread overview]
Message-ID: <CACw3F50Zi7CQsSOcCutRUy1h5p=7UBw7ZRGm4WayvsnuuEnKow@mail.gmail.com> (raw)
In-Reply-To: <20240826204353.2228736-1-peterx@redhat.com>

On Mon, Aug 26, 2024 at 1:44 PM Peter Xu <peterx@redhat.com> wrote:
>
> v2:
> - Added tags
> - Let folio_walk_start() scan special pmd/pud bits [DavidH]
> - Switch copy_huge_pmd() COW+writable check into a VM_WARN_ON_ONCE()
> - Update commit message to drop mentioning of gup-fast, in patch "mm: Mark
>   special bits for huge pfn mappings when inject" [JasonG]
> - In gup-fast, reorder _special check v.s. _devmap check, so as to make
>   pmd/pud path look the same as pte path [DavidH, JasonG]
> - Enrich comments for follow_pfnmap*() API, emphasize the risk when PFN is
>   used after the end() is invoked, s/-ve/negative/ [JasonG, Sean]
>
> Overview
> ========
>
> This series is based on mm-unstable, commit b659edec079c of Aug 26th
> latest, with patch "vma remove the unneeded avc bound with non-CoWed folio"
> reverted, as reported broken [0].
>
> This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> as large as 8GB or even bigger.
>
> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
> patch (from Alex Williamson) will be the first user of huge pfnmap, so as
> to enable vfio-pci driver to fault in huge pfn mappings.
>
> Implementation
> ==============
>
> In reality, it's relatively simple to add such support comparing to many
> other types of mappings, because of PFNMAP's specialties when there's no
> vmemmap backing it, so that most of the kernel routines on huge mappings
> should simply already fail for them, like GUPs or old-school follow_page()
> (which is recently rewritten to be folio_walk* APIs by David).
>
> One trick here is that we're still unmature on PUDs in generic paths here
> and there, as DAX is so far the only user.  This patchset will add the 2nd
> user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
> go on smoothly, but to be discussed later.
>
> The other trick is how to allow gup-fast working for such huge mappings
> even if there's no direct sign of knowing whether it's a normal page or
> MMIO mapping.  This series chose to keep the pte_special solution, so that
> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> gup-fast will be able to identify them and fail properly.
>
> Along the way, we'll also notice that the major pgtable pfn walker, aka,
> follow_pte(), will need to retire soon due to the fact that it only works
> with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
> be able to do whatever follow_pte() can already do, plus that it can also
> process huge pfnmaps now.  Half of this series is about that and converting
> all existing pfnmap walkers to use the new API properly.  Hopefully the new
> API also looks better to avoid exposing e.g. pgtable lock details into the
> callers, so that it can be used in an even more straightforward way.
>
> Here, three more options will be introduced and involved in huge pfnmap:
>
>   - ARCH_SUPPORTS_HUGE_PFNMAP
>
>     Arch developers will need to select this option when huge pfnmap is
>     supported in arch's Kconfig.  After this patchset applied, both x86_64
>     and arm64 will start to enable it by default.
>
>   - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
>
>     These options are for driver developers to identify whether current
>     arch / config supports huge pfnmaps, making decision on whether it can
>     use the huge pfnmap APIs to inject them.  One can refer to the last
>     vfio-pci patch from Alex on the use of them properly in a device
>     driver.
>
> So after the whole set applied, and if one would enable some dynamic debug
> lines in vfio-pci core files, we should observe things like:
>
>   vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
>   vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
>   vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
>
> In this specific case, it says that vfio-pci faults in PMDs properly for a
> few BAR0 offsets.
>
> Patch Layout
> ============
>
> Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
> Patch 2:         A tiny cleanup
> Patch 3-8:       Preparation patches for huge pfnmap (include introduce
>                  special bit for pmd/pud)
> Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
>                  then drop follow_pte() API
> Patch 17:        Add huge pfnmap support for x86_64
> Patch 18:        Add huge pfnmap support for arm64
> Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)
>
> TODO
> ====
>
> More architectures / More page sizes
> ------------------------------------
>
> Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems to
> have plan to support arm64 1G later on top of this series [2].
>
> Any arch will need to first support THP / THP_1G, then provide a special
> bit in pmds/puds to support huge pfnmaps.
>
> remap_pfn_range() support
> -------------------------
>
> Currently, remap_pfn_range() still only maps PTEs.  With the new option,
> remap_pfn_range() can logically start to inject either PMDs or PUDs when
> the alignment requirements match on the VAs.
>
> When the support is there, it should be able to silently benefit all
> drivers that is using remap_pfn_range() in its mmap() handler on better TLB
> hit rate and overall faster MMIO accesses similar to processor on hugepages.
>

Hi Peter,

I am curious if there is any work needed for unmap_mapping_range? If a
driver hugely remap_pfn_range()ed at 1G granularity, can the driver
unmap at PAGE_SIZE granularity? For example, when handling a PFN is
poisoned in the 1G mapping, it would be great if the mapping can be
splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
is lost. (Pretty much like the past proposal* to use HGM** to improve
hugetlb's memory failure handling).

Probably these questions can be answered after reading your code,
which I plan to do, but just want to ask in case you have an easy
answer for me.

* https://patchwork.plctlab.org/project/linux-kernel/cover/20230428004139.2899856-1-jiaqiyan@google.com/
** https://lwn.net/Articles/912017

> More driver support
> -------------------
>
> VFIO is so far the only consumer for the huge pfnmaps after this series
> applied.  Besides above remap_pfn_range() generic optimization, device
> driver can also try to optimize its mmap() on a better VA alignment for
> either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
> as the driver doesn't normally decide the VA to map a bar.  But I don't
> think I know all the drivers to know the full picture.
>
> Tests Done
> ==========
>
> - Cross-build tests
>
> - run_vmtests.sh
>
> - Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect()
>   and fork() tests on the bar mapped
>
> - x86_64 + AMD GPU
>   - Needs Alex's modified QEMU to guarantee proper VA alignment to make
>     sure all pages to be mapped with PUDs
>   - Main BAR (8GB) start to use PUD mappings
>   - Sub BAR (??MBs?) start to use PMD mappings
>   - Performance wise, slight improvement comparing to the old PTE mappings
>
> - aarch64 + NIC
>   - Detached NIC test to make sure driver loads fine with PMD mappings
>
> Credits all go to Alex on help testing the GPU/NIC use cases above.
>
> Comments welcomed, thanks.
>
> [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
> [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
> [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
>
> Alex Williamson (1):
>   vfio/pci: Implement huge_fault support
>
> Peter Xu (18):
>   mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
>   mm: Drop is_huge_zero_pud()
>   mm: Mark special bits for huge pfn mappings when inject
>   mm: Allow THP orders for PFNMAPs
>   mm/gup: Detect huge pfnmap entries in gup-fast
>   mm/pagewalk: Check pfnmap for folio_walk_start()
>   mm/fork: Accept huge pfnmap entries
>   mm: Always define pxx_pgprot()
>   mm: New follow_pfnmap API
>   KVM: Use follow_pfnmap API
>   s390/pci_mmio: Use follow_pfnmap API
>   mm/x86/pat: Use the new follow_pfnmap API
>   vfio: Use the new follow_pfnmap API
>   acrn: Use the new follow_pfnmap API
>   mm/access_process_vm: Use the new follow_pfnmap API
>   mm: Remove follow_pte()
>   mm/x86: Support large pfn mappings
>   mm/arm64: Support large pfn mappings
>
>  arch/arm64/Kconfig                  |   1 +
>  arch/arm64/include/asm/pgtable.h    |  30 +++++
>  arch/powerpc/include/asm/pgtable.h  |   1 +
>  arch/s390/include/asm/pgtable.h     |   1 +
>  arch/s390/pci/pci_mmio.c            |  22 ++--
>  arch/sparc/include/asm/pgtable_64.h |   1 +
>  arch/x86/Kconfig                    |   1 +
>  arch/x86/include/asm/pgtable.h      |  80 +++++++-----
>  arch/x86/mm/pat/memtype.c           |  17 ++-
>  drivers/vfio/pci/vfio_pci_core.c    |  60 ++++++---
>  drivers/vfio/vfio_iommu_type1.c     |  16 +--
>  drivers/virt/acrn/mm.c              |  16 +--
>  include/linux/huge_mm.h             |  16 +--
>  include/linux/mm.h                  |  57 ++++++++-
>  include/linux/pgtable.h             |  12 ++
>  mm/Kconfig                          |  13 ++
>  mm/gup.c                            |   6 +
>  mm/huge_memory.c                    |  50 +++++---
>  mm/memory.c                         | 183 ++++++++++++++++++++--------
>  mm/pagewalk.c                       |   4 +-
>  virt/kvm/kvm_main.c                 |  19 ++-
>  21 files changed, 425 insertions(+), 181 deletions(-)
>
> --
> 2.45.0
>
>


  parent reply	other threads:[~2024-08-27 22:36 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
2024-08-26 20:43 ` [PATCH v2 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
2024-08-26 20:43 ` [PATCH v2 02/19] mm: Drop is_huge_zero_pud() Peter Xu
2024-08-26 20:43 ` [PATCH v2 03/19] mm: Mark special bits for huge pfn mappings when inject Peter Xu
2024-08-28 15:31   ` David Hildenbrand
2024-08-26 20:43 ` [PATCH v2 04/19] mm: Allow THP orders for PFNMAPs Peter Xu
2024-08-28 15:31   ` David Hildenbrand
2024-08-26 20:43 ` [PATCH v2 05/19] mm/gup: Detect huge pfnmap entries in gup-fast Peter Xu
2024-08-26 20:43 ` [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start() Peter Xu
2024-08-28  7:44   ` David Hildenbrand
2024-08-28 14:24     ` Peter Xu
2024-08-28 15:30       ` David Hildenbrand
2024-08-28 19:45         ` Peter Xu
2024-08-28 23:46           ` Jason Gunthorpe
2024-08-29  6:35             ` David Hildenbrand
2024-08-29 18:45               ` Peter Xu
2024-08-29 15:10           ` David Hildenbrand
2024-08-29 18:49             ` Peter Xu
2024-08-26 20:43 ` [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
2024-08-29 15:10   ` David Hildenbrand
2024-08-29 18:26     ` Peter Xu
2024-08-29 19:44       ` David Hildenbrand
2024-08-29 20:01         ` Peter Xu
2024-09-02  7:58   ` Yan Zhao
2024-09-03 21:23     ` Peter Xu
2024-09-09 22:25       ` Andrew Morton
2024-09-09 22:43         ` Peter Xu
2024-09-09 23:15           ` Andrew Morton
2024-09-10  0:08             ` Peter Xu
2024-09-10  2:52               ` Yan Zhao
2024-09-10 12:16                 ` Peter Xu
2024-09-11  2:16                   ` Yan Zhao
2024-09-11 14:34                     ` Peter Xu
2024-08-26 20:43 ` [PATCH v2 08/19] mm: Always define pxx_pgprot() Peter Xu
2024-08-26 20:43 ` [PATCH v2 09/19] mm: New follow_pfnmap API Peter Xu
2024-08-26 20:43 ` [PATCH v2 10/19] KVM: Use " Peter Xu
2024-08-26 20:43 ` [PATCH v2 11/19] s390/pci_mmio: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 12/19] mm/x86/pat: Use the new " Peter Xu
2024-08-26 20:43 ` [PATCH v2 13/19] vfio: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 14/19] acrn: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 15/19] mm/access_process_vm: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 16/19] mm: Remove follow_pte() Peter Xu
2024-09-01  4:33   ` Yu Zhao
2024-09-01 13:39     ` David Hildenbrand
2024-08-26 20:43 ` [PATCH v2 17/19] mm/x86: Support large pfn mappings Peter Xu
2024-08-26 20:43 ` [PATCH v2 18/19] mm/arm64: " Peter Xu
2025-03-19 22:22   ` Keith Busch
2025-03-19 22:46     ` Peter Xu
2025-03-19 22:53       ` Keith Busch
2024-08-26 20:43 ` [PATCH v2 19/19] vfio/pci: Implement huge_fault support Peter Xu
2024-08-27 22:36 ` Jiaqi Yan [this message]
2024-08-27 22:57   ` [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
2024-08-28  0:42     ` Jiaqi Yan
2024-08-28  0:46       ` Jiaqi Yan
2024-08-28 14:24       ` Jason Gunthorpe
2024-08-28 16:10         ` Jiaqi Yan
2024-08-28 23:49           ` Jason Gunthorpe
2024-08-29 19:21             ` Jiaqi Yan
2024-09-04 15:52               ` Jason Gunthorpe
2024-09-04 16:38                 ` Jiaqi Yan
2024-09-04 16:43                   ` Jason Gunthorpe
2024-09-04 16:58                     ` Jiaqi Yan
2024-09-04 17:00                       ` Jason Gunthorpe
2024-09-04 17:07                         ` Jiaqi Yan
2024-09-09  3:56                           ` Ankit Agrawal
2024-08-28 14:41       ` Peter Xu
2024-08-28 16:23         ` Jiaqi Yan
2024-09-09  4:03 ` Ankit Agrawal
2024-09-09 15:03   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACw3F50Zi7CQsSOcCutRUy1h5p=7UBw7ZRGm4WayvsnuuEnKow@mail.gmail.com' \
    --to=jiaqiyan@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=gshan@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=osalvador@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=yan.y.zhao@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).