From: Jiaqi Yan <jiaqiyan@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Gavin Shan <gshan@redhat.com>,
Catalin Marinas <catalin.marinas@arm.com>,
x86@kernel.org, Ingo Molnar <mingo@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Paolo Bonzini <pbonzini@redhat.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
Alistair Popple <apopple@nvidia.com>,
kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
Sean Christopherson <seanjc@google.com>,
Oscar Salvador <osalvador@suse.de>,
Jason Gunthorpe <jgg@nvidia.com>, Borislav Petkov <bp@alien8.de>,
Zi Yan <ziy@nvidia.com>,
Axel Rasmussen <axelrasmussen@google.com>,
David Hildenbrand <david@redhat.com>,
Yan Zhao <yan.y.zhao@intel.com>, Will Deacon <will@kernel.org>,
Kefeng Wang <wangkefeng.wang@huawei.com>,
Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [PATCH v2 00/19] mm: Support huge pfnmaps
Date: Tue, 27 Aug 2024 15:36:07 -0700 [thread overview]
Message-ID: <CACw3F50Zi7CQsSOcCutRUy1h5p=7UBw7ZRGm4WayvsnuuEnKow@mail.gmail.com> (raw)
In-Reply-To: <20240826204353.2228736-1-peterx@redhat.com>
On Mon, Aug 26, 2024 at 1:44 PM Peter Xu <peterx@redhat.com> wrote:
>
> v2:
> - Added tags
> - Let folio_walk_start() scan special pmd/pud bits [DavidH]
> - Switch copy_huge_pmd() COW+writable check into a VM_WARN_ON_ONCE()
> - Update commit message to drop mentioning of gup-fast, in patch "mm: Mark
> special bits for huge pfn mappings when inject" [JasonG]
> - In gup-fast, reorder _special check v.s. _devmap check, so as to make
> pmd/pud path look the same as pte path [DavidH, JasonG]
> - Enrich comments for follow_pfnmap*() API, emphasize the risk when PFN is
> used after the end() is invoked, s/-ve/negative/ [JasonG, Sean]
>
> Overview
> ========
>
> This series is based on mm-unstable, commit b659edec079c of Aug 26th
> latest, with patch "vma remove the unneeded avc bound with non-CoWed folio"
> reverted, as reported broken [0].
>
> This series implements huge pfnmaps support for mm in general. Huge pfnmap
> allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now
> we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> as large as 8GB or even bigger.
>
> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last
> patch (from Alex Williamson) will be the first user of huge pfnmap, so as
> to enable vfio-pci driver to fault in huge pfn mappings.
>
> Implementation
> ==============
>
> In reality, it's relatively simple to add such support comparing to many
> other types of mappings, because of PFNMAP's specialties when there's no
> vmemmap backing it, so that most of the kernel routines on huge mappings
> should simply already fail for them, like GUPs or old-school follow_page()
> (which is recently rewritten to be folio_walk* APIs by David).
>
> One trick here is that we're still unmature on PUDs in generic paths here
> and there, as DAX is so far the only user. This patchset will add the 2nd
> user of it. Hugetlb can be a 3rd user if the hugetlb unification work can
> go on smoothly, but to be discussed later.
>
> The other trick is how to allow gup-fast working for such huge mappings
> even if there's no direct sign of knowing whether it's a normal page or
> MMIO mapping. This series chose to keep the pte_special solution, so that
> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> gup-fast will be able to identify them and fail properly.
>
> Along the way, we'll also notice that the major pgtable pfn walker, aka,
> follow_pte(), will need to retire soon due to the fact that it only works
> with ptes. A new set of simple API is introduced (follow_pfnmap* API) to
> be able to do whatever follow_pte() can already do, plus that it can also
> process huge pfnmaps now. Half of this series is about that and converting
> all existing pfnmap walkers to use the new API properly. Hopefully the new
> API also looks better to avoid exposing e.g. pgtable lock details into the
> callers, so that it can be used in an even more straightforward way.
>
> Here, three more options will be introduced and involved in huge pfnmap:
>
> - ARCH_SUPPORTS_HUGE_PFNMAP
>
> Arch developers will need to select this option when huge pfnmap is
> supported in arch's Kconfig. After this patchset applied, both x86_64
> and arm64 will start to enable it by default.
>
> - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
>
> These options are for driver developers to identify whether current
> arch / config supports huge pfnmaps, making decision on whether it can
> use the huge pfnmap APIs to inject them. One can refer to the last
> vfio-pci patch from Alex on the use of them properly in a device
> driver.
>
> So after the whole set applied, and if one would enable some dynamic debug
> lines in vfio-pci core files, we should observe things like:
>
> vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
> vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
> vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
>
> In this specific case, it says that vfio-pci faults in PMDs properly for a
> few BAR0 offsets.
>
> Patch Layout
> ============
>
> Patch 1: Introduce the new options mentioned above for huge PFNMAPs
> Patch 2: A tiny cleanup
> Patch 3-8: Preparation patches for huge pfnmap (include introduce
> special bit for pmd/pud)
> Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and
> then drop follow_pte() API
> Patch 17: Add huge pfnmap support for x86_64
> Patch 18: Add huge pfnmap support for arm64
> Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex)
>
> TODO
> ====
>
> More architectures / More page sizes
> ------------------------------------
>
> Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems to
> have plan to support arm64 1G later on top of this series [2].
>
> Any arch will need to first support THP / THP_1G, then provide a special
> bit in pmds/puds to support huge pfnmaps.
>
> remap_pfn_range() support
> -------------------------
>
> Currently, remap_pfn_range() still only maps PTEs. With the new option,
> remap_pfn_range() can logically start to inject either PMDs or PUDs when
> the alignment requirements match on the VAs.
>
> When the support is there, it should be able to silently benefit all
> drivers that is using remap_pfn_range() in its mmap() handler on better TLB
> hit rate and overall faster MMIO accesses similar to processor on hugepages.
>
Hi Peter,
I am curious if there is any work needed for unmap_mapping_range? If a
driver hugely remap_pfn_range()ed at 1G granularity, can the driver
unmap at PAGE_SIZE granularity? For example, when handling a PFN is
poisoned in the 1G mapping, it would be great if the mapping can be
splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
is lost. (Pretty much like the past proposal* to use HGM** to improve
hugetlb's memory failure handling).
Probably these questions can be answered after reading your code,
which I plan to do, but just want to ask in case you have an easy
answer for me.
* https://patchwork.plctlab.org/project/linux-kernel/cover/20230428004139.2899856-1-jiaqiyan@google.com/
** https://lwn.net/Articles/912017
> More driver support
> -------------------
>
> VFIO is so far the only consumer for the huge pfnmaps after this series
> applied. Besides above remap_pfn_range() generic optimization, device
> driver can also try to optimize its mmap() on a better VA alignment for
> either PMD/PUD sizes. This may, iiuc, normally require userspace changes,
> as the driver doesn't normally decide the VA to map a bar. But I don't
> think I know all the drivers to know the full picture.
>
> Tests Done
> ==========
>
> - Cross-build tests
>
> - run_vmtests.sh
>
> - Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect()
> and fork() tests on the bar mapped
>
> - x86_64 + AMD GPU
> - Needs Alex's modified QEMU to guarantee proper VA alignment to make
> sure all pages to be mapped with PUDs
> - Main BAR (8GB) start to use PUD mappings
> - Sub BAR (??MBs?) start to use PMD mappings
> - Performance wise, slight improvement comparing to the old PTE mappings
>
> - aarch64 + NIC
> - Detached NIC test to make sure driver loads fine with PMD mappings
>
> Credits all go to Alex on help testing the GPU/NIC use cases above.
>
> Comments welcomed, thanks.
>
> [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
> [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
> [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
>
> Alex Williamson (1):
> vfio/pci: Implement huge_fault support
>
> Peter Xu (18):
> mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
> mm: Drop is_huge_zero_pud()
> mm: Mark special bits for huge pfn mappings when inject
> mm: Allow THP orders for PFNMAPs
> mm/gup: Detect huge pfnmap entries in gup-fast
> mm/pagewalk: Check pfnmap for folio_walk_start()
> mm/fork: Accept huge pfnmap entries
> mm: Always define pxx_pgprot()
> mm: New follow_pfnmap API
> KVM: Use follow_pfnmap API
> s390/pci_mmio: Use follow_pfnmap API
> mm/x86/pat: Use the new follow_pfnmap API
> vfio: Use the new follow_pfnmap API
> acrn: Use the new follow_pfnmap API
> mm/access_process_vm: Use the new follow_pfnmap API
> mm: Remove follow_pte()
> mm/x86: Support large pfn mappings
> mm/arm64: Support large pfn mappings
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/pgtable.h | 30 +++++
> arch/powerpc/include/asm/pgtable.h | 1 +
> arch/s390/include/asm/pgtable.h | 1 +
> arch/s390/pci/pci_mmio.c | 22 ++--
> arch/sparc/include/asm/pgtable_64.h | 1 +
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 80 +++++++-----
> arch/x86/mm/pat/memtype.c | 17 ++-
> drivers/vfio/pci/vfio_pci_core.c | 60 ++++++---
> drivers/vfio/vfio_iommu_type1.c | 16 +--
> drivers/virt/acrn/mm.c | 16 +--
> include/linux/huge_mm.h | 16 +--
> include/linux/mm.h | 57 ++++++++-
> include/linux/pgtable.h | 12 ++
> mm/Kconfig | 13 ++
> mm/gup.c | 6 +
> mm/huge_memory.c | 50 +++++---
> mm/memory.c | 183 ++++++++++++++++++++--------
> mm/pagewalk.c | 4 +-
> virt/kvm/kvm_main.c | 19 ++-
> 21 files changed, 425 insertions(+), 181 deletions(-)
>
> --
> 2.45.0
>
>
next prev parent reply other threads:[~2024-08-27 22:36 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-26 20:43 [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
2024-08-26 20:43 ` [PATCH v2 01/19] mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud Peter Xu
2024-08-26 20:43 ` [PATCH v2 02/19] mm: Drop is_huge_zero_pud() Peter Xu
2024-08-26 20:43 ` [PATCH v2 03/19] mm: Mark special bits for huge pfn mappings when inject Peter Xu
2024-08-28 15:31 ` David Hildenbrand
2024-08-26 20:43 ` [PATCH v2 04/19] mm: Allow THP orders for PFNMAPs Peter Xu
2024-08-28 15:31 ` David Hildenbrand
2024-08-26 20:43 ` [PATCH v2 05/19] mm/gup: Detect huge pfnmap entries in gup-fast Peter Xu
2024-08-26 20:43 ` [PATCH v2 06/19] mm/pagewalk: Check pfnmap for folio_walk_start() Peter Xu
2024-08-28 7:44 ` David Hildenbrand
2024-08-28 14:24 ` Peter Xu
2024-08-28 15:30 ` David Hildenbrand
2024-08-28 19:45 ` Peter Xu
2024-08-28 23:46 ` Jason Gunthorpe
2024-08-29 6:35 ` David Hildenbrand
2024-08-29 18:45 ` Peter Xu
2024-08-29 15:10 ` David Hildenbrand
2024-08-29 18:49 ` Peter Xu
2024-08-26 20:43 ` [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries Peter Xu
2024-08-29 15:10 ` David Hildenbrand
2024-08-29 18:26 ` Peter Xu
2024-08-29 19:44 ` David Hildenbrand
2024-08-29 20:01 ` Peter Xu
2024-09-02 7:58 ` Yan Zhao
2024-09-03 21:23 ` Peter Xu
2024-09-09 22:25 ` Andrew Morton
2024-09-09 22:43 ` Peter Xu
2024-09-09 23:15 ` Andrew Morton
2024-09-10 0:08 ` Peter Xu
2024-09-10 2:52 ` Yan Zhao
2024-09-10 12:16 ` Peter Xu
2024-09-11 2:16 ` Yan Zhao
2024-09-11 14:34 ` Peter Xu
2024-08-26 20:43 ` [PATCH v2 08/19] mm: Always define pxx_pgprot() Peter Xu
2024-08-26 20:43 ` [PATCH v2 09/19] mm: New follow_pfnmap API Peter Xu
2024-08-26 20:43 ` [PATCH v2 10/19] KVM: Use " Peter Xu
2024-08-26 20:43 ` [PATCH v2 11/19] s390/pci_mmio: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 12/19] mm/x86/pat: Use the new " Peter Xu
2024-08-26 20:43 ` [PATCH v2 13/19] vfio: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 14/19] acrn: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 15/19] mm/access_process_vm: " Peter Xu
2024-08-26 20:43 ` [PATCH v2 16/19] mm: Remove follow_pte() Peter Xu
2024-09-01 4:33 ` Yu Zhao
2024-09-01 13:39 ` David Hildenbrand
2024-08-26 20:43 ` [PATCH v2 17/19] mm/x86: Support large pfn mappings Peter Xu
2024-08-26 20:43 ` [PATCH v2 18/19] mm/arm64: " Peter Xu
2025-03-19 22:22 ` Keith Busch
2025-03-19 22:46 ` Peter Xu
2025-03-19 22:53 ` Keith Busch
2024-08-26 20:43 ` [PATCH v2 19/19] vfio/pci: Implement huge_fault support Peter Xu
2024-08-27 22:36 ` Jiaqi Yan [this message]
2024-08-27 22:57 ` [PATCH v2 00/19] mm: Support huge pfnmaps Peter Xu
2024-08-28 0:42 ` Jiaqi Yan
2024-08-28 0:46 ` Jiaqi Yan
2024-08-28 14:24 ` Jason Gunthorpe
2024-08-28 16:10 ` Jiaqi Yan
2024-08-28 23:49 ` Jason Gunthorpe
2024-08-29 19:21 ` Jiaqi Yan
2024-09-04 15:52 ` Jason Gunthorpe
2024-09-04 16:38 ` Jiaqi Yan
2024-09-04 16:43 ` Jason Gunthorpe
2024-09-04 16:58 ` Jiaqi Yan
2024-09-04 17:00 ` Jason Gunthorpe
2024-09-04 17:07 ` Jiaqi Yan
2024-09-09 3:56 ` Ankit Agrawal
2024-08-28 14:41 ` Peter Xu
2024-08-28 16:23 ` Jiaqi Yan
2024-09-09 4:03 ` Ankit Agrawal
2024-09-09 15:03 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CACw3F50Zi7CQsSOcCutRUy1h5p=7UBw7ZRGm4WayvsnuuEnKow@mail.gmail.com' \
--to=jiaqiyan@google.com \
--cc=akpm@linux-foundation.org \
--cc=alex.williamson@redhat.com \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=bp@alien8.de \
--cc=catalin.marinas@arm.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=gshan@redhat.com \
--cc=jgg@nvidia.com \
--cc=kvm@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@redhat.com \
--cc=osalvador@suse.de \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=x86@kernel.org \
--cc=yan.y.zhao@intel.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).