* [GIT PULL] VFIO updates for v6.17-rc1 @ 2025-08-04 22:22 Alex Williamson 2025-08-04 23:55 ` Linus Torvalds 0 siblings, 1 reply; 31+ messages in thread From: Alex Williamson @ 2025-08-04 22:22 UTC (permalink / raw) To: Linus Torvalds; +Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Hi Linus, The following changes since commit d7b8f8e20813f0179d8ef519541a3527e7661d3a: Linux 6.16-rc5 (2025-07-06 14:10:26 -0700) are available in the Git repository at: https://github.com/awilliam/linux-vfio.git tags/vfio-v6.17-rc1 for you to fetch changes up to d2272d898186fa96c6fa4ad067219b45b8c2ef5f: vfio/type1: correct logic of vfio_find_vpfn() (2025-07-27 08:20:31 -0600) ---------------------------------------------------------------- VFIO updates for v6.17-rc1 - Fix imbalance where the no-iommu/cdev device path skips too much on open, failing to increment a reference, but still decrements the reference on close. Add bounds checking to prevent such underflows. (Jacob Pan) - Fill missing detach_ioas op for pds_vfio_pci, fixing probe failure when used with IOMMUFD. (Brett Creeley) - Split SR-IOV VFs to separate dev_set, avoiding unnecessary serialization between VFs that appear on the same bus. (Alex Williamson) - Fix a theoretical integer overflow is the mlx5-vfio-pci variant driver. (Artem Sadovnikov) - Use batching on contiguous pages to significantly improve DMA map and unmap performance when using hugetlbfs or THP backed memory with the legacy vfio type1 IOMMU backend. (Li Zhe) - Implement missing VF token checking support via vfio cdev/IOMMUFD interface. (Jason Gunthorpe) - Update QAT vfio-pci variant driver to claim latest VF devices. (Małgorzata Mielnik) - Add a cond_resched() call to avoid holding the CPU too long during DMA mapping operations. (Keith Busch) ---------------------------------------------------------------- Alex Williamson (1): vfio/pci: Separate SR-IOV VF dev_set Artem Sadovnikov (1): vfio/mlx5: fix possible overflow in tracking max message size Brett Creeley (1): vfio/pds: Fix missing detach_ioas op Jacob Pan (2): vfio: Fix unbalanced vfio_df_close call in no-iommu mode vfio: Prevent open_count decrement to negative Jason Gunthorpe (1): vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD Keith Busch (1): vfio/type1: conditional rescheduling while pinning Li Zhe (6): mm: introduce num_pages_contiguous() vfio/type1: optimize vfio_pin_pages_remote() vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() vfio/type1: introduce a new member has_rsvd for struct vfio_dma vfio/type1: optimize vfio_unpin_pages_remote() vfio/type1: correct logic of vfio_find_vpfn() Małgorzata Mielnik (1): vfio/qat: add support for intel QAT 6xxx virtual functions Xin Zeng (1): vfio/qat: Remove myself from VFIO QAT PCI driver maintainers MAINTAINERS | 1 - drivers/vfio/device_cdev.c | 38 +++++++- drivers/vfio/group.c | 7 +- drivers/vfio/iommufd.c | 4 + drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 1 + drivers/vfio/pci/mlx5/cmd.c | 4 +- drivers/vfio/pci/mlx5/main.c | 1 + drivers/vfio/pci/nvgrace-gpu/main.c | 2 + drivers/vfio/pci/pds/vfio_dev.c | 2 + drivers/vfio/pci/qat/main.c | 5 +- drivers/vfio/pci/vfio_pci.c | 1 + drivers/vfio/pci/vfio_pci_core.c | 24 +++-- drivers/vfio/pci/virtio/main.c | 3 + drivers/vfio/vfio_iommu_type1.c | 118 ++++++++++++++++++++----- drivers/vfio/vfio_main.c | 3 +- include/linux/mm.h | 23 +++++ include/linux/vfio.h | 4 + include/linux/vfio_pci_core.h | 2 + include/uapi/linux/vfio.h | 12 ++- 19 files changed, 212 insertions(+), 43 deletions(-) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-04 22:22 [GIT PULL] VFIO updates for v6.17-rc1 Alex Williamson @ 2025-08-04 23:55 ` Linus Torvalds 2025-08-05 0:53 ` Alex Williamson 0 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2025-08-04 23:55 UTC (permalink / raw) To: Alex Williamson; +Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, 4 Aug 2025 at 15:22, Alex Williamson <alex.williamson@redhat.com> wrote: > > Li Zhe (6): > mm: introduce num_pages_contiguous() WHY? There is exactly *ONE* user, why the heck do we introduce this completely pointless "helper" function, and put it in a core header file like it was worth it? And it's not like that code is some kind of work of art that we want to expose everybody to *anyway*. It's written in a particularly stupid way that means that it's *way* more expensive than it needs to be. And then it's made "inline" despite the code generation being horrible, which makes it all entirely pointless. Yes, I'm grumpy. This pull request came in late, I'm already traveling, and then I look at it and it just makes me *angry* at how bad that code is, and how annoying it is. My builds are already slower than usual because they happen on my laptop while traveling, I do *not* need to see this kind of absolutely disgusting code that does stupid things that make the build even slower. So I refuse to pull this kind of crap. If you insist on making my build slower and exposing these kinds of helper functions, they had better be *good* helper functions. Hint: absolutely nobody cares about "the pages crossed a sparsemem border. If your driver cares about the number of contiguous pages, it might as well say "yeah, they are contiguous, but they are in different sparsemem chunks, so we'll break here too". And at that point all you care about is 'struct page' being contiguous, instead of doing that disgusting 'nth_page'. And then - since there is only *one* single user - you don't put it in the most central header file that EVERYBODY ELSE cares about. And you absolutely don't do it if it generates garbage code for no good reason! Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-04 23:55 ` Linus Torvalds @ 2025-08-05 0:53 ` Alex Williamson 2025-08-05 7:47 ` David Hildenbrand 0 siblings, 1 reply; 31+ messages in thread From: Alex Williamson @ 2025-08-05 0:53 UTC (permalink / raw) To: Linus Torvalds Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com, Jason Gunthorpe, David Hildenbrand On Mon, 4 Aug 2025 16:55:09 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 4 Aug 2025 at 15:22, Alex Williamson <alex.williamson@redhat.com> wrote: > > > > Li Zhe (6): > > mm: introduce num_pages_contiguous() > > WHY? > > There is exactly *ONE* user, why the heck do we introduce this > completely pointless "helper" function, and put it in a core header > file like it was worth it? There was discussion here[1] where David Hildenbrand and Jason Gunthorpe suggested this should be in common code and I believe there was some intent that this would get reused. I took this as endorsement from mm folks. This can certainly be pulled back into subsystem code. > And it's not like that code is some kind of work of art that we want > to expose everybody to *anyway*. It's written in a particularly stupid > way that means that it's *way* more expensive than it needs to be. > > And then it's made "inline" despite the code generation being > horrible, which makes it all entirely pointless. > > Yes, I'm grumpy. This pull request came in late, I'm already > traveling, and then I look at it and it just makes me *angry* at how > bad that code is, and how annoying it is. Sorry, I usually try to get in later during the first week to let the dust settle a bit from the bigger subsystems, I guess I'm running a little behind this cycle. We'll get it fixed and I'll resend. Thanks, Alex > My builds are already slower than usual because they happen on my > laptop while traveling, I do *not* need to see this kind of absolutely > disgusting code that does stupid things that make the build even > slower. > > So I refuse to pull this kind of crap. > > If you insist on making my build slower and exposing these kinds of > helper functions, they had better be *good* helper functions. > > Hint: absolutely nobody cares about "the pages crossed a sparsemem > border. If your driver cares about the number of contiguous pages, it > might as well say "yeah, they are contiguous, but they are in > different sparsemem chunks, so we'll break here too". > > And at that point all you care about is 'struct page' being > contiguous, instead of doing that disgusting 'nth_page'. > > And then - since there is only *one* single user - you don't put it in > the most central header file that EVERYBODY ELSE cares about. > > And you absolutely don't do it if it generates garbage code for no good reason! > > Linus > [1]https://lore.kernel.org/all/20250703111216.GG904431@ziepe.ca/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 0:53 ` Alex Williamson @ 2025-08-05 7:47 ` David Hildenbrand 2025-08-05 11:49 ` Jason Gunthorpe 2025-08-05 13:00 ` Linus Torvalds 0 siblings, 2 replies; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 7:47 UTC (permalink / raw) To: Alex Williamson, Linus Torvalds Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com, Jason Gunthorpe On 05.08.25 02:53, Alex Williamson wrote: > On Mon, 4 Aug 2025 16:55:09 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Mon, 4 Aug 2025 at 15:22, Alex Williamson <alex.williamson@redhat.com> wrote: >>> >>> Li Zhe (6): >>> mm: introduce num_pages_contiguous() >> >> WHY? >> >> There is exactly *ONE* user, why the heck do we introduce this >> completely pointless "helper" function, and put it in a core header >> file like it was worth it? > > There was discussion here[1] where David Hildenbrand and Jason > Gunthorpe suggested this should be in common code and I believe there > was some intent that this would get reused. I took this as > endorsement from mm folks. This can certainly be pulled back into > subsystem code. Yeah, we ended up here after trying to go the folio-way first, but then realizing that code that called GUP shouldn't have to worry about folios, just to detect consecutive pages+PFNs. I think this helper will can come in handy even in folio context. I recall pointing Joanne at it in different fuse context. > >> And it's not like that code is some kind of work of art that we want >> to expose everybody to *anyway*. It's written in a particularly stupid >> way that means that it's *way* more expensive than it needs to be. >> >> And then it's made "inline" despite the code generation being >> horrible, which makes it all entirely pointless. >> >> Yes, I'm grumpy. This pull request came in late, I'm already >> traveling, and then I look at it and it just makes me *angry* at how >> bad that code is, and how annoying it is. > > Sorry, I usually try to get in later during the first week to let the > dust settle a bit from the bigger subsystems, I guess I'm running a > little behind this cycle. We'll get it fixed and I'll resend. Thanks, > > Alex > >> My builds are already slower than usual because they happen on my >> laptop while traveling, I do *not* need to see this kind of absolutely >> disgusting code that does stupid things that make the build even >> slower. >> >> So I refuse to pull this kind of crap. >> >> If you insist on making my build slower and exposing these kinds of >> helper functions, they had better be *good* helper functions. >> >> Hint: absolutely nobody cares about "the pages crossed a sparsemem >> border. If your driver cares about the number of contiguous pages, it >> might as well say "yeah, they are contiguous, but they are in >> different sparsemem chunks, so we'll break here too". The concern is rather false positives, meaning, you want consecutive PFNs (just like within a folio), but -- because the stars aligned -- you get consecutive "struct page" that do not translate to consecutive PFNs. So that's why the nth page stuff is not optional in the current implementation as far as I can tell. And because that nth_page stuff is so tricky and everyone gets it wrong all the time, I am actually in favor of having such a helper around. not buried in some subsystem. >> >> And at that point all you care about is 'struct page' being >> contiguous, instead of doing that disgusting 'nth_page'. I think stopping when we hit the end of a memory section in case of !CONFIG_SPARSEMEM_VMEMMAP could be done and documented. It's just that ... code gets more complicated: we end up really only optimizing for the unloved child sparsemem withoutCONFIG_SPARSEMEM_VMEMMAP: diff --git a/include/linux/mm.h b/include/linux/mm.h index 0d4ee569aa6b6..f080fa5a68d4a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1773,6 +1773,9 @@ static inline unsigned long page_to_section(const struct page *page) * the memory model, this can mean that the addresses of the "struct page"s * are not contiguous. * + * On sparsemem configs without CONFIG_SPARSEMEM_VMEMMAP, we will stop once we + * hit a memory section boundary. + * * @pages: an array of page pointers * @nr_pages: length of the array */ @@ -1781,8 +1784,16 @@ static inline unsigned long num_pages_contiguous(struct page **pages, { size_t i; + if (IS_ENABLED(CONFIG_SPARSEMEM) && + !IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) { + const unsigned long pfn = page_to_pfn(pages[0]); + + nr_pages = min_t(size_t, nr_pages, + PAGES_PER_SECTION - (pfn & PAGES_PER_SECTION)); + } + for (i = 1; i < nr_pages; i++) - if (pages[i] != nth_page(pages[0], i)) + if (pages[i] != pages[i - 1] + 1) break; Whether that helper should live in mm/utils.c is a valid point the bigger it gets. >> >> And then - since there is only *one* single user - you don't put it in >> the most central header file that EVERYBODY ELSE cares about. >> >> And you absolutely don't do it if it generates garbage code for no good reason! I understand the hate for nth_page in this context. But I rather hate it *completely* because people get it wrong all of the time. In any case, enjoy you travel Linus. -- Cheers, David / dhildenb ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 7:47 ` David Hildenbrand @ 2025-08-05 11:49 ` Jason Gunthorpe 2025-08-05 12:07 ` David Hildenbrand 2025-08-05 13:00 ` Linus Torvalds 1 sibling, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2025-08-05 11:49 UTC (permalink / raw) To: David Hildenbrand Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On Tue, Aug 05, 2025 at 09:47:18AM +0200, David Hildenbrand wrote: > > There was discussion here[1] where David Hildenbrand and Jason > > Gunthorpe suggested this should be in common code and I believe there > > was some intent that this would get reused. I took this as > > endorsement from mm folks. This can certainly be pulled back into > > subsystem code. > > Yeah, we ended up here after trying to go the folio-way first, but then > realizing that code that called GUP shouldn't have to worry about > folios, just to detect consecutive pages+PFNs. > > I think this helper will can come in handy even in folio context. > I recall pointing Joanne at it in different fuse context. The scatterlist code should use it also, it is doing the same logic. > The concern is rather false positives, meaning, you want consecutive > PFNs (just like within a folio), but -- because the stars aligned -- > you get consecutive "struct page" that do not translate to consecutive PFNs. I wonder if we can address that from the other side and prevent the memory code from creating a bogus contiguous struct page in the first place so that struct page contiguity directly reflects physical contiguity? Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 11:49 ` Jason Gunthorpe @ 2025-08-05 12:07 ` David Hildenbrand 2025-08-05 12:38 ` Jason Gunthorpe 0 siblings, 1 reply; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 12:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 13:49, Jason Gunthorpe wrote: > On Tue, Aug 05, 2025 at 09:47:18AM +0200, David Hildenbrand wrote: >>> There was discussion here[1] where David Hildenbrand and Jason >>> Gunthorpe suggested this should be in common code and I believe there >>> was some intent that this would get reused. I took this as >>> endorsement from mm folks. This can certainly be pulled back into >>> subsystem code. >> >> Yeah, we ended up here after trying to go the folio-way first, but then >> realizing that code that called GUP shouldn't have to worry about >> folios, just to detect consecutive pages+PFNs. >> >> I think this helper will can come in handy even in folio context. >> I recall pointing Joanne at it in different fuse context. > > The scatterlist code should use it also, it is doing the same logic. > >> The concern is rather false positives, meaning, you want consecutive >> PFNs (just like within a folio), but -- because the stars aligned -- >> you get consecutive "struct page" that do not translate to consecutive PFNs. > > I wonder if we can address that from the other side and prevent the > memory code from creating a bogus contiguous struct page in the first > place so that struct page contiguity directly reflects physical > contiguity? Well, if we could make CONFIG_SPARSEMEM_VMEMMAP the only sparsemem option ... :) But I recall it's not that easy (e.g., 32bit). I don't see an easy way to guarantee that. E.g., populate_section_memmap really just does a kvmalloc_node() and __populate_section_memmap()->memmap_alloc() a memblock_alloc(). So if the starts align, the "struct page" of the memory of two memory sections are contiguous, although the memory sections are not contiguous. Just imagine memory holes. Also, I am not sure if that is really a problem worth solving right now. If a helper on !CONFIG_SPARSEMEM_VMEMMAP is a little slower on some operations, like nth_page(), I really don't particularly care. Eventually, !CONFIG_SPARSEMEM_VMEMMAP will go away in some distant future and we will all be happy. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 12:07 ` David Hildenbrand @ 2025-08-05 12:38 ` Jason Gunthorpe 2025-08-05 12:41 ` David Hildenbrand 0 siblings, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2025-08-05 12:38 UTC (permalink / raw) To: David Hildenbrand Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On Tue, Aug 05, 2025 at 02:07:49PM +0200, David Hildenbrand wrote: > I don't see an easy way to guarantee that. E.g., populate_section_memmap > really just does a kvmalloc_node() and > __populate_section_memmap()->memmap_alloc() a memblock_alloc(). Well, it is really easy, if you do the kvmalloc_node and you get the single unwanted struct page value, then call it again and free the first one. The second call is guarenteed to not return the unwanted value because the first call has it allocated. Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 12:38 ` Jason Gunthorpe @ 2025-08-05 12:41 ` David Hildenbrand 2025-08-05 12:56 ` Jason Gunthorpe 0 siblings, 1 reply; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 12:41 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 14:38, Jason Gunthorpe wrote: > On Tue, Aug 05, 2025 at 02:07:49PM +0200, David Hildenbrand wrote: >> I don't see an easy way to guarantee that. E.g., populate_section_memmap >> really just does a kvmalloc_node() and >> __populate_section_memmap()->memmap_alloc() a memblock_alloc(). > > Well, it is really easy, if you do the kvmalloc_node and you get the > single unwanted struct page value, then call it again and free the > first one. The second call is guarenteed to not return the unwanted > value because the first call has it allocated. So you want to walk all existing sections to check that? :) That's the kind of approach I would describe with the words Linus used. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 12:41 ` David Hildenbrand @ 2025-08-05 12:56 ` Jason Gunthorpe 2025-08-05 13:05 ` David Hildenbrand 0 siblings, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2025-08-05 12:56 UTC (permalink / raw) To: David Hildenbrand Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On Tue, Aug 05, 2025 at 02:41:38PM +0200, David Hildenbrand wrote: > On 05.08.25 14:38, Jason Gunthorpe wrote: > > On Tue, Aug 05, 2025 at 02:07:49PM +0200, David Hildenbrand wrote: > > > I don't see an easy way to guarantee that. E.g., populate_section_memmap > > > really just does a kvmalloc_node() and > > > __populate_section_memmap()->memmap_alloc() a memblock_alloc(). > > > > Well, it is really easy, if you do the kvmalloc_node and you get the > > single unwanted struct page value, then call it again and free the > > first one. The second call is guarenteed to not return the unwanted > > value because the first call has it allocated. > > So you want to walk all existing sections to check that? :) We don't need to walk, compute the page-1 and carefully run that through page_to_pfn algorithm. > That's the kind of approach I would describe with the words Linus used. Its some small boot time nastyness, we do this all the time messing up the slow path so the fast paths can be simple Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 12:56 ` Jason Gunthorpe @ 2025-08-05 13:05 ` David Hildenbrand 2025-08-05 13:15 ` Linus Torvalds 0 siblings, 1 reply; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 13:05 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 14:56, Jason Gunthorpe wrote: > On Tue, Aug 05, 2025 at 02:41:38PM +0200, David Hildenbrand wrote: >> On 05.08.25 14:38, Jason Gunthorpe wrote: >>> On Tue, Aug 05, 2025 at 02:07:49PM +0200, David Hildenbrand wrote: >>>> I don't see an easy way to guarantee that. E.g., populate_section_memmap >>>> really just does a kvmalloc_node() and >>>> __populate_section_memmap()->memmap_alloc() a memblock_alloc(). >>> >>> Well, it is really easy, if you do the kvmalloc_node and you get the >>> single unwanted struct page value, then call it again and free the >>> first one. The second call is guarenteed to not return the unwanted >>> value because the first call has it allocated. >> >> So you want to walk all existing sections to check that? :) > > We don't need to walk, compute the page-1 and carefully run that > through page_to_pfn algorithm. You have to test if first_page-1 and last_page+1 are used as the memmap for another section. So yeah, it could be done by some page_to_pfn + pfn_to_page trickery I think. Not that I would like that "allocate until you find something that works" approach. > >> That's the kind of approach I would describe with the words Linus used. > > Its some small boot time nastyness, we do this all the time messing up > the slow path so the fast paths can be simple nth_page won't go away. If it would go away, I would be sold on almost any idea. nth_page is the real problem. So I don't like the idea of micro-optimizing num_pages_contiguous() by adding weird tweaks to the core for that. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:05 ` David Hildenbrand @ 2025-08-05 13:15 ` Linus Torvalds 2025-08-05 13:19 ` David Hildenbrand 2025-08-05 13:22 ` David Hildenbrand 0 siblings, 2 replies; 31+ messages in thread From: Linus Torvalds @ 2025-08-05 13:15 UTC (permalink / raw) To: David Hildenbrand Cc: Jason Gunthorpe, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On Tue, 5 Aug 2025 at 16:05, David Hildenbrand <david@redhat.com> wrote: > > So I don't like the idea of micro-optimizing num_pages_contiguous() by > adding weird tweaks to the core for that. Seriously - take a look at that suggested sequence I posted, and tell me that it isn't *MORE* obvious than the horror that is nth_page(). Honestly, if anybody thinks nth_page() is obvious and good, I think they have some bad case of Stockholm syndrome. This isn't about micro-optimizing. This is about not writing complete garbage code that makes no sense. nth_page() is a disgusting thing that is designed to look up known-contiguous pages. That code mis-used it for *testing* for being contiguous. It may have _worked_, but it was the wrong thing to do. nth_page() in general should just not exist. I don't actually believe there is any valid reason for it. I do not believe we should actually have valid consecutive allocations of pages across sections. So please work on removing that eldritch horror, not adding new worse versions of it. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:15 ` Linus Torvalds @ 2025-08-05 13:19 ` David Hildenbrand 2025-08-05 13:22 ` David Hildenbrand 1 sibling, 0 replies; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 13:19 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Gunthorpe, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 15:15, Linus Torvalds wrote: > On Tue, 5 Aug 2025 at 16:05, David Hildenbrand <david@redhat.com> wrote: >> >> So I don't like the idea of micro-optimizing num_pages_contiguous() by >> adding weird tweaks to the core for that. > > Seriously - take a look at that suggested sequence I posted, and tell > me that it isn't *MORE* obvious than the horror that is nth_page(). It is, that's not what I am talking about. I was talking about proposals to make it work in a different way as Jason proposed. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:15 ` Linus Torvalds 2025-08-05 13:19 ` David Hildenbrand @ 2025-08-05 13:22 ` David Hildenbrand 1 sibling, 0 replies; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 13:22 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Gunthorpe, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 15:15, Linus Torvalds wrote: > On Tue, 5 Aug 2025 at 16:05, David Hildenbrand <david@redhat.com> wrote: >> >> So I don't like the idea of micro-optimizing num_pages_contiguous() by >> adding weird tweaks to the core for that. > > Seriously - take a look at that suggested sequence I posted, and tell > me that it isn't *MORE* obvious than the horror that is nth_page(). > > Honestly, if anybody thinks nth_page() is obvious and good, I think > they have some bad case of Stockholm syndrome. > > This isn't about micro-optimizing. This is about not writing complete > garbage code that makes no sense. > > nth_page() is a disgusting thing that is designed to look up > known-contiguous pages. That code mis-used it for *testing* for being > contiguous. It may have _worked_, but it was the wrong thing to do. > > nth_page() in general should just not exist. I don't actually believe > there is any valid reason for it. I do not believe we should actually > have valid consecutive allocations of pages across sections. Oh, just to add to that, 1 GiB folios (hugetlb, dax) are the main reason why we use it in things like folio_page(), and also why folio_page_idx() is so horrible. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 7:47 ` David Hildenbrand 2025-08-05 11:49 ` Jason Gunthorpe @ 2025-08-05 13:00 ` Linus Torvalds 2025-08-05 13:20 ` David Hildenbrand 2025-08-05 13:25 ` Jason Gunthorpe 1 sibling, 2 replies; 31+ messages in thread From: Linus Torvalds @ 2025-08-05 13:00 UTC (permalink / raw) To: David Hildenbrand Cc: Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com, Jason Gunthorpe On Tue, 5 Aug 2025 at 10:47, David Hildenbrand <david@redhat.com> wrote: > > The concern is rather false positives, meaning, you want consecutive > PFNs (just like within a folio), but -- because the stars aligned -- > you get consecutive "struct page" that do not translate to consecutive PFNs. So I don't think that can happen with a valid 'struct page', because if the 'struct page's are in different sections, they will have been allocated separately too. So you can't have two consecutive 'struct page' things without them being consecutive pages. But by all means, if you want to make sure, just compare the page sections. But converting them to a PFN and then converting back is just crazy. IOW, the logic would literally be something like (this assumes there is always at least *one* page): struct page *page = *pages++; int section = page_to_section(page); for (size_t nr = 1; nr < nr_pages; nr++) { if (*pages++ != ++page) break; if (page_to_section(page) != section) break; } return nr; and yes, I think we only define page_to_section() for SECTION_IN_PAGE_FLAGS, but we should fix that and just have a #define page_to_section(pg) 0 for the other cases, and the compiler will happily optimize away the "oh, it's always zero" case. So something like that should actually generate reasonable code. It *really* shouldn't try to generate a pfn (or, like that horror that I didn't pull did, then go *back* from pfn to page) That 'nth_page()' thing is just crazy garbage. And even when fixed to not be garbage, I'm not convinced this needs to be in <linux/mm.h>. Linus PS. No - I didn't test the above trivial loop. It may be trivial, but it may be buggy. Think of it as "something like this" rather than "this is tested and does the right thing" ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:00 ` Linus Torvalds @ 2025-08-05 13:20 ` David Hildenbrand 2025-08-05 13:24 ` David Hildenbrand 2025-08-05 13:28 ` Linus Torvalds 2025-08-05 13:25 ` Jason Gunthorpe 1 sibling, 2 replies; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 13:20 UTC (permalink / raw) To: Linus Torvalds Cc: Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com, Jason Gunthorpe On 05.08.25 15:00, Linus Torvalds wrote: > On Tue, 5 Aug 2025 at 10:47, David Hildenbrand <david@redhat.com> wrote: >> >> The concern is rather false positives, meaning, you want consecutive >> PFNs (just like within a folio), but -- because the stars aligned -- >> you get consecutive "struct page" that do not translate to consecutive PFNs. > > So I don't think that can happen with a valid 'struct page', because > if the 'struct page's are in different sections, they will have been > allocated separately too. I think you can end up with two memory sections not being consecutive, but the struct pages being consecutive. I assume the easiest way to achieve that is having a large-enough memory hole (PCI hole?) that spans more than section, and memblock just giving you the next PFN range to use as "struct page". It's one allocation per memory section, see sparse_init_nid()->__populate_section_memmap(prn, PAGES_PER_SECTION) -> memmap_alloc()->memblock_alloc(). With memory hotplug, there might be other weird ways to achieve it I suspect. > > So you can't have two consecutive 'struct page' things without them > being consecutive pages. > > But by all means, if you want to make sure, just compare the page > sections. But converting them to a PFN and then converting back is > just crazy. > > IOW, the logic would literally be something like (this assumes there > is always at least *one* page): > > struct page *page = *pages++; > int section = page_to_section(page); > > for (size_t nr = 1; nr < nr_pages; nr++) { > if (*pages++ != ++page) > break; > if (page_to_section(page) != section) > break; > } > return nr; I think that would work, and we could limit the section check to the problematic case only (sparsemem without VMEMMAP). > > and yes, I think we only define page_to_section() for > SECTION_IN_PAGE_FLAGS, but we should fix that and just have a > > #define page_to_section(pg) 0 Probably yes, have to think about that. > > for the other cases, and the compiler will happily optimize away the > "oh, it's always zero" case. > > So something like that should actually generate reasonable code. It > *really* shouldn't try to generate a pfn (or, like that horror that I > didn't pull did, then go *back* from pfn to page) > > That 'nth_page()' thing is just crazy garbage. Yes, it's unfortunate garbage we have to use even in folio_page(). > > And even when fixed to not be garbage, I'm not convinced this needs to > be in <linux/mm.h>. Yeah, let's move it to mm/util.c if you agree. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:20 ` David Hildenbrand @ 2025-08-05 13:24 ` David Hildenbrand 2025-08-05 13:28 ` Linus Torvalds 1 sibling, 0 replies; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 13:24 UTC (permalink / raw) To: Linus Torvalds Cc: Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com, Jason Gunthorpe On 05.08.25 15:20, David Hildenbrand wrote: > On 05.08.25 15:00, Linus Torvalds wrote: >> On Tue, 5 Aug 2025 at 10:47, David Hildenbrand <david@redhat.com> wrote: >>> >>> The concern is rather false positives, meaning, you want consecutive >>> PFNs (just like within a folio), but -- because the stars aligned -- >>> you get consecutive "struct page" that do not translate to consecutive PFNs. >> >> So I don't think that can happen with a valid 'struct page', because >> if the 'struct page's are in different sections, they will have been >> allocated separately too. > > I think you can end up with two memory sections not being consecutive, > but the struct pages being consecutive. > > I assume the easiest way to achieve that is having a large-enough memory > hole (PCI hole?) that spans more than section, and memblock just giving > you the next PFN range to use as "struct page". > > It's one allocation per memory section, see > > sparse_init_nid()->__populate_section_memmap(prn, PAGES_PER_SECTION) -> > memmap_alloc()->memblock_alloc(). > > With memory hotplug, there might be other weird ways to achieve it I > suspect. > >> >> So you can't have two consecutive 'struct page' things without them >> being consecutive pages. >> >> But by all means, if you want to make sure, just compare the page >> sections. But converting them to a PFN and then converting back is >> just crazy. > > > IOW, the logic would literally be something like (this assumes there >> is always at least *one* page): >> >> struct page *page = *pages++; >> int section = page_to_section(page); >> >> for (size_t nr = 1; nr < nr_pages; nr++) { >> if (*pages++ != ++page) >> break; >> if (page_to_section(page) != section) >> break; >> } >> return nr; > > I think that would work, and we could limit the section check to the > problematic case only (sparsemem without VMEMMAP). > >> >> and yes, I think we only define page_to_section() for >> SECTION_IN_PAGE_FLAGS, but we should fix that and just have a >> >> #define page_to_section(pg) 0 > > Probably yes, have to think about that. Ah, #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define SECTION_IN_PAGE_FLAGS #endif so it will already be optimized out with your proposal, great! -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:20 ` David Hildenbrand 2025-08-05 13:24 ` David Hildenbrand @ 2025-08-05 13:28 ` Linus Torvalds 2025-08-05 13:37 ` David Hildenbrand 1 sibling, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2025-08-05 13:28 UTC (permalink / raw) To: David Hildenbrand Cc: Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com, Jason Gunthorpe On Tue, 5 Aug 2025 at 16:20, David Hildenbrand <david@redhat.com> wrote: > > I think that would work, and we could limit the section check to the > problematic case only (sparsemem without VMEMMAP). We really don't need to, because unlike the nth_page() thing, the compiler can see the logic and see "it's always zero". And in the complex case (ie actual sparsemem without VMEMMAP), the page_section() test is at least trivial, unlike the whole "turn it into a pfn and back". Because that "turn it into a pfn and back" is actually a really quite complicated operation (and the compiler won't be able to optimize that one much, so I'm pretty sure it generates horrific code). I wish we didn't have nth_page() at all. I really don't think it's a valid operation. It's been around forever, but I think it was broken as introduced, exactly because I don't think you can validly even have allocations that cross section boundaries. So the only possible reason for nth_page() is that you tried to combine two such allocations into one, and you simply shouldn't do that in the first place. The VM layer can't free them as one allocation anyway, so the only possible use-case is for some broken "let's optimize this IO into one bigger chunk". So people should actively get rid of that, not add to the brokenness. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:28 ` Linus Torvalds @ 2025-08-05 13:37 ` David Hildenbrand 2025-08-05 13:49 ` Linus Torvalds 0 siblings, 1 reply; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 13:37 UTC (permalink / raw) To: Linus Torvalds Cc: Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com, Jason Gunthorpe On 05.08.25 15:28, Linus Torvalds wrote: > On Tue, 5 Aug 2025 at 16:20, David Hildenbrand <david@redhat.com> wrote: >> >> I think that would work, and we could limit the section check to the >> problematic case only (sparsemem without VMEMMAP). > > We really don't need to, because unlike the nth_page() thing, the > compiler can see the logic and see "it's always zero". Yeah, realized that later. > > And in the complex case (ie actual sparsemem without VMEMMAP), the > page_section() test is at least trivial, unlike the whole "turn it > into a pfn and back". > > Because that "turn it into a pfn and back" is actually a really quite > complicated operation (and the compiler won't be able to optimize that > one much, so I'm pretty sure it generates horrific code). Yes, that's why I hate folio_page_idx() so much on !VMEMMAP #define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) > > I wish we didn't have nth_page() at all. I really don't think it's a > valid operation. It's been around forever, but I think it was broken > as introduced, exactly because I don't think you can validly even have > allocations that cross section boundaries. Ordinary buddy allocations cannot exceed a memory section, but hugetlb and dax can with gigantic folios ... :( We had some weird bugs with that, because people keep forgetting that you cannot just use page++ unconditionally with such folios. Anyhow, thanks Linus! -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:37 ` David Hildenbrand @ 2025-08-05 13:49 ` Linus Torvalds 0 siblings, 0 replies; 31+ messages in thread From: Linus Torvalds @ 2025-08-05 13:49 UTC (permalink / raw) To: David Hildenbrand Cc: Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com, Jason Gunthorpe On Tue, 5 Aug 2025 at 16:37, David Hildenbrand <david@redhat.com> wrote: > > Ordinary buddy allocations cannot exceed a memory section, but hugetlb and > dax can with gigantic folios ... :( Just turn that code off. Nobody sane cares. It sounds like people have bent over backwards to fix the insane case instead of saying "that's insane, let's not support it". And yes, "that's insane" is actually fairly recent. It's not that long ago that we made SPARSEMEM_VMEMMAP the mandatory option on x86-64. So it was all sane in a historical context, but it's not sane any more. But now it *is* the mandatory option both on x86 and arm64, so I really think it's time to get rid of pointless pain points. (I think powerpc still makes it an option to do sparsemem without vmemmap, but it *is* an option there too) Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:00 ` Linus Torvalds 2025-08-05 13:20 ` David Hildenbrand @ 2025-08-05 13:25 ` Jason Gunthorpe 2025-08-05 13:33 ` David Hildenbrand 2025-08-05 13:36 ` Linus Torvalds 1 sibling, 2 replies; 31+ messages in thread From: Jason Gunthorpe @ 2025-08-05 13:25 UTC (permalink / raw) To: Linus Torvalds Cc: David Hildenbrand, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On Tue, Aug 05, 2025 at 04:00:53PM +0300, Linus Torvalds wrote: > On Tue, 5 Aug 2025 at 10:47, David Hildenbrand <david@redhat.com> wrote: > > > > The concern is rather false positives, meaning, you want consecutive > > PFNs (just like within a folio), but -- because the stars aligned -- > > you get consecutive "struct page" that do not translate to consecutive PFNs. > > So I don't think that can happen with a valid 'struct page', because > if the 'struct page's are in different sections, they will have been > allocated separately too. This is certainly true for the CONFIG_SPARSEMEM_VMEMMAP case, but in the other cases I thought we end up with normal allocations for struct page? This is what David was talking about. So then we are afraid of this: a = kvmalloc_array(nelms_a); b = kvmalloc_array(nelms_b); assert(a + nelms_a != b) I thought this was possible with our allocator, especially vmemmap? David, there is another alternative to prevent this, simple though a bit wasteful, just allocate a bit bigger to ensure the allocation doesn't end on an exact PAGE_SIZE boundary? Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:25 ` Jason Gunthorpe @ 2025-08-05 13:33 ` David Hildenbrand 2025-08-05 13:55 ` Jason Gunthorpe 2025-08-05 13:36 ` Linus Torvalds 1 sibling, 1 reply; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 13:33 UTC (permalink / raw) To: Jason Gunthorpe, Linus Torvalds Cc: Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 15:25, Jason Gunthorpe wrote: > On Tue, Aug 05, 2025 at 04:00:53PM +0300, Linus Torvalds wrote: >> On Tue, 5 Aug 2025 at 10:47, David Hildenbrand <david@redhat.com> wrote: >>> >>> The concern is rather false positives, meaning, you want consecutive >>> PFNs (just like within a folio), but -- because the stars aligned -- >>> you get consecutive "struct page" that do not translate to consecutive PFNs. >> >> So I don't think that can happen with a valid 'struct page', because >> if the 'struct page's are in different sections, they will have been >> allocated separately too. > > This is certainly true for the CONFIG_SPARSEMEM_VMEMMAP case, but in > the other cases I thought we end up with normal allocations for struct > page? This is what David was talking about. > > So then we are afraid of this: > > a = kvmalloc_array(nelms_a); > b = kvmalloc_array(nelms_b); > > assert(a + nelms_a != b) > > I thought this was possible with our allocator, especially vmemmap? > > David, there is another alternative to prevent this, simple though a > bit wasteful, just allocate a bit bigger to ensure the allocation > doesn't end on an exact PAGE_SIZE boundary? :/ in particular doing that through the memblock in sparse_init_nid(), I am not so sure that's a good idea. I prefer Linus' proposal and avoids the one nth_page(), unless any other approach can help us get rid of more nth_page() usage -- and I don't think your proposal could, right? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:33 ` David Hildenbrand @ 2025-08-05 13:55 ` Jason Gunthorpe 2025-08-05 14:10 ` David Hildenbrand 0 siblings, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2025-08-05 13:55 UTC (permalink / raw) To: David Hildenbrand Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On Tue, Aug 05, 2025 at 03:33:49PM +0200, David Hildenbrand wrote: > > David, there is another alternative to prevent this, simple though a > > bit wasteful, just allocate a bit bigger to ensure the allocation > > doesn't end on an exact PAGE_SIZE boundary? > > :/ in particular doing that through the memblock in sparse_init_nid(), I am > not so sure that's a good idea. It would probably be some work to make larger allocations to avoid padding :\ > I prefer Linus' proposal and avoids the one nth_page(), unless any other > approach can help us get rid of more nth_page() usage -- and I don't think > your proposal could, right? If the above were solved - so the struct page allocations could be larger than a section, arguably just the entire range being plugged, then I think you also solve the nth_page() problem too. Effectively the nth_page() problem is that we allocate the struct page arrays on an arbitary section-by-section basis, and then the arch sets MAX_ORDER so that a folio can cross sections, effectively guaranteeing to virtually fragment the page *'s inside folios. Doing a giant vmalloc at the start so you could also cheaply add some padding would effectively also prevent the nth_page problem as we can reasonably say that no folio should extend past an entire memory region. Maybe there is some reason we can't do a giant vmalloc on these systems that also can't do SPARSE_VMMEMAP :\ But perhaps we could get up to MAX_ORDER at least? Or perhaps we could have those systems reduce MAX_ORDER? So, I think they are lightly linked problems. I suppose this is also a limitation with Linus's suggestion. It doesn't give the optimal answer for for 1G pages on these older systems: for (size_t nr = 1; nr < nr_pages; nr++) { if (*pages++ != ++page) break; Since that will exit every section. At least for scatterlist like cases the point of this function is just to speed things up. If it returns short the calling code should still be directly checking phys_addr contiguity anyhow. Something for the kdoc I suppose. Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:55 ` Jason Gunthorpe @ 2025-08-05 14:10 ` David Hildenbrand 2025-08-05 14:20 ` Jason Gunthorpe 0 siblings, 1 reply; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 14:10 UTC (permalink / raw) To: Jason Gunthorpe Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 15:55, Jason Gunthorpe wrote: > On Tue, Aug 05, 2025 at 03:33:49PM +0200, David Hildenbrand wrote: > >>> David, there is another alternative to prevent this, simple though a >>> bit wasteful, just allocate a bit bigger to ensure the allocation >>> doesn't end on an exact PAGE_SIZE boundary? >> >> :/ in particular doing that through the memblock in sparse_init_nid(), I am >> not so sure that's a good idea. > > It would probably be some work to make larger allocations to avoid > padding :\ > >> I prefer Linus' proposal and avoids the one nth_page(), unless any other >> approach can help us get rid of more nth_page() usage -- and I don't think >> your proposal could, right? > > If the above were solved - so the struct page allocations could be > larger than a section, arguably just the entire range being plugged, > then I think you also solve the nth_page() problem too. > Effectively the nth_page() problem is that we allocate the struct page > arrays on an arbitary section-by-section basis, and then the arch sets > MAX_ORDER so that a folio can cross sections, effectively guaranteeing > to virtually fragment the page *'s inside folios. > > Doing a giant vmalloc at the start so you could also cheaply add some > padding would effectively also prevent the nth_page problem as we can > reasonably say that no folio should extend past an entire memory > region. > > Maybe there is some reason we can't do a giant vmalloc on these > systems that also can't do SPARSE_VMMEMAP :\ But perhaps we could get > up to MAX_ORDER at least? Or perhaps we could have those systems > reduce MAX_ORDER? > > So, I think they are lightly linked problems. There are some weird scenarios where you hotplug memory after boot memory, and suddenly you can runtime-allocate a gigantic folio that spans both ranges etc. So while related, the corner cases are all a bit nasty, and just forbidding folios to span a memory section on these problematic configs (sparse !vmemmap) sounds interesting. As Linus said, x86-64 and arm64 are already VMEMMAP-only. s390x allows for gigantic folios, and VMEMMAP migjt still be configurable. Same for ppc at least. Not sure about riscv and others, will have to dig. That way we could just naturally make folio_page() and folio_page_idx() simpler. (and some GUP code IIRC as well where we still have to use nth_page) > > I suppose this is also a limitation with Linus's suggestion. It > doesn't give the optimal answer for for 1G pages on these older systems: > > for (size_t nr = 1; nr < nr_pages; nr++) { > if (*pages++ != ++page) > break; > > Since that will exit every section. Yes. If folios can no longer span a section in these configs, then we'd be good if we stop when we cross a section. We'd still always cover the full folio. > > At least for scatterlist like cases the point of this function is just > to speed things up. If it returns short the calling code should still > be directly checking phys_addr contiguity anyhow. Same for vfio I think. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 14:10 ` David Hildenbrand @ 2025-08-05 14:20 ` Jason Gunthorpe 2025-08-05 14:22 ` David Hildenbrand 0 siblings, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2025-08-05 14:20 UTC (permalink / raw) To: David Hildenbrand Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On Tue, Aug 05, 2025 at 04:10:45PM +0200, David Hildenbrand wrote: > There are some weird scenarios where you hotplug memory after boot memory, > and suddenly you can runtime-allocate a gigantic folio that spans both > ranges etc. I was thinking we'd forbid this directly, but yes it is a another new check. > So while related, the corner cases are all a bit nasty, and just forbidding > folios to span a memory section on these problematic configs (sparse > !vmemmap) sounds interesting. Indeed, this just sounds like forcing MAX_ORDER to be no larger than the section size for this old mode? Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 14:20 ` Jason Gunthorpe @ 2025-08-05 14:22 ` David Hildenbrand 2025-08-05 14:24 ` Jason Gunthorpe 0 siblings, 1 reply; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 14:22 UTC (permalink / raw) To: Jason Gunthorpe Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 16:20, Jason Gunthorpe wrote: > On Tue, Aug 05, 2025 at 04:10:45PM +0200, David Hildenbrand wrote: >> There are some weird scenarios where you hotplug memory after boot memory, >> and suddenly you can runtime-allocate a gigantic folio that spans both >> ranges etc. > > I was thinking we'd forbid this directly, but yes it is a another new > check. > >> So while related, the corner cases are all a bit nasty, and just forbidding >> folios to span a memory section on these problematic configs (sparse >> !vmemmap) sounds interesting. > > Indeed, this just sounds like forcing MAX_ORDER to be no larger than > the section size for this old mode? MAX_ORDER is always limited to the section size already. MAX_ORDER is only about buddy allocations. What hugetlb and dax do is independent of MAX_ORDER. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 14:22 ` David Hildenbrand @ 2025-08-05 14:24 ` Jason Gunthorpe 2025-08-05 14:26 ` David Hildenbrand 0 siblings, 1 reply; 31+ messages in thread From: Jason Gunthorpe @ 2025-08-05 14:24 UTC (permalink / raw) To: David Hildenbrand Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On Tue, Aug 05, 2025 at 04:22:32PM +0200, David Hildenbrand wrote: > On 05.08.25 16:20, Jason Gunthorpe wrote: > > On Tue, Aug 05, 2025 at 04:10:45PM +0200, David Hildenbrand wrote: > > > There are some weird scenarios where you hotplug memory after boot memory, > > > and suddenly you can runtime-allocate a gigantic folio that spans both > > > ranges etc. > > > > I was thinking we'd forbid this directly, but yes it is a another new > > check. > > > > > So while related, the corner cases are all a bit nasty, and just forbidding > > > folios to span a memory section on these problematic configs (sparse > > > !vmemmap) sounds interesting. > > > > Indeed, this just sounds like forcing MAX_ORDER to be no larger than > > the section size for this old mode? > > MAX_ORDER is always limited to the section size already. > > MAX_ORDER is only about buddy allocations. What hugetlb and dax do is > independent of MAX_ORDER. Oh I thought it limited folios too. Still same idea is to have a MAX_FOLIO_ORDER for that case. Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 14:24 ` Jason Gunthorpe @ 2025-08-05 14:26 ` David Hildenbrand 0 siblings, 0 replies; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 14:26 UTC (permalink / raw) To: Jason Gunthorpe Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 16:24, Jason Gunthorpe wrote: > On Tue, Aug 05, 2025 at 04:22:32PM +0200, David Hildenbrand wrote: >> On 05.08.25 16:20, Jason Gunthorpe wrote: >>> On Tue, Aug 05, 2025 at 04:10:45PM +0200, David Hildenbrand wrote: >>>> There are some weird scenarios where you hotplug memory after boot memory, >>>> and suddenly you can runtime-allocate a gigantic folio that spans both >>>> ranges etc. >>> >>> I was thinking we'd forbid this directly, but yes it is a another new >>> check. >>> >>>> So while related, the corner cases are all a bit nasty, and just forbidding >>>> folios to span a memory section on these problematic configs (sparse >>>> !vmemmap) sounds interesting. >>> >>> Indeed, this just sounds like forcing MAX_ORDER to be no larger than >>> the section size for this old mode? >> >> MAX_ORDER is always limited to the section size already. >> >> MAX_ORDER is only about buddy allocations. What hugetlb and dax do is >> independent of MAX_ORDER. > > Oh I thought it limited folios too. > > Still same idea is to have a MAX_FOLIO_ORDER for that case. Yes. Usually unlimited, except we are in this weird config state. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:25 ` Jason Gunthorpe 2025-08-05 13:33 ` David Hildenbrand @ 2025-08-05 13:36 ` Linus Torvalds 2025-08-05 13:47 ` David Hildenbrand 1 sibling, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2025-08-05 13:36 UTC (permalink / raw) To: Jason Gunthorpe Cc: David Hildenbrand, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On Tue, 5 Aug 2025 at 16:26, Jason Gunthorpe <jgg@nvidia.com> wrote: > > David, there is another alternative to prevent this, simple though a > bit wasteful, just allocate a bit bigger to ensure the allocation > doesn't end on an exact PAGE_SIZE boundary? So I don't mind adding a check for "page_section()", because at least that makes sense. But yes, it would also probably be a good idea to try to minimize SPARSEMEM without VMEMMAP. I'd love to get rid of it entirely, of course, but even if that isn't possible, I'd *really* just like people to try to make sure that it's neve ra valid thing to try to combine memory across different sections. David mentioned the 1GB hugepage folios, and I really thought that even *those* were all in one section. They *should* be. Do we have any relevant architectures that still do SPARSEMEM without VMEMMAP? Because if it's purely some "legacy architecture" thing (ie x86-32), how about just saying "no 1GB hugepages for you". Because that whole SPARSEMEM without VMEMMAP is certainly painful even outside of nth_page, and minimizing the pain sounds sane. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:36 ` Linus Torvalds @ 2025-08-05 13:47 ` David Hildenbrand 2025-08-05 13:51 ` Linus Torvalds 0 siblings, 1 reply; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 13:47 UTC (permalink / raw) To: Linus Torvalds, Jason Gunthorpe Cc: Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 15:36, Linus Torvalds wrote: > On Tue, 5 Aug 2025 at 16:26, Jason Gunthorpe <jgg@nvidia.com> wrote: >> >> David, there is another alternative to prevent this, simple though a >> bit wasteful, just allocate a bit bigger to ensure the allocation >> doesn't end on an exact PAGE_SIZE boundary? > > So I don't mind adding a check for "page_section()", because at least > that makes sense. > > But yes, it would also probably be a good idea to try to minimize > SPARSEMEM without VMEMMAP. I'd love to get rid of it entirely, of > course, but even if that isn't possible, I'd *really* just like people > to try to make sure that it's neve ra valid thing to try to combine > memory across different sections. > > David mentioned the 1GB hugepage folios, and I really thought that > even *those* were all in one section. They *should* be. The memory section size on x86 is always 128 MiB. Even with SPARSEMEM. There are weird interactions between memory section size and memory hotplug / DAX, so we try to keep it small'ish. It's more that we don't care that much about memory section size with SPARSEMEM because nth_page() and everything around that is just plain simple. > > Do we have any relevant architectures that still do SPARSEMEM without > VMEMMAP? Because if it's purely some "legacy architecture" thing (ie > x86-32), how about just saying "no 1GB hugepages for you". arch/arm64/Kconfig: select SPARSEMEM_VMEMMAP_ENABLE arch/loongarch/Kconfig: select SPARSEMEM_VMEMMAP_ENABLE arch/powerpc/Kconfig: select SPARSEMEM_VMEMMAP_ENABLE arch/riscv/Kconfig: select SPARSEMEM_VMEMMAP_ENABLE if 64BIT arch/s390/Kconfig: select SPARSEMEM_VMEMMAP_ENABLE arch/sparc/Kconfig: select SPARSEMEM_VMEMMAP_ENABLE arch/x86/Kconfig: select SPARSEMEM_VMEMMAP_ENABLE if X86_64 But SPARSEMEM_VMEMMAP is still user-selectable. I would assume SPARSEMEM_VMEMMAP_ENABLE support would cover most hugetlb + dax users indeed, at least when it comes to gigantic folios. Would have to figure out why someone would want to disable it (limited vspace? but that should also not really be a problem on 64bit I think). -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:47 ` David Hildenbrand @ 2025-08-05 13:51 ` Linus Torvalds 2025-08-05 13:55 ` David Hildenbrand 0 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2025-08-05 13:51 UTC (permalink / raw) To: David Hildenbrand Cc: Jason Gunthorpe, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On Tue, 5 Aug 2025 at 16:47, David Hildenbrand <david@redhat.com> wrote: > > arch/x86/Kconfig: select SPARSEMEM_VMEMMAP_ENABLE if X86_64 > > But SPARSEMEM_VMEMMAP is still user-selectable. I think you missed this confusion on x86: select SPARSEMEM_VMEMMAP if X86_64 IOW, that SPARSEMEM_VMEMMAP_ENABLE is entirely historical, I think, and it's unconditional these days. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [GIT PULL] VFIO updates for v6.17-rc1 2025-08-05 13:51 ` Linus Torvalds @ 2025-08-05 13:55 ` David Hildenbrand 0 siblings, 0 replies; 31+ messages in thread From: David Hildenbrand @ 2025-08-05 13:55 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Gunthorpe, Alex Williamson, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, lizhe.67@bytedance.com On 05.08.25 15:51, Linus Torvalds wrote: > On Tue, 5 Aug 2025 at 16:47, David Hildenbrand <david@redhat.com> wrote: >> >> arch/x86/Kconfig: select SPARSEMEM_VMEMMAP_ENABLE if X86_64 >> >> But SPARSEMEM_VMEMMAP is still user-selectable. > > I think you missed this confusion on x86: Yeah ... > > select SPARSEMEM_VMEMMAP if X86_64 > > IOW, that SPARSEMEM_VMEMMAP_ENABLE is entirely historical, I think, > and it's unconditional these days. Same for arm64. The other seem to still allow for configuring it. Probably we could really just force-enable it for the others that support it. Then fence of gigantic folios and remove most of the nth_page hackery when we're working within a folio. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2025-08-05 14:26 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-04 22:22 [GIT PULL] VFIO updates for v6.17-rc1 Alex Williamson 2025-08-04 23:55 ` Linus Torvalds 2025-08-05 0:53 ` Alex Williamson 2025-08-05 7:47 ` David Hildenbrand 2025-08-05 11:49 ` Jason Gunthorpe 2025-08-05 12:07 ` David Hildenbrand 2025-08-05 12:38 ` Jason Gunthorpe 2025-08-05 12:41 ` David Hildenbrand 2025-08-05 12:56 ` Jason Gunthorpe 2025-08-05 13:05 ` David Hildenbrand 2025-08-05 13:15 ` Linus Torvalds 2025-08-05 13:19 ` David Hildenbrand 2025-08-05 13:22 ` David Hildenbrand 2025-08-05 13:00 ` Linus Torvalds 2025-08-05 13:20 ` David Hildenbrand 2025-08-05 13:24 ` David Hildenbrand 2025-08-05 13:28 ` Linus Torvalds 2025-08-05 13:37 ` David Hildenbrand 2025-08-05 13:49 ` Linus Torvalds 2025-08-05 13:25 ` Jason Gunthorpe 2025-08-05 13:33 ` David Hildenbrand 2025-08-05 13:55 ` Jason Gunthorpe 2025-08-05 14:10 ` David Hildenbrand 2025-08-05 14:20 ` Jason Gunthorpe 2025-08-05 14:22 ` David Hildenbrand 2025-08-05 14:24 ` Jason Gunthorpe 2025-08-05 14:26 ` David Hildenbrand 2025-08-05 13:36 ` Linus Torvalds 2025-08-05 13:47 ` David Hildenbrand 2025-08-05 13:51 ` Linus Torvalds 2025-08-05 13:55 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).