[GIT PULL] VFIO updates for v6.17-rc1

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [GIT PULL] VFIO updates for v6.17-rc1
@ 2025-08-04 22:22 Alex Williamson
  2025-08-04 23:55 ` Linus Torvalds
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Williamson @ 2025-08-04 22:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org

Hi Linus,

The following changes since commit d7b8f8e20813f0179d8ef519541a3527e7661d3a:

  Linux 6.16-rc5 (2025-07-06 14:10:26 -0700)

are available in the Git repository at:

  https://github.com/awilliam/linux-vfio.git tags/vfio-v6.17-rc1

for you to fetch changes up to d2272d898186fa96c6fa4ad067219b45b8c2ef5f:

  vfio/type1: correct logic of vfio_find_vpfn() (2025-07-27 08:20:31 -0600)

----------------------------------------------------------------
VFIO updates for v6.17-rc1

 - Fix imbalance where the no-iommu/cdev device path skips too much
   on open, failing to increment a reference, but still decrements
   the reference on close.  Add bounds checking to prevent such
   underflows. (Jacob Pan)

 - Fill missing detach_ioas op for pds_vfio_pci, fixing probe failure
   when used with IOMMUFD. (Brett Creeley)

 - Split SR-IOV VFs to separate dev_set, avoiding unnecessary
   serialization between VFs that appear on the same bus.
   (Alex Williamson)

 - Fix a theoretical integer overflow is the mlx5-vfio-pci variant
   driver. (Artem Sadovnikov)

 - Use batching on contiguous pages to significantly improve DMA map
   and unmap performance when using hugetlbfs or THP backed memory
   with the legacy vfio type1 IOMMU backend. (Li Zhe)

 - Implement missing VF token checking support via vfio cdev/IOMMUFD
   interface. (Jason Gunthorpe)

 - Update QAT vfio-pci variant driver to claim latest VF devices.
   (Małgorzata Mielnik)

 - Add a cond_resched() call to avoid holding the CPU too long during
   DMA mapping operations. (Keith Busch)

----------------------------------------------------------------
Alex Williamson (1):
      vfio/pci: Separate SR-IOV VF dev_set

Artem Sadovnikov (1):
      vfio/mlx5: fix possible overflow in tracking max message size

Brett Creeley (1):
      vfio/pds: Fix missing detach_ioas op

Jacob Pan (2):
      vfio: Fix unbalanced vfio_df_close call in no-iommu mode
      vfio: Prevent open_count decrement to negative

Jason Gunthorpe (1):
      vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD

Keith Busch (1):
      vfio/type1: conditional rescheduling while pinning

Li Zhe (6):
      mm: introduce num_pages_contiguous()
      vfio/type1: optimize vfio_pin_pages_remote()
      vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote()
      vfio/type1: introduce a new member has_rsvd for struct vfio_dma
      vfio/type1: optimize vfio_unpin_pages_remote()
      vfio/type1: correct logic of vfio_find_vpfn()

Małgorzata Mielnik (1):
      vfio/qat: add support for intel QAT 6xxx virtual functions

Xin Zeng (1):
      vfio/qat: Remove myself from VFIO QAT PCI driver maintainers

 MAINTAINERS                                    |   1 -
 drivers/vfio/device_cdev.c                     |  38 +++++++-
 drivers/vfio/group.c                           |   7 +-
 drivers/vfio/iommufd.c                         |   4 +
 drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c |   1 +
 drivers/vfio/pci/mlx5/cmd.c                    |   4 +-
 drivers/vfio/pci/mlx5/main.c                   |   1 +
 drivers/vfio/pci/nvgrace-gpu/main.c            |   2 +
 drivers/vfio/pci/pds/vfio_dev.c                |   2 +
 drivers/vfio/pci/qat/main.c                    |   5 +-
 drivers/vfio/pci/vfio_pci.c                    |   1 +
 drivers/vfio/pci/vfio_pci_core.c               |  24 +++--
 drivers/vfio/pci/virtio/main.c                 |   3 +
 drivers/vfio/vfio_iommu_type1.c                | 118 ++++++++++++++++++++-----
 drivers/vfio/vfio_main.c                       |   3 +-
 include/linux/mm.h                             |  23 +++++
 include/linux/vfio.h                           |   4 +
 include/linux/vfio_pci_core.h                  |   2 +
 include/uapi/linux/vfio.h                      |  12 ++-
 19 files changed, 212 insertions(+), 43 deletions(-)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-04 22:22 [GIT PULL] VFIO updates for v6.17-rc1 Alex Williamson
@ 2025-08-04 23:55 ` Linus Torvalds
  2025-08-05  0:53   ` Alex Williamson
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2025-08-04 23:55 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org

On Mon, 4 Aug 2025 at 15:22, Alex Williamson <alex.williamson@redhat.com> wrote:
>
> Li Zhe (6):
>       mm: introduce num_pages_contiguous()

WHY?

There is exactly *ONE* user, why the heck do we introduce this
completely pointless "helper" function, and put it in a core header
file like it was worth it?

And it's not like that code is some kind of work of art that we want
to expose everybody to *anyway*. It's written in a particularly stupid
way that means that it's *way* more expensive than it needs to be.

And then it's made "inline" despite the code generation being
horrible, which makes it all entirely pointless.

Yes, I'm grumpy. This pull request came in late, I'm already
traveling, and then I look at it and it just makes me *angry* at how
bad that code is, and how annoying it is.

My builds are already slower than usual because they happen on my
laptop while traveling, I do *not* need to see this kind of absolutely
disgusting code that does stupid things that make the build even
slower.

So I refuse to pull this kind of crap.

If you insist on making my build slower and exposing these kinds of
helper functions, they had better be *good* helper functions.

Hint: absolutely nobody cares about "the pages crossed a sparsemem
border. If your driver cares about the number of contiguous pages, it
might as well say "yeah, they are contiguous, but they are in
different sparsemem chunks, so we'll break here too".

And at that point all you care about is 'struct page' being
contiguous, instead of doing that disgusting 'nth_page'.

And then - since there is only *one* single user - you don't put it in
the most central header file that EVERYBODY ELSE cares about.

And you absolutely don't do it if it generates garbage code for no good reason!

            Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-04 23:55 ` Linus Torvalds
@ 2025-08-05  0:53   ` Alex Williamson
  2025-08-05  7:47     ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Williamson @ 2025-08-05  0:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	lizhe.67@bytedance.com, Jason Gunthorpe, David Hildenbrand

On Mon, 4 Aug 2025 16:55:09 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 4 Aug 2025 at 15:22, Alex Williamson <alex.williamson@redhat.com> wrote:
> >
> > Li Zhe (6):
> >       mm: introduce num_pages_contiguous()  
> 
> WHY?
> 
> There is exactly *ONE* user, why the heck do we introduce this
> completely pointless "helper" function, and put it in a core header
> file like it was worth it?

There was discussion here[1] where David Hildenbrand and Jason
Gunthorpe suggested this should be in common code and I believe there
was some intent that this would get reused.  I took this as
endorsement from mm folks.  This can certainly be pulled back into
subsystem code.

> And it's not like that code is some kind of work of art that we want
> to expose everybody to *anyway*. It's written in a particularly stupid
> way that means that it's *way* more expensive than it needs to be.
> 
> And then it's made "inline" despite the code generation being
> horrible, which makes it all entirely pointless.
> 
> Yes, I'm grumpy. This pull request came in late, I'm already
> traveling, and then I look at it and it just makes me *angry* at how
> bad that code is, and how annoying it is.

Sorry, I usually try to get in later during the first week to let the
dust settle a bit from the bigger subsystems, I guess I'm running a
little behind this cycle.  We'll get it fixed and I'll resend.  Thanks,

Alex

> My builds are already slower than usual because they happen on my
> laptop while traveling, I do *not* need to see this kind of absolutely
> disgusting code that does stupid things that make the build even
> slower.
> 
> So I refuse to pull this kind of crap.
> 
> If you insist on making my build slower and exposing these kinds of
> helper functions, they had better be *good* helper functions.
> 
> Hint: absolutely nobody cares about "the pages crossed a sparsemem
> border. If your driver cares about the number of contiguous pages, it
> might as well say "yeah, they are contiguous, but they are in
> different sparsemem chunks, so we'll break here too".
> 
> And at that point all you care about is 'struct page' being
> contiguous, instead of doing that disgusting 'nth_page'.
> 
> And then - since there is only *one* single user - you don't put it in
> the most central header file that EVERYBODY ELSE cares about.
> 
> And you absolutely don't do it if it generates garbage code for no good reason!
> 
>             Linus
> 

[1]https://lore.kernel.org/all/20250703111216.GG904431@ziepe.ca/


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05  0:53   ` Alex Williamson
@ 2025-08-05  7:47     ` David Hildenbrand
  2025-08-05 11:49       ` Jason Gunthorpe
  2025-08-05 13:00       ` Linus Torvalds
  0 siblings, 2 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05  7:47 UTC (permalink / raw)
  To: Alex Williamson, Linus Torvalds
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	lizhe.67@bytedance.com, Jason Gunthorpe

On 05.08.25 02:53, Alex Williamson wrote:
> On Mon, 4 Aug 2025 16:55:09 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Mon, 4 Aug 2025 at 15:22, Alex Williamson <alex.williamson@redhat.com> wrote:
>>>
>>> Li Zhe (6):
>>>        mm: introduce num_pages_contiguous()
>>
>> WHY?
>>
>> There is exactly *ONE* user, why the heck do we introduce this
>> completely pointless "helper" function, and put it in a core header
>> file like it was worth it?
> 
> There was discussion here[1] where David Hildenbrand and Jason
> Gunthorpe suggested this should be in common code and I believe there
> was some intent that this would get reused.  I took this as
> endorsement from mm folks.  This can certainly be pulled back into
> subsystem code.

Yeah, we ended up here after trying to go the folio-way first, but then
realizing that code that called GUP shouldn't have to worry about
folios, just to detect consecutive pages+PFNs.

I think this helper will can come in handy even in folio context.
I recall pointing Joanne at it in different fuse context.

> 
>> And it's not like that code is some kind of work of art that we want
>> to expose everybody to *anyway*. It's written in a particularly stupid
>> way that means that it's *way* more expensive than it needs to be.
>>
>> And then it's made "inline" despite the code generation being
>> horrible, which makes it all entirely pointless.
>>
>> Yes, I'm grumpy. This pull request came in late, I'm already
>> traveling, and then I look at it and it just makes me *angry* at how
>> bad that code is, and how annoying it is.
> 
> Sorry, I usually try to get in later during the first week to let the
> dust settle a bit from the bigger subsystems, I guess I'm running a
> little behind this cycle.  We'll get it fixed and I'll resend.  Thanks,
> 
> Alex
> 
>> My builds are already slower than usual because they happen on my
>> laptop while traveling, I do *not* need to see this kind of absolutely
>> disgusting code that does stupid things that make the build even
>> slower.
>>
>> So I refuse to pull this kind of crap.
>>
>> If you insist on making my build slower and exposing these kinds of
>> helper functions, they had better be *good* helper functions.
>>
>> Hint: absolutely nobody cares about "the pages crossed a sparsemem
>> border. If your driver cares about the number of contiguous pages, it
>> might as well say "yeah, they are contiguous, but they are in
>> different sparsemem chunks, so we'll break here too".

The concern is rather false positives, meaning, you want consecutive
PFNs (just like within a folio), but -- because the stars aligned --
you get consecutive "struct page" that do not translate to consecutive PFNs.

So that's why the nth page stuff is not optional in the current
implementation as far as I can tell.

And because that nth_page stuff is so tricky and everyone gets it wrong
all the time, I am actually in favor of having such a helper around. not
buried in some subsystem.

>>
>> And at that point all you care about is 'struct page' being
>> contiguous, instead of doing that disgusting 'nth_page'.

I think stopping when we hit the end of a memory section in case of
!CONFIG_SPARSEMEM_VMEMMAP could be done and documented.

It's just that ... code gets more complicated: we end up really only
optimizing for the unloved child sparsemem withoutCONFIG_SPARSEMEM_VMEMMAP:


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0d4ee569aa6b6..f080fa5a68d4a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1773,6 +1773,9 @@ static inline unsigned long page_to_section(const struct page *page)
   * the memory model, this can mean that the addresses of the "struct page"s
   * are not contiguous.
   *
+ * On sparsemem configs without CONFIG_SPARSEMEM_VMEMMAP, we will stop once we
+ * hit a memory section boundary.
+ *
   * @pages: an array of page pointers
   * @nr_pages: length of the array
   */
@@ -1781,8 +1784,16 @@ static inline unsigned long num_pages_contiguous(struct page **pages,
  {
         size_t i;
  
+       if (IS_ENABLED(CONFIG_SPARSEMEM) &&
+           !IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) {
+               const unsigned long pfn = page_to_pfn(pages[0]);
+
+               nr_pages = min_t(size_t, nr_pages,
+                                PAGES_PER_SECTION - (pfn & PAGES_PER_SECTION));
+       }
+
         for (i = 1; i < nr_pages; i++)
-               if (pages[i] != nth_page(pages[0], i))
+               if (pages[i] != pages[i - 1] + 1)
                         break;
  

Whether that helper should live in mm/utils.c is a valid point the bigger it gets.

>>
>> And then - since there is only *one* single user - you don't put it in
>> the most central header file that EVERYBODY ELSE cares about.
>>
>> And you absolutely don't do it if it generates garbage code for no good reason!

I understand the hate for nth_page in this context.

But I rather hate it *completely* because people get it wrong all of the time.

In any case, enjoy you travel Linus.

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05  7:47     ` David Hildenbrand
@ 2025-08-05 11:49       ` Jason Gunthorpe
  2025-08-05 12:07         ` David Hildenbrand
  2025-08-05 13:00       ` Linus Torvalds
  1 sibling, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 11:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On Tue, Aug 05, 2025 at 09:47:18AM +0200, David Hildenbrand wrote:
> > There was discussion here[1] where David Hildenbrand and Jason
> > Gunthorpe suggested this should be in common code and I believe there
> > was some intent that this would get reused.  I took this as
> > endorsement from mm folks.  This can certainly be pulled back into
> > subsystem code.
> 
> Yeah, we ended up here after trying to go the folio-way first, but then
> realizing that code that called GUP shouldn't have to worry about
> folios, just to detect consecutive pages+PFNs.
> 
> I think this helper will can come in handy even in folio context.
> I recall pointing Joanne at it in different fuse context.

The scatterlist code should use it also, it is doing the same logic.

> The concern is rather false positives, meaning, you want consecutive
> PFNs (just like within a folio), but -- because the stars aligned --
> you get consecutive "struct page" that do not translate to consecutive PFNs.

I wonder if we can address that from the other side and prevent the
memory code from creating a bogus contiguous struct page in the first
place so that struct page contiguity directly reflects physical
contiguity?

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 11:49       ` Jason Gunthorpe
@ 2025-08-05 12:07         ` David Hildenbrand
  2025-08-05 12:38           ` Jason Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 12:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 13:49, Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 09:47:18AM +0200, David Hildenbrand wrote:
>>> There was discussion here[1] where David Hildenbrand and Jason
>>> Gunthorpe suggested this should be in common code and I believe there
>>> was some intent that this would get reused.  I took this as
>>> endorsement from mm folks.  This can certainly be pulled back into
>>> subsystem code.
>>
>> Yeah, we ended up here after trying to go the folio-way first, but then
>> realizing that code that called GUP shouldn't have to worry about
>> folios, just to detect consecutive pages+PFNs.
>>
>> I think this helper will can come in handy even in folio context.
>> I recall pointing Joanne at it in different fuse context.
> 
> The scatterlist code should use it also, it is doing the same logic.
> 
>> The concern is rather false positives, meaning, you want consecutive
>> PFNs (just like within a folio), but -- because the stars aligned --
>> you get consecutive "struct page" that do not translate to consecutive PFNs.
> 
> I wonder if we can address that from the other side and prevent the
> memory code from creating a bogus contiguous struct page in the first
> place so that struct page contiguity directly reflects physical
> contiguity?

Well, if we could make CONFIG_SPARSEMEM_VMEMMAP the only sparsemem 
option ... :) But I recall it's not that easy (e.g., 32bit).

I don't see an easy way to guarantee that. E.g., populate_section_memmap 
really just does a kvmalloc_node() and 
__populate_section_memmap()->memmap_alloc() a memblock_alloc().

So if the starts align, the "struct page" of the memory of two memory 
sections are contiguous, although the memory sections are not contiguous.

Just imagine memory holes.

Also, I am not sure if that is really a problem worth solving right now. 
If a helper on !CONFIG_SPARSEMEM_VMEMMAP is a little slower on some 
operations, like nth_page(), I really don't particularly care.

Eventually, !CONFIG_SPARSEMEM_VMEMMAP will go away in some distant 
future and we will all be happy.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 12:07         ` David Hildenbrand
@ 2025-08-05 12:38           ` Jason Gunthorpe
  2025-08-05 12:41             ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 12:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On Tue, Aug 05, 2025 at 02:07:49PM +0200, David Hildenbrand wrote:
> I don't see an easy way to guarantee that. E.g., populate_section_memmap
> really just does a kvmalloc_node() and
> __populate_section_memmap()->memmap_alloc() a memblock_alloc().

Well, it is really easy, if you do the kvmalloc_node and you get the
single unwanted struct page value, then call it again and free the
first one. The second call is guarenteed to not return the unwanted
value because the first call has it allocated.

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 12:38           ` Jason Gunthorpe
@ 2025-08-05 12:41             ` David Hildenbrand
  2025-08-05 12:56               ` Jason Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 12:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 14:38, Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 02:07:49PM +0200, David Hildenbrand wrote:
>> I don't see an easy way to guarantee that. E.g., populate_section_memmap
>> really just does a kvmalloc_node() and
>> __populate_section_memmap()->memmap_alloc() a memblock_alloc().
> 
> Well, it is really easy, if you do the kvmalloc_node and you get the
> single unwanted struct page value, then call it again and free the
> first one. The second call is guarenteed to not return the unwanted
> value because the first call has it allocated.

So you want to walk all existing sections to check that? :)

That's the kind of approach I would describe with the words Linus used.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 12:41             ` David Hildenbrand
@ 2025-08-05 12:56               ` Jason Gunthorpe
  2025-08-05 13:05                 ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 12:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On Tue, Aug 05, 2025 at 02:41:38PM +0200, David Hildenbrand wrote:
> On 05.08.25 14:38, Jason Gunthorpe wrote:
> > On Tue, Aug 05, 2025 at 02:07:49PM +0200, David Hildenbrand wrote:
> > > I don't see an easy way to guarantee that. E.g., populate_section_memmap
> > > really just does a kvmalloc_node() and
> > > __populate_section_memmap()->memmap_alloc() a memblock_alloc().
> > 
> > Well, it is really easy, if you do the kvmalloc_node and you get the
> > single unwanted struct page value, then call it again and free the
> > first one. The second call is guarenteed to not return the unwanted
> > value because the first call has it allocated.
> 
> So you want to walk all existing sections to check that? :)

We don't need to walk, compute the page-1 and carefully run that
through page_to_pfn algorithm.

> That's the kind of approach I would describe with the words Linus used.

Its some small boot time nastyness, we do this all the time messing up
the slow path so the fast paths can be simple

Jason
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05  7:47     ` David Hildenbrand
  2025-08-05 11:49       ` Jason Gunthorpe
@ 2025-08-05 13:00       ` Linus Torvalds
  2025-08-05 13:20         ` David Hildenbrand
  2025-08-05 13:25         ` Jason Gunthorpe
  1 sibling, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2025-08-05 13:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com,
	Jason Gunthorpe

On Tue, 5 Aug 2025 at 10:47, David Hildenbrand <david@redhat.com> wrote:
>
> The concern is rather false positives, meaning, you want consecutive
> PFNs (just like within a folio), but -- because the stars aligned --
> you get consecutive "struct page" that do not translate to consecutive PFNs.

So I don't think that can happen with a valid 'struct page', because
if the 'struct page's are in different sections, they will have been
allocated separately too.

So you can't have two consecutive 'struct page' things without them
being consecutive pages.

But by all means, if you want to make sure, just compare the page
sections. But converting them to a PFN and then converting back is
just crazy.

IOW, the logic would literally be something like (this assumes there
is always at least *one* page):

        struct page *page = *pages++;
        int section = page_to_section(page);

        for (size_t nr = 1; nr < nr_pages; nr++) {
                if (*pages++ != ++page)
                        break;
                if (page_to_section(page) != section)
                        break;
        }
        return nr;

and yes, I think we only define page_to_section() for
SECTION_IN_PAGE_FLAGS, but we should fix that and just have a

  #define page_to_section(pg) 0

for the other cases, and the compiler will happily optimize away the
"oh, it's always zero" case.

So something like that should actually generate reasonable code. It
*really* shouldn't try to generate a pfn (or, like that horror that I
didn't pull did, then go *back* from pfn to page)

That 'nth_page()' thing is just crazy garbage.

And even when fixed to not be garbage, I'm not convinced this needs to
be in <linux/mm.h>.

              Linus

PS. No - I didn't test the above trivial loop. It may be trivial, but
it may be buggy. Think of it as "something like this" rather than
"this is tested and does the right thing"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 12:56               ` Jason Gunthorpe
@ 2025-08-05 13:05                 ` David Hildenbrand
  2025-08-05 13:15                   ` Linus Torvalds
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 13:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Linus Torvalds, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 14:56, Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 02:41:38PM +0200, David Hildenbrand wrote:
>> On 05.08.25 14:38, Jason Gunthorpe wrote:
>>> On Tue, Aug 05, 2025 at 02:07:49PM +0200, David Hildenbrand wrote:
>>>> I don't see an easy way to guarantee that. E.g., populate_section_memmap
>>>> really just does a kvmalloc_node() and
>>>> __populate_section_memmap()->memmap_alloc() a memblock_alloc().
>>>
>>> Well, it is really easy, if you do the kvmalloc_node and you get the
>>> single unwanted struct page value, then call it again and free the
>>> first one. The second call is guarenteed to not return the unwanted
>>> value because the first call has it allocated.
>>
>> So you want to walk all existing sections to check that? :)
> 
> We don't need to walk, compute the page-1 and carefully run that
> through page_to_pfn algorithm.

You have to test if first_page-1 and last_page+1 are used as the memmap 
for another section.

So yeah, it could be done by some page_to_pfn + pfn_to_page trickery I 
think.

Not that I would like that "allocate until you find something that 
works" approach.

> 
>> That's the kind of approach I would describe with the words Linus used.
> 
> Its some small boot time nastyness, we do this all the time messing up
> the slow path so the fast paths can be simple

nth_page won't go away. If it would go away, I would be sold on almost 
any idea. nth_page is the real problem.

So I don't like the idea of micro-optimizing num_pages_contiguous() by 
adding weird tweaks to the core for that.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:05                 ` David Hildenbrand
@ 2025-08-05 13:15                   ` Linus Torvalds
  2025-08-05 13:19                     ` David Hildenbrand
  2025-08-05 13:22                     ` David Hildenbrand
  0 siblings, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2025-08-05 13:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On Tue, 5 Aug 2025 at 16:05, David Hildenbrand <david@redhat.com> wrote:
>
> So I don't like the idea of micro-optimizing num_pages_contiguous() by
> adding weird tweaks to the core for that.

Seriously - take a look at that suggested sequence I posted, and tell
me that it isn't *MORE* obvious than the horror that is nth_page().

Honestly, if anybody thinks nth_page() is obvious and good, I think
they have some bad case of Stockholm syndrome.

This isn't about micro-optimizing. This is about not writing complete
garbage code that makes no sense.

nth_page() is a disgusting thing that is designed to look up
known-contiguous pages. That code mis-used it for *testing* for being
contiguous. It may have _worked_, but it was the wrong thing to do.

nth_page() in general should just not exist. I don't actually believe
there is any valid reason for it. I do not believe we should actually
have valid consecutive allocations of pages across sections.

So please work on removing that eldritch horror, not adding new worse
versions of it.

           Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:15                   ` Linus Torvalds
@ 2025-08-05 13:19                     ` David Hildenbrand
  2025-08-05 13:22                     ` David Hildenbrand
  1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 13:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason Gunthorpe, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 15:15, Linus Torvalds wrote:
> On Tue, 5 Aug 2025 at 16:05, David Hildenbrand <david@redhat.com> wrote:
>>
>> So I don't like the idea of micro-optimizing num_pages_contiguous() by
>> adding weird tweaks to the core for that.
> 
> Seriously - take a look at that suggested sequence I posted, and tell
> me that it isn't *MORE* obvious than the horror that is nth_page().
It is, that's not what I am talking about. I was talking about proposals 
to make it work in a different way as Jason proposed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:00       ` Linus Torvalds
@ 2025-08-05 13:20         ` David Hildenbrand
  2025-08-05 13:24           ` David Hildenbrand
  2025-08-05 13:28           ` Linus Torvalds
  2025-08-05 13:25         ` Jason Gunthorpe
  1 sibling, 2 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 13:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com,
	Jason Gunthorpe

On 05.08.25 15:00, Linus Torvalds wrote:
> On Tue, 5 Aug 2025 at 10:47, David Hildenbrand <david@redhat.com> wrote:
>>
>> The concern is rather false positives, meaning, you want consecutive
>> PFNs (just like within a folio), but -- because the stars aligned --
>> you get consecutive "struct page" that do not translate to consecutive PFNs.
> 
> So I don't think that can happen with a valid 'struct page', because
> if the 'struct page's are in different sections, they will have been
> allocated separately too.

I think you can end up with two memory sections not being consecutive, 
but the struct pages being consecutive.

I assume the easiest way to achieve that is having a large-enough memory 
hole (PCI hole?) that spans more than section, and memblock just giving 
you the next PFN range to use as "struct page".

It's one allocation per memory section, see

sparse_init_nid()->__populate_section_memmap(prn, PAGES_PER_SECTION) -> 
memmap_alloc()->memblock_alloc().

With memory hotplug, there might be other weird ways to achieve it I 
suspect.

> 
> So you can't have two consecutive 'struct page' things without them
> being consecutive pages.
> 
> But by all means, if you want to make sure, just compare the page
> sections. But converting them to a PFN and then converting back is
> just crazy.
 > > IOW, the logic would literally be something like (this assumes there
> is always at least *one* page):
> 
>          struct page *page = *pages++;
>          int section = page_to_section(page);
> 
>          for (size_t nr = 1; nr < nr_pages; nr++) {
>                  if (*pages++ != ++page)
>                          break;
>                  if (page_to_section(page) != section)
>                          break;
>          }
>          return nr;

I think that would work, and we could limit the section check to the 
problematic case only (sparsemem without VMEMMAP).

> 
> and yes, I think we only define page_to_section() for
> SECTION_IN_PAGE_FLAGS, but we should fix that and just have a
> 
>    #define page_to_section(pg) 0

Probably yes, have to think about that.

> 
> for the other cases, and the compiler will happily optimize away the
> "oh, it's always zero" case.
> 
> So something like that should actually generate reasonable code. It
> *really* shouldn't try to generate a pfn (or, like that horror that I
> didn't pull did, then go *back* from pfn to page)
 > > That 'nth_page()' thing is just crazy garbage.

Yes, it's unfortunate garbage we have to use even in folio_page().

> 
> And even when fixed to not be garbage, I'm not convinced this needs to
> be in <linux/mm.h>.

Yeah, let's move it to mm/util.c if you agree.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:15                   ` Linus Torvalds
  2025-08-05 13:19                     ` David Hildenbrand
@ 2025-08-05 13:22                     ` David Hildenbrand
  1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 13:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason Gunthorpe, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 15:15, Linus Torvalds wrote:
> On Tue, 5 Aug 2025 at 16:05, David Hildenbrand <david@redhat.com> wrote:
>>
>> So I don't like the idea of micro-optimizing num_pages_contiguous() by
>> adding weird tweaks to the core for that.
> 
> Seriously - take a look at that suggested sequence I posted, and tell
> me that it isn't *MORE* obvious than the horror that is nth_page().
> 
> Honestly, if anybody thinks nth_page() is obvious and good, I think
> they have some bad case of Stockholm syndrome.
> 
> This isn't about micro-optimizing. This is about not writing complete
> garbage code that makes no sense.
> 
> nth_page() is a disgusting thing that is designed to look up
> known-contiguous pages. That code mis-used it for *testing* for being
> contiguous. It may have _worked_, but it was the wrong thing to do.
> 
> nth_page() in general should just not exist. I don't actually believe
> there is any valid reason for it. I do not believe we should actually
> have valid consecutive allocations of pages across sections.

Oh, just to add to that, 1 GiB folios (hugetlb, dax) are the main reason 
why we use it in things like folio_page(), and also why folio_page_idx() 
is so horrible.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:20         ` David Hildenbrand
@ 2025-08-05 13:24           ` David Hildenbrand
  2025-08-05 13:28           ` Linus Torvalds
  1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 13:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com,
	Jason Gunthorpe

On 05.08.25 15:20, David Hildenbrand wrote:
> On 05.08.25 15:00, Linus Torvalds wrote:
>> On Tue, 5 Aug 2025 at 10:47, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> The concern is rather false positives, meaning, you want consecutive
>>> PFNs (just like within a folio), but -- because the stars aligned --
>>> you get consecutive "struct page" that do not translate to consecutive PFNs.
>>
>> So I don't think that can happen with a valid 'struct page', because
>> if the 'struct page's are in different sections, they will have been
>> allocated separately too.
> 
> I think you can end up with two memory sections not being consecutive,
> but the struct pages being consecutive.
> 
> I assume the easiest way to achieve that is having a large-enough memory
> hole (PCI hole?) that spans more than section, and memblock just giving
> you the next PFN range to use as "struct page".
> 
> It's one allocation per memory section, see
> 
> sparse_init_nid()->__populate_section_memmap(prn, PAGES_PER_SECTION) ->
> memmap_alloc()->memblock_alloc().
> 
> With memory hotplug, there might be other weird ways to achieve it I
> suspect.
> 
>>
>> So you can't have two consecutive 'struct page' things without them
>> being consecutive pages.
>>
>> But by all means, if you want to make sure, just compare the page
>> sections. But converting them to a PFN and then converting back is
>> just crazy.
>   > > IOW, the logic would literally be something like (this assumes there
>> is always at least *one* page):
>>
>>           struct page *page = *pages++;
>>           int section = page_to_section(page);
>>
>>           for (size_t nr = 1; nr < nr_pages; nr++) {
>>                   if (*pages++ != ++page)
>>                           break;
>>                   if (page_to_section(page) != section)
>>                           break;
>>           }
>>           return nr;
> 
> I think that would work, and we could limit the section check to the
> problematic case only (sparsemem without VMEMMAP).
> 
>>
>> and yes, I think we only define page_to_section() for
>> SECTION_IN_PAGE_FLAGS, but we should fix that and just have a
>>
>>     #define page_to_section(pg) 0
> 
> Probably yes, have to think about that.

Ah,

#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
#define SECTION_IN_PAGE_FLAGS
#endif

so it will already be optimized out with your proposal, great!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:00       ` Linus Torvalds
  2025-08-05 13:20         ` David Hildenbrand
@ 2025-08-05 13:25         ` Jason Gunthorpe
  2025-08-05 13:33           ` David Hildenbrand
  2025-08-05 13:36           ` Linus Torvalds
  1 sibling, 2 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 13:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On Tue, Aug 05, 2025 at 04:00:53PM +0300, Linus Torvalds wrote:
> On Tue, 5 Aug 2025 at 10:47, David Hildenbrand <david@redhat.com> wrote:
> >
> > The concern is rather false positives, meaning, you want consecutive
> > PFNs (just like within a folio), but -- because the stars aligned --
> > you get consecutive "struct page" that do not translate to consecutive PFNs.
> 
> So I don't think that can happen with a valid 'struct page', because
> if the 'struct page's are in different sections, they will have been
> allocated separately too.

This is certainly true for the CONFIG_SPARSEMEM_VMEMMAP case, but in
the other cases I thought we end up with normal allocations for struct
page? This is what David was talking about.

So then we are afraid of this:

  a = kvmalloc_array(nelms_a);
  b = kvmalloc_array(nelms_b);

  assert(a + nelms_a != b)

I thought this was possible with our allocator, especially vmemmap?

David, there is another alternative to prevent this, simple though a
bit wasteful, just allocate a bit bigger to ensure the allocation
doesn't end on an exact PAGE_SIZE boundary?

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:20         ` David Hildenbrand
  2025-08-05 13:24           ` David Hildenbrand
@ 2025-08-05 13:28           ` Linus Torvalds
  2025-08-05 13:37             ` David Hildenbrand
  1 sibling, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2025-08-05 13:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com,
	Jason Gunthorpe

On Tue, 5 Aug 2025 at 16:20, David Hildenbrand <david@redhat.com> wrote:
>
> I think that would work, and we could limit the section check to the
> problematic case only (sparsemem without VMEMMAP).

We really don't need to, because unlike the nth_page() thing, the
compiler can see the logic and see "it's always zero".

And in the complex case (ie actual sparsemem without VMEMMAP), the
page_section() test is at least trivial, unlike the whole "turn it
into a pfn and back".

Because that "turn it into a pfn and back" is actually a really quite
complicated operation (and the compiler won't be able to optimize that
one much, so I'm pretty sure it generates horrific code).

I wish we didn't have nth_page() at all. I really don't think it's a
valid operation. It's been around forever, but I think it was broken
as introduced, exactly because I don't think you can validly even have
allocations that cross section boundaries.

So the only possible reason for nth_page() is that you tried to
combine two such allocations into one, and you simply shouldn't do
that in the first place. The VM layer can't free them as one
allocation anyway, so the only possible use-case is for some broken
"let's optimize this IO into one bigger chunk".

So people should actively get rid of that, not add to the brokenness.

           Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:25         ` Jason Gunthorpe
@ 2025-08-05 13:33           ` David Hildenbrand
  2025-08-05 13:55             ` Jason Gunthorpe
  2025-08-05 13:36           ` Linus Torvalds
  1 sibling, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 13:33 UTC (permalink / raw)
  To: Jason Gunthorpe, Linus Torvalds
  Cc: Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 15:25, Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 04:00:53PM +0300, Linus Torvalds wrote:
>> On Tue, 5 Aug 2025 at 10:47, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> The concern is rather false positives, meaning, you want consecutive
>>> PFNs (just like within a folio), but -- because the stars aligned --
>>> you get consecutive "struct page" that do not translate to consecutive PFNs.
>>
>> So I don't think that can happen with a valid 'struct page', because
>> if the 'struct page's are in different sections, they will have been
>> allocated separately too.
> 
> This is certainly true for the CONFIG_SPARSEMEM_VMEMMAP case, but in
> the other cases I thought we end up with normal allocations for struct
> page? This is what David was talking about.
> 
> So then we are afraid of this:
> 
>    a = kvmalloc_array(nelms_a);
>    b = kvmalloc_array(nelms_b);
> 
>    assert(a + nelms_a != b)
> 
> I thought this was possible with our allocator, especially vmemmap?
> 
> David, there is another alternative to prevent this, simple though a
> bit wasteful, just allocate a bit bigger to ensure the allocation
> doesn't end on an exact PAGE_SIZE boundary?

:/ in particular doing that through the memblock in sparse_init_nid(), I 
am not so sure that's a good idea.

I prefer Linus' proposal and avoids the one nth_page(), unless any other 
approach can help us get rid of more nth_page() usage -- and I don't 
think your proposal could, right?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:25         ` Jason Gunthorpe
  2025-08-05 13:33           ` David Hildenbrand
@ 2025-08-05 13:36           ` Linus Torvalds
  2025-08-05 13:47             ` David Hildenbrand
  1 sibling, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2025-08-05 13:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On Tue, 5 Aug 2025 at 16:26, Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> David, there is another alternative to prevent this, simple though a
> bit wasteful, just allocate a bit bigger to ensure the allocation
> doesn't end on an exact PAGE_SIZE boundary?

So I don't mind adding a check for "page_section()", because at least
that makes sense.

But yes, it would also probably be a good idea to try to minimize
SPARSEMEM without VMEMMAP. I'd love to get rid of it entirely, of
course, but even if that isn't possible, I'd *really* just like people
to try to make sure that it's neve ra valid thing to try to combine
memory across different sections.

David mentioned the 1GB hugepage folios, and I really thought that
even *those* were all in one section. They *should* be.

Do we have any relevant architectures that still do SPARSEMEM without
VMEMMAP? Because if it's purely some "legacy architecture" thing (ie
x86-32), how about just saying "no 1GB hugepages for you".

Because that whole SPARSEMEM without VMEMMAP is certainly painful even
outside of nth_page, and minimizing the pain sounds sane.

                Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:28           ` Linus Torvalds
@ 2025-08-05 13:37             ` David Hildenbrand
  2025-08-05 13:49               ` Linus Torvalds
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 13:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com,
	Jason Gunthorpe

On 05.08.25 15:28, Linus Torvalds wrote:
> On Tue, 5 Aug 2025 at 16:20, David Hildenbrand <david@redhat.com> wrote:
>>
>> I think that would work, and we could limit the section check to the
>> problematic case only (sparsemem without VMEMMAP).
> 
> We really don't need to, because unlike the nth_page() thing, the
> compiler can see the logic and see "it's always zero".

Yeah, realized that later.

> 
> And in the complex case (ie actual sparsemem without VMEMMAP), the
> page_section() test is at least trivial, unlike the whole "turn it
> into a pfn and back".
> 
> Because that "turn it into a pfn and back" is actually a really quite
> complicated operation (and the compiler won't be able to optimize that
> one much, so I'm pretty sure it generates horrific code).

Yes, that's why I hate folio_page_idx() so much on !VMEMMAP

#define folio_page_idx(folio, p)	(page_to_pfn(p) - folio_pfn(folio))

> 
> I wish we didn't have nth_page() at all. I really don't think it's a
> valid operation. It's been around forever, but I think it was broken
> as introduced, exactly because I don't think you can validly even have
> allocations that cross section boundaries.

Ordinary buddy allocations cannot exceed a memory section, but hugetlb and
dax can with gigantic folios ... :(

We had some weird bugs with that, because people keep forgetting that you
cannot just use page++ unconditionally with such folios.

Anyhow, thanks Linus!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:36           ` Linus Torvalds
@ 2025-08-05 13:47             ` David Hildenbrand
  2025-08-05 13:51               ` Linus Torvalds
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 13:47 UTC (permalink / raw)
  To: Linus Torvalds, Jason Gunthorpe
  Cc: Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 15:36, Linus Torvalds wrote:
> On Tue, 5 Aug 2025 at 16:26, Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>> David, there is another alternative to prevent this, simple though a
>> bit wasteful, just allocate a bit bigger to ensure the allocation
>> doesn't end on an exact PAGE_SIZE boundary?
> 
> So I don't mind adding a check for "page_section()", because at least
> that makes sense.
> 
> But yes, it would also probably be a good idea to try to minimize
> SPARSEMEM without VMEMMAP. I'd love to get rid of it entirely, of
> course, but even if that isn't possible, I'd *really* just like people
> to try to make sure that it's neve ra valid thing to try to combine
> memory across different sections.
> 
> David mentioned the 1GB hugepage folios, and I really thought that
> even *those* were all in one section. They *should* be.

The memory section size on x86 is always 128 MiB. Even with SPARSEMEM.

There are weird interactions between memory section size and memory 
hotplug / DAX, so we try to keep it small'ish.

It's more that we don't care that much about memory section size with 
SPARSEMEM because nth_page() and everything around that is just plain 
simple.

> 
> Do we have any relevant architectures that still do SPARSEMEM without
> VMEMMAP? Because if it's purely some "legacy architecture" thing (ie
> x86-32), how about just saying "no 1GB hugepages for you".

arch/arm64/Kconfig:     select SPARSEMEM_VMEMMAP_ENABLE
arch/loongarch/Kconfig: select SPARSEMEM_VMEMMAP_ENABLE
arch/powerpc/Kconfig:   select SPARSEMEM_VMEMMAP_ENABLE
arch/riscv/Kconfig:     select SPARSEMEM_VMEMMAP_ENABLE if 64BIT
arch/s390/Kconfig:      select SPARSEMEM_VMEMMAP_ENABLE
arch/sparc/Kconfig:     select SPARSEMEM_VMEMMAP_ENABLE
arch/x86/Kconfig:       select SPARSEMEM_VMEMMAP_ENABLE if X86_64

But SPARSEMEM_VMEMMAP is still user-selectable.

I would assume SPARSEMEM_VMEMMAP_ENABLE support would cover most hugetlb 
+ dax users indeed, at least when it comes to gigantic folios.

Would have to figure out why someone would want to disable it (limited 
vspace? but that should also not really be a problem on 64bit I think).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:37             ` David Hildenbrand
@ 2025-08-05 13:49               ` Linus Torvalds
  0 siblings, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2025-08-05 13:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com,
	Jason Gunthorpe

On Tue, 5 Aug 2025 at 16:37, David Hildenbrand <david@redhat.com> wrote:
>
> Ordinary buddy allocations cannot exceed a memory section, but hugetlb and
> dax can with gigantic folios ... :(

Just turn that code off. Nobody sane cares.

It sounds like people have bent over backwards to fix the insane case
instead of saying "that's insane, let's not support it".

And yes, "that's insane" is actually fairly recent. It's not that long
ago that we made SPARSEMEM_VMEMMAP the mandatory option on x86-64. So
it was all sane in a historical context, but it's not sane any more.

But now it *is* the mandatory option both on x86 and arm64, so I
really think it's time to get rid of pointless pain points.

(I think powerpc still makes it an option to do sparsemem without
vmemmap, but it *is* an option there too)

             Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:47             ` David Hildenbrand
@ 2025-08-05 13:51               ` Linus Torvalds
  2025-08-05 13:55                 ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2025-08-05 13:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On Tue, 5 Aug 2025 at 16:47, David Hildenbrand <david@redhat.com> wrote:
>
> arch/x86/Kconfig:       select SPARSEMEM_VMEMMAP_ENABLE if X86_64
>
> But SPARSEMEM_VMEMMAP is still user-selectable.

I think you missed this confusion on x86:

        select SPARSEMEM_VMEMMAP if X86_64

IOW, that SPARSEMEM_VMEMMAP_ENABLE is entirely historical, I think,
and it's unconditional these days.

             Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:33           ` David Hildenbrand
@ 2025-08-05 13:55             ` Jason Gunthorpe
  2025-08-05 14:10               ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 13:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On Tue, Aug 05, 2025 at 03:33:49PM +0200, David Hildenbrand wrote:

> > David, there is another alternative to prevent this, simple though a
> > bit wasteful, just allocate a bit bigger to ensure the allocation
> > doesn't end on an exact PAGE_SIZE boundary?
> 
> :/ in particular doing that through the memblock in sparse_init_nid(), I am
> not so sure that's a good idea.

It would probably be some work to make larger allocations to avoid
padding :\

> I prefer Linus' proposal and avoids the one nth_page(), unless any other
> approach can help us get rid of more nth_page() usage -- and I don't think
> your proposal could, right?

If the above were solved - so the struct page allocations could be
larger than a section, arguably just the entire range being plugged,
then I think you also solve the nth_page() problem too.

Effectively the nth_page() problem is that we allocate the struct page
arrays on an arbitary section-by-section basis, and then the arch sets
MAX_ORDER so that a folio can cross sections, effectively guaranteeing
to virtually fragment the page *'s inside folios.

Doing a giant vmalloc at the start so you could also cheaply add some
padding would effectively also prevent the nth_page problem as we can
reasonably say that no folio should extend past an entire memory
region.

Maybe there is some reason we can't do a giant vmalloc on these
systems that also can't do SPARSE_VMMEMAP :\ But perhaps we could get
up to MAX_ORDER at least? Or perhaps we could have those systems
reduce MAX_ORDER?

So, I think they are lightly linked problems.

I suppose this is also a limitation with Linus's suggestion. It
doesn't give the optimal answer for for 1G pages on these older systems:

        for (size_t nr = 1; nr < nr_pages; nr++) {
                if (*pages++ != ++page)
                        break;

Since that will exit every section.

At least for scatterlist like cases the point of this function is just
to speed things up. If it returns short the calling code should still
be directly checking phys_addr contiguity anyhow. Something for the
kdoc I suppose.

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:51               ` Linus Torvalds
@ 2025-08-05 13:55                 ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 13:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason Gunthorpe, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 15:51, Linus Torvalds wrote:
> On Tue, 5 Aug 2025 at 16:47, David Hildenbrand <david@redhat.com> wrote:
>>
>> arch/x86/Kconfig:       select SPARSEMEM_VMEMMAP_ENABLE if X86_64
>>
>> But SPARSEMEM_VMEMMAP is still user-selectable.
> 
> I think you missed this confusion on x86:

Yeah ...

> 
>          select SPARSEMEM_VMEMMAP if X86_64
> 
> IOW, that SPARSEMEM_VMEMMAP_ENABLE is entirely historical, I think,
> and it's unconditional these days.

Same for arm64. The other seem to still allow for configuring it. 
Probably we could really just force-enable it for the others that 
support it.

Then fence of gigantic folios and remove most of the nth_page hackery 
when we're working within a folio.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 13:55             ` Jason Gunthorpe
@ 2025-08-05 14:10               ` David Hildenbrand
  2025-08-05 14:20                 ` Jason Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 14:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 15:55, Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 03:33:49PM +0200, David Hildenbrand wrote:
> 
>>> David, there is another alternative to prevent this, simple though a
>>> bit wasteful, just allocate a bit bigger to ensure the allocation
>>> doesn't end on an exact PAGE_SIZE boundary?
>>
>> :/ in particular doing that through the memblock in sparse_init_nid(), I am
>> not so sure that's a good idea.
> 
> It would probably be some work to make larger allocations to avoid
> padding :\
> 
>> I prefer Linus' proposal and avoids the one nth_page(), unless any other
>> approach can help us get rid of more nth_page() usage -- and I don't think
>> your proposal could, right?
> 
> If the above were solved - so the struct page allocations could be
> larger than a section, arguably just the entire range being plugged,
> then I think you also solve the nth_page() problem too. > Effectively the nth_page() problem is that we allocate the struct page
> arrays on an arbitary section-by-section basis, and then the arch sets
> MAX_ORDER so that a folio can cross sections, effectively guaranteeing
> to virtually fragment the page *'s inside folios.
> 
> Doing a giant vmalloc at the start so you could also cheaply add some
> padding would effectively also prevent the nth_page problem as we can
> reasonably say that no folio should extend past an entire memory
> region.
> 
> Maybe there is some reason we can't do a giant vmalloc on these
> systems that also can't do SPARSE_VMMEMAP :\ But perhaps we could get
> up to MAX_ORDER at least? Or perhaps we could have those systems
> reduce MAX_ORDER?
> 
> So, I think they are lightly linked problems.

There are some weird scenarios where you hotplug memory after boot 
memory, and suddenly you can runtime-allocate a gigantic folio that 
spans both ranges etc.

So while related, the corner cases are all a bit nasty, and just 
forbidding folios to span a memory section on these problematic configs 
(sparse !vmemmap) sounds interesting.

As Linus said, x86-64 and arm64 are already VMEMMAP-only. s390x allows 
for gigantic folios, and VMEMMAP migjt still be configurable. Same for 
ppc at least. Not sure about riscv and others, will have to dig.

That way we could just naturally make folio_page() and folio_page_idx() 
simpler. (and some GUP code IIRC as well where we still have to use 
nth_page)

> 
> I suppose this is also a limitation with Linus's suggestion. It
> doesn't give the optimal answer for for 1G pages on these older systems:
> 
>          for (size_t nr = 1; nr < nr_pages; nr++) {
>                  if (*pages++ != ++page)
>                          break;
> 
> Since that will exit every section.

Yes. If folios can no longer span a section in these configs, then we'd 
be good if we stop when we cross a section. We'd still always cover the 
full folio.

> 
> At least for scatterlist like cases the point of this function is just
> to speed things up. If it returns short the calling code should still
> be directly checking phys_addr contiguity anyhow.

Same for vfio I think.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 14:10               ` David Hildenbrand
@ 2025-08-05 14:20                 ` Jason Gunthorpe
  2025-08-05 14:22                   ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 14:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On Tue, Aug 05, 2025 at 04:10:45PM +0200, David Hildenbrand wrote:
> There are some weird scenarios where you hotplug memory after boot memory,
> and suddenly you can runtime-allocate a gigantic folio that spans both
> ranges etc.

I was thinking we'd forbid this directly, but yes it is a another new
check.

> So while related, the corner cases are all a bit nasty, and just forbidding
> folios to span a memory section on these problematic configs (sparse
> !vmemmap) sounds interesting.

Indeed, this just sounds like forcing MAX_ORDER to be no larger than
the section size for this old mode?

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 14:20                 ` Jason Gunthorpe
@ 2025-08-05 14:22                   ` David Hildenbrand
  2025-08-05 14:24                     ` Jason Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 14:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 16:20, Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 04:10:45PM +0200, David Hildenbrand wrote:
>> There are some weird scenarios where you hotplug memory after boot memory,
>> and suddenly you can runtime-allocate a gigantic folio that spans both
>> ranges etc.
> 
> I was thinking we'd forbid this directly, but yes it is a another new
> check.
> 
>> So while related, the corner cases are all a bit nasty, and just forbidding
>> folios to span a memory section on these problematic configs (sparse
>> !vmemmap) sounds interesting.
> 
> Indeed, this just sounds like forcing MAX_ORDER to be no larger than
> the section size for this old mode?

MAX_ORDER is always limited to the section size already.

MAX_ORDER is only about buddy allocations. What hugetlb and dax do is 
independent of MAX_ORDER.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 14:22                   ` David Hildenbrand
@ 2025-08-05 14:24                     ` Jason Gunthorpe
  2025-08-05 14:26                       ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 14:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On Tue, Aug 05, 2025 at 04:22:32PM +0200, David Hildenbrand wrote:
> On 05.08.25 16:20, Jason Gunthorpe wrote:
> > On Tue, Aug 05, 2025 at 04:10:45PM +0200, David Hildenbrand wrote:
> > > There are some weird scenarios where you hotplug memory after boot memory,
> > > and suddenly you can runtime-allocate a gigantic folio that spans both
> > > ranges etc.
> > 
> > I was thinking we'd forbid this directly, but yes it is a another new
> > check.
> > 
> > > So while related, the corner cases are all a bit nasty, and just forbidding
> > > folios to span a memory section on these problematic configs (sparse
> > > !vmemmap) sounds interesting.
> > 
> > Indeed, this just sounds like forcing MAX_ORDER to be no larger than
> > the section size for this old mode?
> 
> MAX_ORDER is always limited to the section size already.
> 
> MAX_ORDER is only about buddy allocations. What hugetlb and dax do is
> independent of MAX_ORDER.

Oh I thought it limited folios too.

Still same idea is to have a MAX_FOLIO_ORDER for that case.

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [GIT PULL] VFIO updates for v6.17-rc1
  2025-08-05 14:24                     ` Jason Gunthorpe
@ 2025-08-05 14:26                       ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-08-05 14:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linus Torvalds, Alex Williamson, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, lizhe.67@bytedance.com

On 05.08.25 16:24, Jason Gunthorpe wrote:
> On Tue, Aug 05, 2025 at 04:22:32PM +0200, David Hildenbrand wrote:
>> On 05.08.25 16:20, Jason Gunthorpe wrote:
>>> On Tue, Aug 05, 2025 at 04:10:45PM +0200, David Hildenbrand wrote:
>>>> There are some weird scenarios where you hotplug memory after boot memory,
>>>> and suddenly you can runtime-allocate a gigantic folio that spans both
>>>> ranges etc.
>>>
>>> I was thinking we'd forbid this directly, but yes it is a another new
>>> check.
>>>
>>>> So while related, the corner cases are all a bit nasty, and just forbidding
>>>> folios to span a memory section on these problematic configs (sparse
>>>> !vmemmap) sounds interesting.
>>>
>>> Indeed, this just sounds like forcing MAX_ORDER to be no larger than
>>> the section size for this old mode?
>>
>> MAX_ORDER is always limited to the section size already.
>>
>> MAX_ORDER is only about buddy allocations. What hugetlb and dax do is
>> independent of MAX_ORDER.
> 
> Oh I thought it limited folios too.
> 
> Still same idea is to have a MAX_FOLIO_ORDER for that case.

Yes. Usually unlimited, except we are in this weird config state.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-08-05 14:26 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-04 22:22 [GIT PULL] VFIO updates for v6.17-rc1 Alex Williamson
2025-08-04 23:55 ` Linus Torvalds
2025-08-05  0:53   ` Alex Williamson
2025-08-05  7:47     ` David Hildenbrand
2025-08-05 11:49       ` Jason Gunthorpe
2025-08-05 12:07         ` David Hildenbrand
2025-08-05 12:38           ` Jason Gunthorpe
2025-08-05 12:41             ` David Hildenbrand
2025-08-05 12:56               ` Jason Gunthorpe
2025-08-05 13:05                 ` David Hildenbrand
2025-08-05 13:15                   ` Linus Torvalds
2025-08-05 13:19                     ` David Hildenbrand
2025-08-05 13:22                     ` David Hildenbrand
2025-08-05 13:00       ` Linus Torvalds
2025-08-05 13:20         ` David Hildenbrand
2025-08-05 13:24           ` David Hildenbrand
2025-08-05 13:28           ` Linus Torvalds
2025-08-05 13:37             ` David Hildenbrand
2025-08-05 13:49               ` Linus Torvalds
2025-08-05 13:25         ` Jason Gunthorpe
2025-08-05 13:33           ` David Hildenbrand
2025-08-05 13:55             ` Jason Gunthorpe
2025-08-05 14:10               ` David Hildenbrand
2025-08-05 14:20                 ` Jason Gunthorpe
2025-08-05 14:22                   ` David Hildenbrand
2025-08-05 14:24                     ` Jason Gunthorpe
2025-08-05 14:26                       ` David Hildenbrand
2025-08-05 13:36           ` Linus Torvalds
2025-08-05 13:47             ` David Hildenbrand
2025-08-05 13:51               ` Linus Torvalds
2025-08-05 13:55                 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).