Re: [PATCH v6 00/13] Remove device private pages from physical address space

dri-devel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v6 00/13] Remove device private pages from physical address space
       [not found] ` <4b5b222a-18e8-4d48-9acb-39e5bfe4e5f7@kernel.org>
@ 2026-03-17  1:47   ` Alistair Popple
  2026-03-18  8:44     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 6+ messages in thread
From: Alistair Popple @ 2026-03-17  1:47 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Jordan Niethe, linux-mm, balbirs, matthew.brost, akpm,
	linux-kernel, dri-devel, ziy, lorenzo.stoakes, lyude, dakr,
	airlied, simona, rcampbell, mpenttil, jgg, willy, linuxppc-dev,
	intel-xe, jgg, Felix.Kuehling, jhubbard, maddy, mpe, ying.huang

On 2026-03-07 at 03:16 +1100, "David Hildenbrand (Arm)" <david@kernel.org> wrote...
> On 2/2/26 12:36, Jordan Niethe wrote:
> > Introduction
> > ------------
> > 
> > The existing design of device private memory imposes limitations which
> > render it non functional for certain systems and configurations where
> > the physical address space is limited. 
> > 
> > Limited available address space
> > -------------------------------
> > 
> > Device private memory is implemented by first reserving a region of the
> > physical address space. This is a problem. The physical address space is
> > not a resource that is directly under the kernel's control. Availability
> > of suitable physical address space is constrained by the underlying
> > hardware and firmware and may not always be available. 
> > 
> > Device private memory assumes that it will be able to reserve a device
> > memory sized chunk of physical address space. However, there is nothing
> > guaranteeing that this will succeed, and there a number of factors that
> > increase the likelihood of failure. We need to consider what else may
> > exist in the physical address space. It is observed that certain VM
> > configurations place very large PCI windows immediately after RAM. Large
> > enough that there is no physical address space available at all for
> > device private memory. This is more likely to occur on 43 bit physical
> > width systems which have less physical address space.
> > 
> > The fundamental issue is the physical address space is not a resource
> > the kernel can rely on being to allocate from at will.  
> > 
> > New implementation
> > ------------------
> > 
> > This series changes device private memory so that it does not require
> > allocation of physical address space and these problems are avoided.
> > Instead of using the physical address space, we introduce a "device
> > private address space" and allocate from there.
> > 
> > A consequence of placing the device private pages outside of the
> > physical address space is that they no longer have a PFN. However, it is
> > still necessary to be able to look up a corresponding device private
> > page from a device private PTE entry, which means that we still require
> > some way to index into this device private address space. Instead of a
> > PFN, device private pages use an offset into this device private address
> > space to look up device private struct pages.
> > 
> > The problem that then needs to be addressed is how to avoid confusing
> > these device private offsets with PFNs. It is the limited usage
> > of the device private pages themselves which make this possible. A
> > device private page is only used for userspace mappings, we do not need
> > to be concerned with them being used within the mm more broadly. This
> > means that the only way that the core kernel looks up these pages is via
> > the page table, where their PTE already indicates if they refer to a
> > device private page via their swap type, e.g.  SWP_DEVICE_WRITE. We can
> > use this information to determine if the PTE contains a PFN which should
> > be looked up in the page map, or a device private offset which should be
> > looked up elsewhere.
> > 
> > This applies when we are creating PTE entries for device private pages -
> > because they have their own type there are already must be handled
> > separately, so it is a small step to convert them to a device private
> > PFN now too.
> > 
> > The first part of the series updates callers where device private
> > offsets might now be encountered to track this extra state.
> > 
> > The last patch contains the bulk of the work where we change how we
> > convert between device private pages to device private offsets and then
> > use a new interface for allocating device private pages without the need
> > for reserving physical address space.
> > 
> > By removing the device private pages from the physical address space,
> > this series also opens up the possibility to moving away from tracking
> > device private memory using struct pages in the future. This is
> > desirable as on systems with large amounts of memory these device
> > private struct pages use a signifiant amount of memory and take a
> > significant amount of time to initialize.
> 
> I now went through all of the patches (skimming a bit over some parts
> that need splitting or rework).

Thanks David for taking the time to do a thorough review. I will let Jordan
respond to most of the comments but wanted to add some of my own as I helped
with the initial idea.

> In general, a noble goal and a reasonable approach.
> 
> But I get the sense that we are just hacking in yet another zone-device
> thing. This series certainly makes core-mm more complicated. I provided
> some inputs on how to make some things less hacky, and will provide
> further input as you move forward.

I disagree - this isn't hacking in another/new zone-device thing it is cleaning
up/reworking a pre-existing zone-device thing (DEVICE_PRIVATE pages). My initial
hope was it wouldn't actually involve too much churn on the core-mm side.

It seems that didn't work quite as well as hoped as there are a few places in
core-mm where we use raw pfns without actually accessing them rather than using
the page/folio. Notably page_vma_mapped in patch 5.

But overall this is about replacing pfn_to_page()/page_to_pfn() with
device-private specific variants, as callers *must* already know when they
are dealing with a device-private pfn and treat it specially today (whether
explicitly or implicitly). Callers/callees already can't just treat a
device-private pfn normally as accessing the pfn will cause machine checks and
the associated page is a zone-device page so doesn't behave like a normal struct
page.

> We really have to minimize the impact, otherwise we'll just keep
> breaking stuff all the time when we forget a single test for
> device-private pages in one magical path.

As noted above this is already the case - all paths whether explicitly or
implicitly (or just fogotten ... hard to tell) need to consider device-private
pages and possibly treat them differently. Even today some magical path that
somehow gets a device-private pfn/page and tries to use it as a normal page/pfn
will probably break as they don't actually correspond to physical addresses that
actually exist and the struct pages are special.

So any core-mm churn is really just making this more explicit, but this series
doesn't add any new requirements.

My bigger aim here is to use this as a stepping stone to removing device-private
pages as they just contain a bunch of redundant information from a device driver
perspective that introduces a lot of metadata management overhead.

> I am not 100% sure how much the additional tests for device-private
> pages all over the place will cost us. At least it can get compiled out,
> but most distros will just always have it compiled in.

I didn't notice too many extra checks outside of the migration entry path. But
if perf is a concern there I think we could move those checks to device-private
specific paths. From memory Jordan did this more as a convenience. Will go look
a bit deeper for any other checks we might have added.

 - Alistair

> -- 
> Cheers,
> 
> David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 00/13] Remove device private pages from physical address space
  2026-03-17  1:47   ` [PATCH v6 00/13] Remove device private pages from physical address space Alistair Popple
@ 2026-03-18  8:44     ` David Hildenbrand (Arm)
  2026-03-20  5:52       ` Alistair Popple
  0 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-18  8:44 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jordan Niethe, linux-mm, balbirs, matthew.brost, akpm,
	linux-kernel, dri-devel, ziy, lorenzo.stoakes, lyude, dakr,
	airlied, simona, rcampbell, mpenttil, jgg, willy, linuxppc-dev,
	intel-xe, jgg, Felix.Kuehling, jhubbard, maddy, mpe, ying.huang

On 3/17/26 02:47, Alistair Popple wrote:
> On 2026-03-07 at 03:16 +1100, "David Hildenbrand (Arm)" <david@kernel.org> wrote...
>> On 2/2/26 12:36, Jordan Niethe wrote:
>>> Introduction
>>> ------------
>>>
>>> The existing design of device private memory imposes limitations which
>>> render it non functional for certain systems and configurations where
>>> the physical address space is limited. 
>>>
>>> Limited available address space
>>> -------------------------------
>>>
>>> Device private memory is implemented by first reserving a region of the
>>> physical address space. This is a problem. The physical address space is
>>> not a resource that is directly under the kernel's control. Availability
>>> of suitable physical address space is constrained by the underlying
>>> hardware and firmware and may not always be available. 
>>>
>>> Device private memory assumes that it will be able to reserve a device
>>> memory sized chunk of physical address space. However, there is nothing
>>> guaranteeing that this will succeed, and there a number of factors that
>>> increase the likelihood of failure. We need to consider what else may
>>> exist in the physical address space. It is observed that certain VM
>>> configurations place very large PCI windows immediately after RAM. Large
>>> enough that there is no physical address space available at all for
>>> device private memory. This is more likely to occur on 43 bit physical
>>> width systems which have less physical address space.
>>>
>>> The fundamental issue is the physical address space is not a resource
>>> the kernel can rely on being to allocate from at will.  
>>>
>>> New implementation
>>> ------------------
>>>
>>> This series changes device private memory so that it does not require
>>> allocation of physical address space and these problems are avoided.
>>> Instead of using the physical address space, we introduce a "device
>>> private address space" and allocate from there.
>>>
>>> A consequence of placing the device private pages outside of the
>>> physical address space is that they no longer have a PFN. However, it is
>>> still necessary to be able to look up a corresponding device private
>>> page from a device private PTE entry, which means that we still require
>>> some way to index into this device private address space. Instead of a
>>> PFN, device private pages use an offset into this device private address
>>> space to look up device private struct pages.
>>>
>>> The problem that then needs to be addressed is how to avoid confusing
>>> these device private offsets with PFNs. It is the limited usage
>>> of the device private pages themselves which make this possible. A
>>> device private page is only used for userspace mappings, we do not need
>>> to be concerned with them being used within the mm more broadly. This
>>> means that the only way that the core kernel looks up these pages is via
>>> the page table, where their PTE already indicates if they refer to a
>>> device private page via their swap type, e.g.  SWP_DEVICE_WRITE. We can
>>> use this information to determine if the PTE contains a PFN which should
>>> be looked up in the page map, or a device private offset which should be
>>> looked up elsewhere.
>>>
>>> This applies when we are creating PTE entries for device private pages -
>>> because they have their own type there are already must be handled
>>> separately, so it is a small step to convert them to a device private
>>> PFN now too.
>>>
>>> The first part of the series updates callers where device private
>>> offsets might now be encountered to track this extra state.
>>>
>>> The last patch contains the bulk of the work where we change how we
>>> convert between device private pages to device private offsets and then
>>> use a new interface for allocating device private pages without the need
>>> for reserving physical address space.
>>>
>>> By removing the device private pages from the physical address space,
>>> this series also opens up the possibility to moving away from tracking
>>> device private memory using struct pages in the future. This is
>>> desirable as on systems with large amounts of memory these device
>>> private struct pages use a signifiant amount of memory and take a
>>> significant amount of time to initialize.
>>
>> I now went through all of the patches (skimming a bit over some parts
>> that need splitting or rework).
> 
> Thanks David for taking the time to do a thorough review. I will let Jordan
> respond to most of the comments but wanted to add some of my own as I helped
> with the initial idea.
> 
>> In general, a noble goal and a reasonable approach.
>>
>> But I get the sense that we are just hacking in yet another zone-device
>> thing. This series certainly makes core-mm more complicated. I provided
>> some inputs on how to make some things less hacky, and will provide
>> further input as you move forward.
> 
> I disagree - this isn't hacking in another/new zone-device thing it is cleaning
> up/reworking a pre-existing zone-device thing (DEVICE_PRIVATE pages). My initial
> hope was it wouldn't actually involve too much churn on the core-mm side.

... and there is quite some.

stuff like make_readable_exclusive_migration_entry_from_page() must be
reworked.

Maybe after some reworks it will no longer look like a hack.

Right now it does.

> 
> It seems that didn't work quite as well as hoped as there are a few places in
> core-mm where we use raw pfns without actually accessing them rather than using
> the page/folio. Notably page_vma_mapped in patch 5.

Yes. I provided ideas on how to minimize the impact.

Again, maybe if done right it will be okay-ish.

It will likely still be error prone, but I have no idea how on earth we
could possible catch reliably for an "unsigned long" pfn whether it is a
PFN (it's right there in the name ...) or something completely different.

We don't want another pfn_t, it would be too much churn to convert most
of MM.

> 
> But overall this is about replacing pfn_to_page()/page_to_pfn() with
> device-private specific variants, as callers *must* already know when they
> are dealing with a device-private pfn and treat it specially today (whether
> explicitly or implicitly). Callers/callees already can't just treat a
> device-private pfn normally as accessing the pfn will cause machine checks and
> the associated page is a zone-device page so doesn't behave like a normal struct
> page.
> 
>> We really have to minimize the impact, otherwise we'll just keep
>> breaking stuff all the time when we forget a single test for
>> device-private pages in one magical path.
> 
> As noted above this is already the case - all paths whether explicitly or
> implicitly (or just fogotten ... hard to tell) need to consider device-private
> pages and possibly treat them differently. Even today some magical path that
> somehow gets a device-private pfn/page and tries to use it as a normal page/pfn
> will probably break as they don't actually correspond to physical addresses that
> actually exist and the struct pages are special.

Well, so far a PFN is a PFN, and when you actually have a *page* (after
pfn_to_page() etc) you can just test for these cases.

The page is actually sufficient to make a decision.

With a PFN you have to carry auxiliary information.

> 
> So any core-mm churn is really just making this more explicit, but this series
> doesn't add any new requirements.

Again, maybe it can be done in a better way. I did not enjoy some of the
code changes I was reading.

> 
> My bigger aim here is to use this as a stepping stone to removing device-private
> pages as they just contain a bunch of redundant information from a device driver
> perspective that introduces a lot of metadata management overhead.
> 
>> I am not 100% sure how much the additional tests for device-private
>> pages all over the place will cost us. At least it can get compiled out,
>> but most distros will just always have it compiled in.
> 
> I didn't notice too many extra checks outside of the migration entry path. But
> if perf is a concern there I think we could move those checks to device-private
> specific paths. From memory Jordan did this more as a convenience. Will go look
> a bit deeper for any other checks we might have added.
I meant in stuff like page_vma_mapped. Probably not the hottest path,
and maybe the impact can be reduced by reworking it.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 05/13] mm/page_vma_mapped: Add flag to page_vma_mapped_walk::flags to track device private pages
       [not found]   ` <81e7bd1c-54d5-4f88-969a-685177447c51@kernel.org>
@ 2026-03-20  4:57     ` Alistair Popple
  2026-03-23 20:03       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 6+ messages in thread
From: Alistair Popple @ 2026-03-20  4:57 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Jordan Niethe, linux-mm, balbirs, matthew.brost, akpm,
	linux-kernel, dri-devel, ziy, lorenzo.stoakes, lyude, dakr,
	airlied, simona, rcampbell, mpenttil, jgg, willy, linuxppc-dev,
	intel-xe, jgg, Felix.Kuehling, jhubbard, maddy, mpe, ying.huang

On 2026-03-07 at 02:44 +1100, "David Hildenbrand (Arm)" <david@kernel.org> wrote...
> On 2/2/26 12:36, Jordan Niethe wrote:
> > A future change will remove device private pages from the physical
> > address space. This will mean that device private pages no longer have
> > normal PFN and must be handled separately.
> > 
> > Prepare for this by adding a PVMW_DEVICE_PRIVATE flag to
> > page_vma_mapped_walk::flags. This indicates that
> > page_vma_mapped_walk::pfn contains a device private offset rather than a
> > normal pfn.
> > 
> > Once the device private pages are removed from the physical address
> > space this flag will be used to ensure a device private offset is
> > returned.
> > 
> > Reviewed-by: Zi Yan <ziy@nvidia.com>
> > Signed-off-by: Jordan Niethe <jniethe@nvidia.com>
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > ---
> > v1:
> >   - Update for HMM huge page support
> > v2:
> >   - Move adding device_private param to check_pmd() until final patch
> > v3:
> >   - Track device private offset in pvmw::flags instead of pvmw::pfn
> > v4:
> >   - No change
> > ---
> >  include/linux/rmap.h | 24 ++++++++++++++++++++++--
> >  mm/page_vma_mapped.c |  4 ++--
> >  mm/rmap.c            |  4 ++--
> >  mm/vmscan.c          |  2 +-
> >  4 files changed, 27 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index daa92a58585d..1b03297f13dc 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -921,6 +921,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> >  #define PVMW_SYNC		(1 << 0)
> >  /* Look for migration entries rather than present PTEs */
> >  #define PVMW_MIGRATION		(1 << 1)
> > +/* pvmw::pfn is a device private offset */
> > +#define PVMW_DEVICE_PRIVATE	(1 << 2)
> >  
> >  /* Result flags */
> >  
> > @@ -939,14 +941,32 @@ struct page_vma_mapped_walk {
> >  	unsigned int flags;
> >  };
> >  
> > +static inline unsigned long page_vma_walk_flags(const struct folio *folio,
> > +						unsigned long flags)
> > +{
> > +	if (folio_is_device_private(folio))
> > +		return flags | PVMW_DEVICE_PRIVATE;
> > +	return flags;
> > +}
> > +
> > +static inline unsigned long folio_page_vma_walk_pfn(const struct folio *folio)
> > +{
> > +	return folio_pfn(folio);
> > +}
> > +
> > +static inline struct folio *page_vma_walk_pfn_to_folio(struct page_vma_mapped_walk *pvmw)
> > +{
> > +	return pfn_folio(pvmw->pfn);
> > +}
> > +
> >  #define DEFINE_FOLIO_VMA_WALK(name, _folio, _vma, _address, _flags)	\
> >  	struct page_vma_mapped_walk name = {				\
> > -		.pfn = folio_pfn(_folio),				\
> > +		.pfn = folio_page_vma_walk_pfn(_folio),			\
> >  		.nr_pages = folio_nr_pages(_folio),			\
> >  		.pgoff = folio_pgoff(_folio),				\
> >  		.vma = _vma,						\
> >  		.address = _address,					\
> > -		.flags = _flags,					\
> > +		.flags = page_vma_walk_flags(_folio, _flags),		\
> >  	}
> 
> That's all rather horrible ...
> 
> 
> I was asking myself recently, why something that is called
> "page_vma_mapped_walk" consume a pfn. It's just a horrible interface.

I don't disagree, and in fact it used to consume a page until 2aff7a4755be ("mm:
Convert page_vma_mapped_walk to work on PFNs"). If this was a page it would
basically resolve all the hackiness of this patch because we would no longer
have to pass PFN context around. So I wonder if there would be any opposition to
changing this back to taking a page?

> 
> 
> * DEFINE_FOLIO_VMA_WALK() users obviously receive a folio.
> * mm/migrate_device.c just abuses page_vma_mapped_walk() to make
>   set_pmd_migration_entry() work. But we have a folio.
> * page_mapped_in_vma() has a page/folio.
> 
> mapping_wrprotect_range_one() and pfn_mkclean_range() are the real
> issues. They all end up calling page_vma_mkclean_one(), which does not
> operate on pages/folios.
> 
> Ideally, the odd pfn case would use it's own simplified infrastructure.
> 
> 
> So, could we simply add a folio+page pointer in case we have one, and
> use that one if set, leaving leaving the pfn unset?
> 
> Then, the pfn would only be set for the
> mapping_wrprotect_range_one/pfn_mkclean_range case. I don't think
> device-private folios would ever have to mess with that.
> 
> 
> Then, you just always have a folio+page and don't even have to worry
> about the pfn?

That sounds reasonable to me. We were hesitant to add a page back to the
interface given it had been removed previously but lets try implementing this to
see what it looks like.

 - Alistair

> -- 
> Cheers,
> 
> David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 00/13] Remove device private pages from physical address space
  2026-03-18  8:44     ` David Hildenbrand (Arm)
@ 2026-03-20  5:52       ` Alistair Popple
  2026-03-23 20:10         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 6+ messages in thread
From: Alistair Popple @ 2026-03-20  5:52 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Jordan Niethe, linux-mm, balbirs, matthew.brost, akpm,
	linux-kernel, dri-devel, ziy, lorenzo.stoakes, lyude, dakr,
	airlied, simona, rcampbell, mpenttil, jgg, willy, linuxppc-dev,
	intel-xe, jgg, Felix.Kuehling, jhubbard, maddy, mpe, ying.huang

On 2026-03-18 at 19:44 +1100, "David Hildenbrand (Arm)" <david@kernel.org> wrote...
> On 3/17/26 02:47, Alistair Popple wrote:
> > On 2026-03-07 at 03:16 +1100, "David Hildenbrand (Arm)" <david@kernel.org> wrote...
> >> On 2/2/26 12:36, Jordan Niethe wrote:
> >>> Introduction
> >>> ------------
> >>>
> >>> The existing design of device private memory imposes limitations which
> >>> render it non functional for certain systems and configurations where
> >>> the physical address space is limited. 
> >>>
> >>> Limited available address space
> >>> -------------------------------
> >>>
> >>> Device private memory is implemented by first reserving a region of the
> >>> physical address space. This is a problem. The physical address space is
> >>> not a resource that is directly under the kernel's control. Availability
> >>> of suitable physical address space is constrained by the underlying
> >>> hardware and firmware and may not always be available. 
> >>>
> >>> Device private memory assumes that it will be able to reserve a device
> >>> memory sized chunk of physical address space. However, there is nothing
> >>> guaranteeing that this will succeed, and there a number of factors that
> >>> increase the likelihood of failure. We need to consider what else may
> >>> exist in the physical address space. It is observed that certain VM
> >>> configurations place very large PCI windows immediately after RAM. Large
> >>> enough that there is no physical address space available at all for
> >>> device private memory. This is more likely to occur on 43 bit physical
> >>> width systems which have less physical address space.
> >>>
> >>> The fundamental issue is the physical address space is not a resource
> >>> the kernel can rely on being to allocate from at will.  
> >>>
> >>> New implementation
> >>> ------------------
> >>>
> >>> This series changes device private memory so that it does not require
> >>> allocation of physical address space and these problems are avoided.
> >>> Instead of using the physical address space, we introduce a "device
> >>> private address space" and allocate from there.
> >>>
> >>> A consequence of placing the device private pages outside of the
> >>> physical address space is that they no longer have a PFN. However, it is
> >>> still necessary to be able to look up a corresponding device private
> >>> page from a device private PTE entry, which means that we still require
> >>> some way to index into this device private address space. Instead of a
> >>> PFN, device private pages use an offset into this device private address
> >>> space to look up device private struct pages.
> >>>
> >>> The problem that then needs to be addressed is how to avoid confusing
> >>> these device private offsets with PFNs. It is the limited usage
> >>> of the device private pages themselves which make this possible. A
> >>> device private page is only used for userspace mappings, we do not need
> >>> to be concerned with them being used within the mm more broadly. This
> >>> means that the only way that the core kernel looks up these pages is via
> >>> the page table, where their PTE already indicates if they refer to a
> >>> device private page via their swap type, e.g.  SWP_DEVICE_WRITE. We can
> >>> use this information to determine if the PTE contains a PFN which should
> >>> be looked up in the page map, or a device private offset which should be
> >>> looked up elsewhere.
> >>>
> >>> This applies when we are creating PTE entries for device private pages -
> >>> because they have their own type there are already must be handled
> >>> separately, so it is a small step to convert them to a device private
> >>> PFN now too.
> >>>
> >>> The first part of the series updates callers where device private
> >>> offsets might now be encountered to track this extra state.
> >>>
> >>> The last patch contains the bulk of the work where we change how we
> >>> convert between device private pages to device private offsets and then
> >>> use a new interface for allocating device private pages without the need
> >>> for reserving physical address space.
> >>>
> >>> By removing the device private pages from the physical address space,
> >>> this series also opens up the possibility to moving away from tracking
> >>> device private memory using struct pages in the future. This is
> >>> desirable as on systems with large amounts of memory these device
> >>> private struct pages use a signifiant amount of memory and take a
> >>> significant amount of time to initialize.
> >>
> >> I now went through all of the patches (skimming a bit over some parts
> >> that need splitting or rework).
> > 
> > Thanks David for taking the time to do a thorough review. I will let Jordan
> > respond to most of the comments but wanted to add some of my own as I helped
> > with the initial idea.
> > 
> >> In general, a noble goal and a reasonable approach.
> >>
> >> But I get the sense that we are just hacking in yet another zone-device
> >> thing. This series certainly makes core-mm more complicated. I provided
> >> some inputs on how to make some things less hacky, and will provide
> >> further input as you move forward.
> > 
> > I disagree - this isn't hacking in another/new zone-device thing it is cleaning
> > up/reworking a pre-existing zone-device thing (DEVICE_PRIVATE pages). My initial
> > hope was it wouldn't actually involve too much churn on the core-mm side.
> 
> ... and there is quite some.
> 
> stuff like make_readable_exclusive_migration_entry_from_page() must be
> reworked.

Yeah, I was displeased to (re)discover the migration entry business when we
fleshed this series out. The idea was basically that raw device-private pfns
can't be used sensibly by anything in the core-mm anyway so presumably nothing
was.

That turned out to be only somewhat true. The exceptions are:

1. page_vma_mapped which I think we have a solution for based on the comments to
   patch 5.

2. migration entries which obviously we will have to see if we can rework.

3. hmm_range_fault()

4. page snapshots, although that's actually only used to test zero_pfn so we
   could probably drop that if we just guarantee device private offsets are
   always invalid pfns.

> Maybe after some reworks it will no longer look like a hack.
> 
> Right now it does.

Fair enough.

> > 
> > It seems that didn't work quite as well as hoped as there are a few places in
> > core-mm where we use raw pfns without actually accessing them rather than using
> > the page/folio. Notably page_vma_mapped in patch 5.
> 
> Yes. I provided ideas on how to minimize the impact.
> 
> Again, maybe if done right it will be okay-ish.
> 
> It will likely still be error prone, but I have no idea how on earth we
> could possible catch reliably for an "unsigned long" pfn whether it is a
> PFN (it's right there in the name ...) or something completely different.

The idea was (at least for device-private) that you never needed the PFN,
only the page. Ie: that calling page_to_pfn() on a device-private page could,
conceptually at least, just crash the kernel because it should never happen.

Obviously we identified some exceptions to that rule, the biggest being
migration entries, hence the helpers for those.

> We don't want another pfn_t, it would be too much churn to convert most
> of MM.

Given I removed pfn_t I don't need convincing of that :-)

> > 
> > But overall this is about replacing pfn_to_page()/page_to_pfn() with
> > device-private specific variants, as callers *must* already know when they
> > are dealing with a device-private pfn and treat it specially today (whether
> > explicitly or implicitly). Callers/callees already can't just treat a
> > device-private pfn normally as accessing the pfn will cause machine checks and
> > the associated page is a zone-device page so doesn't behave like a normal struct
> > page.
> > 
> >> We really have to minimize the impact, otherwise we'll just keep
> >> breaking stuff all the time when we forget a single test for
> >> device-private pages in one magical path.
> > 
> > As noted above this is already the case - all paths whether explicitly or
> > implicitly (or just fogotten ... hard to tell) need to consider device-private
> > pages and possibly treat them differently. Even today some magical path that
> > somehow gets a device-private pfn/page and tries to use it as a normal page/pfn
> > will probably break as they don't actually correspond to physical addresses that
> > actually exist and the struct pages are special.
> 
> Well, so far a PFN is a PFN, and when you actually have a *page* (after
> pfn_to_page() etc) you can just test for these cases.
> 
> The page is actually sufficient to make a decision.
> With a PFN you have to carry auxiliary information.
> 
> > 
> > So any core-mm churn is really just making this more explicit, but this series
> > doesn't add any new requirements.
> 
> Again, maybe it can be done in a better way. I did not enjoy some of the
> code changes I was reading.

Ok. Was there anything outside the exceptions above that you did not enjoy?

One idea we did have was to make the PFNs "obviously" invalid PFNs, for example
by setting the MSB which exceeds the physical addressing capabilities of
every arch/platform. That would allow dropping the hmm and page-snapshot flags
although is still a bit of a hack.

Ultimately one of the issues we are trying to resolve is that to get a PFN range
we use get_free_mem_region(), which essentially just returns a random unused PFN
range from the platform/arch perspective so an architecture may not recognise
them as valid pfns and hence may not have allocated enough vmemmap space for
them. That results in pfn_to_page() overflowing into something else (usually
user space VAs, at least in the case of RISC-V).

> > 
> > My bigger aim here is to use this as a stepping stone to removing device-private
> > pages as they just contain a bunch of redundant information from a device driver
> > perspective that introduces a lot of metadata management overhead.
> > 
> >> I am not 100% sure how much the additional tests for device-private
> >> pages all over the place will cost us. At least it can get compiled out,
> >> but most distros will just always have it compiled in.
> > 
> > I didn't notice too many extra checks outside of the migration entry path. But
> > if perf is a concern there I think we could move those checks to device-private
> > specific paths. From memory Jordan did this more as a convenience. Will go look
> > a bit deeper for any other checks we might have added.
> I meant in stuff like page_vma_mapped. Probably not the hottest path,
> and maybe the impact can be reduced by reworking it.

Ok, lets see what the rework looks like. Thanks again for looking.

> -- 
> Cheers,
> 
> David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 05/13] mm/page_vma_mapped: Add flag to page_vma_mapped_walk::flags to track device private pages
  2026-03-20  4:57     ` [PATCH v6 05/13] mm/page_vma_mapped: Add flag to page_vma_mapped_walk::flags to track device private pages Alistair Popple
@ 2026-03-23 20:03       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-23 20:03 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jordan Niethe, linux-mm, balbirs, matthew.brost, akpm,
	linux-kernel, dri-devel, ziy, lorenzo.stoakes, lyude, dakr,
	airlied, simona, rcampbell, mpenttil, jgg, willy, linuxppc-dev,
	intel-xe, jgg, Felix.Kuehling, jhubbard, maddy, mpe, ying.huang

>>>  #define DEFINE_FOLIO_VMA_WALK(name, _folio, _vma, _address, _flags)	\
>>>  	struct page_vma_mapped_walk name = {				\
>>> -		.pfn = folio_pfn(_folio),				\
>>> +		.pfn = folio_page_vma_walk_pfn(_folio),			\
>>>  		.nr_pages = folio_nr_pages(_folio),			\
>>>  		.pgoff = folio_pgoff(_folio),				\
>>>  		.vma = _vma,						\
>>>  		.address = _address,					\
>>> -		.flags = _flags,					\
>>> +		.flags = page_vma_walk_flags(_folio, _flags),		\
>>>  	}
>>
>> That's all rather horrible ...
>>
>>
>> I was asking myself recently, why something that is called
>> "page_vma_mapped_walk" consume a pfn. It's just a horrible interface.
> 
> I don't disagree, and in fact it used to consume a page until 2aff7a4755be ("mm:
> Convert page_vma_mapped_walk to work on PFNs"). If this was a page it would
> basically resolve all the hackiness of this patch because we would no longer
> have to pass PFN context around. So I wonder if there would be any opposition to
> changing this back to taking a page?

Are there cases where we pass a struct-page-less PFN?

The offenders are pfn_mkclean_range() and mapping_wrprotect_range().

pfn_mkclean_range() is only used by fs/dax.c

At least the caller immediately does a

	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);

Afterwards where, apparently, a page for the PFN exists? :)


mapping_wrprotect_range() is only used by drivers/video/fbdev/core/fb_defio.c

fb_deferred_io_work() just calls page_to_pfn(page). Huh?

Best to explore the history what's going on there.


My best guess is that they don't want to modify page state. But that can
be expressed by the helper functions and doesn't require us to have such
a weird implementation.

> 
>>
>>
>> * DEFINE_FOLIO_VMA_WALK() users obviously receive a folio.
>> * mm/migrate_device.c just abuses page_vma_mapped_walk() to make
>>   set_pmd_migration_entry() work. But we have a folio.
>> * page_mapped_in_vma() has a page/folio.
>>
>> mapping_wrprotect_range_one() and pfn_mkclean_range() are the real
>> issues. They all end up calling page_vma_mkclean_one(), which does not
>> operate on pages/folios.
>>
>> Ideally, the odd pfn case would use it's own simplified infrastructure.
>>
>>
>> So, could we simply add a folio+page pointer in case we have one, and
>> use that one if set, leaving leaving the pfn unset?
>>
>> Then, the pfn would only be set for the
>> mapping_wrprotect_range_one/pfn_mkclean_range case. I don't think
>> device-private folios would ever have to mess with that.
>>
>>
>> Then, you just always have a folio+page and don't even have to worry
>> about the pfn?
> 
> That sounds reasonable to me. We were hesitant to add a page back to the
> interface given it had been removed previously but lets try implementing this to
> see what it looks like.

Thanks!

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 00/13] Remove device private pages from physical address space
  2026-03-20  5:52       ` Alistair Popple
@ 2026-03-23 20:10         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-23 20:10 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jordan Niethe, linux-mm, balbirs, matthew.brost, akpm,
	linux-kernel, dri-devel, ziy, lorenzo.stoakes, lyude, dakr,
	airlied, simona, rcampbell, mpenttil, jgg, willy, linuxppc-dev,
	intel-xe, jgg, Felix.Kuehling, jhubbard, maddy, mpe, ying.huang

On 3/20/26 06:52, Alistair Popple wrote:
> On 2026-03-18 at 19:44 +1100, "David Hildenbrand (Arm)" <david@kernel.org> wrote...
>> On 3/17/26 02:47, Alistair Popple wrote:
>>> On 2026-03-07 at 03:16 +1100, "David Hildenbrand (Arm)" <david@kernel.org> wrote...
>>>
>>> Thanks David for taking the time to do a thorough review. I will let Jordan
>>> respond to most of the comments but wanted to add some of my own as I helped
>>> with the initial idea.
>>>
>>>
>>> I disagree - this isn't hacking in another/new zone-device thing it is cleaning
>>> up/reworking a pre-existing zone-device thing (DEVICE_PRIVATE pages). My initial
>>> hope was it wouldn't actually involve too much churn on the core-mm side.
>>
>> ... and there is quite some.
>>
>> stuff like make_readable_exclusive_migration_entry_from_page() must be
>> reworked.
> 
> Yeah, I was displeased to (re)discover the migration entry business when we
> fleshed this series out. The idea was basically that raw device-private pfns
> can't be used sensibly by anything in the core-mm anyway so presumably nothing
> was.
> 
> That turned out to be only somewhat true. The exceptions are:
> 
> 1. page_vma_mapped which I think we have a solution for based on the comments to
>    patch 5.

Yes, if we just have the page/folio we are in a better position. I
*suspect* that we want to pass a page range, as the other two weird
cases might pass a page, that, in the future might not be a folio anymore.

> 
> 2. migration entries which obviously we will have to see if we can rework.

Please look into encoding this internally, using one of the highest PFN
bits or sth like that. We don't have to support this on all weird
architectures.

> 
> 3. hmm_range_fault()

Yes.

> 
> 4. page snapshots, although that's actually only used to test zero_pfn so we
>    could probably drop that if we just guarantee device private offsets are
>    always invalid pfns.

Right, I think that can be more reasonably cleaned up.

[...]

>>
>> It will likely still be error prone, but I have no idea how on earth we
>> could possible catch reliably for an "unsigned long" pfn whether it is a
>> PFN (it's right there in the name ...) or something completely different.
> 
> The idea was (at least for device-private) that you never needed the PFN,
> only the page. Ie: that calling page_to_pfn() on a device-private page could,
> conceptually at least, just crash the kernel because it should never happen.
> 
> Obviously we identified some exceptions to that rule, the biggest being
> migration entries, hence the helpers for those.
> 
>> We don't want another pfn_t, it would be too much churn to convert most
>> of MM.
> 
> Given I removed pfn_t I don't need convincing of that :-)

:)

>>>
>>> So any core-mm churn is really just making this more explicit, but this series
>>> doesn't add any new requirements.
>>
>> Again, maybe it can be done in a better way. I did not enjoy some of the
>> code changes I was reading.
> 
> Ok. Was there anything outside the exceptions above that you did not enjoy?

The last patch was hard to review and I am not sure what else is hiding
in there. As said, breaking the patch into logical pieces will make this
a lot easier to review.

> 
> One idea we did have was to make the PFNs "obviously" invalid PFNs, for example
> by setting the MSB which exceeds the physical addressing capabilities of
> every arch/platform. That would allow dropping the hmm and page-snapshot flags
> although is still a bit of a hack.

I mean, that might be cleaner, because *maybe* one could just teach
pfn_valid() about that? Or have another, more lightweight helper that
really just checks for "ordinary" vs. "special" pfns. Needs some thought.

Using the highest bit as "this is not an ordinary pfn" might just do.
Maybe some highmem considerations (making sure we don't run into weird
stuff).

> 
> Ultimately one of the issues we are trying to resolve is that to get a PFN range
> we use get_free_mem_region(), which essentially just returns a random unused PFN
> range from the platform/arch perspective so an architecture may not recognise
> them as valid pfns and hence may not have allocated enough vmemmap space for
> them. That results in pfn_to_page() overflowing into something else (usually
> user space VAs, at least in the case of RISC-V).

Yes, I think it's a noble goal :)

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-23 20:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260202113642.59295-1-jniethe@nvidia.com>
     [not found] ` <4b5b222a-18e8-4d48-9acb-39e5bfe4e5f7@kernel.org>
2026-03-17  1:47   ` [PATCH v6 00/13] Remove device private pages from physical address space Alistair Popple
2026-03-18  8:44     ` David Hildenbrand (Arm)
2026-03-20  5:52       ` Alistair Popple
2026-03-23 20:10         ` David Hildenbrand (Arm)
     [not found] ` <20260202113642.59295-6-jniethe@nvidia.com>
     [not found]   ` <81e7bd1c-54d5-4f88-969a-685177447c51@kernel.org>
2026-03-20  4:57     ` [PATCH v6 05/13] mm/page_vma_mapped: Add flag to page_vma_mapped_walk::flags to track device private pages Alistair Popple
2026-03-23 20:03       ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox