Where to put page->memdesc initially

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Where to put page->memdesc initially
@ 2025-09-02 19:03 Matthew Wilcox
  2025-09-02 20:08 ` Jason Gunthorpe
  2025-09-02 20:09 ` David Hildenbrand
  0 siblings, 2 replies; 12+ messages in thread
From: Matthew Wilcox @ 2025-09-02 19:03 UTC (permalink / raw)
  To: linux-mm; +Cc: David Hildenbrand, Jason Gunthorpe

With the recent patches to slab, I'm just about ready to allocate struct
slab separately from struct page.  This will not be an immediate win,
of course.  Indeed, it will likely be a slowdown (overhead of a second
allocation per slab).  So there's no urgency to do this until we're
ready to shrink struct page, when we can at least point to that win
as justification.

Still, we should understand how we're going to get to Page2025 [1] one
step at a time.  I had been thinking about coopting compound_head to point
to struct slab.  But looking at the places which call folio_test_slab()
[an oxymoron in the New York Interpretation], it becomes apparent that
we need to keep compound_head() and page_folio() working for all pages
for a while.

As a reminder, compound_head() will _eventually_ return NULL for
slabs & folios.  It will only be defined to work for page allocations.
Likewise page_folio() will return NULL for any pages not part of a folio
and page_slab() will return NULL for any pages not part of a slab.

My best offer right now is to use page->lru.prev.  At least one of the
bottom two bits will be set to indicate that it's a memdesc (we're only
going to use thirteen of the memdesc types initially).

There are a few overlapping uses of these bits in struct page, so if we do
nothing we may get confused.  We can deal with mlock_count and order (for
pcp_llist).  But the biggest problem is the first tail page of a folio.
Depending on word size and endianness, there are four different atomic_t
fields that overlap with page->lru.prev.  That can't be solved by using
a different field in struct page; the first tail page is jam-packed.

So, page_slab() will first load page->memdesc (the same bits as
page->lru.prev), check the bottom four bits match the slab memdesc, and
also check page->page_type matches PGTY_slab.  I don't like this a lot,
because it's two loads rather than one atomic load, but it should only
be present for one commit.

In the next commit, we can separately allocate struct folio, make
page->memdesc point to struct folio and drop the PGTY_slab check (as
there will be no more uses of the first tail page for the mapcount stuff).

[1] https://kernelnewbies.org/MatthewWilcox/Memdescs/Path

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-02 19:03 Where to put page->memdesc initially Matthew Wilcox
@ 2025-09-02 20:08 ` Jason Gunthorpe
  2025-09-02 20:09 ` David Hildenbrand
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2025-09-02 20:08 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, David Hildenbrand

On Tue, Sep 02, 2025 at 08:03:57PM +0100, Matthew Wilcox wrote:
> So, page_slab() will first load page->memdesc (the same bits as
> page->lru.prev), check the bottom four bits match the slab memdesc, and
> also check page->page_type matches PGTY_slab.  I don't like this a lot,
> because it's two loads rather than one atomic load, but it should only
> be present for one commit.
> 
> In the next commit, we can separately allocate struct folio, make
> page->memdesc point to struct folio and drop the PGTY_slab check (as
> there will be no more uses of the first tail page for the mapcount stuff).

So to rephrase, there is no great free space for memdesc in struct
page right now, but after the folio is split then it is fine?

Thus you have a few commits within a single series where it is less
efficient?

Seems OK to me..

Jason


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-02 19:03 Where to put page->memdesc initially Matthew Wilcox
  2025-09-02 20:08 ` Jason Gunthorpe
@ 2025-09-02 20:09 ` David Hildenbrand
  2025-09-02 21:06   ` Matthew Wilcox
  1 sibling, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-09-02 20:09 UTC (permalink / raw)
  To: Matthew Wilcox, linux-mm; +Cc: Jason Gunthorpe

On 02.09.25 21:03, Matthew Wilcox wrote:
> With the recent patches to slab, I'm just about ready to allocate struct
> slab separately from struct page.  This will not be an immediate win,
> of course.  Indeed, it will likely be a slowdown (overhead of a second
> allocation per slab).  So there's no urgency to do this until we're
> ready to shrink struct page, when we can at least point to that win
> as justification.
> 
> Still, we should understand how we're going to get to Page2025 [1] one
> step at a time.  I had been thinking about coopting compound_head to point
> to struct slab.  But looking at the places which call folio_test_slab()
> [an oxymoron in the New York Interpretation], it becomes apparent that
> we need to keep compound_head() and page_folio() working for all pages
> for a while.
> 
> As a reminder, compound_head() will _eventually_ return NULL for
> slabs & folios.  It will only be defined to work for page allocations.
> Likewise page_folio() will return NULL for any pages not part of a folio
> and page_slab() will return NULL for any pages not part of a slab.
> 
> My best offer right now is to use page->lru.prev.  At least one of the
> bottom two bits will be set to indicate that it's a memdesc (we're only
> going to use thirteen of the memdesc types initially).
> 

Just so I understand it correctly:

Would you want to move the page type already from the mapcount into the
memdesc? That sounds challenging, because for any typed folios we would
not be allowed to reuse a field we want to use for the memdesc. IIRC<
hugetlb pretty much uses all of it.

The easy way out for now would be making this page type specific: Only
selected typed pages will store the memdesc (here: slab pointer) e.g.,
in the old page->mapping place.

So PageSlab() still checks the existing page type, put page_slab() would
simply lookup the pointer in the old page->mapping place.


> There are a few overlapping uses of these bits in struct page, so if we do
> nothing we may get confused.  We can deal with mlock_count and order (for
> pcp_llist).  But the biggest problem is the first tail page of a folio.
> Depending on word size and endianness, there are four different atomic_t
> fields that overlap with page->lru.prev.  That can't be solved by using
> a different field in struct page; the first tail page is jam-packed.
> 
> So, page_slab() will first load page->memdesc (the same bits as
> page->lru.prev), check the bottom four bits match the slab memdesc, and
> also check page->page_type matches PGTY_slab.  I don't like this a lot,
> because it's two loads rather than one atomic load, but it should only
> be present for one commit.

As a first step, I would really not use the bottom four bits. Why
perform two type checks initially?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-02 20:09 ` David Hildenbrand
@ 2025-09-02 21:06   ` Matthew Wilcox
  2025-09-02 21:15     ` Jason Gunthorpe
  2025-09-03  9:33     ` David Hildenbrand
  0 siblings, 2 replies; 12+ messages in thread
From: Matthew Wilcox @ 2025-09-02 21:06 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-mm, Jason Gunthorpe

On Tue, Sep 02, 2025 at 10:09:49PM +0200, David Hildenbrand wrote:
> Would you want to move the page type already from the mapcount into the
> memdesc? That sounds challenging, because for any typed folios we would
> not be allowed to reuse a field we want to use for the memdesc. IIRC<
> hugetlb pretty much uses all of it.

That would definitely be part of the same series.  But possibly not the
same patch.  I think the series has to include separate allocations for
slab, folio and whichever other memdescs won't fit into 32 bytes.

> The easy way out for now would be making this page type specific: Only
> selected typed pages will store the memdesc (here: slab pointer) e.g.,
> in the old page->mapping place.
> 
> So PageSlab() still checks the existing page type, put page_slab() would
> simply lookup the pointer in the old page->mapping place.

I *think* that's roughly the same as what I'm proposing, except
that we already have a meaning for "the bottom two bits of
folio->mapping are set", so there's potential confusion for
folio_test_anon() & friends.
> 
> > There are a few overlapping uses of these bits in struct page, so if we do
> > nothing we may get confused.  We can deal with mlock_count and order (for
> > pcp_llist).  But the biggest problem is the first tail page of a folio.
> > Depending on word size and endianness, there are four different atomic_t
> > fields that overlap with page->lru.prev.  That can't be solved by using
> > a different field in struct page; the first tail page is jam-packed.
> > 
> > So, page_slab() will first load page->memdesc (the same bits as
> > page->lru.prev), check the bottom four bits match the slab memdesc, and
> > also check page->page_type matches PGTY_slab.  I don't like this a lot,
> > because it's two loads rather than one atomic load, but it should only
> > be present for one commit.
> 
> As a first step, I would really not use the bottom four bits. Why
> perform two type checks initially?

I'm concerned by things like compaction that are executing
asynchronously and might see a page mid-transition.  Or something like
GUP or lockless pagecache lookup that might get a stale page pointer.
It's a lot easier to reason about if we can do a single load and treat
that as a source of truth (with the appropriate reloads to make sure
nothing changed after we got a refcount).  Doing two loads makes
my brain hurt a bit because it introduces more possibilities for
inconsistency.  I'll need to write it up pretty carefully (which
annoys me because we're going to need it for a single or very
few commits ...)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-02 21:06   ` Matthew Wilcox
@ 2025-09-02 21:15     ` Jason Gunthorpe
  2025-09-02 23:24       ` Matthew Wilcox
  2025-09-03  9:33     ` David Hildenbrand
  1 sibling, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2025-09-02 21:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Hildenbrand, linux-mm

On Tue, Sep 02, 2025 at 10:06:05PM +0100, Matthew Wilcox wrote:

> I'm concerned by things like compaction that are executing
> asynchronously and might see a page mid-transition.  Or something like
> GUP or lockless pagecache lookup that might get a stale page
> pointer.

At least GUP fast obtains a page refcount before touching the rest of
struct page, so I think it can't see those kinds of races since the
page shouldn't be transitioning with a non-zero refcount?

Jason


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-02 21:15     ` Jason Gunthorpe
@ 2025-09-02 23:24       ` Matthew Wilcox
  2025-09-02 23:57         ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2025-09-02 23:24 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: David Hildenbrand, linux-mm

On Tue, Sep 02, 2025 at 06:15:14PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 02, 2025 at 10:06:05PM +0100, Matthew Wilcox wrote:
> 
> > I'm concerned by things like compaction that are executing
> > asynchronously and might see a page mid-transition.  Or something like
> > GUP or lockless pagecache lookup that might get a stale page
> > pointer.
> 
> At least GUP fast obtains a page refcount before touching the rest of
> struct page, so I think it can't see those kinds of races since the
> page shouldn't be transitioning with a non-zero refcount?

OK, so ...

 - For folios, there's already no such thing as a page refcount (you may
   already know this and are just being slightly sloppy while
   speaking).  If you attempt to access the refcount on a tail page,
   you're silently redirected to the folio refcount.

 - That's not going to change with memdescs; for pages which are part of
   a memdesc, attempting to acess the page's refcount will redirect to
   the folio's refcount.

What GUP-fast will do once we get to Page2025 is:

 - READ_ONCE(page->memdesc)
 - Check that the bottom bits match a folio.  If not, fall back to
   GUP-slow (or retry; I forget the details).
 - tryget the refcount, if fail fall back/retry
 - if (READ_ONCE(page->memdesc) != memdesc) { folio_put(); retry/fallback }
 - yay, we succeeded.

So that's all a little more complicated with two places to check as
an intermediate state, but I think it's doable.  It's just fiddly.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-02 23:24       ` Matthew Wilcox
@ 2025-09-02 23:57         ` Jason Gunthorpe
  2025-09-03  4:46           ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2025-09-02 23:57 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Hildenbrand, linux-mm

On Wed, Sep 03, 2025 at 12:24:07AM +0100, Matthew Wilcox wrote:
> On Tue, Sep 02, 2025 at 06:15:14PM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 02, 2025 at 10:06:05PM +0100, Matthew Wilcox wrote:
> > 
> > > I'm concerned by things like compaction that are executing
> > > asynchronously and might see a page mid-transition.  Or something like
> > > GUP or lockless pagecache lookup that might get a stale page
> > > pointer.
> > 
> > At least GUP fast obtains a page refcount before touching the rest of
> > struct page, so I think it can't see those kinds of races since the
> > page shouldn't be transitioning with a non-zero refcount?
> 
> OK, so ...
> 
>  - For folios, there's already no such thing as a page refcount (you may
>    already know this and are just being slightly sloppy while
>    speaking).  

I was thinking broadly about the impossible-in-page-tables things like
slab and ptdesc must continue to have a refcount field, it is just
fixed to 0, right? But yes, the code all goes through struct folio to
get there.

>    you're silently redirected to the folio refcount.
> 
>  - That's not going to change with memdescs; for pages which are part of
>    a memdesc, attempting to acess the page's refcount will redirect to
>    the folio's refcount.

My point is that until the refcount memory is moved from struct folio
to a memdesc allocated struct, you should be able to continue to rely
on checking a non-zero refcount in the struct folio to stabilize
reading the memdesc/type.

That seems like it may address some of your concern for this inbetween
patch if a memdesc pointer and type is guarenteed to be stable when a
positive refcount is being held.

Then you'd change things like you describe:

>  - READ_ONCE(page->memdesc)
>  - Check that the bottom bits match a folio.  If not, fall back to
>    GUP-slow (or retry; I forget the details).

gup-slow sounds right to resolve any races to me.

>  - tryget the refcount, if fail fall back/retry
>  - if (READ_ONCE(page->memdesc) != memdesc) { folio_put(); retry/fallback }
>  - yay, we succeeded.

It is the same as GUP fast does for the PTE today. So this would now
recheck the PTE and the memdesc.

This recheck is because GUP fast effectively runs under a
SLAB_TYPESAFE_BY_RCU type of behavior for the struct folio. I think
the memdesc would also need to follow a SLAB_TYPESAFE_BY_RCU design as
well.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-02 23:57         ` Jason Gunthorpe
@ 2025-09-03  4:46           ` Matthew Wilcox
  2025-09-03  9:38             ` David Hildenbrand
                               ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Matthew Wilcox @ 2025-09-03  4:46 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: David Hildenbrand, linux-mm

On Tue, Sep 02, 2025 at 08:57:40PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 03, 2025 at 12:24:07AM +0100, Matthew Wilcox wrote:
> > On Tue, Sep 02, 2025 at 06:15:14PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 02, 2025 at 10:06:05PM +0100, Matthew Wilcox wrote:
> > > 
> > > > I'm concerned by things like compaction that are executing
> > > > asynchronously and might see a page mid-transition.  Or something like
> > > > GUP or lockless pagecache lookup that might get a stale page
> > > > pointer.
> > > 
> > > At least GUP fast obtains a page refcount before touching the rest of
> > > struct page, so I think it can't see those kinds of races since the
> > > page shouldn't be transitioning with a non-zero refcount?
> > 
> > OK, so ...
> > 
> >  - For folios, there's already no such thing as a page refcount (you may
> >    already know this and are just being slightly sloppy while
> >    speaking).  
> 
> I was thinking broadly about the impossible-in-page-tables things like
> slab and ptdesc must continue to have a refcount field, it is just
> fixed to 0, right? But yes, the code all goes through struct folio to
> get there.

Once we switch to memdescs for these things, they no longer need a
refcount field.   By the end of Page2025, plain pages have a refcount,
but folios/slabs/ptdesc/etc set the page->_refcount to 0.  put_page()
moves out of line because it's really complicated; it looks something
like:

void put_page(struct page *page)
{
	memdesc_t memdesc = READ_ONCE(page->memdesc);

	if (memdesc_is_folio(memdesc)) {
		struct folio *folio = memdesc_folio(memdesc);
		folio_put(folio);
	} else if (memdesc_is_slab(memdesc) || memdesc_is_ptdesc(memdesc))
		BUG();
	} else {
		page = compound_head(page);
		if (page_put_testzero(page))
			__free_page(page);
	}
}

... there's probably a bit more to it ...

get_page() probably looks similar.  GUP-fast obviously wouldn't use
get_page() because it needs to be very careful about what it's doing
(and it needs to fail properly if it sees a non-folio page).

> >    you're silently redirected to the folio refcount.
> > 
> >  - That's not going to change with memdescs; for pages which are part of
> >    a memdesc, attempting to acess the page's refcount will redirect to
> >    the folio's refcount.
> 
> My point is that until the refcount memory is moved from struct folio
> to a memdesc allocated struct, you should be able to continue to rely
> on checking a non-zero refcount in the struct folio to stabilize
> reading the memdesc/type.

Definitely once you have a refcuont on a folio, the page->folio
relationship is stable.  page->slab is stabilised if you've allocated
an object from the slab.  page->ptdesc is stabilised if you hold the
PTE lock or the mmap_lock ... we need to write all these things down.

> That seems like it may address some of your concern for this inbetween
> patch if a memdesc pointer and type is guarenteed to be stable when a
> positive refcount is being held.
> 
> Then you'd change things like you describe:
> 
> >  - READ_ONCE(page->memdesc)
> >  - Check that the bottom bits match a folio.  If not, fall back to
> >    GUP-slow (or retry; I forget the details).
> 
> gup-slow sounds right to resolve any races to me.
> 
> >  - tryget the refcount, if fail fall back/retry
> >  - if (READ_ONCE(page->memdesc) != memdesc) { folio_put(); retry/fallback }
> >  - yay, we succeeded.
> 
> It is the same as GUP fast does for the PTE today. So this would now
> recheck the PTE and the memdesc.

Ah, yes, I missed the step where we recheck the PTE.  Thanks.

> This recheck is because GUP fast effectively runs under a
> SLAB_TYPESAFE_BY_RCU type of behavior for the struct folio. I think
> the memdesc would also need to follow a SLAB_TYPESAFE_BY_RCU design as
> well.

I haven't quite figured out if _all_ memdescs need to be TYPESAFE_BY_RCU
or only the ones which either have refcounts or are otherwise
migratable.  Slab should be safe to be not TYPESAFE because if we ever
see a PageSlab, we won't try to dereference the pointer in GUP,
pagecache lookup or migration.  I need to look through David's recent
patches again to understand how migration is going to work (obviously
we won't try to migrate slab pages).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-03  4:46           ` Matthew Wilcox
@ 2025-09-03  9:38             ` David Hildenbrand
  2025-09-03 12:28             ` Jason Gunthorpe
  2025-09-03 12:43             ` Jason Gunthorpe
  2 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2025-09-03  9:38 UTC (permalink / raw)
  To: Matthew Wilcox, Jason Gunthorpe; +Cc: linux-mm

>> It is the same as GUP fast does for the PTE today. So this would now
>> recheck the PTE and the memdesc.
> 
> Ah, yes, I missed the step where we recheck the PTE.  Thanks.
> 
>> This recheck is because GUP fast effectively runs under a
>> SLAB_TYPESAFE_BY_RCU type of behavior for the struct folio. I think
>> the memdesc would also need to follow a SLAB_TYPESAFE_BY_RCU design as
>> well.
> 
> I haven't quite figured out if _all_ memdescs need to be TYPESAFE_BY_RCU
> or only the ones which either have refcounts or are otherwise
> migratable.  Slab should be safe to be not TYPESAFE because if we ever
> see a PageSlab, we won't try to dereference the pointer in GUP,
> pagecache lookup or migration.  I need to look through David's recent
> patches again to understand how migration is going to work (obviously
> we won't try to migrate slab pages).

The long term plan is to work on frozen pages (with balloon pages that's 
easy, with zsmalloc I am not sure yet).

Migration core will be responsible for freeing these frozen pages after 
migration succeeded etc.

PageOffline pages won't allocate any memdesc. Zsmalloc will have to 
allocate one.

I would expect that the ->migrate_page() callback will just get a frozen 
page and the callback will figure out what to do in regards of the medesc.

It's going to be a bunch of work and I am getting interrupted working on 
it ...

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-03  4:46           ` Matthew Wilcox
  2025-09-03  9:38             ` David Hildenbrand
@ 2025-09-03 12:28             ` Jason Gunthorpe
  2025-09-03 12:43             ` Jason Gunthorpe
  2 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2025-09-03 12:28 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Hildenbrand, linux-mm

On Wed, Sep 03, 2025 at 05:46:08AM +0100, Matthew Wilcox wrote:

> > This recheck is because GUP fast effectively runs under a
> > SLAB_TYPESAFE_BY_RCU type of behavior for the struct folio. I think
> > the memdesc would also need to follow a SLAB_TYPESAFE_BY_RCU design as
> > well.
> 
> I haven't quite figured out if _all_ memdescs need to be TYPESAFE_BY_RCU
> or only the ones which either have refcounts or are otherwise
> migratable.  

Anything that de-references page->memdesc under RCU using the re-check
flow you outlined has to also use TYPESAFE_BY_RCU for the
page->memdesc allocation to avoid UAF under a RCU read side critical
section.

Likely it doesn't make sense to RCU de-reference page->memdesc for
anything other than incrementing a refcount.

Jason


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-03  4:46           ` Matthew Wilcox
  2025-09-03  9:38             ` David Hildenbrand
  2025-09-03 12:28             ` Jason Gunthorpe
@ 2025-09-03 12:43             ` Jason Gunthorpe
  2 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2025-09-03 12:43 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: David Hildenbrand, linux-mm

On Wed, Sep 03, 2025 at 05:46:08AM +0100, Matthew Wilcox wrote:

> Once we switch to memdescs for these things, they no longer need a
> refcount field.   By the end of Page2025, plain pages have a refcount,
> but folios/slabs/ptdesc/etc set the page->_refcount to 0.  

Reading this again, I didn't quite get this till now. Maybe it is
worth adding this detail to the wikki.

In this case, what are "plain pages"? 

One I can think of is naked calls to alloc_page*(), which I see often
used in place of kmalloc(PAGE_SIZE), do you imagine a project to
favour kmalloc instead?

Jason


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Where to put page->memdesc initially
  2025-09-02 21:06   ` Matthew Wilcox
  2025-09-02 21:15     ` Jason Gunthorpe
@ 2025-09-03  9:33     ` David Hildenbrand
  1 sibling, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2025-09-03  9:33 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, Jason Gunthorpe

On 02.09.25 23:06, Matthew Wilcox wrote:
> On Tue, Sep 02, 2025 at 10:09:49PM +0200, David Hildenbrand wrote:
>> Would you want to move the page type already from the mapcount into the
>> memdesc? That sounds challenging, because for any typed folios we would
>> not be allowed to reuse a field we want to use for the memdesc. IIRC<
>> hugetlb pretty much uses all of it.
> 
> That would definitely be part of the same series.  But possibly not the
> same patch.  I think the series has to include separate allocations for
> slab, folio and whichever other memdescs won't fit into 32 bytes.

I was wondering whether there could be a single patch where we do this 
change (separate allocations), and just prepare the code in previous 
patches for that accordingly, such that the resulting patch is still 
reasonable small.

I feel like this way of splitting patches might cause unnecessary 
headaches :)

> 
>> The easy way out for now would be making this page type specific: Only
>> selected typed pages will store the memdesc (here: slab pointer) e.g.,
>> in the old page->mapping place.
>>
>> So PageSlab() still checks the existing page type, put page_slab() would
>> simply lookup the pointer in the old page->mapping place.
> 
> I *think* that's roughly the same as what I'm proposing, except
> that we already have a meaning for "the bottom two bits of
> folio->mapping are set", so there's potential confusion for
> folio_test_anon() & friends.

IIRC, we must always make sure to never call folio_test_anon() on 
something that is a slab already.

But if in doubt, we could use bit[2] in ->mapping, which should still be 
unussed IIRC.

>>
>>> There are a few overlapping uses of these bits in struct page, so if we do
>>> nothing we may get confused.  We can deal with mlock_count and order (for
>>> pcp_llist).  But the biggest problem is the first tail page of a folio.
>>> Depending on word size and endianness, there are four different atomic_t
>>> fields that overlap with page->lru.prev.  That can't be solved by using
>>> a different field in struct page; the first tail page is jam-packed.
>>>
>>> So, page_slab() will first load page->memdesc (the same bits as
>>> page->lru.prev), check the bottom four bits match the slab memdesc, and
>>> also check page->page_type matches PGTY_slab.  I don't like this a lot,
>>> because it's two loads rather than one atomic load, but it should only
>>> be present for one commit.
>>
>> As a first step, I would really not use the bottom four bits. Why
>> perform two type checks initially?
> 
> I'm concerned by things like compaction that are executing
> asynchronously and might see a page mid-transition.  Or something like
> GUP or lockless pagecache lookup that might get a stale page pointer.
> It's a lot easier to reason about if we can do a single load and treat
> that as a source of truth (with the appropriate reloads to make sure
> nothing changed after we got a refcount).  Doing two loads makes
> my brain hurt a bit because it introduces more possibilities for
> inconsistency.  I'll need to write it up pretty carefully (which
> annoys me because we're going to need it for a single or very
> few commits ...)

Makes sense, but maybe we can avoid all that by just structuring the 
patches differently :)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-09-03 12:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-02 19:03 Where to put page->memdesc initially Matthew Wilcox
2025-09-02 20:08 ` Jason Gunthorpe
2025-09-02 20:09 ` David Hildenbrand
2025-09-02 21:06   ` Matthew Wilcox
2025-09-02 21:15     ` Jason Gunthorpe
2025-09-02 23:24       ` Matthew Wilcox
2025-09-02 23:57         ` Jason Gunthorpe
2025-09-03  4:46           ` Matthew Wilcox
2025-09-03  9:38             ` David Hildenbrand
2025-09-03 12:28             ` Jason Gunthorpe
2025-09-03 12:43             ` Jason Gunthorpe
2025-09-03  9:33     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).