[LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages

Linux filesystem development
 help / color / mirror / Atom feed

* [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
@ 2026-04-30 11:33 Ritesh Harjani (IBM)
  2026-04-30 13:15 ` Matthew Wilcox
  2026-04-30 17:32 ` Gregory Price
  0 siblings, 2 replies; 10+ messages in thread
From: Ritesh Harjani (IBM) @ 2026-04-30 11:33 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Amir Goldstein, Christian Brauner, Jan Kara, lsf-pc,
	Gregory Price, Bharata B Rao, Donet Tom, Matthew Wilcox,
	Aboorva Devarajan, linux-mm, Ojaswin Mujoo

Hi All,

Amir insisted for this :)
> IOW, bring the Hallway session into the room, so that other people
> can participate and we can use the hallway time for gossip and
> stuffing our faces.

So, since we might have few slots available in FS breakout sessions -
here is something that I was hoping to have a discussion with you all
in hallway. However, I thought maybe it will be a good idea to initiate
this thread here, to see what do you think about this.

Linux already supports memory tiers and there are ongoing discussions around
promotion of unmapped page cache pages, which lets kernel do the right thing
for userspace page cache pages on a tiered system.

v6.17 added support for per-node global reclaim via
/sys/devices/system/node/nodeX/reclaim, which lets users perform per-node
reclaim of page cache pages. We also already have interfaces that let userspace
define the lifetime of page cache pages, such as RWF_DONTCACHE and
FADV_DONTNEED. These are increasingly useful because locally-attached DRAM is
a costly resource and we don't want unwanted page cache pollution there.
Userspace, sometimes is in a better position than the kernel to know the
workload's access pattern and whether it makes sense to drop page cache pages
once the I/O is done.

So the question is:
Do we need a userspace interface for the placement policy of page cache pages on a per file basis?

Note that we do have per-task placement policies like set_mempolicy(), but those
are too coarse and don't help if userspace wants per-fd control. mmap+mbind()
doesn't reach unmapped page cache. shared_policy per-inode works for
shmem/guest_memfd but not for other filesystems (I think so, but I maybe
wrong).

So what I would like to discuss with others is:

1. Is there a need for an interface that allows userspace to do per-fd page
   placement and maybe per-fd page migration?
2. Are there applications that need such an interface or would they benefit
   from it?
3. Even if applications may not need this today, should kernel developers start
   thinking about it now, before users start abusing some not-well-defined
   existing interface. e.g. the story of echo 1 > /proc/sys/vm/drop_caches,
   which became a production workload tool despite never being intended as
   one?

Let me know if people think that this discussion qualifies for a BoF discussion at LSFMM?
Or do you think it's a bad idea altogether, if that is the case - Then
please help me understand, why so?
Before starting to jump on the implemention of any of this - I would
like to gather feedback on what do others think?

-ritesh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
  2026-04-30 11:33 [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages Ritesh Harjani (IBM)
@ 2026-04-30 13:15 ` Matthew Wilcox
  2026-04-30 14:43   ` Ritesh Harjani
  2026-05-02 14:57   ` Gregory Price
  2026-04-30 17:32 ` Gregory Price
  1 sibling, 2 replies; 10+ messages in thread
From: Matthew Wilcox @ 2026-04-30 13:15 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jan Kara,
	lsf-pc, Gregory Price, Bharata B Rao, Donet Tom,
	Aboorva Devarajan, linux-mm, Ojaswin Mujoo

On Thu, Apr 30, 2026 at 05:03:37PM +0530, Ritesh Harjani (IBM) wrote:
> Linux already supports memory tiers and there are ongoing discussions around
> promotion of unmapped page cache pages, which lets kernel do the right thing
> for userspace page cache pages on a tiered system.

Well, you know my opinion of that idea ...

> So the question is:
> Do we need a userspace interface for the placement policy of page cache pages on a per file basis?

What do we do if two tasks both "know" the right NUMA placement for the
inode's data, and they disagree?

> 1. Is there a need for an interface that allows userspace to do per-fd page
>    placement and maybe per-fd page migration?

Ideally, no, the kernel should observe the task and get it right.

By the way, you're familiar with how filemap_alloc_folio_noprof()
works today, right?  I forget whether cpuset_do_page_mem_spread
is on or off by default.

> Let me know if people think that this discussion qualifies for a BoF discussion at LSFMM?
> Or do you think it's a bad idea altogether, if that is the case - Then
> please help me understand, why so?
> Before starting to jump on the implemention of any of this - I would
> like to gather feedback on what do others think?

I'm just concerned about what other session i'll have to miss to attend
this instead ;-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
  2026-04-30 13:15 ` Matthew Wilcox
@ 2026-04-30 14:43   ` Ritesh Harjani
  2026-05-02 14:57   ` Gregory Price
  1 sibling, 0 replies; 10+ messages in thread
From: Ritesh Harjani @ 2026-04-30 14:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jan Kara,
	lsf-pc, Gregory Price, Bharata B Rao, Donet Tom,
	Aboorva Devarajan, linux-mm, Ojaswin Mujoo

Matthew Wilcox <willy@infradead.org> writes:

> On Thu, Apr 30, 2026 at 05:03:37PM +0530, Ritesh Harjani (IBM) wrote:
>> Linux already supports memory tiers and there are ongoing discussions around
>> promotion of unmapped page cache pages, which lets kernel do the right thing
>> for userspace page cache pages on a tiered system.
>
> Well, you know my opinion of that idea ...
>

:)

>> So the question is:
>> Do we need a userspace interface for the placement policy of page cache pages on a per file basis?
>
> What do we do if two tasks both "know" the right NUMA placement for the
> inode's data, and they disagree?
>

Yes, that's a fair concern that I too had.

So, the placement policy only takes effect at the first allocation i.e.
once a folio is in the page cache. So in the common case where two tasks
read disjoint ranges of the same file, a per-fd policy might work
cleanly - each task's policy governs the folios it reads and there
shouldn't be any conflict.

However, on the same range, whoever instantiate the folio first wins.
But that problem exist today too, even with set_mempolicy.

>> 1. Is there a need for an interface that allows userspace to do per-fd page
>>    placement and maybe per-fd page migration?
>
> Ideally, no, the kernel should observe the task and get it right.
>
> By the way, you're familiar with how filemap_alloc_folio_noprof()
> works today, right?

Are you pointing towards the recent work of yours here?

16a542e22339 Matthew Wilcox  mm/filemap: Extend __filemap_get_folio() to support NUMA memory poli..  8 months ago
7f3779a3ac3e Matthew Wilcox  mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()         8 months ago
    mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()

    Add a mempolicy parameter to filemap_alloc_folio() to enable NUMA-aware
    page cache allocations. This will be used by upcoming changes to
    support NUMA policies in guest-memfd, where guest_memory need to be
    allocated NUMA policy specified by VMM.

    All existing users pass NULL maintaining current behavior.


Yup, that sort of is laying the foundation work for this discussion :)
Although I understand that it was done particularly for guest_memfd only.

Is that what you meant?


> I forget whether cpuset_do_page_mem_spread
> is on or off by default.
>

Should be off then I guess...

cpuset_write_u64()
...
	case FILE_SPREAD_PAGE:
		pr_info_once("cpuset.%s is deprecated\n", cft->name);
		retval = cpuset_update_flag(CS_SPREAD_PAGE, cs, val);
		break;

Is this what you were referring to?


>> Let me know if people think that this discussion qualifies for a BoF discussion at LSFMM?
>> Or do you think it's a bad idea altogether, if that is the case - Then
>> please help me understand, why so?
>> Before starting to jump on the implemention of any of this - I would
>> like to gather feedback on what do others think?
>
> I'm just concerned about what other session i'll have to miss to attend
> this instead ;-)

It's good to know that there is an interest then ;)

-ritesh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
  2026-04-30 11:33 [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages Ritesh Harjani (IBM)
  2026-04-30 13:15 ` Matthew Wilcox
@ 2026-04-30 17:32 ` Gregory Price
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory Price @ 2026-04-30 17:32 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jan Kara,
	lsf-pc, Bharata B Rao, Donet Tom, Matthew Wilcox,
	Aboorva Devarajan, linux-mm, Ojaswin Mujoo

On Thu, Apr 30, 2026 at 05:03:37PM +0530, Ritesh Harjani (IBM) wrote:
> 
> Linux already supports memory tiers

Allegedly. (TM)

In practice, and in working with such support, the support is incredibly
nascent and in fact causes LRU inversions by design, is missing unmapped
page cache support (as you note here), and just overall does not work
well out of the box for any reasonably complicated system.

> and there are ongoing discussions around
> promotion of unmapped page cache pages, which lets kernel do the right thing
> for userspace page cache pages on a tiered system.
> 

I like to think of this more accurately as:

"Lets the kernel nudge the trajectory of the distribution in the right
direction".

There is no objectively "right thing" here, and chasing that is a dead
end.

> Userspace, sometimes is in a better position than the kernel to know the
> workload's access pattern and whether it makes sense to drop page cache pages
> once the I/O is done.
> 

At the expense of an increasingly complex maintenance burden on the kernel.

> So the question is:
> Do we need a userspace interface for the placement policy of page cache pages on a per file basis?
>

To the extent that you get something like:

MADV/FADV_HOT (promote and read-ahead)

as an extension that mirrors MADV_WILLNEED (read-ahead)

... maybe.

> 1. Is there a need for an interface that allows userspace to do per-fd page
>    placement and maybe per-fd page migration?

Maybe as MADV/FADV hints, but beyond this - no.  I agree with Willy that
the kernel should simply get placement right.

Building the assumption that userland will do X and *then* the kernel
will get it right is just a road to building a bunch of random
interfaces that eventually get deprecated when the kernel does it
correctly.  We should just do it correctly or not ship it.

> 3. Even if applications may not need this today, should kernel developers start
>    thinking about it now, before users start abusing some not-well-defined
>    existing interface. e.g. the story of echo 1 > /proc/sys/vm/drop_caches,
>    which became a production workload tool despite never being intended as
>    one?

We have a public meeting every 2 weeks on tiering topics

https://lore.kernel.org/all/8a622c4f-0774-96a5-2d2a-2151e0bc2367@google.com/

> 
> Let me know if people think that this discussion qualifies for a BoF discussion at LSFMM?
> Or do you think it's a bad idea altogether, if that is the case - Then
> please help me understand, why so?
> Before starting to jump on the implemention of any of this - I would
> like to gather feedback on what do others think?
> 

Always happy to discuss.  Just need to figure out timing.

~Gregory

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
  2026-04-30 13:15 ` Matthew Wilcox
  2026-04-30 14:43   ` Ritesh Harjani
@ 2026-05-02 14:57   ` Gregory Price
  2026-05-02 15:49     ` Gregory Price
  2026-05-02 23:00     ` Matthew Wilcox
  1 sibling, 2 replies; 10+ messages in thread
From: Gregory Price @ 2026-05-02 14:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ritesh Harjani (IBM), linux-fsdevel, Amir Goldstein,
	Christian Brauner, Jan Kara, lsf-pc, Bharata B Rao, Donet Tom,
	Aboorva Devarajan, linux-mm, Ojaswin Mujoo

On Thu, Apr 30, 2026 at 02:15:19PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 30, 2026 at 05:03:37PM +0530, Ritesh Harjani (IBM) wrote:
> > Linux already supports memory tiers and there are ongoing discussions around
> > promotion of unmapped page cache pages, which lets kernel do the right thing
> > for userspace page cache pages on a tiered system.
> 
> Well, you know my opinion of that idea ...
> 
> > So the question is:
> > Do we need a userspace interface for the placement policy of page cache pages on a per file basis?
> 
> What do we do if two tasks both "know" the right NUMA placement for the
> inode's data, and they disagree?
> 
> > 1. Is there a need for an interface that allows userspace to do per-fd page
> >    placement and maybe per-fd page migration?
> 
> Ideally, no, the kernel should observe the task and get it right.
> 

Out of curiosity, a use case i've been exploring is something like

fd = open()
buf = mmap(fd, ...)
mbind(buf, device_node)
/* fault file pages directly onto device memory */

this obviously breaks if there are concurrent accessors of said file
with read() (filemap will just fault onto the local node - clear race).

Do you think there's a world where we can hang a mempolicy off the
address_space via an fctrl() call with CAP_SYS_NICE?

I haven't quite worked through the full lifetime, since there's a
possibility the mempolicy ends up with stale nodes (hotplug, etc)
without plumbing for that.   But it did seem like a somewhat clean
abstraction that isn't specifically a tiering use case.

(not interested in this for anything other than single node placement
policy or tiering, so no interleave or migration support or anything)

~Gregory

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
  2026-05-02 14:57   ` Gregory Price
@ 2026-05-02 15:49     ` Gregory Price
  2026-05-03 16:18       ` Ritesh Harjani
  2026-05-02 23:00     ` Matthew Wilcox
  1 sibling, 1 reply; 10+ messages in thread
From: Gregory Price @ 2026-05-02 15:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ritesh Harjani (IBM), linux-fsdevel, Amir Goldstein,
	Christian Brauner, Jan Kara, lsf-pc, Bharata B Rao, Donet Tom,
	Aboorva Devarajan, linux-mm, Ojaswin Mujoo

On Sat, May 02, 2026 at 03:57:19PM +0100, Gregory Price wrote:
> On Thu, Apr 30, 2026 at 02:15:19PM +0100, Matthew Wilcox wrote:
> 
> fd = open()
> buf = mmap(fd, ...)
> mbind(buf, device_node)
> /* fault file pages directly onto device memory */
>

maybe more explicitly, something like this

fcntl(fd, F_SET_FILE_NUMA_NODE, gpu_nid); /* pref node */
mbind(addr, size, MPOL_BIND, ..., MOVE_ALL); /* move existing */
madvise(addr, size, MADV_POPULATE_READ); /* fault the rest */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
  2026-05-02 14:57   ` Gregory Price
  2026-05-02 15:49     ` Gregory Price
@ 2026-05-02 23:00     ` Matthew Wilcox
  2026-05-03 14:15       ` Gregory Price
  1 sibling, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2026-05-02 23:00 UTC (permalink / raw)
  To: Gregory Price
  Cc: Ritesh Harjani (IBM), linux-fsdevel, Amir Goldstein,
	Christian Brauner, Jan Kara, lsf-pc, Bharata B Rao, Donet Tom,
	Aboorva Devarajan, linux-mm, Ojaswin Mujoo

On Sat, May 02, 2026 at 03:57:19PM +0100, Gregory Price wrote:
> On Thu, Apr 30, 2026 at 02:15:19PM +0100, Matthew Wilcox wrote:
> > Ideally, no, the kernel should observe the task and get it right.
> 
> Out of curiosity, a use case i've been exploring is something like
> 
> fd = open()
> buf = mmap(fd, ...)
> mbind(buf, device_node)
> /* fault file pages directly onto device memory */

Could you be explicit what _kind_ of node you're talking about here?
My initial thought went to something like DAX where the node actually
contains persistent memory and the fd is a reference to some chunk
of that storage.  But I don't think that's what you're referring to.
You might be thinking about a presentation of DRAM over the CXL fabric.
Or you might be thinking about memory presented by a graphics card
(perhaps over CXL, perhaps some other way).  The meaning of "device
memory" has become thoroughly confused (thanks Jerome!) so I sincerely
don't know what you're talking about.

> this obviously breaks if there are concurrent accessors of said file
> with read() (filemap will just fault onto the local node - clear race).
> 
> Do you think there's a world where we can hang a mempolicy off the
> address_space via an fctrl() call with CAP_SYS_NICE?

I don't hate the idea.  Presumably we'd document that it overrides any
mempolicy applied to the process?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
  2026-05-02 23:00     ` Matthew Wilcox
@ 2026-05-03 14:15       ` Gregory Price
  0 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2026-05-03 14:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ritesh Harjani (IBM), linux-fsdevel, Amir Goldstein,
	Christian Brauner, Jan Kara, lsf-pc, Bharata B Rao, Donet Tom,
	Aboorva Devarajan, linux-mm, Ojaswin Mujoo

On Sun, May 03, 2026 at 12:00:48AM +0100, Matthew Wilcox wrote:
> On Sat, May 02, 2026 at 03:57:19PM +0100, Gregory Price wrote:
> > On Thu, Apr 30, 2026 at 02:15:19PM +0100, Matthew Wilcox wrote:
> > > Ideally, no, the kernel should observe the task and get it right.
> > 
> > Out of curiosity, a use case i've been exploring is something like
> > 
> > fd = open()
> > buf = mmap(fd, ...)
> > mbind(buf, device_node)
> > /* fault file pages directly onto device memory */
> 
> Could you be explicit what _kind_ of node you're talking about here?
> My initial thought went to something like DAX where the node actually
> contains persistent memory and the fd is a reference to some chunk
> of that storage.  But I don't think that's what you're referring to.
> You might be thinking about a presentation of DRAM over the CXL fabric.
> Or you might be thinking about memory presented by a graphics card
> (perhaps over CXL, perhaps some other way).  The meaning of "device
> memory" has become thoroughly confused (thanks Jerome!) so I sincerely
> don't know what you're talking about.
>

Ah, I can see how the language has moved over time.

But yes a gpu with onboard memory presented as a NUMA node, or some
other accelerator where the kernel is responsible for managing its
memory (rather than a driver).

Probably the use-case needs a bit more consideration before we go
dangling new policies off the page cache - just chewing on ideas.

~Gregory

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
  2026-05-02 15:49     ` Gregory Price
@ 2026-05-03 16:18       ` Ritesh Harjani
  2026-05-03 23:48         ` Gregory Price
  0 siblings, 1 reply; 10+ messages in thread
From: Ritesh Harjani @ 2026-05-03 16:18 UTC (permalink / raw)
  To: Gregory Price, Matthew Wilcox
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jan Kara,
	lsf-pc, Bharata B Rao, Donet Tom, Aboorva Devarajan, linux-mm,
	Ojaswin Mujoo

Gregory Price <gourry@gourry.net> writes:

> On Sat, May 02, 2026 at 03:57:19PM +0100, Gregory Price wrote:
>> On Thu, Apr 30, 2026 at 02:15:19PM +0100, Matthew Wilcox wrote:
>> 
>> fd = open()
>> buf = mmap(fd, ...)
>> mbind(buf, device_node)
>> /* fault file pages directly onto device memory */
>>
>
> maybe more explicitly, something like this
>
> fcntl(fd, F_SET_FILE_NUMA_NODE, gpu_nid); /* pref node */
> mbind(addr, size, MPOL_BIND, ..., MOVE_ALL); /* move existing */

One existing problem with this approach of using MPOL_MF_MOVE[_ALL] (not
specifically about gpu nid usecase here) is that, it only considers
pages which are mapped into process address space. So, unmapped page
cache pages are invisible to it.

So, the problem which I am trying to highlight here is - Today there is
no mechanism for a user to move/migrate the file-cache pages w/o
incurring extra I/Os. The only way that works today is to first populate
all the folios belonging to the file into the process address space by
issuing MADV_POPULATE_READ and then issue mbind(..., MPOL_F_MOVE). But
there is no primitive to populate the process address space with only
the folios which are already present in the page cache (which maybe
residing on a different numa node). Note that MADV_POPULATE_READ will
read the missing pages from disk too.

So, maybe something like this.. Would this still be useful? i.e.

madvise (addr, size, MADV_POPULATE_READ_NOIO);
mbind(addr, size, MPOL_BIND, ..., MPOL_MF_MOVE);

MADV_POPULATE_READ_NOIO should ensure that only the cached folios
belonging to that file are mapped into the process address space w/o
doing any extra disk I/Os. The subsequent mbind call with MPOL_MF_MOVE,
will then ensure that all the existing mapped folios are migrated into the
chosen numa node. And also that any new pages which gets faulted in will
get allocated onto the chose numa node because of MPOL_BIND policy. 

I believe there might be existing applications which might be facing
this problem today. This can happen, for instance, when there is a
workload which can run multiple times and may run across different NUMA
nodes. Our internal test team once reported a similar performance
regression with llama-bench on subsequent runs when running it across
different NUMA nodes. The reason this happened was that the existing
page cache folios of model weight file (from the previous run on a
separate NUMA node) were not getting migrated (because they were not
calling MADV_POPULATE_READ since it can cause a read of a large model
weight file into the page cache all at once).

With that in mind, do we think having something like
MADV_POPULATE_READ_NOIO make sense to address such problems? Do we have
any other usecases of this too?
Or do we see any problems with this, due to which it never existed?

(Note that I haven't yet given a thought for how it should behave for
anon memory).

-ritesh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
  2026-05-03 16:18       ` Ritesh Harjani
@ 2026-05-03 23:48         ` Gregory Price
  0 siblings, 0 replies; 10+ messages in thread
From: Gregory Price @ 2026-05-03 23:48 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Matthew Wilcox, linux-fsdevel, Amir Goldstein, Christian Brauner,
	Jan Kara, lsf-pc, Bharata B Rao, Donet Tom, Aboorva Devarajan,
	linux-mm, Ojaswin Mujoo

On Sun, May 03, 2026 at 09:48:01PM +0530, Ritesh Harjani wrote:
> Gregory Price <gourry@gourry.net> writes:
> 
> MADV_POPULATE_READ_NOIO should ensure that only the cached folios
> belonging to that file are mapped into the process address space w/o
> doing any extra disk I/Os. The subsequent mbind call with MPOL_MF_MOVE,
> will then ensure that all the existing mapped folios are migrated into the
> chosen numa node. And also that any new pages which gets faulted in will
> get allocated onto the chose numa node because of MPOL_BIND policy. 
> 

This all gets rather racy with buffered I/O, I'm not sure we can make
this work the way either one of us want.  I need to chew on this.

> I believe there might be existing applications which might be facing
> this problem today. This can happen, for instance, when there is a
> workload which can run multiple times and may run across different NUMA
> nodes. Our internal test team once reported a similar performance
> regression with llama-bench on subsequent runs when running it across
> different NUMA nodes. The reason this happened was that the existing
> page cache folios of model weight file (from the previous run on a
> separate NUMA node) were not getting migrated (because they were not
> calling MADV_POPULATE_READ since it can cause a read of a large model
> weight file into the page cache all at once).
> 

There's a long standing issue of unmapped page cache files getting
trapped on lower-tier memory, i think that's an orthoganol issue to the
discussion here.

IIRC DAMON can migrate them, i think, and vmscan.c will happily demote
these folios.  So at a minimum we know they're "migratable".

> With that in mind, do we think having something like
> MADV_POPULATE_READ_NOIO make sense to address such problems? Do we have
> any other usecases of this too?
> Or do we see any problems with this, due to which it never existed?
> 
> (Note that I haven't yet given a thought for how it should behave for
> anon memory).
>

I'm not sure why this would apply to anon?  Unless the issue is
specifically anon MAP_SHARED.

~Gregory

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-05-03 23:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-30 11:33 [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages Ritesh Harjani (IBM)
2026-04-30 13:15 ` Matthew Wilcox
2026-04-30 14:43   ` Ritesh Harjani
2026-05-02 14:57   ` Gregory Price
2026-05-02 15:49     ` Gregory Price
2026-05-03 16:18       ` Ritesh Harjani
2026-05-03 23:48         ` Gregory Price
2026-05-02 23:00     ` Matthew Wilcox
2026-05-03 14:15       ` Gregory Price
2026-04-30 17:32 ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox