Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Gregory Price <gourry@gourry.net>, Matthew Wilcox <willy@infradead.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Amir Goldstein <amir73il@gmail.com>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	lsf-pc <lsf-pc@lists.linux-foundation.org>,
	Bharata B Rao <bharata@amd.com>,
	Donet Tom <donettom@linux.ibm.com>,
	Aboorva Devarajan <aboorvad@linux.ibm.com>,
	linux-mm@kvack.org, Ojaswin Mujoo <ojaswin@linux.ibm.com>
Subject: Re: [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages
Date: Sun, 03 May 2026 21:48:01 +0530	[thread overview]
Message-ID: <8qa0sc06.ritesh.list@gmail.com> (raw)
In-Reply-To: <afYdH4Alu9QA18CO@gourry-fedora-PF4VCD3F>

Gregory Price <gourry@gourry.net> writes:

> On Sat, May 02, 2026 at 03:57:19PM +0100, Gregory Price wrote:
>> On Thu, Apr 30, 2026 at 02:15:19PM +0100, Matthew Wilcox wrote:
>> 
>> fd = open()
>> buf = mmap(fd, ...)
>> mbind(buf, device_node)
>> /* fault file pages directly onto device memory */
>>
>
> maybe more explicitly, something like this
>
> fcntl(fd, F_SET_FILE_NUMA_NODE, gpu_nid); /* pref node */
> mbind(addr, size, MPOL_BIND, ..., MOVE_ALL); /* move existing */

One existing problem with this approach of using MPOL_MF_MOVE[_ALL] (not
specifically about gpu nid usecase here) is that, it only considers
pages which are mapped into process address space. So, unmapped page
cache pages are invisible to it.

So, the problem which I am trying to highlight here is - Today there is
no mechanism for a user to move/migrate the file-cache pages w/o
incurring extra I/Os. The only way that works today is to first populate
all the folios belonging to the file into the process address space by
issuing MADV_POPULATE_READ and then issue mbind(..., MPOL_F_MOVE). But
there is no primitive to populate the process address space with only
the folios which are already present in the page cache (which maybe
residing on a different numa node). Note that MADV_POPULATE_READ will
read the missing pages from disk too.

So, maybe something like this.. Would this still be useful? i.e.

madvise (addr, size, MADV_POPULATE_READ_NOIO);
mbind(addr, size, MPOL_BIND, ..., MPOL_MF_MOVE);

MADV_POPULATE_READ_NOIO should ensure that only the cached folios
belonging to that file are mapped into the process address space w/o
doing any extra disk I/Os. The subsequent mbind call with MPOL_MF_MOVE,
will then ensure that all the existing mapped folios are migrated into the
chosen numa node. And also that any new pages which gets faulted in will
get allocated onto the chose numa node because of MPOL_BIND policy. 

I believe there might be existing applications which might be facing
this problem today. This can happen, for instance, when there is a
workload which can run multiple times and may run across different NUMA
nodes. Our internal test team once reported a similar performance
regression with llama-bench on subsequent runs when running it across
different NUMA nodes. The reason this happened was that the existing
page cache folios of model weight file (from the previous run on a
separate NUMA node) were not getting migrated (because they were not
calling MADV_POPULATE_READ since it can cause a read of a large model
weight file into the page cache all at once).

With that in mind, do we think having something like
MADV_POPULATE_READ_NOIO make sense to address such problems? Do we have
any other usecases of this too?
Or do we see any problems with this, due to which it never existed?

(Note that I haven't yet given a thought for how it should behave for
anon memory).

-ritesh

next prev parent reply	other threads:[~2026-05-03 17:59 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-30 11:33 [LSF/MM/BPF BoF Session] Numa-Aware Placement for Page Cache Pages Ritesh Harjani (IBM)
2026-04-30 13:15 ` Matthew Wilcox
2026-04-30 14:43   ` Ritesh Harjani
2026-05-02 14:57   ` Gregory Price
2026-05-02 15:49     ` Gregory Price
2026-05-03 16:18       ` Ritesh Harjani [this message]
2026-05-03 23:48         ` Gregory Price
2026-05-02 23:00     ` Matthew Wilcox
2026-05-03 14:15       ` Gregory Price
2026-04-30 17:32 ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8qa0sc06.ritesh.list@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=aboorvad@linux.ibm.com \
    --cc=amir73il@gmail.com \
    --cc=bharata@amd.com \
    --cc=brauner@kernel.org \
    --cc=donettom@linux.ibm.com \
    --cc=gourry@gourry.net \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox