From: Chris Li <chrisl@kernel.org>
To: Barry Song <21cnbao@gmail.com>
Cc: Baoquan He <bhe@redhat.com>,
linux-mm@kvack.org, akpm@linux-foundation.org,
kasong@tencent.com, youngjun.park@lge.com, aaron.lu@intel.com,
shikemeng@huaweicloud.com, nphamcs@gmail.com
Subject: Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
Date: Tue, 14 Oct 2025 23:23:52 -0700 [thread overview]
Message-ID: <CACePvbVDB9-r3dTTAJ8e++1swAt9=fPRK9ex_30L=FgXBe5BpQ@mail.gmail.com> (raw)
In-Reply-To: <CAGsJ_4yv66kW32Lr--O8qWq3gbrwF110cT7MzqMWumRabFNj1g@mail.gmail.com>
On Tue, Oct 14, 2025 at 10:02 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 11:06 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 10/13/25 at 02:09pm, Barry Song wrote:
> > > > -static int swap_node(struct swap_info_struct *si)
> > > > -{
> > > > - struct block_device *bdev;
> > > > -
> > > > - if (si->bdev)
> > > > - bdev = si->bdev;
> > > > - else
> > > > - bdev = si->swap_file->f_inode->i_sb->s_bdev;
> > > > -
> > > > - return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> > > > -}
> > > > -
> > >
> > > Looking at the code, it seems to have some hardware affinity awareness,
> > > as it uses the swapfile’s bdev’s node_id. Are we regressing cases where
> > > each node has a closer block device?
> >
> > I had talked about this with Chris before I posted v1. We don't need to
> > worry about this because:
> >
> > 1) Kernel code rarely set disk->node_id, all disks just assign
> > NUMA_NO_NODE to it except of these:
> >
> > drivers/nvdimm/pmem.c <<pmem_attach_disk>>
> > drivers/md/dm.c <<alloc_dev>>
> >
> > For intel ssd Aaron introduced the node based si choosing is for, it
> > should be Optane which has been discontinued. It could be wrong, then
> > hope intel can help test so that we can see what impact is brought in.
> >
> > 2) The gap between disk io and memory accessing
> > Usually memory accessing is nanosecond level, while disk io is
> > microsecond level, HDD even could be at millisecond. The node affinity
> > saving nanoseconds is negligible compared to the disk's own acessing
> > speed. This includes pmem, its io is more than ten times or even more
> > than memory accessing.
>
> I agree that it’s fine to remove the code if the related hardware is obsolete.
> I found a paper [1] showing that accessing local Optane PMEM is much faster
> than accessing remote Optane PMEM (see slides 4 and 5). That might explain why
> they started the project to make swapfile NUMA-aware.
Are you suggesting the swapfiel is used for PMEM devices? It sounds
very strange to back swapfile with PMEM. I am under the impression
that the original a2468cc9bfdf commit is introduced with the intel SSD
as a testing swapfile device. I just looked it up. Here is what I find
out in the commit log:
======= quote ========
To see the effect of the patch, a test that starts N process, each mmap
a region of anonymous memory and then continually write to it at random
position to trigger both swap in and out is used.
On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
are used as swap devices with each attached to a different node, the
result is:
======= end quote =====
> My point is that we should at least mention this in the changelog to
> honor their past contributions. But since the hardware is no longer used,
> we can remove the code to reduce complexity and stop maintaining it.
Optane was not even supported in Skylake. Commit a2468cc9bfdf has
nothing to do with Optane. The Op]tane talk in a2468cc9bfdf is just a
red herring. I fail to see why reverting a2468cc9bfdf needs to mention
Optane is obsolete.
> I see Aaron's email is no longer reachable, which is probably why we haven’t
> received any feedback from them.
>
> [1] https://www.usenix.org/system/files/osdi21_slides_wang-qing.pdf
>
> >
> > If there's a real system which owns disks belonging to NUMA nodes, we
> > can test to see if the new round robin way is better or worse then the
> > node based way.
>
> Yep. If there might be a real user in the future, we can revisit this.
> For now, I agree that we can drop the complexity.
Thank you for the alignment.
Chris
next prev parent reply other threads:[~2025-10-15 6:24 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-11 8:16 [PATCH v4 mm-new 0/2] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
2025-10-11 8:16 ` [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-11 20:45 ` kernel test robot
2025-10-11 22:04 ` Andrew Morton
2025-10-12 2:08 ` Baoquan He
2025-10-14 11:56 ` Baoquan He
2025-10-13 6:09 ` Barry Song
2025-10-14 21:50 ` Chris Li
2025-10-15 3:06 ` Baoquan He
2025-10-15 5:02 ` Barry Song
2025-10-15 6:23 ` Chris Li [this message]
2025-10-15 8:09 ` Barry Song
2025-10-15 13:27 ` Chris Li
2025-10-11 8:16 ` [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
2025-10-12 20:40 ` Barry Song
2025-10-13 3:58 ` Baoquan He
2025-10-13 6:17 ` Barry Song
2025-10-13 23:07 ` Baoquan He
2025-10-14 22:11 ` Chris Li
2025-10-15 4:29 ` Barry Song
2025-10-15 6:24 ` Chris Li
2025-10-14 22:01 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CACePvbVDB9-r3dTTAJ8e++1swAt9=fPRK9ex_30L=FgXBe5BpQ@mail.gmail.com' \
--to=chrisl@kernel.org \
--cc=21cnbao@gmail.com \
--cc=aaron.lu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=kasong@tencent.com \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=shikemeng@huaweicloud.com \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).