From: Barry Song <21cnbao@gmail.com>
To: chrisl@kernel.org
Cc: 21cnbao@gmail.com, aaron.lu@intel.com, akpm@linux-foundation.org,
bhe@redhat.com, kasong@tencent.com, linux-mm@kvack.org,
nphamcs@gmail.com, shikemeng@huaweicloud.com,
youngjun.park@lge.com
Subject: Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
Date: Wed, 15 Oct 2025 16:09:25 +0800 [thread overview]
Message-ID: <20251015080925.4008-1-21cnbao@gmail.com> (raw)
In-Reply-To: <CACePvbVDB9-r3dTTAJ8e++1swAt9=fPRK9ex_30L=FgXBe5BpQ@mail.gmail.com>
>
> >
> > On Wed, Oct 15, 2025 at 11:06 AM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > On 10/13/25 at 02:09pm, Barry Song wrote:
> > > > > -static int swap_node(struct swap_info_struct *si)
> > > > > -{
> > > > > - struct block_device *bdev;
> > > > > -
> > > > > - if (si->bdev)
> > > > > - bdev = si->bdev;
> > > > > - else
> > > > > - bdev = si->swap_file->f_inode->i_sb->s_bdev;
> > > > > -
> > > > > - return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> > > > > -}
> > > > > -
> > > >
> > > > Looking at the code, it seems to have some hardware affinity awareness,
> > > > as it uses the swapfile’s bdev’s node_id. Are we regressing cases where
> > > > each node has a closer block device?
> > >
> > > I had talked about this with Chris before I posted v1. We don't need to
> > > worry about this because:
> > >
> > > 1) Kernel code rarely set disk->node_id, all disks just assign
> > > NUMA_NO_NODE to it except of these:
> > >
> > > drivers/nvdimm/pmem.c <<pmem_attach_disk>>
> > > drivers/md/dm.c <<alloc_dev>>
> > >
> > > For intel ssd Aaron introduced the node based si choosing is for, it
> > > should be Optane which has been discontinued. It could be wrong, then
> > > hope intel can help test so that we can see what impact is brought in.
> > >
> > > 2) The gap between disk io and memory accessing
> > > Usually memory accessing is nanosecond level, while disk io is
> > > microsecond level, HDD even could be at millisecond. The node affinity
> > > saving nanoseconds is negligible compared to the disk's own acessing
> > > speed. This includes pmem, its io is more than ten times or even more
> > > than memory accessing.
> >
> > I agree that it’s fine to remove the code if the related hardware is obsolete.
> > I found a paper [1] showing that accessing local Optane PMEM is much faster
> > than accessing remote Optane PMEM (see slides 4 and 5). That might explain why
> > they started the project to make swapfile NUMA-aware.
>
> Are you suggesting the swapfiel is used for PMEM devices? It sounds
> very strange to back swapfile with PMEM. I am under the impression
> that the original a2468cc9bfdf commit is introduced with the intel SSD
> as a testing swapfile device. I just looked it up. Here is what I find
> out in the commit log:
>
> ======= quote ========
> To see the effect of the patch, a test that starts N process, each mmap
> a region of anonymous memory and then continually write to it at random
> position to trigger both swap in and out is used.
>
> On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
> are used as swap devices with each attached to a different node, the
> result is:
> ======= end quote =====
>
> > My point is that we should at least mention this in the changelog to
> > honor their past contributions. But since the hardware is no longer used,
> > we can remove the code to reduce complexity and stop maintaining it.
>
> Optane was not even supported in Skylake. Commit a2468cc9bfdf has
> nothing to do with Optane. The Op]tane talk in a2468cc9bfdf is just a
> red herring. I fail to see why reverting a2468cc9bfdf needs to mention
> Optane is obsolete.
Thanks for the clarification. The Optane discussion turned out to be a goof :-)
Just for the record, this paper [1] also mentions that accessing remote SSDs
can significantly decrease performance. However, it is rare to find any NUMA
machine using SSDs directly as swap files without a RAM compression frontend,
so I don’t think the performance penalty of remote access would be a problem
when choosing a direct swapfile.
[1] https://shbakram.github.io/assets/papers/akram-caos12.pdf
Thanks
Barry
next prev parent reply other threads:[~2025-10-15 8:09 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-11 8:16 [PATCH v4 mm-new 0/2] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
2025-10-11 8:16 ` [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-11 20:45 ` kernel test robot
2025-10-11 22:04 ` Andrew Morton
2025-10-12 2:08 ` Baoquan He
2025-10-14 11:56 ` Baoquan He
2025-10-13 6:09 ` Barry Song
2025-10-14 21:50 ` Chris Li
2025-10-15 3:06 ` Baoquan He
2025-10-15 5:02 ` Barry Song
2025-10-15 6:23 ` Chris Li
2025-10-15 8:09 ` Barry Song [this message]
2025-10-15 13:27 ` Chris Li
2025-10-11 8:16 ` [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
2025-10-12 20:40 ` Barry Song
2025-10-13 3:58 ` Baoquan He
2025-10-13 6:17 ` Barry Song
2025-10-13 23:07 ` Baoquan He
2025-10-14 22:11 ` Chris Li
2025-10-15 4:29 ` Barry Song
2025-10-15 6:24 ` Chris Li
2025-10-14 22:01 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251015080925.4008-1-21cnbao@gmail.com \
--to=21cnbao@gmail.com \
--cc=aaron.lu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=kasong@tencent.com \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=shikemeng@huaweicloud.com \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).