linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Chris Li <chrisl@kernel.org>
To: Barry Song <21cnbao@gmail.com>
Cc: Baoquan He <bhe@redhat.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	 kasong@tencent.com, youngjun.park@lge.com, aaron.lu@intel.com,
	 shikemeng@huaweicloud.com, nphamcs@gmail.com
Subject: Re: [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node
Date: Tue, 14 Oct 2025 23:23:52 -0700	[thread overview]
Message-ID: <CACePvbVDB9-r3dTTAJ8e++1swAt9=fPRK9ex_30L=FgXBe5BpQ@mail.gmail.com> (raw)
In-Reply-To: <CAGsJ_4yv66kW32Lr--O8qWq3gbrwF110cT7MzqMWumRabFNj1g@mail.gmail.com>

On Tue, Oct 14, 2025 at 10:02 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 15, 2025 at 11:06 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 10/13/25 at 02:09pm, Barry Song wrote:
> > > > -static int swap_node(struct swap_info_struct *si)
> > > > -{
> > > > -       struct block_device *bdev;
> > > > -
> > > > -       if (si->bdev)
> > > > -               bdev = si->bdev;
> > > > -       else
> > > > -               bdev = si->swap_file->f_inode->i_sb->s_bdev;
> > > > -
> > > > -       return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
> > > > -}
> > > > -
> > >
> > > Looking at the code, it seems to have some hardware affinity awareness,
> > > as it uses the swapfile’s bdev’s node_id. Are we regressing cases where
> > > each node has a closer block device?
> >
> > I had talked about this with Chris before I posted v1. We don't need to
> > worry about this because:
> >
> > 1) Kernel code rarely set disk->node_id, all disks just assign
> > NUMA_NO_NODE to it except of these:
> >
> > drivers/nvdimm/pmem.c <<pmem_attach_disk>>
> > drivers/md/dm.c <<alloc_dev>>
> >
> > For intel ssd Aaron introduced the node based si choosing is for, it
> > should be Optane which has been discontinued. It could be wrong, then
> > hope intel can help test so that we can see what impact is brought in.
> >
> > 2) The gap between disk io and memory accessing
> > Usually memory accessing is nanosecond level, while disk io is
> > microsecond level, HDD even could be at millisecond. The node affinity
> > saving nanoseconds is negligible compared to the disk's own acessing
> > speed. This includes pmem, its io is more than ten times or even more
> > than memory accessing.
>
> I agree that it’s fine to remove the code if the related hardware is obsolete.
> I found a paper [1] showing that accessing local Optane PMEM is much faster
> than accessing remote Optane PMEM (see slides 4 and 5). That might explain why
> they started the project to make swapfile NUMA-aware.

Are you suggesting the swapfiel is used for PMEM devices? It sounds
very strange to back swapfile with PMEM. I am under the impression
that the original a2468cc9bfdf commit is introduced with the intel SSD
as a testing swapfile device. I just looked it up. Here is what I find
out in the commit log:

======= quote ========
    To see the effect of the patch, a test that starts N process, each mmap
    a region of anonymous memory and then continually write to it at random
    position to trigger both swap in and out is used.

    On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
    are used as swap devices with each attached to a different node, the
    result is:
======= end quote =====

> My point is that we should at least mention this in the changelog to
> honor their past contributions. But since the hardware is no longer used,
> we can remove the code to reduce complexity and stop maintaining it.

Optane was not even supported in Skylake. Commit a2468cc9bfdf has
nothing to do with Optane. The Op]tane talk in a2468cc9bfdf is just a
red herring. I fail to see why reverting a2468cc9bfdf needs to mention
Optane is obsolete.

> I see Aaron's email is no longer reachable, which is probably why we haven’t
> received any feedback from them.
>
> [1] https://www.usenix.org/system/files/osdi21_slides_wang-qing.pdf
>
> >
> > If there's a real system which owns disks belonging to NUMA nodes, we
> > can test to see if the new round robin way is better or worse then the
> > node based way.
>
> Yep. If there might be a real user in the future, we can revisit this.
> For now, I agree that we can drop the complexity.

Thank you for the alignment.

Chris


  reply	other threads:[~2025-10-15  6:24 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-11  8:16 [PATCH v4 mm-new 0/2] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
2025-10-11  8:16 ` [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-11 20:45   ` kernel test robot
2025-10-11 22:04     ` Andrew Morton
2025-10-12  2:08       ` Baoquan He
2025-10-14 11:56       ` Baoquan He
2025-10-13  6:09   ` Barry Song
2025-10-14 21:50     ` Chris Li
2025-10-15  3:06     ` Baoquan He
2025-10-15  5:02       ` Barry Song
2025-10-15  6:23         ` Chris Li [this message]
2025-10-15  8:09           ` Barry Song
2025-10-15 13:27             ` Chris Li
2025-10-11  8:16 ` [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
2025-10-12 20:40   ` Barry Song
2025-10-13  3:58     ` Baoquan He
2025-10-13  6:17       ` Barry Song
2025-10-13 23:07         ` Baoquan He
2025-10-14 22:11         ` Chris Li
2025-10-15  4:29           ` Barry Song
2025-10-15  6:24             ` Chris Li
2025-10-14 22:01     ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACePvbVDB9-r3dTTAJ8e++1swAt9=fPRK9ex_30L=FgXBe5BpQ@mail.gmail.com' \
    --to=chrisl@kernel.org \
    --cc=21cnbao@gmail.com \
    --cc=aaron.lu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).