All of lore.kernel.org
 help / color / mirror / Atom feed
From: Baoquan He <bhe@redhat.com>
To: Barry Song <21cnbao@gmail.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, chrisl@kernel.org,
	kasong@tencent.com, youngjun.park@lge.com, aaron.lu@intel.com,
	shikemeng@huaweicloud.com, nphamcs@gmail.com
Subject: Re: [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin
Date: Tue, 14 Oct 2025 07:07:59 +0800	[thread overview]
Message-ID: <aO2GT6qqOu5Qsy4X@MiWiFi-R3L-srv> (raw)
In-Reply-To: <CAGsJ_4zSeMfiy=9Pa3A3UtdNigOc=w4eWc1KQpkBbD4AdvmPTA@mail.gmail.com>

On 10/13/25 at 02:17pm, Barry Song wrote:
> On Mon, Oct 13, 2025 at 11:58 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 10/13/25 at 04:40am, Barry Song wrote:
> > > On Sun, Oct 12, 2025 at 5:14 AM Baoquan He <bhe@redhat.com> wrote:
> > > >
> > > > Swap devices are assumed to have similar accessing speed if no priority
> > > > is specified when swapon. It's unfair and doesn't make sense just because
> > > > one swap device is swapped on firstly, its priority will be higher than
> > > > the one swapped on later.
> > > >
> > > > Here, set all swap devicess to have priority '-1' by default. With this
> > > > change, swap device with default priority will be selected round robin
> > > > when swapping out. This can improve the swapping efficiency a lot among
> > > > multiple swap devices with default priority.
> > > >
> > > > Below are swapon output during processes high pressure vm-scability test
> > > > is being taken:
> > > >
> > > > 1) This is pre-commit a2468cc9bfdf, swap device is selectd one by one by
> > > >    priority from high to low when one swap device is exhausted:
> > > > ------------------------------------
> > > > [root@hp-dl385g10-03 ~]# swapon
> > > > NAME       TYPE      SIZE   USED PRIO
> > > > /dev/zram0 partition  16G    16G   -1
> > > > /dev/zram1 partition  16G 966.2M   -2
> > > > /dev/zram2 partition  16G     0B   -3
> > > > /dev/zram3 partition  16G     0B   -4
> > > >
> > > > 2) This is behaviour with commit a2468cc9bfdf, on node, swap device
> > > >    sharing the same node id is selected firstly until exhausted; while
> > > >    on node no swap device sharing the node id it selects the one with
> > > >    highest priority until exhaustd:
> > > > ------------------------------------
> > > > [root@hp-dl385g10-03 ~]# swapon
> > > > NAME       TYPE      SIZE  USED PRIO
> > > > /dev/zram0 partition  16G 15.7G   -2
> > > > /dev/zram1 partition  16G  3.4G   -3
> > > > /dev/zram2 partition  16G  3.4G   -4
> > > > /dev/zram3 partition  16G  2.6G   -5
> > > >
> > > > 3) After this patch applied, swap devices with default priority are selectd
> > > >    round robin:
> > > > ------------------------------------
> > > > [root@hp-dl385g10-03 block]# swapon
> > > > NAME       TYPE      SIZE USED PRIO
> > > > /dev/zram0 partition  16G 6.6G   -1
> > > > /dev/zram1 partition  16G 6.6G   -1
> > > > /dev/zram2 partition  16G 6.6G   -1
> > > > /dev/zram3 partition  16G 6.6G   -1
> > > >
> > > > With the change, we can see about 18% efficiency promotion relative to
> > > > node based way as below. (Surely, the pre-commit a2468cc9bfdf way is
> > > > the worst.)
> > > >
> >
> > Thanks a lot for reviewing, Barry.
> >
> > >
> > > I’m not against the behavior change; but the swapon man page says:
> > > "
> > >        Each swap area has a priority, either high or low.  The default
> > >        priority is low.  Within the low-priority areas, newer areas are
> > >        even lower priority than older areas.
> >
> > I didn't see this in man 8 page of swapon, while see it in man 2 page.
> > Means people may feel that change when they call the call swapon()
> > syscall, but people may not cares about in script or something like that?
> >
> > > "
> > > So my question is whether users still assume that newly added swap areas
> > > get a lower priority than the older ones?
> > >
> > > I assume the priority decrement isn’t a stable ABI, so this change won’t
> > > break userspace?
> >
> > Hmm, I would say that this will change the assumption, BUT I don't start
> > it. That assumption has been broken since the numa based swap device
> > choosing at below commit:
> >
> > commit a2468cc9bfdf ("swap: choose swap device according to numa node").
> >
> > Before commit a2468cc9bfdf, swapon behaviour is taken strictly as the
> > man page states. The earlier the swap device is added, the higher its
> > default priority is. And the highest priority device is used up, then
> > the 2nd highest priority swap device, and so on in sequence. Below
> > swapon output demonstrate.
> > ===============================
> > [root@hp-dl385g10-03 ~]# swapon
> > NAME       TYPE      SIZE   USED PRIO
> > /dev/zram0 partition  16G    16G   -1
> > /dev/zram1 partition  16G 966.2M   -2
> > /dev/zram2 partition  16G     0B   -3
> > /dev/zram3 partition  16G     0B   -4
> >
> > However, after commit a2468cc9bfdf applied, above behaviour had been
> > changed. I can give an extreme example, imagine on a system with one
> > NUMA Node, node_id is 0. Then I swapon several swap devices w/o node_id
> > value (namely node_id is -1), at last I swapon one device with node_id
> > 0. You can see the last one will have the highest priority to be chosen,
> > then other swap devices.
> 
> I assume this adds logic to prefer swapping to the closer swapfile first,
> while still maintaining the old behavior for non-NUMA cases.

But it still change the traditional behaviour, right?
The old man 2 page of swapon obviously doesn't state the prefer swapping
to the closer swapfile first on NUMA, while still maintain the old
behaviour for non-NUMA cases.

===
Each swap area has a priority, either high or low.  The default
priority is low.  Within the low-priority areas, newer areas are
even lower priority than older areas.
===

> 
> >
> > So I would argue that if people realy care about the default priority,
> > it has been broken since 2017 when commit a2468cc9bfdf was introduce,
> > and complaint would be heard since long before. While we didn't hear
> > complaint, means the default priority doesn't really matter?
> > >
> > > Or if someone sets up Linux assuming that a newer swap file will only be
> > > used after the older one is full, then this change would break those cases?
> >
> > Hmm, it could happen, but I doubt people really count on that. I would use
> > 'swapon -p xx' to specify explicit priority to make sure it. In the case you
> > said, swapped out pages will be swapped in, it's either not guaranteed.
> 
> Personally, I also dislike the behavior where a newer swap file
> automatically gets a lower priority than an older one. However, since
> we have a rule to never break userspace, is this considered such a
> case? Or at least, do we need to update the man page as well?

As discussed above, the rule on swapon had been broken. Of course, I can
update the man 2 page of swapon. There's no change to man 8 page of
swapon, because it's not mentioning the default priority thing.

> 
> BTW, we can achieve all the benefits of the round-robin “18%
> efficiency boost” once users set an explicit priority in userspace for
> the four zRAMs you’re using?

Not sure if I got you correctly. The 18% boost is only related to
default priority. If user sets explicit priority via 'swapon -p xx', 
nothing changed.



  reply	other threads:[~2025-10-13 23:08 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-11  8:16 [PATCH v4 mm-new 0/2] mm/swapfile.c: select the swap device with default priority round robin Baoquan He
2025-10-11  8:16 ` [PATCH v4 mm-new 1/2] mm/swap: do not choose swap device according to numa node Baoquan He
2025-10-11 20:45   ` kernel test robot
2025-10-11 22:04     ` Andrew Morton
2025-10-12  2:08       ` Baoquan He
2025-10-14 11:56       ` Baoquan He
2025-10-13  6:09   ` Barry Song
2025-10-14 21:50     ` Chris Li
2025-10-15  3:06     ` Baoquan He
2025-10-15  5:02       ` Barry Song
2025-10-15  6:23         ` Chris Li
2025-10-15  8:09           ` Barry Song
2025-10-15 13:27             ` Chris Li
2025-10-11  8:16 ` [PATCH v4 mm-new 2/2] mm/swap: select swap device with default priority round robin Baoquan He
2025-10-12 20:40   ` Barry Song
2025-10-13  3:58     ` Baoquan He
2025-10-13  6:17       ` Barry Song
2025-10-13 23:07         ` Baoquan He [this message]
2025-10-14 22:11         ` Chris Li
2025-10-15  4:29           ` Barry Song
2025-10-15  6:24             ` Chris Li
2025-10-14 22:01     ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aO2GT6qqOu5Qsy4X@MiWiFi-R3L-srv \
    --to=bhe@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=aaron.lu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.