From: "Michal Koutný" <mkoutny@suse.com>
To: YoungJun Park <youngjun.park@lge.com>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org,
roman.gushchin@linux.dev, shakeel.butt@linux.dev,
muchun.song@linux.dev, shikemeng@huaweicloud.com,
kasong@tencent.com, nphamcs@gmail.com, bhe@redhat.com,
baohua@kernel.org, chrisl@kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
gunho.lee@lge.com, iamjoonsoo.kim@lge.com, taejoon.song@lge.com
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
Date: Thu, 14 Aug 2025 16:03:36 +0200 [thread overview]
Message-ID: <uyxkdmnmvjipxuf7gagu2okw7afvzlclomfmc6wb6tygc3mhv6@736m7xs6gn5q> (raw)
In-Reply-To: <aH/baxIgrBI3Z1Hl@yjaykim-PowerEdge-T330>
[-- Attachment #1: Type: text/plain, Size: 4395 bytes --]
On Wed, Jul 23, 2025 at 03:41:47AM +0900, YoungJun Park <youngjun.park@lge.com> wrote:
> This leaves us with a few design options:
>
> 1. Treat negative values as valid priorities. Once any device is
> assigned via `memory.swap.priority`, the NUMA autobind logic is
> entirely disabled.
> - Pros: Simplifies implementation; avoids exposing NUMA autobind via
> cgroup interface.
> - Cons: Overrides autobind for all devices even if only one is set.
>
> 2. Continue to treat negative values as NUMA autobind weights, without
> implicit shifting. If a user assigns `-3`, it is stored and used
> exactly as `-3`, and does not affect other devices.
> - Pros: Simple and intuitive; matches current implementation
> semantics.
> - Cons: Autobind semantics still need to be reasoned about when
> using the interface.
>
> 3. Disallow setting negative values via `memory.swap.priority`.
> Existing NUMA autobind config is preserved, but no new autobind
> configuration is possible from cgroup interface.
> - Pros: Keeps cgroup interface simple; no autobind manipulation.
> - Cons: Autobind infra remains partially active, increasing code
> complexity.
>
> 4. Keep the current design: allow setting negative values to express
> NUMA autobind weights explicitly. Devices without overridden values
> continue to follow NUMA-based dynamic selection.
> - Pros: Preserves current flexibility; gives users control per device.
> - Cons: Slightly more complex semantics; NUMA autobind remains a
> visible part of the interface.
>
> After thinking through these tradeoffs, I'm inclined to think that
> preserving the NUMA autobind option might be the better path forward.
> What are your thoughts on this?
>
> Thank you again for your helpful feedback.
Let me share my mental model in order to help forming the design.
I find these per-cgroup swap priorities similar to cpuset -- instead of
having a configured cpumask (bitmask) for each cgroup, you have
weight-mask for individual swap devices (or distribution over the
devices, I hope it's not too big deviation from priority ranking).
Then you have the hierarchy, so you need a method how to combine
child+parent masks (or global/root) to obtain effective weight-mask (and
effective ranking) for each cgroup.
Furthermore, there's the NUMA autobinding which adds another weight-mask
to the game but this time it's not configured but it depends on "who is
asking". (Tasks running on node N would have autobind shifted towards
devices associated to node N. Is that how autobinding works?)
From the hierarchy point of view, you have to compound weight-masks in
top-down preference (so that higher cgroups can override lower) and
autobind weight-mask that is only conceivable at the very bottom
(not a cgroup but depending on the task's NUMA placement).
There I see conflict between the ends a tad. I think the attempted
reconciliation was to allow emptiness of a single slot in the
weight-mask but it may not be practical for the compounding (that's why
you came up with the four variants). So another option would be to allow
whole weight-mask being empty (or uniform) so that it'd be identity in
the compounding operation.
The conflict exists also in the current non-percg priorities -- there
are the global priorities and autobind priorities. IIUC, the global
level either defines a weight (user prio) or it is empty (defer to NUMA
autobinding).
[I leveled rankings and weight-masks of devices but I left a loophole of
how the empty slots in the latter would be converted to (and from)
rankings. This e-mail is already too long.]
An very different alternative that comes to my mind together with
autobinding and leveraging that to your use case:
- define virtual NUMA nodes [1],
- associate separate swap devices to those nodes,
- utilize task (or actual (mem)cpuset) affinity to those virtual NUMA
nodes based on each process's swap requirements,
- NUMA autobinding would then yield the device constraints you sought.
HTH,
Michal
[1] Not sure how close this is to the linked series [2] which is AFAIU
a different kind of virtualization that isn't supposed to be exposed
to userspace(?).
[2] https://lore.kernel.org/linux-mm/20250429233848.3093350-1-nphamcs@gmail.com/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
next prev parent reply other threads:[~2025-08-14 14:03 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-16 20:20 [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities Youngjun Park
2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
2025-07-17 11:20 ` kernel test robot
2025-07-22 14:09 ` YoungJun Park
2025-07-18 17:08 ` kernel test robot
2025-07-22 14:11 ` YoungJun Park
2025-07-21 15:13 ` kernel test robot
2025-07-22 14:14 ` YoungJun Park
2025-07-22 8:41 ` Michal Koutný
2025-07-22 14:05 ` YoungJun Park
2025-07-22 18:41 ` YoungJun Park
2025-08-14 14:03 ` Michal Koutný [this message]
2025-08-15 15:10 ` Chris Li
2025-08-16 17:21 ` YoungJun Park
2025-08-16 19:15 ` Chris Li
2025-08-19 10:12 ` YoungJun Park
2025-08-20 0:52 ` Chris Li
2025-08-20 14:39 ` YoungJun Park
2025-08-21 20:39 ` Chris Li
2025-08-22 5:45 ` YoungJun Park
2025-08-22 16:48 ` Chris Li
2025-08-24 12:05 ` YoungJun Park
2025-08-26 8:19 ` Chris Li
2025-08-26 12:57 ` YoungJun Park
2025-08-26 14:30 ` Chris Li
2025-08-30 4:05 ` YoungJun Park
2025-08-30 7:13 ` Chris Li
2025-08-31 13:53 ` YoungJun Park
2025-08-31 16:45 ` Chris Li
2025-09-01 16:03 ` YoungJun Park
2025-09-01 16:06 ` YoungJun Park
2025-09-01 22:40 ` Chris Li
2025-08-24 14:19 ` YoungJun Park
2025-08-16 16:41 ` YoungJun Park
2025-07-16 20:20 ` [PATCH 2/4] mm: swap: Apply per-cgroup swap priority mechanism to swap layer Youngjun Park
2025-07-16 20:20 ` [PATCH 3/4] mm: memcg: Add swap cgroup priority inheritance mechanism Youngjun Park
2025-07-16 20:20 ` [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters Youngjun Park
2025-07-22 17:44 ` Kairui Song
2025-07-22 18:30 ` YoungJun Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=uyxkdmnmvjipxuf7gagu2okw7afvzlclomfmc6wb6tygc3mhv6@736m7xs6gn5q \
--to=mkoutny@suse.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=gunho.lee@lge.com \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=taejoon.song@lge.com \
--cc=youngjun.park@lge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).