Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: YoungJun Park <youngjun.park@lge.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org,
	roman.gushchin@linux.dev, shakeel.butt@linux.dev,
	muchun.song@linux.dev, shikemeng@huaweicloud.com,
	kasong@tencent.com, nphamcs@gmail.com, bhe@redhat.com,
	baohua@kernel.org, chrisl@kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	gunho.lee@lge.com, iamjoonsoo.kim@lge.com, taejoon.song@lge.com
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
Date: Sun, 17 Aug 2025 01:41:34 +0900	[thread overview]
Message-ID: <aKC0vrr0vIdRV/Ob@yjaykim-PowerEdge-T330> (raw)
In-Reply-To: <uyxkdmnmvjipxuf7gagu2okw7afvzlclomfmc6wb6tygc3mhv6@736m7xs6gn5q>

On Thu, Aug 14, 2025 at 04:03:36PM +0200, Michal Koutný wrote:
> On Wed, Jul 23, 2025 at 03:41:47AM +0900, YoungJun Park <youngjun.park@lge.com> wrote:

> Let me share my mental model in order to help forming the design.

First of all, thank you very much for your detailed reply. As Friday was a
public holiday in Korea and I had some personal commitments over the weekend,
I only got to check your email late — I hope you can kindly excuse the delayed
response.

For the points that require deeper consideration, I will provide detailed
answers later. For now, let me share some quick feedback on the parts I can
respond to right away.

> I find these per-cgroup swap priorities similar to cpuset -- instead of
> having a configured cpumask (bitmask) for each cgroup, you have
> weight-mask for individual swap devices (or distribution over the
> devices, I hope it's not too big deviation from priority ranking).
> Then you have the hierarchy, so you need a method how to combine
> child+parent masks (or global/root) to obtain effective weight-mask (and
> effective ranking) for each cgroup.
> 
> Furthermore, there's the NUMA autobinding which adds another weight-mask
> to the game but this time it's not configured but it depends on "who is
> asking". (Tasks running on node N would have autobind shifted towards
> devices associated to node N. Is that how autobinding works?)

Yes, your description indeed captures the core concept of how autobinding
works.
 
> From the hierarchy point of view, you have to compound weight-masks in
> top-down preference (so that higher cgroups can override lower) and
> autobind weight-mask that is only conceivable at the very bottom
> (not a cgroup but depending on the task's NUMA placement).
> 
> There I see conflict between the ends a tad. I think the attempted
> reconciliation was to allow emptiness of a single slot in the
> weight-mask but it may not be practical for the compounding (that's why
> you came up with the four variants). So another option would be to allow
> whole weight-mask being empty (or uniform) so that it'd be identity in
> the compounding operation.
> The conflict exists also in the current non-percg priorities -- there
> are the global priorities and autobind priorities. IIUC, the global
> level either defines a weight (user prio) or it is empty (defer to NUMA
> autobinding).
> 
> [I leveled rankings and weight-masks of devices but I left a loophole of
> how the empty slots in the latter would be converted to (and from)
> rankings. This e-mail is already too long.]

Yes. A single slot emptiness is enemy..
The problem arises from two aspects: (1) allowing per-device priorities
inherently leads to the possibility of single-slot emptiness, and (2)
depending on swapon configuration, empty slots may be inevitable. That’s
why the compounding rules ended up allowing this complexity. I’ll review
your suggestions carefully and share soon how we might simplify this
direction.

> 
> An very different alternative that comes to my mind together with
> autobinding and leveraging that to your use case:
> - define virtual NUMA nodes [1],
> - associate separate swap devices to those nodes,
> - utilize task (or actual (mem)cpuset) affinity to those virtual NUMA
>   nodes based on each process's swap requirements,
> - NUMA autobinding would then yield the device constraints you sought.

Creative. I have understood the overall concept for now.

Thank you as always for your valuable insights.

Best regards,  
YoungJun

next prev parent reply	other threads:[~2025-08-16 16:56 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-16 20:20 [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities Youngjun Park
2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
2025-07-17 11:20   ` kernel test robot
2025-07-22 14:09     ` YoungJun Park
2025-07-18 17:08   ` kernel test robot
2025-07-22 14:11     ` YoungJun Park
2025-07-21 15:13   ` kernel test robot
2025-07-22 14:14     ` YoungJun Park
2025-07-22  8:41   ` Michal Koutný
2025-07-22 14:05     ` YoungJun Park
2025-07-22 18:41       ` YoungJun Park
2025-08-14 14:03         ` Michal Koutný
2025-08-15 15:10           ` Chris Li
2025-08-16 17:21             ` YoungJun Park
2025-08-16 19:15               ` Chris Li
2025-08-19 10:12                 ` YoungJun Park
2025-08-20  0:52                   ` Chris Li
2025-08-20 14:39                     ` YoungJun Park
2025-08-21 20:39                       ` Chris Li
2025-08-22  5:45                         ` YoungJun Park
2025-08-22 16:48                           ` Chris Li
2025-08-24 12:05                             ` YoungJun Park
2025-08-26  8:19                               ` Chris Li
2025-08-26 12:57                                 ` YoungJun Park
2025-08-26 14:30                                   ` Chris Li
2025-08-30  4:05                                     ` YoungJun Park
2025-08-30  7:13                                       ` Chris Li
2025-08-31 13:53                                         ` YoungJun Park
2025-08-31 16:45                                           ` Chris Li
2025-09-01 16:03                                             ` YoungJun Park
2025-09-01 16:06                                             ` YoungJun Park
2025-09-01 22:40                                               ` Chris Li
2025-08-24 14:19                             ` YoungJun Park
2025-08-16 16:41           ` YoungJun Park [this message]
2025-07-16 20:20 ` [PATCH 2/4] mm: swap: Apply per-cgroup swap priority mechanism to swap layer Youngjun Park
2025-07-16 20:20 ` [PATCH 3/4] mm: memcg: Add swap cgroup priority inheritance mechanism Youngjun Park
2025-07-16 20:20 ` [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters Youngjun Park
2025-07-22 17:44   ` Kairui Song
2025-07-22 18:30     ` YoungJun Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aKC0vrr0vIdRV/Ob@yjaykim-PowerEdge-T330 \
    --to=youngjun.park@lge.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chrisl@kernel.org \
    --cc=gunho.lee@lge.com \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=taejoon.song@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).