From: YoungJun Park <youngjun.park@lge.com>
To: Chris Li <chrisl@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>,
akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org,
roman.gushchin@linux.dev, shakeel.butt@linux.dev,
muchun.song@linux.dev, shikemeng@huaweicloud.com,
kasong@tencent.com, nphamcs@gmail.com, bhe@redhat.com,
baohua@kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, gunho.lee@lge.com,
iamjoonsoo.kim@lge.com, taejoon.song@lge.com,
"Matthew Wilcox" <willy@infradead.org>,
"David Hildenbrand" <david@redhat.com>,
"Kairui Song" <ryncsn@gmail.com>
Subject: Re: [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority
Date: Sat, 30 Aug 2025 13:05:16 +0900 [thread overview]
Message-ID: <aLJ4fEWo7V9Xsz15@yjaykim-PowerEdge-T330> (raw)
In-Reply-To: <CACePvbUJSk23sH01msPcNiiiYw7JqWq_7xP1C7iBUN81nxJ36Q@mail.gmail.com>
Hi Chris,
Thanks for the detailed feedback, and sorry for the late reply.
> I think you touch on a very important question that might trigger a
> big design change. Do we want to have a per tier swap.max? It will
> specify not only whether this cgroup will enroll into this tier or
> not. It also controls how much swap it allows to do in this cgroup.
> The swap.max will follow the straight contain relationship. I would
> need to think more about the relationship between swap.max and
> swap.tiers. Initial intuition is that, we might end up with both per
> tier swap.max, which control resource limit, it has subset contain
> relationship. At the same time the swap.tiers which control QoS, it
> does not follow the subset contained.
>
> Need more sleep on that.
When I first ideated on this, I also considered per-device max values,
with 0 meaning exclusion, to implement cases like a cgroup using only
network swap. At that time the idea was to give each device its own
counter, so setting it to 0 would imply exclusion. But this approach
would effectively require maintaining per-device page counters similar
to the existing swap.max implementation, and the relationship between
these per-device counters and the global swap.max would need to be
carefully defined. That made the design significantly heavier than the
functionality I was aiming for, so I decided to drop it. I read your
point more as a QoS extension, and I see it as complementary rather
than a counter argument.
> First of all, sorry about the pedantic, it should be "swap.tiers" just
> to be consistent with the rest of the discussion.
> Secondly, I just view names as an alias of the number. 1-3 is hard to
> read what you want.
> If we allow name as the alias, we can also do:
> echo zram-hdd > memory.swap.tieres
>
> It is exactly the same thing but much more readable.
>
> > cg1/cg2: 2-4,6 > memory.swap.tie (ssd,hdd,network device, somedevice 2, assuming non-subset is allowed)
>
> echo ssd-network_device,some_device2 > memory.swap.tiers
>
> See, same thing but much more readable what is your intention.
>
> BTW, we should disallow space in tier names.
Ack—those spaces were only in my example; the implementation will reject
spaces in tier names.
I like the interface format you proposed, and I’ll move forward with an
initial implementation using the name-based tier approach, dropping
the numeric format.
> We do want to think about swap.tiers vs per tier swap.max. One idea
> just brainstorming is that we can have an array of
> "swap.<tiername>.max".
> It is likely we need to have both kinds of interface. Because
> "swap.<tiername>.max" specifies the inclusive child limit.
> "swap.tiers" specifies this C group swap usage QoS. I might not use
> hdd in this cgroup A, but the child cgroup B does. So A's hdd max
> can't be zero.
>
> The other idea is to specify a percentage for each tier of the
> swap.max in "swap.tiers.max": zram:30 sdd:70
> That means zram max is "swap.max * 30%" and ssd max is "swap.max *
> 70%". The number does not need to add up to 100, but can't be bigger
> than 100.
> The sum can be bigger than 100.
>
> Need more sleep on it.
I don’t have additional ideas beyond what you suggested at now. Since swap.max
is defined in terms of quantity, my intuition is that tier.max should
probably also be quantity-based, not percentage. As I mentioned earlier,
I had also considered per-device max in the early RFC stage. The design
was to introduce per-device counters, but that added substantial overhead
and complexity, especially in reconciling them with the global swap.max
semantics. For that reason I abandoned the idea, though I agree your
suggestion makes sense in the context of QoS extension.
At this point I feel the main directions are aligned, so I’ll proceed
with an initial patch version. My current summary is:
1. Global interface to group swap priority ranges into tiers by name
(/sys/kernel/mm/swap/swaptier).
2. Slow path allocation uses bitmask skipping; fast path uses per-cpu
tier cluster caches.
3. Cgroup interface format modeled after cpuset.
4. No inheritance between parent and child cgroup as a perspective of QoS
5. Runtime modification of tier settings allowed.
6. Keep extensibility and broader use cases in mind.
And some open points for further thought:
1. NUMA autobind
- Forbid tier if NUMA priorities exist, and vice versa?
- Should we create a dedicated NUMA tier?
- Other options?
2. swap.tier.max
- percentage vs quantity, and clear use cases.
- sketch concrete real-world scenarios to clarify usage
3. Possible future extensions to VMA-based tier usage.
4. Arbitrary ordering
- Do we really need it?
- If so, maybe provide a separate cgroup interface to reorder tiers.
Best Regards
Youngjun Park
next prev parent reply other threads:[~2025-08-30 4:05 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-16 20:20 [PATCH 0/4] mm/swap, memcg: Support per-cgroup swap device priorities Youngjun Park
2025-07-16 20:20 ` [PATCH 1/4] mm/swap, memcg: Introduce infrastructure for cgroup-based swap priority Youngjun Park
2025-07-17 11:20 ` kernel test robot
2025-07-22 14:09 ` YoungJun Park
2025-07-18 17:08 ` kernel test robot
2025-07-22 14:11 ` YoungJun Park
2025-07-21 15:13 ` kernel test robot
2025-07-22 14:14 ` YoungJun Park
2025-07-22 8:41 ` Michal Koutný
2025-07-22 14:05 ` YoungJun Park
2025-07-22 18:41 ` YoungJun Park
2025-08-14 14:03 ` Michal Koutný
2025-08-15 15:10 ` Chris Li
2025-08-16 17:21 ` YoungJun Park
2025-08-16 19:15 ` Chris Li
2025-08-19 10:12 ` YoungJun Park
2025-08-20 0:52 ` Chris Li
2025-08-20 14:39 ` YoungJun Park
2025-08-21 20:39 ` Chris Li
2025-08-22 5:45 ` YoungJun Park
2025-08-22 16:48 ` Chris Li
2025-08-24 12:05 ` YoungJun Park
2025-08-26 8:19 ` Chris Li
2025-08-26 12:57 ` YoungJun Park
2025-08-26 14:30 ` Chris Li
2025-08-30 4:05 ` YoungJun Park [this message]
2025-08-30 7:13 ` Chris Li
2025-08-31 13:53 ` YoungJun Park
2025-08-31 16:45 ` Chris Li
2025-09-01 16:03 ` YoungJun Park
2025-09-01 16:06 ` YoungJun Park
2025-09-01 22:40 ` Chris Li
2025-08-24 14:19 ` YoungJun Park
2025-08-16 16:41 ` YoungJun Park
2025-07-16 20:20 ` [PATCH 2/4] mm: swap: Apply per-cgroup swap priority mechanism to swap layer Youngjun Park
2025-07-16 20:20 ` [PATCH 3/4] mm: memcg: Add swap cgroup priority inheritance mechanism Youngjun Park
2025-07-16 20:20 ` [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters Youngjun Park
2025-07-22 17:44 ` Kairui Song
2025-07-22 18:30 ` YoungJun Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aLJ4fEWo7V9Xsz15@yjaykim-PowerEdge-T330 \
--to=youngjun.park@lge.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=gunho.lee@lge.com \
--cc=hannes@cmpxchg.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=taejoon.song@lge.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).