From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EED4A257843; Sat, 15 Nov 2025 16:56:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763225808; cv=none; b=u8ANWIIpThFIyoKjcomV1cuvdzpbSX9dOg/3DUhhJ2HFOiA3pgdh6oDsSAB2uwhlU7uGESKz9t/wLBH+B5UF2ysdAmVl/BjLkupiwl2URscmba4+uzn/665BI/4OZKGtHeRKMrtdmro1rnwCAhYeUFi47iCiXV4FK9N9HyoHfMw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763225808; c=relaxed/simple; bh=/gOkGsj7raAQD06iajDMIoQCwWI8rwh1FUuNxXNRiDc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Ra6uE3L461g/03ODHprKD52dd9zbZp5TAbVcB43JUUh74Yw7mox2PPHv1a4+ewdrpJjJWnZXPnwqhd+aZKpagVfIqCH+JuzBD1uon3W2r4hbIFLtFu5fSx5N0tqeBk0lN/PRUnH4U6FPlnvlz6cZxTDymGyNaWB6PS4CgKgiGOY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=H0fBrSOT; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="H0fBrSOT" Received: by smtp.kernel.org (Postfix) with ESMTPSA id E96E7C4CEF5; Sat, 15 Nov 2025 16:56:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1763225806; bh=/gOkGsj7raAQD06iajDMIoQCwWI8rwh1FUuNxXNRiDc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=H0fBrSOTYwuz2F3C7hEG+BNNJ4ANuNHg5MF4ShyVzawn6gUZ2cv+mqh67Kmp4OfqC bEMV1qF0avBOwaTJWdmPQ63m4NnPD+GQABzOF8GZN4GfZ9Wx7LVgNlWwvn6vSiXYf4 9ZuWxsvvwd3y5ICUVwyIWwNLpf5XzFSqOMy1ihbWGKajOU8O1GhBlMMaUx4UI/7m1Z xGk/emq5jpa8fmE9d9mZHp5OzgnkVE5MyKqkNwl3Uh+IU73uI0AIEzNlyyrEeAM0cn Lc+9MePZszKDSsvlerpFA4E1e4fYpvJnJpoKif0m8edPq6ei27KhYck9PkDzgvzXeY S2faeXhLNG90w== From: SeongJae Park To: YoungJun Park Cc: SeongJae Park , akpm@linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, chrisl@kernel.org, kasong@tencent.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, gunho.lee@lge.com, taejoon.song@lge.com Subject: Re: [RFC] mm/swap, memcg: Introduce swap tiers for cgroup based swap control Date: Sat, 15 Nov 2025 08:56:35 -0800 Message-ID: <20251115165637.82966-1-sj@kernel.org> X-Mailer: git-send-email 2.47.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On Sat, 15 Nov 2025 18:44:56 +0900 YoungJun Park wrote: > On Fri, Nov 14, 2025 at 05:22:45PM -0800, SeongJae Park wrote: > > On Sun, 9 Nov 2025 21:49:44 +0900 Youngjun Park wrote: > > > > > Hi all, > > > > > > In constrained environments, there is a need to improve workload > > > performance by controlling swap device usage on a per-process or > > > per-cgroup basis. For example, one might want to direct critical > > > processes to faster swap devices (like SSDs) while relegating > > > less critical ones to slower devices (like HDDs or Network Swap). > > > > > > Initial approach was to introduce a per-cgroup swap priority > > > mechanism [1]. However, through review and discussion, several > > > drawbacks were identified: > > > > > > a. There is a lack of concrete use cases for assigning a fine-grained, > > > unique swap priority to each cgroup. > > > b. The implementation complexity was high relative to the desired > > > level of control. > > > c. Differing swap priorities between cgroups could lead to LRU > > > inversion problems. > > > > > > To address these concerns, I propose the "swap tiers" concept, > > > originally suggested by Chris Li [2] and further developed through > > > collaborative discussions. I would like to thank Chris Li and > > > He Baoquan for their invaluable contributions in refining this > > > approach, and Kairui Song, Nhat Pham, and Michal Koutný for their > > > insightful reviews of earlier RFC versions. > > > > I think the tiers concept is a nice abstraction. I'm also interested in how > > the in-kernel control mechanism will deal with tiers management, which is not > > always simple. I'll try to take a time to read this series thoroughly. Thank > > you for sharing this nice work! > > Hi SeongJae, > > Thank you for your feedback and interest in the swap tiers concept > I appreciate your willingness to review this series. > > Regarding your question about simpler approaches using memory.reclaim, > MADV_PAGEOUT, or DAMOS_PAGEOUT with swap device specification - I've > looked into this perspective after reading your comments. This approach > would indeed be one way to enable per-process swap device selection > from a broader standpoint. > > > Nevertheless, I'm curious if there is simpler and more flexible ways to achieve > > the goal (control of swap device to use). For example, extending existing > > proactive pageout features, such as memory.reclaim, MADV_PAGEOUT or > > DAMOS_PAGEOUT, to let users specify the swap device to use. Doing such > > extension for MADV_PAGEOUT may be challenging, but it might be doable for > > memory.reclaim and DAMOS_PAGEOUT. Have you considered this kind of options? > > Regarding your question about simpler approaches using memory.reclaim, > MADV_PAGEOUT, or DAMOS_PAGEOUT with swap device specification - I've > looked into this perspective after reading your comments. This approach > would indeed be one way to enable per-process swap device selection > from a broader standpoint. > > However, for our use case, per-process granularity feels too fine-grained, > which is why we've been focusing more on the cgroup-based approach. Thank you for kindly sharing your opinion. That all makes sense. Nonetheless, I think the limitation is only for MADV_PAGEOUT. MADV_PAGEOUT would indeed have a limitation at applying it on cgroup level. In case of memory.reclaim and DAMOS_PAGEOUT, however, I think it can work in cgroup level, since memory.reclaim exists per cgroup, and DAMOS_PAGEOUT has knobs for cgroup level controls, including cgroup based DAMOS filters and per-node per-cgroup memory usage based DAMOS quota goal. Also, if needed for swap tiers, extending DAMOS seems doable, to my perspective. > > That said, if we were to aggressively consider the per-process approach > as well in the future, I'm thinking about how we might integrate it with > the tier concept(not just indivisual swap device). During discussions with Chris Li, we also talked about > potentially tying this to per-VMA control (see the discussion at > https://lore.kernel.org/linux-mm/CACePvbW_Q6O2ppMG35gwj7OHCdbjja3qUCF1T7GFsm9VDr2e_g@mail.gmail.com/). > This concept could go beyond just selection at the cgroup layer. Sounds interesting. I once thought extending DAMOS for vma level control (e.g., asking some DAMOS actions to target only vmas of specific names) could be useful, in the past. I have no real plan to do that at the moment due to the absence of expected usage. But if that could be used for swap tiers, I would be happy to help. Thanks, SJ [...]