From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lgeamrelo07.lge.com (lgeamrelo07.lge.com [156.147.51.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 34BA6257AF8 for ; Fri, 13 Jun 2025 10:45:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=156.147.51.103 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749811527; cv=none; b=mGwOWspbpJLa6m3TxNI1oItyYWDvpY3j0LX9KscbLzCRXV3MmEWwDJ0McDE2y8JNnqi4kNgPj0+0bDcM0yXg1f4wk/6s3pNcdk6dJaocV9c7oHHWTAB+/BqNX3U7yyYzfp4XUw93jNj1F3SbwsqhCY5GkR4bj0OqmDgs7PUpE1Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749811527; c=relaxed/simple; bh=ZuFlF01f04Ql+aEJFM/KMn2qbrjLWQUDyCJ7lnDU9S0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=NG86c9XEBog7YdFCPc9pHn159k4JyKC7PvxOTZU05qsyyxTquwlU0t9PXetTsEnPbcWj58Fwo5npXUpu6jamk697mxPVAZNyM7h+bO6svXJSjljvVBRw5Svh1OIbKNWrzGd8zSMiHMoVTanYgGTnonZrH1xHIRTuIx+9mQY29xY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com; spf=pass smtp.mailfrom=lge.com; arc=none smtp.client-ip=156.147.51.103 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=lge.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=lge.com Received: from unknown (HELO yjaykim-PowerEdge-T330) (10.177.112.156) by 156.147.51.103 with ESMTP; 13 Jun 2025 19:45:16 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com Date: Fri, 13 Jun 2025 19:45:16 +0900 From: YoungJun Park To: Kairui Song Cc: Nhat Pham , linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, shikemeng@huaweicloud.com, bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org, muchun.song@linux.dev, iamjoonsoo.kim@lge.com, taejoon.song@lge.com, gunho.lee@lge.com Subject: Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer Message-ID: References: <20250612103743.3385842-1-youngjun.park@lge.com> <20250612103743.3385842-3-youngjun.park@lge.com> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Fri, Jun 13, 2025 at 03:38:37PM +0800, Kairui Song wrote: > On Fri, Jun 13, 2025 at 3:36 PM Kairui Song wrote: > > > > On Fri, Jun 13, 2025 at 3:11 PM YoungJun Park wrote: > > > > > > On Thu, Jun 12, 2025 at 01:08:08PM -0700, Nhat Pham wrote: > > > > On Thu, Jun 12, 2025 at 11:20 AM Kairui Song wrote: > > > > > > > > > > On Fri, Jun 13, 2025 at 1:28 AM Nhat Pham wrote: > > > > > > > > > > > > On Thu, Jun 12, 2025 at 4:14 AM Kairui Song wrote: > > > > > > > > > > > > > > On Thu, Jun 12, 2025 at 6:43 PM wrote: > > > > > > > > > > > > > > > > From: "youngjun.park" > > > > > > > > > > > > > > > > > > > > > > Hi, Youngjun, > > > > > > > > > > > > > > Thanks for sharing this series. > > > > > > > > > > > > > > > This patch implements swap device selection and swap on/off propagation > > > > > > > > when a cgroup-specific swap priority is set. > > > > > > > > > > > > > > > > There is one workaround to this implementation as follows. > > > > > > > > Current per-cpu swap cluster enforces swap device selection based solely > > > > > > > > on CPU locality, overriding the swap cgroup's configured priorities. > > > > > > > > > > > > > > I've been thinking about this, we can switch to a per-cgroup-per-cpu > > > > > > > next cluster selector, the problem with current code is that swap > > > > > > > > > > > > What about per-cpu-per-order-per-swap-device :-? Number of swap > > > > > > devices is gonna be smaller than number of cgroups, right? > > > > > > > > > > Hi Nhat, > > > > > > > > > > The problem is per cgroup makes more sense (I was suggested to use > > > > > cgroup level locality at the very beginning of the implementation of > > > > > the allocator in the mail list, but it was hard to do so at that > > > > > time), for container environments, a cgroup is a container that runs > > > > > one type of workload, so it has its own locality. Things like systemd > > > > > also organize different desktop workloads into cgroups. The whole > > > > > point is about cgroup. > > > > > > > > Yeah I know what cgroup represents. Which is why I mentioned in the > > > > next paragraph that are still making decisions based per-cgroup - we > > > > just organize the per-cpu cache based on swap devices. This way, two > > > > cgroups with similar/same priority list can share the clusters, for > > > > each swapfile, in each CPU. There will be a lot less duplication and > > > > overhead. And two cgroups with different priority lists won't > > > > interfere with each other, since they'll target different swapfiles. > > > > > > > > Unless we want to nudge the swapfiles/clusters to be self-partitioned > > > > among the cgroups? :) IOW, each cluster contains pages mostly from a > > > > single cgroup (with some stranglers mixed in). I suppose that will be > > > > very useful for swap on rotational drives where read contiguity is > > > > imperative, but not sure about other backends :-? > > > > Anyway, no strong opinions to be completely honest :) Was just > > > > throwing out some ideas. Per-cgroup-per-cpu-per-order sounds good to > > > > me too, if it's easy to do. > > > > > > Good point! > > > I agree with the mention that self-partitioned clusters and duplicated priority. > > > One concern is the cost of synchronization. > > > Specifically the one incurred when accessing the prioritized swap device > > > From a simple performance perspective, a per-cgroup-per-CPU implementation > > > seems favorable - in line with the current swap allocation fastpath. > > > > > > It seems most reasonable to carefully compare the pros and cons of the > > > tow approaches. > > > > > > To summaraize, > > > > > > Option 1. per-cgroup-per-cpu > > > Pros: upstream fit. performance. > > > Cons: duplicate priority(some memory structure consumtion cost), > > > self partioned cluster > > > > > > Option 2. per-cpu-per-order(per-device) > > > Pros: Cons of Option1 > > > Cons: Pros of Option1 > > > > > > It's not easy to draw a definitive conclusion right away, > > > I should also evaluate other pros and cons that may arise during actual > > > implementation. > > > so I'd like to take some time to review things in more detail > > > and share my thoughs and conclusions in the next patch series. > > > > > > What do you think, Nhat and Kairui? > > > > Ah, I think what might be best fits here is, each cgroup have a pcp > > device list, and each device have a pcp cluster list: > > > > folio -> mem_cgroup -> swap_priority (maybe a more generic name is > > better?) -> swap_device_pcp (recording only the *si per order) > > swap_device_info -> swap_cluster_pcp (cluster offset per order) > > Sorry the truncate made this hard to read, let me try again: > > folio -> > mem_cgroup -> > swap_priority (maybe a more generic name is better?) -> > swap_device_pcp (recording only the *si per order) > > And: > swap_device_info -> > swap_cluster_pcp (cluster offset per order) > > And if mem_cgroup -> swap_priority is NULL, > fallback to a global swap_device_pcp. Thank you for quick and kind feedback. This is a really good idea :) On my workaround proposal, I just need to add the swap_device_pcp part along with some refactoring. And the naming swap_cgroup_priority... I adopted the term "swap_cgorup_priority" based on the perspective of the functionality I'm aiming to implement. Here are some words that immediately come to mind. (Like I said, just come to mind) * swap_tier, swap_order, swap_selection, swap_cgroup_tier, swap_cgroup_order, swap_cgroup_selection.... I'll try to come up with a more suitable conceptual name as I continue working on the patch. In the meantime, I'd appreciate any suggestions or feedback you may have. Thanks again your feedback and suggestions.