From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1F7A6C61CE8
	for <linux-mm@archiver.kernel.org>; Thu, 12 Jun 2025 11:17:19 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 998606B0089; Thu, 12 Jun 2025 07:17:18 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9709C6B008A; Thu, 12 Jun 2025 07:17:18 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 886886B008C; Thu, 12 Jun 2025 07:17:18 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 60FCA6B0089
	for <linux-mm@kvack.org>; Thu, 12 Jun 2025 07:17:18 -0400 (EDT)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id A70921211F1
	for <linux-mm@kvack.org>; Thu, 12 Jun 2025 11:17:17 +0000 (UTC)
X-FDA: 83546497314.09.41AE74B
Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180])
	by imf07.hostedemail.com (Postfix) with ESMTP id 9760840009
	for <linux-mm@kvack.org>; Thu, 12 Jun 2025 11:17:15 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="kF/QDCT/";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1749727035;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=HIIF55KQpm7f44r8NdaskpxJAFkQQ8vCnGotfXMcBk8=;
	b=N392UDnfwRJVsGk1kUyZ8mCz01A7N3z2LevyKYVb2jcPJsGNS/kMEc5jw87ZwhYmrfylxG
	aOqtxkaH9saxcnrD1fUFc6gfFhP5UzzOz8Z5ifTCiW2/TvdbjtsHM1R3rtzijkBuhw6ztj
	ULquj+RA1ZwjrbeiwbBkRziXOuHpuJE=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749727035; a=rsa-sha256;
	cv=none;
	b=0HLWo3phjM4TjlDdJCnTuQJUBps0JCaLXqgk0AT4kY9JSpUDDskmFZ496e5AondLF8LVac
	xos+J+Wh/wuRAV6ACHFIvTnT6N5XJar1Q1vBb26voe+o/cOREYZP35rCf2UTg4fL530ctw
	0TYG62Wz0VBxnYYiQ5xnamlFdNyfBAg=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="kF/QDCT/";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com
Received: by mail-lj1-f180.google.com with SMTP id 38308e7fff4ca-32b3b250621so1086021fa.2
        for <linux-mm@kvack.org>; Thu, 12 Jun 2025 04:17:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1749727034; x=1750331834; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=HIIF55KQpm7f44r8NdaskpxJAFkQQ8vCnGotfXMcBk8=;
        b=kF/QDCT/f9p7EO4ZWbuU0kZl6hp73mnynWk+4dlmlIo/2p8NSQZJH68LFEOPU/1dA3
         MObbvCp4zf1wWg6+kVwPgq25Z9CEPPUXpLlMdV9kL4s/G1u3t3ORgKqYasnuSYwpVK2t
         fMbQr5BpMBwbkmN1CUW6zDYDvc5lzIZseyKBWXxV20EoJ+A2Pp5+mldQxCv6UZ94Q89F
         wAOZsk4YhoGd4usRLdEo/V09Xan2YFwgZP/UEWu0mRpuz8llE8pT1lbA4wo6wBNvC+Iy
         EqBxhdzGRQUhIxyyUKM3CDw4hyqzWTJ1h1CzKW2WyAOEIZUrNsECRSzwQTqf09NFCWTN
         asMA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1749727034; x=1750331834;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=HIIF55KQpm7f44r8NdaskpxJAFkQQ8vCnGotfXMcBk8=;
        b=Pbc15eecEfwRcQlbW2PQjW1/KbRCQKVWsKpQr6sKnCFfXrR02U4A36cJ/4VeF8Yo0v
         j7FjkUfx4lX8e4Ny+c6brfpJqx+GjSTSL0rQvVoBmRKD3fKsWP2vde8Cwc9Plmd9COlz
         D8EW3UMP2T+C2JAZxggcaCZmGrpx+TmdcwIuim6mUJai26/bTVeigNIo6BRgBaWt8olQ
         FNlb7P1vqsgYDgGoTGchi3kicUx3z6R/oIirfk3TZtC+jmhxzeLn3cnpmYICRGuCvedI
         DPSJm3aj/V8V+StoxsSegk3ViF/HREP8kI7fhS257YkCvqt8wL2LJMlmV4qRalNqQV3P
         0ZJw==
X-Gm-Message-State: AOJu0Yy3W/UavKQnopKjbnZvTDx52VDa7ygx0ahEH9ATS63N76AaIJC0
	uorTUtPZxC1Gw6/imPJ2hJILHqEZ0Sb+QThwLStik1+Fs2NTiFr1JYXvFjs5n7eXckt2kKEAJSc
	z9/lf8lpia3gAzM89TLrpeESO6WJulnw=
X-Gm-Gg: ASbGncu9CDJlEi8drgDvFz8VZ1rNzHfYGKTQfIF81pcM8tksQLU9eRUCN3Tvu3bR5e7
	O3JEXoomW62izLR6zHPKkDIKHtZroEXVXbFCQwIjDPtO2VGQ3PJjYeIgL0gPctwPzrXjrXkZV2f
	9Qqs8l4hhMpLiiCI3GuzL9LKMT6e3ZBtxPDJ5AqWNbaKs=
X-Google-Smtp-Source: AGHT+IEZWRJa3iN7Ncw0u8P+0RnC6lRsrxWoZktikte8LbIo5Xj7rRGDDRVoCXAZNh860z3rF6MLc7UAGW6Xm3n9BZI=
X-Received: by 2002:a05:651c:221f:b0:32a:6764:a1cc with SMTP id
 38308e7fff4ca-32b21d48601mr18022131fa.5.1749727033469; Thu, 12 Jun 2025
 04:17:13 -0700 (PDT)
MIME-Version: 1.0
References: <20250612103743.3385842-1-youngjun.park@lge.com>
 <20250612103743.3385842-3-youngjun.park@lge.com> <CAMgjq7BJE9ALFG4N8wb-hdkC+b-8d1+ckXL9D6pbbfgiXfuzPA@mail.gmail.com>
In-Reply-To: <CAMgjq7BJE9ALFG4N8wb-hdkC+b-8d1+ckXL9D6pbbfgiXfuzPA@mail.gmail.com>
From: Kairui Song <ryncsn@gmail.com>
Date: Thu, 12 Jun 2025 19:16:56 +0800
X-Gm-Features: AX0GCFsFJdGizMwhML3RnqwHK9p9h8b8SBZGgU9UHbHnJQo_uU0xaAEOqNxFQX4
Message-ID: <CAMgjq7BAvsqWkGRZO6_u+6n-cUQ1nVHwMnerOs-s_RHkf90i2A@mail.gmail.com>
Subject: Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority
 mechansim on swap layer
To: youngjun.park@lge.com
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, 
	mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, 
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, 
	shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, 
	baohua@kernel.org, chrisl@kernel.org, muchun.song@linux.dev, 
	iamjoonsoo.kim@lge.com, taejoon.song@lge.com, gunho.lee@lge.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: p8acsgkafecp5wsfk9szrxm1ge7joaem
X-Rspamd-Queue-Id: 9760840009
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-HE-Tag: 1749727035-628484
X-HE-Meta: U2FsdGVkX19wAwDegjZh4SXDZSXoEk8xusvIAY3xiBnamlPSUrYlbdkFO+tkaYtVqdude3EV5uMJGhDKUiTJUcTIrtWWv6vaN+UE2Lxt0vN9+N9hrChl9tLicXRjJxJ+j3Wlv6+sTHJIKJSlr9pdECewSjM08E0gCyYfTZFl9ashB5zELy6IisZdeNtcBq8MuMyfugEeXHNF+TjbZIE93QINVOnt56Hv0yoIAd8fYkVC7YD59rxSpF5k4ACDDsJewf2T/NZEQUxMiyVKYj/7SRNr7bPoCLh1iY9/HVH6pG4lLiunTuZe5033wbhgMLsiy/6FMHe6nnRze036bYWom6hH/IdpKEGxg44lNGIXlyRpcVDB27KEK2S8Fa5RuwummiP8Rovs4rAWul8GBkQ9s3KDyLaQ5jY5i1FT5aqouRUhv+ARMfOLWjZ5Es48Me9+vo//NPXhmNxRtw8Mk5jzbcTMtQSgltMdhQngPB/+UzaeagvbNAdSByJHSP/tNd/KTOaqihz6y2/5c6cl1ql2oh/2ozjGyVwNvLiroV5uIMotPOS8EV+vhI+o+iNFht9ETLMpO4bqNrzpyae3SViHildCPtVsMLAql7JH+s/FfglYbbDa2OCs+PhZ+u4cuY9fLTJWfXvVOt8vteDA9hw5nuZO0QjDeogyWudglFeUPVGiRS2UoeevtZJKN3jbs9IFzemwXzipIkb5C+RFe3xOjp+AtuKIVn5FoKCG6pPhAkPhHKFmK7pCJHKzcAw3fYcbElo2O1B7elwZ5qtnJRU4PHUFNBsTnwJp4g8XMJACegnICqqokdY5K6BlQ/duRmL2NQrFXxLJExH/DcB/BoIBvNGndO1ypyl3W96wFHafxGyqQaxoFklWvSu0y22sz/Qxk9iIzyJWSV5Adqm0Z6jokOK/4K5w+lvJn9o3+jNaNipYoDmG2O3EaxfBASEmd5vTS3ZEebYVeErVZdhqstq
 qNd/yv0U
 S7YIcZZ9EUEcjW05QpIAbTJh1v/m7wO+DBmpyClbmj2QsOkQddzH05E4CI8akIyF8/NIPVVGD57GRR7D9dlyrN6wz+40zgn3LhZxNLGWPqW/7TEWcfnfHflFKOfZ1wITYMFVi0KYPoIS53rSHcTZQsZHkHkInj2R5HOMKmp5myN53bD8slnU7ZCz7YrKWUX/hGF+TLpfMyv4SXH+jgxBT2djnC2peeojDAJqKoxFXBPomcoj/FEx1mgLI5JdfsrqnwRNXHSS6zF2Bhpkml1wepI1xu6MCMdF9CM2f9k20iWFS3a06usY3yio7DgQ5xWGw2Bs+IkofZKVWyVBXCMZMe7axnOx7Ge4vM6Ts+V9YVNmZkSI9Gj+LZkLgZsqBCHidfl3kFFlA2Fmr6OY=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jun 12, 2025 at 7:14=E2=80=AFPM Kairui Song <ryncsn@gmail.com> wrot=
e:
>
> On Thu, Jun 12, 2025 at 6:43=E2=80=AFPM <youngjun.park@lge.com> wrote:
> >
> > From: "youngjun.park" <youngjun.park@lge.com>
> >
>
> Hi, Youngjun,
>
> Thanks for sharing this series.
>
> > This patch implements swap device selection and swap on/off propagation
> > when a cgroup-specific swap priority is set.
> >
> > There is one workaround to this implementation as follows.
> > Current per-cpu swap cluster enforces swap device selection based solel=
y
> > on CPU locality, overriding the swap cgroup's configured priorities.
>
> I've been thinking about this, we can switch to a per-cgroup-per-cpu
> next cluster selector, the problem with current code is that swap
> allocator is not designed with folio / cgroup in mind at all, so it's
> really ugly to implement, which is why I have following two patches in
> the swap table series:
>
> https://lore.kernel.org/linux-mm/20250514201729.48420-18-ryncsn@gmail.com=
/
> https://lore.kernel.org/linux-mm/20250514201729.48420-22-ryncsn@gmail.com=
/

And BTW this is not the only reason, these two are also quite critical
to get rid of the swap_cgroup_ctrl later, and maybe switch to use
folio lock for more swap operations, etc..

> The first one makes all swap allocation starts with a folio, the
> second one makes the allocator always folio aware. So you can know
> which cgroup is doing the allocation at anytime inside the allocator
> (and it reduced the number of argument, also improving performance :)
> )
>
> So the allocator can just use cgroup's swap info if available, plist,
> percpu cluster, and fallback to global locality in a very natural way.
>
>
> > Therefore, when a swap cgroup priority is assigned, we fall back to
> > using per-CPU clusters per swap device, similar to the previous behavio=
r.
> >
> > A proper fix for this workaround will be evaluated in the next patch.
>
> Hmm, but this is already the last patch in the series?
>
> >
> > Signed-off-by: Youngjun park <youngjun.park@lge.com>
> > ---
> >  include/linux/swap.h      |   8 +++
> >  mm/swap.h                 |   8 +++
> >  mm/swap_cgroup_priority.c | 133 ++++++++++++++++++++++++++++++++++++++
> >  mm/swapfile.c             | 125 ++++++++++++++++++++++++-----------
> >  4 files changed, 238 insertions(+), 36 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 49b73911c1bd..d158b0d5c997 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -283,6 +283,13 @@ enum swap_cluster_flags {
> >  #define SWAP_NR_ORDERS         1
> >  #endif
> >
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +struct percpu_cluster {
> > +       local_lock_t lock; /* Protect the percpu_cluster above */
> > +       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation of=
fset */
> > +};
> > +#endif
> > +
> >  /*
> >   * We keep using same cluster for rotational device so IO will be sequ=
ential.
> >   * The purpose is to optimize SWAP throughput on these device.
> > @@ -341,6 +348,7 @@ struct swap_info_struct {
> >         struct list_head discard_clusters; /* discard clusters list */
> >  #ifdef CONFIG_SWAP_CGROUP_PRIORITY
> >         int unique_id;
> > +       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's sw=
ap location */
> >  #endif
> >         struct plist_node avail_lists[]; /*
> >                                            * entries in swap_avail_head=
s, one
> > diff --git a/mm/swap.h b/mm/swap.h
> > index cd2649c632ed..cb6d653fe3f1 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -113,7 +113,15 @@ void delete_swap_cgroup_priority(struct mem_cgroup=
 *memcg);
> >  void show_swap_device_unique_id(struct seq_file *m);
> >  #else
> >  static inline void delete_swap_cgroup_priority(struct mem_cgroup *memc=
g) {}
> > +static inline void activate_swap_cgroup_priority_pnode(struct swap_inf=
o_struct *swp, bool swapon) {}
> > +static inline void deactivate_swap_cgroup_priority_pnode(struct swap_i=
nfo_struct *swp, bool swapoff){}
> >  static inline void get_swap_unique_id(struct swap_info_struct *si) {}
> > +static inline bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg=
,
> > +                               swp_entry_t *entry, int order)
> > +{
> > +       return false;
> > +}
> > +
> >  #endif
> >
> >  #else /* CONFIG_SWAP */
> > diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c
> > index b3e20b676680..bb18cb251f60 100644
> > --- a/mm/swap_cgroup_priority.c
> > +++ b/mm/swap_cgroup_priority.c
> > @@ -54,6 +54,132 @@ static void get_swap_unique_id(struct swap_info_str=
uct *si)
> >         si->unique_id =3D atomic_add_return(1, &swap_unique_id_counter)=
;
> >  }
> >
> > +static bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg,
> > +                               swp_entry_t *entry, int order)
> > +{
> > +       struct swap_cgroup_priority *swap_priority;
> > +       struct swap_cgroup_priority_pnode *pnode, *next;
> > +       unsigned long offset;
> > +       int node;
> > +
> > +       if (!memcg)
> > +               return false;
> > +
> > +       spin_lock(&swap_avail_lock);
> > +priority_check:
> > +       swap_priority =3D memcg->swap_priority;
> > +       if (!swap_priority) {
> > +               spin_unlock(&swap_avail_lock);
> > +               return false;
> > +       }
> > +
> > +       node =3D numa_node_id();
> > +start_over:
> > +       plist_for_each_entry_safe(pnode, next, &swap_priority->plist[no=
de],
> > +                                       avail_lists[node]) {
> > +               struct swap_info_struct *si =3D pnode->swap;
> > +               plist_requeue(&pnode->avail_lists[node],
> > +                       &swap_priority->plist[node]);
> > +               spin_unlock(&swap_avail_lock);
> > +
> > +               if (get_swap_device_info(si)) {
> > +                       offset =3D cluster_alloc_swap_entry(si,
> > +                                       order, SWAP_HAS_CACHE, true);
> > +                       put_swap_device(si);
> > +                       if (offset) {
> > +                               *entry =3D swp_entry(si->type, offset);
> > +                               return true;
> > +                       }
> > +                       if (order)
> > +                               return false;
> > +               }
> > +
> > +               spin_lock(&swap_avail_lock);
> > +
> > +               /* swap_priority is remove or changed under us. */
> > +               if (swap_priority !=3D memcg->swap_priority)
> > +                       goto priority_check;
> > +
> > +               if (plist_node_empty(&next->avail_lists[node]))
> > +                       goto start_over;
> > +       }
> > +       spin_unlock(&swap_avail_lock);
> > +
> > +       return false;
> > +}
> > +
> > +/* add_to_avail_list (swapon / swapusage > 0) */
> > +static void activate_swap_cgroup_priority_pnode(struct swap_info_struc=
t *swp,
> > +                       bool swapon)
> > +{
> > +       struct swap_cgroup_priority *swap_priority;
> > +       int i;
> > +
> > +       list_for_each_entry(swap_priority, &swap_cgroup_priority_list, =
link) {
> > +               struct swap_cgroup_priority_pnode *pnode
> > +                       =3D swap_priority->pnode[swp->type];
> > +
> > +               if (swapon) {
> > +                       pnode->swap =3D swp;
> > +                       pnode->prio =3D swp->prio;
> > +               }
> > +
> > +               /* NUMA priority handling */
> > +               for_each_node(i) {
> > +                       if (swapon) {
> > +                               if (swap_node(swp) =3D=3D i) {
> > +                                       plist_node_init(
> > +                                               &pnode->avail_lists[i],
> > +                                               1);
> > +                               } else {
> > +                                       plist_node_init(
> > +                                               &pnode->avail_lists[i],
> > +                                               -pnode->prio);
> > +                               }
> > +                       }
> > +
> > +                       plist_add(&pnode->avail_lists[i],
> > +                               &swap_priority->plist[i]);
> > +               }
> > +       }
> > +}
> > +
> > +/* del_from_avail_list (swapoff / swap usage <=3D 0) */
> > +static void deactivate_swap_cgroup_priority_pnode(struct swap_info_str=
uct *swp,
> > +               bool swapoff)
> > +{
> > +       struct swap_cgroup_priority *swap_priority;
> > +       int nid, i;
> > +
> > +       list_for_each_entry(swap_priority, &swap_cgroup_priority_list, =
link) {
> > +               struct swap_cgroup_priority_pnode *pnode;
> > +
> > +               if (swapoff && swp->prio < 0) {
> > +                       /*
> > +                       * NUMA priority handling
> > +                       * mimic swapoff prio adjustment without plist
> > +                       */
> > +                       for (int i =3D 0; i < MAX_SWAPFILES; i++) {
> > +                               pnode =3D swap_priority->pnode[i];
> > +                               if (pnode->prio > swp->prio ||
> > +                                       pnode->swap =3D=3D swp)
> > +                                       continue;
> > +
> > +                               pnode->prio++;
> > +                               for_each_node(nid) {
> > +                                       if (pnode->avail_lists[nid].pri=
o !=3D 1)
> > +                                               pnode->avail_lists[nid]=
.prio--;
> > +                               }
> > +                       }
> > +               }
> > +
> > +               pnode =3D swap_priority->pnode[swp->type];
> > +               for_each_node(i)
> > +                       plist_del(&pnode->avail_lists[i],
> > +                               &swap_priority->plist[i]);
> > +       }
> > +}
> > +
> >  int create_swap_cgroup_priority(struct mem_cgroup *memcg,
> >                 int unique[], int prio[], int nr)
> >  {
> > @@ -183,6 +309,12 @@ void delete_swap_cgroup_priority(struct mem_cgroup=
 *memcg)
> >  {
> >         struct swap_cgroup_priority *swap_priority;
> >
> > +       /*
> > +       * XXX: Possible RCU wait? No. Cannot protect priority list addi=
tion.
> > +       * swap_avail_lock gives protection.
> > +       * Think about other object protection mechanism
> > +       * might be solve it and better. (e.g object reference)
> > +       */
> >         spin_lock(&swap_avail_lock);
> >         swap_priority =3D memcg->swap_priority;
> >         if (!swap_priority) {
> > @@ -198,5 +330,6 @@ void delete_swap_cgroup_priority(struct mem_cgroup =
*memcg)
> >
> >         for (int i =3D 0; i < MAX_SWAPFILES; i++)
> >                 kvfree(swap_priority->pnode[i]);
> > +
> >         kvfree(swap_priority);
> >  }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index f8e48dd2381e..28afe4ec0504 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -126,8 +126,12 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, =
percpu_swap_cluster) =3D {
> >         .offset =3D { SWAP_ENTRY_INVALID },
> >         .lock =3D INIT_LOCAL_LOCK(),
> >  };
> > -/* TODO: better choice? */
> > +/* TODO: better arrangement */
> >  #ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +static bool get_swap_device_info(struct swap_info_struct *si);
> > +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct =
*si, int order,
> > +                                             unsigned char usage, bool=
 is_cgroup_priority);
> > +static int swap_node(struct swap_info_struct *si);
> >  #include "swap_cgroup_priority.c"
> >  #endif
> >
> > @@ -776,7 +780,8 @@ static unsigned int alloc_swap_scan_cluster(struct =
swap_info_struct *si,
> >                                             struct swap_cluster_info *c=
i,
> >                                             unsigned long offset,
> >                                             unsigned int order,
> > -                                           unsigned char usage)
> > +                                           unsigned char usage,
> > +                                           bool is_cgroup_priority)
> >  {
> >         unsigned int next =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_=
INVALID;
> >         unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> > @@ -820,12 +825,19 @@ static unsigned int alloc_swap_scan_cluster(struc=
t swap_info_struct *si,
> >  out:
> >         relocate_cluster(si, ci);
> >         unlock_cluster(ci);
> > +
> >         if (si->flags & SWP_SOLIDSTATE) {
> > -               this_cpu_write(percpu_swap_cluster.offset[order], next)=
;
> > -               this_cpu_write(percpu_swap_cluster.si[order], si);
> > -       } else {
> > +               if (!is_cgroup_priority) {
> > +                       this_cpu_write(percpu_swap_cluster.offset[order=
], next);
> > +                       this_cpu_write(percpu_swap_cluster.si[order], s=
i);
> > +               } else {
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +                       __this_cpu_write(si->percpu_cluster->next[order=
], next);
> > +#endif
> > +               }
> > +       } else
> >                 si->global_cluster->next[order] =3D next;
> > -       }
> > +
> >         return found;
> >  }
> >
> > @@ -883,7 +895,7 @@ static void swap_reclaim_work(struct work_struct *w=
ork)
> >   * cluster for current CPU too.
> >   */
> >  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct =
*si, int order,
> > -                                             unsigned char usage)
> > +                                             unsigned char usage, bool=
 is_cgroup_priority)
> >  {
> >         struct swap_cluster_info *ci;
> >         unsigned int offset =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTR=
Y_INVALID;
> > @@ -895,32 +907,38 @@ static unsigned long cluster_alloc_swap_entry(str=
uct swap_info_struct *si, int o
> >         if (order && !(si->flags & SWP_BLKDEV))
> >                 return 0;
> >
> > -       if (!(si->flags & SWP_SOLIDSTATE)) {
> > +       if (si->flags & SWP_SOLIDSTATE) {
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +                local_lock(&si->percpu_cluster->lock);
> > +                offset =3D __this_cpu_read(si->percpu_cluster->next[or=
der]);
> > +#endif
> > +       } else {
> >                 /* Serialize HDD SWAP allocation for each device. */
> >                 spin_lock(&si->global_cluster_lock);
> >                 offset =3D si->global_cluster->next[order];
> > -               if (offset =3D=3D SWAP_ENTRY_INVALID)
> > -                       goto new_cluster;
> > +       }
> >
> > -               ci =3D lock_cluster(si, offset);
> > -               /* Cluster could have been used by another order */
> > -               if (cluster_is_usable(ci, order)) {
> > -                       if (cluster_is_empty(ci))
> > -                               offset =3D cluster_offset(si, ci);
> > -                       found =3D alloc_swap_scan_cluster(si, ci, offse=
t,
> > -                                                       order, usage);
> > -               } else {
> > -                       unlock_cluster(ci);
> > -               }
> > -               if (found)
> > -                       goto done;
> > +       if (offset =3D=3D SWAP_ENTRY_INVALID)
> > +               goto new_cluster;
> > +
> > +       ci =3D lock_cluster(si, offset);
> > +       /* Cluster could have been used by another order */
> > +       if (cluster_is_usable(ci, order)) {
> > +               if (cluster_is_empty(ci))
> > +                       offset =3D cluster_offset(si, ci);
> > +               found =3D alloc_swap_scan_cluster(si, ci, offset,
> > +                                               order, usage, is_cgroup=
_priority);
> > +       } else {
> > +               unlock_cluster(ci);
> >         }
> > +       if (found)
> > +               goto done;
> >
> >  new_cluster:
> >         ci =3D isolate_lock_cluster(si, &si->free_clusters);
> >         if (ci) {
> >                 found =3D alloc_swap_scan_cluster(si, ci, cluster_offse=
t(si, ci),
> > -                                               order, usage);
> > +                                               order, usage, is_cgroup=
_priority);
> >                 if (found)
> >                         goto done;
> >         }
> > @@ -934,7 +952,7 @@ static unsigned long cluster_alloc_swap_entry(struc=
t swap_info_struct *si, int o
> >
> >                 while ((ci =3D isolate_lock_cluster(si, &si->nonfull_cl=
usters[order]))) {
> >                         found =3D alloc_swap_scan_cluster(si, ci, clust=
er_offset(si, ci),
> > -                                                       order, usage);
> > +                                                       order, usage, i=
s_cgroup_priority);
> >                         if (found)
> >                                 goto done;
> >                         /* Clusters failed to allocate are moved to fra=
g_clusters */
> > @@ -952,7 +970,7 @@ static unsigned long cluster_alloc_swap_entry(struc=
t swap_info_struct *si, int o
> >                          * reclaimable (eg. lazy-freed swap cache) slot=
s.
> >                          */
> >                         found =3D alloc_swap_scan_cluster(si, ci, clust=
er_offset(si, ci),
> > -                                                       order, usage);
> > +                                                       order, usage, i=
s_cgroup_priority);
> >                         if (found)
> >                                 goto done;
> >                         frags++;
> > @@ -979,21 +997,27 @@ static unsigned long cluster_alloc_swap_entry(str=
uct swap_info_struct *si, int o
> >                 while ((ci =3D isolate_lock_cluster(si, &si->frag_clust=
ers[o]))) {
> >                         atomic_long_dec(&si->frag_cluster_nr[o]);
> >                         found =3D alloc_swap_scan_cluster(si, ci, clust=
er_offset(si, ci),
> > -                                                       0, usage);
> > +                                                       0, usage, is_cg=
roup_priority);
> >                         if (found)
> >                                 goto done;
> >                 }
> >
> >                 while ((ci =3D isolate_lock_cluster(si, &si->nonfull_cl=
usters[o]))) {
> >                         found =3D alloc_swap_scan_cluster(si, ci, clust=
er_offset(si, ci),
> > -                                                       0, usage);
> > +                                                       0, usage, is_cg=
roup_priority);
> >                         if (found)
> >                                 goto done;
> >                 }
> >         }
> >  done:
> > -       if (!(si->flags & SWP_SOLIDSTATE))
> > +       if (si->flags & SWP_SOLIDSTATE) {
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +               local_unlock(&si->percpu_cluster->lock);
> > +#endif
> > +       } else {
> >                 spin_unlock(&si->global_cluster_lock);
> > +       }
> > +
> >         return found;
> >  }
> >
> > @@ -1032,6 +1056,7 @@ static void del_from_avail_list(struct swap_info_=
struct *si, bool swapoff)
> >         for_each_node(nid)
> >                 plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]=
);
> >
> > +       deactivate_swap_cgroup_priority_pnode(si, swapoff);
> >  skip:
> >         spin_unlock(&swap_avail_lock);
> >  }
> > @@ -1075,6 +1100,7 @@ static void add_to_avail_list(struct swap_info_st=
ruct *si, bool swapon)
> >         for_each_node(nid)
> >                 plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]=
);
> >
> > +       activate_swap_cgroup_priority_pnode(si, swapon);
> >  skip:
> >         spin_unlock(&swap_avail_lock);
> >  }
> > @@ -1200,7 +1226,8 @@ static bool swap_alloc_fast(swp_entry_t *entry,
> >         if (cluster_is_usable(ci, order)) {
> >                 if (cluster_is_empty(ci))
> >                         offset =3D cluster_offset(si, ci);
> > -               found =3D alloc_swap_scan_cluster(si, ci, offset, order=
, SWAP_HAS_CACHE);
> > +               found =3D alloc_swap_scan_cluster(si, ci, offset, order=
,
> > +                               SWAP_HAS_CACHE, false);
> >                 if (found)
> >                         *entry =3D swp_entry(si->type, found);
> >         } else {
> > @@ -1227,7 +1254,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
> >                 plist_requeue(&si->avail_lists[node], &swap_avail_heads=
[node]);
> >                 spin_unlock(&swap_avail_lock);
> >                 if (get_swap_device_info(si)) {
> > -                       offset =3D cluster_alloc_swap_entry(si, order, =
SWAP_HAS_CACHE);
> > +                       offset =3D cluster_alloc_swap_entry(si, order, =
SWAP_HAS_CACHE, false);
> >                         put_swap_device(si);
> >                         if (offset) {
> >                                 *entry =3D swp_entry(si->type, offset);
> > @@ -1294,10 +1321,12 @@ int folio_alloc_swap(struct folio *folio, gfp_t=
 gfp)
> >                 }
> >         }
> >
> > -       local_lock(&percpu_swap_cluster.lock);
> > -       if (!swap_alloc_fast(&entry, order))
> > -               swap_alloc_slow(&entry, order);
> > -       local_unlock(&percpu_swap_cluster.lock);
> > +       if (!swap_alloc_cgroup_priority(folio_memcg(folio), &entry, ord=
er)) {
> > +               local_lock(&percpu_swap_cluster.lock);
> > +               if (!swap_alloc_fast(&entry, order))
> > +                       swap_alloc_slow(&entry, order);
> > +               local_unlock(&percpu_swap_cluster.lock);
> > +       }
> >
> >         /* Need to call this even if allocation failed, for MEMCG_SWAP_=
FAIL. */
> >         if (mem_cgroup_try_charge_swap(folio, entry))
> > @@ -1870,7 +1899,7 @@ swp_entry_t get_swap_page_of_type(int type)
> >         /* This is called for allocating swap entry, not cache */
> >         if (get_swap_device_info(si)) {
> >                 if (si->flags & SWP_WRITEOK) {
> > -                       offset =3D cluster_alloc_swap_entry(si, 0, 1);
> > +                       offset =3D cluster_alloc_swap_entry(si, 0, 1, f=
alse);
> >                         if (offset) {
> >                                 entry =3D swp_entry(si->type, offset);
> >                                 atomic_long_dec(&nr_swap_pages);
> > @@ -2800,6 +2829,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, sp=
ecialfile)
> >         arch_swap_invalidate_area(p->type);
> >         zswap_swapoff(p->type);
> >         mutex_unlock(&swapon_mutex);
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +       free_percpu(p->percpu_cluster);
> > +       p->percpu_cluster =3D NULL;
> > +#endif
> >         kfree(p->global_cluster);
> >         p->global_cluster =3D NULL;
> >         vfree(swap_map);
> > @@ -3207,7 +3240,23 @@ static struct swap_cluster_info *setup_clusters(=
struct swap_info_struct *si,
> >         for (i =3D 0; i < nr_clusters; i++)
> >                 spin_lock_init(&cluster_info[i].lock);
> >
> > -       if (!(si->flags & SWP_SOLIDSTATE)) {
> > +       if (si->flags & SWP_SOLIDSTATE) {
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +               si->percpu_cluster =3D alloc_percpu(struct percpu_clust=
er);
> > +               if (!si->percpu_cluster)
> > +                       goto err_free;
> > +
> > +               int cpu;
> > +               for_each_possible_cpu(cpu) {
> > +                       struct percpu_cluster *cluster;
> > +
> > +                       cluster =3D per_cpu_ptr(si->percpu_cluster, cpu=
);
> > +                       for (i =3D 0; i < SWAP_NR_ORDERS; i++)
> > +                               cluster->next[i] =3D SWAP_ENTRY_INVALID=
;
> > +                       local_lock_init(&cluster->lock);
> > +               }
> > +#endif
> > +       } else {
> >                 si->global_cluster =3D kmalloc(sizeof(*si->global_clust=
er),
> >                                      GFP_KERNEL);
> >                 if (!si->global_cluster)
> > @@ -3495,6 +3544,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, spe=
cialfile, int, swap_flags)
> >  bad_swap_unlock_inode:
> >         inode_unlock(inode);
> >  bad_swap:
> > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY
> > +       free_percpu(si->percpu_cluster);
> > +       si->percpu_cluster =3D NULL;
> > +#endif
> >         kfree(si->global_cluster);
> >         si->global_cluster =3D NULL;
> >         inode =3D NULL;
> > --
> > 2.34.1
> >
> >