From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F7A6C61CE8 for ; Thu, 12 Jun 2025 11:17:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 998606B0089; Thu, 12 Jun 2025 07:17:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9709C6B008A; Thu, 12 Jun 2025 07:17:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 886886B008C; Thu, 12 Jun 2025 07:17:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 60FCA6B0089 for ; Thu, 12 Jun 2025 07:17:18 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id A70921211F1 for ; Thu, 12 Jun 2025 11:17:17 +0000 (UTC) X-FDA: 83546497314.09.41AE74B Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by imf07.hostedemail.com (Postfix) with ESMTP id 9760840009 for ; Thu, 12 Jun 2025 11:17:15 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="kF/QDCT/"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749727035; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HIIF55KQpm7f44r8NdaskpxJAFkQQ8vCnGotfXMcBk8=; b=N392UDnfwRJVsGk1kUyZ8mCz01A7N3z2LevyKYVb2jcPJsGNS/kMEc5jw87ZwhYmrfylxG aOqtxkaH9saxcnrD1fUFc6gfFhP5UzzOz8Z5ifTCiW2/TvdbjtsHM1R3rtzijkBuhw6ztj ULquj+RA1ZwjrbeiwbBkRziXOuHpuJE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749727035; a=rsa-sha256; cv=none; b=0HLWo3phjM4TjlDdJCnTuQJUBps0JCaLXqgk0AT4kY9JSpUDDskmFZ496e5AondLF8LVac xos+J+Wh/wuRAV6ACHFIvTnT6N5XJar1Q1vBb26voe+o/cOREYZP35rCf2UTg4fL530ctw 0TYG62Wz0VBxnYYiQ5xnamlFdNyfBAg= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="kF/QDCT/"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com Received: by mail-lj1-f180.google.com with SMTP id 38308e7fff4ca-32b3b250621so1086021fa.2 for ; Thu, 12 Jun 2025 04:17:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749727034; x=1750331834; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HIIF55KQpm7f44r8NdaskpxJAFkQQ8vCnGotfXMcBk8=; b=kF/QDCT/f9p7EO4ZWbuU0kZl6hp73mnynWk+4dlmlIo/2p8NSQZJH68LFEOPU/1dA3 MObbvCp4zf1wWg6+kVwPgq25Z9CEPPUXpLlMdV9kL4s/G1u3t3ORgKqYasnuSYwpVK2t fMbQr5BpMBwbkmN1CUW6zDYDvc5lzIZseyKBWXxV20EoJ+A2Pp5+mldQxCv6UZ94Q89F wAOZsk4YhoGd4usRLdEo/V09Xan2YFwgZP/UEWu0mRpuz8llE8pT1lbA4wo6wBNvC+Iy EqBxhdzGRQUhIxyyUKM3CDw4hyqzWTJ1h1CzKW2WyAOEIZUrNsECRSzwQTqf09NFCWTN asMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749727034; x=1750331834; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HIIF55KQpm7f44r8NdaskpxJAFkQQ8vCnGotfXMcBk8=; b=Pbc15eecEfwRcQlbW2PQjW1/KbRCQKVWsKpQr6sKnCFfXrR02U4A36cJ/4VeF8Yo0v j7FjkUfx4lX8e4Ny+c6brfpJqx+GjSTSL0rQvVoBmRKD3fKsWP2vde8Cwc9Plmd9COlz D8EW3UMP2T+C2JAZxggcaCZmGrpx+TmdcwIuim6mUJai26/bTVeigNIo6BRgBaWt8olQ FNlb7P1vqsgYDgGoTGchi3kicUx3z6R/oIirfk3TZtC+jmhxzeLn3cnpmYICRGuCvedI DPSJm3aj/V8V+StoxsSegk3ViF/HREP8kI7fhS257YkCvqt8wL2LJMlmV4qRalNqQV3P 0ZJw== X-Gm-Message-State: AOJu0Yy3W/UavKQnopKjbnZvTDx52VDa7ygx0ahEH9ATS63N76AaIJC0 uorTUtPZxC1Gw6/imPJ2hJILHqEZ0Sb+QThwLStik1+Fs2NTiFr1JYXvFjs5n7eXckt2kKEAJSc z9/lf8lpia3gAzM89TLrpeESO6WJulnw= X-Gm-Gg: ASbGncu9CDJlEi8drgDvFz8VZ1rNzHfYGKTQfIF81pcM8tksQLU9eRUCN3Tvu3bR5e7 O3JEXoomW62izLR6zHPKkDIKHtZroEXVXbFCQwIjDPtO2VGQ3PJjYeIgL0gPctwPzrXjrXkZV2f 9Qqs8l4hhMpLiiCI3GuzL9LKMT6e3ZBtxPDJ5AqWNbaKs= X-Google-Smtp-Source: AGHT+IEZWRJa3iN7Ncw0u8P+0RnC6lRsrxWoZktikte8LbIo5Xj7rRGDDRVoCXAZNh860z3rF6MLc7UAGW6Xm3n9BZI= X-Received: by 2002:a05:651c:221f:b0:32a:6764:a1cc with SMTP id 38308e7fff4ca-32b21d48601mr18022131fa.5.1749727033469; Thu, 12 Jun 2025 04:17:13 -0700 (PDT) MIME-Version: 1.0 References: <20250612103743.3385842-1-youngjun.park@lge.com> <20250612103743.3385842-3-youngjun.park@lge.com> In-Reply-To: From: Kairui Song Date: Thu, 12 Jun 2025 19:16:56 +0800 X-Gm-Features: AX0GCFsFJdGizMwhML3RnqwHK9p9h8b8SBZGgU9UHbHnJQo_uU0xaAEOqNxFQX4 Message-ID: Subject: Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer To: youngjun.park@lge.com Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org, muchun.song@linux.dev, iamjoonsoo.kim@lge.com, taejoon.song@lge.com, gunho.lee@lge.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: p8acsgkafecp5wsfk9szrxm1ge7joaem X-Rspamd-Queue-Id: 9760840009 X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1749727035-628484 X-HE-Meta: U2FsdGVkX19wAwDegjZh4SXDZSXoEk8xusvIAY3xiBnamlPSUrYlbdkFO+tkaYtVqdude3EV5uMJGhDKUiTJUcTIrtWWv6vaN+UE2Lxt0vN9+N9hrChl9tLicXRjJxJ+j3Wlv6+sTHJIKJSlr9pdECewSjM08E0gCyYfTZFl9ashB5zELy6IisZdeNtcBq8MuMyfugEeXHNF+TjbZIE93QINVOnt56Hv0yoIAd8fYkVC7YD59rxSpF5k4ACDDsJewf2T/NZEQUxMiyVKYj/7SRNr7bPoCLh1iY9/HVH6pG4lLiunTuZe5033wbhgMLsiy/6FMHe6nnRze036bYWom6hH/IdpKEGxg44lNGIXlyRpcVDB27KEK2S8Fa5RuwummiP8Rovs4rAWul8GBkQ9s3KDyLaQ5jY5i1FT5aqouRUhv+ARMfOLWjZ5Es48Me9+vo//NPXhmNxRtw8Mk5jzbcTMtQSgltMdhQngPB/+UzaeagvbNAdSByJHSP/tNd/KTOaqihz6y2/5c6cl1ql2oh/2ozjGyVwNvLiroV5uIMotPOS8EV+vhI+o+iNFht9ETLMpO4bqNrzpyae3SViHildCPtVsMLAql7JH+s/FfglYbbDa2OCs+PhZ+u4cuY9fLTJWfXvVOt8vteDA9hw5nuZO0QjDeogyWudglFeUPVGiRS2UoeevtZJKN3jbs9IFzemwXzipIkb5C+RFe3xOjp+AtuKIVn5FoKCG6pPhAkPhHKFmK7pCJHKzcAw3fYcbElo2O1B7elwZ5qtnJRU4PHUFNBsTnwJp4g8XMJACegnICqqokdY5K6BlQ/duRmL2NQrFXxLJExH/DcB/BoIBvNGndO1ypyl3W96wFHafxGyqQaxoFklWvSu0y22sz/Qxk9iIzyJWSV5Adqm0Z6jokOK/4K5w+lvJn9o3+jNaNipYoDmG2O3EaxfBASEmd5vTS3ZEebYVeErVZdhqstq qNd/yv0U S7YIcZZ9EUEcjW05QpIAbTJh1v/m7wO+DBmpyClbmj2QsOkQddzH05E4CI8akIyF8/NIPVVGD57GRR7D9dlyrN6wz+40zgn3LhZxNLGWPqW/7TEWcfnfHflFKOfZ1wITYMFVi0KYPoIS53rSHcTZQsZHkHkInj2R5HOMKmp5myN53bD8slnU7ZCz7YrKWUX/hGF+TLpfMyv4SXH+jgxBT2djnC2peeojDAJqKoxFXBPomcoj/FEx1mgLI5JdfsrqnwRNXHSS6zF2Bhpkml1wepI1xu6MCMdF9CM2f9k20iWFS3a06usY3yio7DgQ5xWGw2Bs+IkofZKVWyVBXCMZMe7axnOx7Ge4vM6Ts+V9YVNmZkSI9Gj+LZkLgZsqBCHidfl3kFFlA2Fmr6OY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 12, 2025 at 7:14=E2=80=AFPM Kairui Song wrot= e: > > On Thu, Jun 12, 2025 at 6:43=E2=80=AFPM wrote: > > > > From: "youngjun.park" > > > > Hi, Youngjun, > > Thanks for sharing this series. > > > This patch implements swap device selection and swap on/off propagation > > when a cgroup-specific swap priority is set. > > > > There is one workaround to this implementation as follows. > > Current per-cpu swap cluster enforces swap device selection based solel= y > > on CPU locality, overriding the swap cgroup's configured priorities. > > I've been thinking about this, we can switch to a per-cgroup-per-cpu > next cluster selector, the problem with current code is that swap > allocator is not designed with folio / cgroup in mind at all, so it's > really ugly to implement, which is why I have following two patches in > the swap table series: > > https://lore.kernel.org/linux-mm/20250514201729.48420-18-ryncsn@gmail.com= / > https://lore.kernel.org/linux-mm/20250514201729.48420-22-ryncsn@gmail.com= / And BTW this is not the only reason, these two are also quite critical to get rid of the swap_cgroup_ctrl later, and maybe switch to use folio lock for more swap operations, etc.. > The first one makes all swap allocation starts with a folio, the > second one makes the allocator always folio aware. So you can know > which cgroup is doing the allocation at anytime inside the allocator > (and it reduced the number of argument, also improving performance :) > ) > > So the allocator can just use cgroup's swap info if available, plist, > percpu cluster, and fallback to global locality in a very natural way. > > > > Therefore, when a swap cgroup priority is assigned, we fall back to > > using per-CPU clusters per swap device, similar to the previous behavio= r. > > > > A proper fix for this workaround will be evaluated in the next patch. > > Hmm, but this is already the last patch in the series? > > > > > Signed-off-by: Youngjun park > > --- > > include/linux/swap.h | 8 +++ > > mm/swap.h | 8 +++ > > mm/swap_cgroup_priority.c | 133 ++++++++++++++++++++++++++++++++++++++ > > mm/swapfile.c | 125 ++++++++++++++++++++++++----------- > > 4 files changed, 238 insertions(+), 36 deletions(-) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 49b73911c1bd..d158b0d5c997 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -283,6 +283,13 @@ enum swap_cluster_flags { > > #define SWAP_NR_ORDERS 1 > > #endif > > > > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY > > +struct percpu_cluster { > > + local_lock_t lock; /* Protect the percpu_cluster above */ > > + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation of= fset */ > > +}; > > +#endif > > + > > /* > > * We keep using same cluster for rotational device so IO will be sequ= ential. > > * The purpose is to optimize SWAP throughput on these device. > > @@ -341,6 +348,7 @@ struct swap_info_struct { > > struct list_head discard_clusters; /* discard clusters list */ > > #ifdef CONFIG_SWAP_CGROUP_PRIORITY > > int unique_id; > > + struct percpu_cluster __percpu *percpu_cluster; /* per cpu's sw= ap location */ > > #endif > > struct plist_node avail_lists[]; /* > > * entries in swap_avail_head= s, one > > diff --git a/mm/swap.h b/mm/swap.h > > index cd2649c632ed..cb6d653fe3f1 100644 > > --- a/mm/swap.h > > +++ b/mm/swap.h > > @@ -113,7 +113,15 @@ void delete_swap_cgroup_priority(struct mem_cgroup= *memcg); > > void show_swap_device_unique_id(struct seq_file *m); > > #else > > static inline void delete_swap_cgroup_priority(struct mem_cgroup *memc= g) {} > > +static inline void activate_swap_cgroup_priority_pnode(struct swap_inf= o_struct *swp, bool swapon) {} > > +static inline void deactivate_swap_cgroup_priority_pnode(struct swap_i= nfo_struct *swp, bool swapoff){} > > static inline void get_swap_unique_id(struct swap_info_struct *si) {} > > +static inline bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg= , > > + swp_entry_t *entry, int order) > > +{ > > + return false; > > +} > > + > > #endif > > > > #else /* CONFIG_SWAP */ > > diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c > > index b3e20b676680..bb18cb251f60 100644 > > --- a/mm/swap_cgroup_priority.c > > +++ b/mm/swap_cgroup_priority.c > > @@ -54,6 +54,132 @@ static void get_swap_unique_id(struct swap_info_str= uct *si) > > si->unique_id =3D atomic_add_return(1, &swap_unique_id_counter)= ; > > } > > > > +static bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg, > > + swp_entry_t *entry, int order) > > +{ > > + struct swap_cgroup_priority *swap_priority; > > + struct swap_cgroup_priority_pnode *pnode, *next; > > + unsigned long offset; > > + int node; > > + > > + if (!memcg) > > + return false; > > + > > + spin_lock(&swap_avail_lock); > > +priority_check: > > + swap_priority =3D memcg->swap_priority; > > + if (!swap_priority) { > > + spin_unlock(&swap_avail_lock); > > + return false; > > + } > > + > > + node =3D numa_node_id(); > > +start_over: > > + plist_for_each_entry_safe(pnode, next, &swap_priority->plist[no= de], > > + avail_lists[node]) { > > + struct swap_info_struct *si =3D pnode->swap; > > + plist_requeue(&pnode->avail_lists[node], > > + &swap_priority->plist[node]); > > + spin_unlock(&swap_avail_lock); > > + > > + if (get_swap_device_info(si)) { > > + offset =3D cluster_alloc_swap_entry(si, > > + order, SWAP_HAS_CACHE, true); > > + put_swap_device(si); > > + if (offset) { > > + *entry =3D swp_entry(si->type, offset); > > + return true; > > + } > > + if (order) > > + return false; > > + } > > + > > + spin_lock(&swap_avail_lock); > > + > > + /* swap_priority is remove or changed under us. */ > > + if (swap_priority !=3D memcg->swap_priority) > > + goto priority_check; > > + > > + if (plist_node_empty(&next->avail_lists[node])) > > + goto start_over; > > + } > > + spin_unlock(&swap_avail_lock); > > + > > + return false; > > +} > > + > > +/* add_to_avail_list (swapon / swapusage > 0) */ > > +static void activate_swap_cgroup_priority_pnode(struct swap_info_struc= t *swp, > > + bool swapon) > > +{ > > + struct swap_cgroup_priority *swap_priority; > > + int i; > > + > > + list_for_each_entry(swap_priority, &swap_cgroup_priority_list, = link) { > > + struct swap_cgroup_priority_pnode *pnode > > + =3D swap_priority->pnode[swp->type]; > > + > > + if (swapon) { > > + pnode->swap =3D swp; > > + pnode->prio =3D swp->prio; > > + } > > + > > + /* NUMA priority handling */ > > + for_each_node(i) { > > + if (swapon) { > > + if (swap_node(swp) =3D=3D i) { > > + plist_node_init( > > + &pnode->avail_lists[i], > > + 1); > > + } else { > > + plist_node_init( > > + &pnode->avail_lists[i], > > + -pnode->prio); > > + } > > + } > > + > > + plist_add(&pnode->avail_lists[i], > > + &swap_priority->plist[i]); > > + } > > + } > > +} > > + > > +/* del_from_avail_list (swapoff / swap usage <=3D 0) */ > > +static void deactivate_swap_cgroup_priority_pnode(struct swap_info_str= uct *swp, > > + bool swapoff) > > +{ > > + struct swap_cgroup_priority *swap_priority; > > + int nid, i; > > + > > + list_for_each_entry(swap_priority, &swap_cgroup_priority_list, = link) { > > + struct swap_cgroup_priority_pnode *pnode; > > + > > + if (swapoff && swp->prio < 0) { > > + /* > > + * NUMA priority handling > > + * mimic swapoff prio adjustment without plist > > + */ > > + for (int i =3D 0; i < MAX_SWAPFILES; i++) { > > + pnode =3D swap_priority->pnode[i]; > > + if (pnode->prio > swp->prio || > > + pnode->swap =3D=3D swp) > > + continue; > > + > > + pnode->prio++; > > + for_each_node(nid) { > > + if (pnode->avail_lists[nid].pri= o !=3D 1) > > + pnode->avail_lists[nid]= .prio--; > > + } > > + } > > + } > > + > > + pnode =3D swap_priority->pnode[swp->type]; > > + for_each_node(i) > > + plist_del(&pnode->avail_lists[i], > > + &swap_priority->plist[i]); > > + } > > +} > > + > > int create_swap_cgroup_priority(struct mem_cgroup *memcg, > > int unique[], int prio[], int nr) > > { > > @@ -183,6 +309,12 @@ void delete_swap_cgroup_priority(struct mem_cgroup= *memcg) > > { > > struct swap_cgroup_priority *swap_priority; > > > > + /* > > + * XXX: Possible RCU wait? No. Cannot protect priority list addi= tion. > > + * swap_avail_lock gives protection. > > + * Think about other object protection mechanism > > + * might be solve it and better. (e.g object reference) > > + */ > > spin_lock(&swap_avail_lock); > > swap_priority =3D memcg->swap_priority; > > if (!swap_priority) { > > @@ -198,5 +330,6 @@ void delete_swap_cgroup_priority(struct mem_cgroup = *memcg) > > > > for (int i =3D 0; i < MAX_SWAPFILES; i++) > > kvfree(swap_priority->pnode[i]); > > + > > kvfree(swap_priority); > > } > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index f8e48dd2381e..28afe4ec0504 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -126,8 +126,12 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, = percpu_swap_cluster) =3D { > > .offset =3D { SWAP_ENTRY_INVALID }, > > .lock =3D INIT_LOCAL_LOCK(), > > }; > > -/* TODO: better choice? */ > > +/* TODO: better arrangement */ > > #ifdef CONFIG_SWAP_CGROUP_PRIORITY > > +static bool get_swap_device_info(struct swap_info_struct *si); > > +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct = *si, int order, > > + unsigned char usage, bool= is_cgroup_priority); > > +static int swap_node(struct swap_info_struct *si); > > #include "swap_cgroup_priority.c" > > #endif > > > > @@ -776,7 +780,8 @@ static unsigned int alloc_swap_scan_cluster(struct = swap_info_struct *si, > > struct swap_cluster_info *c= i, > > unsigned long offset, > > unsigned int order, > > - unsigned char usage) > > + unsigned char usage, > > + bool is_cgroup_priority) > > { > > unsigned int next =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTRY_= INVALID; > > unsigned long start =3D ALIGN_DOWN(offset, SWAPFILE_CLUSTER); > > @@ -820,12 +825,19 @@ static unsigned int alloc_swap_scan_cluster(struc= t swap_info_struct *si, > > out: > > relocate_cluster(si, ci); > > unlock_cluster(ci); > > + > > if (si->flags & SWP_SOLIDSTATE) { > > - this_cpu_write(percpu_swap_cluster.offset[order], next)= ; > > - this_cpu_write(percpu_swap_cluster.si[order], si); > > - } else { > > + if (!is_cgroup_priority) { > > + this_cpu_write(percpu_swap_cluster.offset[order= ], next); > > + this_cpu_write(percpu_swap_cluster.si[order], s= i); > > + } else { > > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY > > + __this_cpu_write(si->percpu_cluster->next[order= ], next); > > +#endif > > + } > > + } else > > si->global_cluster->next[order] =3D next; > > - } > > + > > return found; > > } > > > > @@ -883,7 +895,7 @@ static void swap_reclaim_work(struct work_struct *w= ork) > > * cluster for current CPU too. > > */ > > static unsigned long cluster_alloc_swap_entry(struct swap_info_struct = *si, int order, > > - unsigned char usage) > > + unsigned char usage, bool= is_cgroup_priority) > > { > > struct swap_cluster_info *ci; > > unsigned int offset =3D SWAP_ENTRY_INVALID, found =3D SWAP_ENTR= Y_INVALID; > > @@ -895,32 +907,38 @@ static unsigned long cluster_alloc_swap_entry(str= uct swap_info_struct *si, int o > > if (order && !(si->flags & SWP_BLKDEV)) > > return 0; > > > > - if (!(si->flags & SWP_SOLIDSTATE)) { > > + if (si->flags & SWP_SOLIDSTATE) { > > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY > > + local_lock(&si->percpu_cluster->lock); > > + offset =3D __this_cpu_read(si->percpu_cluster->next[or= der]); > > +#endif > > + } else { > > /* Serialize HDD SWAP allocation for each device. */ > > spin_lock(&si->global_cluster_lock); > > offset =3D si->global_cluster->next[order]; > > - if (offset =3D=3D SWAP_ENTRY_INVALID) > > - goto new_cluster; > > + } > > > > - ci =3D lock_cluster(si, offset); > > - /* Cluster could have been used by another order */ > > - if (cluster_is_usable(ci, order)) { > > - if (cluster_is_empty(ci)) > > - offset =3D cluster_offset(si, ci); > > - found =3D alloc_swap_scan_cluster(si, ci, offse= t, > > - order, usage); > > - } else { > > - unlock_cluster(ci); > > - } > > - if (found) > > - goto done; > > + if (offset =3D=3D SWAP_ENTRY_INVALID) > > + goto new_cluster; > > + > > + ci =3D lock_cluster(si, offset); > > + /* Cluster could have been used by another order */ > > + if (cluster_is_usable(ci, order)) { > > + if (cluster_is_empty(ci)) > > + offset =3D cluster_offset(si, ci); > > + found =3D alloc_swap_scan_cluster(si, ci, offset, > > + order, usage, is_cgroup= _priority); > > + } else { > > + unlock_cluster(ci); > > } > > + if (found) > > + goto done; > > > > new_cluster: > > ci =3D isolate_lock_cluster(si, &si->free_clusters); > > if (ci) { > > found =3D alloc_swap_scan_cluster(si, ci, cluster_offse= t(si, ci), > > - order, usage); > > + order, usage, is_cgroup= _priority); > > if (found) > > goto done; > > } > > @@ -934,7 +952,7 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, int o > > > > while ((ci =3D isolate_lock_cluster(si, &si->nonfull_cl= usters[order]))) { > > found =3D alloc_swap_scan_cluster(si, ci, clust= er_offset(si, ci), > > - order, usage); > > + order, usage, i= s_cgroup_priority); > > if (found) > > goto done; > > /* Clusters failed to allocate are moved to fra= g_clusters */ > > @@ -952,7 +970,7 @@ static unsigned long cluster_alloc_swap_entry(struc= t swap_info_struct *si, int o > > * reclaimable (eg. lazy-freed swap cache) slot= s. > > */ > > found =3D alloc_swap_scan_cluster(si, ci, clust= er_offset(si, ci), > > - order, usage); > > + order, usage, i= s_cgroup_priority); > > if (found) > > goto done; > > frags++; > > @@ -979,21 +997,27 @@ static unsigned long cluster_alloc_swap_entry(str= uct swap_info_struct *si, int o > > while ((ci =3D isolate_lock_cluster(si, &si->frag_clust= ers[o]))) { > > atomic_long_dec(&si->frag_cluster_nr[o]); > > found =3D alloc_swap_scan_cluster(si, ci, clust= er_offset(si, ci), > > - 0, usage); > > + 0, usage, is_cg= roup_priority); > > if (found) > > goto done; > > } > > > > while ((ci =3D isolate_lock_cluster(si, &si->nonfull_cl= usters[o]))) { > > found =3D alloc_swap_scan_cluster(si, ci, clust= er_offset(si, ci), > > - 0, usage); > > + 0, usage, is_cg= roup_priority); > > if (found) > > goto done; > > } > > } > > done: > > - if (!(si->flags & SWP_SOLIDSTATE)) > > + if (si->flags & SWP_SOLIDSTATE) { > > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY > > + local_unlock(&si->percpu_cluster->lock); > > +#endif > > + } else { > > spin_unlock(&si->global_cluster_lock); > > + } > > + > > return found; > > } > > > > @@ -1032,6 +1056,7 @@ static void del_from_avail_list(struct swap_info_= struct *si, bool swapoff) > > for_each_node(nid) > > plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]= ); > > > > + deactivate_swap_cgroup_priority_pnode(si, swapoff); > > skip: > > spin_unlock(&swap_avail_lock); > > } > > @@ -1075,6 +1100,7 @@ static void add_to_avail_list(struct swap_info_st= ruct *si, bool swapon) > > for_each_node(nid) > > plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]= ); > > > > + activate_swap_cgroup_priority_pnode(si, swapon); > > skip: > > spin_unlock(&swap_avail_lock); > > } > > @@ -1200,7 +1226,8 @@ static bool swap_alloc_fast(swp_entry_t *entry, > > if (cluster_is_usable(ci, order)) { > > if (cluster_is_empty(ci)) > > offset =3D cluster_offset(si, ci); > > - found =3D alloc_swap_scan_cluster(si, ci, offset, order= , SWAP_HAS_CACHE); > > + found =3D alloc_swap_scan_cluster(si, ci, offset, order= , > > + SWAP_HAS_CACHE, false); > > if (found) > > *entry =3D swp_entry(si->type, found); > > } else { > > @@ -1227,7 +1254,7 @@ static bool swap_alloc_slow(swp_entry_t *entry, > > plist_requeue(&si->avail_lists[node], &swap_avail_heads= [node]); > > spin_unlock(&swap_avail_lock); > > if (get_swap_device_info(si)) { > > - offset =3D cluster_alloc_swap_entry(si, order, = SWAP_HAS_CACHE); > > + offset =3D cluster_alloc_swap_entry(si, order, = SWAP_HAS_CACHE, false); > > put_swap_device(si); > > if (offset) { > > *entry =3D swp_entry(si->type, offset); > > @@ -1294,10 +1321,12 @@ int folio_alloc_swap(struct folio *folio, gfp_t= gfp) > > } > > } > > > > - local_lock(&percpu_swap_cluster.lock); > > - if (!swap_alloc_fast(&entry, order)) > > - swap_alloc_slow(&entry, order); > > - local_unlock(&percpu_swap_cluster.lock); > > + if (!swap_alloc_cgroup_priority(folio_memcg(folio), &entry, ord= er)) { > > + local_lock(&percpu_swap_cluster.lock); > > + if (!swap_alloc_fast(&entry, order)) > > + swap_alloc_slow(&entry, order); > > + local_unlock(&percpu_swap_cluster.lock); > > + } > > > > /* Need to call this even if allocation failed, for MEMCG_SWAP_= FAIL. */ > > if (mem_cgroup_try_charge_swap(folio, entry)) > > @@ -1870,7 +1899,7 @@ swp_entry_t get_swap_page_of_type(int type) > > /* This is called for allocating swap entry, not cache */ > > if (get_swap_device_info(si)) { > > if (si->flags & SWP_WRITEOK) { > > - offset =3D cluster_alloc_swap_entry(si, 0, 1); > > + offset =3D cluster_alloc_swap_entry(si, 0, 1, f= alse); > > if (offset) { > > entry =3D swp_entry(si->type, offset); > > atomic_long_dec(&nr_swap_pages); > > @@ -2800,6 +2829,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, sp= ecialfile) > > arch_swap_invalidate_area(p->type); > > zswap_swapoff(p->type); > > mutex_unlock(&swapon_mutex); > > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY > > + free_percpu(p->percpu_cluster); > > + p->percpu_cluster =3D NULL; > > +#endif > > kfree(p->global_cluster); > > p->global_cluster =3D NULL; > > vfree(swap_map); > > @@ -3207,7 +3240,23 @@ static struct swap_cluster_info *setup_clusters(= struct swap_info_struct *si, > > for (i =3D 0; i < nr_clusters; i++) > > spin_lock_init(&cluster_info[i].lock); > > > > - if (!(si->flags & SWP_SOLIDSTATE)) { > > + if (si->flags & SWP_SOLIDSTATE) { > > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY > > + si->percpu_cluster =3D alloc_percpu(struct percpu_clust= er); > > + if (!si->percpu_cluster) > > + goto err_free; > > + > > + int cpu; > > + for_each_possible_cpu(cpu) { > > + struct percpu_cluster *cluster; > > + > > + cluster =3D per_cpu_ptr(si->percpu_cluster, cpu= ); > > + for (i =3D 0; i < SWAP_NR_ORDERS; i++) > > + cluster->next[i] =3D SWAP_ENTRY_INVALID= ; > > + local_lock_init(&cluster->lock); > > + } > > +#endif > > + } else { > > si->global_cluster =3D kmalloc(sizeof(*si->global_clust= er), > > GFP_KERNEL); > > if (!si->global_cluster) > > @@ -3495,6 +3544,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, spe= cialfile, int, swap_flags) > > bad_swap_unlock_inode: > > inode_unlock(inode); > > bad_swap: > > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY > > + free_percpu(si->percpu_cluster); > > + si->percpu_cluster =3D NULL; > > +#endif > > kfree(si->global_cluster); > > si->global_cluster =3D NULL; > > inode =3D NULL; > > -- > > 2.34.1 > > > >