From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9AE2E4A3405 for ; Wed, 21 Jan 2026 14:58:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769007518; cv=none; b=k0UBS4mcv+P1+iuJzhgtGLVBFoEP22QcyrbI6vJvXc4V0yKf5BiLAk9YAd2K80j84wSvEKPKqClCGDhE6aR43Q2i60nSj2YIPkOatAteVRQwToUE/MFsIg8B0rDhAWXHY9jR/P0ndPNLu4ZXMLFXhbods/crvsv18VUpLLShtn8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769007518; c=relaxed/simple; bh=+OwBnIlWcgkstsBQ2nMNm7ynJcQbUy4My6NYE9UynHE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=PFHTNpHXr9q/UuiHFZgMrm7nRjsA/9TFQzGO05orPcIGZMwwQSvifU8O5FH7cwt8yHY9iuNv86XC74Rb5WVqaFLw9GAAvYcUYhhVN8kEziF8TBSfPPBqQqOIvoI3gwg7GBU/Rfkvea5gBah4xooG1yrZqQ0ZyrRPuClg9JR9hfs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=WQ3Rldgf; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="WQ3Rldgf" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-2a7afb2cf09so4614435ad.3 for ; Wed, 21 Jan 2026 06:58:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1769007516; x=1769612316; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=UpnJM6r29OmlECz6ZZJlv11hzQUIyaYlzl/s4DXsy2Q=; b=WQ3RldgfTDTYXmznnbH713BQ3GkU5GIkUehJNgZpdrZQxs0LsdrYjQn/maUr/UoakB 8mOWZ4XNAKYRXUFEHnLXQxjSouixWmXQmkIufNY8U5jdIY3K4ObLO1lJzXeYJaaF6Y00 pQbPiD9Y5lFJ1OGqWl6qsvBf1vq70mi2tEJwV58/2F252AqSJDoa2herbjR5bvM/SrDj pnNY9cmGjtOLr1N3mC9+obWMILEYF/ZPFcfJIgxCWRHkGeSMFA+3dUSkAz12fAe77oMm tv8Wk+7U6ZNq6Gf3tHvu/uAbEg0L6sgnDOZJqLhH7EQYY73F8esqEZrNYe8Y/AxVSJs/ F2CQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769007516; x=1769612316; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UpnJM6r29OmlECz6ZZJlv11hzQUIyaYlzl/s4DXsy2Q=; b=MqKrpmvFcsM5A9TOyCBoDKEPp4Fyf6tPus7gJ/lY3iLEBGIm9j5xcETOeOHCd2/APa r9Af29QVvu+NIHey71oR4k0X0K4LO/shbShnkMyVcajPBKBAKxnJgWus2egrra8kc+EZ MmiWWqqo4bixh5oF0e9JvgUeHBgZqghI50IysMF/hY70lahWCBmmcu6x3ZkQhXUI03UB IMPsrzrGqgPlpSsD20s0d4jDE5bCgor4EAdIWthhlfGPuPuOkL+KegBi+Dh68VI6Th+j GzrZYlOnM3XQ6ZKXGXuHt/t6+ungaliGmTATd6taaODTVcbhyVcssdB0w+2s3siPTj/m REew== X-Forwarded-Encrypted: i=1; AJvYcCVchI+ofTZdfQFubRISBU5NS1YqGJ/3lUEv1xjNkUr9JvU0mSUCAIyUsrB13yuV807FmwxcsMSG@vger.kernel.org X-Gm-Message-State: AOJu0Ywos8Gc4AGK1KvZ7BZ3AaYj4w1NXWHQ6V6VNmnH8uHN9tnrUkj1 08xpS/zGa0PZrPjgNQSus9yNVIWJzxxzygKkb3oFF5dktHxdLYG5iF8l X-Gm-Gg: AZuq6aJKB9ZG9uvQCWbcU+EQeh/Tf7FYQEGn7E8Vo/4600kOyLpmqKYHFu0B+fD88VL ahGh8OYPh7+gbjaFHzePR1HEc9He4enS12CLHlSPEs81jN4nr7FJBZr8VzrGWRLnZrdRBUD5Y6T Te4p3nyyW+qNF1LKSIN8LlLCRr4Z02aSMDpaI4i0jphzHG6ILgPhhWjfh3qiOdE/3mfywkp3h2B 3Aq3bWQWJQgtf2Cjp/rvxp+mPNE+L2GC9JDcFclieDU64GPNWiV1k/TE+W50g9+YvBMnn2zro0k SbC9yT3w7ifscL29CboyYdJF1BQKv7bWKHz93eCibRm5X5S/nD5yp5cYYVhv61mAS6C53pA0yA7 4kwLrw5/ulBdoO6laox9hlMc/ZQ8MORUJ0ibNupYJyXuEw4sgrsB2B/akSmnfpnJQ8LRR4SOyqJ WOa/5//D84jPnjU/CGdbg7Q/DLOkUGBG1HExjr6nrHdVXzqyI= X-Received: by 2002:a17:902:c949:b0:2a1:3cd8:d2df with SMTP id d9443c01a7336-2a7177db71fmr179679185ad.54.1769007515562; Wed, 21 Jan 2026 06:58:35 -0800 (PST) Received: from KASONG-MC4 ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a7646119b4sm53170045ad.71.2026.01.21.06.58.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Jan 2026 06:58:34 -0800 (PST) Date: Wed, 21 Jan 2026 22:58:27 +0800 From: Kairui Song To: Chen Ridong Cc: akpm@linux-foundation.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, corbet@lwn.net, skhan@linuxfoundation.org, hannes@cmpxchg.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, zhengqi.arch@bytedance.com, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, lujialin4@huawei.com Subject: Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim Message-ID: References: <20260120134256.2271710-1-chenridong@huaweicloud.com> <20260120134256.2271710-2-chenridong@huaweicloud.com> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260120134256.2271710-2-chenridong@huaweicloud.com> On Tue, Jan 20, 2026 at 01:42:50PM +0800, Chen Ridong wrote: > From: Chen Ridong > > The memcg LRU was originally introduced to improve scalability during > global reclaim. However, it is complex and only works with gen lru > global reclaim. Moreover, its implementation complexity has led to > performance regressions when handling a large number of memory cgroups [1]. > > This patch introduces a per-memcg heat level for reclaim, aiming to unify > gen lru and traditional LRU global reclaim. The core idea is to track > per-node per-memcg reclaim state, including heat, last_decay, and > last_refault. The last_refault records the total reclaimed data from the > previous memcg reclaim. The last_decay is a time-based parameter; the heat > level decays over time if the memcg is not reclaimed again. Both last_decay > and last_refault are used to calculate the current heat level when reclaim > starts. > > Three reclaim heat levels are defined: cold, warm, and hot. Cold memcgs are > reclaimed first; only if cold memcgs cannot reclaim enough pages, warm > memcgs become eligible for reclaim. Hot memcgs are reclaimed last. > > While this design can be applied to all memcg reclaim scenarios, this patch > is conservative and only introduces heat levels for traditional LRU global > reclaim. Subsequent patches will replace the memcg LRU with > heat-level-based reclaim. > > Based on tests provided by YU Zhao, traditional LRU global reclaim shows > significant performance improvement with heat-level reclaim enabled. > > The results below are from a 2-hour run of the test [2]. > > Throughput (number of requests) before after Change > Total 1734169 2353717 +35% > > Tail latency (number of requests) before after Change > [128s, inf) 1231 1057 -14% > [64s, 128s) 586 444 -24% > [32s, 64s) 1658 1061 -36% > [16s, 32s) 4611 2863 -38% > > [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org > [2] https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ Hi Ridong, Thanks very much for checking the test! The benchmark looks good. While I don't have strong opinion on the whole approach yet as I'm still checking the whole series. But I have some comment and question for this patch: > > Signed-off-by: Chen Ridong > --- > include/linux/memcontrol.h | 7 ++ > mm/memcontrol.c | 3 + > mm/vmscan.c | 227 +++++++++++++++++++++++++++++-------- > 3 files changed, 192 insertions(+), 45 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index af352cabedba..b293caf70034 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -76,6 +76,12 @@ struct memcg_vmstats; > struct lruvec_stats_percpu; > struct lruvec_stats; > > +struct memcg_reclaim_state { > + atomic_long_t heat; > + unsigned long last_decay; > + atomic_long_t last_refault; > +}; > + > struct mem_cgroup_reclaim_iter { > struct mem_cgroup *position; > /* scan generation, increased every round-trip */ > @@ -114,6 +120,7 @@ struct mem_cgroup_per_node { > CACHELINE_PADDING(_pad2_); > unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; > struct mem_cgroup_reclaim_iter iter; > + struct memcg_reclaim_state reclaim; > > #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC > /* slab stats for nmi context */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index f2b87e02574e..675d49ad7e2c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3713,6 +3713,9 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) > > lruvec_init(&pn->lruvec); > pn->memcg = memcg; > + atomic_long_set(&pn->reclaim.heat, 0); > + pn->reclaim.last_decay = jiffies; > + atomic_long_set(&pn->reclaim.last_refault, 0); > > memcg->nodeinfo[node] = pn; > return true; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4aa73f125772..3759cd52c336 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -5978,6 +5978,124 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, > return inactive_lru_pages > pages_for_compaction; > } > > +enum memcg_scan_level { > + MEMCG_LEVEL_COLD, > + MEMCG_LEVEL_WARM, > + MEMCG_LEVEL_HOT, > + MEMCG_LEVEL_MAX, > +}; This looks similar to MEMCG_LRU_HEAD, MEMCG_LRU_TAIL, MEMCG_LRU_OLD, MEMCG_LRU_YOUNG of the memcg LRU? But now it's unaware of the aging event? > + > +#define MEMCG_HEAT_WARM 4 > +#define MEMCG_HEAT_HOT 8 > +#define MEMCG_HEAT_MAX 12 > +#define MEMCG_HEAT_DECAY_STEP 1 > +#define MEMCG_HEAT_DECAY_INTERVAL (1 * HZ) This is a hardcoded interval (1s), but memcg_decay_heat is driven by reclaim which is kind of random, could be very frequent or not happening at all, that doesn't look pretty by first look. > + > +static void memcg_adjust_heat(struct mem_cgroup_per_node *pn, long delta) > +{ > + long heat, new_heat; > + > + if (mem_cgroup_is_root(pn->memcg)) > + return; > + > + heat = atomic_long_read(&pn->reclaim.heat); > + do { > + new_heat = clamp_t(long, heat + delta, 0, MEMCG_HEAT_MAX); The hotness range is 0 - 12, is that a suitable value for all setup and workloads? > + if (atomic_long_cmpxchg(&pn->reclaim.heat, heat, new_heat) == heat) > + break; > + heat = atomic_long_read(&pn->reclaim.heat); > + } while (1); > +} > + > +static void memcg_decay_heat(struct mem_cgroup_per_node *pn) > +{ > + unsigned long last; > + unsigned long now = jiffies; > + > + if (mem_cgroup_is_root(pn->memcg)) > + return; > + > + last = READ_ONCE(pn->reclaim.last_decay); > + if (!time_after(now, last + MEMCG_HEAT_DECAY_INTERVAL)) > + return; > + > + if (cmpxchg(&pn->reclaim.last_decay, last, now) != last) > + return; > + > + memcg_adjust_heat(pn, -MEMCG_HEAT_DECAY_STEP); > +} > + > +static int memcg_heat_level(struct mem_cgroup_per_node *pn) > +{ > + long heat; > + > + if (mem_cgroup_is_root(pn->memcg)) > + return MEMCG_LEVEL_COLD; > + > + memcg_decay_heat(pn); > + heat = atomic_long_read(&pn->reclaim.heat); > + > + if (heat >= MEMCG_HEAT_HOT) > + return MEMCG_LEVEL_HOT; > + if (heat >= MEMCG_HEAT_WARM) > + return MEMCG_LEVEL_WARM; > + return MEMCG_LEVEL_COLD; > +} > + > +static void memcg_record_reclaim_result(struct mem_cgroup_per_node *pn, > + struct lruvec *lruvec, > + unsigned long scanned, > + unsigned long reclaimed) > +{ > + long delta; > + > + if (mem_cgroup_is_root(pn->memcg)) > + return; > + > + memcg_decay_heat(pn); > + > + /* > + * Memory cgroup heat adjustment algorithm: > + * - If scanned == 0: mark as hottest (+MAX_HEAT) > + * - If reclaimed >= 50% * scanned: strong cool (-2) > + * - If reclaimed >= 25% * scanned: mild cool (-1) > + * - Otherwise: warm up (+1) The naming is bit of confusing I think, no scan doesn't mean it's all hot. Maybe you mean no reclaim? No scan could also mean a empty memcg? > + */ > + if (!scanned) > + delta = MEMCG_HEAT_MAX; > + else if (reclaimed * 2 >= scanned) > + delta = -2; > + else if (reclaimed * 4 >= scanned) > + delta = -1; > + else > + delta = 1; > + > + /* > + * Refault-based heat adjustment: > + * - If refault increase > reclaimed pages: heat up (more cautious reclaim) > + * - If no refaults and currently warm: cool down (allow more reclaim) > + * This prevents thrashing by backing off when refaults indicate over-reclaim. > + */ > + if (lruvec) { > + unsigned long total_refaults; > + unsigned long prev; > + long refault_delta; > + > + total_refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_ANON); > + total_refaults += lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_FILE); I think you want WORKINGSET_REFAULT_* or WORKINGSET_RESTORE_* here. > + > + prev = atomic_long_xchg(&pn->reclaim.last_refault, total_refaults); > + refault_delta = total_refaults - prev; > + > + if (refault_delta > reclaimed) > + delta++; > + else if (!refault_delta && delta > 0) > + delta--; > + } > + > + memcg_adjust_heat(pn, delta); > +} > + > static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > { > struct mem_cgroup *target_memcg = sc->target_mem_cgroup; > @@ -5986,7 +6104,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > }; > struct mem_cgroup_reclaim_cookie *partial = &reclaim; > struct mem_cgroup *memcg; > - > + int level; > + int max_level = root_reclaim(sc) ? MEMCG_LEVEL_MAX : MEMCG_LEVEL_WARM; Why limit to MEMCG_LEVEL_WARM when it's not a root reclaim? > /* > * In most cases, direct reclaimers can do partial walks > * through the cgroup tree, using an iterator state that > @@ -5999,62 +6118,80 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > if (current_is_kswapd() || sc->memcg_full_walk) > partial = NULL; > > - memcg = mem_cgroup_iter(target_memcg, NULL, partial); > - do { > - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > - unsigned long reclaimed; > - unsigned long scanned; > - > - /* > - * This loop can become CPU-bound when target memcgs > - * aren't eligible for reclaim - either because they > - * don't have any reclaimable pages, or because their > - * memory is explicitly protected. Avoid soft lockups. > - */ > - cond_resched(); > + for (level = MEMCG_LEVEL_COLD; level < max_level; level++) { > + bool need_next_level = false; > > - mem_cgroup_calculate_protection(target_memcg, memcg); > + memcg = mem_cgroup_iter(target_memcg, NULL, partial); > + do { > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > + unsigned long reclaimed; > + unsigned long scanned; > + struct mem_cgroup_per_node *pn = memcg->nodeinfo[pgdat->node_id]; > > - if (mem_cgroup_below_min(target_memcg, memcg)) { > - /* > - * Hard protection. > - * If there is no reclaimable memory, OOM. > - */ > - continue; > - } else if (mem_cgroup_below_low(target_memcg, memcg)) { > /* > - * Soft protection. > - * Respect the protection only as long as > - * there is an unprotected supply > - * of reclaimable memory from other cgroups. > + * This loop can become CPU-bound when target memcgs > + * aren't eligible for reclaim - either because they > + * don't have any reclaimable pages, or because their > + * memory is explicitly protected. Avoid soft lockups. > */ > - if (!sc->memcg_low_reclaim) { > - sc->memcg_low_skipped = 1; > + cond_resched(); > + > + mem_cgroup_calculate_protection(target_memcg, memcg); > + > + if (mem_cgroup_below_min(target_memcg, memcg)) { > + /* > + * Hard protection. > + * If there is no reclaimable memory, OOM. > + */ > continue; > + } else if (mem_cgroup_below_low(target_memcg, memcg)) { > + /* > + * Soft protection. > + * Respect the protection only as long as > + * there is an unprotected supply > + * of reclaimable memory from other cgroups. > + */ > + if (!sc->memcg_low_reclaim) { > + sc->memcg_low_skipped = 1; > + continue; > + } > + memcg_memory_event(memcg, MEMCG_LOW); > } > - memcg_memory_event(memcg, MEMCG_LOW); > - } > > - reclaimed = sc->nr_reclaimed; > - scanned = sc->nr_scanned; > + if (root_reclaim(sc) && memcg_heat_level(pn) > level) { > + need_next_level = true; > + continue; > + } > > - shrink_lruvec(lruvec, sc); > + reclaimed = sc->nr_reclaimed; > + scanned = sc->nr_scanned; > > - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, > - sc->priority); > + shrink_lruvec(lruvec, sc); > + if (!memcg || memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B)) If we might have memcg == NULL here, the pn = memcg->nodeinfo[pgdat->node_id] and other memcg operations above looks kind of dangerous. Also why check NR_SLAB_RECLAIMABLE_B if there wasn't such a check previously? Maybe worth a separate patch. > + shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, > + sc->priority); > > - /* Record the group's reclaim efficiency */ > - if (!sc->proactive) > - vmpressure(sc->gfp_mask, memcg, false, > - sc->nr_scanned - scanned, > - sc->nr_reclaimed - reclaimed); > + if (root_reclaim(sc)) > + memcg_record_reclaim_result(pn, lruvec, > + sc->nr_scanned - scanned, > + sc->nr_reclaimed - reclaimed); Why only record the reclaim result for root_reclaim? > > - /* If partial walks are allowed, bail once goal is reached */ > - if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) { > - mem_cgroup_iter_break(target_memcg, memcg); > + /* Record the group's reclaim efficiency */ > + if (!sc->proactive) > + vmpressure(sc->gfp_mask, memcg, false, > + sc->nr_scanned - scanned, > + sc->nr_reclaimed - reclaimed); > + > + /* If partial walks are allowed, bail once goal is reached */ > + if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) { > + mem_cgroup_iter_break(target_memcg, memcg); > + break; > + } > + } while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial))); > + > + if (!need_next_level) > break; > - } > - } while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial))); > + } IIUC you are iterating all the memcg's for up to MEMCG_LEVEL_MAX times and only reclaim certain memcg in each iteration. I think in theory some workload may have a higher overhead since there are actually more iterations, and will this break the reclaim fairness? > } > > static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > -- > 2.34.1