From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2D0EF1A5BAE; Sat, 30 May 2026 01:37:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780105041; cv=none; b=SQHCWj3xrRTp4Wl+XLIfwzSro4037IOswB3+DvMVVCJs2Ud5kJePcqEgIhqYp0iCLDBQ/0jVIq8VLB0dhZWgnKfhrZsrUUhBD0Vn90MFQkTBMjo9ud2TBn2OJmERcWX0UyRNP2nqY59TIHXoeaQ0vqiN8zQavQq8C/P/lY8mAOY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780105041; c=relaxed/simple; bh=Nakagv1cWXtoKKPRUdH3GuaufcWVA+b7irqhBhk5ZRc=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=S0X+jQylxwBNhu2jQRmfl7nGNsl0hOxcTAaDxHJkp1hRWyn7vuIwaDB6eZIQnaqhuLolQOjqGdraBVWjJ8ilhLWjPGShxqEmfD7zdRr3Haz25QPCg8jmDryone2fSCq+RI640sHm9KD8kAEmGde7SQCj0ei76Zm2/fXcPD99q7s= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=AzsUAUVP; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="AzsUAUVP" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 706F51F00893; Sat, 30 May 2026 01:37:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1780105040; bh=nO0cIl82wPxeZvS2+pe+To1HD6LqNrV1MR7nS3Y58H4=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=AzsUAUVPsLR7I7C/g6G1sNgiaNwjyBjcx4VXK1rGDUE+KeFuH+qr9lt+QZoIftpGe zDlTKF+/SrGarcqWMVsCqpPMdjfOtpS/HBQzr/LeAZ+yODO9O9apiCfsUGDxXn3kDK ZhfgnoGOrdiQWy8mww3522kArTFMwMd9iEQDknvI4pLn6/XB2EfHmcMQjGqEwr5nFu pC16ARFKIg63BifO/IJoQ4ABoEAT00o38Aj8xoh5rW/VJzSBHSlHteIjOd1DH1Fzl7 +pzohQWzrkPteaoygGwIFlpjTkFL5cRl9TUOWgSKOSY2feSmHw1/Dm9PC4vuWjVofQ jnyVNGdWki0Ow== Date: Sat, 30 May 2026 01:37:18 +0000 From: Yosry Ahmed To: Hao Jia Cc: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com, nphamcs@gmail.com, chengming.zhou@linux.dev, muchun.song@linux.dev, roman.gushchin@linux.dev, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Hao Jia Subject: Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback Message-ID: References: <20260526114601.67041-1-jiahao.kernel@gmail.com> <20260526114601.67041-3-jiahao.kernel@gmail.com> Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260526114601.67041-3-jiahao.kernel@gmail.com> On Tue, May 26, 2026 at 07:45:59PM +0800, Hao Jia wrote: > From: Hao Jia > > Zswap currently writes back pages to backing swap reactively, triggered > either by the shrinker or when the pool reaches its size limit. There is > no mechanism to control the amount of writeback for a specific memory > cgroup. However, users may want to proactively write back zswap pages, > e.g., to free up memory for other applications or to prepare for > memory-intensive workloads. > > Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup > interface. When specified, this key bypasses standard memory reclaim > and exclusively performs proactive zswap writeback up to the requested > budget. If omitted, the default reclaim behavior remains unchanged. > > Example usage: > # Write back 100MB of pages from zswap to the backing swap > echo "100M zswap_writeback_only" > memory.reclaim > > Note that the actual amount written back may be less than requested due > to the zswap second-chance algorithm: referenced entries are rotated on > the LRU on the first encounter and only written back on a second pass. > If fewer bytes are written back than requested, -EAGAIN is returned, > matching the existing memory.reclaim semantics. > > Internally, extend user_proactive_reclaim() to parse the new > "zswap_writeback_only" token and invoke the dedicated handler. Add > zswap_proactive_writeback() to walk the target memcg subtree via the > per-memcg writeback cursor, draining per-node zswap LRUs through > list_lru_walk_one() with the shrink_memcg_cb() callback. > > Suggested-by: Yosry Ahmed > Suggested-by: Nhat Pham > Signed-off-by: Hao Jia > --- > Documentation/admin-guide/cgroup-v2.rst | 18 +++- > Documentation/admin-guide/mm/zswap.rst | 11 +- > include/linux/zswap.h | 7 ++ > mm/vmscan.c | 14 +++ > mm/zswap.c | 138 ++++++++++++++++++++++++ > 5 files changed, 185 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 6efd0095ed99..6564abf0dec5 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1425,9 +1425,10 @@ PAGE_SIZE multiple when read back. > > The following nested keys are defined. > > - ========== ================================ > + ==================== ================================================== > swappiness Swappiness value to reclaim with > - ========== ================================ > + zswap_writeback_only Only perform proactive zswap writeback > + ==================== ================================================== > > Specifying a swappiness value instructs the kernel to perform > the reclaim with that swappiness value. Note that this has the > @@ -1437,6 +1438,19 @@ The following nested keys are defined. > The valid range for swappiness is [0-200, max], setting > swappiness=max exclusively reclaims anonymous memory. > > + The zswap_writeback_only key skips ordinary memory reclaim and > + writes back pages from zswap to the backing swap device until > + the requested amount has been written or no further candidates > + are found. This is useful to proactively offload cold pages from > + the zswap pool to the swap device. It is only available if > + zswap writeback is enabled. zswap_writeback_only cannot be combined > + with swappiness; specifying both returns -EINVAL. > + > + Example:: > + > + # Write back up to 100MB of pages from zswap to the backing swap > + echo "100M zswap_writeback_only" > memory.reclaim memcg folks need to chime in about the interface here. An alternative would be a separate interface (e.g. memory.zswap.do_writeback or memory.zswap.writeback.reclaim or sth). > diff --git a/mm/zswap.c b/mm/zswap.c > index 73e64a635690..7bcbf788f634 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1679,6 +1679,144 @@ int zswap_load(struct folio *folio) > return 0; > } > > +/* > + * Maximum LRU scan limit: > + * number of entries to scan per page of remaining budget. > + */ > +#define ZSWAP_PROACTIVE_WB_SCAN_RATIO 16UL > +/* > + * Batch size for proactive writeback: > + * - As the per-memcg writeback target in the outer memcg loop. > + * - As the per-walk budget passed to list_lru_walk_one(). > + */ > +#define ZSWAP_PROACTIVE_WB_BATCH 128UL > + > +/* > + * Walk the per-node LRUs of @memcg to write back up to @nr_to_write pages. > + * Returns the number of pages written back, or -ENOENT if @memcg is a > + * zombie or has writeback disabled. > + */ > +static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg, > + unsigned long nr_to_write) > +{ > + unsigned long nr_written = 0; > + int nid; > + > + if (!mem_cgroup_zswap_writeback_enabled(memcg)) > + return -ENOENT; > + > + if (!mem_cgroup_online(memcg)) > + return -ENOENT; > + > + for_each_node_state(nid, N_NORMAL_MEMORY) { > + bool encountered_page_in_swapcache = false; > + unsigned long nr_to_scan, nr_scanned = 0; > + > + /* > + * Cap by LRU length: bounds rewalks when referenced > + * entries keep rotating to the tail. > + */ > + nr_to_scan = list_lru_count_one(&zswap_list_lru, nid, memcg); > + if (!nr_to_scan) > + continue; > + > + /* > + * Cap by SCAN_RATIO * remaining budget: bounds scan cost > + * to the remaining writeback budget. > + */ > + nr_to_scan = min(nr_to_scan, > + (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO); > + > + while (nr_scanned < nr_to_scan) { > + unsigned long nr_to_walk = min(ZSWAP_PROACTIVE_WB_BATCH, > + nr_to_scan - nr_scanned); > + > + if (signal_pending(current)) > + return nr_written; > + > + /* > + * Account for the committed budget rather than the walker's > + * actual delta. If the list is emptied concurrently, the > + * walker visits nothing and nr_scanned would never advance. > + */ > + nr_scanned += nr_to_walk; > + > + nr_written += list_lru_walk_one(&zswap_list_lru, nid, memcg, > + &shrink_memcg_cb, > + &encountered_page_in_swapcache, > + &nr_to_walk); > + > + if (nr_written >= nr_to_write) > + return nr_written; > + if (encountered_page_in_swapcache) > + break; > + > + cond_resched(); > + } > + } > + > + return nr_written; > +} > + > +int zswap_proactive_writeback(struct mem_cgroup *memcg, > + unsigned long nr_to_writeback) > +{ > + struct mem_cgroup *iter_memcg; > + unsigned long nr_written = 0; > + int failures = 0, attempts = 0; > + > + if (!memcg) > + return -EINVAL; > + if (!nr_to_writeback) > + return 0; > + > + /* > + * Writeback will be aborted with -EAGAIN if we encounter > + * the following MAX_RECLAIM_RETRIES times: > + * - No writeback-candidate memcgs found in a subtree walk. > + * - A writeback-candidate memcg wrote back zero pages. > + */ > + while (nr_written < nr_to_writeback) { > + unsigned long batch_size; > + long shrunk; > + > + if (signal_pending(current)) > + return -EINTR; > + > + iter_memcg = zswap_mem_cgroup_iter(memcg); > + > + if (!iter_memcg) { > + /* > + * Continue without incrementing failures if we found > + * candidate memcgs in the last subtree walk. > + */ > + if (!attempts && ++failures == MAX_RECLAIM_RETRIES) > + return -EAGAIN; > + attempts = 0; > + continue; > + } > + > + batch_size = min(nr_to_writeback - nr_written, > + ZSWAP_PROACTIVE_WB_BATCH); > + shrunk = zswap_proactive_shrink_memcg(iter_memcg, batch_size); > + mem_cgroup_put(iter_memcg); > + > + /* Writeback-disabled or offline: skip without counting. */ > + if (shrunk == -ENOENT) > + continue; > + > + ++attempts; > + if (shrunk > 0) > + nr_written += shrunk; > + else if (++failures == MAX_RECLAIM_RETRIES) > + return -EAGAIN; > + > + cond_resched(); > + } > + > + return 0; > +} > + There is a lot of copy+paste from shrink_worker() and shrink_memcg() here. We really should be able to reuse shrink_memcg(). Is the main difference that we are scanning in batches here? I think we can have shrink_memcg() do that too. If anything, it might make the shrinker more efficient. Over-reclaim is ofc a concern, and especially in the zswap_store() path as the overhead can be noticeable. Maybe we can parameterize the batch size based on the code path. Nhat, what do you think?