From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 333CC258CE7 for ; Tue, 30 Jun 2026 01:49:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.48 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782784166; cv=none; b=btXQYJqYjC0ul1lYJAra032Xc4UCAKpqk5ooxMaZbZ5F86GwdKAz2MJSk+HOQ5vxctbBf9AJguvKOBiOOzqo/sN77DsHcyjRq9+ElxH8TCdXdyhm8yt0BKl0dCoSjHU/29iTKQfSOlXlNr//tiHG74hkyqMXt1oFC+pl1+VRx2k= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782784166; c=relaxed/simple; bh=ZW+mzUs7us6S8WDQel80JrAPYm1jQXd7jiXihFWGz24=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=N6L0IME4mleIjkmV4/O2A9MX1nqvNlLnzYRdElI9975LbuinuE9wf80zS7Gfcoo9b3QzmbcOX1cWuDNzlG5DkTCzHdFPtsUCfN2bMaEI8ju/Cbm/RXfn9xqIpoJbAkkiQHQWkrYAQ4eQJsB1NX/unGWVToII6pmVKyBXaQ1XEV8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=VnGn9acz; arc=none smtp.client-ip=209.85.216.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VnGn9acz" Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-38005a36edbso966482a91.1 for ; Mon, 29 Jun 2026 18:49:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1782784164; x=1783388964; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=kGzb5I99oVjrLBb+VIXHECds+0MQ3SMiFwp7W65djH8=; b=VnGn9aczMX5cknPbaRmuYYrSThr7ITBKEiE6TwuTjfZfbBAof5bRGIDwHtiWRD9QCU rIFWN51XuVIDPuCc9vCoF6TDAXZoqaaB20sRP05EVP5tn4DGKYMck9rXBp9O+G87A50d WFwdPMj19tYNzFlIn5d/2CT3cn4U0LQY7/QAiOiEKkRlGEDI6oGTkueKVdmwHf3KFl+L 0DLEm/tpqfJ55IVV3WjTMBs2yw6U6hsFwbUPukbAzqL89Q5sxGIE8f/GxMfs26ybXN0r 2qfoY2nzPoH5ZIkpAKxmuq2gzeMBLRiVydv6TSM2EO2gHZ/HS6WlsxSia/J39y46Q7Om C7GQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1782784164; x=1783388964; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-gg:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=kGzb5I99oVjrLBb+VIXHECds+0MQ3SMiFwp7W65djH8=; b=OxM+yLCxeO/J8HPcNqvYipJ9UKQGyV3aAdvn+gk88jjkePo/7ie/37reQxi/leZp+Q GAKXUzXIEGzPKGfwyTQlxXW7pQ+R/AE2AolrP7lWH95/+E2qN/siGk8Y4ATDmV7jIqA3 Y1YGtS6hpWOAz/eBFBcBcYL9AG8mV2N5V8Qz3qpBnZMt9kJNb7g5+36PNrcdFYTVi9kG L1IknEF/yEebYFqqoWZfvQ06B4WBbxZzmGEHTXbMNTI9e3M/i+PlvkzTaaXdOXsjeHMn d6O7h3qafxjeqcHbeDjpJsD6Ve9uAr8bUeE8jxfRGhRE7ZWAtgDk4fyvZqpdGLcGL0Va NQsA== X-Forwarded-Encrypted: i=1; AHgh+RocYnGhciiic3G/Xam/HVyi5XslpvncgEefA7DcZfisjhl41akhuEYfhiu9ULbCRDbgRoZtWZLtoAfeR9c=@vger.kernel.org X-Gm-Message-State: AOJu0Yyq6dBoEg3JLpLRX8xaQYcZJknk73+m34HA6VnLDA9fMl5qHhWm pvNZ8Crn74wGhjBcYUdloXMzEc8ul83n1aysmETqMQm6flKt1R5HFPqZ X-Gm-Gg: AfdE7ckVjiiD0cWgQnMSHDFOhOTxh5EWq7b5Y5q3YvYPqikhVKM9vkZ9Nd9XlB+vJjT vm+4yUzrVpf/Gw/Q/fKAEAevqsv9Mv0y4kcj/+SmDjfqtVKIiL600eFFZPMEXJQ9KWnuCJNGKpU iAnA3cu2mJnIS3EXWvjMLIAGFtSmpewiTl0kusWhCz8YtxwMcwjtgjS4RRXcjd91gs5xtmRC2UJ 1bjCG0lVioitp2c8V7pPrcjONNrbCpIfZnSeWxOpGySAaU0HA2yKvYs6wRFImmdA88pehp1DylE zA1d7TkGBqKjntUF5jJHErZdSKrKTMV09lnxt4tQllCD3boY6xcXphW/rHUFGP3x1jRQQ0a2F9S R/V7Kbhf3mWa8VqX5gFUjxdfLWrAUsqdTcJU9KwmH5pvFp37AgxQqJjU5tqsMkoOBnKnoJx+MIl qP9mIfEZavoClRyGWY38tgIwM3cuhgH0E0 X-Received: by 2002:a17:90b:1d4c:b0:37f:db06:229c with SMTP id 98e67ed59e1d1-380527a8520mr1156642a91.22.1782784164449; Mon, 29 Jun 2026 18:49:24 -0700 (PDT) Received: from [10.125.192.77] ([210.184.73.204]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-38052737fbesm673824a91.0.2026.06.29.18.49.07 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 29 Jun 2026 18:49:23 -0700 (PDT) Message-ID: Date: Tue, 30 Jun 2026 09:49:03 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.15.0 Subject: Re: [PATCH v5 4/6] mm/zswap: Implement proactive writeback To: Yosry Ahmed Cc: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com, nphamcs@gmail.com, chengming.zhou@linux.dev, muchun.song@linux.dev, roman.gushchin@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Hao Jia References: <20260629112032.20423-1-jiahao.kernel@gmail.com> <20260629112032.20423-5-jiahao.kernel@gmail.com> From: Hao Jia In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 2026/6/30 08:15, Yosry Ahmed wrote: > On Mon, Jun 29, 2026 at 07:20:30PM +0800, Hao Jia wrote: >> From: Hao Jia >> >> Zswap currently writes back pages to backing swap reactively, triggered >> either by the shrinker or when the pool reaches its size limit. There is >> no mechanism to control the amount of writeback for a specific memory >> cgroup. However, users may want to proactively write back zswap pages, >> e.g., to free up memory for other applications or to prepare for >> memory-intensive workloads. >> >> Introduce a "source=" key to the memory.reclaim cgroup interface, >> currently accepting the single value "zswap". When set to "zswap", it >> bypasses standard memory reclaim and exclusively performs proactive >> zswap writeback up to the requested budget. If omitted, the default >> reclaim behavior remains unchanged. >> >> Example usage: >> # Write back 10MB of compressed data from zswap to the backing swap >> echo "10M source=zswap" > memory.reclaim >> >> Note that the actual amount of compressed data written back may be less >> than requested due to the zswap second-chance algorithm: referenced >> entries are rotated on the LRU on the first encounter and only written >> back on a second pass. If fewer bytes are written back than requested, >> -EAGAIN is returned, matching the existing memory.reclaim semantics. >> >> Internally, extend user_proactive_reclaim() to parse the new "source=" >> key and invoke the dedicated handler zswap_proactive_writeback() when it >> is set to "zswap". This handler walks the target memcg subtree in a >> round-robin fashion and drains each memcg's per-node zswap LRUs through >> shrink_memcg(), accumulating the compressed bytes written back until the >> requested budget is met. >> >> Suggested-by: Yosry Ahmed >> Suggested-by: Nhat Pham >> Signed-off-by: Hao Jia >> --- > > Before going through more versions we need to figure out if this will > pivot to be a proactive demotion interfcae for swap tiering. > Yes. Should I drop patches 4-6 in the next version and wait for swap tiering to be finalized? We can try to get the non-memcg parts (patches 1-3) merged upstream first. This would also give them plenty of time to bake and catch any potential regressions. Thoughts? >> @@ -7869,9 +7872,12 @@ int user_proactive_reclaim(char *buf, >> unsigned int nr_retries = MAX_RECLAIM_RETRIES; >> unsigned long nr_to_reclaim, nr_reclaimed = 0; >> int swappiness = -1; >> + bool zswap_writeback_only = false; >> char *old_buf, *start; >> + char source[16]; >> substring_t args[MAX_OPT_ARGS]; >> gfp_t gfp_mask = GFP_KERNEL; >> + u64 nr_bytes; >> >> if (!buf || (!memcg && !pgdat) || (memcg && pgdat)) >> return -EINVAL; >> @@ -7879,7 +7885,8 @@ int user_proactive_reclaim(char *buf, >> buf = strstrip(buf); >> >> old_buf = buf; >> - nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; >> + nr_bytes = memparse(buf, &buf); >> + nr_to_reclaim = nr_bytes / PAGE_SIZE; > > Nit: if we keep this as part of memory.reclaim, we probably want to > choose clearer names (e.g. pages_to_reclaim and bytes_to_reclaim). Will do. > >> if (buf == old_buf) >> return -EINVAL; >> >> @@ -7899,11 +7906,26 @@ int user_proactive_reclaim(char *buf, >> case MEMORY_RECLAIM_SWAPPINESS_MAX: >> swappiness = SWAPPINESS_ANON_ONLY; >> break; >> + case MEMORY_RECLAIM_SOURCE: >> + if (match_strlcpy(source, &args[0], sizeof(source)) >= sizeof(source)) >> + return -EINVAL; >> + /* Only zswap is supported as a reclaim source for now. */ >> + if (strcmp(source, "zswap")) >> + return -EINVAL; >> + zswap_writeback_only = true; >> + break; >> default: >> return -EINVAL; >> } >> } >> >> + if (zswap_writeback_only) { >> + /* source=zswap and swappiness are mutually exclusive. */ >> + if (swappiness != -1) >> + return -EINVAL; >> + return zswap_proactive_writeback(memcg, nr_bytes); >> + } >> + >> while (nr_reclaimed < nr_to_reclaim) { >> /* Will converge on zero, but reclaim enforces a minimum */ >> unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4; >> diff --git a/mm/zswap.c b/mm/zswap.c >> index ba01bf0e44e9..9cda96f05508 100644 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -1713,6 +1713,56 @@ int zswap_load(struct folio *folio) >> return 0; >> } >> >> +int zswap_proactive_writeback(struct mem_cgroup *memcg, u64 bytes_to_writeback) >> +{ >> + struct zswap_shrink_state s = {}; >> + struct mem_cgroup *iter = NULL; >> + u64 bytes_written = 0; >> + int ret = 0; >> + >> + if (!memcg) >> + return -EINVAL; > > Can this ever happen? It would be a bug in the caller. IIRC,Writing the following to the NUMA node sysfs entry triggers this check: echo "10M source=zswap" > /sys/devices/system/node/nodeN/reclaim > >> + if (!mem_cgroup_zswap_writeback_enabled(memcg)) >> + return -EINVAL; >> + if (!bytes_to_writeback) >> + return 0; > > Do we need this? I think the loop will just never enter and > mem_cgroup_iter_break() will do nothing. Will do. > >> + >> + while (bytes_written < bytes_to_writeback) { >> + long shrunk; >> + >> + cond_resched(); >> + >> + if (signal_pending(current)) { >> + ret = -EINTR; >> + break; >> + } >> + >> + /* >> + * Use a local iterator to walk the memcg and its online descendants >> + * in a round-robin manner. Upon exiting the loop, mem_cgroup_iter_break() >> + * must be called to drop the iterator reference. >> + */ >> + do { >> + iter = mem_cgroup_iter(memcg, iter, NULL); >> + } while (iter && !mem_cgroup_tryget_online(iter)); >> + >> + shrunk = zswap_shrink_one_memcg(iter, &s); >> + if (shrunk > 0) >> + bytes_written += shrunk; >> + >> + /* drop the extra reference taken by mem_cgroup_tryget_online() */ >> + mem_cgroup_put(iter); > > > Can we just use mem_cgroup_online() instead since mem_cgroup_iter() > already graps a ref? > Will do. Thanks, Hao