From: Hao Jia <jiahao.kernel@gmail.com>
To: Yosry Ahmed <yosry@kernel.org>, Nhat Pham <nphamcs@gmail.com>
Cc: akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org,
shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com,
chengming.zhou@linux.dev, muchun.song@linux.dev,
roman.gushchin@linux.dev, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, Hao Jia <jiahao1@lixiang.com>
Subject: Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
Date: Tue, 12 May 2026 17:32:32 +0800 [thread overview]
Message-ID: <12e4784e-2add-d849-7e54-bde8abfa6e78@gmail.com> (raw)
In-Reply-To: <CAO9r8zNOPdpJuTmccvQ6ZAVS+tXxp-_ofA765DbnfaUZOPPO-g@mail.gmail.com>
On 2026/5/12 03:57, Yosry Ahmed wrote:
> On Mon, May 11, 2026 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>
>> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>>
>>> From: Hao Jia <jiahao1@lixiang.com>
>>>
>>> Zswap currently writes back pages to backing swap devices reactively,
>>> triggered either by memory pressure via the shrinker or by the pool
>>> reaching its size limit. This reactive approach offers no precise
>>> control over when writeback happens, which can disturb latency-sensitive
>>> workloads, and it cannot direct writeback at a specific memory cgroup.
>>> However, there are scenarios where users might want to proactively
>>> write back cold pages from zswap to the backing swap device, for
>>> example, to free up memory for other applications or to prepare for
>>> upcoming memory-intensive workloads.
>>>
>>> Therefore, implement a proactive writeback mechanism for zswap by
>>> adding a new cgroup interface file memory.zswap.proactive_writeback
>>> within the memory controller.
>>
Thanks Nhat, Yosry — let me address both comments together.
>>
>> We already have memory.reclaim, no? Would that not work to create
>> headroom generally for your use case? Is there a reason why we are
>> treating zswap memory as special here?
>
Apologies for the lack of detailed explanation in the patch description,
which led to the confusion.
While we are already utilizing memory.reclaim, it does not fully address
our requirements.
Our deployment runs a userspace proactive reclaimer that drives
memory.reclaim based on the system's runtime state (memory/CPU/IO
pressure, refault rate, ...) and workload-specific
policy. That first stage compresses cold anon pages into zswap. Entries
that then remain in zswap past a policy-defined age threshold are
considered "twice cold", and the reclaimer wants
to write them back to the backing swap device at a moment of its own
choosing, to further reclaim the DRAM still held by the compressed data.
This is the "second-level offloading" pattern described in Meta's TMO
paper [1]. zswap proactive writeback is what this series introduces to
address that second-level offloading stage.
[1] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf
> +1, why do we need to specifically proactively reclaim the compressed memory?
>
> Also, if we do need to minimize the compressed memory and force higher
> writeback rates, we can do so with memory.zswap.max, right?
Here are a few reasons why memory.zswap.max is not enough:
1. Writing memory.zswap.max itself does not trigger any writeback
immediately. For a memcg that has reached steady state (on which the
userspace reclaimer is no longer invoking
memory.reclaim), after enough time has passed, the reclaimer has no good
way to trigger proactive writeback for second-level offloading by
lowering memory.zswap.max, because in steady
state nothing drives the zswap_store() -> shrink_memcg() path. The
userspace reclaimer still has no control over when proactive writeback
happens.
2. memory.zswap.max currently triggers zswap writeback via zswap_store()
-> shrink_memcg(), and each over-limit event can write back at most
NR_NODES entries. If zswap residency is far
above memory.zswap.max, converging to the target size requires at least
O(over-limit pages / NR_NODES) zswap_store() events, with no batching —
proactive writeback therefore has
significant latency.
3. memory.zswap.max is a stateful interface. If the userspace reclaimer
crashes for any reason mid-operation, it may leave memory.zswap.max at
some set value, putting the application in a
persistently throttled bad state.
4. Once the userspace reclaimer has lowered memory.zswap.max, if the
workload is rapidly expanding and triggers memory reclaim via
memory.high / kswapd / etc., the actual amount written
back can exceed what was intended.
Thanks,
Hao
next prev parent reply other threads:[~2026-05-12 9:32 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-11 10:51 [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-05-11 10:51 ` [PATCH 1/3] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
2026-05-11 10:51 ` [PATCH 2/3] mm/zswap: Implement proactive writeback Hao Jia
2026-05-11 19:49 ` Nhat Pham
2026-05-11 19:57 ` Yosry Ahmed
2026-05-12 9:32 ` Hao Jia [this message]
2026-05-11 19:54 ` Nhat Pham
2026-05-12 9:37 ` Hao Jia
2026-05-11 10:51 ` [PATCH 3/3] mm/zswap: Add per-memcg stat for " Hao Jia
2026-05-11 11:39 ` [PATCH 0/3] mm/zswap: Implement per-cgroup " Michal Koutný
2026-05-12 11:23 ` Hao Jia
2026-05-11 19:53 ` Nhat Pham
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=12e4784e-2add-d849-7e54-bde8abfa6e78@gmail.com \
--to=jiahao.kernel@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=hannes@cmpxchg.org \
--cc=jiahao1@lixiang.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
--cc=yosry@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox