All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hao Jia <jiahao.kernel@gmail.com>
To: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosry@kernel.org>,
	akpm@linux-foundation.org, tj@kernel.org, hannes@cmpxchg.org,
	shakeel.butt@linux.dev, mhocko@kernel.org, mkoutny@suse.com,
	chengming.zhou@linux.dev, muchun.song@linux.dev,
	roman.gushchin@linux.dev, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, Hao Jia <jiahao1@lixiang.com>,
	Alexandre Ghiti <alex@ghiti.fr>
Subject: Re: [PATCH 2/3] mm/zswap: Implement proactive writeback
Date: Thu, 14 May 2026 16:15:38 +0800	[thread overview]
Message-ID: <6c531b1a-ab35-e5a3-b9ca-40a639cca55f@gmail.com> (raw)
In-Reply-To: <CAKEwX=M=6AQVYA7ROM0YOP7irpxbdMrEOAHKGKYo0Qgr+-uhSw@mail.gmail.com>



On 2026/5/14 05:09, Nhat Pham wrote:
> On Wed, May 13, 2026 at 1:04 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>
>>
>>
>> On 2026/5/12 23:47, Nhat Pham wrote:
>>> On Tue, May 12, 2026 at 2:32 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2026/5/12 03:57, Yosry Ahmed wrote:
>>>>> On Mon, May 11, 2026 at 12:49 PM Nhat Pham <nphamcs@gmail.com> wrote:
>>>>>>
>>>>>> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>>>>>>
>>>>>>> From: Hao Jia <jiahao1@lixiang.com>
>>>>>>>
>>>>>>> Zswap currently writes back pages to backing swap devices reactively,
>>>>>>> triggered either by memory pressure via the shrinker or by the pool
>>>>>>> reaching its size limit. This reactive approach offers no precise
>>>>>>> control over when writeback happens, which can disturb latency-sensitive
>>>>>>> workloads, and it cannot direct writeback at a specific memory cgroup.
>>>>>>> However, there are scenarios where users might want to proactively
>>>>>>> write back cold pages from zswap to the backing swap device, for
>>>>>>> example, to free up memory for other applications or to prepare for
>>>>>>> upcoming memory-intensive workloads.
>>>>>>>
>>>>>>> Therefore, implement a proactive writeback mechanism for zswap by
>>>>>>> adding a new cgroup interface file memory.zswap.proactive_writeback
>>>>>>> within the memory controller.
>>>>>>
>>>>
>>>> Thanks Nhat, Yosry — let me address both comments together.
>>>>
>>>>>>
>>>>>> We already have memory.reclaim, no? Would that not work to create
>>>>>> headroom generally for your use case? Is there a reason why we are
>>>>>> treating zswap memory as special here?
>>>>>
>>>>
>>>> Apologies for the lack of detailed explanation in the patch description,
>>>> which led to the confusion.
>>>>
>>>> While we are already utilizing memory.reclaim, it does not fully address
>>>> our requirements.
>>>>
>>>> Our deployment runs a userspace proactive reclaimer that drives
>>>> memory.reclaim based on the system's runtime state (memory/CPU/IO
>>>> pressure, refault rate, ...) and workload-specific
>>>> policy. That first stage compresses cold anon pages into zswap. Entries
>>>> that then remain in zswap past a policy-defined age threshold are
>>>> considered "twice cold", and the reclaimer wants
>>>> to write them back to the backing swap device at a moment of its own
>>>> choosing, to further reclaim the DRAM still held by the compressed data.
>>>>
>>>> This is the "second-level offloading" pattern described in Meta's TMO
>>>> paper [1]. zswap proactive writeback is what this series introduces to
>>>> address that second-level offloading stage.
>>>>
>>>> [1] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf
>>>
>>> Yeah that's what we've been trying to work on as well :) We are
>>> working on a couple of improvements to the mechanism side of this path
>>> (cc Alex) - hopefully it will help your use case too!
>>>
>>> Anyway, back to my original inquiry: I understand your use case. It's
>>> pretty similar to our goal. What I'm not getting is why is
>>> memory.reclaim (which you already use) not sufficient for zswap ->
>>> disk swap offloading too?
>>>
>>> Zswap objects are organized into LRU and exposed to the shrinker
>>> interface. Echo-ing to memory.reclaim should also offload some zswap
>>> entries, correct? Are there still cold zswap entries that escape this,
>>> somehow?
>>>
>>
>> Yes, the memory.reclaim path does drive some zswap writeback, but
>> it is not enough for our case.
>>
>> 1. For a memcg that has reached steady state (a common case being
>> when memory.current is below the policy target), the userspace
>> reclaimer may not invoke memory.reclaim on it for a long time,
>> and so no second-level offloading happens through
>> memory.reclaim. In this state we want
>> memory.zswap.proactive_writeback to write back entries that
>> have sat in zswap past an age threshold, to further reclaim
>> the DRAM still held by the compressed data.
>>
>> 2. Even when memory.reclaim is running, the fraction of zswap
>> residency that ends up reaching the backing swap device is
>> still very small for many of our workloads, and the userspace
>> reclaimer has no way to participate in or control the
>> granularity of zswap writeback. So in our deployment we prefer
>> to leave the zswap shrinker disabled, decouple LRU -> zswap
>> from zswap -> swap, and use a dedicated proactive-writeback
>> interface that lifts the writeback policy into userspace where
>> it can evolve independently of the kernel.
> 
> I see. It's interesting - we've been dealing with the opposite
> problems (reclaiming too much from zswap) that it's refreshing to see
> the other end of the spectrum :) We should invest more into this to
> see why we are not reclaiming enough, but I see the value of adding a
> knob to hit zswap exclusively.
> 
> Regarding age-based reclaim, I agree with Yosry here. Let us try to
> land an interface to do targeted reclaim on compressed memory first. I
> do see the value of age information: with it, you can track zswap
> entries ages and the distribution of refault ages, and only reclaim
> the tail. However, I wonder if you can just build a system that adapt
> the reclaim request size based on PSI, refault rate etc. similar to
> how you're adjusting memory.reclaim on uncompressed memories with a
> senpai-like system. Something along the line of - if we are swapping
> in too much from disk (or if IO pressure is high), back off, and if
> not, stealing a bit more from zswap pool (perhaps with a bigger step
> size), etc. Is there a reason why zswap cannot adopt a similar
> strategy?

I'm not sure, as we haven't tested the case of tuning proactive zswap 
writeback without using age. As you pointed out, age provides a 
deterministic target that allows the userspace reclaimer to converge 
faster in a closed-loop, which helps avoid performance jitters.

That said, using age as a zswap writeback parameter indeed warrants 
further independent discussion. So I'll remove the age-related parts in v2.

Thanks,
Hao

  reply	other threads:[~2026-05-14  8:15 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11 10:51 [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback Hao Jia
2026-05-11 10:51 ` [PATCH 1/3] mm/zswap: Make shrink_worker writeback cursor per-memcg Hao Jia
2026-05-11 10:51 ` [PATCH 2/3] mm/zswap: Implement proactive writeback Hao Jia
2026-05-11 19:49   ` Nhat Pham
2026-05-11 19:57     ` Yosry Ahmed
2026-05-12  9:32       ` Hao Jia
2026-05-12 15:47         ` Nhat Pham
2026-05-13  8:04           ` Hao Jia
2026-05-13 18:54             ` Yosry Ahmed
2026-05-13 20:53               ` Nhat Pham
2026-05-14  8:13                 ` Hao Jia
2026-05-13 21:09             ` Nhat Pham
2026-05-14  8:15               ` Hao Jia [this message]
2026-05-11 19:54   ` Nhat Pham
2026-05-12  9:37     ` Hao Jia
2026-05-11 10:51 ` [PATCH 3/3] mm/zswap: Add per-memcg stat for " Hao Jia
2026-05-13 21:21   ` Nhat Pham
2026-05-14  8:21     ` Hao Jia
2026-05-11 11:39 ` [PATCH 0/3] mm/zswap: Implement per-cgroup " Michal Koutný
2026-05-12 11:23   ` Hao Jia
2026-05-11 19:53 ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6c531b1a-ab35-e5a3-b9ca-40a639cca55f@gmail.com \
    --to=jiahao.kernel@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex@ghiti.fr \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=jiahao1@lixiang.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=yosry@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.