All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Nhat Pham <nphamcs@gmail.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org,
	kernel-team@meta.com, linux-kernel@vger.kernel.org,
	gourry@gourry.net, ying.huang@linux.alibaba.com,
	jonathan.cameron@huawei.com, dan.j.williams@intel.com,
	linux-cxl@vger.kernel.org, minchan@kernel.org,
	senozhatsky@chromium.org
Subject: Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data
Date: Wed, 2 Apr 2025 15:57:41 -0400	[thread overview]
Message-ID: <20250402195741.GD198651@cmpxchg.org> (raw)
In-Reply-To: <20250402191145.2841864-1-nphamcs@gmail.com>

On Wed, Apr 02, 2025 at 12:11:45PM -0700, Nhat Pham wrote:
> Currently, zsmalloc, zswap's backend memory allocator, does not enforce
> any policy for the allocation of memory for the compressed data,
> instead just adopting the memory policy of the task entering reclaim,
> or the default policy (prefer local node) if no such policy is
> specified. This can lead to several pathological behaviors in
> multi-node NUMA systems:
> 
> 1. Systems with CXL-based memory tiering can encounter the following
>    inversion with zswap: the coldest pages demoted to the CXL tier
>    can return to the high tier when they are zswapped out, creating
>    memory pressure on the high tier.
> 
> 2. Consider a direct reclaimer scanning nodes in order of allocation
>    preference. If it ventures into remote nodes, the memory it
>    compresses there should stay there. Trying to shift those contents
>    over to the reclaiming thread's preferred node further *increases*
>    its local pressure, and provoking more spills. The remote node is
>    also the most likely to refault this data again. This undesirable
>    behavior was pointed out by Johannes Weiner in [1].
> 
> 3. For zswap writeback, the zswap entries are organized in
>    node-specific LRUs, based on the node placement of the original
>    pages, allowing for targeted zswap writeback for specific nodes.
> 
>    However, the compressed data of a zswap entry can be placed on a
>    different node from the LRU it is placed on. This means that reclaim
>    targeted at one node might not free up memory used for zswap entries
>    in that node, but instead reclaiming memory in a different node.
> 
> All of these issues will be resolved if the compressed data go to the
> same node as the original page. This patch encourages this behavior by
> having zswap pass the node of the original page to zsmalloc, and have
> zsmalloc prefer the specified node if we need to allocate new (zs)pages
> for the compressed data.
> 
> Note that we are not strictly binding the allocation to the preferred
> node. We still allow the allocation to fall back to other nodes when
> the preferred node is full, or if we have zspages with slots available
> on a different node. This is OK, and still a strict improvement over
> the status quo:
> 
> 1. On a system with demotion enabled, we will generally prefer
>    demotions over zswapping, and only zswap when pages have
>    already gone to the lowest tier. This patch should achieve the
>    desired effect for the most part.
> 
> 2. If the preferred node is out of memory, letting the compressed data
>    going to other nodes can be better than the alternative (OOMs,
>    keeping cold memory unreclaimed, disk swapping, etc.).
> 
> 3. If the allocation go to a separate node because we have a zspage
>    with slots available, at least we're not creating extra immediate
>    memory pressure (since the space is already allocated).
> 
> 3. While there can be mixings, we generally reclaim pages in
>    same-node batches, which encourage zspage grouping that is more
>    likely to go to the right node.
> 
> 4. A strict binding would require partitioning zsmalloc by node, which
>    is more complicated, and more prone to regression, since it reduces
>    the storage density of zsmalloc. We need to evaluate the tradeoff
>    and benchmark carefully before adopting such an involved solution.
> 
> This patch does not fix zram, leaving its memory allocation behavior
> unchanged. We leave this effort to future work.

zram's zs_malloc() calls all have page context. It seems a lot easier
to just fix the bug for them as well than to have two allocation APIs
and verbose commentary?

> -static inline struct zpdesc *alloc_zpdesc(gfp_t gfp)
> +static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, const int *nid)
>  {
> -	struct page *page = alloc_page(gfp);
> +	struct page *page;
> +
> +	if (nid)
> +		page = alloc_pages_node(*nid, gfp, 0);
> +	else {
> +		/*
> +		 * XXX: this is the zram path. We should consider fixing zram to also
> +		 * use alloc_pages_node() and prefer the same node as the original page.
> +		 *
> +		 * Note that alloc_pages_node(NUMA_NO_NODE, gfp, 0) is not equivalent
> +		 * to allloc_page(gfp). The former will prefer the local/closest node,
> +		 * whereas the latter will try to follow the memory policy of the current
> +		 * process.
> +		 */
> +		page = alloc_page(gfp);
> +	}
>  
>  	return page_zpdesc(page);
>  }
> @@ -461,10 +476,13 @@ static void zs_zpool_destroy(void *pool)
>  	zs_destroy_pool(pool);
>  }
>  
> +static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size,
> +				gfp_t gfp, const int *nid);
> +
>  static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp,
> -			unsigned long *handle)
> +			unsigned long *handle, const int nid)
>  {
> -	*handle = zs_malloc(pool, size, gfp);
> +	*handle = zs_malloc_node(pool, size, gfp, &nid);
>  
>  	if (IS_ERR_VALUE(*handle))
>  		return PTR_ERR((void *)*handle);

>  }
>  
>  
> -/**
> - * zs_malloc - Allocate block of given size from pool.
> - * @pool: pool to allocate from
> - * @size: size of block to allocate
> - * @gfp: gfp flags when allocating object
> - *
> - * On success, handle to the allocated object is returned,
> - * otherwise an ERR_PTR().
> - * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
> - */
> -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> +static unsigned long zs_malloc_node(struct zs_pool *pool, size_t size,
> +				gfp_t gfp, const int *nid)
>  {
>  	unsigned long handle;
>  	struct size_class *class;

> @@ -1397,6 +1406,21 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>  
>  	return handle;
>  }
> +
> +/**
> + * zs_malloc - Allocate block of given size from pool.
> + * @pool: pool to allocate from
> + * @size: size of block to allocate
> + * @gfp: gfp flags when allocating object
> + *
> + * On success, handle to the allocated object is returned,
> + * otherwise an ERR_PTR().
> + * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
> + */
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> +{
> +	return zs_malloc_node(pool, size, gfp, NULL);
> +}
>  EXPORT_SYMBOL_GPL(zs_malloc);
>  
>  static void obj_free(int class_size, unsigned long obj)

  reply	other threads:[~2025-04-02 19:57 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-02 19:11 [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data Nhat Pham
2025-04-02 19:57 ` Johannes Weiner [this message]
2025-04-02 20:09   ` Nhat Pham
2025-04-02 20:24     ` Johannes Weiner
2025-04-02 20:25       ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250402195741.GD198651@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=dan.j.williams@intel.com \
    --cc=gourry@gourry.net \
    --cc=jonathan.cameron@huawei.com \
    --cc=kernel-team@meta.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=senozhatsky@chromium.org \
    --cc=sj@kernel.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.