public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH 00/11] mm/zswap, zsmalloc: Per-memcg-lruvec zswap accounting
@ 2026-03-11 19:51 Joshua Hahn
  2026-03-11 19:51 ` [PATCH 01/11] mm/zsmalloc: Rename zs_object_copy to zs_obj_copy Joshua Hahn
                   ` (11 more replies)
  0 siblings, 12 replies; 33+ messages in thread
From: Joshua Hahn @ 2026-03-11 19:51 UTC (permalink / raw)
  To: Minchan Kim, Sergey Senozhatsky
  Cc: Johannes Weiner, Yosry Ahmed, Nhat Pham, Nhat Pham,
	Chengming Zhou, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Harry Yoo, Andrew Morton, cgroups, linux-mm,
	linux-kernel, kernel-team

INTRODUCTION
============
The current design for zswap and zsmalloc leaves a clean divide between
layers of the memory stack. At the higher level, we have zswap, which
interacts directly with memory consumers, compression algorithms, and
handles memory usage accounting via memcg limits. At the lower level,
we have zsmalloc, which handles the page allocation and migration of
physical pages.

While this logical separation simplifies the codebase, it leaves
problems for accounting that requires both memory cgroup awareness and
physical memory location. To name a few:

 - On tiered systems, it is impossible to understand how much toptier
   memory a cgroup is using, since zswap has no understanding of where
   the compressed memory is physically stored.
   + With SeongJae Park's work to store incompressible pages as-is in
     zswap [1], the size of compressed memory can become non-trivial,
     and easily consume a meaningful portion of memory.

 - cgroups that restrict memory nodes have no control over which nodes
   their zswapped objects live on. This can lead to unexpectedly high
   fault times for workloads, who must eat the remote access latency
   cost of retrieving the compressed object from a remote node.
   + Nhat Pham addressed this issue via a best-effort attempt to place
     compressed objects in the same page as the original page, but this
     cannot guarantee complete isolation [2].

 - On the flip side, zsmalloc's ignorance of cgroup also makes its
   shrinker memcg-unaware, which can lead to ineffective reclaim when
   pressure is localized to a single cgroup.

Until recently, zpool acted as another layer of indirection between
zswap and zsmalloc, which made bridging memcg and physical location
difficult. Now that zsmalloc is the only allocator backend for zswap and
zram [3], it is possible to move memory-cgroup accounting to the
zsmalloc layer.

Introduce a new per-zspage array of objcg pointers to track
per-memcg-lruvec memory usage by zswap, while leaving zram users
mostly unaffected.

In addition, move the accounting of memcg charges from the consumer
layer (zswap, zram) to the zsmalloc layer. Stat indices are
parameterized at pool creation time, meaning future consumers that wish
to account memory statistics can do so using the compressed object
memory accounting infrastructure introduced here.

PERFORMANCE
===========
The experiments were performed across 5 trials on a 2-NUMA machine.

Experiment 1
Node-bound workload, churning memory by allocating 2GB in 1GB cgroup.
0.638% regression, standard deviation: +/- 0.603%

Experiment 2:
Writeback with zswap pressure
0.295% gain, standard deviation: +/- 0.456%

Experiment 3:
1 cgroup, 2 workloads each bound to a NUMA node.
2.126% regression, standard deviation: +/- 3.008%

Experiment 4:
Reading memory.stat 10000x
1.464% gain, standard deviation: +/- 2.239%

Experiment 5:
Reading memory.numa_stat 10000x
0.281% gain, standard deviation: +/- 1.878%

It seems like all of the gains or regressions are mostly within the
standard deviation. I would like to note that workloads that span NUMA
nodes may see some contention as the zsmalloc migration path becomes
more expensive.

PATCH OUTLINE
=============
Patches 1 and 2 are small cleanups that make the codebase consistent and
easier to digest.

Patch 3 introduces memcg accounting-awareness to struct zs_pool, and
allows consumers to provide the memcg stat item indices that should be
accounted. The awareness is not functional at this point.

Patches 4, 5, and 6 allocate and populate the new zspage->objcgs field
with compressed objects' obj_cgroups. zswap_entry->objcgs is removed
and redirected to look at the zspage for memcg information.

Patch 7 moves the charging and lifetime management of obj_cgroups to the
zsmalloc layer, which leaves zswwap only as a plumbing layer to hand
cgroup information to zsmalloc at compression time.

Patches 8 and 9 introduce node counters and memcg-lruvec counters for
zswap.

Patches 10 and 11 handle charge migrations for the two types of compressed
object migration in zsmalloc. Special care is taken for compressed
objects that span multiple nodes.

CHANGELOG V1 [4] --> V2
=======================
A lot has changed from v1 and v2, thanks to the generous suggestions
from reviewers.
- Harry Yoo's suggestion to make the objcgs array per-zspage instead of
  per-zpdesc simplified much of the code needed to handle boundary
  cases. By moving the array to be per-zspage, much of the index
  translation (from per-zspage to per-zpdesc) has been simplified. Note
  that this does make the reverse true (per-zpdesc to per-zspage is
  harder now), but the only case this really matters is during the
  charge migration case in patch 10. Thank you Harry!

- Yosry Ahmed's suggestion to make memcg awareness a per-zspool decision
  has simplified much of the #ifdef casing needed, which makes the code
  a lot easier to follow (and makes changes less invasive for zram).

- Yosry Ahmed's suggestion to parameterize the memcg stat indices as
  zs_pool parameter makes the awkward hardcoding of zswap stat indices
  in zsmalloc code more natural & leaves room for future consumers to
  follow. Thank you Yosry!

- Shakeel Butt's suggestion to turn the objcgs array from an unsigned
  long to an objcgs ** pointer made the code much cleaner. However,
  after moving the pointer from zpdesc to zspage, there is now no longer
  a need to tag the pointer. Thank you, Shakeel!

- v1 only handled the migration case for single compressed objects.
  Patch 10 in v2 is written to handle the migration case for zpdesc
  replacement.
  + Special-casing compressed objects living at the boundary are a tad
    bit harder with per-zspage objcgs. I felt that this difficulty was
    outweighed by the simplification in the "typical" write/free case,
    though. 

REVIEWERS NOTE
==============
Patches 10 and 11 are a bit hairy, since they have to deal with special
casing scenarios for objects that span pages. I originally implemented a
very simple approach which uses the existing zs_charge_objcg functions,
but later realized that these migration paths take spin locks and
therefore cannot accept obj_cgroup_charge going to sleep.

The workaround is less elegant, but gets the job done. Feedback on these
two commits would be greatly appreciated!

[1] https://lore.kernel.org/linux-mm/20250822190817.49287-1-sj@kernel.org/
[2] https://lore.kernel.org/linux-mm/20250402204416.3435994-1-nphamcs@gmail.com/#t3
[3] https://lore.kernel.org/linux-mm/20250829162212.208258-1-hannes@cmpxchg.org/
[4] https://lore.kernel.org/all/20260226192936.3190275-1-joshua.hahnjy@gmail.com/

Joshua Hahn (11):
  mm/zsmalloc: Rename zs_object_copy to zs_obj_copy
  mm/zsmalloc: Make all obj_idx unsigned ints
  mm/zsmalloc: Introduce conditional memcg awareness to zs_pool
  mm/zsmalloc: Introduce objcgs pointer in struct zspage
  mm/zsmalloc: Store obj_cgroup pointer in zspage
  mm/zsmalloc, zswap: Redirect zswap_entry->objcg to zspage
  mm/zsmalloc, zswap: Handle objcg charging and lifetime in zsmalloc
  mm/memcontrol: Track MEMCG_ZSWAPPED in bytes
  mm/vmstat, memcontrol: Track ZSWAP_B, ZSWAPPED_B per-memcg-lruvec
  mm/zsmalloc: Handle single object charge migration in migrate_zspage
  mm/zsmalloc: Handle charge migration in zpdesc substitution

 drivers/block/zram/zram_drv.c |  10 +-
 include/linux/memcontrol.h    |  20 +-
 include/linux/mmzone.h        |   2 +
 include/linux/zsmalloc.h      |   9 +-
 mm/memcontrol.c               |  75 ++-----
 mm/vmstat.c                   |   2 +
 mm/zsmalloc.c                 | 381 ++++++++++++++++++++++++++++++++--
 mm/zswap.c                    |  66 +++---
 8 files changed, 431 insertions(+), 134 deletions(-)

-- 
2.52.0



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2026-03-17 19:13 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 19:51 [PATCH 00/11] mm/zswap, zsmalloc: Per-memcg-lruvec zswap accounting Joshua Hahn
2026-03-11 19:51 ` [PATCH 01/11] mm/zsmalloc: Rename zs_object_copy to zs_obj_copy Joshua Hahn
2026-03-11 19:56   ` Yosry Ahmed
2026-03-11 20:00   ` Nhat Pham
2026-03-11 19:51 ` [PATCH 02/11] mm/zsmalloc: Make all obj_idx unsigned ints Joshua Hahn
2026-03-11 19:58   ` Yosry Ahmed
2026-03-11 20:01   ` Nhat Pham
2026-03-11 19:51 ` [PATCH 03/11] mm/zsmalloc: Introduce conditional memcg awareness to zs_pool Joshua Hahn
2026-03-11 20:12   ` Nhat Pham
2026-03-11 20:16   ` Johannes Weiner
2026-03-11 20:19     ` Yosry Ahmed
2026-03-11 20:20     ` Joshua Hahn
2026-03-11 19:51 ` [PATCH 04/11] mm/zsmalloc: Introduce objcgs pointer in struct zspage Joshua Hahn
2026-03-11 20:17   ` Nhat Pham
2026-03-11 20:22     ` Joshua Hahn
2026-03-11 19:51 ` [PATCH 05/11] mm/zsmalloc: Store obj_cgroup pointer in zspage Joshua Hahn
2026-03-11 20:17   ` Yosry Ahmed
2026-03-11 20:24     ` Joshua Hahn
2026-03-11 19:51 ` [PATCH 06/11] mm/zsmalloc, zswap: Redirect zswap_entry->objcg to zspage Joshua Hahn
2026-03-11 19:51 ` [PATCH 07/11] mm/zsmalloc, zswap: Handle objcg charging and lifetime in zsmalloc Joshua Hahn
2026-03-12 21:42   ` Johannes Weiner
2026-03-13 15:34     ` Joshua Hahn
2026-03-13 16:49       ` Johannes Weiner
2026-03-11 19:51 ` [PATCH 08/11] mm/memcontrol: Track MEMCG_ZSWAPPED in bytes Joshua Hahn
2026-03-11 20:33   ` Nhat Pham
2026-03-17 19:13     ` Joshua Hahn
2026-03-11 19:51 ` [PATCH 09/11] mm/vmstat, memcontrol: Track ZSWAP_B, ZSWAPPED_B per-memcg-lruvec Joshua Hahn
2026-03-11 19:51 ` [PATCH 10/11] mm/zsmalloc: Handle single object charge migration in migrate_zspage Joshua Hahn
2026-03-12  3:51   ` kernel test robot
2026-03-12  3:51   ` kernel test robot
2026-03-12 16:56     ` Joshua Hahn
2026-03-11 19:51 ` [PATCH 11/11] mm/zsmalloc: Handle charge migration in zpdesc substitution Joshua Hahn
2026-03-11 19:54 ` [PATCH 00/11] mm/zswap, zsmalloc: Per-memcg-lruvec zswap accounting Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox