Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/5] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter
@ 2026-06-23 18:01 Joshua Hahn
  2026-06-23 18:01 ` [PATCH v4 1/5] mm/page_counter: introduce per-page_counter stock Joshua Hahn
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Joshua Hahn @ 2026-06-23 18:01 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, cgroups,
	linux-mm, linux-kernel, kernel-team

This series is intended for the next release cycle.

v3 --> v4
=========
- Reduced memory footprint by 4x, from 16 bytes per-(cpu x memcg) to
  4 bytes per-(cpu x memcg). Each page_counter_stock is a thin wrapper
  around an atomic_t.
- Removed locking completely and uses atomic operations to use stock.
- Removed synchronous work_on_cpu. All work is done via remote
  atomic_xchgs.
- Added a patch to flatten page_counter charging in try_charge_memcg
- Split page_counter_try_charge into stocked and non-stocked variants.

INTRO
=====
Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
allocations, allowing small and frequent allocations to avoid walking
the expensive mem_cgroup hierarchy traversal each time. This fastpath
offers real improvements, but there is room for improvement:

1. Currently, each CPU tracks up to 7 (NR_MEMCG_STOCK) mem_cgroups. When
   more than 7 mem_cgroups have stock present on a single CPU, a random
   victim is evicted and its associated stock is drained.

2. When one cgroup runs out of memory and needs to drain stock across
   all CPUs it has stock cached in, those CPUs will drain all other
   memcgs' stock present in that CPU. This leads to inefficient stock
   caching and cross-memcg interference under memory pressure.

3. Stock management is tightly coupled to struct mem_cgroup, which makes
   it difficult to add a new page_counter to mem_cgroup and have
   multiple sources of stock management.

This series moves the per-cpu stock down into page_counter which
consolidates stock limit checking and page_counter limit checking into
page_counter_try_charge_stock. This eliminates the 7 memcg-per-cpu slot
limit, the random cross-memcg stock drains, and slot traversal. We also
simplify memcontrol code, since we no longer need to maintain separate
draining functions or manage the asynchronous workqueue.

In turn, we can add independent stock management for additional
page_counters in each memcg, which is used in my tiered memory limits
series to add a new page_counter to track toptier usage [1].

There are a few tradeoffs, however.

First, the bound on how much memory can be overcharged (and remain stale
as stock) is raised. Previously, it was fixed to nr_cpus x 7 x 64 pages.
Now, it becomes nr_cpus x nr_cgroups x 64 pages. On large machines
with many cgroups, this could be significant.

There are 4 qualifying points:
1. Larger machines should be able to tolerate the additional overhead.
2. Stock should not remain stale as long as the cgroups are actively
   charging memory.
3. Getting anywhere close to this overhead is difficult and rare. It
   would require processes to bounce across CPUs and refill stock.
4. These charges are not "real" allocated memory, but rather accounting
   done in memcg; they are easily returned on pressure.

Secondly, we introduce a small memory footprint.
The new struct page_counter_stock is a wrapper around an atomic_t,
which adds 4 bytes of overhead per-(cpu x memcg). On a 1024-CPU,
1024-memcg system, this adds 4MB of overhead. Smaller machines will
see much smaller overhead.

One small side effect for cgroupv1: this series decouples swap for
the memory and memsw page_counters. Since stock charging can go out of
sync, this means that users can transiently see memsw usage go below
memory usage.

Finally, by moving from asynchronous workqueue scheduling for draining
to synchronous atomic_xchg, drain_all_stock holds the
percpu_charge_mutex longer while it performs the work. This means that
chargers may be more likely to be unable to grab the mutex lock and
exhaust MAX_RECLAIM_RETRIES and OOM, in theory. In practice, I have not
been able to replicate this behavior in my experiments.

The series was built on top of latest akpm/mm-new as of Jun 23 2026,
which is cdad4d4e4fc2e "mm/swap, PM: hibernate: atomically replace
hibernation pin".

TESTING
=======
I tried to demonstrate the worst-case overhead this series introduces
by writing a microbenchmark that pins multiple cgroup jobs to a single
CPU and repeatedly faulting and releasing anon pages using
madv(MADV_DONTNEED) in each cgroup. The data was collected over
30 trials of 15 iterations. Metric here is time each iteration took (ms)

+----------+--------+-------+-----------+--------------+
| #cgroups | before | after | delta (%) | variance (%) |
+----------+--------+-------+-----------+--------------+
|        1 |    112 |   112 |     0.000 |          1.1 |
|        4 |    443 |   451 |    +1.806 |          1.1 |
|       32 |   3512 |  3584 |    +2.051 |          2.0 |
+----------+--------+-------+-----------+--------------+

It appears as though there is some small regression, although the
magnitude is similar to the coefficient of variation (stddev / mean).

CHANGELOG
=========
v2 --> v3:
- Dropped the cgroup v2 optimization, since it could indeed lead to too
  much time held with the cgroup_mutex. Instead we let the stock
  accumulate in the parent cgroups, which is not so bad; charges can
  still land on these cgroups, and if we ever reach the mem_cgroup
  limit, we can easily return those charges.
- page_counter_disable_stock no longer drains, just prevents
  accumulating stock. The actual draining is done in the free_stock
  variant, where we know for sure there are no in-flight charges.
- Reordering the page_counter_disable_stock path to disable before
  draining as to prevent accumulating stock first.
- Skip isolated CPUs when draining synchronously
- Rebase on newest mm-new
- Wordsmithing

v1 --> v2:
- Dropped stock returning on uncharge to preserve same behavior as memcg
  stock. This resolves some race conditions present in v1.
- Fixed many race conditions between disabling page_counter_stock and
  in-flight charges
- Restructured drain_all_stock to iterate over all CPUs first before
  memcgs, to reduce the number of synchronous CPU work scheduling
- Optimized cgroup v2 further to drain only on the first child and skip
  the root mem_cgroup
- Dropped RFC
- Wordsmithing cover letter

[1] https://lore.kernel.org/all/20260423203445.2914963-1-joshua.hahnjy@gmail.com/

Joshua Hahn (5):
  mm/page_counter: introduce per-page_counter stock
  mm/memcontrol: flatten try_charge_memcg control flow
  mm/page_counter: introduce page_counter_try_charge_stock()
  mm/memcontrol: convert memcg to use page_counter_stock
  mm/memcontrol: remove unused memcg_stock code

 include/linux/page_counter.h |  19 +++
 mm/memcontrol.c              | 280 ++++++-----------------------------
 mm/page_counter.c            |  90 +++++++++++
 3 files changed, 155 insertions(+), 234 deletions(-)

-- 
2.53.0-Meta



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-06-24 16:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-23 18:01 [PATCH v4 0/5] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter Joshua Hahn
2026-06-23 18:01 ` [PATCH v4 1/5] mm/page_counter: introduce per-page_counter stock Joshua Hahn
2026-06-23 18:01 ` [PATCH v4 2/5] mm/memcontrol: flatten try_charge_memcg control flow Joshua Hahn
2026-06-23 18:01 ` [PATCH v4 3/5] mm/page_counter: introduce page_counter_try_charge_stock() Joshua Hahn
2026-06-23 18:01 ` [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock Joshua Hahn
2026-06-24 14:43   ` Usama Arif
2026-06-24 15:23     ` Joshua Hahn
2026-06-24 16:43       ` Usama Arif
2026-06-23 18:01 ` [PATCH v4 5/5] mm/memcontrol: remove unused memcg_stock code Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox