* [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
@ 2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
0 siblings, 0 replies; 6+ messages in thread
From: Alireza Haghdoost @ 2026-05-01 22:28 UTC (permalink / raw)
To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Andrew Morton, Matthew Wilcox (Oracle), Jan Kara
Cc: cgroups, linux-mm, linux-kernel, linux-fsdevel, Kshitij Doshi,
Alireza Haghdoost
Add two cgroup v2 memory-controller knobs that bring
balance_dirty_pages() throttling into per-cgroup scope so one noisy
writer cannot stall peers sharing the same host:
memory.dirty_ratio Per-cgroup dirty-page ceiling, 0..100 percent of
the cgroup's dirtyable memory. 0 (default) leaves
the cgroup subject to the global threshold only.
memory.dirty_min Guaranteed dirty-page floor, byte value (default 0).
The two knobs compose: dirty_ratio bounds how much dirty memory a
cgroup may accrue, dirty_min guarantees a floor below which it is
never throttled.
Motivation, design trade-offs, cost analysis, validation data, and
open questions are in the cover letter.
Co-developed-by: Kshitij Doshi <kshitijd@uber.com>
Signed-off-by: Kshitij Doshi <kshitijd@uber.com>
Signed-off-by: Alireza Haghdoost <haghdoost@uber.com>
Assisted-by: Cursor:claude-sonnet-4.5
---
Hi all,
This RFC adds two cgroup v2 memory-controller knobs that give operators
per-cgroup control over dirty-page throttling in balance_dirty_pages():
memory.dirty_ratio (per-cgroup ceiling) and memory.dirty_min (guaranteed
floor). A third knob, memory.dirty_weight, is forthcoming in a follow-up
once we have validated the application site (see "Follow-ups" below).
We are posting this as an RFC, as a single squashed patch, to get design
feedback before splitting the prototype into a per-logical-change series.
Motivation
==========
balance_dirty_pages() (BDP) is a global throttle. It sleeps writers
once the host-wide dirty count crosses a single threshold. On a
container host that threshold is shared across cgroups. A cgroup
that dirties pages faster than storage can drain them pushes the
count over the limit. Every writer on the host then parks in
io_schedule_timeout() -- including cgroups that have not dirtied a
single page of their own.
cgroup v2 already has per-memcg dirty accounting, but that accounting
does not translate into per-memcg dirty throttling.
We see this in production: a buffered write-heavy container generates
multi-second stalls for co-located latency-sensitive workloads.
Moreover, dirty-page accumulation from a single noisy neighbor is a
recurring contributor to mount-responsiveness degradation on shared hosts.
Prior work
==========
Per-memcg dirty-page limits have been proposed before. Andrea Righi
posted an initial RFC in February 2010 [1]; Greg Thelen continued the
work through v9 in August 2011 [2]. That series added per-memcg dirty
counters and hooked them into balance_dirty_pages(), but it bolted
per-cgroup limits onto the global writeback path without making
writeback itself cgroup-aware. Without cgroup-aware flusher threads,
a cgroup exceeding its limit triggered writeback of inodes from any
cgroup, giving poor isolation. The series was not merged.
Konstantin Khlebnikov posted "[PATCHSET RFC 0/6] memcg: inode-based
dirty-set controller" in January 2015 [4], which proposed
memory.dirty_ratio (the same interface name this series uses) via an
inode-tagged, filtered-writeback approach. Tejun Heo reviewed it
and rejected it as a "dead end" that duplicated lower-layer policy
without solving the underlying isolation problem; this rejection
directly motivated Tejun's native cgwb rework described below.
Tejun Heo's 48-patch cgroup-writeback rework, merged in Linux 4.2
(commit e4bc13adfd01, "Merge branch 'for-4.2/writeback'"), took the
different approach of restructuring writeback to be natively
cgroup-aware: per-memcg wb_domain (commit 841710aa6e4a), per-memcg
NR_FILE_DIRTY / NR_WRITEBACK accounting, and cgroup-aware flusher
threads [3]. That work deliberately deferred user-facing policy knobs.
This series adds the policy surface that consumes Tejun's
infrastructure. The dirty_min reservation concept is, to our
knowledge, new.
A November 2023 LKML thread by Chengming Zhou [5] independently
identified the identical throttling regression on cgroup v2 (a 5 GB
container constantly throttled because memory.max * dirty_ratio yields
too small a threshold for bursty workloads). Jan Kara participated and
endorsed a bytes-based per-memcg dirty limit; no patches followed that
discussion, confirming the gap this series fills.
[1] https://lwn.net/Articles/408349/
[2] https://lore.kernel.org/lkml/1313597705-6093-1-git-send-email-gthelen@google.com/
[3] https://lwn.net/Articles/648292/
[4] https://lore.kernel.org/all/20150115180242.10450.92.stgit@buzz/
[5] https://lore.kernel.org/all/109029e0-1772-4102-a2a8-ab9076462454@linux.dev/
Proposed interface
==================
Two new cgroup v2 files under the memory controller, absent on the root
cgroup (CFTYPE_NOT_ON_ROOT):
memory.dirty_ratio
Integer [0, 100]. Per-cgroup dirty-page ceiling as a percentage of
the cgroup's dirtyable memory (mdtc->avail: file cache plus
reclaimable slack), the same base the global vm_dirty_ratio scales
against. 0 (the default) disables the per-cgroup ceiling and leaves
the cgroup subject to the global threshold only. A non-zero value
that is stricter than vm_dirty_ratio overrides the global ratio for
this cgroup via min(mdtc->thresh, cg_thresh); because both sides
scale off the same base, the knob can never widen the cgroup past
the global ceiling. A memory.dirty_bytes companion for byte-precise
caps (mirroring vm_dirty_bytes) is noted under "Follow-ups" below.
The prototype reads the value for the immediate memcg only;
hierarchical enforcement (clamping against ancestors, like
memory.max) is not implemented yet. We would like guidance on
whether this is required for v1 or can follow in a subsequent
series.
memory.dirty_min
Byte value (K/M/G suffixes accepted), default 0 (no reservation).
Guaranteed dirty-page floor: while cgroup_dirty < dirty_min,
throttling is bypassed (goto free_running). Lets a latency-sensitive
cgroup buffer a small write burst even under global dirty pressure.
dirty_min is an admission guarantee, so we have to prevent it from
breaking the global dirty invariant. Two aspects:
- Global cap. The sum of dirty_min reservations across all cgroups
must not exceed a fraction of the global dirty threshold (our
working number is 80%), so the system always retains some shared
capacity. The prototype does not enforce this cap yet; we expect
to either reject at write() time or clamp on read in a cheap
precomputed effective_dirty_min. We would appreciate feedback on
which approach the cgroup maintainers prefer.
- Per-cgroup cap. A cgroup should not be able to reserve more dirty
capacity than it can hold. Our tentative rule is
effective_dirty_min = min(dirty_min, memory.max - memory.min),
evaluated at enforcement time so it tracks live memory.max changes,
rather than rejected at write. This is similar to how memory.low
composes with memory.max.
Neither cap is implemented in the prototype; both would land before
a non-RFC posting.
The two knobs compose: dirty_ratio bounds how much dirty memory a
cgroup may accrue, dirty_min guarantees a floor below which it is never
throttled.
Test setup and results
======================
To show the problem and the fix, we built a single reproducer that runs
on an unmodified stock kernel and then on the patched kernel, using the
same setup for both.
Setup: QEMU guest with a virtio-blk disk throttled to 256 KB/s
(bps_wr=262144). Two sibling cgroups, no io.weight; both share disk
bandwidth equally. dirty_bytes=32MB, dirty_background_bytes=16MB
(freerun = 24 MB). Files pre-allocated with fallocate before dirty
pressure. Two phases per run: (1) victim alone (baseline), (2) noisy
fills global dirty to the 32 MB cap, then victim runs contended for
30 s.
- noisy: single fio job, unlimited write rate, fills global dirty pool.
- victim: single fio job, rate-limited to 500 KB/s (128 IOPS target),
4 KiB sequential write().
The high freerun (24 MB) ensures victim's solo dirty accumulation
(244 KB/s x 30 s = 7.3 MB) stays below the threshold. BDP does not fire
during the solo phase on either kernel, giving identical baselines.
Stock kernel results (the problem):
solo (no contention) contended inflation
victim IOPS 125 5.1 24.5x worse
victim p99 0.6 ms 152 ms 253x worse
The contended p99 sits at fio's percentile bucket nearest MAX_PAUSE =
HZ/5 = 200 ms (mm/page-writeback.c:49), the hard kernel ceiling on BDP
sleep. The victim has near-zero dirty pages of its own but is forced to
sleep because balance_dirty_pages() sees gdtc->dirty = NR_FILE_DIRTY +
NR_WRITEBACK above the freerun threshold. Most of noisy's pages are
queued in NR_WRITEBACK waiting for the throttled disk.
memory.events.local on both cgroups shows max 0 throughout the run;
this is not memory pressure inside either cgroup.
Patched kernel results (the fix), with victim/memory.dirty_min = 16 MB:
solo (no contention) contended inflation
victim IOPS 125 125 1.0x
victim p99 0.9 ms 0.7 ms 1.0x
The patched kernel checks cgroup_dirty < dirty_min (4096 pages) before
computing any sleep. Because the rate-limited victim's resident dirty
set stays well below the reservation, the check fires on every write()
-> goto free_running -> write() returns in ~7 us. The victim is fully
protected.
The figures above are the per-kernel medians of N=5 iterations and
reflect a deterministic outcome on every iter: stock had cont_iops in
4.4..6.1 (retention 0.035..0.049) on all five iters, patched had
cont_iops = 125.0 (retention = 1.000) on all five iters. The 5/5 stock
iters all hit BDP's throttled regime; the 5/5 patched iters all
bypassed it via the dirty_min check.
Implementation
==============
The patch touches five files:
- include/linux/memcontrol.h: two new fields on struct mem_cgroup
(dirty_ratio, dirty_min).
- include/linux/writeback.h: a per-pass cg_dirty_cap field on
struct dirty_throttle_control used to publish the memcg clamp to
BDP's setpoint and rate-limit math.
- include/trace/events/writeback.h: cg_dirty_cap added to the
balance_dirty_pages tracepoint so operators can distinguish the
memcg clamp from the global dirty_limit at runtime.
- mm/memcontrol.c: registers the two cgroup v2 files with input
validation.
- mm/page-writeback.c: the throttling changes.
The key changes in page-writeback.c:
- The dirty_ratio clamp lives inside domain_dirty_limits(), keyed
on wb->memcg_css (the inode owner's memcg). Every consumer of
the memcg dtc -- writer throttle, flusher kworker,
cgwb_calc_thresh -- sees the same clamped thresh and bg_thresh.
The clamp uses mult_frac() so a small memcg does not collapse
to a zero ceiling.
- The dirty_min bypass lives in balance_dirty_pages() and is
writer-keyed: the writing task's memcg is looked up under RCU,
and when its dirty+in-flight backlog is below dirty_min the
loop jumps to free_running, bypassing both the global and the
per-memcg BDP gates. dirty_min is an admission guarantee for
the writer's own cgroup, not for inode owners.
- When the clamp engages, mdtc->dirty is replaced with the
memcg-wide NR_FILE_DIRTY + NR_WRITEBACK sum so freerun /
setpoint / rate-limit smoothing see the real backlog and pages
migrating from NR_FILE_DIRTY into NR_WRITEBACK cannot silently
widen the cap.
Fast-path cost when neither knob is set: one rcu_read_lock/unlock
pair plus a READ_ONCE(dirty_min) per balance_dirty_pages()
iteration, and one READ_ONCE(dirty_ratio) per domain_dirty_limits()
call. The memcg counter reads are gated on "knob armed" and do
not fire on the default path. We have not measured the added
cost yet, but we expect it to be in the noise of existing BDP
bookkeeping. A tight pwrite() microbenchmark will confirm this
before a non-RFC posting.
Scope
=====
This series affects the writer-side throttle (balance_dirty_pages())
only. It does not partition the flusher-side writeback queue. A
cgroup's fsync() can still block behind inodes from other cgroups in
writeback_sb_inodes(). We document this limit explicitly and expect
writeback-queue partitioning to be a separate, larger effort.
Interaction with block-layer throttles
======================================
The two knobs are orthogonal to io.max / io.cost. balance_dirty_pages()
runs before the bio reaches the block layer, so dirty_min simply allows
a cgroup to keep accepting write() syscalls up to its reservation; the
actual I/O is still subject to whatever block throttle is in effect. In
the reproducer above, the disk-level bandwidth limit (256 KB/s) is
applied at the QEMU virtio-blk layer, and the protected victim dirties
roughly equal to the rate it can drain, so the block throttle is
exercised on both kernels. We have not yet tested interaction with
guest-side io.max settings; this is on the list before a non-RFC
posting.
Questions for maintainers
=========================
1. Is writer-throttle-only scope (no flusher/writeback-queue work)
acceptable for the first series?
2. Does dirty_ratio belong on struct mem_cgroup (as the prototype
has it) or on the memcg's wb_domain? Routing it through wb_domain
would let us reuse __wb_calc_thresh() and keep all threshold
policy in one place; a possible split is dirty_ratio on wb_domain
(it is threshold policy), dirty_min on the memcg (throttle
bypass). We can go either way.
3. For dirty_min safety caps (per-cgroup and global sum), which
approach do you prefer: reject at write time, or clamp on read in
an effective_dirty_min?
4. Is hierarchical enforcement of dirty_ratio (clamp against
ancestors) required for v1, or can it follow in a subsequent
series?
What's missing before a non-RFC posting
=======================================
- Split the monolithic prototype into a proper series (one patch per
concept + Documentation + selftest).
- Documentation/admin-guide/cgroup-v2.rst entries for both knobs.
- tools/testing/selftests/cgroup/ test for interface surface and
noisy-neighbor protection.
- Implement the per-cgroup and global dirty_min safety caps described
in the memory.dirty_min bullet.
- Fast-path microbenchmark: confirm zero measurable regression for
cgroups that have neither knob set.
- Larger-N validation on real hardware (the current N=5 data is from
a QEMU guest on a throttled virtio-blk).
Follow-ups (out of scope for this series)
=========================================
- memory.dirty_weight: a priority weight knob that scales the BDP
pause length, planned as a separate series. The prototype validated
the interface surface but the application site (post-pause scaling
vs. folding into pos_ratio / dirty_ratelimit) needs to be settled
before we ship it. Happy to discuss in advance of that posting.
- memory.dirty_bytes: a byte-value companion to memory.dirty_ratio,
mirroring the global vm_dirty_bytes. For operators who want a
byte-predictable per-cgroup dirty cap rather than a ratio of the
cgroup's dirtyable memory. We have not prototyped this yet; we
are listing it so reviewers know it is on the roadmap, since
the ratio-only interface omits that use case.
- Writeback-queue partitioning: flusher-side fairness across
cgroups, as noted in Scope above.
Looking forward to feedback.
Thanks,
Alireza Haghdoost <haghdoost@uber.com>
Kshitij Doshi <kshitijd@uber.com>
---
include/linux/memcontrol.h | 10 +++
include/linux/writeback.h | 4 +
include/trace/events/writeback.h | 5 +-
mm/memcontrol.c | 62 ++++++++++++++
mm/page-writeback.c | 179 ++++++++++++++++++++++++++++++++++++---
5 files changed, 249 insertions(+), 11 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b..45ca949a4c68 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -323,6 +323,16 @@ struct mem_cgroup {
spinlock_t event_list_lock;
#endif /* CONFIG_MEMCG_V1 */
+ /* Per-memcg dirty-page controls (memory.dirty_ratio, memory.dirty_min) */
+ /*
+ * dirty_ratio: [0, 100] percent of dirtyable memory (mdtc->avail),
+ * matching the global vm_dirty_ratio base; 0 inherits the global
+ * threshold.
+ * dirty_min: dirty-page reservation, in pages; 0 disables the bypass.
+ */
+ unsigned int dirty_ratio;
+ unsigned long dirty_min;
+
struct mem_cgroup_per_node *nodeinfo[];
};
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 62552a2ce5b9..e37632f728be 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -318,6 +318,10 @@ struct dirty_throttle_control {
unsigned long thresh; /* dirty threshold */
unsigned long bg_thresh; /* dirty background threshold */
unsigned long limit; /* hard dirty limit */
+ unsigned long cg_dirty_cap; /* per-memcg dirty_ratio clamp for
+ * this pass, or PAGE_COUNTER_MAX
+ * when no memcg clamp applies
+ */
unsigned long wb_dirty; /* per-wb counterparts */
unsigned long wb_thresh;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index bdac0d685a98..0bf86b3c903c 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -672,6 +672,7 @@ TRACE_EVENT(balance_dirty_pages,
__array( char, bdi, 32)
__field(u64, cgroup_ino)
__field(unsigned long, limit)
+ __field(unsigned long, cg_dirty_cap)
__field(unsigned long, setpoint)
__field(unsigned long, dirty)
__field(unsigned long, wb_setpoint)
@@ -691,6 +692,7 @@ TRACE_EVENT(balance_dirty_pages,
strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
__entry->limit = dtc->limit;
+ __entry->cg_dirty_cap = dtc->cg_dirty_cap;
__entry->setpoint = (dtc->limit + freerun) / 2;
__entry->dirty = dtc->dirty;
__entry->wb_setpoint = __entry->setpoint *
@@ -710,13 +712,14 @@ TRACE_EVENT(balance_dirty_pages,
TP_printk("bdi %s: "
- "limit=%lu setpoint=%lu dirty=%lu "
+ "limit=%lu cg_dirty_cap=%lu setpoint=%lu dirty=%lu "
"wb_setpoint=%lu wb_dirty=%lu "
"dirty_ratelimit=%lu task_ratelimit=%lu "
"dirtied=%u dirtied_pause=%u "
"paused=%lu pause=%ld period=%lu think=%ld cgroup_ino=%llu",
__entry->bdi,
__entry->limit,
+ __entry->cg_dirty_cap,
__entry->setpoint,
__entry->dirty,
__entry->wb_setpoint,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3d98ab41f1f..c43fe4f394eb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4748,6 +4748,56 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_dirty_ratio_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ seq_printf(m, "%u\n", READ_ONCE(memcg->dirty_ratio));
+ return 0;
+}
+
+static ssize_t memory_dirty_ratio_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned int ratio;
+ int err;
+
+ err = kstrtouint(strstrip(buf), 0, &ratio);
+ if (err)
+ return err;
+
+ if (ratio > 100)
+ return -EINVAL;
+
+ WRITE_ONCE(memcg->dirty_ratio, ratio);
+ return nbytes;
+}
+
+static int memory_dirty_min_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ /* seq_puts_memcg_tunable automatically multiplies by PAGE_SIZE for the user */
+ return seq_puts_memcg_tunable(m, READ_ONCE(memcg->dirty_min));
+}
+
+static ssize_t memory_dirty_min_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long dirty_min;
+ int err;
+
+ /* page_counter_memparse converts strings like "512M" into a page count */
+ err = page_counter_memparse(strstrip(buf), "max", &dirty_min);
+ if (err)
+ return err;
+
+ WRITE_ONCE(memcg->dirty_min, dirty_min);
+ return nbytes;
+}
+
/*
* Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
* if any new events become available.
@@ -4950,6 +5000,18 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NS_DELEGATABLE,
.write = memory_reclaim,
},
+ {
+ .name = "dirty_ratio",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_dirty_ratio_show,
+ .write = memory_dirty_ratio_write,
+ },
+ {
+ .name = "dirty_min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_dirty_min_show,
+ .write = memory_dirty_min_write,
+ },
{ } /* terminate */
};
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 88cd53d4ba09..2847b2c1e59a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -124,14 +124,17 @@ struct wb_domain global_wb_domain;
#define GDTC_INIT(__wb) .wb = (__wb), \
.dom = &global_wb_domain, \
- .wb_completions = &(__wb)->completions
+ .wb_completions = &(__wb)->completions, \
+ .cg_dirty_cap = PAGE_COUNTER_MAX
-#define GDTC_INIT_NO_WB .dom = &global_wb_domain
+#define GDTC_INIT_NO_WB .dom = &global_wb_domain, \
+ .cg_dirty_cap = PAGE_COUNTER_MAX
#define MDTC_INIT(__wb, __gdtc) .wb = (__wb), \
.dom = mem_cgroup_wb_domain(__wb), \
.wb_completions = &(__wb)->memcg_completions, \
- .gdtc = __gdtc
+ .gdtc = __gdtc, \
+ .cg_dirty_cap = PAGE_COUNTER_MAX
static bool mdtc_valid(struct dirty_throttle_control *dtc)
{
@@ -183,8 +186,9 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
#else /* CONFIG_CGROUP_WRITEBACK */
#define GDTC_INIT(__wb) .wb = (__wb), \
- .wb_completions = &(__wb)->completions
-#define GDTC_INIT_NO_WB
+ .wb_completions = &(__wb)->completions, \
+ .cg_dirty_cap = PAGE_COUNTER_MAX
+#define GDTC_INIT_NO_WB .cg_dirty_cap = PAGE_COUNTER_MAX
#define MDTC_INIT(__wb, __gdtc)
static bool mdtc_valid(struct dirty_throttle_control *dtc)
@@ -392,6 +396,58 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
}
+
+ /*
+ * Apply the per-memcg dirty_ratio clamp on mdtc (gdtc != NULL
+ * iff @dtc is a memcg dtc). dirty_ratio is scaled against
+ * the memcg's own dirtyable memory (@available_memory), matching
+ * the semantics of vm_dirty_ratio so the two knobs share a base
+ * and compose via a plain min() on thresh. The clamp is keyed
+ * on wb->memcg_css (the inode-owner's memcg) rather than on
+ * current's memcg, so balance_dirty_pages(), wb_over_bg_thresh()
+ * (flusher kworker context), and cgwb_calc_thresh() all see the
+ * same clamped value.
+ *
+ * Published on dtc->cg_dirty_cap as well so hard_dirty_limit()
+ * callers in balance_dirty_pages() can ignore the slower
+ * dom->dirty_limit smoothing when deriving setpoint/
+ * rate-limit from the clamped ceiling.
+ *
+ * Clamp is applied after the rt/dl boost: dirty_ratio is a
+ * strict override, not widened by priority. bg_thresh is
+ * scaled by the same factor we apply to thresh so the
+ * user-configured bg/thresh ratio survives clamping instead
+ * of snapping to thresh/2 via the bg_thresh >= thresh guard
+ * below. mult_frac() preserves precision for small memcgs
+ * where a plain "(avail / 100) * ratio" would collapse to 0.
+ */
+ if (gdtc) {
+ struct mem_cgroup *memcg =
+ mem_cgroup_from_css(dtc->wb->memcg_css);
+ unsigned int cg_ratio = memcg ?
+ READ_ONCE(memcg->dirty_ratio) : 0;
+
+ /*
+ * dtc is reused across balance_dirty_pages() iterations,
+ * so reset the published clamp every call -- an admin
+ * clearing memory.dirty_ratio mid-flight must take effect
+ * on the next pass.
+ */
+ dtc->cg_dirty_cap = PAGE_COUNTER_MAX;
+
+ if (cg_ratio) {
+ unsigned long cg_thresh = mult_frac(available_memory,
+ cg_ratio, 100);
+
+ if (cg_thresh < thresh) {
+ bg_thresh = mult_frac(bg_thresh, cg_thresh,
+ thresh);
+ thresh = cg_thresh;
+ dtc->cg_dirty_cap = cg_thresh;
+ }
+ }
+ }
+
/*
* Dirty throttling logic assumes the limits in page units fit into
* 32-bits. This gives 16TB dirty limits max which is hopefully enough.
@@ -1065,7 +1121,9 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
struct bdi_writeback *wb = dtc->wb;
unsigned long write_bw = READ_ONCE(wb->avg_write_bandwidth);
unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
- unsigned long limit = dtc->limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
+ unsigned long limit = dtc->limit = min(hard_dirty_limit(dtc_dom(dtc),
+ dtc->thresh),
+ dtc->cg_dirty_cap);
unsigned long wb_thresh = dtc->wb_thresh;
unsigned long x_intercept;
unsigned long setpoint; /* dirty pages' target balance point */
@@ -1334,7 +1392,8 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
struct bdi_writeback *wb = dtc->wb;
unsigned long dirty = dtc->dirty;
unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
- unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
+ unsigned long limit = min(hard_dirty_limit(dtc_dom(dtc), dtc->thresh),
+ dtc->cg_dirty_cap);
unsigned long setpoint = (freerun + limit) / 2;
unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long dirty_ratelimit = wb->dirty_ratelimit;
@@ -1822,22 +1881,122 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
int ret = 0;
for (;;) {
+ unsigned long cg_dirty_min = 0;
+ unsigned long cg_dirty_pages = 0;
unsigned long now = jiffies;
nr_dirty = global_node_page_state(NR_FILE_DIRTY);
balance_domain_limits(gdtc, strictlimit);
+
+ /*
+ * Under RCU, snapshot the current memcg's memory.dirty_min
+ * reservation. When it is non-zero, also snapshot the
+ * memcg-wide dirty backlog. These feed the per-writer
+ * dirty_min bypass below; the dirty_ratio clamp itself
+ * is applied inside domain_dirty_limits() keyed on
+ * wb->memcg_css so balance_dirty_pages(),
+ * wb_over_bg_thresh() (flusher kworker context), and
+ * cgwb_calc_thresh() all see a consistent clamped
+ * threshold.
+ *
+ * rcu_read_lock() is held only for the __rcu dereference
+ * of current->cgroups; the memcg pointer does not escape
+ * the critical section. The counter read matches
+ * domain_dirty_avail(mdtc, true) so the bypass compares
+ * the same dirty+in-flight backlog the global path uses.
+ */
+ rcu_read_lock();
+ {
+ struct mem_cgroup *memcg =
+ mem_cgroup_from_task(current);
+
+ if (memcg) {
+ cg_dirty_min = READ_ONCE(memcg->dirty_min);
+ if (cg_dirty_min)
+ cg_dirty_pages =
+ memcg_page_state(memcg,
+ NR_FILE_DIRTY) +
+ memcg_page_state(memcg,
+ NR_WRITEBACK);
+ }
+ }
+ rcu_read_unlock();
+
if (mdtc) {
/*
- * If @wb belongs to !root memcg, repeat the same
- * basic calculations for the memcg domain.
+ * For !root memcg, repeat the same three-step
+ * sequence as balance_domain_limits(gdtc):
+ * avail -> limits -> freerun. We inline it here
+ * so we can insert the mdtc->dirty override
+ * between step 2 (domain_dirty_limits, which
+ * publishes the per-memcg dirty_ratio clamp on
+ * cg_dirty_cap) and step 3 (domain_dirty_freerun,
+ * which consumes mdtc->dirty along with
+ * thresh/bg_thresh).
+ */
+ domain_dirty_avail(mdtc, true);
+ domain_dirty_limits(mdtc);
+
+ /*
+ * When the dirty_ratio clamp engaged, replace the
+ * per-wb dirty count from mem_cgroup_wb_stats()
+ * with the memcg-wide NR_FILE_DIRTY + NR_WRITEBACK
+ * sum so freerun, the setpoint, and the rate-limit
+ * smoothing see the true memcg backlog instead of
+ * the subset that has migrated to this cgwb (cgwb
+ * migration is lazy and can lag by many seconds),
+ * and so a burst of buffered writes cannot silently
+ * bypass the clamp by shifting pages from
+ * NR_FILE_DIRTY into NR_WRITEBACK.
+ *
+ * Keyed on wb->memcg_css to match the clamp itself.
+ * The cgwb holds a css reference, so the memcg
+ * pointer is stable without additional locking.
+ *
+ * Caveat: memcg_page_state() aggregates across ALL
+ * backing devices owned by this memcg, while mdtc
+ * is scoped to one wb. A writer to a fast BDI may
+ * observe backlog accumulated on slow BDIs in the
+ * same memcg and throttle more than strictly needed.
+ * Accepted for v1; the alternative (summing per-wb
+ * dirty across the memcg's cgwbs) walks the cgwb
+ * list under RCU on a hot path.
*/
- balance_domain_limits(mdtc, strictlimit);
+ if (mdtc->cg_dirty_cap != PAGE_COUNTER_MAX) {
+ struct mem_cgroup *wb_memcg =
+ mem_cgroup_from_css(mdtc->wb->memcg_css);
+
+ if (wb_memcg)
+ mdtc->dirty =
+ memcg_page_state(wb_memcg,
+ NR_FILE_DIRTY) +
+ memcg_page_state(wb_memcg,
+ NR_WRITEBACK);
+ }
+
+ domain_dirty_freerun(mdtc, strictlimit);
}
if (nr_dirty > gdtc->bg_thresh && !writeback_in_progress(wb))
wb_start_background_writeback(wb);
+ /*
+ * dirty_min bypass: when the current memcg's dirty+in-flight
+ * backlog is below its memory.dirty_min reservation, let the
+ * writer proceed without throttling. This check must live
+ * outside the if (mdtc) block because a writer's file may not
+ * yet have been migrated to a cgwb; without cgwb, mdtc is NULL
+ * and the per-memcg block above is skipped entirely.
+ *
+ * cg_dirty_min and cg_dirty_pages come from the per-iteration
+ * snapshot taken above under rcu_read_lock; both are stored
+ * in pages (page_counter_memparse converts bytes -> pages for
+ * dirty_min), so no unit conversion is needed.
+ */
+ if (cg_dirty_min && cg_dirty_pages < cg_dirty_min)
+ goto free_running;
+
/*
* If memcg domain is in effect, @dirty should be under
* both global and memcg freerun ceilings.
---
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
change-id: 20260501-rfc-memcg-dirty-v1-ed4644c3fa8a
Best regards,
--
Alireza Haghdoost <haghdoost@uber.com>
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
@ 2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
0 siblings, 0 replies; 6+ messages in thread
From: Alireza Haghdoost via B4 Relay @ 2026-05-01 22:28 UTC (permalink / raw)
To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Andrew Morton, Matthew Wilcox (Oracle), Jan Kara
Cc: cgroups, linux-mm, linux-kernel, linux-fsdevel, Kshitij Doshi,
Alireza Haghdoost
From: Alireza Haghdoost <haghdoost@uber.com>
Add two cgroup v2 memory-controller knobs that bring
balance_dirty_pages() throttling into per-cgroup scope so one noisy
writer cannot stall peers sharing the same host:
memory.dirty_ratio Per-cgroup dirty-page ceiling, 0..100 percent of
the cgroup's dirtyable memory. 0 (default) leaves
the cgroup subject to the global threshold only.
memory.dirty_min Guaranteed dirty-page floor, byte value (default 0).
The two knobs compose: dirty_ratio bounds how much dirty memory a
cgroup may accrue, dirty_min guarantees a floor below which it is
never throttled.
Motivation, design trade-offs, cost analysis, validation data, and
open questions are in the cover letter.
Co-developed-by: Kshitij Doshi <kshitijd@uber.com>
Signed-off-by: Kshitij Doshi <kshitijd@uber.com>
Signed-off-by: Alireza Haghdoost <haghdoost@uber.com>
Assisted-by: Cursor:claude-sonnet-4.5
---
Hi all,
This RFC adds two cgroup v2 memory-controller knobs that give operators
per-cgroup control over dirty-page throttling in balance_dirty_pages():
memory.dirty_ratio (per-cgroup ceiling) and memory.dirty_min (guaranteed
floor). A third knob, memory.dirty_weight, is forthcoming in a follow-up
once we have validated the application site (see "Follow-ups" below).
We are posting this as an RFC, as a single squashed patch, to get design
feedback before splitting the prototype into a per-logical-change series.
Motivation
==========
balance_dirty_pages() (BDP) is a global throttle. It sleeps writers
once the host-wide dirty count crosses a single threshold. On a
container host that threshold is shared across cgroups. A cgroup
that dirties pages faster than storage can drain them pushes the
count over the limit. Every writer on the host then parks in
io_schedule_timeout() -- including cgroups that have not dirtied a
single page of their own.
cgroup v2 already has per-memcg dirty accounting, but that accounting
does not translate into per-memcg dirty throttling.
We see this in production: a buffered write-heavy container generates
multi-second stalls for co-located latency-sensitive workloads.
Moreover, dirty-page accumulation from a single noisy neighbor is a
recurring contributor to mount-responsiveness degradation on shared hosts.
Prior work
==========
Per-memcg dirty-page limits have been proposed before. Andrea Righi
posted an initial RFC in February 2010 [1]; Greg Thelen continued the
work through v9 in August 2011 [2]. That series added per-memcg dirty
counters and hooked them into balance_dirty_pages(), but it bolted
per-cgroup limits onto the global writeback path without making
writeback itself cgroup-aware. Without cgroup-aware flusher threads,
a cgroup exceeding its limit triggered writeback of inodes from any
cgroup, giving poor isolation. The series was not merged.
Konstantin Khlebnikov posted "[PATCHSET RFC 0/6] memcg: inode-based
dirty-set controller" in January 2015 [4], which proposed
memory.dirty_ratio (the same interface name this series uses) via an
inode-tagged, filtered-writeback approach. Tejun Heo reviewed it
and rejected it as a "dead end" that duplicated lower-layer policy
without solving the underlying isolation problem; this rejection
directly motivated Tejun's native cgwb rework described below.
Tejun Heo's 48-patch cgroup-writeback rework, merged in Linux 4.2
(commit e4bc13adfd01, "Merge branch 'for-4.2/writeback'"), took the
different approach of restructuring writeback to be natively
cgroup-aware: per-memcg wb_domain (commit 841710aa6e4a), per-memcg
NR_FILE_DIRTY / NR_WRITEBACK accounting, and cgroup-aware flusher
threads [3]. That work deliberately deferred user-facing policy knobs.
This series adds the policy surface that consumes Tejun's
infrastructure. The dirty_min reservation concept is, to our
knowledge, new.
A November 2023 LKML thread by Chengming Zhou [5] independently
identified the identical throttling regression on cgroup v2 (a 5 GB
container constantly throttled because memory.max * dirty_ratio yields
too small a threshold for bursty workloads). Jan Kara participated and
endorsed a bytes-based per-memcg dirty limit; no patches followed that
discussion, confirming the gap this series fills.
[1] https://lwn.net/Articles/408349/
[2] https://lore.kernel.org/lkml/1313597705-6093-1-git-send-email-gthelen@google.com/
[3] https://lwn.net/Articles/648292/
[4] https://lore.kernel.org/all/20150115180242.10450.92.stgit@buzz/
[5] https://lore.kernel.org/all/109029e0-1772-4102-a2a8-ab9076462454@linux.dev/
Proposed interface
==================
Two new cgroup v2 files under the memory controller, absent on the root
cgroup (CFTYPE_NOT_ON_ROOT):
memory.dirty_ratio
Integer [0, 100]. Per-cgroup dirty-page ceiling as a percentage of
the cgroup's dirtyable memory (mdtc->avail: file cache plus
reclaimable slack), the same base the global vm_dirty_ratio scales
against. 0 (the default) disables the per-cgroup ceiling and leaves
the cgroup subject to the global threshold only. A non-zero value
that is stricter than vm_dirty_ratio overrides the global ratio for
this cgroup via min(mdtc->thresh, cg_thresh); because both sides
scale off the same base, the knob can never widen the cgroup past
the global ceiling. A memory.dirty_bytes companion for byte-precise
caps (mirroring vm_dirty_bytes) is noted under "Follow-ups" below.
The prototype reads the value for the immediate memcg only;
hierarchical enforcement (clamping against ancestors, like
memory.max) is not implemented yet. We would like guidance on
whether this is required for v1 or can follow in a subsequent
series.
memory.dirty_min
Byte value (K/M/G suffixes accepted), default 0 (no reservation).
Guaranteed dirty-page floor: while cgroup_dirty < dirty_min,
throttling is bypassed (goto free_running). Lets a latency-sensitive
cgroup buffer a small write burst even under global dirty pressure.
dirty_min is an admission guarantee, so we have to prevent it from
breaking the global dirty invariant. Two aspects:
- Global cap. The sum of dirty_min reservations across all cgroups
must not exceed a fraction of the global dirty threshold (our
working number is 80%), so the system always retains some shared
capacity. The prototype does not enforce this cap yet; we expect
to either reject at write() time or clamp on read in a cheap
precomputed effective_dirty_min. We would appreciate feedback on
which approach the cgroup maintainers prefer.
- Per-cgroup cap. A cgroup should not be able to reserve more dirty
capacity than it can hold. Our tentative rule is
effective_dirty_min = min(dirty_min, memory.max - memory.min),
evaluated at enforcement time so it tracks live memory.max changes,
rather than rejected at write. This is similar to how memory.low
composes with memory.max.
Neither cap is implemented in the prototype; both would land before
a non-RFC posting.
The two knobs compose: dirty_ratio bounds how much dirty memory a
cgroup may accrue, dirty_min guarantees a floor below which it is never
throttled.
Test setup and results
======================
To show the problem and the fix, we built a single reproducer that runs
on an unmodified stock kernel and then on the patched kernel, using the
same setup for both.
Setup: QEMU guest with a virtio-blk disk throttled to 256 KB/s
(bps_wr=262144). Two sibling cgroups, no io.weight; both share disk
bandwidth equally. dirty_bytes=32MB, dirty_background_bytes=16MB
(freerun = 24 MB). Files pre-allocated with fallocate before dirty
pressure. Two phases per run: (1) victim alone (baseline), (2) noisy
fills global dirty to the 32 MB cap, then victim runs contended for
30 s.
- noisy: single fio job, unlimited write rate, fills global dirty pool.
- victim: single fio job, rate-limited to 500 KB/s (128 IOPS target),
4 KiB sequential write().
The high freerun (24 MB) ensures victim's solo dirty accumulation
(244 KB/s x 30 s = 7.3 MB) stays below the threshold. BDP does not fire
during the solo phase on either kernel, giving identical baselines.
Stock kernel results (the problem):
solo (no contention) contended inflation
victim IOPS 125 5.1 24.5x worse
victim p99 0.6 ms 152 ms 253x worse
The contended p99 sits at fio's percentile bucket nearest MAX_PAUSE =
HZ/5 = 200 ms (mm/page-writeback.c:49), the hard kernel ceiling on BDP
sleep. The victim has near-zero dirty pages of its own but is forced to
sleep because balance_dirty_pages() sees gdtc->dirty = NR_FILE_DIRTY +
NR_WRITEBACK above the freerun threshold. Most of noisy's pages are
queued in NR_WRITEBACK waiting for the throttled disk.
memory.events.local on both cgroups shows max 0 throughout the run;
this is not memory pressure inside either cgroup.
Patched kernel results (the fix), with victim/memory.dirty_min = 16 MB:
solo (no contention) contended inflation
victim IOPS 125 125 1.0x
victim p99 0.9 ms 0.7 ms 1.0x
The patched kernel checks cgroup_dirty < dirty_min (4096 pages) before
computing any sleep. Because the rate-limited victim's resident dirty
set stays well below the reservation, the check fires on every write()
-> goto free_running -> write() returns in ~7 us. The victim is fully
protected.
The figures above are the per-kernel medians of N=5 iterations and
reflect a deterministic outcome on every iter: stock had cont_iops in
4.4..6.1 (retention 0.035..0.049) on all five iters, patched had
cont_iops = 125.0 (retention = 1.000) on all five iters. The 5/5 stock
iters all hit BDP's throttled regime; the 5/5 patched iters all
bypassed it via the dirty_min check.
Implementation
==============
The patch touches five files:
- include/linux/memcontrol.h: two new fields on struct mem_cgroup
(dirty_ratio, dirty_min).
- include/linux/writeback.h: a per-pass cg_dirty_cap field on
struct dirty_throttle_control used to publish the memcg clamp to
BDP's setpoint and rate-limit math.
- include/trace/events/writeback.h: cg_dirty_cap added to the
balance_dirty_pages tracepoint so operators can distinguish the
memcg clamp from the global dirty_limit at runtime.
- mm/memcontrol.c: registers the two cgroup v2 files with input
validation.
- mm/page-writeback.c: the throttling changes.
The key changes in page-writeback.c:
- The dirty_ratio clamp lives inside domain_dirty_limits(), keyed
on wb->memcg_css (the inode owner's memcg). Every consumer of
the memcg dtc -- writer throttle, flusher kworker,
cgwb_calc_thresh -- sees the same clamped thresh and bg_thresh.
The clamp uses mult_frac() so a small memcg does not collapse
to a zero ceiling.
- The dirty_min bypass lives in balance_dirty_pages() and is
writer-keyed: the writing task's memcg is looked up under RCU,
and when its dirty+in-flight backlog is below dirty_min the
loop jumps to free_running, bypassing both the global and the
per-memcg BDP gates. dirty_min is an admission guarantee for
the writer's own cgroup, not for inode owners.
- When the clamp engages, mdtc->dirty is replaced with the
memcg-wide NR_FILE_DIRTY + NR_WRITEBACK sum so freerun /
setpoint / rate-limit smoothing see the real backlog and pages
migrating from NR_FILE_DIRTY into NR_WRITEBACK cannot silently
widen the cap.
Fast-path cost when neither knob is set: one rcu_read_lock/unlock
pair plus a READ_ONCE(dirty_min) per balance_dirty_pages()
iteration, and one READ_ONCE(dirty_ratio) per domain_dirty_limits()
call. The memcg counter reads are gated on "knob armed" and do
not fire on the default path. We have not measured the added
cost yet, but we expect it to be in the noise of existing BDP
bookkeeping. A tight pwrite() microbenchmark will confirm this
before a non-RFC posting.
Scope
=====
This series affects the writer-side throttle (balance_dirty_pages())
only. It does not partition the flusher-side writeback queue. A
cgroup's fsync() can still block behind inodes from other cgroups in
writeback_sb_inodes(). We document this limit explicitly and expect
writeback-queue partitioning to be a separate, larger effort.
Interaction with block-layer throttles
======================================
The two knobs are orthogonal to io.max / io.cost. balance_dirty_pages()
runs before the bio reaches the block layer, so dirty_min simply allows
a cgroup to keep accepting write() syscalls up to its reservation; the
actual I/O is still subject to whatever block throttle is in effect. In
the reproducer above, the disk-level bandwidth limit (256 KB/s) is
applied at the QEMU virtio-blk layer, and the protected victim dirties
roughly equal to the rate it can drain, so the block throttle is
exercised on both kernels. We have not yet tested interaction with
guest-side io.max settings; this is on the list before a non-RFC
posting.
Questions for maintainers
=========================
1. Is writer-throttle-only scope (no flusher/writeback-queue work)
acceptable for the first series?
2. Does dirty_ratio belong on struct mem_cgroup (as the prototype
has it) or on the memcg's wb_domain? Routing it through wb_domain
would let us reuse __wb_calc_thresh() and keep all threshold
policy in one place; a possible split is dirty_ratio on wb_domain
(it is threshold policy), dirty_min on the memcg (throttle
bypass). We can go either way.
3. For dirty_min safety caps (per-cgroup and global sum), which
approach do you prefer: reject at write time, or clamp on read in
an effective_dirty_min?
4. Is hierarchical enforcement of dirty_ratio (clamp against
ancestors) required for v1, or can it follow in a subsequent
series?
What's missing before a non-RFC posting
=======================================
- Split the monolithic prototype into a proper series (one patch per
concept + Documentation + selftest).
- Documentation/admin-guide/cgroup-v2.rst entries for both knobs.
- tools/testing/selftests/cgroup/ test for interface surface and
noisy-neighbor protection.
- Implement the per-cgroup and global dirty_min safety caps described
in the memory.dirty_min bullet.
- Fast-path microbenchmark: confirm zero measurable regression for
cgroups that have neither knob set.
- Larger-N validation on real hardware (the current N=5 data is from
a QEMU guest on a throttled virtio-blk).
Follow-ups (out of scope for this series)
=========================================
- memory.dirty_weight: a priority weight knob that scales the BDP
pause length, planned as a separate series. The prototype validated
the interface surface but the application site (post-pause scaling
vs. folding into pos_ratio / dirty_ratelimit) needs to be settled
before we ship it. Happy to discuss in advance of that posting.
- memory.dirty_bytes: a byte-value companion to memory.dirty_ratio,
mirroring the global vm_dirty_bytes. For operators who want a
byte-predictable per-cgroup dirty cap rather than a ratio of the
cgroup's dirtyable memory. We have not prototyped this yet; we
are listing it so reviewers know it is on the roadmap, since
the ratio-only interface omits that use case.
- Writeback-queue partitioning: flusher-side fairness across
cgroups, as noted in Scope above.
Looking forward to feedback.
Thanks,
Alireza Haghdoost <haghdoost@uber.com>
Kshitij Doshi <kshitijd@uber.com>
---
include/linux/memcontrol.h | 10 +++
include/linux/writeback.h | 4 +
include/trace/events/writeback.h | 5 +-
mm/memcontrol.c | 62 ++++++++++++++
mm/page-writeback.c | 179 ++++++++++++++++++++++++++++++++++++---
5 files changed, 249 insertions(+), 11 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b..45ca949a4c68 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -323,6 +323,16 @@ struct mem_cgroup {
spinlock_t event_list_lock;
#endif /* CONFIG_MEMCG_V1 */
+ /* Per-memcg dirty-page controls (memory.dirty_ratio, memory.dirty_min) */
+ /*
+ * dirty_ratio: [0, 100] percent of dirtyable memory (mdtc->avail),
+ * matching the global vm_dirty_ratio base; 0 inherits the global
+ * threshold.
+ * dirty_min: dirty-page reservation, in pages; 0 disables the bypass.
+ */
+ unsigned int dirty_ratio;
+ unsigned long dirty_min;
+
struct mem_cgroup_per_node *nodeinfo[];
};
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 62552a2ce5b9..e37632f728be 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -318,6 +318,10 @@ struct dirty_throttle_control {
unsigned long thresh; /* dirty threshold */
unsigned long bg_thresh; /* dirty background threshold */
unsigned long limit; /* hard dirty limit */
+ unsigned long cg_dirty_cap; /* per-memcg dirty_ratio clamp for
+ * this pass, or PAGE_COUNTER_MAX
+ * when no memcg clamp applies
+ */
unsigned long wb_dirty; /* per-wb counterparts */
unsigned long wb_thresh;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index bdac0d685a98..0bf86b3c903c 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -672,6 +672,7 @@ TRACE_EVENT(balance_dirty_pages,
__array( char, bdi, 32)
__field(u64, cgroup_ino)
__field(unsigned long, limit)
+ __field(unsigned long, cg_dirty_cap)
__field(unsigned long, setpoint)
__field(unsigned long, dirty)
__field(unsigned long, wb_setpoint)
@@ -691,6 +692,7 @@ TRACE_EVENT(balance_dirty_pages,
strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
__entry->limit = dtc->limit;
+ __entry->cg_dirty_cap = dtc->cg_dirty_cap;
__entry->setpoint = (dtc->limit + freerun) / 2;
__entry->dirty = dtc->dirty;
__entry->wb_setpoint = __entry->setpoint *
@@ -710,13 +712,14 @@ TRACE_EVENT(balance_dirty_pages,
TP_printk("bdi %s: "
- "limit=%lu setpoint=%lu dirty=%lu "
+ "limit=%lu cg_dirty_cap=%lu setpoint=%lu dirty=%lu "
"wb_setpoint=%lu wb_dirty=%lu "
"dirty_ratelimit=%lu task_ratelimit=%lu "
"dirtied=%u dirtied_pause=%u "
"paused=%lu pause=%ld period=%lu think=%ld cgroup_ino=%llu",
__entry->bdi,
__entry->limit,
+ __entry->cg_dirty_cap,
__entry->setpoint,
__entry->dirty,
__entry->wb_setpoint,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3d98ab41f1f..c43fe4f394eb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4748,6 +4748,56 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}
+static int memory_dirty_ratio_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ seq_printf(m, "%u\n", READ_ONCE(memcg->dirty_ratio));
+ return 0;
+}
+
+static ssize_t memory_dirty_ratio_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned int ratio;
+ int err;
+
+ err = kstrtouint(strstrip(buf), 0, &ratio);
+ if (err)
+ return err;
+
+ if (ratio > 100)
+ return -EINVAL;
+
+ WRITE_ONCE(memcg->dirty_ratio, ratio);
+ return nbytes;
+}
+
+static int memory_dirty_min_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+ /* seq_puts_memcg_tunable automatically multiplies by PAGE_SIZE for the user */
+ return seq_puts_memcg_tunable(m, READ_ONCE(memcg->dirty_min));
+}
+
+static ssize_t memory_dirty_min_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long dirty_min;
+ int err;
+
+ /* page_counter_memparse converts strings like "512M" into a page count */
+ err = page_counter_memparse(strstrip(buf), "max", &dirty_min);
+ if (err)
+ return err;
+
+ WRITE_ONCE(memcg->dirty_min, dirty_min);
+ return nbytes;
+}
+
/*
* Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
* if any new events become available.
@@ -4950,6 +5000,18 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NS_DELEGATABLE,
.write = memory_reclaim,
},
+ {
+ .name = "dirty_ratio",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_dirty_ratio_show,
+ .write = memory_dirty_ratio_write,
+ },
+ {
+ .name = "dirty_min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_dirty_min_show,
+ .write = memory_dirty_min_write,
+ },
{ } /* terminate */
};
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 88cd53d4ba09..2847b2c1e59a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -124,14 +124,17 @@ struct wb_domain global_wb_domain;
#define GDTC_INIT(__wb) .wb = (__wb), \
.dom = &global_wb_domain, \
- .wb_completions = &(__wb)->completions
+ .wb_completions = &(__wb)->completions, \
+ .cg_dirty_cap = PAGE_COUNTER_MAX
-#define GDTC_INIT_NO_WB .dom = &global_wb_domain
+#define GDTC_INIT_NO_WB .dom = &global_wb_domain, \
+ .cg_dirty_cap = PAGE_COUNTER_MAX
#define MDTC_INIT(__wb, __gdtc) .wb = (__wb), \
.dom = mem_cgroup_wb_domain(__wb), \
.wb_completions = &(__wb)->memcg_completions, \
- .gdtc = __gdtc
+ .gdtc = __gdtc, \
+ .cg_dirty_cap = PAGE_COUNTER_MAX
static bool mdtc_valid(struct dirty_throttle_control *dtc)
{
@@ -183,8 +186,9 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
#else /* CONFIG_CGROUP_WRITEBACK */
#define GDTC_INIT(__wb) .wb = (__wb), \
- .wb_completions = &(__wb)->completions
-#define GDTC_INIT_NO_WB
+ .wb_completions = &(__wb)->completions, \
+ .cg_dirty_cap = PAGE_COUNTER_MAX
+#define GDTC_INIT_NO_WB .cg_dirty_cap = PAGE_COUNTER_MAX
#define MDTC_INIT(__wb, __gdtc)
static bool mdtc_valid(struct dirty_throttle_control *dtc)
@@ -392,6 +396,58 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
}
+
+ /*
+ * Apply the per-memcg dirty_ratio clamp on mdtc (gdtc != NULL
+ * iff @dtc is a memcg dtc). dirty_ratio is scaled against
+ * the memcg's own dirtyable memory (@available_memory), matching
+ * the semantics of vm_dirty_ratio so the two knobs share a base
+ * and compose via a plain min() on thresh. The clamp is keyed
+ * on wb->memcg_css (the inode-owner's memcg) rather than on
+ * current's memcg, so balance_dirty_pages(), wb_over_bg_thresh()
+ * (flusher kworker context), and cgwb_calc_thresh() all see the
+ * same clamped value.
+ *
+ * Published on dtc->cg_dirty_cap as well so hard_dirty_limit()
+ * callers in balance_dirty_pages() can ignore the slower
+ * dom->dirty_limit smoothing when deriving setpoint/
+ * rate-limit from the clamped ceiling.
+ *
+ * Clamp is applied after the rt/dl boost: dirty_ratio is a
+ * strict override, not widened by priority. bg_thresh is
+ * scaled by the same factor we apply to thresh so the
+ * user-configured bg/thresh ratio survives clamping instead
+ * of snapping to thresh/2 via the bg_thresh >= thresh guard
+ * below. mult_frac() preserves precision for small memcgs
+ * where a plain "(avail / 100) * ratio" would collapse to 0.
+ */
+ if (gdtc) {
+ struct mem_cgroup *memcg =
+ mem_cgroup_from_css(dtc->wb->memcg_css);
+ unsigned int cg_ratio = memcg ?
+ READ_ONCE(memcg->dirty_ratio) : 0;
+
+ /*
+ * dtc is reused across balance_dirty_pages() iterations,
+ * so reset the published clamp every call -- an admin
+ * clearing memory.dirty_ratio mid-flight must take effect
+ * on the next pass.
+ */
+ dtc->cg_dirty_cap = PAGE_COUNTER_MAX;
+
+ if (cg_ratio) {
+ unsigned long cg_thresh = mult_frac(available_memory,
+ cg_ratio, 100);
+
+ if (cg_thresh < thresh) {
+ bg_thresh = mult_frac(bg_thresh, cg_thresh,
+ thresh);
+ thresh = cg_thresh;
+ dtc->cg_dirty_cap = cg_thresh;
+ }
+ }
+ }
+
/*
* Dirty throttling logic assumes the limits in page units fit into
* 32-bits. This gives 16TB dirty limits max which is hopefully enough.
@@ -1065,7 +1121,9 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
struct bdi_writeback *wb = dtc->wb;
unsigned long write_bw = READ_ONCE(wb->avg_write_bandwidth);
unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
- unsigned long limit = dtc->limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
+ unsigned long limit = dtc->limit = min(hard_dirty_limit(dtc_dom(dtc),
+ dtc->thresh),
+ dtc->cg_dirty_cap);
unsigned long wb_thresh = dtc->wb_thresh;
unsigned long x_intercept;
unsigned long setpoint; /* dirty pages' target balance point */
@@ -1334,7 +1392,8 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
struct bdi_writeback *wb = dtc->wb;
unsigned long dirty = dtc->dirty;
unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
- unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
+ unsigned long limit = min(hard_dirty_limit(dtc_dom(dtc), dtc->thresh),
+ dtc->cg_dirty_cap);
unsigned long setpoint = (freerun + limit) / 2;
unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long dirty_ratelimit = wb->dirty_ratelimit;
@@ -1822,22 +1881,122 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
int ret = 0;
for (;;) {
+ unsigned long cg_dirty_min = 0;
+ unsigned long cg_dirty_pages = 0;
unsigned long now = jiffies;
nr_dirty = global_node_page_state(NR_FILE_DIRTY);
balance_domain_limits(gdtc, strictlimit);
+
+ /*
+ * Under RCU, snapshot the current memcg's memory.dirty_min
+ * reservation. When it is non-zero, also snapshot the
+ * memcg-wide dirty backlog. These feed the per-writer
+ * dirty_min bypass below; the dirty_ratio clamp itself
+ * is applied inside domain_dirty_limits() keyed on
+ * wb->memcg_css so balance_dirty_pages(),
+ * wb_over_bg_thresh() (flusher kworker context), and
+ * cgwb_calc_thresh() all see a consistent clamped
+ * threshold.
+ *
+ * rcu_read_lock() is held only for the __rcu dereference
+ * of current->cgroups; the memcg pointer does not escape
+ * the critical section. The counter read matches
+ * domain_dirty_avail(mdtc, true) so the bypass compares
+ * the same dirty+in-flight backlog the global path uses.
+ */
+ rcu_read_lock();
+ {
+ struct mem_cgroup *memcg =
+ mem_cgroup_from_task(current);
+
+ if (memcg) {
+ cg_dirty_min = READ_ONCE(memcg->dirty_min);
+ if (cg_dirty_min)
+ cg_dirty_pages =
+ memcg_page_state(memcg,
+ NR_FILE_DIRTY) +
+ memcg_page_state(memcg,
+ NR_WRITEBACK);
+ }
+ }
+ rcu_read_unlock();
+
if (mdtc) {
/*
- * If @wb belongs to !root memcg, repeat the same
- * basic calculations for the memcg domain.
+ * For !root memcg, repeat the same three-step
+ * sequence as balance_domain_limits(gdtc):
+ * avail -> limits -> freerun. We inline it here
+ * so we can insert the mdtc->dirty override
+ * between step 2 (domain_dirty_limits, which
+ * publishes the per-memcg dirty_ratio clamp on
+ * cg_dirty_cap) and step 3 (domain_dirty_freerun,
+ * which consumes mdtc->dirty along with
+ * thresh/bg_thresh).
+ */
+ domain_dirty_avail(mdtc, true);
+ domain_dirty_limits(mdtc);
+
+ /*
+ * When the dirty_ratio clamp engaged, replace the
+ * per-wb dirty count from mem_cgroup_wb_stats()
+ * with the memcg-wide NR_FILE_DIRTY + NR_WRITEBACK
+ * sum so freerun, the setpoint, and the rate-limit
+ * smoothing see the true memcg backlog instead of
+ * the subset that has migrated to this cgwb (cgwb
+ * migration is lazy and can lag by many seconds),
+ * and so a burst of buffered writes cannot silently
+ * bypass the clamp by shifting pages from
+ * NR_FILE_DIRTY into NR_WRITEBACK.
+ *
+ * Keyed on wb->memcg_css to match the clamp itself.
+ * The cgwb holds a css reference, so the memcg
+ * pointer is stable without additional locking.
+ *
+ * Caveat: memcg_page_state() aggregates across ALL
+ * backing devices owned by this memcg, while mdtc
+ * is scoped to one wb. A writer to a fast BDI may
+ * observe backlog accumulated on slow BDIs in the
+ * same memcg and throttle more than strictly needed.
+ * Accepted for v1; the alternative (summing per-wb
+ * dirty across the memcg's cgwbs) walks the cgwb
+ * list under RCU on a hot path.
*/
- balance_domain_limits(mdtc, strictlimit);
+ if (mdtc->cg_dirty_cap != PAGE_COUNTER_MAX) {
+ struct mem_cgroup *wb_memcg =
+ mem_cgroup_from_css(mdtc->wb->memcg_css);
+
+ if (wb_memcg)
+ mdtc->dirty =
+ memcg_page_state(wb_memcg,
+ NR_FILE_DIRTY) +
+ memcg_page_state(wb_memcg,
+ NR_WRITEBACK);
+ }
+
+ domain_dirty_freerun(mdtc, strictlimit);
}
if (nr_dirty > gdtc->bg_thresh && !writeback_in_progress(wb))
wb_start_background_writeback(wb);
+ /*
+ * dirty_min bypass: when the current memcg's dirty+in-flight
+ * backlog is below its memory.dirty_min reservation, let the
+ * writer proceed without throttling. This check must live
+ * outside the if (mdtc) block because a writer's file may not
+ * yet have been migrated to a cgwb; without cgwb, mdtc is NULL
+ * and the per-memcg block above is skipped entirely.
+ *
+ * cg_dirty_min and cg_dirty_pages come from the per-iteration
+ * snapshot taken above under rcu_read_lock; both are stored
+ * in pages (page_counter_memparse converts bytes -> pages for
+ * dirty_min), so no unit conversion is needed.
+ */
+ if (cg_dirty_min && cg_dirty_pages < cg_dirty_min)
+ goto free_running;
+
/*
* If memcg domain is in effect, @dirty should be under
* both global and memcg freerun ceilings.
---
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
change-id: 20260501-rfc-memcg-dirty-v1-ed4644c3fa8a
Best regards,
--
Alireza Haghdoost <haghdoost@uber.com>
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
(?)
@ 2026-05-03 8:59 ` kernel test robot
-1 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2026-05-03 8:59 UTC (permalink / raw)
To: Alireza Haghdoost via B4 Relay; +Cc: oe-kbuild-all
Hi Alireza,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on 254f49634ee16a731174d2ae34bc50bd5f45e731]
url: https://github.com/intel-lab-lkp/linux/commits/Alireza-Haghdoost-via-B4-Relay/memcg-add-per-cgroup-dirty-page-controls-dirty_ratio-dirty_min/20260502-235916
base: 254f49634ee16a731174d2ae34bc50bd5f45e731
patch link: https://lore.kernel.org/r/20260501-rfc-memcg-dirty-v1-v1-1-9a8c80036ec1%40uber.com
patch subject: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20260503/202605031656.ByQM5ZN9-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260503/202605031656.ByQM5ZN9-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605031656.ByQM5ZN9-lkp@intel.com/
All errors (new ones prefixed by >>):
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~~~~~~~~
mm/page-writeback.c:428:40: error: invalid use of undefined type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~
include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
679 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
699 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~~~~~~~~
mm/page-writeback.c:428:40: error: invalid use of undefined type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~
include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
679 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
699 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~~~~~~~~
mm/page-writeback.c:428:40: error: invalid use of undefined type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~
include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
679 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
699 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~~~~~~~~
mm/page-writeback.c:428:40: error: invalid use of undefined type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~
include/linux/compiler_types.h:635:53: note: in definition of macro '__unqual_scalar_typeof'
635 | #define __unqual_scalar_typeof(x) __typeof_unqual__(x)
| ^
include/asm-generic/rwonce.h:50:9: note: in expansion of macro '__READ_ONCE'
50 | __READ_ONCE(x); \
| ^~~~~~~~~~~
mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~~~~~~~~
In file included from ./arch/openrisc/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:369,
from include/linux/array_size.h:5,
from include/linux/kernel.h:16,
from mm/page-writeback.c:15:
mm/page-writeback.c:428:40: error: invalid use of undefined type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~
include/asm-generic/rwonce.h:44:73: note: in definition of macro '__READ_ONCE'
44 | #define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x))
| ^
mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ^~~~~~~~~
mm/page-writeback.c: In function 'balance_dirty_pages':
mm/page-writeback.c:1912:33: error: implicit declaration of function 'mem_cgroup_from_task'; did you mean 'mem_cgroup_from_css'? [-Wimplicit-function-declaration]
1912 | mem_cgroup_from_task(current);
| ^~~~~~~~~~~~~~~~~~~~
| mem_cgroup_from_css
>> mm/page-writeback.c:1912:33: error: initialization of 'struct mem_cgroup *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
mm/page-writeback.c:1915:63: error: invalid use of undefined type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ^~
include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
679 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
699 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page-writeback.c:1915:48: note: in expansion of macro 'READ_ONCE'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ^~~~~~~~~
mm/page-writeback.c:1915:63: error: invalid use of undefined type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ^~
include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
679 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
699 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page-writeback.c:1915:48: note: in expansion of macro 'READ_ONCE'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ^~~~~~~~~
mm/page-writeback.c:1915:63: error: invalid use of undefined type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ^~
include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
679 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
699 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page-writeback.c:1915:48: note: in expansion of macro 'READ_ONCE'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ^~~~~~~~~
mm/page-writeback.c:1915:63: error: invalid use of undefined type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ^~
include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
679 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
699 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page-writeback.c:1915:48: note: in expansion of macro 'READ_ONCE'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ^~~~~~~~~
mm/page-writeback.c:1915:63: error: invalid use of undefined type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ^~
include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
679 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
699 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page-writeback.c:1915:48: note: in expansion of macro 'READ_ONCE'
vim +1912 mm/page-writeback.c
1853
1854 /*
1855 * balance_dirty_pages() must be called by processes which are generating dirty
1856 * data. It looks at the number of dirty pages in the machine and will force
1857 * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
1858 * If we're over `background_thresh' then the writeback threads are woken to
1859 * perform some writeout.
1860 */
1861 static int balance_dirty_pages(struct bdi_writeback *wb,
1862 unsigned long pages_dirtied, unsigned int flags)
1863 {
1864 struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
1865 struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
1866 struct dirty_throttle_control * const gdtc = &gdtc_stor;
1867 struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
1868 &mdtc_stor : NULL;
1869 struct dirty_throttle_control *sdtc;
1870 unsigned long nr_dirty;
1871 long period;
1872 long pause;
1873 long max_pause;
1874 long min_pause;
1875 int nr_dirtied_pause;
1876 unsigned long task_ratelimit;
1877 unsigned long dirty_ratelimit;
1878 struct backing_dev_info *bdi = wb->bdi;
1879 bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
1880 unsigned long start_time = jiffies;
1881 int ret = 0;
1882
1883 for (;;) {
1884 unsigned long cg_dirty_min = 0;
1885 unsigned long cg_dirty_pages = 0;
1886 unsigned long now = jiffies;
1887
1888 nr_dirty = global_node_page_state(NR_FILE_DIRTY);
1889
1890 balance_domain_limits(gdtc, strictlimit);
1891
1892 /*
1893 * Under RCU, snapshot the current memcg's memory.dirty_min
1894 * reservation. When it is non-zero, also snapshot the
1895 * memcg-wide dirty backlog. These feed the per-writer
1896 * dirty_min bypass below; the dirty_ratio clamp itself
1897 * is applied inside domain_dirty_limits() keyed on
1898 * wb->memcg_css so balance_dirty_pages(),
1899 * wb_over_bg_thresh() (flusher kworker context), and
1900 * cgwb_calc_thresh() all see a consistent clamped
1901 * threshold.
1902 *
1903 * rcu_read_lock() is held only for the __rcu dereference
1904 * of current->cgroups; the memcg pointer does not escape
1905 * the critical section. The counter read matches
1906 * domain_dirty_avail(mdtc, true) so the bypass compares
1907 * the same dirty+in-flight backlog the global path uses.
1908 */
1909 rcu_read_lock();
1910 {
1911 struct mem_cgroup *memcg =
> 1912 mem_cgroup_from_task(current);
1913
1914 if (memcg) {
1915 cg_dirty_min = READ_ONCE(memcg->dirty_min);
1916 if (cg_dirty_min)
1917 cg_dirty_pages =
1918 memcg_page_state(memcg,
1919 NR_FILE_DIRTY) +
1920 memcg_page_state(memcg,
1921 NR_WRITEBACK);
1922 }
1923 }
1924 rcu_read_unlock();
1925
1926 if (mdtc) {
1927 /*
1928 * For !root memcg, repeat the same three-step
1929 * sequence as balance_domain_limits(gdtc):
1930 * avail -> limits -> freerun. We inline it here
1931 * so we can insert the mdtc->dirty override
1932 * between step 2 (domain_dirty_limits, which
1933 * publishes the per-memcg dirty_ratio clamp on
1934 * cg_dirty_cap) and step 3 (domain_dirty_freerun,
1935 * which consumes mdtc->dirty along with
1936 * thresh/bg_thresh).
1937 */
1938 domain_dirty_avail(mdtc, true);
1939 domain_dirty_limits(mdtc);
1940
1941 /*
1942 * When the dirty_ratio clamp engaged, replace the
1943 * per-wb dirty count from mem_cgroup_wb_stats()
1944 * with the memcg-wide NR_FILE_DIRTY + NR_WRITEBACK
1945 * sum so freerun, the setpoint, and the rate-limit
1946 * smoothing see the true memcg backlog instead of
1947 * the subset that has migrated to this cgwb (cgwb
1948 * migration is lazy and can lag by many seconds),
1949 * and so a burst of buffered writes cannot silently
1950 * bypass the clamp by shifting pages from
1951 * NR_FILE_DIRTY into NR_WRITEBACK.
1952 *
1953 * Keyed on wb->memcg_css to match the clamp itself.
1954 * The cgwb holds a css reference, so the memcg
1955 * pointer is stable without additional locking.
1956 *
1957 * Caveat: memcg_page_state() aggregates across ALL
1958 * backing devices owned by this memcg, while mdtc
1959 * is scoped to one wb. A writer to a fast BDI may
1960 * observe backlog accumulated on slow BDIs in the
1961 * same memcg and throttle more than strictly needed.
1962 * Accepted for v1; the alternative (summing per-wb
1963 * dirty across the memcg's cgwbs) walks the cgwb
1964 * list under RCU on a hot path.
1965 */
1966 if (mdtc->cg_dirty_cap != PAGE_COUNTER_MAX) {
1967 struct mem_cgroup *wb_memcg =
1968 mem_cgroup_from_css(mdtc->wb->memcg_css);
1969
1970 if (wb_memcg)
1971 mdtc->dirty =
1972 memcg_page_state(wb_memcg,
1973 NR_FILE_DIRTY) +
1974 memcg_page_state(wb_memcg,
1975 NR_WRITEBACK);
1976 }
1977
1978 domain_dirty_freerun(mdtc, strictlimit);
1979 }
1980
1981 if (nr_dirty > gdtc->bg_thresh && !writeback_in_progress(wb))
1982 wb_start_background_writeback(wb);
1983
1984 /*
1985 * dirty_min bypass: when the current memcg's dirty+in-flight
1986 * backlog is below its memory.dirty_min reservation, let the
1987 * writer proceed without throttling. This check must live
1988 * outside the if (mdtc) block because a writer's file may not
1989 * yet have been migrated to a cgwb; without cgwb, mdtc is NULL
1990 * and the per-memcg block above is skipped entirely.
1991 *
1992 * cg_dirty_min and cg_dirty_pages come from the per-iteration
1993 * snapshot taken above under rcu_read_lock; both are stored
1994 * in pages (page_counter_memparse converts bytes -> pages for
1995 * dirty_min), so no unit conversion is needed.
1996 */
1997 if (cg_dirty_min && cg_dirty_pages < cg_dirty_min)
1998 goto free_running;
1999
2000 /*
2001 * If memcg domain is in effect, @dirty should be under
2002 * both global and memcg freerun ceilings.
2003 */
2004 if (gdtc->freerun && (!mdtc || mdtc->freerun)) {
2005 unsigned long intv;
2006 unsigned long m_intv;
2007
2008 free_running:
2009 intv = domain_poll_intv(gdtc, strictlimit);
2010 m_intv = ULONG_MAX;
2011
2012 current->dirty_paused_when = now;
2013 current->nr_dirtied = 0;
2014 if (mdtc)
2015 m_intv = domain_poll_intv(mdtc, strictlimit);
2016 current->nr_dirtied_pause = min(intv, m_intv);
2017 break;
2018 }
2019
2020 /*
2021 * Unconditionally start background writeback if it's not
2022 * already in progress. We need to do this because the global
2023 * dirty threshold check above (nr_dirty > gdtc->bg_thresh)
2024 * doesn't account for these cases:
2025 *
2026 * a) strictlimit BDIs: throttling is calculated using per-wb
2027 * thresholds. The per-wb threshold can be exceeded even when
2028 * nr_dirty < gdtc->bg_thresh
2029 *
2030 * b) memcg-based throttling: memcg uses its own dirty count and
2031 * thresholds and can trigger throttling even when global
2032 * nr_dirty < gdtc->bg_thresh
2033 *
2034 * Writeback needs to be started else the writer stalls in the
2035 * throttle loop waiting for dirty pages to be written back
2036 * while no writeback is running.
2037 */
2038 if (unlikely(!writeback_in_progress(wb)))
2039 wb_start_background_writeback(wb);
2040
2041 mem_cgroup_flush_foreign(wb);
2042
2043 /*
2044 * Calculate global domain's pos_ratio and select the
2045 * global dtc by default.
2046 */
2047 balance_wb_limits(gdtc, strictlimit);
2048 if (gdtc->freerun)
2049 goto free_running;
2050 sdtc = gdtc;
2051
2052 if (mdtc) {
2053 /*
2054 * If memcg domain is in effect, calculate its
2055 * pos_ratio. @wb should satisfy constraints from
2056 * both global and memcg domains. Choose the one
2057 * w/ lower pos_ratio.
2058 */
2059 balance_wb_limits(mdtc, strictlimit);
2060 if (mdtc->freerun)
2061 goto free_running;
2062 if (mdtc->pos_ratio < gdtc->pos_ratio)
2063 sdtc = mdtc;
2064 }
2065
2066 wb->dirty_exceeded = gdtc->dirty_exceeded ||
2067 (mdtc && mdtc->dirty_exceeded);
2068 if (time_is_before_jiffies(READ_ONCE(wb->bw_time_stamp) +
2069 BANDWIDTH_INTERVAL))
2070 __wb_update_bandwidth(gdtc, mdtc, true);
2071
2072 /* throttle according to the chosen dtc */
2073 dirty_ratelimit = READ_ONCE(wb->dirty_ratelimit);
2074 task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >>
2075 RATELIMIT_CALC_SHIFT;
2076 max_pause = wb_max_pause(wb, sdtc->wb_dirty);
2077 min_pause = wb_min_pause(wb, max_pause,
2078 task_ratelimit, dirty_ratelimit,
2079 &nr_dirtied_pause);
2080
2081 if (unlikely(task_ratelimit == 0)) {
2082 period = max_pause;
2083 pause = max_pause;
2084 goto pause;
2085 }
2086 period = HZ * pages_dirtied / task_ratelimit;
2087 pause = period;
2088 if (current->dirty_paused_when)
2089 pause -= now - current->dirty_paused_when;
2090 /*
2091 * For less than 1s think time (ext3/4 may block the dirtier
2092 * for up to 800ms from time to time on 1-HDD; so does xfs,
2093 * however at much less frequency), try to compensate it in
2094 * future periods by updating the virtual time; otherwise just
2095 * do a reset, as it may be a light dirtier.
2096 */
2097 if (pause < min_pause) {
2098 trace_balance_dirty_pages(wb,
2099 sdtc,
2100 dirty_ratelimit,
2101 task_ratelimit,
2102 pages_dirtied,
2103 period,
2104 min(pause, 0L),
2105 start_time);
2106 if (pause < -HZ) {
2107 current->dirty_paused_when = now;
2108 current->nr_dirtied = 0;
2109 } else if (period) {
2110 current->dirty_paused_when += period;
2111 current->nr_dirtied = 0;
2112 } else if (current->nr_dirtied_pause <= pages_dirtied)
2113 current->nr_dirtied_pause += pages_dirtied;
2114 break;
2115 }
2116 if (unlikely(pause > max_pause)) {
2117 /* for occasional dropped task_ratelimit */
2118 now += min(pause - max_pause, max_pause);
2119 pause = max_pause;
2120 }
2121
2122 pause:
2123 trace_balance_dirty_pages(wb,
2124 sdtc,
2125 dirty_ratelimit,
2126 task_ratelimit,
2127 pages_dirtied,
2128 period,
2129 pause,
2130 start_time);
2131 if (flags & BDP_ASYNC) {
2132 ret = -EAGAIN;
2133 break;
2134 }
2135 __set_current_state(TASK_KILLABLE);
2136 bdi->last_bdp_sleep = jiffies;
2137 io_schedule_timeout(pause);
2138
2139 current->dirty_paused_when = now + pause;
2140 current->nr_dirtied = 0;
2141 current->nr_dirtied_pause = nr_dirtied_pause;
2142
2143 /*
2144 * This is typically equal to (dirty < thresh) and can also
2145 * keep "1000+ dd on a slow USB stick" under control.
2146 */
2147 if (task_ratelimit)
2148 break;
2149
2150 /*
2151 * In the case of an unresponsive NFS server and the NFS dirty
2152 * pages exceeds dirty_thresh, give the other good wb's a pipe
2153 * to go through, so that tasks on them still remain responsive.
2154 *
2155 * In theory 1 page is enough to keep the consumer-producer
2156 * pipe going: the flusher cleans 1 page => the task dirties 1
2157 * more page. However wb_dirty has accounting errors. So use
2158 * the larger and more IO friendly wb_stat_error.
2159 */
2160 if (sdtc->wb_dirty <= wb_stat_error())
2161 break;
2162
2163 if (fatal_signal_pending(current))
2164 break;
2165 }
2166 return ret;
2167 }
2168
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
(?)
(?)
@ 2026-05-03 9:55 ` kernel test robot
-1 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2026-05-03 9:55 UTC (permalink / raw)
To: Alireza Haghdoost via B4 Relay; +Cc: llvm, oe-kbuild-all
Hi Alireza,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on 254f49634ee16a731174d2ae34bc50bd5f45e731]
url: https://github.com/intel-lab-lkp/linux/commits/Alireza-Haghdoost-via-B4-Relay/memcg-add-per-cgroup-dirty-page-controls-dirty_ratio-dirty_min/20260502-235916
base: 254f49634ee16a731174d2ae34bc50bd5f45e731
patch link: https://lore.kernel.org/r/20260501-rfc-memcg-dirty-v1-v1-1-9a8c80036ec1%40uber.com
patch subject: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20260503/202605031710.4QHTfWdf-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 5bac06718f502014fade905512f1d26d578a18f3)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260503/202605031710.4QHTfWdf-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605031710.4QHTfWdf-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/page-writeback.c:426:33: error: no member named 'memcg_css' in 'struct bdi_writeback'
426 | mem_cgroup_from_css(dtc->wb->memcg_css);
| ~~~~~~~ ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
428 | READ_ONCE(memcg->dirty_ratio) : 0;
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
>> mm/page-writeback.c:1912:5: error: call to undeclared function 'mem_cgroup_from_task'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1912 | mem_cgroup_from_task(current);
| ^
mm/page-writeback.c:1912:5: note: did you mean 'mem_cgroup_from_css'?
include/linux/memcontrol.h:1211:20: note: 'mem_cgroup_from_css' declared here
1211 | struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
| ^
>> mm/page-writeback.c:1911:23: error: incompatible integer to pointer conversion initializing 'struct mem_cgroup *' with an expression of type 'int' [-Wint-conversion]
1911 | struct mem_cgroup *memcg =
| ^
1912 | mem_cgroup_from_task(current);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
1915 | cg_dirty_min = READ_ONCE(memcg->dirty_min);
| ~~~~~^
include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
55 | struct mem_cgroup *memcg;
| ^
mm/page-writeback.c:1968:36: error: no member named 'memcg_css' in 'struct bdi_writeback'
1968 | mem_cgroup_from_css(mdtc->wb->memcg_css);
| ~~~~~~~~ ^
18 errors generated.
vim +426 mm/page-writeback.c
341
342 /**
343 * domain_dirty_limits - calculate thresh and bg_thresh for a wb_domain
344 * @dtc: dirty_throttle_control of interest
345 *
346 * Calculate @dtc->thresh and ->bg_thresh considering
347 * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller
348 * must ensure that @dtc->avail is set before calling this function. The
349 * dirty limits will be lifted by 1/4 for real-time tasks.
350 */
351 static void domain_dirty_limits(struct dirty_throttle_control *dtc)
352 {
353 const unsigned long available_memory = dtc->avail;
354 struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc);
355 unsigned long bytes = vm_dirty_bytes;
356 unsigned long bg_bytes = dirty_background_bytes;
357 /* convert ratios to per-PAGE_SIZE for higher precision */
358 unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100;
359 unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100;
360 unsigned long thresh;
361 unsigned long bg_thresh;
362 struct task_struct *tsk;
363
364 /* gdtc is !NULL iff @dtc is for memcg domain */
365 if (gdtc) {
366 unsigned long global_avail = gdtc->avail;
367
368 /*
369 * The byte settings can't be applied directly to memcg
370 * domains. Convert them to ratios by scaling against
371 * globally available memory. As the ratios are in
372 * per-PAGE_SIZE, they can be obtained by dividing bytes by
373 * number of pages.
374 */
375 if (bytes)
376 ratio = min(DIV_ROUND_UP(bytes, global_avail),
377 PAGE_SIZE);
378 if (bg_bytes)
379 bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail),
380 PAGE_SIZE);
381 bytes = bg_bytes = 0;
382 }
383
384 if (bytes)
385 thresh = DIV_ROUND_UP(bytes, PAGE_SIZE);
386 else
387 thresh = (ratio * available_memory) / PAGE_SIZE;
388
389 if (bg_bytes)
390 bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
391 else
392 bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE;
393
394 tsk = current;
395 if (rt_or_dl_task(tsk)) {
396 bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
397 thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
398 }
399
400 /*
401 * Apply the per-memcg dirty_ratio clamp on mdtc (gdtc != NULL
402 * iff @dtc is a memcg dtc). dirty_ratio is scaled against
403 * the memcg's own dirtyable memory (@available_memory), matching
404 * the semantics of vm_dirty_ratio so the two knobs share a base
405 * and compose via a plain min() on thresh. The clamp is keyed
406 * on wb->memcg_css (the inode-owner's memcg) rather than on
407 * current's memcg, so balance_dirty_pages(), wb_over_bg_thresh()
408 * (flusher kworker context), and cgwb_calc_thresh() all see the
409 * same clamped value.
410 *
411 * Published on dtc->cg_dirty_cap as well so hard_dirty_limit()
412 * callers in balance_dirty_pages() can ignore the slower
413 * dom->dirty_limit smoothing when deriving setpoint/
414 * rate-limit from the clamped ceiling.
415 *
416 * Clamp is applied after the rt/dl boost: dirty_ratio is a
417 * strict override, not widened by priority. bg_thresh is
418 * scaled by the same factor we apply to thresh so the
419 * user-configured bg/thresh ratio survives clamping instead
420 * of snapping to thresh/2 via the bg_thresh >= thresh guard
421 * below. mult_frac() preserves precision for small memcgs
422 * where a plain "(avail / 100) * ratio" would collapse to 0.
423 */
424 if (gdtc) {
425 struct mem_cgroup *memcg =
> 426 mem_cgroup_from_css(dtc->wb->memcg_css);
427 unsigned int cg_ratio = memcg ?
> 428 READ_ONCE(memcg->dirty_ratio) : 0;
429
430 /*
431 * dtc is reused across balance_dirty_pages() iterations,
432 * so reset the published clamp every call -- an admin
433 * clearing memory.dirty_ratio mid-flight must take effect
434 * on the next pass.
435 */
436 dtc->cg_dirty_cap = PAGE_COUNTER_MAX;
437
438 if (cg_ratio) {
439 unsigned long cg_thresh = mult_frac(available_memory,
440 cg_ratio, 100);
441
442 if (cg_thresh < thresh) {
443 bg_thresh = mult_frac(bg_thresh, cg_thresh,
444 thresh);
445 thresh = cg_thresh;
446 dtc->cg_dirty_cap = cg_thresh;
447 }
448 }
449 }
450
451 /*
452 * Dirty throttling logic assumes the limits in page units fit into
453 * 32-bits. This gives 16TB dirty limits max which is hopefully enough.
454 */
455 if (thresh > UINT_MAX)
456 thresh = UINT_MAX;
457 /* This makes sure bg_thresh is within 32-bits as well */
458 if (bg_thresh >= thresh)
459 bg_thresh = thresh / 2;
460 dtc->thresh = thresh;
461 dtc->bg_thresh = bg_thresh;
462
463 /* we should eventually report the domain in the TP */
464 if (!gdtc)
465 trace_global_dirty_state(bg_thresh, thresh);
466 }
467
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
` (2 preceding siblings ...)
(?)
@ 2026-05-06 14:21 ` Jan Kara
2026-05-14 4:10 ` Alireza Haghdoost
-1 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2026-05-06 14:21 UTC (permalink / raw)
To: haghdoost
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Andrew Morton, Matthew Wilcox (Oracle), Jan Kara,
cgroups, linux-mm, linux-kernel, linux-fsdevel, Kshitij Doshi
On Fri 01-05-26 22:28:38, Alireza Haghdoost via B4 Relay wrote:
> From: Alireza Haghdoost <haghdoost@uber.com>
>
> Add two cgroup v2 memory-controller knobs that bring
> balance_dirty_pages() throttling into per-cgroup scope so one noisy
> writer cannot stall peers sharing the same host:
>
> memory.dirty_ratio Per-cgroup dirty-page ceiling, 0..100 percent of
> the cgroup's dirtyable memory. 0 (default) leaves
> the cgroup subject to the global threshold only.
>
> memory.dirty_min Guaranteed dirty-page floor, byte value (default 0).
>
> The two knobs compose: dirty_ratio bounds how much dirty memory a
> cgroup may accrue, dirty_min guarantees a floor below which it is
> never throttled.
>
> Motivation, design trade-offs, cost analysis, validation data, and
> open questions are in the cover letter.
>
> Co-developed-by: Kshitij Doshi <kshitijd@uber.com>
> Signed-off-by: Kshitij Doshi <kshitijd@uber.com>
> Signed-off-by: Alireza Haghdoost <haghdoost@uber.com>
> Assisted-by: Cursor:claude-sonnet-4.5
> ---
Things like motivation actually belong to the changelog itself, measured
results how the patch helps as well. On the other hand stuff like history
is largely irrelevant here, frankly I don't have a bandwidth to carefully
read the huge amount of text LLM has generated below so please try to make
it more concise for next time.
> This RFC adds two cgroup v2 memory-controller knobs that give operators
> per-cgroup control over dirty-page throttling in balance_dirty_pages():
> memory.dirty_ratio (per-cgroup ceiling) and memory.dirty_min (guaranteed
> floor). A third knob, memory.dirty_weight, is forthcoming in a follow-up
> once we have validated the application site (see "Follow-ups" below).
> We are posting this as an RFC, as a single squashed patch, to get design
> feedback before splitting the prototype into a per-logical-change series.
>
> Motivation
> ==========
>
> balance_dirty_pages() (BDP) is a global throttle. It sleeps writers
> once the host-wide dirty count crosses a single threshold. On a
> container host that threshold is shared across cgroups. A cgroup
> that dirties pages faster than storage can drain them pushes the
> count over the limit. Every writer on the host then parks in
> io_schedule_timeout() -- including cgroups that have not dirtied a
> single page of their own.
>
> cgroup v2 already has per-memcg dirty accounting, but that accounting
> does not translate into per-memcg dirty throttling.
Not quite true. We do have per-memcg writeback workers and we do have
per-memcg dirty limits (inferred from global dirty limit tunings) that are
enforced...
> We see this in production: a buffered write-heavy container generates
> multi-second stalls for co-located latency-sensitive workloads.
> Moreover, dirty-page accumulation from a single noisy neighbor is a
> recurring contributor to mount-responsiveness degradation on shared hosts.
... and I quite don't see how a multisecond stalls you are describing would
happen. There is something I must be missing. The throttling works as
follows: Until we cross global freerun limit (that is (background_limit +
dirty_limit) / 2) nobody is throttled. Once we cross it, memcg dirty limits
start to be taken into account. If we are below freerun in the memcg, the
task dirtying folios from that memcg shouldn't be throttled at all, once we
get above freerun we throttle by maximum of throttling delay decided from
global and memcg situation so then long delays can start happening but it
would mean the "innocent" task's memcg had to get at least over the freerun
limit.
So can you perhaps share more details about the configuration where you
observe these delays to innocent tasks due to another task dirtying a lot of
memory? How many page cache in total and dirty pages are there in each
memcg (both for the aggressive dirtier and wrongly delayed task)? Is the
delayed task really throttled in balance_dirty_pages()?
Honza
> Prior work
> ==========
>
> Per-memcg dirty-page limits have been proposed before. Andrea Righi
> posted an initial RFC in February 2010 [1]; Greg Thelen continued the
> work through v9 in August 2011 [2]. That series added per-memcg dirty
> counters and hooked them into balance_dirty_pages(), but it bolted
> per-cgroup limits onto the global writeback path without making
> writeback itself cgroup-aware. Without cgroup-aware flusher threads,
> a cgroup exceeding its limit triggered writeback of inodes from any
> cgroup, giving poor isolation. The series was not merged.
>
> Konstantin Khlebnikov posted "[PATCHSET RFC 0/6] memcg: inode-based
> dirty-set controller" in January 2015 [4], which proposed
> memory.dirty_ratio (the same interface name this series uses) via an
> inode-tagged, filtered-writeback approach. Tejun Heo reviewed it
> and rejected it as a "dead end" that duplicated lower-layer policy
> without solving the underlying isolation problem; this rejection
> directly motivated Tejun's native cgwb rework described below.
>
> Tejun Heo's 48-patch cgroup-writeback rework, merged in Linux 4.2
> (commit e4bc13adfd01, "Merge branch 'for-4.2/writeback'"), took the
> different approach of restructuring writeback to be natively
> cgroup-aware: per-memcg wb_domain (commit 841710aa6e4a), per-memcg
> NR_FILE_DIRTY / NR_WRITEBACK accounting, and cgroup-aware flusher
> threads [3]. That work deliberately deferred user-facing policy knobs.
> This series adds the policy surface that consumes Tejun's
> infrastructure. The dirty_min reservation concept is, to our
> knowledge, new.
>
> A November 2023 LKML thread by Chengming Zhou [5] independently
> identified the identical throttling regression on cgroup v2 (a 5 GB
> container constantly throttled because memory.max * dirty_ratio yields
> too small a threshold for bursty workloads). Jan Kara participated and
> endorsed a bytes-based per-memcg dirty limit; no patches followed that
> discussion, confirming the gap this series fills.
>
> [1] https://lwn.net/Articles/408349/
> [2] https://lore.kernel.org/lkml/1313597705-6093-1-git-send-email-gthelen@google.com/
> [3] https://lwn.net/Articles/648292/
> [4] https://lore.kernel.org/all/20150115180242.10450.92.stgit@buzz/
> [5] https://lore.kernel.org/all/109029e0-1772-4102-a2a8-ab9076462454@linux.dev/
>
> Proposed interface
> ==================
>
> Two new cgroup v2 files under the memory controller, absent on the root
> cgroup (CFTYPE_NOT_ON_ROOT):
>
> memory.dirty_ratio
> Integer [0, 100]. Per-cgroup dirty-page ceiling as a percentage of
> the cgroup's dirtyable memory (mdtc->avail: file cache plus
> reclaimable slack), the same base the global vm_dirty_ratio scales
> against. 0 (the default) disables the per-cgroup ceiling and leaves
> the cgroup subject to the global threshold only. A non-zero value
> that is stricter than vm_dirty_ratio overrides the global ratio for
> this cgroup via min(mdtc->thresh, cg_thresh); because both sides
> scale off the same base, the knob can never widen the cgroup past
> the global ceiling. A memory.dirty_bytes companion for byte-precise
> caps (mirroring vm_dirty_bytes) is noted under "Follow-ups" below.
> The prototype reads the value for the immediate memcg only;
> hierarchical enforcement (clamping against ancestors, like
> memory.max) is not implemented yet. We would like guidance on
> whether this is required for v1 or can follow in a subsequent
> series.
>
> memory.dirty_min
> Byte value (K/M/G suffixes accepted), default 0 (no reservation).
> Guaranteed dirty-page floor: while cgroup_dirty < dirty_min,
> throttling is bypassed (goto free_running). Lets a latency-sensitive
> cgroup buffer a small write burst even under global dirty pressure.
>
> dirty_min is an admission guarantee, so we have to prevent it from
> breaking the global dirty invariant. Two aspects:
>
> - Global cap. The sum of dirty_min reservations across all cgroups
> must not exceed a fraction of the global dirty threshold (our
> working number is 80%), so the system always retains some shared
> capacity. The prototype does not enforce this cap yet; we expect
> to either reject at write() time or clamp on read in a cheap
> precomputed effective_dirty_min. We would appreciate feedback on
> which approach the cgroup maintainers prefer.
>
> - Per-cgroup cap. A cgroup should not be able to reserve more dirty
> capacity than it can hold. Our tentative rule is
> effective_dirty_min = min(dirty_min, memory.max - memory.min),
> evaluated at enforcement time so it tracks live memory.max changes,
> rather than rejected at write. This is similar to how memory.low
> composes with memory.max.
>
> Neither cap is implemented in the prototype; both would land before
> a non-RFC posting.
>
> The two knobs compose: dirty_ratio bounds how much dirty memory a
> cgroup may accrue, dirty_min guarantees a floor below which it is never
> throttled.
>
> Test setup and results
> ======================
>
> To show the problem and the fix, we built a single reproducer that runs
> on an unmodified stock kernel and then on the patched kernel, using the
> same setup for both.
>
> Setup: QEMU guest with a virtio-blk disk throttled to 256 KB/s
> (bps_wr=262144). Two sibling cgroups, no io.weight; both share disk
> bandwidth equally. dirty_bytes=32MB, dirty_background_bytes=16MB
> (freerun = 24 MB). Files pre-allocated with fallocate before dirty
> pressure. Two phases per run: (1) victim alone (baseline), (2) noisy
> fills global dirty to the 32 MB cap, then victim runs contended for
> 30 s.
>
> - noisy: single fio job, unlimited write rate, fills global dirty pool.
> - victim: single fio job, rate-limited to 500 KB/s (128 IOPS target),
> 4 KiB sequential write().
>
> The high freerun (24 MB) ensures victim's solo dirty accumulation
> (244 KB/s x 30 s = 7.3 MB) stays below the threshold. BDP does not fire
> during the solo phase on either kernel, giving identical baselines.
>
> Stock kernel results (the problem):
>
> solo (no contention) contended inflation
> victim IOPS 125 5.1 24.5x worse
> victim p99 0.6 ms 152 ms 253x worse
>
> The contended p99 sits at fio's percentile bucket nearest MAX_PAUSE =
> HZ/5 = 200 ms (mm/page-writeback.c:49), the hard kernel ceiling on BDP
> sleep. The victim has near-zero dirty pages of its own but is forced to
> sleep because balance_dirty_pages() sees gdtc->dirty = NR_FILE_DIRTY +
> NR_WRITEBACK above the freerun threshold. Most of noisy's pages are
> queued in NR_WRITEBACK waiting for the throttled disk.
> memory.events.local on both cgroups shows max 0 throughout the run;
> this is not memory pressure inside either cgroup.
>
> Patched kernel results (the fix), with victim/memory.dirty_min = 16 MB:
>
> solo (no contention) contended inflation
> victim IOPS 125 125 1.0x
> victim p99 0.9 ms 0.7 ms 1.0x
>
> The patched kernel checks cgroup_dirty < dirty_min (4096 pages) before
> computing any sleep. Because the rate-limited victim's resident dirty
> set stays well below the reservation, the check fires on every write()
> -> goto free_running -> write() returns in ~7 us. The victim is fully
> protected.
>
> The figures above are the per-kernel medians of N=5 iterations and
> reflect a deterministic outcome on every iter: stock had cont_iops in
> 4.4..6.1 (retention 0.035..0.049) on all five iters, patched had
> cont_iops = 125.0 (retention = 1.000) on all five iters. The 5/5 stock
> iters all hit BDP's throttled regime; the 5/5 patched iters all
> bypassed it via the dirty_min check.
>
> Implementation
> ==============
>
> The patch touches five files:
>
> - include/linux/memcontrol.h: two new fields on struct mem_cgroup
> (dirty_ratio, dirty_min).
> - include/linux/writeback.h: a per-pass cg_dirty_cap field on
> struct dirty_throttle_control used to publish the memcg clamp to
> BDP's setpoint and rate-limit math.
> - include/trace/events/writeback.h: cg_dirty_cap added to the
> balance_dirty_pages tracepoint so operators can distinguish the
> memcg clamp from the global dirty_limit at runtime.
> - mm/memcontrol.c: registers the two cgroup v2 files with input
> validation.
> - mm/page-writeback.c: the throttling changes.
>
> The key changes in page-writeback.c:
>
> - The dirty_ratio clamp lives inside domain_dirty_limits(), keyed
> on wb->memcg_css (the inode owner's memcg). Every consumer of
> the memcg dtc -- writer throttle, flusher kworker,
> cgwb_calc_thresh -- sees the same clamped thresh and bg_thresh.
> The clamp uses mult_frac() so a small memcg does not collapse
> to a zero ceiling.
>
> - The dirty_min bypass lives in balance_dirty_pages() and is
> writer-keyed: the writing task's memcg is looked up under RCU,
> and when its dirty+in-flight backlog is below dirty_min the
> loop jumps to free_running, bypassing both the global and the
> per-memcg BDP gates. dirty_min is an admission guarantee for
> the writer's own cgroup, not for inode owners.
>
> - When the clamp engages, mdtc->dirty is replaced with the
> memcg-wide NR_FILE_DIRTY + NR_WRITEBACK sum so freerun /
> setpoint / rate-limit smoothing see the real backlog and pages
> migrating from NR_FILE_DIRTY into NR_WRITEBACK cannot silently
> widen the cap.
>
> Fast-path cost when neither knob is set: one rcu_read_lock/unlock
> pair plus a READ_ONCE(dirty_min) per balance_dirty_pages()
> iteration, and one READ_ONCE(dirty_ratio) per domain_dirty_limits()
> call. The memcg counter reads are gated on "knob armed" and do
> not fire on the default path. We have not measured the added
> cost yet, but we expect it to be in the noise of existing BDP
> bookkeeping. A tight pwrite() microbenchmark will confirm this
> before a non-RFC posting.
>
> Scope
> =====
>
> This series affects the writer-side throttle (balance_dirty_pages())
> only. It does not partition the flusher-side writeback queue. A
> cgroup's fsync() can still block behind inodes from other cgroups in
> writeback_sb_inodes(). We document this limit explicitly and expect
> writeback-queue partitioning to be a separate, larger effort.
>
> Interaction with block-layer throttles
> ======================================
>
> The two knobs are orthogonal to io.max / io.cost. balance_dirty_pages()
> runs before the bio reaches the block layer, so dirty_min simply allows
> a cgroup to keep accepting write() syscalls up to its reservation; the
> actual I/O is still subject to whatever block throttle is in effect. In
> the reproducer above, the disk-level bandwidth limit (256 KB/s) is
> applied at the QEMU virtio-blk layer, and the protected victim dirties
> roughly equal to the rate it can drain, so the block throttle is
> exercised on both kernels. We have not yet tested interaction with
> guest-side io.max settings; this is on the list before a non-RFC
> posting.
>
> Questions for maintainers
> =========================
>
> 1. Is writer-throttle-only scope (no flusher/writeback-queue work)
> acceptable for the first series?
> 2. Does dirty_ratio belong on struct mem_cgroup (as the prototype
> has it) or on the memcg's wb_domain? Routing it through wb_domain
> would let us reuse __wb_calc_thresh() and keep all threshold
> policy in one place; a possible split is dirty_ratio on wb_domain
> (it is threshold policy), dirty_min on the memcg (throttle
> bypass). We can go either way.
> 3. For dirty_min safety caps (per-cgroup and global sum), which
> approach do you prefer: reject at write time, or clamp on read in
> an effective_dirty_min?
> 4. Is hierarchical enforcement of dirty_ratio (clamp against
> ancestors) required for v1, or can it follow in a subsequent
> series?
>
> What's missing before a non-RFC posting
> =======================================
>
> - Split the monolithic prototype into a proper series (one patch per
> concept + Documentation + selftest).
> - Documentation/admin-guide/cgroup-v2.rst entries for both knobs.
> - tools/testing/selftests/cgroup/ test for interface surface and
> noisy-neighbor protection.
> - Implement the per-cgroup and global dirty_min safety caps described
> in the memory.dirty_min bullet.
> - Fast-path microbenchmark: confirm zero measurable regression for
> cgroups that have neither knob set.
> - Larger-N validation on real hardware (the current N=5 data is from
> a QEMU guest on a throttled virtio-blk).
>
> Follow-ups (out of scope for this series)
> =========================================
>
> - memory.dirty_weight: a priority weight knob that scales the BDP
> pause length, planned as a separate series. The prototype validated
> the interface surface but the application site (post-pause scaling
> vs. folding into pos_ratio / dirty_ratelimit) needs to be settled
> before we ship it. Happy to discuss in advance of that posting.
> - memory.dirty_bytes: a byte-value companion to memory.dirty_ratio,
> mirroring the global vm_dirty_bytes. For operators who want a
> byte-predictable per-cgroup dirty cap rather than a ratio of the
> cgroup's dirtyable memory. We have not prototyped this yet; we
> are listing it so reviewers know it is on the roadmap, since
> the ratio-only interface omits that use case.
> - Writeback-queue partitioning: flusher-side fairness across
> cgroups, as noted in Scope above.
>
> Looking forward to feedback.
>
> Thanks,
> Alireza Haghdoost <haghdoost@uber.com>
> Kshitij Doshi <kshitijd@uber.com>
> ---
> include/linux/memcontrol.h | 10 +++
> include/linux/writeback.h | 4 +
> include/trace/events/writeback.h | 5 +-
> mm/memcontrol.c | 62 ++++++++++++++
> mm/page-writeback.c | 179 ++++++++++++++++++++++++++++++++++++---
> 5 files changed, 249 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index dc3fa687759b..45ca949a4c68 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -323,6 +323,16 @@ struct mem_cgroup {
> spinlock_t event_list_lock;
> #endif /* CONFIG_MEMCG_V1 */
>
> + /* Per-memcg dirty-page controls (memory.dirty_ratio, memory.dirty_min) */
> + /*
> + * dirty_ratio: [0, 100] percent of dirtyable memory (mdtc->avail),
> + * matching the global vm_dirty_ratio base; 0 inherits the global
> + * threshold.
> + * dirty_min: dirty-page reservation, in pages; 0 disables the bypass.
> + */
> + unsigned int dirty_ratio;
> + unsigned long dirty_min;
> +
> struct mem_cgroup_per_node *nodeinfo[];
> };
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 62552a2ce5b9..e37632f728be 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -318,6 +318,10 @@ struct dirty_throttle_control {
> unsigned long thresh; /* dirty threshold */
> unsigned long bg_thresh; /* dirty background threshold */
> unsigned long limit; /* hard dirty limit */
> + unsigned long cg_dirty_cap; /* per-memcg dirty_ratio clamp for
> + * this pass, or PAGE_COUNTER_MAX
> + * when no memcg clamp applies
> + */
>
> unsigned long wb_dirty; /* per-wb counterparts */
> unsigned long wb_thresh;
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index bdac0d685a98..0bf86b3c903c 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -672,6 +672,7 @@ TRACE_EVENT(balance_dirty_pages,
> __array( char, bdi, 32)
> __field(u64, cgroup_ino)
> __field(unsigned long, limit)
> + __field(unsigned long, cg_dirty_cap)
> __field(unsigned long, setpoint)
> __field(unsigned long, dirty)
> __field(unsigned long, wb_setpoint)
> @@ -691,6 +692,7 @@ TRACE_EVENT(balance_dirty_pages,
> strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
>
> __entry->limit = dtc->limit;
> + __entry->cg_dirty_cap = dtc->cg_dirty_cap;
> __entry->setpoint = (dtc->limit + freerun) / 2;
> __entry->dirty = dtc->dirty;
> __entry->wb_setpoint = __entry->setpoint *
> @@ -710,13 +712,14 @@ TRACE_EVENT(balance_dirty_pages,
>
>
> TP_printk("bdi %s: "
> - "limit=%lu setpoint=%lu dirty=%lu "
> + "limit=%lu cg_dirty_cap=%lu setpoint=%lu dirty=%lu "
> "wb_setpoint=%lu wb_dirty=%lu "
> "dirty_ratelimit=%lu task_ratelimit=%lu "
> "dirtied=%u dirtied_pause=%u "
> "paused=%lu pause=%ld period=%lu think=%ld cgroup_ino=%llu",
> __entry->bdi,
> __entry->limit,
> + __entry->cg_dirty_cap,
> __entry->setpoint,
> __entry->dirty,
> __entry->wb_setpoint,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c3d98ab41f1f..c43fe4f394eb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4748,6 +4748,56 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +static int memory_dirty_ratio_show(struct seq_file *m, void *v)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> + seq_printf(m, "%u\n", READ_ONCE(memcg->dirty_ratio));
> + return 0;
> +}
> +
> +static ssize_t memory_dirty_ratio_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + unsigned int ratio;
> + int err;
> +
> + err = kstrtouint(strstrip(buf), 0, &ratio);
> + if (err)
> + return err;
> +
> + if (ratio > 100)
> + return -EINVAL;
> +
> + WRITE_ONCE(memcg->dirty_ratio, ratio);
> + return nbytes;
> +}
> +
> +static int memory_dirty_min_show(struct seq_file *m, void *v)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> + /* seq_puts_memcg_tunable automatically multiplies by PAGE_SIZE for the user */
> + return seq_puts_memcg_tunable(m, READ_ONCE(memcg->dirty_min));
> +}
> +
> +static ssize_t memory_dirty_min_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + unsigned long dirty_min;
> + int err;
> +
> + /* page_counter_memparse converts strings like "512M" into a page count */
> + err = page_counter_memparse(strstrip(buf), "max", &dirty_min);
> + if (err)
> + return err;
> +
> + WRITE_ONCE(memcg->dirty_min, dirty_min);
> + return nbytes;
> +}
> +
> /*
> * Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
> * if any new events become available.
> @@ -4950,6 +5000,18 @@ static struct cftype memory_files[] = {
> .flags = CFTYPE_NS_DELEGATABLE,
> .write = memory_reclaim,
> },
> + {
> + .name = "dirty_ratio",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_dirty_ratio_show,
> + .write = memory_dirty_ratio_write,
> + },
> + {
> + .name = "dirty_min",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_dirty_min_show,
> + .write = memory_dirty_min_write,
> + },
> { } /* terminate */
> };
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 88cd53d4ba09..2847b2c1e59a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -124,14 +124,17 @@ struct wb_domain global_wb_domain;
>
> #define GDTC_INIT(__wb) .wb = (__wb), \
> .dom = &global_wb_domain, \
> - .wb_completions = &(__wb)->completions
> + .wb_completions = &(__wb)->completions, \
> + .cg_dirty_cap = PAGE_COUNTER_MAX
>
> -#define GDTC_INIT_NO_WB .dom = &global_wb_domain
> +#define GDTC_INIT_NO_WB .dom = &global_wb_domain, \
> + .cg_dirty_cap = PAGE_COUNTER_MAX
>
> #define MDTC_INIT(__wb, __gdtc) .wb = (__wb), \
> .dom = mem_cgroup_wb_domain(__wb), \
> .wb_completions = &(__wb)->memcg_completions, \
> - .gdtc = __gdtc
> + .gdtc = __gdtc, \
> + .cg_dirty_cap = PAGE_COUNTER_MAX
>
> static bool mdtc_valid(struct dirty_throttle_control *dtc)
> {
> @@ -183,8 +186,9 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
> #else /* CONFIG_CGROUP_WRITEBACK */
>
> #define GDTC_INIT(__wb) .wb = (__wb), \
> - .wb_completions = &(__wb)->completions
> -#define GDTC_INIT_NO_WB
> + .wb_completions = &(__wb)->completions, \
> + .cg_dirty_cap = PAGE_COUNTER_MAX
> +#define GDTC_INIT_NO_WB .cg_dirty_cap = PAGE_COUNTER_MAX
> #define MDTC_INIT(__wb, __gdtc)
>
> static bool mdtc_valid(struct dirty_throttle_control *dtc)
> @@ -392,6 +396,58 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
> bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
> thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
> }
> +
> + /*
> + * Apply the per-memcg dirty_ratio clamp on mdtc (gdtc != NULL
> + * iff @dtc is a memcg dtc). dirty_ratio is scaled against
> + * the memcg's own dirtyable memory (@available_memory), matching
> + * the semantics of vm_dirty_ratio so the two knobs share a base
> + * and compose via a plain min() on thresh. The clamp is keyed
> + * on wb->memcg_css (the inode-owner's memcg) rather than on
> + * current's memcg, so balance_dirty_pages(), wb_over_bg_thresh()
> + * (flusher kworker context), and cgwb_calc_thresh() all see the
> + * same clamped value.
> + *
> + * Published on dtc->cg_dirty_cap as well so hard_dirty_limit()
> + * callers in balance_dirty_pages() can ignore the slower
> + * dom->dirty_limit smoothing when deriving setpoint/
> + * rate-limit from the clamped ceiling.
> + *
> + * Clamp is applied after the rt/dl boost: dirty_ratio is a
> + * strict override, not widened by priority. bg_thresh is
> + * scaled by the same factor we apply to thresh so the
> + * user-configured bg/thresh ratio survives clamping instead
> + * of snapping to thresh/2 via the bg_thresh >= thresh guard
> + * below. mult_frac() preserves precision for small memcgs
> + * where a plain "(avail / 100) * ratio" would collapse to 0.
> + */
> + if (gdtc) {
> + struct mem_cgroup *memcg =
> + mem_cgroup_from_css(dtc->wb->memcg_css);
> + unsigned int cg_ratio = memcg ?
> + READ_ONCE(memcg->dirty_ratio) : 0;
> +
> + /*
> + * dtc is reused across balance_dirty_pages() iterations,
> + * so reset the published clamp every call -- an admin
> + * clearing memory.dirty_ratio mid-flight must take effect
> + * on the next pass.
> + */
> + dtc->cg_dirty_cap = PAGE_COUNTER_MAX;
> +
> + if (cg_ratio) {
> + unsigned long cg_thresh = mult_frac(available_memory,
> + cg_ratio, 100);
> +
> + if (cg_thresh < thresh) {
> + bg_thresh = mult_frac(bg_thresh, cg_thresh,
> + thresh);
> + thresh = cg_thresh;
> + dtc->cg_dirty_cap = cg_thresh;
> + }
> + }
> + }
> +
> /*
> * Dirty throttling logic assumes the limits in page units fit into
> * 32-bits. This gives 16TB dirty limits max which is hopefully enough.
> @@ -1065,7 +1121,9 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
> struct bdi_writeback *wb = dtc->wb;
> unsigned long write_bw = READ_ONCE(wb->avg_write_bandwidth);
> unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
> - unsigned long limit = dtc->limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
> + unsigned long limit = dtc->limit = min(hard_dirty_limit(dtc_dom(dtc),
> + dtc->thresh),
> + dtc->cg_dirty_cap);
> unsigned long wb_thresh = dtc->wb_thresh;
> unsigned long x_intercept;
> unsigned long setpoint; /* dirty pages' target balance point */
> @@ -1334,7 +1392,8 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
> struct bdi_writeback *wb = dtc->wb;
> unsigned long dirty = dtc->dirty;
> unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
> - unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
> + unsigned long limit = min(hard_dirty_limit(dtc_dom(dtc), dtc->thresh),
> + dtc->cg_dirty_cap);
> unsigned long setpoint = (freerun + limit) / 2;
> unsigned long write_bw = wb->avg_write_bandwidth;
> unsigned long dirty_ratelimit = wb->dirty_ratelimit;
> @@ -1822,22 +1881,122 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
> int ret = 0;
>
> for (;;) {
> + unsigned long cg_dirty_min = 0;
> + unsigned long cg_dirty_pages = 0;
> unsigned long now = jiffies;
>
> nr_dirty = global_node_page_state(NR_FILE_DIRTY);
>
> balance_domain_limits(gdtc, strictlimit);
> +
> + /*
> + * Under RCU, snapshot the current memcg's memory.dirty_min
> + * reservation. When it is non-zero, also snapshot the
> + * memcg-wide dirty backlog. These feed the per-writer
> + * dirty_min bypass below; the dirty_ratio clamp itself
> + * is applied inside domain_dirty_limits() keyed on
> + * wb->memcg_css so balance_dirty_pages(),
> + * wb_over_bg_thresh() (flusher kworker context), and
> + * cgwb_calc_thresh() all see a consistent clamped
> + * threshold.
> + *
> + * rcu_read_lock() is held only for the __rcu dereference
> + * of current->cgroups; the memcg pointer does not escape
> + * the critical section. The counter read matches
> + * domain_dirty_avail(mdtc, true) so the bypass compares
> + * the same dirty+in-flight backlog the global path uses.
> + */
> + rcu_read_lock();
> + {
> + struct mem_cgroup *memcg =
> + mem_cgroup_from_task(current);
> +
> + if (memcg) {
> + cg_dirty_min = READ_ONCE(memcg->dirty_min);
> + if (cg_dirty_min)
> + cg_dirty_pages =
> + memcg_page_state(memcg,
> + NR_FILE_DIRTY) +
> + memcg_page_state(memcg,
> + NR_WRITEBACK);
> + }
> + }
> + rcu_read_unlock();
> +
> if (mdtc) {
> /*
> - * If @wb belongs to !root memcg, repeat the same
> - * basic calculations for the memcg domain.
> + * For !root memcg, repeat the same three-step
> + * sequence as balance_domain_limits(gdtc):
> + * avail -> limits -> freerun. We inline it here
> + * so we can insert the mdtc->dirty override
> + * between step 2 (domain_dirty_limits, which
> + * publishes the per-memcg dirty_ratio clamp on
> + * cg_dirty_cap) and step 3 (domain_dirty_freerun,
> + * which consumes mdtc->dirty along with
> + * thresh/bg_thresh).
> + */
> + domain_dirty_avail(mdtc, true);
> + domain_dirty_limits(mdtc);
> +
> + /*
> + * When the dirty_ratio clamp engaged, replace the
> + * per-wb dirty count from mem_cgroup_wb_stats()
> + * with the memcg-wide NR_FILE_DIRTY + NR_WRITEBACK
> + * sum so freerun, the setpoint, and the rate-limit
> + * smoothing see the true memcg backlog instead of
> + * the subset that has migrated to this cgwb (cgwb
> + * migration is lazy and can lag by many seconds),
> + * and so a burst of buffered writes cannot silently
> + * bypass the clamp by shifting pages from
> + * NR_FILE_DIRTY into NR_WRITEBACK.
> + *
> + * Keyed on wb->memcg_css to match the clamp itself.
> + * The cgwb holds a css reference, so the memcg
> + * pointer is stable without additional locking.
> + *
> + * Caveat: memcg_page_state() aggregates across ALL
> + * backing devices owned by this memcg, while mdtc
> + * is scoped to one wb. A writer to a fast BDI may
> + * observe backlog accumulated on slow BDIs in the
> + * same memcg and throttle more than strictly needed.
> + * Accepted for v1; the alternative (summing per-wb
> + * dirty across the memcg's cgwbs) walks the cgwb
> + * list under RCU on a hot path.
> */
> - balance_domain_limits(mdtc, strictlimit);
> + if (mdtc->cg_dirty_cap != PAGE_COUNTER_MAX) {
> + struct mem_cgroup *wb_memcg =
> + mem_cgroup_from_css(mdtc->wb->memcg_css);
> +
> + if (wb_memcg)
> + mdtc->dirty =
> + memcg_page_state(wb_memcg,
> + NR_FILE_DIRTY) +
> + memcg_page_state(wb_memcg,
> + NR_WRITEBACK);
> + }
> +
> + domain_dirty_freerun(mdtc, strictlimit);
> }
>
> if (nr_dirty > gdtc->bg_thresh && !writeback_in_progress(wb))
> wb_start_background_writeback(wb);
>
> + /*
> + * dirty_min bypass: when the current memcg's dirty+in-flight
> + * backlog is below its memory.dirty_min reservation, let the
> + * writer proceed without throttling. This check must live
> + * outside the if (mdtc) block because a writer's file may not
> + * yet have been migrated to a cgwb; without cgwb, mdtc is NULL
> + * and the per-memcg block above is skipped entirely.
> + *
> + * cg_dirty_min and cg_dirty_pages come from the per-iteration
> + * snapshot taken above under rcu_read_lock; both are stored
> + * in pages (page_counter_memparse converts bytes -> pages for
> + * dirty_min), so no unit conversion is needed.
> + */
> + if (cg_dirty_min && cg_dirty_pages < cg_dirty_min)
> + goto free_running;
> +
> /*
> * If memcg domain is in effect, @dirty should be under
> * both global and memcg freerun ceilings.
>
> ---
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> change-id: 20260501-rfc-memcg-dirty-v1-ed4644c3fa8a
>
> Best regards,
> --
> Alireza Haghdoost <haghdoost@uber.com>
>
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
2026-05-06 14:21 ` Jan Kara
@ 2026-05-14 4:10 ` Alireza Haghdoost
0 siblings, 0 replies; 6+ messages in thread
From: Alireza Haghdoost @ 2026-05-14 4:10 UTC (permalink / raw)
To: Jan Kara
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, Andrew Morton, Matthew Wilcox (Oracle), cgroups,
linux-mm, linux-kernel, linux-fsdevel, Kshitij Doshi
On Wed 06-05-26 16:21:00, Jan Kara wrote:
> Things like motivation actually belong to the changelog itself, measured
> results how the patch helps as well. On the other hand stuff like history
> is largely irrelevant here, frankly I don't have a bandwidth to carefully
> read the huge amount of text LLM has generated below so please try to make
> it more concise for next time.
Understood. Will trim for the non-RFC posting; apologies for the
volume.
> ... I quite don't see how a multisecond stalls you are describing would
> happen [...] If we are below freerun in the memcg, the task dirtying
> folios from that memcg shouldn't be throttled at all, once we get above
> freerun we throttle by maximum of throttling delay decided from global
> and memcg situation [...]
The stall is reachable even with the victim's memcg well below its
own freerun. The freerun shortcut in balance_dirty_pages() is an AND,
not OR:
if (gdtc->freerun && (!mdtc || mdtc->freerun))
goto free_running;
Once gdtc is over freerun (because the noisy neighbour pushed it
there) the shortcut does not fire, even when mdtc->freerun is true.
After the shortcut fails, the per-task pause is computed from the
dtc with the smaller pos_ratio:
if (mdtc->pos_ratio < gdtc->pos_ratio)
sdtc = mdtc;
When global is the worse domain, the victim sleeps against global
state, not memcg state.
> So can you perhaps share more details about the configuration where
> you observe these delays to innocent tasks due to another task
> dirtying a lot of memory? How many page cache in total and dirty
> pages are there in each memcg [...]? Is the delayed task really
> throttled in balance_dirty_pages()?
Yes. Re-ran the reproducer: stock 7.0-rc5, ext4 on virtio-blk
throttled to 256 KB/s, dirty_bytes=32M, dirty_background_bytes=16M
(freerun = 24 MB), noisy = single fio job doing unlimited buffered
randwrite, victim = single fio job doing 4 KiB sequential write
rate-limited to 500 KB/s.
Per-memcg snapshot during the contended phase, ~10 s into the run:
noisy memcg victim memcg global
memory.current 47 MB 21 MB --
file (cache) 38 MB 14 MB --
file_dirty 26 MB 1.7 MB 27 MB
file_writeback 1.5 MB 4.0 MB 5.3 MB
Victim memcg holds 1.7 MB dirty, far below any reasonable per-memcg
freerun. Global dirty (NR_FILE_DIRTY + NR_WRITEBACK ~ 32 MB) is over
the 24 MB freerun ceiling, driven entirely by noisy.
The victim writer (fio with psync) is in fact sleeping in
balance_dirty_pages(). One stack snapshot during a stall:
[<0>] balance_dirty_pages+0x5c5/0xac0
[<0>] balance_dirty_pages_ratelimited_flags+0x2a1/0x380
[<0>] generic_perform_write+0x194/0x280
[<0>] ext4_buffered_write_iter+0x63/0x110
[<0>] vfs_write+0x28d/0x450
[<0>] __x64_sys_pwrite64+0x8c/0xc0
[<0>] do_syscall_64+0xfa/0x520
[<0>] entry_SYSCALL_64_after_hwframe+0x77/0x7f
Sampling /proc/<pid>/wchan at 100 Hz across the contended phase
yields the histogram:
104 balance_dirty_pages
88 hrtimer_nanosleep (the fio rate-limit sleep between writes)
12 RUNNING
4 p9_client_rpc (virtfs, host-guest filesystem RPC)
3 d_alloc_parallel
The vast majority of non-rate-limit samples have the writer parked
in balance_dirty_pages(). Victim per-IO clat in this run reaches a
3 s tail (worst single 4 KiB pwrite blocked ~3.0 s) while its own
memcg holds < 2 MB dirty.
I'm happy to share the full traces and the reproducer if useful.
Thanks for the review,
Alireza
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-14 4:10 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-01 22:28 [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min) Alireza Haghdoost
2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
2026-05-03 8:59 ` kernel test robot
2026-05-03 9:55 ` kernel test robot
2026-05-06 14:21 ` Jan Kara
2026-05-14 4:10 ` Alireza Haghdoost
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.