All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
@ 2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
  0 siblings, 0 replies; 6+ messages in thread
From: Alireza Haghdoost @ 2026-05-01 22:28 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle), Jan Kara
  Cc: cgroups, linux-mm, linux-kernel, linux-fsdevel, Kshitij Doshi,
	Alireza Haghdoost

Add two cgroup v2 memory-controller knobs that bring
balance_dirty_pages() throttling into per-cgroup scope so one noisy
writer cannot stall peers sharing the same host:

  memory.dirty_ratio  Per-cgroup dirty-page ceiling, 0..100 percent of
                      the cgroup's dirtyable memory.  0 (default) leaves
                      the cgroup subject to the global threshold only.

  memory.dirty_min    Guaranteed dirty-page floor, byte value (default 0).

The two knobs compose: dirty_ratio bounds how much dirty memory a
cgroup may accrue, dirty_min guarantees a floor below which it is
never throttled.

Motivation, design trade-offs, cost analysis, validation data, and
open questions are in the cover letter.

Co-developed-by: Kshitij Doshi <kshitijd@uber.com>
Signed-off-by: Kshitij Doshi <kshitijd@uber.com>
Signed-off-by: Alireza Haghdoost <haghdoost@uber.com>
Assisted-by: Cursor:claude-sonnet-4.5
---
Hi all,

This RFC adds two cgroup v2 memory-controller knobs that give operators
per-cgroup control over dirty-page throttling in balance_dirty_pages():
memory.dirty_ratio (per-cgroup ceiling) and memory.dirty_min (guaranteed
floor). A third knob, memory.dirty_weight, is forthcoming in a follow-up
once we have validated the application site (see "Follow-ups" below).
We are posting this as an RFC, as a single squashed patch, to get design
feedback before splitting the prototype into a per-logical-change series.

Motivation
==========

balance_dirty_pages() (BDP) is a global throttle. It sleeps writers
once the host-wide dirty count crosses a single threshold. On a
container host that threshold is shared across cgroups. A cgroup
that dirties pages faster than storage can drain them pushes the
count over the limit. Every writer on the host then parks in
io_schedule_timeout() -- including cgroups that have not dirtied a
single page of their own.

cgroup v2 already has per-memcg dirty accounting, but that accounting
does not translate into per-memcg dirty throttling.

We see this in production: a buffered write-heavy container generates
multi-second stalls for co-located latency-sensitive workloads.
Moreover, dirty-page accumulation from a single noisy neighbor is a
recurring contributor to mount-responsiveness degradation on shared hosts.

Prior work
==========

Per-memcg dirty-page limits have been proposed before. Andrea Righi
posted an initial RFC in February 2010 [1]; Greg Thelen continued the
work through v9 in August 2011 [2]. That series added per-memcg dirty
counters and hooked them into balance_dirty_pages(), but it bolted
per-cgroup limits onto the global writeback path without making
writeback itself cgroup-aware. Without cgroup-aware flusher threads,
a cgroup exceeding its limit triggered writeback of inodes from any
cgroup, giving poor isolation. The series was not merged.

Konstantin Khlebnikov posted "[PATCHSET RFC 0/6] memcg: inode-based
dirty-set controller" in January 2015 [4], which proposed
memory.dirty_ratio (the same interface name this series uses) via an
inode-tagged, filtered-writeback approach. Tejun Heo reviewed it
and rejected it as a "dead end" that duplicated lower-layer policy
without solving the underlying isolation problem; this rejection
directly motivated Tejun's native cgwb rework described below.

Tejun Heo's 48-patch cgroup-writeback rework, merged in Linux 4.2
(commit e4bc13adfd01, "Merge branch 'for-4.2/writeback'"), took the
different approach of restructuring writeback to be natively
cgroup-aware: per-memcg wb_domain (commit 841710aa6e4a), per-memcg
NR_FILE_DIRTY / NR_WRITEBACK accounting, and cgroup-aware flusher
threads [3]. That work deliberately deferred user-facing policy knobs.
This series adds the policy surface that consumes Tejun's
infrastructure. The dirty_min reservation concept is, to our
knowledge, new.

A November 2023 LKML thread by Chengming Zhou [5] independently
identified the identical throttling regression on cgroup v2 (a 5 GB
container constantly throttled because memory.max * dirty_ratio yields
too small a threshold for bursty workloads). Jan Kara participated and
endorsed a bytes-based per-memcg dirty limit; no patches followed that
discussion, confirming the gap this series fills.

[1] https://lwn.net/Articles/408349/
[2] https://lore.kernel.org/lkml/1313597705-6093-1-git-send-email-gthelen@google.com/
[3] https://lwn.net/Articles/648292/
[4] https://lore.kernel.org/all/20150115180242.10450.92.stgit@buzz/
[5] https://lore.kernel.org/all/109029e0-1772-4102-a2a8-ab9076462454@linux.dev/

Proposed interface
==================

Two new cgroup v2 files under the memory controller, absent on the root
cgroup (CFTYPE_NOT_ON_ROOT):

memory.dirty_ratio
  Integer [0, 100]. Per-cgroup dirty-page ceiling as a percentage of
  the cgroup's dirtyable memory (mdtc->avail: file cache plus
  reclaimable slack), the same base the global vm_dirty_ratio scales
  against. 0 (the default) disables the per-cgroup ceiling and leaves
  the cgroup subject to the global threshold only. A non-zero value
  that is stricter than vm_dirty_ratio overrides the global ratio for
  this cgroup via min(mdtc->thresh, cg_thresh); because both sides
  scale off the same base, the knob can never widen the cgroup past
  the global ceiling. A memory.dirty_bytes companion for byte-precise
  caps (mirroring vm_dirty_bytes) is noted under "Follow-ups" below.
  The prototype reads the value for the immediate memcg only;
  hierarchical enforcement (clamping against ancestors, like
  memory.max) is not implemented yet. We would like guidance on
  whether this is required for v1 or can follow in a subsequent
  series.

memory.dirty_min
  Byte value (K/M/G suffixes accepted), default 0 (no reservation).
  Guaranteed dirty-page floor: while cgroup_dirty < dirty_min,
  throttling is bypassed (goto free_running). Lets a latency-sensitive
  cgroup buffer a small write burst even under global dirty pressure.

  dirty_min is an admission guarantee, so we have to prevent it from
  breaking the global dirty invariant. Two aspects:

   - Global cap. The sum of dirty_min reservations across all cgroups
     must not exceed a fraction of the global dirty threshold (our
     working number is 80%), so the system always retains some shared
     capacity. The prototype does not enforce this cap yet; we expect
     to either reject at write() time or clamp on read in a cheap
     precomputed effective_dirty_min. We would appreciate feedback on
     which approach the cgroup maintainers prefer.

   - Per-cgroup cap. A cgroup should not be able to reserve more dirty
     capacity than it can hold. Our tentative rule is
     effective_dirty_min = min(dirty_min, memory.max - memory.min),
     evaluated at enforcement time so it tracks live memory.max changes,
     rather than rejected at write. This is similar to how memory.low
     composes with memory.max.

  Neither cap is implemented in the prototype; both would land before
  a non-RFC posting.

The two knobs compose: dirty_ratio bounds how much dirty memory a
cgroup may accrue, dirty_min guarantees a floor below which it is never
throttled.

Test setup and results
======================

To show the problem and the fix, we built a single reproducer that runs
on an unmodified stock kernel and then on the patched kernel, using the
same setup for both.

Setup: QEMU guest with a virtio-blk disk throttled to 256 KB/s
(bps_wr=262144). Two sibling cgroups, no io.weight; both share disk
bandwidth equally. dirty_bytes=32MB, dirty_background_bytes=16MB
(freerun = 24 MB). Files pre-allocated with fallocate before dirty
pressure. Two phases per run: (1) victim alone (baseline), (2) noisy
fills global dirty to the 32 MB cap, then victim runs contended for
30 s.

  - noisy:  single fio job, unlimited write rate, fills global dirty pool.
  - victim: single fio job, rate-limited to 500 KB/s (128 IOPS target),
            4 KiB sequential write().

The high freerun (24 MB) ensures victim's solo dirty accumulation
(244 KB/s x 30 s = 7.3 MB) stays below the threshold. BDP does not fire
during the solo phase on either kernel, giving identical baselines.

Stock kernel results (the problem):

                       solo (no contention)  contended  inflation
  victim IOPS          125                   5.1        24.5x worse
  victim p99           0.6 ms                152 ms     253x worse

The contended p99 sits at fio's percentile bucket nearest MAX_PAUSE =
HZ/5 = 200 ms (mm/page-writeback.c:49), the hard kernel ceiling on BDP
sleep. The victim has near-zero dirty pages of its own but is forced to
sleep because balance_dirty_pages() sees gdtc->dirty = NR_FILE_DIRTY +
NR_WRITEBACK above the freerun threshold. Most of noisy's pages are
queued in NR_WRITEBACK waiting for the throttled disk.
memory.events.local on both cgroups shows max 0 throughout the run;
this is not memory pressure inside either cgroup.

Patched kernel results (the fix), with victim/memory.dirty_min = 16 MB:

                       solo (no contention)  contended  inflation
  victim IOPS          125                   125        1.0x
  victim p99           0.9 ms                0.7 ms     1.0x

The patched kernel checks cgroup_dirty < dirty_min (4096 pages) before
computing any sleep. Because the rate-limited victim's resident dirty
set stays well below the reservation, the check fires on every write()
-> goto free_running -> write() returns in ~7 us. The victim is fully
protected.

The figures above are the per-kernel medians of N=5 iterations and
reflect a deterministic outcome on every iter: stock had cont_iops in
4.4..6.1 (retention 0.035..0.049) on all five iters, patched had
cont_iops = 125.0 (retention = 1.000) on all five iters. The 5/5 stock
iters all hit BDP's throttled regime; the 5/5 patched iters all
bypassed it via the dirty_min check.

Implementation
==============

The patch touches five files:

  - include/linux/memcontrol.h: two new fields on struct mem_cgroup
    (dirty_ratio, dirty_min).
  - include/linux/writeback.h: a per-pass cg_dirty_cap field on
    struct dirty_throttle_control used to publish the memcg clamp to
    BDP's setpoint and rate-limit math.
  - include/trace/events/writeback.h: cg_dirty_cap added to the
    balance_dirty_pages tracepoint so operators can distinguish the
    memcg clamp from the global dirty_limit at runtime.
  - mm/memcontrol.c: registers the two cgroup v2 files with input
    validation.
  - mm/page-writeback.c: the throttling changes.

The key changes in page-writeback.c:

  - The dirty_ratio clamp lives inside domain_dirty_limits(), keyed
    on wb->memcg_css (the inode owner's memcg).  Every consumer of
    the memcg dtc -- writer throttle, flusher kworker,
    cgwb_calc_thresh -- sees the same clamped thresh and bg_thresh.
    The clamp uses mult_frac() so a small memcg does not collapse
    to a zero ceiling.

  - The dirty_min bypass lives in balance_dirty_pages() and is
    writer-keyed: the writing task's memcg is looked up under RCU,
    and when its dirty+in-flight backlog is below dirty_min the
    loop jumps to free_running, bypassing both the global and the
    per-memcg BDP gates.  dirty_min is an admission guarantee for
    the writer's own cgroup, not for inode owners.

  - When the clamp engages, mdtc->dirty is replaced with the
    memcg-wide NR_FILE_DIRTY + NR_WRITEBACK sum so freerun /
    setpoint / rate-limit smoothing see the real backlog and pages
    migrating from NR_FILE_DIRTY into NR_WRITEBACK cannot silently
    widen the cap.

Fast-path cost when neither knob is set: one rcu_read_lock/unlock
pair plus a READ_ONCE(dirty_min) per balance_dirty_pages()
iteration, and one READ_ONCE(dirty_ratio) per domain_dirty_limits()
call.  The memcg counter reads are gated on "knob armed" and do
not fire on the default path.  We have not measured the added
cost yet, but we expect it to be in the noise of existing BDP
bookkeeping.  A tight pwrite() microbenchmark will confirm this
before a non-RFC posting.

Scope
=====

This series affects the writer-side throttle (balance_dirty_pages())
only. It does not partition the flusher-side writeback queue. A
cgroup's fsync() can still block behind inodes from other cgroups in
writeback_sb_inodes(). We document this limit explicitly and expect
writeback-queue partitioning to be a separate, larger effort.

Interaction with block-layer throttles
======================================

The two knobs are orthogonal to io.max / io.cost. balance_dirty_pages()
runs before the bio reaches the block layer, so dirty_min simply allows
a cgroup to keep accepting write() syscalls up to its reservation; the
actual I/O is still subject to whatever block throttle is in effect. In
the reproducer above, the disk-level bandwidth limit (256 KB/s) is
applied at the QEMU virtio-blk layer, and the protected victim dirties
roughly equal to the rate it can drain, so the block throttle is
exercised on both kernels. We have not yet tested interaction with
guest-side io.max settings; this is on the list before a non-RFC
posting.

Questions for maintainers
=========================

 1. Is writer-throttle-only scope (no flusher/writeback-queue work)
    acceptable for the first series?
 2. Does dirty_ratio belong on struct mem_cgroup (as the prototype
    has it) or on the memcg's wb_domain? Routing it through wb_domain
    would let us reuse __wb_calc_thresh() and keep all threshold
    policy in one place; a possible split is dirty_ratio on wb_domain
    (it is threshold policy), dirty_min on the memcg (throttle
    bypass). We can go either way.
 3. For dirty_min safety caps (per-cgroup and global sum), which
    approach do you prefer: reject at write time, or clamp on read in
    an effective_dirty_min?
 4. Is hierarchical enforcement of dirty_ratio (clamp against
    ancestors) required for v1, or can it follow in a subsequent
    series?

What's missing before a non-RFC posting
=======================================

  - Split the monolithic prototype into a proper series (one patch per
    concept + Documentation + selftest).
  - Documentation/admin-guide/cgroup-v2.rst entries for both knobs.
  - tools/testing/selftests/cgroup/ test for interface surface and
    noisy-neighbor protection.
  - Implement the per-cgroup and global dirty_min safety caps described
    in the memory.dirty_min bullet.
  - Fast-path microbenchmark: confirm zero measurable regression for
    cgroups that have neither knob set.
  - Larger-N validation on real hardware (the current N=5 data is from
    a QEMU guest on a throttled virtio-blk).

Follow-ups (out of scope for this series)
=========================================

  - memory.dirty_weight: a priority weight knob that scales the BDP
    pause length, planned as a separate series. The prototype validated
    the interface surface but the application site (post-pause scaling
    vs. folding into pos_ratio / dirty_ratelimit) needs to be settled
    before we ship it. Happy to discuss in advance of that posting.
  - memory.dirty_bytes: a byte-value companion to memory.dirty_ratio,
    mirroring the global vm_dirty_bytes. For operators who want a
    byte-predictable per-cgroup dirty cap rather than a ratio of the
    cgroup's dirtyable memory. We have not prototyped this yet; we
    are listing it so reviewers know it is on the roadmap, since
    the ratio-only interface omits that use case.
  - Writeback-queue partitioning: flusher-side fairness across
    cgroups, as noted in Scope above.

Looking forward to feedback.

Thanks,
Alireza Haghdoost <haghdoost@uber.com>
Kshitij Doshi <kshitijd@uber.com>
---
 include/linux/memcontrol.h       |  10 +++
 include/linux/writeback.h        |   4 +
 include/trace/events/writeback.h |   5 +-
 mm/memcontrol.c                  |  62 ++++++++++++++
 mm/page-writeback.c              | 179 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 249 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b..45ca949a4c68 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -323,6 +323,16 @@ struct mem_cgroup {
 	spinlock_t event_list_lock;
 #endif /* CONFIG_MEMCG_V1 */
 
+	/* Per-memcg dirty-page controls (memory.dirty_ratio, memory.dirty_min) */
+	/*
+	 * dirty_ratio: [0, 100] percent of dirtyable memory (mdtc->avail),
+	 *   matching the global vm_dirty_ratio base; 0 inherits the global
+	 *   threshold.
+	 * dirty_min:   dirty-page reservation, in pages; 0 disables the bypass.
+	 */
+	unsigned int dirty_ratio;
+	unsigned long dirty_min;
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 62552a2ce5b9..e37632f728be 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -318,6 +318,10 @@ struct dirty_throttle_control {
 	unsigned long		thresh;		/* dirty threshold */
 	unsigned long		bg_thresh;	/* dirty background threshold */
 	unsigned long		limit;		/* hard dirty limit */
+	unsigned long		cg_dirty_cap;	/* per-memcg dirty_ratio clamp for
+						 * this pass, or PAGE_COUNTER_MAX
+						 * when no memcg clamp applies
+						 */
 
 	unsigned long		wb_dirty;	/* per-wb counterparts */
 	unsigned long		wb_thresh;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index bdac0d685a98..0bf86b3c903c 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -672,6 +672,7 @@ TRACE_EVENT(balance_dirty_pages,
 		__array(	 char,	bdi, 32)
 		__field(u64,		cgroup_ino)
 		__field(unsigned long,	limit)
+		__field(unsigned long,	cg_dirty_cap)
 		__field(unsigned long,	setpoint)
 		__field(unsigned long,	dirty)
 		__field(unsigned long,	wb_setpoint)
@@ -691,6 +692,7 @@ TRACE_EVENT(balance_dirty_pages,
 		strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
 
 		__entry->limit		= dtc->limit;
+		__entry->cg_dirty_cap	= dtc->cg_dirty_cap;
 		__entry->setpoint	= (dtc->limit + freerun) / 2;
 		__entry->dirty		= dtc->dirty;
 		__entry->wb_setpoint	= __entry->setpoint *
@@ -710,13 +712,14 @@ TRACE_EVENT(balance_dirty_pages,
 
 
 	TP_printk("bdi %s: "
-		  "limit=%lu setpoint=%lu dirty=%lu "
+		  "limit=%lu cg_dirty_cap=%lu setpoint=%lu dirty=%lu "
 		  "wb_setpoint=%lu wb_dirty=%lu "
 		  "dirty_ratelimit=%lu task_ratelimit=%lu "
 		  "dirtied=%u dirtied_pause=%u "
 		  "paused=%lu pause=%ld period=%lu think=%ld cgroup_ino=%llu",
 		  __entry->bdi,
 		  __entry->limit,
+		  __entry->cg_dirty_cap,
 		  __entry->setpoint,
 		  __entry->dirty,
 		  __entry->wb_setpoint,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3d98ab41f1f..c43fe4f394eb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4748,6 +4748,56 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_dirty_ratio_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	seq_printf(m, "%u\n", READ_ONCE(memcg->dirty_ratio));
+	return 0;
+}
+
+static ssize_t memory_dirty_ratio_write(struct kernfs_open_file *of,
+					char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned int ratio;
+	int err;
+
+	err = kstrtouint(strstrip(buf), 0, &ratio);
+	if (err)
+		return err;
+
+	if (ratio > 100)
+		return -EINVAL;
+
+	WRITE_ONCE(memcg->dirty_ratio, ratio);
+	return nbytes;
+}
+
+static int memory_dirty_min_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	/* seq_puts_memcg_tunable automatically multiplies by PAGE_SIZE for the user */
+	return seq_puts_memcg_tunable(m, READ_ONCE(memcg->dirty_min));
+}
+
+static ssize_t memory_dirty_min_write(struct kernfs_open_file *of,
+				      char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long dirty_min;
+	int err;
+
+	/* page_counter_memparse converts strings like "512M" into a page count */
+	err = page_counter_memparse(strstrip(buf), "max", &dirty_min);
+	if (err)
+		return err;
+
+	WRITE_ONCE(memcg->dirty_min, dirty_min);
+	return nbytes;
+}
+
 /*
  * Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
  * if any new events become available.
@@ -4950,6 +5000,18 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.write = memory_reclaim,
 	},
+	{
+		.name = "dirty_ratio",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_dirty_ratio_show,
+		.write = memory_dirty_ratio_write,
+	},
+	{
+		.name = "dirty_min",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_dirty_min_show,
+		.write = memory_dirty_min_write,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 88cd53d4ba09..2847b2c1e59a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -124,14 +124,17 @@ struct wb_domain global_wb_domain;
 
 #define GDTC_INIT(__wb)		.wb = (__wb),				\
 				.dom = &global_wb_domain,		\
-				.wb_completions = &(__wb)->completions
+				.wb_completions = &(__wb)->completions,	\
+				.cg_dirty_cap = PAGE_COUNTER_MAX
 
-#define GDTC_INIT_NO_WB		.dom = &global_wb_domain
+#define GDTC_INIT_NO_WB		.dom = &global_wb_domain,		\
+				.cg_dirty_cap = PAGE_COUNTER_MAX
 
 #define MDTC_INIT(__wb, __gdtc)	.wb = (__wb),				\
 				.dom = mem_cgroup_wb_domain(__wb),	\
 				.wb_completions = &(__wb)->memcg_completions, \
-				.gdtc = __gdtc
+				.gdtc = __gdtc,				\
+				.cg_dirty_cap = PAGE_COUNTER_MAX
 
 static bool mdtc_valid(struct dirty_throttle_control *dtc)
 {
@@ -183,8 +186,9 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 #define GDTC_INIT(__wb)		.wb = (__wb),                           \
-				.wb_completions = &(__wb)->completions
-#define GDTC_INIT_NO_WB
+				.wb_completions = &(__wb)->completions,	\
+				.cg_dirty_cap = PAGE_COUNTER_MAX
+#define GDTC_INIT_NO_WB		.cg_dirty_cap = PAGE_COUNTER_MAX
 #define MDTC_INIT(__wb, __gdtc)
 
 static bool mdtc_valid(struct dirty_throttle_control *dtc)
@@ -392,6 +396,58 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
 		bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
 		thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
 	}
+
+	/*
+	 * Apply the per-memcg dirty_ratio clamp on mdtc (gdtc != NULL
+	 * iff @dtc is a memcg dtc).  dirty_ratio is scaled against
+	 * the memcg's own dirtyable memory (@available_memory), matching
+	 * the semantics of vm_dirty_ratio so the two knobs share a base
+	 * and compose via a plain min() on thresh.  The clamp is keyed
+	 * on wb->memcg_css (the inode-owner's memcg) rather than on
+	 * current's memcg, so balance_dirty_pages(), wb_over_bg_thresh()
+	 * (flusher kworker context), and cgwb_calc_thresh() all see the
+	 * same clamped value.
+	 *
+	 * Published on dtc->cg_dirty_cap as well so hard_dirty_limit()
+	 * callers in balance_dirty_pages() can ignore the slower
+	 * dom->dirty_limit smoothing when deriving setpoint/
+	 * rate-limit from the clamped ceiling.
+	 *
+	 * Clamp is applied after the rt/dl boost: dirty_ratio is a
+	 * strict override, not widened by priority.  bg_thresh is
+	 * scaled by the same factor we apply to thresh so the
+	 * user-configured bg/thresh ratio survives clamping instead
+	 * of snapping to thresh/2 via the bg_thresh >= thresh guard
+	 * below.  mult_frac() preserves precision for small memcgs
+	 * where a plain "(avail / 100) * ratio" would collapse to 0.
+	 */
+	if (gdtc) {
+		struct mem_cgroup *memcg =
+			mem_cgroup_from_css(dtc->wb->memcg_css);
+		unsigned int cg_ratio = memcg ?
+			READ_ONCE(memcg->dirty_ratio) : 0;
+
+		/*
+		 * dtc is reused across balance_dirty_pages() iterations,
+		 * so reset the published clamp every call -- an admin
+		 * clearing memory.dirty_ratio mid-flight must take effect
+		 * on the next pass.
+		 */
+		dtc->cg_dirty_cap = PAGE_COUNTER_MAX;
+
+		if (cg_ratio) {
+			unsigned long cg_thresh = mult_frac(available_memory,
+							    cg_ratio, 100);
+
+			if (cg_thresh < thresh) {
+				bg_thresh = mult_frac(bg_thresh, cg_thresh,
+						      thresh);
+				thresh = cg_thresh;
+				dtc->cg_dirty_cap = cg_thresh;
+			}
+		}
+	}
+
 	/*
 	 * Dirty throttling logic assumes the limits in page units fit into
 	 * 32-bits. This gives 16TB dirty limits max which is hopefully enough.
@@ -1065,7 +1121,9 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
 	struct bdi_writeback *wb = dtc->wb;
 	unsigned long write_bw = READ_ONCE(wb->avg_write_bandwidth);
 	unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
-	unsigned long limit = dtc->limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
+	unsigned long limit = dtc->limit = min(hard_dirty_limit(dtc_dom(dtc),
+							       dtc->thresh),
+					       dtc->cg_dirty_cap);
 	unsigned long wb_thresh = dtc->wb_thresh;
 	unsigned long x_intercept;
 	unsigned long setpoint;		/* dirty pages' target balance point */
@@ -1334,7 +1392,8 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
 	struct bdi_writeback *wb = dtc->wb;
 	unsigned long dirty = dtc->dirty;
 	unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
-	unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
+	unsigned long limit = min(hard_dirty_limit(dtc_dom(dtc), dtc->thresh),
+				  dtc->cg_dirty_cap);
 	unsigned long setpoint = (freerun + limit) / 2;
 	unsigned long write_bw = wb->avg_write_bandwidth;
 	unsigned long dirty_ratelimit = wb->dirty_ratelimit;
@@ -1822,22 +1881,122 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
 	int ret = 0;
 
 	for (;;) {
+		unsigned long cg_dirty_min = 0;
+		unsigned long cg_dirty_pages = 0;
 		unsigned long now = jiffies;
 
 		nr_dirty = global_node_page_state(NR_FILE_DIRTY);
 
 		balance_domain_limits(gdtc, strictlimit);
+
+		/*
+		 * Under RCU, snapshot the current memcg's memory.dirty_min
+		 * reservation.  When it is non-zero, also snapshot the
+		 * memcg-wide dirty backlog.  These feed the per-writer
+		 * dirty_min bypass below; the dirty_ratio clamp itself
+		 * is applied inside domain_dirty_limits() keyed on
+		 * wb->memcg_css so balance_dirty_pages(),
+		 * wb_over_bg_thresh() (flusher kworker context), and
+		 * cgwb_calc_thresh() all see a consistent clamped
+		 * threshold.
+		 *
+		 * rcu_read_lock() is held only for the __rcu dereference
+		 * of current->cgroups; the memcg pointer does not escape
+		 * the critical section.  The counter read matches
+		 * domain_dirty_avail(mdtc, true) so the bypass compares
+		 * the same dirty+in-flight backlog the global path uses.
+		 */
+		rcu_read_lock();
+		{
+			struct mem_cgroup *memcg =
+				mem_cgroup_from_task(current);
+
+			if (memcg) {
+				cg_dirty_min = READ_ONCE(memcg->dirty_min);
+				if (cg_dirty_min)
+					cg_dirty_pages =
+						memcg_page_state(memcg,
+								 NR_FILE_DIRTY) +
+						memcg_page_state(memcg,
+								 NR_WRITEBACK);
+			}
+		}
+		rcu_read_unlock();
+
 		if (mdtc) {
 			/*
-			 * If @wb belongs to !root memcg, repeat the same
-			 * basic calculations for the memcg domain.
+			 * For !root memcg, repeat the same three-step
+			 * sequence as balance_domain_limits(gdtc):
+			 * avail -> limits -> freerun.  We inline it here
+			 * so we can insert the mdtc->dirty override
+			 * between step 2 (domain_dirty_limits, which
+			 * publishes the per-memcg dirty_ratio clamp on
+			 * cg_dirty_cap) and step 3 (domain_dirty_freerun,
+			 * which consumes mdtc->dirty along with
+			 * thresh/bg_thresh).
+			 */
+			domain_dirty_avail(mdtc, true);
+			domain_dirty_limits(mdtc);
+
+			/*
+			 * When the dirty_ratio clamp engaged, replace the
+			 * per-wb dirty count from mem_cgroup_wb_stats()
+			 * with the memcg-wide NR_FILE_DIRTY + NR_WRITEBACK
+			 * sum so freerun, the setpoint, and the rate-limit
+			 * smoothing see the true memcg backlog instead of
+			 * the subset that has migrated to this cgwb (cgwb
+			 * migration is lazy and can lag by many seconds),
+			 * and so a burst of buffered writes cannot silently
+			 * bypass the clamp by shifting pages from
+			 * NR_FILE_DIRTY into NR_WRITEBACK.
+			 *
+			 * Keyed on wb->memcg_css to match the clamp itself.
+			 * The cgwb holds a css reference, so the memcg
+			 * pointer is stable without additional locking.
+			 *
+			 * Caveat: memcg_page_state() aggregates across ALL
+			 * backing devices owned by this memcg, while mdtc
+			 * is scoped to one wb.  A writer to a fast BDI may
+			 * observe backlog accumulated on slow BDIs in the
+			 * same memcg and throttle more than strictly needed.
+			 * Accepted for v1; the alternative (summing per-wb
+			 * dirty across the memcg's cgwbs) walks the cgwb
+			 * list under RCU on a hot path.
 			 */
-			balance_domain_limits(mdtc, strictlimit);
+			if (mdtc->cg_dirty_cap != PAGE_COUNTER_MAX) {
+				struct mem_cgroup *wb_memcg =
+					mem_cgroup_from_css(mdtc->wb->memcg_css);
+
+				if (wb_memcg)
+					mdtc->dirty =
+						memcg_page_state(wb_memcg,
+								 NR_FILE_DIRTY) +
+						memcg_page_state(wb_memcg,
+								 NR_WRITEBACK);
+			}
+
+			domain_dirty_freerun(mdtc, strictlimit);
 		}
 
 		if (nr_dirty > gdtc->bg_thresh && !writeback_in_progress(wb))
 			wb_start_background_writeback(wb);
 
+		/*
+		 * dirty_min bypass: when the current memcg's dirty+in-flight
+		 * backlog is below its memory.dirty_min reservation, let the
+		 * writer proceed without throttling.  This check must live
+		 * outside the if (mdtc) block because a writer's file may not
+		 * yet have been migrated to a cgwb; without cgwb, mdtc is NULL
+		 * and the per-memcg block above is skipped entirely.
+		 *
+		 * cg_dirty_min and cg_dirty_pages come from the per-iteration
+		 * snapshot taken above under rcu_read_lock; both are stored
+		 * in pages (page_counter_memparse converts bytes -> pages for
+		 * dirty_min), so no unit conversion is needed.
+		 */
+		if (cg_dirty_min && cg_dirty_pages < cg_dirty_min)
+			goto free_running;
+
 		/*
 		 * If memcg domain is in effect, @dirty should be under
 		 * both global and memcg freerun ceilings.

---
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
change-id: 20260501-rfc-memcg-dirty-v1-ed4644c3fa8a

Best regards,
-- 
Alireza Haghdoost <haghdoost@uber.com>


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
@ 2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
  0 siblings, 0 replies; 6+ messages in thread
From: Alireza Haghdoost via B4 Relay @ 2026-05-01 22:28 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle), Jan Kara
  Cc: cgroups, linux-mm, linux-kernel, linux-fsdevel, Kshitij Doshi,
	Alireza Haghdoost

From: Alireza Haghdoost <haghdoost@uber.com>

Add two cgroup v2 memory-controller knobs that bring
balance_dirty_pages() throttling into per-cgroup scope so one noisy
writer cannot stall peers sharing the same host:

  memory.dirty_ratio  Per-cgroup dirty-page ceiling, 0..100 percent of
                      the cgroup's dirtyable memory.  0 (default) leaves
                      the cgroup subject to the global threshold only.

  memory.dirty_min    Guaranteed dirty-page floor, byte value (default 0).

The two knobs compose: dirty_ratio bounds how much dirty memory a
cgroup may accrue, dirty_min guarantees a floor below which it is
never throttled.

Motivation, design trade-offs, cost analysis, validation data, and
open questions are in the cover letter.

Co-developed-by: Kshitij Doshi <kshitijd@uber.com>
Signed-off-by: Kshitij Doshi <kshitijd@uber.com>
Signed-off-by: Alireza Haghdoost <haghdoost@uber.com>
Assisted-by: Cursor:claude-sonnet-4.5
---
Hi all,

This RFC adds two cgroup v2 memory-controller knobs that give operators
per-cgroup control over dirty-page throttling in balance_dirty_pages():
memory.dirty_ratio (per-cgroup ceiling) and memory.dirty_min (guaranteed
floor). A third knob, memory.dirty_weight, is forthcoming in a follow-up
once we have validated the application site (see "Follow-ups" below).
We are posting this as an RFC, as a single squashed patch, to get design
feedback before splitting the prototype into a per-logical-change series.

Motivation
==========

balance_dirty_pages() (BDP) is a global throttle. It sleeps writers
once the host-wide dirty count crosses a single threshold. On a
container host that threshold is shared across cgroups. A cgroup
that dirties pages faster than storage can drain them pushes the
count over the limit. Every writer on the host then parks in
io_schedule_timeout() -- including cgroups that have not dirtied a
single page of their own.

cgroup v2 already has per-memcg dirty accounting, but that accounting
does not translate into per-memcg dirty throttling.

We see this in production: a buffered write-heavy container generates
multi-second stalls for co-located latency-sensitive workloads.
Moreover, dirty-page accumulation from a single noisy neighbor is a
recurring contributor to mount-responsiveness degradation on shared hosts.

Prior work
==========

Per-memcg dirty-page limits have been proposed before. Andrea Righi
posted an initial RFC in February 2010 [1]; Greg Thelen continued the
work through v9 in August 2011 [2]. That series added per-memcg dirty
counters and hooked them into balance_dirty_pages(), but it bolted
per-cgroup limits onto the global writeback path without making
writeback itself cgroup-aware. Without cgroup-aware flusher threads,
a cgroup exceeding its limit triggered writeback of inodes from any
cgroup, giving poor isolation. The series was not merged.

Konstantin Khlebnikov posted "[PATCHSET RFC 0/6] memcg: inode-based
dirty-set controller" in January 2015 [4], which proposed
memory.dirty_ratio (the same interface name this series uses) via an
inode-tagged, filtered-writeback approach. Tejun Heo reviewed it
and rejected it as a "dead end" that duplicated lower-layer policy
without solving the underlying isolation problem; this rejection
directly motivated Tejun's native cgwb rework described below.

Tejun Heo's 48-patch cgroup-writeback rework, merged in Linux 4.2
(commit e4bc13adfd01, "Merge branch 'for-4.2/writeback'"), took the
different approach of restructuring writeback to be natively
cgroup-aware: per-memcg wb_domain (commit 841710aa6e4a), per-memcg
NR_FILE_DIRTY / NR_WRITEBACK accounting, and cgroup-aware flusher
threads [3]. That work deliberately deferred user-facing policy knobs.
This series adds the policy surface that consumes Tejun's
infrastructure. The dirty_min reservation concept is, to our
knowledge, new.

A November 2023 LKML thread by Chengming Zhou [5] independently
identified the identical throttling regression on cgroup v2 (a 5 GB
container constantly throttled because memory.max * dirty_ratio yields
too small a threshold for bursty workloads). Jan Kara participated and
endorsed a bytes-based per-memcg dirty limit; no patches followed that
discussion, confirming the gap this series fills.

[1] https://lwn.net/Articles/408349/
[2] https://lore.kernel.org/lkml/1313597705-6093-1-git-send-email-gthelen@google.com/
[3] https://lwn.net/Articles/648292/
[4] https://lore.kernel.org/all/20150115180242.10450.92.stgit@buzz/
[5] https://lore.kernel.org/all/109029e0-1772-4102-a2a8-ab9076462454@linux.dev/

Proposed interface
==================

Two new cgroup v2 files under the memory controller, absent on the root
cgroup (CFTYPE_NOT_ON_ROOT):

memory.dirty_ratio
  Integer [0, 100]. Per-cgroup dirty-page ceiling as a percentage of
  the cgroup's dirtyable memory (mdtc->avail: file cache plus
  reclaimable slack), the same base the global vm_dirty_ratio scales
  against. 0 (the default) disables the per-cgroup ceiling and leaves
  the cgroup subject to the global threshold only. A non-zero value
  that is stricter than vm_dirty_ratio overrides the global ratio for
  this cgroup via min(mdtc->thresh, cg_thresh); because both sides
  scale off the same base, the knob can never widen the cgroup past
  the global ceiling. A memory.dirty_bytes companion for byte-precise
  caps (mirroring vm_dirty_bytes) is noted under "Follow-ups" below.
  The prototype reads the value for the immediate memcg only;
  hierarchical enforcement (clamping against ancestors, like
  memory.max) is not implemented yet. We would like guidance on
  whether this is required for v1 or can follow in a subsequent
  series.

memory.dirty_min
  Byte value (K/M/G suffixes accepted), default 0 (no reservation).
  Guaranteed dirty-page floor: while cgroup_dirty < dirty_min,
  throttling is bypassed (goto free_running). Lets a latency-sensitive
  cgroup buffer a small write burst even under global dirty pressure.

  dirty_min is an admission guarantee, so we have to prevent it from
  breaking the global dirty invariant. Two aspects:

   - Global cap. The sum of dirty_min reservations across all cgroups
     must not exceed a fraction of the global dirty threshold (our
     working number is 80%), so the system always retains some shared
     capacity. The prototype does not enforce this cap yet; we expect
     to either reject at write() time or clamp on read in a cheap
     precomputed effective_dirty_min. We would appreciate feedback on
     which approach the cgroup maintainers prefer.

   - Per-cgroup cap. A cgroup should not be able to reserve more dirty
     capacity than it can hold. Our tentative rule is
     effective_dirty_min = min(dirty_min, memory.max - memory.min),
     evaluated at enforcement time so it tracks live memory.max changes,
     rather than rejected at write. This is similar to how memory.low
     composes with memory.max.

  Neither cap is implemented in the prototype; both would land before
  a non-RFC posting.

The two knobs compose: dirty_ratio bounds how much dirty memory a
cgroup may accrue, dirty_min guarantees a floor below which it is never
throttled.

Test setup and results
======================

To show the problem and the fix, we built a single reproducer that runs
on an unmodified stock kernel and then on the patched kernel, using the
same setup for both.

Setup: QEMU guest with a virtio-blk disk throttled to 256 KB/s
(bps_wr=262144). Two sibling cgroups, no io.weight; both share disk
bandwidth equally. dirty_bytes=32MB, dirty_background_bytes=16MB
(freerun = 24 MB). Files pre-allocated with fallocate before dirty
pressure. Two phases per run: (1) victim alone (baseline), (2) noisy
fills global dirty to the 32 MB cap, then victim runs contended for
30 s.

  - noisy:  single fio job, unlimited write rate, fills global dirty pool.
  - victim: single fio job, rate-limited to 500 KB/s (128 IOPS target),
            4 KiB sequential write().

The high freerun (24 MB) ensures victim's solo dirty accumulation
(244 KB/s x 30 s = 7.3 MB) stays below the threshold. BDP does not fire
during the solo phase on either kernel, giving identical baselines.

Stock kernel results (the problem):

                       solo (no contention)  contended  inflation
  victim IOPS          125                   5.1        24.5x worse
  victim p99           0.6 ms                152 ms     253x worse

The contended p99 sits at fio's percentile bucket nearest MAX_PAUSE =
HZ/5 = 200 ms (mm/page-writeback.c:49), the hard kernel ceiling on BDP
sleep. The victim has near-zero dirty pages of its own but is forced to
sleep because balance_dirty_pages() sees gdtc->dirty = NR_FILE_DIRTY +
NR_WRITEBACK above the freerun threshold. Most of noisy's pages are
queued in NR_WRITEBACK waiting for the throttled disk.
memory.events.local on both cgroups shows max 0 throughout the run;
this is not memory pressure inside either cgroup.

Patched kernel results (the fix), with victim/memory.dirty_min = 16 MB:

                       solo (no contention)  contended  inflation
  victim IOPS          125                   125        1.0x
  victim p99           0.9 ms                0.7 ms     1.0x

The patched kernel checks cgroup_dirty < dirty_min (4096 pages) before
computing any sleep. Because the rate-limited victim's resident dirty
set stays well below the reservation, the check fires on every write()
-> goto free_running -> write() returns in ~7 us. The victim is fully
protected.

The figures above are the per-kernel medians of N=5 iterations and
reflect a deterministic outcome on every iter: stock had cont_iops in
4.4..6.1 (retention 0.035..0.049) on all five iters, patched had
cont_iops = 125.0 (retention = 1.000) on all five iters. The 5/5 stock
iters all hit BDP's throttled regime; the 5/5 patched iters all
bypassed it via the dirty_min check.

Implementation
==============

The patch touches five files:

  - include/linux/memcontrol.h: two new fields on struct mem_cgroup
    (dirty_ratio, dirty_min).
  - include/linux/writeback.h: a per-pass cg_dirty_cap field on
    struct dirty_throttle_control used to publish the memcg clamp to
    BDP's setpoint and rate-limit math.
  - include/trace/events/writeback.h: cg_dirty_cap added to the
    balance_dirty_pages tracepoint so operators can distinguish the
    memcg clamp from the global dirty_limit at runtime.
  - mm/memcontrol.c: registers the two cgroup v2 files with input
    validation.
  - mm/page-writeback.c: the throttling changes.

The key changes in page-writeback.c:

  - The dirty_ratio clamp lives inside domain_dirty_limits(), keyed
    on wb->memcg_css (the inode owner's memcg).  Every consumer of
    the memcg dtc -- writer throttle, flusher kworker,
    cgwb_calc_thresh -- sees the same clamped thresh and bg_thresh.
    The clamp uses mult_frac() so a small memcg does not collapse
    to a zero ceiling.

  - The dirty_min bypass lives in balance_dirty_pages() and is
    writer-keyed: the writing task's memcg is looked up under RCU,
    and when its dirty+in-flight backlog is below dirty_min the
    loop jumps to free_running, bypassing both the global and the
    per-memcg BDP gates.  dirty_min is an admission guarantee for
    the writer's own cgroup, not for inode owners.

  - When the clamp engages, mdtc->dirty is replaced with the
    memcg-wide NR_FILE_DIRTY + NR_WRITEBACK sum so freerun /
    setpoint / rate-limit smoothing see the real backlog and pages
    migrating from NR_FILE_DIRTY into NR_WRITEBACK cannot silently
    widen the cap.

Fast-path cost when neither knob is set: one rcu_read_lock/unlock
pair plus a READ_ONCE(dirty_min) per balance_dirty_pages()
iteration, and one READ_ONCE(dirty_ratio) per domain_dirty_limits()
call.  The memcg counter reads are gated on "knob armed" and do
not fire on the default path.  We have not measured the added
cost yet, but we expect it to be in the noise of existing BDP
bookkeeping.  A tight pwrite() microbenchmark will confirm this
before a non-RFC posting.

Scope
=====

This series affects the writer-side throttle (balance_dirty_pages())
only. It does not partition the flusher-side writeback queue. A
cgroup's fsync() can still block behind inodes from other cgroups in
writeback_sb_inodes(). We document this limit explicitly and expect
writeback-queue partitioning to be a separate, larger effort.

Interaction with block-layer throttles
======================================

The two knobs are orthogonal to io.max / io.cost. balance_dirty_pages()
runs before the bio reaches the block layer, so dirty_min simply allows
a cgroup to keep accepting write() syscalls up to its reservation; the
actual I/O is still subject to whatever block throttle is in effect. In
the reproducer above, the disk-level bandwidth limit (256 KB/s) is
applied at the QEMU virtio-blk layer, and the protected victim dirties
roughly equal to the rate it can drain, so the block throttle is
exercised on both kernels. We have not yet tested interaction with
guest-side io.max settings; this is on the list before a non-RFC
posting.

Questions for maintainers
=========================

 1. Is writer-throttle-only scope (no flusher/writeback-queue work)
    acceptable for the first series?
 2. Does dirty_ratio belong on struct mem_cgroup (as the prototype
    has it) or on the memcg's wb_domain? Routing it through wb_domain
    would let us reuse __wb_calc_thresh() and keep all threshold
    policy in one place; a possible split is dirty_ratio on wb_domain
    (it is threshold policy), dirty_min on the memcg (throttle
    bypass). We can go either way.
 3. For dirty_min safety caps (per-cgroup and global sum), which
    approach do you prefer: reject at write time, or clamp on read in
    an effective_dirty_min?
 4. Is hierarchical enforcement of dirty_ratio (clamp against
    ancestors) required for v1, or can it follow in a subsequent
    series?

What's missing before a non-RFC posting
=======================================

  - Split the monolithic prototype into a proper series (one patch per
    concept + Documentation + selftest).
  - Documentation/admin-guide/cgroup-v2.rst entries for both knobs.
  - tools/testing/selftests/cgroup/ test for interface surface and
    noisy-neighbor protection.
  - Implement the per-cgroup and global dirty_min safety caps described
    in the memory.dirty_min bullet.
  - Fast-path microbenchmark: confirm zero measurable regression for
    cgroups that have neither knob set.
  - Larger-N validation on real hardware (the current N=5 data is from
    a QEMU guest on a throttled virtio-blk).

Follow-ups (out of scope for this series)
=========================================

  - memory.dirty_weight: a priority weight knob that scales the BDP
    pause length, planned as a separate series. The prototype validated
    the interface surface but the application site (post-pause scaling
    vs. folding into pos_ratio / dirty_ratelimit) needs to be settled
    before we ship it. Happy to discuss in advance of that posting.
  - memory.dirty_bytes: a byte-value companion to memory.dirty_ratio,
    mirroring the global vm_dirty_bytes. For operators who want a
    byte-predictable per-cgroup dirty cap rather than a ratio of the
    cgroup's dirtyable memory. We have not prototyped this yet; we
    are listing it so reviewers know it is on the roadmap, since
    the ratio-only interface omits that use case.
  - Writeback-queue partitioning: flusher-side fairness across
    cgroups, as noted in Scope above.

Looking forward to feedback.

Thanks,
Alireza Haghdoost <haghdoost@uber.com>
Kshitij Doshi <kshitijd@uber.com>
---
 include/linux/memcontrol.h       |  10 +++
 include/linux/writeback.h        |   4 +
 include/trace/events/writeback.h |   5 +-
 mm/memcontrol.c                  |  62 ++++++++++++++
 mm/page-writeback.c              | 179 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 249 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b..45ca949a4c68 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -323,6 +323,16 @@ struct mem_cgroup {
 	spinlock_t event_list_lock;
 #endif /* CONFIG_MEMCG_V1 */
 
+	/* Per-memcg dirty-page controls (memory.dirty_ratio, memory.dirty_min) */
+	/*
+	 * dirty_ratio: [0, 100] percent of dirtyable memory (mdtc->avail),
+	 *   matching the global vm_dirty_ratio base; 0 inherits the global
+	 *   threshold.
+	 * dirty_min:   dirty-page reservation, in pages; 0 disables the bypass.
+	 */
+	unsigned int dirty_ratio;
+	unsigned long dirty_min;
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 62552a2ce5b9..e37632f728be 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -318,6 +318,10 @@ struct dirty_throttle_control {
 	unsigned long		thresh;		/* dirty threshold */
 	unsigned long		bg_thresh;	/* dirty background threshold */
 	unsigned long		limit;		/* hard dirty limit */
+	unsigned long		cg_dirty_cap;	/* per-memcg dirty_ratio clamp for
+						 * this pass, or PAGE_COUNTER_MAX
+						 * when no memcg clamp applies
+						 */
 
 	unsigned long		wb_dirty;	/* per-wb counterparts */
 	unsigned long		wb_thresh;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index bdac0d685a98..0bf86b3c903c 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -672,6 +672,7 @@ TRACE_EVENT(balance_dirty_pages,
 		__array(	 char,	bdi, 32)
 		__field(u64,		cgroup_ino)
 		__field(unsigned long,	limit)
+		__field(unsigned long,	cg_dirty_cap)
 		__field(unsigned long,	setpoint)
 		__field(unsigned long,	dirty)
 		__field(unsigned long,	wb_setpoint)
@@ -691,6 +692,7 @@ TRACE_EVENT(balance_dirty_pages,
 		strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
 
 		__entry->limit		= dtc->limit;
+		__entry->cg_dirty_cap	= dtc->cg_dirty_cap;
 		__entry->setpoint	= (dtc->limit + freerun) / 2;
 		__entry->dirty		= dtc->dirty;
 		__entry->wb_setpoint	= __entry->setpoint *
@@ -710,13 +712,14 @@ TRACE_EVENT(balance_dirty_pages,
 
 
 	TP_printk("bdi %s: "
-		  "limit=%lu setpoint=%lu dirty=%lu "
+		  "limit=%lu cg_dirty_cap=%lu setpoint=%lu dirty=%lu "
 		  "wb_setpoint=%lu wb_dirty=%lu "
 		  "dirty_ratelimit=%lu task_ratelimit=%lu "
 		  "dirtied=%u dirtied_pause=%u "
 		  "paused=%lu pause=%ld period=%lu think=%ld cgroup_ino=%llu",
 		  __entry->bdi,
 		  __entry->limit,
+		  __entry->cg_dirty_cap,
 		  __entry->setpoint,
 		  __entry->dirty,
 		  __entry->wb_setpoint,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3d98ab41f1f..c43fe4f394eb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4748,6 +4748,56 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_dirty_ratio_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	seq_printf(m, "%u\n", READ_ONCE(memcg->dirty_ratio));
+	return 0;
+}
+
+static ssize_t memory_dirty_ratio_write(struct kernfs_open_file *of,
+					char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned int ratio;
+	int err;
+
+	err = kstrtouint(strstrip(buf), 0, &ratio);
+	if (err)
+		return err;
+
+	if (ratio > 100)
+		return -EINVAL;
+
+	WRITE_ONCE(memcg->dirty_ratio, ratio);
+	return nbytes;
+}
+
+static int memory_dirty_min_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	/* seq_puts_memcg_tunable automatically multiplies by PAGE_SIZE for the user */
+	return seq_puts_memcg_tunable(m, READ_ONCE(memcg->dirty_min));
+}
+
+static ssize_t memory_dirty_min_write(struct kernfs_open_file *of,
+				      char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long dirty_min;
+	int err;
+
+	/* page_counter_memparse converts strings like "512M" into a page count */
+	err = page_counter_memparse(strstrip(buf), "max", &dirty_min);
+	if (err)
+		return err;
+
+	WRITE_ONCE(memcg->dirty_min, dirty_min);
+	return nbytes;
+}
+
 /*
  * Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
  * if any new events become available.
@@ -4950,6 +5000,18 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.write = memory_reclaim,
 	},
+	{
+		.name = "dirty_ratio",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_dirty_ratio_show,
+		.write = memory_dirty_ratio_write,
+	},
+	{
+		.name = "dirty_min",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_dirty_min_show,
+		.write = memory_dirty_min_write,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 88cd53d4ba09..2847b2c1e59a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -124,14 +124,17 @@ struct wb_domain global_wb_domain;
 
 #define GDTC_INIT(__wb)		.wb = (__wb),				\
 				.dom = &global_wb_domain,		\
-				.wb_completions = &(__wb)->completions
+				.wb_completions = &(__wb)->completions,	\
+				.cg_dirty_cap = PAGE_COUNTER_MAX
 
-#define GDTC_INIT_NO_WB		.dom = &global_wb_domain
+#define GDTC_INIT_NO_WB		.dom = &global_wb_domain,		\
+				.cg_dirty_cap = PAGE_COUNTER_MAX
 
 #define MDTC_INIT(__wb, __gdtc)	.wb = (__wb),				\
 				.dom = mem_cgroup_wb_domain(__wb),	\
 				.wb_completions = &(__wb)->memcg_completions, \
-				.gdtc = __gdtc
+				.gdtc = __gdtc,				\
+				.cg_dirty_cap = PAGE_COUNTER_MAX
 
 static bool mdtc_valid(struct dirty_throttle_control *dtc)
 {
@@ -183,8 +186,9 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 #define GDTC_INIT(__wb)		.wb = (__wb),                           \
-				.wb_completions = &(__wb)->completions
-#define GDTC_INIT_NO_WB
+				.wb_completions = &(__wb)->completions,	\
+				.cg_dirty_cap = PAGE_COUNTER_MAX
+#define GDTC_INIT_NO_WB		.cg_dirty_cap = PAGE_COUNTER_MAX
 #define MDTC_INIT(__wb, __gdtc)
 
 static bool mdtc_valid(struct dirty_throttle_control *dtc)
@@ -392,6 +396,58 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
 		bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
 		thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
 	}
+
+	/*
+	 * Apply the per-memcg dirty_ratio clamp on mdtc (gdtc != NULL
+	 * iff @dtc is a memcg dtc).  dirty_ratio is scaled against
+	 * the memcg's own dirtyable memory (@available_memory), matching
+	 * the semantics of vm_dirty_ratio so the two knobs share a base
+	 * and compose via a plain min() on thresh.  The clamp is keyed
+	 * on wb->memcg_css (the inode-owner's memcg) rather than on
+	 * current's memcg, so balance_dirty_pages(), wb_over_bg_thresh()
+	 * (flusher kworker context), and cgwb_calc_thresh() all see the
+	 * same clamped value.
+	 *
+	 * Published on dtc->cg_dirty_cap as well so hard_dirty_limit()
+	 * callers in balance_dirty_pages() can ignore the slower
+	 * dom->dirty_limit smoothing when deriving setpoint/
+	 * rate-limit from the clamped ceiling.
+	 *
+	 * Clamp is applied after the rt/dl boost: dirty_ratio is a
+	 * strict override, not widened by priority.  bg_thresh is
+	 * scaled by the same factor we apply to thresh so the
+	 * user-configured bg/thresh ratio survives clamping instead
+	 * of snapping to thresh/2 via the bg_thresh >= thresh guard
+	 * below.  mult_frac() preserves precision for small memcgs
+	 * where a plain "(avail / 100) * ratio" would collapse to 0.
+	 */
+	if (gdtc) {
+		struct mem_cgroup *memcg =
+			mem_cgroup_from_css(dtc->wb->memcg_css);
+		unsigned int cg_ratio = memcg ?
+			READ_ONCE(memcg->dirty_ratio) : 0;
+
+		/*
+		 * dtc is reused across balance_dirty_pages() iterations,
+		 * so reset the published clamp every call -- an admin
+		 * clearing memory.dirty_ratio mid-flight must take effect
+		 * on the next pass.
+		 */
+		dtc->cg_dirty_cap = PAGE_COUNTER_MAX;
+
+		if (cg_ratio) {
+			unsigned long cg_thresh = mult_frac(available_memory,
+							    cg_ratio, 100);
+
+			if (cg_thresh < thresh) {
+				bg_thresh = mult_frac(bg_thresh, cg_thresh,
+						      thresh);
+				thresh = cg_thresh;
+				dtc->cg_dirty_cap = cg_thresh;
+			}
+		}
+	}
+
 	/*
 	 * Dirty throttling logic assumes the limits in page units fit into
 	 * 32-bits. This gives 16TB dirty limits max which is hopefully enough.
@@ -1065,7 +1121,9 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
 	struct bdi_writeback *wb = dtc->wb;
 	unsigned long write_bw = READ_ONCE(wb->avg_write_bandwidth);
 	unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
-	unsigned long limit = dtc->limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
+	unsigned long limit = dtc->limit = min(hard_dirty_limit(dtc_dom(dtc),
+							       dtc->thresh),
+					       dtc->cg_dirty_cap);
 	unsigned long wb_thresh = dtc->wb_thresh;
 	unsigned long x_intercept;
 	unsigned long setpoint;		/* dirty pages' target balance point */
@@ -1334,7 +1392,8 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
 	struct bdi_writeback *wb = dtc->wb;
 	unsigned long dirty = dtc->dirty;
 	unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
-	unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
+	unsigned long limit = min(hard_dirty_limit(dtc_dom(dtc), dtc->thresh),
+				  dtc->cg_dirty_cap);
 	unsigned long setpoint = (freerun + limit) / 2;
 	unsigned long write_bw = wb->avg_write_bandwidth;
 	unsigned long dirty_ratelimit = wb->dirty_ratelimit;
@@ -1822,22 +1881,122 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
 	int ret = 0;
 
 	for (;;) {
+		unsigned long cg_dirty_min = 0;
+		unsigned long cg_dirty_pages = 0;
 		unsigned long now = jiffies;
 
 		nr_dirty = global_node_page_state(NR_FILE_DIRTY);
 
 		balance_domain_limits(gdtc, strictlimit);
+
+		/*
+		 * Under RCU, snapshot the current memcg's memory.dirty_min
+		 * reservation.  When it is non-zero, also snapshot the
+		 * memcg-wide dirty backlog.  These feed the per-writer
+		 * dirty_min bypass below; the dirty_ratio clamp itself
+		 * is applied inside domain_dirty_limits() keyed on
+		 * wb->memcg_css so balance_dirty_pages(),
+		 * wb_over_bg_thresh() (flusher kworker context), and
+		 * cgwb_calc_thresh() all see a consistent clamped
+		 * threshold.
+		 *
+		 * rcu_read_lock() is held only for the __rcu dereference
+		 * of current->cgroups; the memcg pointer does not escape
+		 * the critical section.  The counter read matches
+		 * domain_dirty_avail(mdtc, true) so the bypass compares
+		 * the same dirty+in-flight backlog the global path uses.
+		 */
+		rcu_read_lock();
+		{
+			struct mem_cgroup *memcg =
+				mem_cgroup_from_task(current);
+
+			if (memcg) {
+				cg_dirty_min = READ_ONCE(memcg->dirty_min);
+				if (cg_dirty_min)
+					cg_dirty_pages =
+						memcg_page_state(memcg,
+								 NR_FILE_DIRTY) +
+						memcg_page_state(memcg,
+								 NR_WRITEBACK);
+			}
+		}
+		rcu_read_unlock();
+
 		if (mdtc) {
 			/*
-			 * If @wb belongs to !root memcg, repeat the same
-			 * basic calculations for the memcg domain.
+			 * For !root memcg, repeat the same three-step
+			 * sequence as balance_domain_limits(gdtc):
+			 * avail -> limits -> freerun.  We inline it here
+			 * so we can insert the mdtc->dirty override
+			 * between step 2 (domain_dirty_limits, which
+			 * publishes the per-memcg dirty_ratio clamp on
+			 * cg_dirty_cap) and step 3 (domain_dirty_freerun,
+			 * which consumes mdtc->dirty along with
+			 * thresh/bg_thresh).
+			 */
+			domain_dirty_avail(mdtc, true);
+			domain_dirty_limits(mdtc);
+
+			/*
+			 * When the dirty_ratio clamp engaged, replace the
+			 * per-wb dirty count from mem_cgroup_wb_stats()
+			 * with the memcg-wide NR_FILE_DIRTY + NR_WRITEBACK
+			 * sum so freerun, the setpoint, and the rate-limit
+			 * smoothing see the true memcg backlog instead of
+			 * the subset that has migrated to this cgwb (cgwb
+			 * migration is lazy and can lag by many seconds),
+			 * and so a burst of buffered writes cannot silently
+			 * bypass the clamp by shifting pages from
+			 * NR_FILE_DIRTY into NR_WRITEBACK.
+			 *
+			 * Keyed on wb->memcg_css to match the clamp itself.
+			 * The cgwb holds a css reference, so the memcg
+			 * pointer is stable without additional locking.
+			 *
+			 * Caveat: memcg_page_state() aggregates across ALL
+			 * backing devices owned by this memcg, while mdtc
+			 * is scoped to one wb.  A writer to a fast BDI may
+			 * observe backlog accumulated on slow BDIs in the
+			 * same memcg and throttle more than strictly needed.
+			 * Accepted for v1; the alternative (summing per-wb
+			 * dirty across the memcg's cgwbs) walks the cgwb
+			 * list under RCU on a hot path.
 			 */
-			balance_domain_limits(mdtc, strictlimit);
+			if (mdtc->cg_dirty_cap != PAGE_COUNTER_MAX) {
+				struct mem_cgroup *wb_memcg =
+					mem_cgroup_from_css(mdtc->wb->memcg_css);
+
+				if (wb_memcg)
+					mdtc->dirty =
+						memcg_page_state(wb_memcg,
+								 NR_FILE_DIRTY) +
+						memcg_page_state(wb_memcg,
+								 NR_WRITEBACK);
+			}
+
+			domain_dirty_freerun(mdtc, strictlimit);
 		}
 
 		if (nr_dirty > gdtc->bg_thresh && !writeback_in_progress(wb))
 			wb_start_background_writeback(wb);
 
+		/*
+		 * dirty_min bypass: when the current memcg's dirty+in-flight
+		 * backlog is below its memory.dirty_min reservation, let the
+		 * writer proceed without throttling.  This check must live
+		 * outside the if (mdtc) block because a writer's file may not
+		 * yet have been migrated to a cgwb; without cgwb, mdtc is NULL
+		 * and the per-memcg block above is skipped entirely.
+		 *
+		 * cg_dirty_min and cg_dirty_pages come from the per-iteration
+		 * snapshot taken above under rcu_read_lock; both are stored
+		 * in pages (page_counter_memparse converts bytes -> pages for
+		 * dirty_min), so no unit conversion is needed.
+		 */
+		if (cg_dirty_min && cg_dirty_pages < cg_dirty_min)
+			goto free_running;
+
 		/*
 		 * If memcg domain is in effect, @dirty should be under
 		 * both global and memcg freerun ceilings.

---
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
change-id: 20260501-rfc-memcg-dirty-v1-ed4644c3fa8a

Best regards,
-- 
Alireza Haghdoost <haghdoost@uber.com>



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
  2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
  (?)
@ 2026-05-03  8:59 ` kernel test robot
  -1 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2026-05-03  8:59 UTC (permalink / raw)
  To: Alireza Haghdoost via B4 Relay; +Cc: oe-kbuild-all

Hi Alireza,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on 254f49634ee16a731174d2ae34bc50bd5f45e731]

url:    https://github.com/intel-lab-lkp/linux/commits/Alireza-Haghdoost-via-B4-Relay/memcg-add-per-cgroup-dirty-page-controls-dirty_ratio-dirty_min/20260502-235916
base:   254f49634ee16a731174d2ae34bc50bd5f45e731
patch link:    https://lore.kernel.org/r/20260501-rfc-memcg-dirty-v1-v1-1-9a8c80036ec1%40uber.com
patch subject: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20260503/202605031656.ByQM5ZN9-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260503/202605031656.ByQM5ZN9-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605031656.ByQM5ZN9-lkp@intel.com/

All errors (new ones prefixed by >>):

   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                         ^~~~~~~~~
   mm/page-writeback.c:428:40: error: invalid use of undefined type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                        ^~
   include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
     679 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
     699 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                         ^~~~~~~~~
   mm/page-writeback.c:428:40: error: invalid use of undefined type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                        ^~
   include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
     679 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
     699 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                         ^~~~~~~~~
   mm/page-writeback.c:428:40: error: invalid use of undefined type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                        ^~
   include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
     679 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
     699 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                         ^~~~~~~~~
   mm/page-writeback.c:428:40: error: invalid use of undefined type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                        ^~
   include/linux/compiler_types.h:635:53: note: in definition of macro '__unqual_scalar_typeof'
     635 | #define __unqual_scalar_typeof(x) __typeof_unqual__(x)
         |                                                     ^
   include/asm-generic/rwonce.h:50:9: note: in expansion of macro '__READ_ONCE'
      50 |         __READ_ONCE(x);                                                 \
         |         ^~~~~~~~~~~
   mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                         ^~~~~~~~~
   In file included from ./arch/openrisc/include/generated/asm/rwonce.h:1,
                    from include/linux/compiler.h:369,
                    from include/linux/array_size.h:5,
                    from include/linux/kernel.h:16,
                    from mm/page-writeback.c:15:
   mm/page-writeback.c:428:40: error: invalid use of undefined type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                        ^~
   include/asm-generic/rwonce.h:44:73: note: in definition of macro '__READ_ONCE'
      44 | #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
         |                                                                         ^
   mm/page-writeback.c:428:25: note: in expansion of macro 'READ_ONCE'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                         ^~~~~~~~~
   mm/page-writeback.c: In function 'balance_dirty_pages':
   mm/page-writeback.c:1912:33: error: implicit declaration of function 'mem_cgroup_from_task'; did you mean 'mem_cgroup_from_css'? [-Wimplicit-function-declaration]
    1912 |                                 mem_cgroup_from_task(current);
         |                                 ^~~~~~~~~~~~~~~~~~~~
         |                                 mem_cgroup_from_css
>> mm/page-writeback.c:1912:33: error: initialization of 'struct mem_cgroup *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
   mm/page-writeback.c:1915:63: error: invalid use of undefined type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                               ^~
   include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
     679 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
     699 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:1915:48: note: in expansion of macro 'READ_ONCE'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                ^~~~~~~~~
   mm/page-writeback.c:1915:63: error: invalid use of undefined type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                               ^~
   include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
     679 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
     699 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:1915:48: note: in expansion of macro 'READ_ONCE'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                ^~~~~~~~~
   mm/page-writeback.c:1915:63: error: invalid use of undefined type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                               ^~
   include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
     679 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
     699 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:1915:48: note: in expansion of macro 'READ_ONCE'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                ^~~~~~~~~
   mm/page-writeback.c:1915:63: error: invalid use of undefined type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                               ^~
   include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
     679 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
     699 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:1915:48: note: in expansion of macro 'READ_ONCE'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                ^~~~~~~~~
   mm/page-writeback.c:1915:63: error: invalid use of undefined type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                               ^~
   include/linux/compiler_types.h:679:23: note: in definition of macro '__compiletime_assert'
     679 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:699:9: note: in expansion of macro '_compiletime_assert'
     699 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:1915:48: note: in expansion of macro 'READ_ONCE'


vim +1912 mm/page-writeback.c

  1853	
  1854	/*
  1855	 * balance_dirty_pages() must be called by processes which are generating dirty
  1856	 * data.  It looks at the number of dirty pages in the machine and will force
  1857	 * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  1858	 * If we're over `background_thresh' then the writeback threads are woken to
  1859	 * perform some writeout.
  1860	 */
  1861	static int balance_dirty_pages(struct bdi_writeback *wb,
  1862				       unsigned long pages_dirtied, unsigned int flags)
  1863	{
  1864		struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
  1865		struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
  1866		struct dirty_throttle_control * const gdtc = &gdtc_stor;
  1867		struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
  1868							     &mdtc_stor : NULL;
  1869		struct dirty_throttle_control *sdtc;
  1870		unsigned long nr_dirty;
  1871		long period;
  1872		long pause;
  1873		long max_pause;
  1874		long min_pause;
  1875		int nr_dirtied_pause;
  1876		unsigned long task_ratelimit;
  1877		unsigned long dirty_ratelimit;
  1878		struct backing_dev_info *bdi = wb->bdi;
  1879		bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
  1880		unsigned long start_time = jiffies;
  1881		int ret = 0;
  1882	
  1883		for (;;) {
  1884			unsigned long cg_dirty_min = 0;
  1885			unsigned long cg_dirty_pages = 0;
  1886			unsigned long now = jiffies;
  1887	
  1888			nr_dirty = global_node_page_state(NR_FILE_DIRTY);
  1889	
  1890			balance_domain_limits(gdtc, strictlimit);
  1891	
  1892			/*
  1893			 * Under RCU, snapshot the current memcg's memory.dirty_min
  1894			 * reservation.  When it is non-zero, also snapshot the
  1895			 * memcg-wide dirty backlog.  These feed the per-writer
  1896			 * dirty_min bypass below; the dirty_ratio clamp itself
  1897			 * is applied inside domain_dirty_limits() keyed on
  1898			 * wb->memcg_css so balance_dirty_pages(),
  1899			 * wb_over_bg_thresh() (flusher kworker context), and
  1900			 * cgwb_calc_thresh() all see a consistent clamped
  1901			 * threshold.
  1902			 *
  1903			 * rcu_read_lock() is held only for the __rcu dereference
  1904			 * of current->cgroups; the memcg pointer does not escape
  1905			 * the critical section.  The counter read matches
  1906			 * domain_dirty_avail(mdtc, true) so the bypass compares
  1907			 * the same dirty+in-flight backlog the global path uses.
  1908			 */
  1909			rcu_read_lock();
  1910			{
  1911				struct mem_cgroup *memcg =
> 1912					mem_cgroup_from_task(current);
  1913	
  1914				if (memcg) {
  1915					cg_dirty_min = READ_ONCE(memcg->dirty_min);
  1916					if (cg_dirty_min)
  1917						cg_dirty_pages =
  1918							memcg_page_state(memcg,
  1919									 NR_FILE_DIRTY) +
  1920							memcg_page_state(memcg,
  1921									 NR_WRITEBACK);
  1922				}
  1923			}
  1924			rcu_read_unlock();
  1925	
  1926			if (mdtc) {
  1927				/*
  1928				 * For !root memcg, repeat the same three-step
  1929				 * sequence as balance_domain_limits(gdtc):
  1930				 * avail -> limits -> freerun.  We inline it here
  1931				 * so we can insert the mdtc->dirty override
  1932				 * between step 2 (domain_dirty_limits, which
  1933				 * publishes the per-memcg dirty_ratio clamp on
  1934				 * cg_dirty_cap) and step 3 (domain_dirty_freerun,
  1935				 * which consumes mdtc->dirty along with
  1936				 * thresh/bg_thresh).
  1937				 */
  1938				domain_dirty_avail(mdtc, true);
  1939				domain_dirty_limits(mdtc);
  1940	
  1941				/*
  1942				 * When the dirty_ratio clamp engaged, replace the
  1943				 * per-wb dirty count from mem_cgroup_wb_stats()
  1944				 * with the memcg-wide NR_FILE_DIRTY + NR_WRITEBACK
  1945				 * sum so freerun, the setpoint, and the rate-limit
  1946				 * smoothing see the true memcg backlog instead of
  1947				 * the subset that has migrated to this cgwb (cgwb
  1948				 * migration is lazy and can lag by many seconds),
  1949				 * and so a burst of buffered writes cannot silently
  1950				 * bypass the clamp by shifting pages from
  1951				 * NR_FILE_DIRTY into NR_WRITEBACK.
  1952				 *
  1953				 * Keyed on wb->memcg_css to match the clamp itself.
  1954				 * The cgwb holds a css reference, so the memcg
  1955				 * pointer is stable without additional locking.
  1956				 *
  1957				 * Caveat: memcg_page_state() aggregates across ALL
  1958				 * backing devices owned by this memcg, while mdtc
  1959				 * is scoped to one wb.  A writer to a fast BDI may
  1960				 * observe backlog accumulated on slow BDIs in the
  1961				 * same memcg and throttle more than strictly needed.
  1962				 * Accepted for v1; the alternative (summing per-wb
  1963				 * dirty across the memcg's cgwbs) walks the cgwb
  1964				 * list under RCU on a hot path.
  1965				 */
  1966				if (mdtc->cg_dirty_cap != PAGE_COUNTER_MAX) {
  1967					struct mem_cgroup *wb_memcg =
  1968						mem_cgroup_from_css(mdtc->wb->memcg_css);
  1969	
  1970					if (wb_memcg)
  1971						mdtc->dirty =
  1972							memcg_page_state(wb_memcg,
  1973									 NR_FILE_DIRTY) +
  1974							memcg_page_state(wb_memcg,
  1975									 NR_WRITEBACK);
  1976				}
  1977	
  1978				domain_dirty_freerun(mdtc, strictlimit);
  1979			}
  1980	
  1981			if (nr_dirty > gdtc->bg_thresh && !writeback_in_progress(wb))
  1982				wb_start_background_writeback(wb);
  1983	
  1984			/*
  1985			 * dirty_min bypass: when the current memcg's dirty+in-flight
  1986			 * backlog is below its memory.dirty_min reservation, let the
  1987			 * writer proceed without throttling.  This check must live
  1988			 * outside the if (mdtc) block because a writer's file may not
  1989			 * yet have been migrated to a cgwb; without cgwb, mdtc is NULL
  1990			 * and the per-memcg block above is skipped entirely.
  1991			 *
  1992			 * cg_dirty_min and cg_dirty_pages come from the per-iteration
  1993			 * snapshot taken above under rcu_read_lock; both are stored
  1994			 * in pages (page_counter_memparse converts bytes -> pages for
  1995			 * dirty_min), so no unit conversion is needed.
  1996			 */
  1997			if (cg_dirty_min && cg_dirty_pages < cg_dirty_min)
  1998				goto free_running;
  1999	
  2000			/*
  2001			 * If memcg domain is in effect, @dirty should be under
  2002			 * both global and memcg freerun ceilings.
  2003			 */
  2004			if (gdtc->freerun && (!mdtc || mdtc->freerun)) {
  2005				unsigned long intv;
  2006				unsigned long m_intv;
  2007	
  2008	free_running:
  2009				intv = domain_poll_intv(gdtc, strictlimit);
  2010				m_intv = ULONG_MAX;
  2011	
  2012				current->dirty_paused_when = now;
  2013				current->nr_dirtied = 0;
  2014				if (mdtc)
  2015					m_intv = domain_poll_intv(mdtc, strictlimit);
  2016				current->nr_dirtied_pause = min(intv, m_intv);
  2017				break;
  2018			}
  2019	
  2020			/*
  2021			 * Unconditionally start background writeback if it's not
  2022			 * already in progress. We need to do this because the global
  2023			 * dirty threshold check above (nr_dirty > gdtc->bg_thresh)
  2024			 * doesn't account for these cases:
  2025			 *
  2026			 * a) strictlimit BDIs: throttling is calculated using per-wb
  2027			 * thresholds. The per-wb threshold can be exceeded even when
  2028			 * nr_dirty < gdtc->bg_thresh
  2029			 *
  2030			 * b) memcg-based throttling: memcg uses its own dirty count and
  2031			 * thresholds and can trigger throttling even when global
  2032			 * nr_dirty < gdtc->bg_thresh
  2033			 *
  2034			 * Writeback needs to be started else the writer stalls in the
  2035			 * throttle loop waiting for dirty pages to be written back
  2036			 * while no writeback is running.
  2037			 */
  2038			if (unlikely(!writeback_in_progress(wb)))
  2039				wb_start_background_writeback(wb);
  2040	
  2041			mem_cgroup_flush_foreign(wb);
  2042	
  2043			/*
  2044			 * Calculate global domain's pos_ratio and select the
  2045			 * global dtc by default.
  2046			 */
  2047			balance_wb_limits(gdtc, strictlimit);
  2048			if (gdtc->freerun)
  2049				goto free_running;
  2050			sdtc = gdtc;
  2051	
  2052			if (mdtc) {
  2053				/*
  2054				 * If memcg domain is in effect, calculate its
  2055				 * pos_ratio.  @wb should satisfy constraints from
  2056				 * both global and memcg domains.  Choose the one
  2057				 * w/ lower pos_ratio.
  2058				 */
  2059				balance_wb_limits(mdtc, strictlimit);
  2060				if (mdtc->freerun)
  2061					goto free_running;
  2062				if (mdtc->pos_ratio < gdtc->pos_ratio)
  2063					sdtc = mdtc;
  2064			}
  2065	
  2066			wb->dirty_exceeded = gdtc->dirty_exceeded ||
  2067					     (mdtc && mdtc->dirty_exceeded);
  2068			if (time_is_before_jiffies(READ_ONCE(wb->bw_time_stamp) +
  2069						   BANDWIDTH_INTERVAL))
  2070				__wb_update_bandwidth(gdtc, mdtc, true);
  2071	
  2072			/* throttle according to the chosen dtc */
  2073			dirty_ratelimit = READ_ONCE(wb->dirty_ratelimit);
  2074			task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >>
  2075								RATELIMIT_CALC_SHIFT;
  2076			max_pause = wb_max_pause(wb, sdtc->wb_dirty);
  2077			min_pause = wb_min_pause(wb, max_pause,
  2078						 task_ratelimit, dirty_ratelimit,
  2079						 &nr_dirtied_pause);
  2080	
  2081			if (unlikely(task_ratelimit == 0)) {
  2082				period = max_pause;
  2083				pause = max_pause;
  2084				goto pause;
  2085			}
  2086			period = HZ * pages_dirtied / task_ratelimit;
  2087			pause = period;
  2088			if (current->dirty_paused_when)
  2089				pause -= now - current->dirty_paused_when;
  2090			/*
  2091			 * For less than 1s think time (ext3/4 may block the dirtier
  2092			 * for up to 800ms from time to time on 1-HDD; so does xfs,
  2093			 * however at much less frequency), try to compensate it in
  2094			 * future periods by updating the virtual time; otherwise just
  2095			 * do a reset, as it may be a light dirtier.
  2096			 */
  2097			if (pause < min_pause) {
  2098				trace_balance_dirty_pages(wb,
  2099							  sdtc,
  2100							  dirty_ratelimit,
  2101							  task_ratelimit,
  2102							  pages_dirtied,
  2103							  period,
  2104							  min(pause, 0L),
  2105							  start_time);
  2106				if (pause < -HZ) {
  2107					current->dirty_paused_when = now;
  2108					current->nr_dirtied = 0;
  2109				} else if (period) {
  2110					current->dirty_paused_when += period;
  2111					current->nr_dirtied = 0;
  2112				} else if (current->nr_dirtied_pause <= pages_dirtied)
  2113					current->nr_dirtied_pause += pages_dirtied;
  2114				break;
  2115			}
  2116			if (unlikely(pause > max_pause)) {
  2117				/* for occasional dropped task_ratelimit */
  2118				now += min(pause - max_pause, max_pause);
  2119				pause = max_pause;
  2120			}
  2121	
  2122	pause:
  2123			trace_balance_dirty_pages(wb,
  2124						  sdtc,
  2125						  dirty_ratelimit,
  2126						  task_ratelimit,
  2127						  pages_dirtied,
  2128						  period,
  2129						  pause,
  2130						  start_time);
  2131			if (flags & BDP_ASYNC) {
  2132				ret = -EAGAIN;
  2133				break;
  2134			}
  2135			__set_current_state(TASK_KILLABLE);
  2136			bdi->last_bdp_sleep = jiffies;
  2137			io_schedule_timeout(pause);
  2138	
  2139			current->dirty_paused_when = now + pause;
  2140			current->nr_dirtied = 0;
  2141			current->nr_dirtied_pause = nr_dirtied_pause;
  2142	
  2143			/*
  2144			 * This is typically equal to (dirty < thresh) and can also
  2145			 * keep "1000+ dd on a slow USB stick" under control.
  2146			 */
  2147			if (task_ratelimit)
  2148				break;
  2149	
  2150			/*
  2151			 * In the case of an unresponsive NFS server and the NFS dirty
  2152			 * pages exceeds dirty_thresh, give the other good wb's a pipe
  2153			 * to go through, so that tasks on them still remain responsive.
  2154			 *
  2155			 * In theory 1 page is enough to keep the consumer-producer
  2156			 * pipe going: the flusher cleans 1 page => the task dirties 1
  2157			 * more page. However wb_dirty has accounting errors.  So use
  2158			 * the larger and more IO friendly wb_stat_error.
  2159			 */
  2160			if (sdtc->wb_dirty <= wb_stat_error())
  2161				break;
  2162	
  2163			if (fatal_signal_pending(current))
  2164				break;
  2165		}
  2166		return ret;
  2167	}
  2168	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
  2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
  (?)
  (?)
@ 2026-05-03  9:55 ` kernel test robot
  -1 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2026-05-03  9:55 UTC (permalink / raw)
  To: Alireza Haghdoost via B4 Relay; +Cc: llvm, oe-kbuild-all

Hi Alireza,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on 254f49634ee16a731174d2ae34bc50bd5f45e731]

url:    https://github.com/intel-lab-lkp/linux/commits/Alireza-Haghdoost-via-B4-Relay/memcg-add-per-cgroup-dirty-page-controls-dirty_ratio-dirty_min/20260502-235916
base:   254f49634ee16a731174d2ae34bc50bd5f45e731
patch link:    https://lore.kernel.org/r/20260501-rfc-memcg-dirty-v1-v1-1-9a8c80036ec1%40uber.com
patch subject: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20260503/202605031710.4QHTfWdf-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 5bac06718f502014fade905512f1d26d578a18f3)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260503/202605031710.4QHTfWdf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605031710.4QHTfWdf-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/page-writeback.c:426:33: error: no member named 'memcg_css' in 'struct bdi_writeback'
     426 |                         mem_cgroup_from_css(dtc->wb->memcg_css);
         |                                             ~~~~~~~  ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                   ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                   ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                   ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                   ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                   ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                   ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
>> mm/page-writeback.c:428:19: error: incomplete definition of type 'struct mem_cgroup'
     428 |                         READ_ONCE(memcg->dirty_ratio) : 0;
         |                                   ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
>> mm/page-writeback.c:1912:5: error: call to undeclared function 'mem_cgroup_from_task'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    1912 |                                 mem_cgroup_from_task(current);
         |                                 ^
   mm/page-writeback.c:1912:5: note: did you mean 'mem_cgroup_from_css'?
   include/linux/memcontrol.h:1211:20: note: 'mem_cgroup_from_css' declared here
    1211 | struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
         |                    ^
>> mm/page-writeback.c:1911:23: error: incompatible integer to pointer conversion initializing 'struct mem_cgroup *' with an expression of type 'int' [-Wint-conversion]
    1911 |                         struct mem_cgroup *memcg =
         |                                            ^
    1912 |                                 mem_cgroup_from_task(current);
         |                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                          ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
   mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                          ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
   mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                          ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
   mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                          ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
   mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                          ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
   mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                          ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
   mm/page-writeback.c:1915:35: error: incomplete definition of type 'struct mem_cgroup'
    1915 |                                 cg_dirty_min = READ_ONCE(memcg->dirty_min);
         |                                                          ~~~~~^
   include/linux/shrinker.h:55:9: note: forward declaration of 'struct mem_cgroup'
      55 |         struct mem_cgroup *memcg;
         |                ^
   mm/page-writeback.c:1968:36: error: no member named 'memcg_css' in 'struct bdi_writeback'
    1968 |                                         mem_cgroup_from_css(mdtc->wb->memcg_css);
         |                                                             ~~~~~~~~  ^
   18 errors generated.


vim +426 mm/page-writeback.c

   341	
   342	/**
   343	 * domain_dirty_limits - calculate thresh and bg_thresh for a wb_domain
   344	 * @dtc: dirty_throttle_control of interest
   345	 *
   346	 * Calculate @dtc->thresh and ->bg_thresh considering
   347	 * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}.  The caller
   348	 * must ensure that @dtc->avail is set before calling this function.  The
   349	 * dirty limits will be lifted by 1/4 for real-time tasks.
   350	 */
   351	static void domain_dirty_limits(struct dirty_throttle_control *dtc)
   352	{
   353		const unsigned long available_memory = dtc->avail;
   354		struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc);
   355		unsigned long bytes = vm_dirty_bytes;
   356		unsigned long bg_bytes = dirty_background_bytes;
   357		/* convert ratios to per-PAGE_SIZE for higher precision */
   358		unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100;
   359		unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100;
   360		unsigned long thresh;
   361		unsigned long bg_thresh;
   362		struct task_struct *tsk;
   363	
   364		/* gdtc is !NULL iff @dtc is for memcg domain */
   365		if (gdtc) {
   366			unsigned long global_avail = gdtc->avail;
   367	
   368			/*
   369			 * The byte settings can't be applied directly to memcg
   370			 * domains.  Convert them to ratios by scaling against
   371			 * globally available memory.  As the ratios are in
   372			 * per-PAGE_SIZE, they can be obtained by dividing bytes by
   373			 * number of pages.
   374			 */
   375			if (bytes)
   376				ratio = min(DIV_ROUND_UP(bytes, global_avail),
   377					    PAGE_SIZE);
   378			if (bg_bytes)
   379				bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail),
   380					       PAGE_SIZE);
   381			bytes = bg_bytes = 0;
   382		}
   383	
   384		if (bytes)
   385			thresh = DIV_ROUND_UP(bytes, PAGE_SIZE);
   386		else
   387			thresh = (ratio * available_memory) / PAGE_SIZE;
   388	
   389		if (bg_bytes)
   390			bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
   391		else
   392			bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE;
   393	
   394		tsk = current;
   395		if (rt_or_dl_task(tsk)) {
   396			bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
   397			thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
   398		}
   399	
   400		/*
   401		 * Apply the per-memcg dirty_ratio clamp on mdtc (gdtc != NULL
   402		 * iff @dtc is a memcg dtc).  dirty_ratio is scaled against
   403		 * the memcg's own dirtyable memory (@available_memory), matching
   404		 * the semantics of vm_dirty_ratio so the two knobs share a base
   405		 * and compose via a plain min() on thresh.  The clamp is keyed
   406		 * on wb->memcg_css (the inode-owner's memcg) rather than on
   407		 * current's memcg, so balance_dirty_pages(), wb_over_bg_thresh()
   408		 * (flusher kworker context), and cgwb_calc_thresh() all see the
   409		 * same clamped value.
   410		 *
   411		 * Published on dtc->cg_dirty_cap as well so hard_dirty_limit()
   412		 * callers in balance_dirty_pages() can ignore the slower
   413		 * dom->dirty_limit smoothing when deriving setpoint/
   414		 * rate-limit from the clamped ceiling.
   415		 *
   416		 * Clamp is applied after the rt/dl boost: dirty_ratio is a
   417		 * strict override, not widened by priority.  bg_thresh is
   418		 * scaled by the same factor we apply to thresh so the
   419		 * user-configured bg/thresh ratio survives clamping instead
   420		 * of snapping to thresh/2 via the bg_thresh >= thresh guard
   421		 * below.  mult_frac() preserves precision for small memcgs
   422		 * where a plain "(avail / 100) * ratio" would collapse to 0.
   423		 */
   424		if (gdtc) {
   425			struct mem_cgroup *memcg =
 > 426				mem_cgroup_from_css(dtc->wb->memcg_css);
   427			unsigned int cg_ratio = memcg ?
 > 428				READ_ONCE(memcg->dirty_ratio) : 0;
   429	
   430			/*
   431			 * dtc is reused across balance_dirty_pages() iterations,
   432			 * so reset the published clamp every call -- an admin
   433			 * clearing memory.dirty_ratio mid-flight must take effect
   434			 * on the next pass.
   435			 */
   436			dtc->cg_dirty_cap = PAGE_COUNTER_MAX;
   437	
   438			if (cg_ratio) {
   439				unsigned long cg_thresh = mult_frac(available_memory,
   440								    cg_ratio, 100);
   441	
   442				if (cg_thresh < thresh) {
   443					bg_thresh = mult_frac(bg_thresh, cg_thresh,
   444							      thresh);
   445					thresh = cg_thresh;
   446					dtc->cg_dirty_cap = cg_thresh;
   447				}
   448			}
   449		}
   450	
   451		/*
   452		 * Dirty throttling logic assumes the limits in page units fit into
   453		 * 32-bits. This gives 16TB dirty limits max which is hopefully enough.
   454		 */
   455		if (thresh > UINT_MAX)
   456			thresh = UINT_MAX;
   457		/* This makes sure bg_thresh is within 32-bits as well */
   458		if (bg_thresh >= thresh)
   459			bg_thresh = thresh / 2;
   460		dtc->thresh = thresh;
   461		dtc->bg_thresh = bg_thresh;
   462	
   463		/* we should eventually report the domain in the TP */
   464		if (!gdtc)
   465			trace_global_dirty_state(bg_thresh, thresh);
   466	}
   467	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
  2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
                   ` (2 preceding siblings ...)
  (?)
@ 2026-05-06 14:21 ` Jan Kara
  2026-05-14  4:10   ` Alireza Haghdoost
  -1 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2026-05-06 14:21 UTC (permalink / raw)
  To: haghdoost
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle), Jan Kara,
	cgroups, linux-mm, linux-kernel, linux-fsdevel, Kshitij Doshi

On Fri 01-05-26 22:28:38, Alireza Haghdoost via B4 Relay wrote:
> From: Alireza Haghdoost <haghdoost@uber.com>
> 
> Add two cgroup v2 memory-controller knobs that bring
> balance_dirty_pages() throttling into per-cgroup scope so one noisy
> writer cannot stall peers sharing the same host:
> 
>   memory.dirty_ratio  Per-cgroup dirty-page ceiling, 0..100 percent of
>                       the cgroup's dirtyable memory.  0 (default) leaves
>                       the cgroup subject to the global threshold only.
> 
>   memory.dirty_min    Guaranteed dirty-page floor, byte value (default 0).
> 
> The two knobs compose: dirty_ratio bounds how much dirty memory a
> cgroup may accrue, dirty_min guarantees a floor below which it is
> never throttled.
> 
> Motivation, design trade-offs, cost analysis, validation data, and
> open questions are in the cover letter.
> 
> Co-developed-by: Kshitij Doshi <kshitijd@uber.com>
> Signed-off-by: Kshitij Doshi <kshitijd@uber.com>
> Signed-off-by: Alireza Haghdoost <haghdoost@uber.com>
> Assisted-by: Cursor:claude-sonnet-4.5
> ---

Things like motivation actually belong to the changelog itself, measured
results how the patch helps as well. On the other hand stuff like history
is largely irrelevant here, frankly I don't have a bandwidth to carefully
read the huge amount of text LLM has generated below so please try to make
it more concise for next time.

> This RFC adds two cgroup v2 memory-controller knobs that give operators
> per-cgroup control over dirty-page throttling in balance_dirty_pages():
> memory.dirty_ratio (per-cgroup ceiling) and memory.dirty_min (guaranteed
> floor). A third knob, memory.dirty_weight, is forthcoming in a follow-up
> once we have validated the application site (see "Follow-ups" below).
> We are posting this as an RFC, as a single squashed patch, to get design
> feedback before splitting the prototype into a per-logical-change series.
> 
> Motivation
> ==========
> 
> balance_dirty_pages() (BDP) is a global throttle. It sleeps writers
> once the host-wide dirty count crosses a single threshold. On a
> container host that threshold is shared across cgroups. A cgroup
> that dirties pages faster than storage can drain them pushes the
> count over the limit. Every writer on the host then parks in
> io_schedule_timeout() -- including cgroups that have not dirtied a
> single page of their own.
> 
> cgroup v2 already has per-memcg dirty accounting, but that accounting
> does not translate into per-memcg dirty throttling.

Not quite true. We do have per-memcg writeback workers and we do have
per-memcg dirty limits (inferred from global dirty limit tunings) that are
enforced...

> We see this in production: a buffered write-heavy container generates
> multi-second stalls for co-located latency-sensitive workloads.
> Moreover, dirty-page accumulation from a single noisy neighbor is a
> recurring contributor to mount-responsiveness degradation on shared hosts.

... and I quite don't see how a multisecond stalls you are describing would
happen. There is something I must be missing. The throttling works as
follows: Until we cross global freerun limit (that is (background_limit +
dirty_limit) / 2) nobody is throttled. Once we cross it, memcg dirty limits
start to be taken into account. If we are below freerun in the memcg, the
task dirtying folios from that memcg shouldn't be throttled at all, once we
get above freerun we throttle by maximum of throttling delay decided from
global and memcg situation so then long delays can start happening but it
would mean the "innocent" task's memcg had to get at least over the freerun
limit.

So can you perhaps share more details about the configuration where you
observe these delays to innocent tasks due to another task dirtying a lot of
memory? How many page cache in total and dirty pages are there in each
memcg (both for the aggressive dirtier and wrongly delayed task)? Is the
delayed task really throttled in balance_dirty_pages()?

								Honza


> Prior work
> ==========
> 
> Per-memcg dirty-page limits have been proposed before. Andrea Righi
> posted an initial RFC in February 2010 [1]; Greg Thelen continued the
> work through v9 in August 2011 [2]. That series added per-memcg dirty
> counters and hooked them into balance_dirty_pages(), but it bolted
> per-cgroup limits onto the global writeback path without making
> writeback itself cgroup-aware. Without cgroup-aware flusher threads,
> a cgroup exceeding its limit triggered writeback of inodes from any
> cgroup, giving poor isolation. The series was not merged.
> 
> Konstantin Khlebnikov posted "[PATCHSET RFC 0/6] memcg: inode-based
> dirty-set controller" in January 2015 [4], which proposed
> memory.dirty_ratio (the same interface name this series uses) via an
> inode-tagged, filtered-writeback approach. Tejun Heo reviewed it
> and rejected it as a "dead end" that duplicated lower-layer policy
> without solving the underlying isolation problem; this rejection
> directly motivated Tejun's native cgwb rework described below.
> 
> Tejun Heo's 48-patch cgroup-writeback rework, merged in Linux 4.2
> (commit e4bc13adfd01, "Merge branch 'for-4.2/writeback'"), took the
> different approach of restructuring writeback to be natively
> cgroup-aware: per-memcg wb_domain (commit 841710aa6e4a), per-memcg
> NR_FILE_DIRTY / NR_WRITEBACK accounting, and cgroup-aware flusher
> threads [3]. That work deliberately deferred user-facing policy knobs.
> This series adds the policy surface that consumes Tejun's
> infrastructure. The dirty_min reservation concept is, to our
> knowledge, new.
> 
> A November 2023 LKML thread by Chengming Zhou [5] independently
> identified the identical throttling regression on cgroup v2 (a 5 GB
> container constantly throttled because memory.max * dirty_ratio yields
> too small a threshold for bursty workloads). Jan Kara participated and
> endorsed a bytes-based per-memcg dirty limit; no patches followed that
> discussion, confirming the gap this series fills.
> 
> [1] https://lwn.net/Articles/408349/
> [2] https://lore.kernel.org/lkml/1313597705-6093-1-git-send-email-gthelen@google.com/
> [3] https://lwn.net/Articles/648292/
> [4] https://lore.kernel.org/all/20150115180242.10450.92.stgit@buzz/
> [5] https://lore.kernel.org/all/109029e0-1772-4102-a2a8-ab9076462454@linux.dev/
> 
> Proposed interface
> ==================
> 
> Two new cgroup v2 files under the memory controller, absent on the root
> cgroup (CFTYPE_NOT_ON_ROOT):
> 
> memory.dirty_ratio
>   Integer [0, 100]. Per-cgroup dirty-page ceiling as a percentage of
>   the cgroup's dirtyable memory (mdtc->avail: file cache plus
>   reclaimable slack), the same base the global vm_dirty_ratio scales
>   against. 0 (the default) disables the per-cgroup ceiling and leaves
>   the cgroup subject to the global threshold only. A non-zero value
>   that is stricter than vm_dirty_ratio overrides the global ratio for
>   this cgroup via min(mdtc->thresh, cg_thresh); because both sides
>   scale off the same base, the knob can never widen the cgroup past
>   the global ceiling. A memory.dirty_bytes companion for byte-precise
>   caps (mirroring vm_dirty_bytes) is noted under "Follow-ups" below.
>   The prototype reads the value for the immediate memcg only;
>   hierarchical enforcement (clamping against ancestors, like
>   memory.max) is not implemented yet. We would like guidance on
>   whether this is required for v1 or can follow in a subsequent
>   series.
> 
> memory.dirty_min
>   Byte value (K/M/G suffixes accepted), default 0 (no reservation).
>   Guaranteed dirty-page floor: while cgroup_dirty < dirty_min,
>   throttling is bypassed (goto free_running). Lets a latency-sensitive
>   cgroup buffer a small write burst even under global dirty pressure.
> 
>   dirty_min is an admission guarantee, so we have to prevent it from
>   breaking the global dirty invariant. Two aspects:
> 
>    - Global cap. The sum of dirty_min reservations across all cgroups
>      must not exceed a fraction of the global dirty threshold (our
>      working number is 80%), so the system always retains some shared
>      capacity. The prototype does not enforce this cap yet; we expect
>      to either reject at write() time or clamp on read in a cheap
>      precomputed effective_dirty_min. We would appreciate feedback on
>      which approach the cgroup maintainers prefer.
> 
>    - Per-cgroup cap. A cgroup should not be able to reserve more dirty
>      capacity than it can hold. Our tentative rule is
>      effective_dirty_min = min(dirty_min, memory.max - memory.min),
>      evaluated at enforcement time so it tracks live memory.max changes,
>      rather than rejected at write. This is similar to how memory.low
>      composes with memory.max.
> 
>   Neither cap is implemented in the prototype; both would land before
>   a non-RFC posting.
> 
> The two knobs compose: dirty_ratio bounds how much dirty memory a
> cgroup may accrue, dirty_min guarantees a floor below which it is never
> throttled.
> 
> Test setup and results
> ======================
> 
> To show the problem and the fix, we built a single reproducer that runs
> on an unmodified stock kernel and then on the patched kernel, using the
> same setup for both.
> 
> Setup: QEMU guest with a virtio-blk disk throttled to 256 KB/s
> (bps_wr=262144). Two sibling cgroups, no io.weight; both share disk
> bandwidth equally. dirty_bytes=32MB, dirty_background_bytes=16MB
> (freerun = 24 MB). Files pre-allocated with fallocate before dirty
> pressure. Two phases per run: (1) victim alone (baseline), (2) noisy
> fills global dirty to the 32 MB cap, then victim runs contended for
> 30 s.
> 
>   - noisy:  single fio job, unlimited write rate, fills global dirty pool.
>   - victim: single fio job, rate-limited to 500 KB/s (128 IOPS target),
>             4 KiB sequential write().
> 
> The high freerun (24 MB) ensures victim's solo dirty accumulation
> (244 KB/s x 30 s = 7.3 MB) stays below the threshold. BDP does not fire
> during the solo phase on either kernel, giving identical baselines.
> 
> Stock kernel results (the problem):
> 
>                        solo (no contention)  contended  inflation
>   victim IOPS          125                   5.1        24.5x worse
>   victim p99           0.6 ms                152 ms     253x worse
> 
> The contended p99 sits at fio's percentile bucket nearest MAX_PAUSE =
> HZ/5 = 200 ms (mm/page-writeback.c:49), the hard kernel ceiling on BDP
> sleep. The victim has near-zero dirty pages of its own but is forced to
> sleep because balance_dirty_pages() sees gdtc->dirty = NR_FILE_DIRTY +
> NR_WRITEBACK above the freerun threshold. Most of noisy's pages are
> queued in NR_WRITEBACK waiting for the throttled disk.
> memory.events.local on both cgroups shows max 0 throughout the run;
> this is not memory pressure inside either cgroup.
> 
> Patched kernel results (the fix), with victim/memory.dirty_min = 16 MB:
> 
>                        solo (no contention)  contended  inflation
>   victim IOPS          125                   125        1.0x
>   victim p99           0.9 ms                0.7 ms     1.0x
> 
> The patched kernel checks cgroup_dirty < dirty_min (4096 pages) before
> computing any sleep. Because the rate-limited victim's resident dirty
> set stays well below the reservation, the check fires on every write()
> -> goto free_running -> write() returns in ~7 us. The victim is fully
> protected.
> 
> The figures above are the per-kernel medians of N=5 iterations and
> reflect a deterministic outcome on every iter: stock had cont_iops in
> 4.4..6.1 (retention 0.035..0.049) on all five iters, patched had
> cont_iops = 125.0 (retention = 1.000) on all five iters. The 5/5 stock
> iters all hit BDP's throttled regime; the 5/5 patched iters all
> bypassed it via the dirty_min check.
> 
> Implementation
> ==============
> 
> The patch touches five files:
> 
>   - include/linux/memcontrol.h: two new fields on struct mem_cgroup
>     (dirty_ratio, dirty_min).
>   - include/linux/writeback.h: a per-pass cg_dirty_cap field on
>     struct dirty_throttle_control used to publish the memcg clamp to
>     BDP's setpoint and rate-limit math.
>   - include/trace/events/writeback.h: cg_dirty_cap added to the
>     balance_dirty_pages tracepoint so operators can distinguish the
>     memcg clamp from the global dirty_limit at runtime.
>   - mm/memcontrol.c: registers the two cgroup v2 files with input
>     validation.
>   - mm/page-writeback.c: the throttling changes.
> 
> The key changes in page-writeback.c:
> 
>   - The dirty_ratio clamp lives inside domain_dirty_limits(), keyed
>     on wb->memcg_css (the inode owner's memcg).  Every consumer of
>     the memcg dtc -- writer throttle, flusher kworker,
>     cgwb_calc_thresh -- sees the same clamped thresh and bg_thresh.
>     The clamp uses mult_frac() so a small memcg does not collapse
>     to a zero ceiling.
> 
>   - The dirty_min bypass lives in balance_dirty_pages() and is
>     writer-keyed: the writing task's memcg is looked up under RCU,
>     and when its dirty+in-flight backlog is below dirty_min the
>     loop jumps to free_running, bypassing both the global and the
>     per-memcg BDP gates.  dirty_min is an admission guarantee for
>     the writer's own cgroup, not for inode owners.
> 
>   - When the clamp engages, mdtc->dirty is replaced with the
>     memcg-wide NR_FILE_DIRTY + NR_WRITEBACK sum so freerun /
>     setpoint / rate-limit smoothing see the real backlog and pages
>     migrating from NR_FILE_DIRTY into NR_WRITEBACK cannot silently
>     widen the cap.
> 
> Fast-path cost when neither knob is set: one rcu_read_lock/unlock
> pair plus a READ_ONCE(dirty_min) per balance_dirty_pages()
> iteration, and one READ_ONCE(dirty_ratio) per domain_dirty_limits()
> call.  The memcg counter reads are gated on "knob armed" and do
> not fire on the default path.  We have not measured the added
> cost yet, but we expect it to be in the noise of existing BDP
> bookkeeping.  A tight pwrite() microbenchmark will confirm this
> before a non-RFC posting.
> 
> Scope
> =====
> 
> This series affects the writer-side throttle (balance_dirty_pages())
> only. It does not partition the flusher-side writeback queue. A
> cgroup's fsync() can still block behind inodes from other cgroups in
> writeback_sb_inodes(). We document this limit explicitly and expect
> writeback-queue partitioning to be a separate, larger effort.
> 
> Interaction with block-layer throttles
> ======================================
> 
> The two knobs are orthogonal to io.max / io.cost. balance_dirty_pages()
> runs before the bio reaches the block layer, so dirty_min simply allows
> a cgroup to keep accepting write() syscalls up to its reservation; the
> actual I/O is still subject to whatever block throttle is in effect. In
> the reproducer above, the disk-level bandwidth limit (256 KB/s) is
> applied at the QEMU virtio-blk layer, and the protected victim dirties
> roughly equal to the rate it can drain, so the block throttle is
> exercised on both kernels. We have not yet tested interaction with
> guest-side io.max settings; this is on the list before a non-RFC
> posting.
> 
> Questions for maintainers
> =========================
> 
>  1. Is writer-throttle-only scope (no flusher/writeback-queue work)
>     acceptable for the first series?
>  2. Does dirty_ratio belong on struct mem_cgroup (as the prototype
>     has it) or on the memcg's wb_domain? Routing it through wb_domain
>     would let us reuse __wb_calc_thresh() and keep all threshold
>     policy in one place; a possible split is dirty_ratio on wb_domain
>     (it is threshold policy), dirty_min on the memcg (throttle
>     bypass). We can go either way.
>  3. For dirty_min safety caps (per-cgroup and global sum), which
>     approach do you prefer: reject at write time, or clamp on read in
>     an effective_dirty_min?
>  4. Is hierarchical enforcement of dirty_ratio (clamp against
>     ancestors) required for v1, or can it follow in a subsequent
>     series?
> 
> What's missing before a non-RFC posting
> =======================================
> 
>   - Split the monolithic prototype into a proper series (one patch per
>     concept + Documentation + selftest).
>   - Documentation/admin-guide/cgroup-v2.rst entries for both knobs.
>   - tools/testing/selftests/cgroup/ test for interface surface and
>     noisy-neighbor protection.
>   - Implement the per-cgroup and global dirty_min safety caps described
>     in the memory.dirty_min bullet.
>   - Fast-path microbenchmark: confirm zero measurable regression for
>     cgroups that have neither knob set.
>   - Larger-N validation on real hardware (the current N=5 data is from
>     a QEMU guest on a throttled virtio-blk).
> 
> Follow-ups (out of scope for this series)
> =========================================
> 
>   - memory.dirty_weight: a priority weight knob that scales the BDP
>     pause length, planned as a separate series. The prototype validated
>     the interface surface but the application site (post-pause scaling
>     vs. folding into pos_ratio / dirty_ratelimit) needs to be settled
>     before we ship it. Happy to discuss in advance of that posting.
>   - memory.dirty_bytes: a byte-value companion to memory.dirty_ratio,
>     mirroring the global vm_dirty_bytes. For operators who want a
>     byte-predictable per-cgroup dirty cap rather than a ratio of the
>     cgroup's dirtyable memory. We have not prototyped this yet; we
>     are listing it so reviewers know it is on the roadmap, since
>     the ratio-only interface omits that use case.
>   - Writeback-queue partitioning: flusher-side fairness across
>     cgroups, as noted in Scope above.
> 
> Looking forward to feedback.
> 
> Thanks,
> Alireza Haghdoost <haghdoost@uber.com>
> Kshitij Doshi <kshitijd@uber.com>
> ---
>  include/linux/memcontrol.h       |  10 +++
>  include/linux/writeback.h        |   4 +
>  include/trace/events/writeback.h |   5 +-
>  mm/memcontrol.c                  |  62 ++++++++++++++
>  mm/page-writeback.c              | 179 ++++++++++++++++++++++++++++++++++++---
>  5 files changed, 249 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index dc3fa687759b..45ca949a4c68 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -323,6 +323,16 @@ struct mem_cgroup {
>  	spinlock_t event_list_lock;
>  #endif /* CONFIG_MEMCG_V1 */
>  
> +	/* Per-memcg dirty-page controls (memory.dirty_ratio, memory.dirty_min) */
> +	/*
> +	 * dirty_ratio: [0, 100] percent of dirtyable memory (mdtc->avail),
> +	 *   matching the global vm_dirty_ratio base; 0 inherits the global
> +	 *   threshold.
> +	 * dirty_min:   dirty-page reservation, in pages; 0 disables the bypass.
> +	 */
> +	unsigned int dirty_ratio;
> +	unsigned long dirty_min;
> +
>  	struct mem_cgroup_per_node *nodeinfo[];
>  };
>  
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 62552a2ce5b9..e37632f728be 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -318,6 +318,10 @@ struct dirty_throttle_control {
>  	unsigned long		thresh;		/* dirty threshold */
>  	unsigned long		bg_thresh;	/* dirty background threshold */
>  	unsigned long		limit;		/* hard dirty limit */
> +	unsigned long		cg_dirty_cap;	/* per-memcg dirty_ratio clamp for
> +						 * this pass, or PAGE_COUNTER_MAX
> +						 * when no memcg clamp applies
> +						 */
>  
>  	unsigned long		wb_dirty;	/* per-wb counterparts */
>  	unsigned long		wb_thresh;
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index bdac0d685a98..0bf86b3c903c 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -672,6 +672,7 @@ TRACE_EVENT(balance_dirty_pages,
>  		__array(	 char,	bdi, 32)
>  		__field(u64,		cgroup_ino)
>  		__field(unsigned long,	limit)
> +		__field(unsigned long,	cg_dirty_cap)
>  		__field(unsigned long,	setpoint)
>  		__field(unsigned long,	dirty)
>  		__field(unsigned long,	wb_setpoint)
> @@ -691,6 +692,7 @@ TRACE_EVENT(balance_dirty_pages,
>  		strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
>  
>  		__entry->limit		= dtc->limit;
> +		__entry->cg_dirty_cap	= dtc->cg_dirty_cap;
>  		__entry->setpoint	= (dtc->limit + freerun) / 2;
>  		__entry->dirty		= dtc->dirty;
>  		__entry->wb_setpoint	= __entry->setpoint *
> @@ -710,13 +712,14 @@ TRACE_EVENT(balance_dirty_pages,
>  
>  
>  	TP_printk("bdi %s: "
> -		  "limit=%lu setpoint=%lu dirty=%lu "
> +		  "limit=%lu cg_dirty_cap=%lu setpoint=%lu dirty=%lu "
>  		  "wb_setpoint=%lu wb_dirty=%lu "
>  		  "dirty_ratelimit=%lu task_ratelimit=%lu "
>  		  "dirtied=%u dirtied_pause=%u "
>  		  "paused=%lu pause=%ld period=%lu think=%ld cgroup_ino=%llu",
>  		  __entry->bdi,
>  		  __entry->limit,
> +		  __entry->cg_dirty_cap,
>  		  __entry->setpoint,
>  		  __entry->dirty,
>  		  __entry->wb_setpoint,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c3d98ab41f1f..c43fe4f394eb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4748,6 +4748,56 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static int memory_dirty_ratio_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +	seq_printf(m, "%u\n", READ_ONCE(memcg->dirty_ratio));
> +	return 0;
> +}
> +
> +static ssize_t memory_dirty_ratio_write(struct kernfs_open_file *of,
> +					char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned int ratio;
> +	int err;
> +
> +	err = kstrtouint(strstrip(buf), 0, &ratio);
> +	if (err)
> +		return err;
> +
> +	if (ratio > 100)
> +		return -EINVAL;
> +
> +	WRITE_ONCE(memcg->dirty_ratio, ratio);
> +	return nbytes;
> +}
> +
> +static int memory_dirty_min_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +	/* seq_puts_memcg_tunable automatically multiplies by PAGE_SIZE for the user */
> +	return seq_puts_memcg_tunable(m, READ_ONCE(memcg->dirty_min));
> +}
> +
> +static ssize_t memory_dirty_min_write(struct kernfs_open_file *of,
> +				      char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long dirty_min;
> +	int err;
> +
> +	/* page_counter_memparse converts strings like "512M" into a page count */
> +	err = page_counter_memparse(strstrip(buf), "max", &dirty_min);
> +	if (err)
> +		return err;
> +
> +	WRITE_ONCE(memcg->dirty_min, dirty_min);
> +	return nbytes;
> +}
> +
>  /*
>   * Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
>   * if any new events become available.
> @@ -4950,6 +5000,18 @@ static struct cftype memory_files[] = {
>  		.flags = CFTYPE_NS_DELEGATABLE,
>  		.write = memory_reclaim,
>  	},
> +	{
> +		.name = "dirty_ratio",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_dirty_ratio_show,
> +		.write = memory_dirty_ratio_write,
> +	},
> +	{
> +		.name = "dirty_min",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_dirty_min_show,
> +		.write = memory_dirty_min_write,
> +	},
>  	{ }	/* terminate */
>  };
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 88cd53d4ba09..2847b2c1e59a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -124,14 +124,17 @@ struct wb_domain global_wb_domain;
>  
>  #define GDTC_INIT(__wb)		.wb = (__wb),				\
>  				.dom = &global_wb_domain,		\
> -				.wb_completions = &(__wb)->completions
> +				.wb_completions = &(__wb)->completions,	\
> +				.cg_dirty_cap = PAGE_COUNTER_MAX
>  
> -#define GDTC_INIT_NO_WB		.dom = &global_wb_domain
> +#define GDTC_INIT_NO_WB		.dom = &global_wb_domain,		\
> +				.cg_dirty_cap = PAGE_COUNTER_MAX
>  
>  #define MDTC_INIT(__wb, __gdtc)	.wb = (__wb),				\
>  				.dom = mem_cgroup_wb_domain(__wb),	\
>  				.wb_completions = &(__wb)->memcg_completions, \
> -				.gdtc = __gdtc
> +				.gdtc = __gdtc,				\
> +				.cg_dirty_cap = PAGE_COUNTER_MAX
>  
>  static bool mdtc_valid(struct dirty_throttle_control *dtc)
>  {
> @@ -183,8 +186,9 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
>  #else	/* CONFIG_CGROUP_WRITEBACK */
>  
>  #define GDTC_INIT(__wb)		.wb = (__wb),                           \
> -				.wb_completions = &(__wb)->completions
> -#define GDTC_INIT_NO_WB
> +				.wb_completions = &(__wb)->completions,	\
> +				.cg_dirty_cap = PAGE_COUNTER_MAX
> +#define GDTC_INIT_NO_WB		.cg_dirty_cap = PAGE_COUNTER_MAX
>  #define MDTC_INIT(__wb, __gdtc)
>  
>  static bool mdtc_valid(struct dirty_throttle_control *dtc)
> @@ -392,6 +396,58 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
>  		bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
>  		thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
>  	}
> +
> +	/*
> +	 * Apply the per-memcg dirty_ratio clamp on mdtc (gdtc != NULL
> +	 * iff @dtc is a memcg dtc).  dirty_ratio is scaled against
> +	 * the memcg's own dirtyable memory (@available_memory), matching
> +	 * the semantics of vm_dirty_ratio so the two knobs share a base
> +	 * and compose via a plain min() on thresh.  The clamp is keyed
> +	 * on wb->memcg_css (the inode-owner's memcg) rather than on
> +	 * current's memcg, so balance_dirty_pages(), wb_over_bg_thresh()
> +	 * (flusher kworker context), and cgwb_calc_thresh() all see the
> +	 * same clamped value.
> +	 *
> +	 * Published on dtc->cg_dirty_cap as well so hard_dirty_limit()
> +	 * callers in balance_dirty_pages() can ignore the slower
> +	 * dom->dirty_limit smoothing when deriving setpoint/
> +	 * rate-limit from the clamped ceiling.
> +	 *
> +	 * Clamp is applied after the rt/dl boost: dirty_ratio is a
> +	 * strict override, not widened by priority.  bg_thresh is
> +	 * scaled by the same factor we apply to thresh so the
> +	 * user-configured bg/thresh ratio survives clamping instead
> +	 * of snapping to thresh/2 via the bg_thresh >= thresh guard
> +	 * below.  mult_frac() preserves precision for small memcgs
> +	 * where a plain "(avail / 100) * ratio" would collapse to 0.
> +	 */
> +	if (gdtc) {
> +		struct mem_cgroup *memcg =
> +			mem_cgroup_from_css(dtc->wb->memcg_css);
> +		unsigned int cg_ratio = memcg ?
> +			READ_ONCE(memcg->dirty_ratio) : 0;
> +
> +		/*
> +		 * dtc is reused across balance_dirty_pages() iterations,
> +		 * so reset the published clamp every call -- an admin
> +		 * clearing memory.dirty_ratio mid-flight must take effect
> +		 * on the next pass.
> +		 */
> +		dtc->cg_dirty_cap = PAGE_COUNTER_MAX;
> +
> +		if (cg_ratio) {
> +			unsigned long cg_thresh = mult_frac(available_memory,
> +							    cg_ratio, 100);
> +
> +			if (cg_thresh < thresh) {
> +				bg_thresh = mult_frac(bg_thresh, cg_thresh,
> +						      thresh);
> +				thresh = cg_thresh;
> +				dtc->cg_dirty_cap = cg_thresh;
> +			}
> +		}
> +	}
> +
>  	/*
>  	 * Dirty throttling logic assumes the limits in page units fit into
>  	 * 32-bits. This gives 16TB dirty limits max which is hopefully enough.
> @@ -1065,7 +1121,9 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
>  	struct bdi_writeback *wb = dtc->wb;
>  	unsigned long write_bw = READ_ONCE(wb->avg_write_bandwidth);
>  	unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
> -	unsigned long limit = dtc->limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
> +	unsigned long limit = dtc->limit = min(hard_dirty_limit(dtc_dom(dtc),
> +							       dtc->thresh),
> +					       dtc->cg_dirty_cap);
>  	unsigned long wb_thresh = dtc->wb_thresh;
>  	unsigned long x_intercept;
>  	unsigned long setpoint;		/* dirty pages' target balance point */
> @@ -1334,7 +1392,8 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
>  	struct bdi_writeback *wb = dtc->wb;
>  	unsigned long dirty = dtc->dirty;
>  	unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
> -	unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
> +	unsigned long limit = min(hard_dirty_limit(dtc_dom(dtc), dtc->thresh),
> +				  dtc->cg_dirty_cap);
>  	unsigned long setpoint = (freerun + limit) / 2;
>  	unsigned long write_bw = wb->avg_write_bandwidth;
>  	unsigned long dirty_ratelimit = wb->dirty_ratelimit;
> @@ -1822,22 +1881,122 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
>  	int ret = 0;
>  
>  	for (;;) {
> +		unsigned long cg_dirty_min = 0;
> +		unsigned long cg_dirty_pages = 0;
>  		unsigned long now = jiffies;
>  
>  		nr_dirty = global_node_page_state(NR_FILE_DIRTY);
>  
>  		balance_domain_limits(gdtc, strictlimit);
> +
> +		/*
> +		 * Under RCU, snapshot the current memcg's memory.dirty_min
> +		 * reservation.  When it is non-zero, also snapshot the
> +		 * memcg-wide dirty backlog.  These feed the per-writer
> +		 * dirty_min bypass below; the dirty_ratio clamp itself
> +		 * is applied inside domain_dirty_limits() keyed on
> +		 * wb->memcg_css so balance_dirty_pages(),
> +		 * wb_over_bg_thresh() (flusher kworker context), and
> +		 * cgwb_calc_thresh() all see a consistent clamped
> +		 * threshold.
> +		 *
> +		 * rcu_read_lock() is held only for the __rcu dereference
> +		 * of current->cgroups; the memcg pointer does not escape
> +		 * the critical section.  The counter read matches
> +		 * domain_dirty_avail(mdtc, true) so the bypass compares
> +		 * the same dirty+in-flight backlog the global path uses.
> +		 */
> +		rcu_read_lock();
> +		{
> +			struct mem_cgroup *memcg =
> +				mem_cgroup_from_task(current);
> +
> +			if (memcg) {
> +				cg_dirty_min = READ_ONCE(memcg->dirty_min);
> +				if (cg_dirty_min)
> +					cg_dirty_pages =
> +						memcg_page_state(memcg,
> +								 NR_FILE_DIRTY) +
> +						memcg_page_state(memcg,
> +								 NR_WRITEBACK);
> +			}
> +		}
> +		rcu_read_unlock();
> +
>  		if (mdtc) {
>  			/*
> -			 * If @wb belongs to !root memcg, repeat the same
> -			 * basic calculations for the memcg domain.
> +			 * For !root memcg, repeat the same three-step
> +			 * sequence as balance_domain_limits(gdtc):
> +			 * avail -> limits -> freerun.  We inline it here
> +			 * so we can insert the mdtc->dirty override
> +			 * between step 2 (domain_dirty_limits, which
> +			 * publishes the per-memcg dirty_ratio clamp on
> +			 * cg_dirty_cap) and step 3 (domain_dirty_freerun,
> +			 * which consumes mdtc->dirty along with
> +			 * thresh/bg_thresh).
> +			 */
> +			domain_dirty_avail(mdtc, true);
> +			domain_dirty_limits(mdtc);
> +
> +			/*
> +			 * When the dirty_ratio clamp engaged, replace the
> +			 * per-wb dirty count from mem_cgroup_wb_stats()
> +			 * with the memcg-wide NR_FILE_DIRTY + NR_WRITEBACK
> +			 * sum so freerun, the setpoint, and the rate-limit
> +			 * smoothing see the true memcg backlog instead of
> +			 * the subset that has migrated to this cgwb (cgwb
> +			 * migration is lazy and can lag by many seconds),
> +			 * and so a burst of buffered writes cannot silently
> +			 * bypass the clamp by shifting pages from
> +			 * NR_FILE_DIRTY into NR_WRITEBACK.
> +			 *
> +			 * Keyed on wb->memcg_css to match the clamp itself.
> +			 * The cgwb holds a css reference, so the memcg
> +			 * pointer is stable without additional locking.
> +			 *
> +			 * Caveat: memcg_page_state() aggregates across ALL
> +			 * backing devices owned by this memcg, while mdtc
> +			 * is scoped to one wb.  A writer to a fast BDI may
> +			 * observe backlog accumulated on slow BDIs in the
> +			 * same memcg and throttle more than strictly needed.
> +			 * Accepted for v1; the alternative (summing per-wb
> +			 * dirty across the memcg's cgwbs) walks the cgwb
> +			 * list under RCU on a hot path.
>  			 */
> -			balance_domain_limits(mdtc, strictlimit);
> +			if (mdtc->cg_dirty_cap != PAGE_COUNTER_MAX) {
> +				struct mem_cgroup *wb_memcg =
> +					mem_cgroup_from_css(mdtc->wb->memcg_css);
> +
> +				if (wb_memcg)
> +					mdtc->dirty =
> +						memcg_page_state(wb_memcg,
> +								 NR_FILE_DIRTY) +
> +						memcg_page_state(wb_memcg,
> +								 NR_WRITEBACK);
> +			}
> +
> +			domain_dirty_freerun(mdtc, strictlimit);
>  		}
>  
>  		if (nr_dirty > gdtc->bg_thresh && !writeback_in_progress(wb))
>  			wb_start_background_writeback(wb);
>  
> +		/*
> +		 * dirty_min bypass: when the current memcg's dirty+in-flight
> +		 * backlog is below its memory.dirty_min reservation, let the
> +		 * writer proceed without throttling.  This check must live
> +		 * outside the if (mdtc) block because a writer's file may not
> +		 * yet have been migrated to a cgwb; without cgwb, mdtc is NULL
> +		 * and the per-memcg block above is skipped entirely.
> +		 *
> +		 * cg_dirty_min and cg_dirty_pages come from the per-iteration
> +		 * snapshot taken above under rcu_read_lock; both are stored
> +		 * in pages (page_counter_memparse converts bytes -> pages for
> +		 * dirty_min), so no unit conversion is needed.
> +		 */
> +		if (cg_dirty_min && cg_dirty_pages < cg_dirty_min)
> +			goto free_running;
> +
>  		/*
>  		 * If memcg domain is in effect, @dirty should be under
>  		 * both global and memcg freerun ceilings.
> 
> ---
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> change-id: 20260501-rfc-memcg-dirty-v1-ed4644c3fa8a
> 
> Best regards,
> -- 
> Alireza Haghdoost <haghdoost@uber.com>
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
  2026-05-06 14:21 ` Jan Kara
@ 2026-05-14  4:10   ` Alireza Haghdoost
  0 siblings, 0 replies; 6+ messages in thread
From: Alireza Haghdoost @ 2026-05-14  4:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Matthew Wilcox (Oracle), cgroups,
	linux-mm, linux-kernel, linux-fsdevel, Kshitij Doshi

On Wed 06-05-26 16:21:00, Jan Kara wrote:
> Things like motivation actually belong to the changelog itself, measured
> results how the patch helps as well. On the other hand stuff like history
> is largely irrelevant here, frankly I don't have a bandwidth to carefully
> read the huge amount of text LLM has generated below so please try to make
> it more concise for next time.

Understood. Will trim for the non-RFC posting; apologies for the
volume.

> ... I quite don't see how a multisecond stalls you are describing would
> happen [...] If we are below freerun in the memcg, the task dirtying
> folios from that memcg shouldn't be throttled at all, once we get above
> freerun we throttle by maximum of throttling delay decided from global
> and memcg situation [...]

The stall is reachable even with the victim's memcg well below its
own freerun. The freerun shortcut in balance_dirty_pages() is an AND,
not OR:

  if (gdtc->freerun && (!mdtc || mdtc->freerun))
          goto free_running;

Once gdtc is over freerun (because the noisy neighbour pushed it
there) the shortcut does not fire, even when mdtc->freerun is true.
After the shortcut fails, the per-task pause is computed from the
dtc with the smaller pos_ratio:

  if (mdtc->pos_ratio < gdtc->pos_ratio)
          sdtc = mdtc;

When global is the worse domain, the victim sleeps against global
state, not memcg state.

> So can you perhaps share more details about the configuration where
> you observe these delays to innocent tasks due to another task
> dirtying a lot of memory? How many page cache in total and dirty
> pages are there in each memcg [...]? Is the delayed task really
> throttled in balance_dirty_pages()?

Yes. Re-ran the reproducer: stock 7.0-rc5, ext4 on virtio-blk
throttled to 256 KB/s, dirty_bytes=32M, dirty_background_bytes=16M
(freerun = 24 MB), noisy = single fio job doing unlimited buffered
randwrite, victim = single fio job doing 4 KiB sequential write
rate-limited to 500 KB/s.

Per-memcg snapshot during the contended phase, ~10 s into the run:

                       noisy memcg     victim memcg     global
  memory.current        47 MB            21 MB           --
  file (cache)          38 MB            14 MB           --
  file_dirty            26 MB           1.7 MB           27 MB
  file_writeback       1.5 MB           4.0 MB          5.3 MB

Victim memcg holds 1.7 MB dirty, far below any reasonable per-memcg
freerun. Global dirty (NR_FILE_DIRTY + NR_WRITEBACK ~ 32 MB) is over
the 24 MB freerun ceiling, driven entirely by noisy.

The victim writer (fio with psync) is in fact sleeping in
balance_dirty_pages(). One stack snapshot during a stall:

  [<0>] balance_dirty_pages+0x5c5/0xac0
  [<0>] balance_dirty_pages_ratelimited_flags+0x2a1/0x380
  [<0>] generic_perform_write+0x194/0x280
  [<0>] ext4_buffered_write_iter+0x63/0x110
  [<0>] vfs_write+0x28d/0x450
  [<0>] __x64_sys_pwrite64+0x8c/0xc0
  [<0>] do_syscall_64+0xfa/0x520
  [<0>] entry_SYSCALL_64_after_hwframe+0x77/0x7f

Sampling /proc/<pid>/wchan at 100 Hz across the contended phase
yields the histogram:

  104  balance_dirty_pages
   88  hrtimer_nanosleep    (the fio rate-limit sleep between writes)
   12  RUNNING
    4  p9_client_rpc        (virtfs, host-guest filesystem RPC)
    3  d_alloc_parallel

The vast majority of non-rate-limit samples have the writer parked
in balance_dirty_pages(). Victim per-IO clat in this run reaches a
3 s tail (worst single 4 KiB pwrite blocked ~3.0 s) while its own
memcg holds < 2 MB dirty.

I'm happy to share the full traces and the reproducer if useful.

Thanks for the review,
Alireza


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-14  4:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-01 22:28 [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min) Alireza Haghdoost
2026-05-01 22:28 ` Alireza Haghdoost via B4 Relay
2026-05-03  8:59 ` kernel test robot
2026-05-03  9:55 ` kernel test robot
2026-05-06 14:21 ` Jan Kara
2026-05-14  4:10   ` Alireza Haghdoost

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.