Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration
From: Thomas Gleixner @ 2026-06-23 20:45 UTC (permalink / raw)
  To: Jing Wu; +Cc: Jing Wu, Waiman Long, linux-kernel, rcu, cgroups, Qiliang Yuan
In-Reply-To: <20260623043641.2391662-1-realwujing@gmail.com>

On Tue, Jun 23 2026 at 12:36, Jing Wu wrote:
> On Thu, Jun 18 2026 at 22:27, Thomas Gleixner wrote:
> That said, I fully accept the architectural feedback: the on-the-fly
> subsystem modification approach in v3 is wrong, and v4 should use the
> CPU hotplug machinery.
>
> We are open to coordinating with Waiman on a unified approach that
> covers both use cases. Before starting v4, two questions:
>
>   1. Is the "no boot parameter required" use case worth pursuing
>      independently, or should it be folded into Waiman's series?

Sort it out with him.

>   2. For the hotplug path: is CPU-by-CPU offline/online the expected
>      mechanism, given that you rejected the cpuhp_offline_cb() bulk
>      approach in Waiman's v1?

I think so. It makes the most sense.

^ permalink raw reply

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Joshua Hahn @ 2026-06-23 20:20 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Youngjun Park, Shakeel Butt, akpm, chrisl, youngjun.park,
	linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, muchun.song, shikemeng, nphamcs, baoquan.he,
	baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny, baver.bae,
	matia.kim
In-Reply-To: <CAO9r8zPwHYj284gyyjqnH6Z-NNLLbftKqzoOKycaMzm3+ifSdA@mail.gmail.com>

On Tue, 23 Jun 2026 13:06:10 -0700 Yosry Ahmed <yosry@kernel.org> wrote:

> On Tue, Jun 23, 2026 at 11:56 AM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> >
> > On Tue, 23 Jun 2026 11:10:32 -0700 Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > > > To get back to the question of how the auto-tuning should work, the
> > > > main question is to which ratio we scale the swap limits to.
> > > > Do we set the swap limits proportional to how much swap is present
> > > > in the system, or how much swap is available to the cgroup?
> > > >
> > > > So if we have 3 swap tiers A, B, C, with 50G, 30G, and 20G capacity
> > > > respectively, how much should a cgroup with swap.max = 10G have if
> > > > it is limited to tiers A and B?
> > > >
> > > > This is what I was getting at earlier when I said we have to calculate
> > > > different ratios for different cgroups, based on what tiers they have
> > > > access to.
> > >
> > > That's a good question. I think the case that is particularly
> > > interesting is whether or not the limits of other tiers should change
> > > when another tier is disabled/enabled.
> > >
> > > So basically in your example, assuming everything starts as "max",
> > > when swap.max is set to 10G, the autoscaled limits would be: (tier A,
> > > 5G), (tier B, 3G), (tier C, 2G). Now the question becomes, if
> > > userspace sets the limit of tier C to 0, should the limits for tiers A
> > > and B change?
> > >
> > > On one hand, it's simpler to just keep the autoscaled limits unchanged
> > > in this case. However, this means that the effective swap limit is now
> > > 8G, which is not great :/
> > >
> > > The alternative is to recalculate all the limits when one of them
> > > changes, in which case the limits of A and B would change to 6.25G and
> > > 3.75G. But I don't know if this will work well if we allow custom
> > > limits. What happens if the limit of tier C is written as 1 (or 4096)
> > > instead of 0? It's effectively the same scenario, but the tier is
> > > technically allowed.
> >
> > I think the one problem with this is that it becomes quite easy to
> > accidentally overcommit. As a toy example, if you have 10 workloads and
> > 100G swap (as in the example I gave above), intuitively setting
> > swap.max = 10G for all 10 workloads shouldn't ever cause any contention
> > on capacity. But if you start excluding some tiers from some workloads,
> > you actually get overcommitting on the tiers that can service the
> > most workloads.
> >
> > I am not sure how concerning swap overcommit was, but at least in the
> > memory tiering scenario accidental overcommitting of toptier memory
> > seemed bad enough that I wanted to avoid the problem entirely.
> >
> > > The more I think about it, the more I realize it may be best to drop
> > > the autoscaling thing. I imagine memory tiering might run into similar
> > > issues too :/
> >
> > And that's why I didn't include opt-in/opt-out for any of the tiers;
> > if you have system-wide ratios, there's no need to change the ratios
> > at all, and as long as the sum of your memory.limit for each workload
> > is under the total capacity, all tiers will also not be overcommitted.
> 
> I think eventually there may be use cases to opt some memcgs out for
> some memory tiers. For example, limit sensitive workloads to the top
> tier (or vice versa).

Yup, that makes sense to me too.

One of the things that did concern me a bit with my model for tiered
memcg limit was that system-critical processes would also be susceptible
to being demoted and churned, when we would much rather make sure
those are kept protected at the toptier.

> > Now, all of these complications aside, I think we might be overthinking
> > a bit here : -) The auto-scaling should just provide some sort of
> > "reasonable" default, the users can always override the per-tier
> > limits if they are unhappy with the autoscaled values.
> 
> I agree, but it seems like both options are not ideal here. I think it
> might make more sense to not present a default value at all, have
> "max" be the default for all the tiers, even if memory.max or swap.max
> isn't. Userspace can set the limits if they need to. Autoscaling the
> limits in userspace should be easy.

I like this idea a lot. That would basically make swap tiers a no-op
unless you opt-into setting the limits yourself, so we don't run the
risk of accidentally enabling tiers.

On that note, maybe it makes sense for me to change my
memory tiering series to also just not present a default setting for
tiered limits, and instead just set them as max until the user comes
and configures them?

I think this is a better question for the memcg maintainers, who might
have more to say on this. Johannes, Michal, Roman, and Shakeel,
what do you guys think? Could an approach to just make the memory
tier limits writable from the get-go and not expose any defaults
make sense to you?

I think that would simplify the code quite a bit and also help mitigate
the possible side effects on system-critical workloads.

Thanks!
Joshua

^ permalink raw reply

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Yosry Ahmed @ 2026-06-23 20:06 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Youngjun Park, Shakeel Butt, akpm, chrisl, youngjun.park,
	linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, muchun.song, shikemeng, nphamcs, baoquan.he,
	baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny, baver.bae,
	matia.kim
In-Reply-To: <20260623185618.1488231-1-joshua.hahnjy@gmail.com>

On Tue, Jun 23, 2026 at 11:56 AM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> On Tue, 23 Jun 2026 11:10:32 -0700 Yosry Ahmed <yosry@kernel.org> wrote:
>
> > > To get back to the question of how the auto-tuning should work, the
> > > main question is to which ratio we scale the swap limits to.
> > > Do we set the swap limits proportional to how much swap is present
> > > in the system, or how much swap is available to the cgroup?
> > >
> > > So if we have 3 swap tiers A, B, C, with 50G, 30G, and 20G capacity
> > > respectively, how much should a cgroup with swap.max = 10G have if
> > > it is limited to tiers A and B?
> > >
> > > This is what I was getting at earlier when I said we have to calculate
> > > different ratios for different cgroups, based on what tiers they have
> > > access to.
> >
> > That's a good question. I think the case that is particularly
> > interesting is whether or not the limits of other tiers should change
> > when another tier is disabled/enabled.
> >
> > So basically in your example, assuming everything starts as "max",
> > when swap.max is set to 10G, the autoscaled limits would be: (tier A,
> > 5G), (tier B, 3G), (tier C, 2G). Now the question becomes, if
> > userspace sets the limit of tier C to 0, should the limits for tiers A
> > and B change?
> >
> > On one hand, it's simpler to just keep the autoscaled limits unchanged
> > in this case. However, this means that the effective swap limit is now
> > 8G, which is not great :/
> >
> > The alternative is to recalculate all the limits when one of them
> > changes, in which case the limits of A and B would change to 6.25G and
> > 3.75G. But I don't know if this will work well if we allow custom
> > limits. What happens if the limit of tier C is written as 1 (or 4096)
> > instead of 0? It's effectively the same scenario, but the tier is
> > technically allowed.
>
> I think the one problem with this is that it becomes quite easy to
> accidentally overcommit. As a toy example, if you have 10 workloads and
> 100G swap (as in the example I gave above), intuitively setting
> swap.max = 10G for all 10 workloads shouldn't ever cause any contention
> on capacity. But if you start excluding some tiers from some workloads,
> you actually get overcommitting on the tiers that can service the
> most workloads.
>
> I am not sure how concerning swap overcommit was, but at least in the
> memory tiering scenario accidental overcommitting of toptier memory
> seemed bad enough that I wanted to avoid the problem entirely.
>
> > The more I think about it, the more I realize it may be best to drop
> > the autoscaling thing. I imagine memory tiering might run into similar
> > issues too :/
>
> And that's why I didn't include opt-in/opt-out for any of the tiers;
> if you have system-wide ratios, there's no need to change the ratios
> at all, and as long as the sum of your memory.limit for each workload
> is under the total capacity, all tiers will also not be overcommitted.

I think eventually there may be use cases to opt some memcgs out for
some memory tiers. For example, limit sensitive workloads to the top
tier (or vice versa).

>
> Now, all of these complications aside, I think we might be overthinking
> a bit here : -) The auto-scaling should just provide some sort of
> "reasonable" default, the users can always override the per-tier
> limits if they are unhappy with the autoscaled values.

I agree, but it seems like both options are not ideal here. I think it
might make more sense to not present a default value at all, have
"max" be the default for all the tiers, even if memory.max or swap.max
isn't. Userspace can set the limits if they need to. Autoscaling the
limits in userspace should be easy.

>
> In fact, maybe it even makes sense to have sum of swap tier limits >
> swap.max.
>
> (I actually recall having a really similar discussion when I was working
> on weighted interleave auto-tuning a year ago, on how weights should be
> set when switching between manually-set limits and relying on
> auto-scaled defaults [1]. I don't think there's a need to follow this
> convention, but we should think about what the expected behavior should
> be if a user manually sets a limit, but later wants to go back to
> auto-scaling limits).
>
> Anyways, I think these are important questions. Youngjun, Nhat, Shakeel,
> any thoughts from you all? : -)
>
> [1] https://lore.kernel.org/all/8734hbiq7j.fsf@DESKTOP-5N7EMDA/

^ permalink raw reply

* Re: [PATCH v2] selftests/cgroup: Adjust cpu.max quota based on HZ
From: Joe Simmons-Talbott @ 2026-06-23 19:42 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Joe Simmons-Talbott, Tejun Heo, Johannes Weiner, Shuah Khan,
	cgroups, linux-kselftest, linux-kernel, Sebastian Chlad
In-Reply-To: <ajqBjmJ-VT3UDPMr@localhost.localdomain>

On Tue, Jun 23, 2026 at 03:52:30PM +0200, Michal Koutný wrote:
> On Mon, Jun 22, 2026 at 03:43:04PM -0400, Joe Simmons-Talbott <joest@redhat.com> wrote:
> > +static long
> > +_get_config_hz(void)
> > +{
> > +	long hz = -1;
> > +	FILE *f;
> > +	char cmd[256] = "zcat /proc/config.gz 2>/dev/null | grep '^CONFIG_HZ='";
> > +
> > +	f = popen(cmd, "r");
> > +
> > +	if (!f)
> > +		goto out;
> > +
> > +	fscanf(f, "CONFIG_HZ=%ld", &hz);
> > +
> > +out:
> > +	pclose(f);
> > +	return hz;
> > +}
> 
> I like that you voiced this dependency on CONFIG_HZ and also that
> _SC_CLK_TCK is useless in this regards.
> (I see that BPF selftests have similar infra for this.)
> 
> 
> > +
> >  /*
> >   * This test creates a cgroup with some maximum value within a period, and
> >   * verifies that a process in the cgroup is not overscheduled.
> > @@ -646,7 +669,8 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
> >  static int test_cpucg_max(const char *root)
> >  {
> >  	int ret = KSFT_FAIL;
> > -	long quota_usec = 1000;
> > +	long hz = _get_config_hz();
> > +	long quota_usec;
> >  	long default_period_usec = 100000; /* cpu.max's default period */
> >  	long duration_seconds = 1;
> 
> I would not bend the tested value but it's expectation (so that
> approximately same quantity is tested acroos configs).
> 
> I reckon the problem might be tasks that overrun the quota due to long
> tick, fortunately, we can assume this is compensated over multiple
> periods, so _on average_ quota should be honored (more) precisely.
> But the test duration may be not well aligned with all the compensation
> periods, to that must be accounted for in the expectation.
> 
> When I write it all down, I get this:
> 
> --- a/tools/testing/selftests/cgroup/test_cpu.c
> +++ b/tools/testing/selftests/cgroup/test_cpu.c
> @@ -651,7 +651,9 @@ static int test_cpucg_max(const char *root)
>         long duration_seconds = 1;
> 
>         long duration_usec = duration_seconds * USEC_PER_SEC;
> -       long usage_usec, n_periods, remainder_usec, expected_usage_usec;
> +       long usage_usec, expected_usage_usec;
> +       long n_periods, spread_periods, unaligned;
> +       long tick_usec, low_usage, high_usage;
>         char *cpucg;
>         char quota_buf[32];
> 
> @@ -687,9 +689,16 @@ static int test_cpucg_max(const char *root)
>          * the cpu hog is set to run as per wall-clock time
>          */
>         n_periods = duration_usec / default_period_usec;
> -       remainder_usec = duration_usec - n_periods * default_period_usec;
> -       expected_usage_usec
> -               = n_periods * quota_usec + MIN(remainder_usec, quota_usec);
> +       tick_usec = USEC_PER_SEC / hz;
> +       /* Up to tick_usec (over)run is compensated over multiple periods */
> +       spread_periods = MAX(1, tick_usec / quota_usec);
> +       low_usage = n_periods / spread_periods;
> +       high_usage = (n_periods + spread_periods - 1) / spread_periods;
> +       unaligned = n_periods % spread_periods;
> +
> +       expected_usage_usec = quota_usec * (
> +               unaligned * high_usage +
> +               (spread_periods - unaligned) * low_usage);
> 
>         if (!values_close_report(usage_usec, expected_usage_usec, 10))
>                 goto cleanup;
> 
> 
> (I neglected (and dropped) remainder_usec because it is zero with
> default values)
> 
> However, not all preemptions are tick-based, so there'd be noise 
> and one has to tune the values_clone_report(,,err) anyway.
> 
> Then to reduce noise, the simpler solution is to let the test run
> longer
> 
> duration_usec = duration_seconds * USEC_PER_SEC * 1000 / hz;
> 
> (where 1000 is the CONFIG_HZ=1000 where the test runs sufficiently [1] well.)
> 
> Joe, how do to the two variants above (unalignment account and prolonged
> duration) affect test_cpu behavior on your setup?

Hi Michal,

Thank you for your review.

I tried both approaches, unalignment account and prolonged duration, and
both allowed me to run 10 iterations of the test_cpu tests without any
failures. I will use the simpler prolonged duration approach in v3 if
that is okay.

Thanks,
Joe

> 
> (I'm personally wondering what is bigger quantity: systemic error due to
> HZ quantization or random (SMP) error.)
> 
> Thanks,
> Michal
> 
> [1] Even there one runs into noise depending on nr_cpus, thus even that
>     fixed err=10 is not ideal.



^ permalink raw reply

* Re: [RFC PATCH v2 3/7] mm, swap: support physical swap as a vswap backend
From: Yosry Ahmed @ 2026-06-23 19:01 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <20260612193738.2183968-4-nphamcs@gmail.com>

> diff --git a/mm/zswap.c b/mm/zswap.c
> index 466f8a182716..5daff7a25f67 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -993,6 +993,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         struct folio *folio;
>         struct mempolicy *mpol;
>         struct swap_info_struct *si;
> +       swp_entry_t phys = {};
>         int ret = 0;
>
>         /* try to allocate swap cache folio */
> @@ -1000,16 +1001,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         if (!si)
>                 return -EEXIST;
>
> -       /*
> -        * Vswap entries have no physical backing - writeback would fail
> -        * and SIGBUS the caller. Bail before we waste a swap-cache folio
> -        * allocation.
> -        */
> -       if (si->flags & SWP_VSWAP) {
> -               put_swap_device(si);
> -               return -EINVAL;
> -       }
> -
>         mpol = get_task_policy(current);
>         folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol,
>                                        NO_INTERLEAVE_INDEX);
> @@ -1028,40 +1019,78 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         /*
>          * folio is locked, and the swapcache is now secured against
>          * concurrent swapping to and from the slot, and concurrent
> -        * swapoff so we can safely dereference the zswap tree here.
> -        * Verify that the swap entry hasn't been invalidated and recycled
> -        * behind our backs, to avoid overwriting a new swap folio with
> -        * old compressed data. Only when this is successful can the entry
> -        * be dereferenced.
> +        * swapoff so we can safely dereference the zswap tree (or vswap
> +        * vtable) here. Verify that the swap entry hasn't been
> +        * invalidated and recycled behind our backs, to avoid overwriting
> +        * a new swap folio with old compressed data. Only when this is
> +        * successful can the entry be dereferenced.
>          */
> -       tree = swap_zswap_tree(swpentry);
> -       if (entry != xa_load(tree, offset)) {
> -               ret = -ENOMEM;
> -               goto out;
> +       if (swap_is_vswap(si)) {
> +               if (entry != vswap_zswap_load(swpentry)) {
> +                       ret = -ENOMEM;
> +                       goto out;
> +               }
> +               /*
> +                * Allocate physical backing BEFORE decompress - if it fails,
> +                * no wasted work. folio_realloc_swap sets vtable to PHYS,
> +                * overwriting ZSWAP - the old entry pointer is only held
> +                * by the caller now.
> +                */
> +               phys = folio_realloc_swap(folio);
> +               if (!phys.val) {
> +                       ret = -ENOMEM;
> +                       goto out;
> +               }
> +       } else {
> +               tree = swap_zswap_tree(swpentry);
> +               if (entry != xa_load(tree, offset)) {
> +                       ret = -ENOMEM;
> +                       goto out;
> +               }

There's a lot of divergence in the code (in this patch and previous
ones). Seems like a lot of it is to do xarray operations vs vswap
operations. I wonder if we can abstract these into helpers, e.g.
zswap_tree_store(), zswap_tree_load(), etc. Maybe the name is not the
best, but you get the point :)

Here we can then do zswap_tree_load() for both code paths and only the
folio_realloc_swap() needs to be different for vswap. We can do
similar cleanups for the load/store paths as well.

>         }
>
>         if (!zswap_decompress(entry, folio)) {
>                 ret = -EIO;
> +               /*
> +                * For vswap: folio_realloc_swap already moved the entry
> +                * out of the vtable. Restore it via vswap_zswap_store so
> +                * the entry stays tracked (and the just-allocated PHYS
> +                * slot is freed). For non-vswap: entry is still in the
> +                * zswap tree.
> +                */
> +               if (swap_is_vswap(si) && phys.val)
> +                       vswap_zswap_store(swpentry, entry);

Should this go in the cleanup path instead (i.e. in the 'out' label?).

>                 goto out;
>         }
>
> -       xa_erase(tree, offset);
> +       if (!swap_is_vswap(si))
> +               xa_erase(tree, offset);

Maybe this can also be abstracted into a helper, but I wonder what the
corresponding vswap operation would be. I think folio_realloc_swap()
will have already "erased" the zswap entry from vswap. Maybe have a
vswap helper that will only remove it if it's a zswap entry? We can
probably do a lockless check first to make it cheap?

It's probably silly to do this, and maybe there's a better way.
Generally, I think the code would be easier to follow if we abstract
away the xarray vs. vswap stuff into helpers (where it's reasonable).

>
>         count_vm_event(ZSWPWB);
>         if (entry->objcg)
>                 count_objcg_events(entry->objcg, ZSWPWB, 1);
>
> -       zswap_entry_free(entry);
> -
>         /* folio is up to date */
>         folio_mark_uptodate(folio);
>
>         /* move it to the tail of the inactive list after end_writeback */
>         folio_set_reclaim(folio);
>
> -       /* start writeback */
> -       ret = __swap_writepage(folio, NULL);
> -       WARN_ON_ONCE(ret);
> +       /*
> +        * Start writeback. __swap_writepage_phys is void; __swap_writepage
> +        * returns 0 today (async IO errors surface in the bio end_io
> +        * callback). Either way the entry has been moved out of its prior
> +        * location (vtable PHYS for vswap, removed from tree otherwise),
> +        * so we own the free.
> +        */
> +       if (swap_is_vswap(si)) {
> +               __swap_writepage_phys(folio, NULL, phys);
> +       } else {
> +               ret = __swap_writepage(folio, NULL);
> +               WARN_ON_ONCE(ret);
> +       }
> +
> +       zswap_entry_free(entry);
>
>  out:
>         if (ret) {
> @@ -1212,6 +1241,18 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>         if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
>                 return 0;
>
> +       /*
> +        * With CONFIG_VSWAP, vswap-backed zswap entries need a physical
> +        * swap slot allocated on demand (via folio_realloc_swap) for
> +        * writeback. If no physical slots are available, writeback will
> +        * fail - skip the shrinker to avoid spinning on entries we cannot
> +        * drain. Vanilla zswap-on-swapfile is unaffected because every
> +        * zswap entry already has a backing slot; gate on CONFIG_VSWAP so
> +        * the check compiles out there.
> +        */
> +       if (IS_ENABLED(CONFIG_VSWAP) && !get_nr_swap_pages())
> +               return 0;
> +
>         /*
>          * The shrinker resumes swap writeback, which will enter block
>          * and may enter fs. XXX: Harmonize with vmscan.c __GFP_FS
> @@ -1558,7 +1599,7 @@ bool zswap_store(struct folio *folio)
>          * writeback could overwrite the new data in the swapfile.
>          */
>         if (partial_store && is_vswap_entry(swp))
> -               folio_release_vswap_backing(folio);
> +               folio_release_non_phys_swap_backing(folio);
>         else if (!ret && !is_vswap_entry(swp)) {
>                 unsigned type = swp_type(swp);
>                 pgoff_t offset = swp_offset(swp);
> --
> 2.53.0-Meta
>

^ permalink raw reply

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Joshua Hahn @ 2026-06-23 18:56 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Youngjun Park, Shakeel Butt, akpm, chrisl, youngjun.park,
	linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, muchun.song, shikemeng, nphamcs, baoquan.he,
	baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny, baver.bae,
	matia.kim
In-Reply-To: <CAO9r8zOQQwDxmHGsRw_9k2eu6r=HU_HiXxbB4cbpwhc1GGgHOw@mail.gmail.com>

On Tue, 23 Jun 2026 11:10:32 -0700 Yosry Ahmed <yosry@kernel.org> wrote:

> > To get back to the question of how the auto-tuning should work, the
> > main question is to which ratio we scale the swap limits to.
> > Do we set the swap limits proportional to how much swap is present
> > in the system, or how much swap is available to the cgroup?
> >
> > So if we have 3 swap tiers A, B, C, with 50G, 30G, and 20G capacity
> > respectively, how much should a cgroup with swap.max = 10G have if
> > it is limited to tiers A and B?
> >
> > This is what I was getting at earlier when I said we have to calculate
> > different ratios for different cgroups, based on what tiers they have
> > access to.
> 
> That's a good question. I think the case that is particularly
> interesting is whether or not the limits of other tiers should change
> when another tier is disabled/enabled.
> 
> So basically in your example, assuming everything starts as "max",
> when swap.max is set to 10G, the autoscaled limits would be: (tier A,
> 5G), (tier B, 3G), (tier C, 2G). Now the question becomes, if
> userspace sets the limit of tier C to 0, should the limits for tiers A
> and B change?
> 
> On one hand, it's simpler to just keep the autoscaled limits unchanged
> in this case. However, this means that the effective swap limit is now
> 8G, which is not great :/
> 
> The alternative is to recalculate all the limits when one of them
> changes, in which case the limits of A and B would change to 6.25G and
> 3.75G. But I don't know if this will work well if we allow custom
> limits. What happens if the limit of tier C is written as 1 (or 4096)
> instead of 0? It's effectively the same scenario, but the tier is
> technically allowed.

I think the one problem with this is that it becomes quite easy to
accidentally overcommit. As a toy example, if you have 10 workloads and
100G swap (as in the example I gave above), intuitively setting
swap.max = 10G for all 10 workloads shouldn't ever cause any contention
on capacity. But if you start excluding some tiers from some workloads,
you actually get overcommitting on the tiers that can service the
most workloads.

I am not sure how concerning swap overcommit was, but at least in the
memory tiering scenario accidental overcommitting of toptier memory
seemed bad enough that I wanted to avoid the problem entirely.

> The more I think about it, the more I realize it may be best to drop
> the autoscaling thing. I imagine memory tiering might run into similar
> issues too :/

And that's why I didn't include opt-in/opt-out for any of the tiers;
if you have system-wide ratios, there's no need to change the ratios
at all, and as long as the sum of your memory.limit for each workload
is under the total capacity, all tiers will also not be overcommitted.

Now, all of these complications aside, I think we might be overthinking
a bit here : -) The auto-scaling should just provide some sort of
"reasonable" default, the users can always override the per-tier
limits if they are unhappy with the autoscaled values.

In fact, maybe it even makes sense to have sum of swap tier limits >
swap.max.

(I actually recall having a really similar discussion when I was working
on weighted interleave auto-tuning a year ago, on how weights should be
set when switching between manually-set limits and relying on
auto-scaled defaults [1]. I don't think there's a need to follow this
convention, but we should think about what the expected behavior should
be if a user manually sets a limit, but later wants to go back to
auto-scaling limits). 

Anyways, I think these are important questions. Youngjun, Nhat, Shakeel,
any thoughts from you all? : -)

[1] https://lore.kernel.org/all/8734hbiq7j.fsf@DESKTOP-5N7EMDA/

^ permalink raw reply

* Re: [RFC PATCH v2 3/7] mm, swap: support physical swap as a vswap backend
From: Nhat Pham @ 2026-06-23 18:45 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <ajnRulrxAKnZavOl@google.com>

On Mon, Jun 22, 2026 at 5:23 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> On Fri, Jun 12, 2026 at 12:37:34PM -0700, Nhat Pham wrote:
> > Add physical swap as a backend for the virtual swap layer.
> >
> > With physical swap backing, vswap can allocate a physical slot on
> > demand when needed: as a fallback for zswap_store failures, or as
> > the destination for zswap writeback.
> >
> > Each vswap entry's physical slot is tracked via a Pointer-tagged
> > swap_table entry on the physical cluster (rmap back to the vswap
> > entry).
> >
> > Suggested-by: Kairui Song <kasong@tencent.com>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
>
> I didn't look through the rest of the series, but are there use cases
> for calling folio_realloc_swap() without calling vswap_zswap_load()
> first? I wonder if the realloc_swap API should take the swpentry
> directly and do the load within? Something like
> vswap_alloc_phys(swpentry, folio)?

It's also use on the swapout fallback path! If zswap rejects the page
or is disabled, and memory.zswap.writeback=1, then we allocate phys
swap space. We probably don't wanna do zswap_load() there again :)

^ permalink raw reply

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Yosry Ahmed @ 2026-06-23 18:10 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Youngjun Park, Shakeel Butt, akpm, chrisl, youngjun.park,
	linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, muchun.song, shikemeng, nphamcs, baoquan.he,
	baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny, baver.bae,
	matia.kim
In-Reply-To: <20260623004018.1864121-1-joshua.hahnjy@gmail.com>

On Mon, Jun 22, 2026 at 5:40 PM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> On Mon, 22 Jun 2026 23:46:31 +0000 Yosry Ahmed <yosry@kernel.org> wrote:
>
> > > > > If that is the case, I think auto-scaling makes sense but can be a bit
> > > > > tricky, since there is no universal tiered ratio; each workload will
> > > > > have different tiers it can swap to, so they will all have to calculate
> > > > > their own ratios. Tiered memory limits escapes this difficulty since we
> > > > > assume all memory can be placed on all tiers, so we have a system-wide
> > > > > ratio : -)
> > > >
> > > > Hmm I don't follow. It's also possible (maybe not initially) that a
> > > > memcg cannot use specific memory tiers, right? I am not sure what the
> > > > difference is.
> > >
> > > You're right, I was speaking more to the current state of memory tiers.
> > > The majority of the feedack I received was that we already have too
> > > many memcg knobs, so I just opted to make tiered memcg limits a
> > > cgroup mount, with no ability for individual memcgs to tune their
> > > limits or opt-in/out.
> >
> > Right, I think this is similar to the approach taken here. We have a
> > single interface for per-tier limits. The main difference is that we're
> > allowing 0/max values to disable/enable different swap tiers per-memcg,
> > as there's a use case for that.
> >
> > Seems like for memory tiering there's no use case for that yet.
>
> Yes, I would agree with that.
>
> > > What do you think Yosry? Would it make sense for us to be able to
> > > tune these values? Personally I think it makes sense but just wanted to
> > > make the basic features merged before I went to push for making those
> > > knobs tunable.
> >
> > Right now we're not proposing to allow tuning swap tier limits either,
> > just enable or disable a tier. My main question is about the default
> > values.
> >
> > IIUC, for memory tiering, if you set memory.max, then the limits for
> > tiers are auto-scaled. I think it makes sense to do the same for swap
> > tiers for cosnsitency. Or am I wrong about the memory tiering limits
> > behavior?
>
> No, you're right about that. Sorry for steering the thread to my
> series ; -)
>
> To get back to the question of how the auto-tuning should work, the
> main question is to which ratio we scale the swap limits to.
> Do we set the swap limits proportional to how much swap is present
> in the system, or how much swap is available to the cgroup?
>
> So if we have 3 swap tiers A, B, C, with 50G, 30G, and 20G capacity
> respectively, how much should a cgroup with swap.max = 10G have if
> it is limited to tiers A and B?
>
> This is what I was getting at earlier when I said we have to calculate
> different ratios for different cgroups, based on what tiers they have
> access to.

That's a good question. I think the case that is particularly
interesting is whether or not the limits of other tiers should change
when another tier is disabled/enabled.

So basically in your example, assuming everything starts as "max",
when swap.max is set to 10G, the autoscaled limits would be: (tier A,
5G), (tier B, 3G), (tier C, 2G). Now the question becomes, if
userspace sets the limit of tier C to 0, should the limits for tiers A
and B change?

On one hand, it's simpler to just keep the autoscaled limits unchanged
in this case. However, this means that the effective swap limit is now
8G, which is not great :/

The alternative is to recalculate all the limits when one of them
changes, in which case the limits of A and B would change to 6.25G and
3.75G. But I don't know if this will work well if we allow custom
limits. What happens if the limit of tier C is written as 1 (or 4096)
instead of 0? It's effectively the same scenario, but the tier is
technically allowed.

The more I think about it, the more I realize it may be best to drop
the autoscaling thing. I imagine memory tiering might run into similar
issues too :/

^ permalink raw reply

* [PATCH v4 3/5] mm/page_counter: introduce page_counter_try_charge_stock()
From: Joshua Hahn @ 2026-06-23 18:01 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, cgroups,
	linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260623180124.868655-1-joshua.hahnjy@gmail.com>

Add a stock-aware variant of page_counter_try_charge.

As before with try_charge_memcg, it first tries to satisfy the charge by
consuming the per-cpu stock (and skipping the hierarchical charge). On a
miss, it tries to greedily overcharge up to counter->batch pages to
refill the stock. Finally, if this fails, it tries to charge exactly the
requested number of pages.

The number of pages that were charged to the page_counter is reported
back to the caller, so that stock hits don't trigger memory limit
checks.

The variant is unused for now; memcg is converted in a later patch.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/page_counter.h |  4 +++
 mm/page_counter.c            | 48 ++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 4abc7fe7c3494..b97b5491447e4 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -84,6 +84,10 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages);
 bool page_counter_try_charge(struct page_counter *counter,
 			     unsigned long nr_pages,
 			     struct page_counter **fail);
+bool page_counter_try_charge_stock(struct page_counter *counter,
+				   unsigned long nr_pages,
+				   struct page_counter **fail,
+				   unsigned long *nr_charged);
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 6bb48a913a90d..cce3af3f19e03 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -172,6 +172,54 @@ bool page_counter_try_charge(struct page_counter *counter,
 	return false;
 }
 
+bool page_counter_try_charge_stock(struct page_counter *counter,
+				   unsigned long nr_pages,
+				   struct page_counter **fail,
+				   unsigned long *nr_charged)
+{
+	struct page_counter_stock *stock;
+	unsigned long charge = 0;
+	int old;
+
+	if (!counter->stock)
+		goto charge_exact;
+
+	preempt_disable();
+	stock = this_cpu_ptr(counter->stock);
+	old = atomic_read(&stock->nr_pages);
+	while ((unsigned long)old >= nr_pages) {
+		if (atomic_try_cmpxchg(&stock->nr_pages, &old,
+				       old - (int)nr_pages)) {
+			preempt_enable();
+			goto out_success;
+		}
+	}
+	preempt_enable();
+
+	charge = max_t(unsigned long, READ_ONCE(counter->batch), nr_pages);
+	if (charge <= nr_pages)
+		goto charge_exact;
+
+	if (page_counter_try_charge(counter, charge, fail)) {
+		preempt_disable();
+		stock = this_cpu_ptr(counter->stock);
+		atomic_add((int)(charge - nr_pages), &stock->nr_pages);
+		preempt_enable();
+		goto out_success;
+	}
+
+charge_exact:
+	/* stock is not enabled, no need for surplus, or greedy charge failed */
+	charge = nr_pages;
+	if (!page_counter_try_charge(counter, charge, fail))
+		return false;
+
+out_success:
+	if (nr_charged)
+		*nr_charged = charge;
+	return true;
+}
+
 /**
  * page_counter_uncharge - hierarchically uncharge pages
  * @counter: counter
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v4 5/5] mm/memcontrol: remove unused memcg_stock code
From: Joshua Hahn @ 2026-06-23 18:01 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, cgroups,
	linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260623180124.868655-1-joshua.hahnjy@gmail.com>

Now that all memcg_stock logic has been moved to page_counter_stock, we
can remove all code related to handling memcg_stock. Note that obj_stock
is untouched and is still needed. FLUSHING_CACHED_CHARGE is preserved
so that it can be used by obj_stock as well.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/memcontrol.c | 186 ------------------------------------------------
 1 file changed, 186 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 846800917af49..762fb8914c308 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1998,25 +1998,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
 	pr_cont(" are going to be killed due to memory.oom.group set\n");
 }
 
-/*
- * The value of NR_MEMCG_STOCK is selected to keep the cached memcgs and their
- * nr_pages in a single cacheline. This may change in future.
- */
-#define NR_MEMCG_STOCK 7
 #define FLUSHING_CACHED_CHARGE	0
-struct memcg_stock_pcp {
-	local_trylock_t lock;
-	uint8_t nr_pages[NR_MEMCG_STOCK];
-	struct mem_cgroup *cached[NR_MEMCG_STOCK];
-
-	struct work_struct work;
-	unsigned long flags;
-	uint8_t drain_idx;
-};
-
-static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
-	.lock = INIT_LOCAL_TRYLOCK(lock),
-};
 
 /*
  * NR_OBJ_STOCK is sized so the entire hot path of obj_stock_pcp
@@ -2065,47 +2047,6 @@ static void drain_obj_stock(struct obj_stock_pcp *stock);
 static bool obj_stock_flush_required(struct obj_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
-/**
- * consume_stock: Try to consume stocked charge on this cpu.
- * @memcg: memcg to consume from.
- * @nr_pages: how many pages to charge.
- *
- * Consume the cached charge if enough nr_pages are present otherwise return
- * failure. Also return failure for charge request larger than
- * MEMCG_CHARGE_BATCH or if the local lock is already taken.
- *
- * returns true if successful, false otherwise.
- */
-static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-	struct memcg_stock_pcp *stock;
-	uint8_t stock_pages;
-	bool ret = false;
-	int i;
-
-	if (nr_pages > MEMCG_CHARGE_BATCH ||
-	    !local_trylock(&memcg_stock.lock))
-		return ret;
-
-	stock = this_cpu_ptr(&memcg_stock);
-
-	for (i = 0; i < NR_MEMCG_STOCK; ++i) {
-		if (memcg != READ_ONCE(stock->cached[i]))
-			continue;
-
-		stock_pages = READ_ONCE(stock->nr_pages[i]);
-		if (stock_pages >= nr_pages) {
-			WRITE_ONCE(stock->nr_pages[i], stock_pages - nr_pages);
-			ret = true;
-		}
-		break;
-	}
-
-	local_unlock(&memcg_stock.lock);
-
-	return ret;
-}
-
 static void memcg_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	page_counter_uncharge(&memcg->memory, nr_pages);
@@ -2113,51 +2054,6 @@ static void memcg_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages)
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 }
 
-/*
- * Returns stocks cached in percpu and reset cached information.
- */
-static void drain_stock(struct memcg_stock_pcp *stock, int i)
-{
-	struct mem_cgroup *old = READ_ONCE(stock->cached[i]);
-	uint8_t stock_pages;
-
-	if (!old)
-		return;
-
-	stock_pages = READ_ONCE(stock->nr_pages[i]);
-	if (stock_pages) {
-		memcg_uncharge(old, stock_pages);
-		WRITE_ONCE(stock->nr_pages[i], 0);
-	}
-
-	css_put(&old->css);
-	WRITE_ONCE(stock->cached[i], NULL);
-}
-
-static void drain_stock_fully(struct memcg_stock_pcp *stock)
-{
-	int i;
-
-	for (i = 0; i < NR_MEMCG_STOCK; ++i)
-		drain_stock(stock, i);
-}
-
-static void drain_local_memcg_stock(struct work_struct *dummy)
-{
-	struct memcg_stock_pcp *stock;
-
-	if (WARN_ONCE(!in_task(), "drain in non-task context"))
-		return;
-
-	local_lock(&memcg_stock.lock);
-
-	stock = this_cpu_ptr(&memcg_stock);
-	drain_stock_fully(stock);
-	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
-
-	local_unlock(&memcg_stock.lock);
-}
-
 static void drain_local_obj_stock(struct work_struct *dummy)
 {
 	struct obj_stock_pcp *stock;
@@ -2174,88 +2070,6 @@ static void drain_local_obj_stock(struct work_struct *dummy)
 	local_unlock(&obj_stock.lock);
 }
 
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-	struct memcg_stock_pcp *stock;
-	struct mem_cgroup *cached;
-	uint8_t stock_pages;
-	bool success = false;
-	int empty_slot = -1;
-	int i;
-
-	/*
-	 * For now limit MEMCG_CHARGE_BATCH to 127 and less. In future if we
-	 * decide to increase it more than 127 then we will need more careful
-	 * handling of nr_pages[] in struct memcg_stock_pcp.
-	 */
-	BUILD_BUG_ON(MEMCG_CHARGE_BATCH > S8_MAX);
-
-	VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg));
-
-	if (nr_pages > MEMCG_CHARGE_BATCH ||
-	    !local_trylock(&memcg_stock.lock)) {
-		/*
-		 * In case of larger than batch refill or unlikely failure to
-		 * lock the percpu memcg_stock.lock, uncharge memcg directly.
-		 */
-		memcg_uncharge(memcg, nr_pages);
-		return;
-	}
-
-	stock = this_cpu_ptr(&memcg_stock);
-	for (i = 0; i < NR_MEMCG_STOCK; ++i) {
-		cached = READ_ONCE(stock->cached[i]);
-		if (!cached && empty_slot == -1)
-			empty_slot = i;
-		if (memcg == READ_ONCE(stock->cached[i])) {
-			stock_pages = READ_ONCE(stock->nr_pages[i]) + nr_pages;
-			WRITE_ONCE(stock->nr_pages[i], stock_pages);
-			if (stock_pages > MEMCG_CHARGE_BATCH)
-				drain_stock(stock, i);
-			success = true;
-			break;
-		}
-	}
-
-	if (!success) {
-		i = empty_slot;
-		if (i == -1) {
-			i = stock->drain_idx++;
-			if (stock->drain_idx == NR_MEMCG_STOCK)
-				stock->drain_idx = 0;
-			drain_stock(stock, i);
-		}
-		css_get(&memcg->css);
-		WRITE_ONCE(stock->cached[i], memcg);
-		WRITE_ONCE(stock->nr_pages[i], nr_pages);
-	}
-
-	local_unlock(&memcg_stock.lock);
-}
-
-static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock,
-				  struct mem_cgroup *root_memcg)
-{
-	struct mem_cgroup *memcg;
-	bool flush = false;
-	int i;
-
-	rcu_read_lock();
-	for (i = 0; i < NR_MEMCG_STOCK; ++i) {
-		memcg = READ_ONCE(stock->cached[i]);
-		if (!memcg)
-			continue;
-
-		if (READ_ONCE(stock->nr_pages[i]) &&
-		    mem_cgroup_is_descendant(memcg, root_memcg)) {
-			flush = true;
-			break;
-		}
-	}
-	rcu_read_unlock();
-	return flush;
-}
-
 static void schedule_drain_work(int cpu, struct work_struct *work)
 {
 	/*
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Joshua Hahn @ 2026-06-23 18:01 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, cgroups,
	linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260623180124.868655-1-joshua.hahnjy@gmail.com>

Now with all of the memcg_stock handling logic replicated in
page_counter_stock, switch memcg to use the page_counter_stock for the
memory (and for cgroup v1 users, memsw) page_counters.

There are a few details that have changed:

First, the old special-casing for the !allow_spinning check to avoid
refilling and flushing of the old stock is removed. This special casing
was important previously, because refilling the stock could do a lot of
extra work by evicting one of 7 random victim memcgs in the percpu
memcg_stock slots. In the new per-counter design, refilling stock just
adds pages to the counter's own local cache without affecting other memcgs,
so the original reason for the special case no longer applies.

Also, we can now fail during page_counter_alloc_stock(), if there is
not enough memory to allocate a percpu page_counter_stock. This failure
is rare and nonfatal; the system can continue to operate, with the page
counter working without stock and falling back to walking the hierarchy.

drain_all_stock and memcg_hotplug_cpu_dead also now use the page_counter
stock drain variant, which uses remote atomic_xchg to retrieve stock
across CPUs, instead of scheduling asynchronous work.

Finally, as a side-effect of separating the per-memcg stock to per-
page_counter, the memsw and memory page_counters have independent stock.
This means that the reported memsw may transiently be lower than memory
usage if the stock for memory and memsw page_counters go out of sync.

Note that obj_stock is untouched by this change.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/memcontrol.c | 87 +++++++++++++++++++++++--------------------------
 1 file changed, 41 insertions(+), 46 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 306658fd55512..846800917af49 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2269,39 +2269,36 @@ static void schedule_drain_work(int cpu, struct work_struct *work)
 		queue_work_on(cpu, memcg_wq, work);
 }
 
+static void memcg_drain_stock(struct mem_cgroup *memcg, int cpu)
+{
+	page_counter_drain_stock(&memcg->memory, cpu);
+	if (do_memsw_account())
+		page_counter_drain_stock(&memcg->memsw, cpu);
+}
+
 /*
  * Drains all per-CPU charge caches for given root_memcg resp. subtree
  * of the hierarchy under it.
  */
 void drain_all_stock(struct mem_cgroup *root_memcg)
 {
+	struct mem_cgroup *memcg;
 	int cpu, curcpu;
 
 	/* If someone's already draining, avoid adding running more workers. */
 	if (!mutex_trylock(&percpu_charge_mutex))
 		return;
-	/*
-	 * Notify other cpus that system-wide "drain" is running
-	 * We do not care about races with the cpu hotplug because cpu down
-	 * as well as workers from this path always operate on the local
-	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
-	 */
+
+	for_each_mem_cgroup_tree(memcg, root_memcg) {
+		for_each_online_cpu(cpu)
+			memcg_drain_stock(memcg, cpu);
+	}
+
 	migrate_disable();
 	curcpu = smp_processor_id();
 	for_each_online_cpu(cpu) {
-		struct memcg_stock_pcp *memcg_st = &per_cpu(memcg_stock, cpu);
 		struct obj_stock_pcp *obj_st = &per_cpu(obj_stock, cpu);
 
-		if (!test_bit(FLUSHING_CACHED_CHARGE, &memcg_st->flags) &&
-		    is_memcg_drain_needed(memcg_st, root_memcg) &&
-		    !test_and_set_bit(FLUSHING_CACHED_CHARGE,
-				      &memcg_st->flags)) {
-			if (cpu == curcpu)
-				drain_local_memcg_stock(&memcg_st->work);
-			else
-				schedule_drain_work(cpu, &memcg_st->work);
-		}
-
 		if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) &&
 		    obj_stock_flush_required(obj_st, root_memcg) &&
 		    !test_and_set_bit(FLUSHING_CACHED_CHARGE,
@@ -2318,9 +2315,13 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
 
 static int memcg_hotplug_cpu_dead(unsigned int cpu)
 {
+	struct mem_cgroup *memcg;
+
 	/* no need for the local lock */
 	drain_obj_stock(&per_cpu(obj_stock, cpu));
-	drain_stock_fully(&per_cpu(memcg_stock, cpu));
+
+	for_each_mem_cgroup(memcg)
+		memcg_drain_stock(memcg, cpu);
 
 	return 0;
 }
@@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			    unsigned int nr_pages)
 {
-	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
 	int nr_retries = MAX_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
 	struct page_counter *counter;
@@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	bool raised_max_event = false;
 	unsigned long pflags;
 	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
+	unsigned long nr_charged = 0;
 
 retry:
-	if (consume_stock(memcg, nr_pages))
-		return 0;
-
-	if (!allow_spinning)
-		/* Avoid the refill and flush of the older stock */
-		batch = nr_pages;
-
 	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
 	if (do_memsw_account() &&
-	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
+	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
+					   &counter, NULL)) {
 		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
 		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
 		goto reclaim;
 	}
 
-	if (page_counter_try_charge(&memcg->memory, batch, &counter))
-		goto done_restock;
+	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
+					  &counter, &nr_charged)) {
+		if (!nr_charged)
+			return 0;
+		goto handle_high;
+	}
 
 	if (do_memsw_account())
-		page_counter_uncharge(&memcg->memsw, batch);
+		page_counter_uncharge(&memcg->memsw, nr_pages);
 	mem_over_limit = mem_cgroup_from_counter(counter, memory);
 
 reclaim:
-	if (batch > nr_pages) {
-		batch = nr_pages;
-		goto retry;
-	}
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2731,10 +2725,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	return 0;
 
-done_restock:
-	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
-
+handle_high:
 	/*
 	 * If the hierarchy is above the normal consumption range, schedule
 	 * reclaim on returning to userland.  We can perform reclaim here
@@ -2771,7 +2762,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			 * and distribute reclaim work and delay penalties
 			 * based on how much each task is actually allocating.
 			 */
-			current->memcg_nr_pages_over_high += batch;
+			current->memcg_nr_pages_over_high += nr_charged;
 			set_notify_resume(current);
 			break;
 		}
@@ -3076,7 +3067,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
 	account_kmem_nmi_safe(memcg, -nr_pages);
 	memcg1_account_kmem(memcg, -nr_pages);
 	if (!mem_cgroup_is_root(memcg))
-		refill_stock(memcg, nr_pages);
+		memcg_uncharge(memcg, nr_pages);
 
 	css_put(&memcg->css);
 }
@@ -4080,6 +4071,8 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 
 static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
+	page_counter_free_stock(&memcg->memory);
+	page_counter_free_stock(&memcg->memsw);
 	lru_gen_exit_memcg(memcg);
 	memcg_wb_domain_exit(memcg);
 	__mem_cgroup_free(memcg);
@@ -4247,6 +4240,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	refcount_set(&memcg->id.ref, 1);
 	css_get(css);
 
+	/* failure is nonfatal, charges fall back to direct hierarchy */
+	page_counter_alloc_stock(&memcg->memory, MEMCG_CHARGE_BATCH);
+	if (do_memsw_account())
+		page_counter_alloc_stock(&memcg->memsw, MEMCG_CHARGE_BATCH);
+
 	/*
 	 * Ensure mem_cgroup_from_private_id() works once we're fully online.
 	 *
@@ -5502,7 +5500,7 @@ void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages)
 
 	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
 
-	refill_stock(memcg, nr_pages);
+	page_counter_uncharge(&memcg->memory, nr_pages);
 }
 
 void mem_cgroup_flush_workqueue(void)
@@ -5555,12 +5553,9 @@ int __init mem_cgroup_init(void)
 	memcg_wq = alloc_workqueue("memcg", WQ_PERCPU, 0);
 	WARN_ON(!memcg_wq);
 
-	for_each_possible_cpu(cpu) {
-		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
-			  drain_local_memcg_stock);
+	for_each_possible_cpu(cpu)
 		INIT_WORK(&per_cpu_ptr(&obj_stock, cpu)->work,
 			  drain_local_obj_stock);
-	}
 
 	memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids);
 	memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0,
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v4 2/5] mm/memcontrol: flatten try_charge_memcg control flow
From: Joshua Hahn @ 2026-06-23 18:01 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, cgroups,
	linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260623180124.868655-1-joshua.hahnjy@gmail.com>

Refactor try_charge_memcg by flattening the nested memsw/memory
page_counter operations to separate the logic between the two.

When page_counter_try_charge is made stock-aware, this flattening makes
the control flow easier to follow since each page counter now has its
own success/failure paths.

No functional changes intended.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/memcontrol.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56cd4af082326..306658fd55512 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2616,18 +2616,21 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		batch = nr_pages;
 
 	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
-	if (!do_memsw_account() ||
-	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
-		if (page_counter_try_charge(&memcg->memory, batch, &counter))
-			goto done_restock;
-		if (do_memsw_account())
-			page_counter_uncharge(&memcg->memsw, batch);
-		mem_over_limit = mem_cgroup_from_counter(counter, memory);
-	} else {
+	if (do_memsw_account() &&
+	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
 		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
 		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
+		goto reclaim;
 	}
 
+	if (page_counter_try_charge(&memcg->memory, batch, &counter))
+		goto done_restock;
+
+	if (do_memsw_account())
+		page_counter_uncharge(&memcg->memsw, batch);
+	mem_over_limit = mem_cgroup_from_counter(counter, memory);
+
+reclaim:
 	if (batch > nr_pages) {
 		batch = nr_pages;
 		goto retry;
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v4 1/5] mm/page_counter: introduce per-page_counter stock
From: Joshua Hahn @ 2026-06-23 18:01 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, cgroups,
	linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260623180124.868655-1-joshua.hahnjy@gmail.com>

In order to avoid expensive hierarchy walks on every memcg charge and
limit check, memcontrol uses per-cpu stocks (memcg_stock_pcp) to cache
pre-charged pages and introduce a fast path to try_charge_memcg.
However, there are a few quirks with the current implementation that
could be improved upon.

First, each memcg_stock_pcp can only cache the charges of 7 memcgs
(defined as NR_MEMCG_STOCK), which means that once a CPU starts handling
the charging of more than 7 memcgs, it randomly selects a victim memcg
to evict and drain from the cpu, which can cause unnecessarily increased
latencies and thrashing as memcgs continually evict each other's stock.

Flushing a memcg's stock on a CPU also means that all other stock
present on that CPU is also flushed, leading to poor caching for systems
running multiple memcgs competing for the same CPUs.

Finally, stock is tightly coupled with memcg, which means that all page
counters in a memcg share the same resource. This may simplify some of
the charging logic, but it prevents new page counters from being added
and using a separate stock.

We can address these concerns by pushing the concept of stock down to
the page_counter level, which addresses the random eviction problem by
getting rid of the 7 slot limit, and makes enabling separate stock
caches for other page_counters simpler.

Introduce a generic per-cpu stock directly in struct page_counter.
Stock can optionally be enabled per-page_counter, limiting the overhead
increase for page_counters who do not benefit greatly from caching
charges.

In this scheme, stock usage and refills happen via lockless atomic
operations, eliminating the need for asynchronous workqueues as well.
In this commit we introduce the alloc, free, and drain operations,
although they are unused for now.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/page_counter.h | 15 +++++++++++++
 mm/page_counter.c            | 42 ++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index d649b6bbbc871..4abc7fe7c3494 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -5,8 +5,17 @@
 #include <linux/atomic.h>
 #include <linux/cache.h>
 #include <linux/limits.h>
+#include <linux/percpu.h>
 #include <asm/page.h>

+struct page_counter_stock {
+	/*
+	 * Consumption/refills can only come from the owning cpu via
+	 * atomic_cmpxchg. Remote access only happens on drain via atomic_xchg.
+	 */
+	atomic_t nr_pages;
+};
+
 struct page_counter {
 	/*
 	 * Make sure 'usage' does not share cacheline with any other field in
@@ -41,6 +50,8 @@ struct page_counter {
 	unsigned long high;
 	unsigned long max;
 	struct page_counter *parent;
+	struct page_counter_stock __percpu *stock;
+	unsigned int batch;
 } ____cacheline_internodealigned_in_smp;

 #if BITS_PER_LONG == 32
@@ -99,6 +110,10 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
 	counter->watermark = usage;
 }

+void page_counter_drain_stock(struct page_counter *counter, unsigned int cpu);
+int page_counter_alloc_stock(struct page_counter *counter, unsigned int batch);
+void page_counter_free_stock(struct page_counter *counter);
+
 #if IS_ENABLED(CONFIG_MEMCG) || IS_ENABLED(CONFIG_CGROUP_DMEM)
 void page_counter_calculate_protection(struct page_counter *root,
 				       struct page_counter *counter,
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 661e0f2a5127a..6bb48a913a90d 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -8,6 +8,7 @@
 #include <linux/page_counter.h>
 #include <linux/atomic.h>
 #include <linux/kernel.h>
+#include <linux/percpu.h>
 #include <linux/string.h>
 #include <linux/sched.h>
 #include <linux/bug.h>
@@ -289,6 +290,47 @@ int page_counter_memparse(const char *buf, const char *max,
 	return 0;
 }

+void page_counter_drain_stock(struct page_counter *counter, unsigned int cpu)
+{
+	struct page_counter_stock *stock;
+	int nr_pages;
+
+	if (!counter->stock)
+		return;
+
+	stock = per_cpu_ptr(counter->stock, cpu);
+	nr_pages = atomic_xchg(&stock->nr_pages, 0);
+	if (nr_pages)
+		page_counter_uncharge(counter, nr_pages);
+}
+
+int page_counter_alloc_stock(struct page_counter *counter, unsigned int batch)
+{
+	struct page_counter_stock __percpu *stock;
+
+	stock = alloc_percpu(struct page_counter_stock);
+	if (!stock)
+		return -ENOMEM;
+
+	counter->stock = stock;
+	counter->batch = batch;
+
+	return 0;
+}
+
+void page_counter_free_stock(struct page_counter *counter)
+{
+	int cpu;
+
+	if (!counter->stock)
+		return;
+
+	for_each_possible_cpu(cpu)
+		page_counter_drain_stock(counter, cpu);
+
+	free_percpu(counter->stock);
+	counter->stock = NULL;
+}

 #if IS_ENABLED(CONFIG_MEMCG) || IS_ENABLED(CONFIG_CGROUP_DMEM)
 /*
-- 
2.53.0-Meta

^ permalink raw reply related

* [PATCH v4 0/5] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter
From: Joshua Hahn @ 2026-06-23 18:01 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko
  Cc: Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, cgroups,
	linux-mm, linux-kernel, kernel-team

This series is intended for the next release cycle.

v3 --> v4
=========
- Reduced memory footprint by 4x, from 16 bytes per-(cpu x memcg) to
  4 bytes per-(cpu x memcg). Each page_counter_stock is a thin wrapper
  around an atomic_t.
- Removed locking completely and uses atomic operations to use stock.
- Removed synchronous work_on_cpu. All work is done via remote
  atomic_xchgs.
- Added a patch to flatten page_counter charging in try_charge_memcg
- Split page_counter_try_charge into stocked and non-stocked variants.

INTRO
=====
Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
allocations, allowing small and frequent allocations to avoid walking
the expensive mem_cgroup hierarchy traversal each time. This fastpath
offers real improvements, but there is room for improvement:

1. Currently, each CPU tracks up to 7 (NR_MEMCG_STOCK) mem_cgroups. When
   more than 7 mem_cgroups have stock present on a single CPU, a random
   victim is evicted and its associated stock is drained.

2. When one cgroup runs out of memory and needs to drain stock across
   all CPUs it has stock cached in, those CPUs will drain all other
   memcgs' stock present in that CPU. This leads to inefficient stock
   caching and cross-memcg interference under memory pressure.

3. Stock management is tightly coupled to struct mem_cgroup, which makes
   it difficult to add a new page_counter to mem_cgroup and have
   multiple sources of stock management.

This series moves the per-cpu stock down into page_counter which
consolidates stock limit checking and page_counter limit checking into
page_counter_try_charge_stock. This eliminates the 7 memcg-per-cpu slot
limit, the random cross-memcg stock drains, and slot traversal. We also
simplify memcontrol code, since we no longer need to maintain separate
draining functions or manage the asynchronous workqueue.

In turn, we can add independent stock management for additional
page_counters in each memcg, which is used in my tiered memory limits
series to add a new page_counter to track toptier usage [1].

There are a few tradeoffs, however.

First, the bound on how much memory can be overcharged (and remain stale
as stock) is raised. Previously, it was fixed to nr_cpus x 7 x 64 pages.
Now, it becomes nr_cpus x nr_cgroups x 64 pages. On large machines
with many cgroups, this could be significant.

There are 4 qualifying points:
1. Larger machines should be able to tolerate the additional overhead.
2. Stock should not remain stale as long as the cgroups are actively
   charging memory.
3. Getting anywhere close to this overhead is difficult and rare. It
   would require processes to bounce across CPUs and refill stock.
4. These charges are not "real" allocated memory, but rather accounting
   done in memcg; they are easily returned on pressure.

Secondly, we introduce a small memory footprint.
The new struct page_counter_stock is a wrapper around an atomic_t,
which adds 4 bytes of overhead per-(cpu x memcg). On a 1024-CPU,
1024-memcg system, this adds 4MB of overhead. Smaller machines will
see much smaller overhead.

One small side effect for cgroupv1: this series decouples swap for
the memory and memsw page_counters. Since stock charging can go out of
sync, this means that users can transiently see memsw usage go below
memory usage.

Finally, by moving from asynchronous workqueue scheduling for draining
to synchronous atomic_xchg, drain_all_stock holds the
percpu_charge_mutex longer while it performs the work. This means that
chargers may be more likely to be unable to grab the mutex lock and
exhaust MAX_RECLAIM_RETRIES and OOM, in theory. In practice, I have not
been able to replicate this behavior in my experiments.

The series was built on top of latest akpm/mm-new as of Jun 23 2026,
which is cdad4d4e4fc2e "mm/swap, PM: hibernate: atomically replace
hibernation pin".

TESTING
=======
I tried to demonstrate the worst-case overhead this series introduces
by writing a microbenchmark that pins multiple cgroup jobs to a single
CPU and repeatedly faulting and releasing anon pages using
madv(MADV_DONTNEED) in each cgroup. The data was collected over
30 trials of 15 iterations. Metric here is time each iteration took (ms)

+----------+--------+-------+-----------+--------------+
| #cgroups | before | after | delta (%) | variance (%) |
+----------+--------+-------+-----------+--------------+
|        1 |    112 |   112 |     0.000 |          1.1 |
|        4 |    443 |   451 |    +1.806 |          1.1 |
|       32 |   3512 |  3584 |    +2.051 |          2.0 |
+----------+--------+-------+-----------+--------------+

It appears as though there is some small regression, although the
magnitude is similar to the coefficient of variation (stddev / mean).

CHANGELOG
=========
v2 --> v3:
- Dropped the cgroup v2 optimization, since it could indeed lead to too
  much time held with the cgroup_mutex. Instead we let the stock
  accumulate in the parent cgroups, which is not so bad; charges can
  still land on these cgroups, and if we ever reach the mem_cgroup
  limit, we can easily return those charges.
- page_counter_disable_stock no longer drains, just prevents
  accumulating stock. The actual draining is done in the free_stock
  variant, where we know for sure there are no in-flight charges.
- Reordering the page_counter_disable_stock path to disable before
  draining as to prevent accumulating stock first.
- Skip isolated CPUs when draining synchronously
- Rebase on newest mm-new
- Wordsmithing

v1 --> v2:
- Dropped stock returning on uncharge to preserve same behavior as memcg
  stock. This resolves some race conditions present in v1.
- Fixed many race conditions between disabling page_counter_stock and
  in-flight charges
- Restructured drain_all_stock to iterate over all CPUs first before
  memcgs, to reduce the number of synchronous CPU work scheduling
- Optimized cgroup v2 further to drain only on the first child and skip
  the root mem_cgroup
- Dropped RFC
- Wordsmithing cover letter

[1] https://lore.kernel.org/all/20260423203445.2914963-1-joshua.hahnjy@gmail.com/

Joshua Hahn (5):
  mm/page_counter: introduce per-page_counter stock
  mm/memcontrol: flatten try_charge_memcg control flow
  mm/page_counter: introduce page_counter_try_charge_stock()
  mm/memcontrol: convert memcg to use page_counter_stock
  mm/memcontrol: remove unused memcg_stock code

 include/linux/page_counter.h |  19 +++
 mm/memcontrol.c              | 280 ++++++-----------------------------
 mm/page_counter.c            |  90 +++++++++++
 3 files changed, 155 insertions(+), 234 deletions(-)

-- 
2.53.0-Meta

^ permalink raw reply

* Re: [PATCH v6 6/6] drm/amdgpu: Wire up dmem cgroup reclaim for VRAM manager
From: Thadeu Lima de Souza Cascardo @ 2026-06-23 17:45 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Natalie Vock, Johannes Weiner, Tejun Heo,
	Michal Koutný, cgroups, Huang Rui, Matthew Brost,
	Matthew Auld, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <20260611173301.17473-7-thomas.hellstrom@linux.intel.com>

On Thu, Jun 11, 2026 at 07:33:01PM +0200, Thomas Hellström wrote:
> Register the VRAM manager with the dmem cgroup reclaim infrastructure
> so that lowering dmem.max below current VRAM usage triggers TTM
> eviction rather than failing with -EBUSY.
> 
> Guard place->flags in amdgpu_ttm_bo_eviction_valuable() against NULL,
> as the TTM reclaim path passes a NULL place in cgroup drain mode.
> 
> v3:
> - Rebased on fix for uninitialized list and buddy allocator on the
>   drmm_cgroup_register_region() error path.
> 
> v5:
> - Rebased on the introduction of struct dmem_cgroup_init.
> - Clear the reclaim callback in amdgpu_vram_mgr_fini() to prevent
>   use-after-free if cgroup reclaim is triggered after driver unbind
>   while userspace holds an open DRM file descriptor. (Sashiko-bot)
> - Switch from drmm_cgroup_register_region() to the raw
>   dmem_cgroup_register_region() and store the region in
>   amdgpu_vram_mgr.cg_region. Call dmem_cgroup_unregister_region()
>   in amdgpu_vram_mgr_fini() after ttm_resource_manager_evict_all()
>   to drain in-flight reclaim callbacks, and clear man->cg afterwards.
>   This is required because amdgpu's vram manager fini is called
>   explicitly during driver unbind, which may precede the DRM device
>   release and thus precede any drmm-based cleanup. (Sashiko-bot)
> 
> v6:
> - Fix mgr->cg_region never being assigned, so
>   dmem_cgroup_unregister_region() in fini silently no-ops on NULL
>   and leaks the region. (Sashiko-bot)
> - Reorder fini to call set_used(false) and evict_all() before
>   dmem_cgroup_unregister_region(), so ttm_resource_free() can
>   uncharge via man->cg during eviction; clear man->cg after
>   unregister. (Sashiko-bot)
> 
> Assisted-by: GitHub_Copilot:claude-sonnet-4.6
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Hi, Thomas.

I needed the following fixup for this. Otherwise, it regresses on the dmem
region name.

Regards.
Cascardo


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 2250bab0970d..d93bb88e8b25 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -942,7 +942,8 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
 					     .size = adev->gmc.real_vram_size,
 					     .ops = &amdgpu_vram_mgr_dmem_ops,
 					     .reclaim_priv = man,
-					 }, "vram");
+					 }, "drm/%s/vram",
+					    adev_to_drm(adev)->unique);
 	if (IS_ERR(cg))
 		return PTR_ERR(cg);
 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c      |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 31 ++++++++++++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h |  2 ++
>  3 files changed, 28 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 2740de94e93c..8cbcd33f51a5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1488,7 +1488,7 @@ static bool amdgpu_ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
>  	dma_resv_for_each_fence(&resv_cursor, bo->base.resv,
>  				DMA_RESV_USAGE_BOOKKEEP, f) {
>  		if (amdkfd_fence_check_mm(f, current->mm) &&
> -		    !(place->flags & TTM_PL_FLAG_CONTIGUOUS))
> +		    !(place && (place->flags & TTM_PL_FLAG_CONTIGUOUS)))
>  			return false;
>  	}
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 08f05c3aed1d..2250bab0970d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -906,6 +906,10 @@ static const struct ttm_resource_manager_func amdgpu_vram_mgr_func = {
>  	.debug	= amdgpu_vram_mgr_debug
>  };
>  
> +static const struct dmem_cgroup_ops amdgpu_vram_mgr_dmem_ops = {
> +	.reclaim = ttm_resource_manager_dmem_reclaim,
> +};
> +
>  /**
>   * amdgpu_vram_mgr_init - init VRAM manager and DRM MM
>   *
> @@ -917,6 +921,7 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
>  {
>  	struct amdgpu_vram_mgr *mgr = &adev->mman.vram_mgr;
>  	struct ttm_resource_manager *man = &mgr->manager;
> +	struct dmem_cgroup_region *cg;
>  	int err;
>  
>  	ttm_resource_manager_init(man, &adev->mman.bdev,
> @@ -933,12 +938,16 @@ int amdgpu_vram_mgr_init(struct amdgpu_device *adev)
>  	if (err)
>  		return err;
>  
> -	man->cg = drmm_cgroup_register_region(adev_to_drm(adev), "vram",
> -					      &(struct dmem_cgroup_init){
> -						.size = adev->gmc.real_vram_size,
> -					      });
> -	if (IS_ERR(man->cg))
> -		return PTR_ERR(man->cg);
> +	cg = dmem_cgroup_register_region(&(struct dmem_cgroup_init){
> +					     .size = adev->gmc.real_vram_size,
> +					     .ops = &amdgpu_vram_mgr_dmem_ops,
> +					     .reclaim_priv = man,
> +					 }, "vram");
> +	if (IS_ERR(cg))
> +		return PTR_ERR(cg);
> +
> +	mgr->cg_region = cg;
> +	ttm_resource_manager_set_dmem_region(man, cg);
>  
>  	ttm_set_driver_manager(&adev->mman.bdev, TTM_PL_VRAM, &mgr->manager);
>  	ttm_resource_manager_set_used(man, true);
> @@ -966,6 +975,16 @@ void amdgpu_vram_mgr_fini(struct amdgpu_device *adev)
>  	if (ret)
>  		return;
>  
> +	/*
> +	 * Drain any in-flight dmem cgroup reclaim callbacks and remove the
> +	 * region from the global list.  This must happen after evict_all()
> +	 * so that ttm_resource_free() can still uncharge via man->cg while
> +	 * BOs are being evicted.
> +	 */
> +	dmem_cgroup_unregister_region(mgr->cg_region);
> +	mgr->cg_region = NULL;
> +	man->cg = NULL;
> +
>  	mutex_lock(&mgr->lock);
>  	list_for_each_entry_safe(rsv, temp, &mgr->reservations_pending, blocks)
>  		kfree(rsv);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
> index 429a21a2e9b2..07103cddb335 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h
> @@ -36,6 +36,8 @@ struct amdgpu_vram_mgr {
>  	atomic64_t vis_usage;
>  	u64 default_page_size;
>  	struct list_head allocated_vres_list;
> +	/** @cg_region: dmem cgroup region for VRAM; unregistered in fini. */
> +	struct dmem_cgroup_region *cg_region;
>  };
>  
>  struct amdgpu_vres_task {
> -- 
> 2.54.0
> 

^ permalink raw reply related

* Re: [PATCH v7 4/9] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper
From: Gregory Price @ 2026-06-23 16:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
	Li Zefan, Farhad Alemi, Andrew Morton, cgroups, linux-kernel,
	Aaron Tomlin, Guopeng Zhang, David Hildenbrand
In-Reply-To: <20260621032816.1806773-5-longman@redhat.com>

On Sat, Jun 20, 2026 at 11:28:11PM -0400, Waiman Long wrote:
> Extract the DL bandwidth allocation code in cpuset_attach() to a new
> cpuset_reserve_dl_bw() helper to simplify code.
> 
> No functional change is expected.
> 
> Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
> Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Gregory Price <gourry@gourry.net>


^ permalink raw reply

* Re: [PATCH v7 2/9] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach()
From: Gregory Price @ 2026-06-23 16:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
	Li Zefan, Farhad Alemi, Andrew Morton, cgroups, linux-kernel,
	Aaron Tomlin, Guopeng Zhang, David Hildenbrand
In-Reply-To: <20260621032816.1806773-3-longman@redhat.com>

On Sat, Jun 20, 2026 at 11:28:09PM -0400, Waiman Long wrote:
> Whenever memory node mask is changed, there are 4 places where the node
> mask has to be updated or used.
>  1) task's node mask via cpuset_change_task_nodemask()
>  2) memory policy binding via mpol_rebind_mm()
>  3) if memory migration is enabled, migrate from old_mems_allowed to
>     the new node mask via cpuset_migrate_mm().
>  4) setting old_mems_allowed
> 
> These memory actions are done in cpuset_update_tasks_nodemask() and
> cpuset_attach(). However there are inconsistencies in what node masks
> are being used in these 2 functions.
> 
> In cpuset_update_tasks_nodemask(),
>  - cpuset_change_task_nodemask(): guarantee_online_mems()
>  - mpol_rebind_mm(): mems_allowed
>  - cpuset_migrate_mm(): guarantee_online_mems()
>  - old_mems_allowed: guarantee_online_mems()
> 
> In cpuset_attach(),
>  - cpuset_change_task_nodemask(): guarantee_online_mems()
>  - mpol_rebind_mm(): effective_mems
>  - cpuset_migrate_mm(): effective_mems
>  - old_mems_allowed: effective_mems
> 
> These inconsistencies dates back to quite a long time ago and it is
> hard to say what should be the correct values.
> 
> The guarantee_online_mems() function returns a node mask from current or
> an ancestor cpuset that is a subset of node_states[N_MEMORY]. Nodes in
> node_states[N_MEMORY] are all online, i.e. in node_states[N_ONLINE].
> However, node in node_states[N_ONLINE] may not have memory. So
> node_states[N_MEMORY] should be a subset of node_states[N_ONLINE].
> 
> The guarantee_online_mems() function should mostly be useful for v1
> where mems_allowed is the same as effective_mems. With v2, the memory
> nodes in effective_mems should be a subset of node_states[N_MEMORY]
> except when a memory hot-unplug operation is in progress and a memory
> node is removed from node_states[N_MEMORY] but not yet reflected in
> the effective_mems's as cpuset_handle_hotplug() has not been called
> from cpuset_track_online_nodes().
> 
> Let use the following setup for both of them and make them consistent.
>  - cpuset_change_task_nodemask(): guarantee_online_mems()
>  - mpol_rebind_mm(): effective_mems
>  - cpuset_migrate_mm(): guarantee_online_mems()
>  - old_mems_allowed: guarantee_online_mems()
> 
> So for v2, it is effectively all effective_mems most of the time. For
> v1, mpol_rebind_mm() uses mems_allowed which may differ from what
> guarantee_online_mems() returns, but it conforms to what the cpuset v1
> documentation says with respect to setting memory policy.
> 
> Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cpuset.c | 30 ++++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)
> 

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply

* Re: [PATCH v7 1/9] cgroup/cpuset: rebind mm mempolicy to effective_mems, not mems_allowed
From: Gregory Price @ 2026-06-23 16:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
	Li Zefan, Farhad Alemi, Andrew Morton, cgroups, linux-kernel,
	Aaron Tomlin, Guopeng Zhang, David Hildenbrand, stable
In-Reply-To: <20260621032816.1806773-2-longman@redhat.com>

On Sat, Jun 20, 2026 at 11:28:08PM -0400, Waiman Long wrote:
> From: Farhad Alemi <farhad.alemi@berkeley.edu>
> 
> Creating a child cpuset where cpuset.mems is never set leads to a div/0
> when a VMA mempolicy with MPOL_F_RELATIVE_NODES rebinds in response to a
> CPU hotplug event.
> 
> Reproduction steps:
>  1) Create a cgroup w/ cpuset controls (do not set cpuset.mems)
>  2) Move the task into the child cpuset
>  3) Create a VMA mempolicy for that task with MPOL_F_RELATIVE_NODES
>  4) unplug and hotplug a cpu
>       echo 0 > /sys/devices/system/cpu/cpu1/online
>       echo 1 > /sys/devices/system/cpu/cpu1/online
>  5) mempolicy rebind does a div/0 in mpol_relative_nodemask on the
>     call to __nodes_fold()
> 
> The cpuset code passes (cs->mems_allowed) which is not guaranteed to have
> nodes to the rebind routine.  Use cs->effective_mems instead, which is
> guaranteed to have a non-empty nodemask.
> 
> Closes: https://lore.kernel.org/linux-mm/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/
> Link: https://lore.kernel.org/all/CA+0ovCiEz6SP_sn3kN4Tb+_oC=eHMXy_Ffj=usV3wREdQrUtww@mail.gmail.com/
> Fixes: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}")
> Suggested-by: Gregory Price <gourry@gourry.net>
> Suggested-by: Waiman Long <longman@redhat.com>
> Signed-off-by: Farhad Alemi <farhad.alemi@berkeley.edu>
> Acked-by: Waiman Long <longman@redhat.com>
> Cc: stable@vger.kernel.org

Reviewed-by: Gregory Price <gourry@gourry.net>

> ---
>  kernel/cgroup/cpuset.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 591e3aa487fc..b21c31650583 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2653,7 +2653,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
>  
>  		migrate = is_memory_migrate(cs);
>  
> -		mpol_rebind_mm(mm, &cs->mems_allowed);
> +		mpol_rebind_mm(mm, &cs->effective_mems);
>  		if (migrate)
>  			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
>  		else
> -- 
> 2.54.0
> 

^ permalink raw reply

* Re: [PATCH v2] selftests/cgroup: Adjust cpu.max quota based on HZ
From: Michal Koutný @ 2026-06-23 13:52 UTC (permalink / raw)
  To: Joe Simmons-Talbott
  Cc: Tejun Heo, Johannes Weiner, Shuah Khan, cgroups, linux-kselftest,
	linux-kernel, Sebastian Chlad
In-Reply-To: <20260622194305.601392-1-joest@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3924 bytes --]

On Mon, Jun 22, 2026 at 03:43:04PM -0400, Joe Simmons-Talbott <joest@redhat.com> wrote:
> +static long
> +_get_config_hz(void)
> +{
> +	long hz = -1;
> +	FILE *f;
> +	char cmd[256] = "zcat /proc/config.gz 2>/dev/null | grep '^CONFIG_HZ='";
> +
> +	f = popen(cmd, "r");
> +
> +	if (!f)
> +		goto out;
> +
> +	fscanf(f, "CONFIG_HZ=%ld", &hz);
> +
> +out:
> +	pclose(f);
> +	return hz;
> +}

I like that you voiced this dependency on CONFIG_HZ and also that
_SC_CLK_TCK is useless in this regards.
(I see that BPF selftests have similar infra for this.)


> +
>  /*
>   * This test creates a cgroup with some maximum value within a period, and
>   * verifies that a process in the cgroup is not overscheduled.
> @@ -646,7 +669,8 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
>  static int test_cpucg_max(const char *root)
>  {
>  	int ret = KSFT_FAIL;
> -	long quota_usec = 1000;
> +	long hz = _get_config_hz();
> +	long quota_usec;
>  	long default_period_usec = 100000; /* cpu.max's default period */
>  	long duration_seconds = 1;

I would not bend the tested value but it's expectation (so that
approximately same quantity is tested acroos configs).

I reckon the problem might be tasks that overrun the quota due to long
tick, fortunately, we can assume this is compensated over multiple
periods, so _on average_ quota should be honored (more) precisely.
But the test duration may be not well aligned with all the compensation
periods, to that must be accounted for in the expectation.

When I write it all down, I get this:

--- a/tools/testing/selftests/cgroup/test_cpu.c
+++ b/tools/testing/selftests/cgroup/test_cpu.c
@@ -651,7 +651,9 @@ static int test_cpucg_max(const char *root)
        long duration_seconds = 1;

        long duration_usec = duration_seconds * USEC_PER_SEC;
-       long usage_usec, n_periods, remainder_usec, expected_usage_usec;
+       long usage_usec, expected_usage_usec;
+       long n_periods, spread_periods, unaligned;
+       long tick_usec, low_usage, high_usage;
        char *cpucg;
        char quota_buf[32];

@@ -687,9 +689,16 @@ static int test_cpucg_max(const char *root)
         * the cpu hog is set to run as per wall-clock time
         */
        n_periods = duration_usec / default_period_usec;
-       remainder_usec = duration_usec - n_periods * default_period_usec;
-       expected_usage_usec
-               = n_periods * quota_usec + MIN(remainder_usec, quota_usec);
+       tick_usec = USEC_PER_SEC / hz;
+       /* Up to tick_usec (over)run is compensated over multiple periods */
+       spread_periods = MAX(1, tick_usec / quota_usec);
+       low_usage = n_periods / spread_periods;
+       high_usage = (n_periods + spread_periods - 1) / spread_periods;
+       unaligned = n_periods % spread_periods;
+
+       expected_usage_usec = quota_usec * (
+               unaligned * high_usage +
+               (spread_periods - unaligned) * low_usage);

        if (!values_close_report(usage_usec, expected_usage_usec, 10))
                goto cleanup;


(I neglected (and dropped) remainder_usec because it is zero with
default values)

However, not all preemptions are tick-based, so there'd be noise 
and one has to tune the values_clone_report(,,err) anyway.

Then to reduce noise, the simpler solution is to let the test run
longer

duration_usec = duration_seconds * USEC_PER_SEC * 1000 / hz;

(where 1000 is the CONFIG_HZ=1000 where the test runs sufficiently [1] well.)

Joe, how do to the two variants above (unalignment account and prolonged
duration) affect test_cpu behavior on your setup?

(I'm personally wondering what is bigger quantity: systemic error due to
HZ quantization or random (SMP) error.)

Thanks,
Michal

[1] Even there one runs into noise depending on nr_cpus, thus even that
    fixed err=10 is not ideal.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v2] selftests/cgroup: Adjust cpu.max quota based on HZ
From: Joe Simmons-Talbott @ 2026-06-23 11:51 UTC (permalink / raw)
  To: Tao Cui
  Cc: Joe Simmons-Talbott, Tejun Heo, Johannes Weiner,
	Michal Koutný, Shuah Khan, cgroups, linux-kselftest,
	linux-kernel
In-Reply-To: <d8bdf9ef-a393-4734-8639-308ac3eaa05c@linux.dev>

On Tue, Jun 23, 2026 at 01:32:08PM +0800, Tao Cui wrote:
> 
> Hi Joe,
> 
> One comment on the fallback:
> 
>   quota_usec = hz != -1 ? USEC_PER_SEC / hz : 1000;
> 
> When HZ can't be determined (no CONFIG_IKCONFIG_PROC, or zcat missing),
> the fallback to 1000 is the exact value that fails at low HZ — so this
> doesn't actually fix such kernels. A larger fallback (e.g. 10000, the
> HZ=100 equivalent) would make the tests robust regardless of whether the
> config is exposed.

Hi Tao Cui,

Thank you for your review.

I am happy to use 10000 as the fallback value.  I will address this as
well as the sashiko comments in v3.

Thanks,
Joe
> 
> 在 2026/6/23 03:43, Joe Simmons-Talbott 写道:
> > For lower HZ values a quota of 1000us is much lower than the amount
> > of microseconds per tick which makes the tests test_cpucg_max and
> > test_cpugc_max_nested fail. Use the amount of microseconds per tick
> > as the quota value.
> > 
> > Signed-off-by: Joe Simmons-Talbott <joest@redhat.com>
> > ---
> > changes since v1:
> > - Try checking /proc/config.gz to get the actual kernel HZ value and
> >   fallback to 1000 if the value cannot be determined.
> > 
> >  tools/testing/selftests/cgroup/test_cpu.c | 33 +++++++++++++++++++++--
> >  1 file changed, 31 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c
> > index 7a40d76b9548..65e09555309f 100644
> > --- a/tools/testing/selftests/cgroup/test_cpu.c
> > +++ b/tools/testing/selftests/cgroup/test_cpu.c
> > @@ -639,6 +639,29 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
> >  	return run_cpucg_nested_weight_test(root, false);
> >  }
> >  
> > +/*
> > + * Best effort attempt to get the kernel's HZ value from the config.
> > + * Return the HZ value if found otherwise return -1 to indicate failure.
> > + */
> > +static long
> > +_get_config_hz(void)
> > +{
> > +	long hz = -1;
> > +	FILE *f;
> > +	char cmd[256] = "zcat /proc/config.gz 2>/dev/null | grep '^CONFIG_HZ='";
> > +
> > +	f = popen(cmd, "r");
> > +
> > +	if (!f)
> > +		goto out;
> > +
> > +	fscanf(f, "CONFIG_HZ=%ld", &hz);
> > +
> > +out:
> > +	pclose(f);
> > +	return hz;
> > +}
> > +
> >  /*
> >   * This test creates a cgroup with some maximum value within a period, and
> >   * verifies that a process in the cgroup is not overscheduled.
> > @@ -646,7 +669,8 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
> >  static int test_cpucg_max(const char *root)
> >  {
> >  	int ret = KSFT_FAIL;
> > -	long quota_usec = 1000;
> > +	long hz = _get_config_hz();
> > +	long quota_usec;
> >  	long default_period_usec = 100000; /* cpu.max's default period */
> >  	long duration_seconds = 1;
> >  
> > @@ -655,6 +679,8 @@ static int test_cpucg_max(const char *root)
> >  	char *cpucg;
> >  	char quota_buf[32];
> >  
> > +	quota_usec = hz != -1 ? USEC_PER_SEC / hz : 1000;
> > +
> >  	snprintf(quota_buf, sizeof(quota_buf), "%ld", quota_usec);
> >  
> >  	cpucg = cg_name(root, "cpucg_test");
> > @@ -710,7 +736,8 @@ static int test_cpucg_max(const char *root)
> >  static int test_cpucg_max_nested(const char *root)
> >  {
> >  	int ret = KSFT_FAIL;
> > -	long quota_usec = 1000;
> > +	long quota_usec;
> > +	long hz = _get_config_hz();
> >  	long default_period_usec = 100000; /* cpu.max's default period */
> >  	long duration_seconds = 1;
> >  
> > @@ -719,6 +746,8 @@ static int test_cpucg_max_nested(const char *root)
> >  	char *parent, *child;
> >  	char quota_buf[32];
> >  
> > +	quota_usec = hz != -1 ? USEC_PER_SEC / hz : 1000;
> > +
> >  	snprintf(quota_buf, sizeof(quota_buf), "%ld", quota_usec);
> >  
> >  	parent = cg_name(root, "cpucg_parent");
> 
> 


^ permalink raw reply

* Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
From: Juri Lelli @ 2026-06-23 10:15 UTC (permalink / raw)
  To: Yuri Andriaccio
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Tejun Heo, Johannes Weiner, Michal Koutný, cgroups,
	linux-kernel, Luca Abeni, Yuri Andriaccio
In-Reply-To: <20260608121546.69910-1-yurand2000@gmail.com>

Hi Yuri,

On 08/06/26 14:15, Yuri Andriaccio wrote:
> Hello,
> 
> This is the v6 for Hierarchical Constant Bandwidth Server, aiming at replacing
> the current RT_GROUP_SCHED mechanism with something more robust and
> theoretically sound. The patchset has been presented at OSPM25 and OSPM26
> (https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
> be found at https://lwn.net/Articles/1021332/ . You can find the previous
> versions of this patchset at the bottom of the page, in particular version 1
> which talks in more detail what this patchset is all about and how it is
> implemented.
> 
> This v6 version works on the comments by the reviewers and introduces the
> following meaningful changes:
> - Update to kernel version 7.1.
> - Refactorings and general cleanups.
> - Removal of substantial duplicated code.
> - Express more locking constraints in code.
> - New cpu.rt.max interface.
> - Refactoring of migration code to reduce code duplication.
>   The new migration code now reuses the existing push/pull and similar functions
>   and specializes where needed, substantially reducing the footprint of group
>   migration code from previous versions.

I've been working on a simple demo and benchmark suite for HCBS to
explore real-world like use cases and characterize the feature's
behavior. Different angle wrt your unit test suite (I believe).

The demo is available at:

  https://github.com/jlelli/hcbs-demo

The demo models three scenarios — Industrial PLC Convergence, Robotics
Compute Platform, and Precision Motion Control — each with multiple
cooperating SCHED_FIFO tasks at different priorities sharing a cgroup,
which is the key differentiator vs. plain SCHED_DEADLINE. An aggressor
subsystem overloads the system while HCBS contains it to its budget.

A side-by-side compare mode runs baseline and HCBS simultaneously on
separate cpuset partitions, with a live terminal dashboard showing the
contrast in real time.

Key findings from testing on an Intel Xeon Gold 6433N (4 isolated CPUs
via cpuset partition):

 - At 10ms task periods, HCBS provides perfect temporal isolation: zero
   victim deadline misses across all scenarios, while aggressors are
   correctly throttled to their budget.

 - At 1ms task periods, the dl-server period is the critical tuning
   parameter, less the bandwidth. A 10ms dl-server with 60% bandwidth
   caused ~10% miss rates because the worst-case throttle gap (4ms)
   spanned multiple 1ms deadlines. Switching to a 2ms dl-server period
   at just 30% bandwidth eliminated all misses.

 - A simple Rule of thumb might be to set the dl-server period to at
   most 2x the shortest task period in the cgroup (e.g., 2ms dl-server
   for 1ms tasks, 10ms for 10ms tasks). Would you (and Luca?) agree or
   would you suggest something different?

 - dl-server overhead itself appears negligible: a parameter sweep
   confirmed zero misses for a single task at all bandwidth/period
   combinations tested.

 - The current v6 has been quite stable throughout my testing — no
   warnings, no crashes, and the bandwidth isolation works as expected
   across all the scenarios and workload combinations I've tried.

The demo supports three workload backends:
 - rt-app (default): synthetic periodic tasks with configurable periods
 - RT-Bench TACLeBench (--rtbench): real algorithms (susan, dijkstra,
   FFT, mpeg2, etc.) matching each scenario's computation profile
 - stress-ng (--stress): mixed CPU/IO/cache/memory aggressors for
   realistic interference patterns

Recorded demos (asciinema):
  Industrial: https://asciinema.org/a/5e4BMdWxS7hmm4hI
  Robotics:   https://asciinema.org/a/Msj48XnJGgcCev7M
  Precision:  https://asciinema.org/a/56WC3bu7yrcQe9nz

Thanks for the great work on this patchset. Happy to hear if this demo
is of any interest, discuss any of the findings and/or understand if/how
this can be further expanded.

Best,
Juri


^ permalink raw reply

* Re: [PATCH] mm: memcg: remove stray text from obj_stock_pcp comment
From: Guopeng Zhang @ 2026-06-23  9:02 UTC (permalink / raw)
  To: Harry Yoo, Guopeng Zhang, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
  Cc: cgroups, linux-mm, linux-kernel
In-Reply-To: <82703e62-4061-4241-b12b-c46b927cc67d@kernel.org>


在 2026/6/23 16:42, Harry Yoo 写道:
>
> On 6/23/26 5:26 PM, Guopeng Zhang wrote:
>> From: Guopeng Zhang <zhangguopeng@kylinos.cn>
>>
>> A patch filename was accidentally inserted into the comment describing
>> the nr_bytes field of struct obj_stock_pcp. Remove it.
> nit: perhaps add something like
> "Fix a typo in the comment (target -> targets)"?
Hi Harry,

Thanks for the review and the Ack.

Yes, I also fixed the "target -> targets" typo, but missed mentioning it
in the commit message. I'll be more careful about describing all changes
clearly next time. If a respin is needed, I'll add it to the commit
message and carry your Acked-by.

Thanks,
Guopeng

>> No functional change.
>>
>> Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
>> ---
> FWIW,
> Acked-by: Harry Yoo (Oracle) <harry@kernel.org>
>
> Thanks!
>
>>  mm/memcontrol.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6dc4888a90f3..3eedfc4e84a0 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2039,7 +2039,7 @@ struct obj_stock_pcp {
>>  	/*
>>  	 * On rare archs with 256KiB base page size (hexagon and powerpc 44x)
>>  	 * keep nr_bytes to unsigned int as uint16_t cannot represent the full
>> -e patches/memcg-uint16_t-for-nr_bytes-in-obj_stock_pcp.patch	 * sub-page remainder. Such archs are not cacheline optimization target.
>> +	 * sub-page remainder. Such archs are not cacheline optimization targets.
>>  	 */
>>  	unsigned int nr_bytes[NR_OBJ_STOCK];
>>  #else

^ permalink raw reply

* Re: [PATCH] mm: memcg: remove stray text from obj_stock_pcp comment
From: Harry Yoo @ 2026-06-23  8:42 UTC (permalink / raw)
  To: Guopeng Zhang, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton
  Cc: cgroups, linux-mm, linux-kernel, Guopeng Zhang
In-Reply-To: <20260623082614.81621-1-guopeng.zhang@linux.dev>


[-- Attachment #1.1: Type: text/plain, Size: 1227 bytes --]



On 6/23/26 5:26 PM, Guopeng Zhang wrote:
> From: Guopeng Zhang <zhangguopeng@kylinos.cn>
> 
> A patch filename was accidentally inserted into the comment describing
> the nr_bytes field of struct obj_stock_pcp. Remove it.

nit: perhaps add something like
"Fix a typo in the comment (target -> targets)"?

> No functional change.
> 
> Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
> ---

FWIW,
Acked-by: Harry Yoo (Oracle) <harry@kernel.org>

Thanks!

>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6dc4888a90f3..3eedfc4e84a0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2039,7 +2039,7 @@ struct obj_stock_pcp {
>  	/*
>  	 * On rare archs with 256KiB base page size (hexagon and powerpc 44x)
>  	 * keep nr_bytes to unsigned int as uint16_t cannot represent the full
> -e patches/memcg-uint16_t-for-nr_bytes-in-obj_stock_pcp.patch	 * sub-page remainder. Such archs are not cacheline optimization target.
> +	 * sub-page remainder. Such archs are not cacheline optimization targets.
>  	 */
>  	unsigned int nr_bytes[NR_OBJ_STOCK];
>  #else

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH] mm: memcg: remove stray text from obj_stock_pcp comment
From: Guopeng Zhang @ 2026-06-23  8:26 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton
  Cc: cgroups, linux-mm, linux-kernel, Guopeng Zhang

From: Guopeng Zhang <zhangguopeng@kylinos.cn>

A patch filename was accidentally inserted into the comment describing
the nr_bytes field of struct obj_stock_pcp. Remove it.

No functional change.

Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6dc4888a90f3..3eedfc4e84a0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2039,7 +2039,7 @@ struct obj_stock_pcp {
 	/*
 	 * On rare archs with 256KiB base page size (hexagon and powerpc 44x)
 	 * keep nr_bytes to unsigned int as uint16_t cannot represent the full
-e patches/memcg-uint16_t-for-nr_bytes-in-obj_stock_pcp.patch	 * sub-page remainder. Such archs are not cacheline optimization target.
+	 * sub-page remainder. Such archs are not cacheline optimization targets.
 	 */
 	unsigned int nr_bytes[NR_OBJ_STOCK];
 #else
-- 
2.25.1


^ permalink raw reply related

* [PATCH 3/3] memcg: bail out proactive reclaim when memcg is dying
From: Jiayuan Chen @ 2026-06-23  6:27 UTC (permalink / raw)
  To: linux-mm
  Cc: yingfu.zhou, jiayuan.chen, Jiayuan Chen, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel
In-Reply-To: <20260623062800.298514-1-jiayuan.chen@linux.dev>

From: Jiayuan Chen <jiayuan.chen@shopee.com>

Proactive reclaim via memory.reclaim can run for a long time - swap I/O
or thrashing again dominating the latency - and delays cgroup removal in
the same way.

Mitigate this by stopping the reclaim once memcg_is_dying().

Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
 mm/vmscan.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8190c4abec84..1162b7f76655 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7922,6 +7922,9 @@ int user_proactive_reclaim(char *buf,
 		if (memcg) {
 			unsigned int reclaim_options;
 
+			if (memcg_is_dying(memcg))
+				break;
+
 			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
 					  MEMCG_RECLAIM_PROACTIVE;
 			reclaimed = try_to_free_mem_cgroup_pages(memcg,
-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox