Linux cgroups development
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Jani Nikula @ 2026-06-25 11:00 UTC (permalink / raw)
  To: Kaitao Cheng, David Laight, Christian König,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <0ed6b5c3-e955-46e2-9fc6-075a0dfd1c4f@linux.dev>

On Thu, 25 Jun 2026, Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> 在 2026/6/24 22:23, David Laight 写道:
>> On Wed, 24 Jun 2026 15:23:47 +0200
>> Christian König <christian.koenig@amd.com> wrote:
>>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>>  
>>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>>
>>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>>> every call site, even though most users only need it for the iterator
>>>>>> implementation and never reference it in the loop body.
>>>>>>
>>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>>> a unique internal cursor.  
>>>>>
>>>>> I'm not really sure 'mutable' means anything either.
>>>>> It is possible to make it valid for the loop body (or even other threads)
>>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>>
>>>>> It might be worth doing something that doesn't need the extra variable,
>>>>> but there is little point doing all the churn just to rename things.
>>>>>  
>>>>>>
>>>>>> This makes call sites that only mutate the list through the current entry
>>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>>> compatibility.
>>>>>>
>>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>> ---
>>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>>> --- a/include/linux/list.h
>>>>>> +++ b/include/linux/list.h
>>>>>> @@ -7,6 +7,7 @@
>>>>>>  #include <linux/stddef.h>
>>>>>>  #include <linux/poison.h>
>>>>>>  #include <linux/const.h>
>>>>>> +#include <linux/args.h>
>>>>>>  
>>>>>>  #include <asm/barrier.h>
>>>>>>  
>>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>>  #define list_for_each_prev(pos, head) \
>>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>>  
>>>>>> -/**
>>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>>> - * @head:	the head for your list.
>>>>>> +/*
>>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>>   */
>>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>>  	     !list_is_head(pos, (head)); \
>>>>>>  	     pos = n, n = pos->next)
>>>>>>  
>>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>>
>>>>> Use auto
>>>>>  
>>>>>> +	     !list_is_head(pos, (head));				\
>>>>>> +	     pos = tmp, tmp = pos->next)
>>>>>> +
>>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>>> +
>>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>>> +	list_for_each_safe(pos, next, head)
>>>>>> +
>>>>>>  /**
>>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>>> - * @head:	the head for your list.
>>>>>> + * @...:	either (head) or (next, head)
>>>>>> + *
>>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>>> + *		the caller.
>>>>>> + * head:	the head for your list.
>>>>>> + */
>>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>>> +		(pos, __VA_ARGS__)  
>>>>>
>>>>> The variable argument count logic really just slows down compilation.
>>>>> Maybe there aren't enough copies of this code to make that significant.
>>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>>> I'm also not sure it really adds anything to the readability.
>>>>>
>>>>> And, it you are going to make the middle argument optional there is
>>>>> no need to change the macro name.  
>>>>
>>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>>> implementation approach. If we abandon that method, it means we will
>>>> inevitably need to add some new macros. If mutable is not a good name,
>>>> suggestions for better alternatives would be welcome; coming up with a
>>>> suitable name is indeed rather tricky.  
>>>
>>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>>
>>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
>> 
>> IIRC currently you have a choice of either:
>> 	define               Item that can't be deleted
>> 	list_for_each()	     The current item.
>> 	list_for_each_safe() The next item.
>> There is also likely to be code that updates the variables to allow
>> for other scenarios.
>> 
>> Note that if increase a reference count and release a lock then list_for_each()
>> is likely safer than list_for_each_safe() :-)
>> 
>> list.h has 9 variants of the 'safe' loop.
>> The bloat of another 9 is getting excessive.
>> 
>> It has to be said that this is one of my least favourite type of list...
>
> Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
> Andy Shevchenko, Alexei Starovoitov
>
> For ease of discussion, I need to summarize the currently possible
> approaches and briefly describe their respective pros and cons,
> using the list_for_each_entry* interfaces as examples.
>
> 1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
> and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
> would be used specifically for safe deletion scenarios that do not
> need to expose the temporary cursor externally. The code can refer to
> the v1 version.
>
> Pros: Does not depend on immediate per-subsystem adaptation and can be
>       merged directly.
> Cons: Requires adding a whole set of mutable interfaces, which makes the
>       code somewhat redundant.

Seems fine, and the original _safe naming is ambiguous anyway.

> 2. Directly optimize away the temporary cursor in list_for_each_entry_safe
> and define it inside the loop instead, changing the interface from four
> arguments to three.
>
> Pros: Does not add redundant interfaces.
> Cons: (1) Users need to manually update special cases that use the
>       traversal variable of list_for_each_entry_safe, the new
>       list_for_each_entry_safe would no longer apply there and would
>       need to be open-coded.
>       (2) Because the macro arguments changes, all list_for_each_entry_safe
>       callers would need to be modified and merged together, making it
>       difficult to merge such a large amount of code at once.

This won't fly because there are literally thousands of
list_for_each_entry_safe() users.

> 3. Use a variadic macro approach to optimize list_for_each_entry_safe,
> so that it supports both three and four arguments.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) Does not depend on immediate per-subsystem adaptation and can
>       be merged directly.
> Cons: (1) Increases compile time.
>       (2) Makes the interface harder for users to use.

Basically I'm against any variadic macro tricks where the optional
argument is not the last argument. That's just way too surprising, and
goes against common practice in just about all other languages.

> 4. Optimize list_for_each_entry by defining the temporary cursor internally,
> making it compatible with the functionality of list_for_each_entry_safe.
> The code can refer to the v2 version.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) The number of externally visible arguments of list_for_each_entry
>       remains unchanged, still three.
> Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
>       into one, and list_for_each_entry_safe would gradually be deprecated.
>       (2) Users need to manually update special cases that use the traversal
>       variable of list_for_each_entry, the new list_for_each_entry would no
>       longer apply there and would need to be open-coded. There are 15 such
>       cases in total.

This sounds good to me, though I take it there's some code size increase
and/or performance penalty?

Maybe the 15 cases are questionable anyway?

> 5. Use a variadic macro approach to optimize list_for_each_entry, so that
> it supports both three and four arguments.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) Does not depend on immediate per-subsystem adaptation and can be
>       merged directly.
> Cons: (1) Increases compile time.
>       (2) list_for_each_entry and list_for_each_entry_safe would be merged
>       into one, and list_for_each_entry_safe would gradually be deprecated.

Please don't do the macro tricks.

> 6. Make no changes, keep the current logic unchanged, and close the current
> email discussion.

I like hiding the temporary stuff when possible.


BR,
Jani.

-- 
Jani Nikula, Intel

^ permalink raw reply

* Re: [PATCH 1/2] cgroup/dmem: add per-region event counters
From: Hongfu Li @ 2026-06-25 10:21 UTC (permalink / raw)
  To: Natalie Vock, tj
  Cc: cgroups, corbet, dev, dri-devel, hannes, linux-doc, linux-kernel,
	mkoutny, mripard, skhan, hongfu.li
In-Reply-To: <b549422c-7c35-434d-ad4a-49a4676970ac@gmx.de>

Hi,

On 6/25/26 4:57 PM, Natalie Vock wrote:
> Hi,
>
> On 6/25/26 04:10, Hongfu Li wrote:
>> Hi, Tejun
>> Thanks for the review comments.
>>
>>>> Add dmem.events to report hierarchical low/max event counts per DMEM
>>>> region.  Increment counters on dmem.max allocation failures and
>>>> dmem.low protection events.  The file is available for non-root 
>>>> cgroups
>>>> only.
>>>
>>> Please don't double space in descs or comments. Also, maybe it's 
>>> obvious but
>>> it'd help if you list why and how this is useful. Why do we want to add
>>> this?
>>
>> I'll fix the double spacing in the commit message and comments.
>>
>> As for the motivation: dmem already exposes per-region limits and 
>> current
>> usage, but not how often those limits actually matter at runtime. 
>> Without
>> event counters, it's hard to tell whether allocation failures come from
>> this cgroup, a parent limit, or pressure elsewhere in the hierarchy.
>> dmem.events provides that visibility for tuning dmem.low/dmem.max and
>> diagnosing recurring device memory pressure.
>
> Shouldn't you be able to deduce this rather trivially from just 
> looking at the current usage together with the low/max limits you 
> already set? I'm not sure I really see anything this events file 
> provides that analysis of current usage and set limits doesn't? If 
> your usage is highly variable, the separately-developed dmem.peak file 
> might also suit your needs, but still, not sure what you can do with 
> dmem.events that you can't already do with these tools. 
Thanks for the question.

Besides exposing counters, dmem.events notifies userspace on changes via
cgroup_file_notify(). This allows tools to monitor limit-related events
(for example, allocation failures or low-protection fallbacks) 
asynchronously,
without the need to periodically poll dmem.current against the limits. 
While
you could infer some conditions from current usage and limits, polling is
inefficient and cannot capture transient events in real time. dmem.peak only
records the highest usage, not these specific events.

So dmem.events provides both lower overhead and richer, actionable 
information.

Best regards,
Hongfu




^ permalink raw reply

* [tj-cgroup:for-7.2-fixes] BUILD SUCCESS 46d65096ce8d278abf4528e254878c14ddd0b459
From: kernel test robot @ 2026-06-25  9:48 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-7.2-fixes
branch HEAD: 46d65096ce8d278abf4528e254878c14ddd0b459  Docs/admin-guide/cgroup-v2: fix memory.stat doc details

elapsed time: 824m

configs tested: 248
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig    gcc-16.1.0
alpha                            allyesconfig    gcc-16.1.0
alpha                               defconfig    gcc-16.1.0
arc                              allmodconfig    clang-23
arc                              allmodconfig    gcc-16.1.0
arc                               allnoconfig    gcc-16.1.0
arc                              allyesconfig    clang-23
arc                                 defconfig    gcc-16.1.0
arc                            randconfig-001    gcc-16.1.0
arc                   randconfig-001-20260625    clang-23
arc                   randconfig-001-20260625    gcc-16.1.0
arc                            randconfig-002    gcc-16.1.0
arc                   randconfig-002-20260625    clang-23
arc                   randconfig-002-20260625    gcc-16.1.0
arm                               allnoconfig    gcc-16.1.0
arm                              allyesconfig    clang-23
arm                              allyesconfig    gcc-16.1.0
arm                                 defconfig    gcc-16.1.0
arm                         nhk8815_defconfig    clang-23
arm                            randconfig-001    gcc-16.1.0
arm                   randconfig-001-20260625    clang-23
arm                   randconfig-001-20260625    gcc-16.1.0
arm                            randconfig-002    gcc-16.1.0
arm                   randconfig-002-20260625    clang-23
arm                   randconfig-002-20260625    gcc-16.1.0
arm                            randconfig-003    gcc-16.1.0
arm                   randconfig-003-20260625    clang-23
arm                   randconfig-003-20260625    gcc-16.1.0
arm                            randconfig-004    gcc-16.1.0
arm                   randconfig-004-20260625    clang-23
arm                   randconfig-004-20260625    gcc-16.1.0
arm                        shmobile_defconfig    gcc-16.1.0
arm64                            allmodconfig    clang-23
arm64                             allnoconfig    gcc-16.1.0
arm64                               defconfig    gcc-16.1.0
arm64                 randconfig-001-20260625    clang-23
arm64                 randconfig-002-20260625    clang-23
arm64                 randconfig-003-20260625    clang-23
arm64                 randconfig-004-20260625    clang-23
csky                             allmodconfig    gcc-16.1.0
csky                              allnoconfig    gcc-16.1.0
csky                                defconfig    gcc-16.1.0
csky                  randconfig-001-20260625    clang-23
csky                  randconfig-002-20260625    clang-23
hexagon                          allmodconfig    clang-23
hexagon                          allmodconfig    gcc-16.1.0
hexagon                           allnoconfig    gcc-16.1.0
hexagon                             defconfig    gcc-16.1.0
hexagon                        randconfig-001    gcc-11.5.0
hexagon               randconfig-001-20260625    gcc-11.5.0
hexagon                        randconfig-002    gcc-11.5.0
hexagon               randconfig-002-20260625    gcc-11.5.0
i386                             allmodconfig    clang-22
i386                              allnoconfig    gcc-16.1.0
i386                             allyesconfig    clang-22
i386                 buildonly-randconfig-001    gcc-14
i386        buildonly-randconfig-001-20260625    gcc-14
i386                 buildonly-randconfig-002    gcc-14
i386        buildonly-randconfig-002-20260625    gcc-14
i386                 buildonly-randconfig-003    gcc-14
i386        buildonly-randconfig-003-20260625    gcc-14
i386                 buildonly-randconfig-004    gcc-14
i386        buildonly-randconfig-004-20260625    gcc-14
i386                 buildonly-randconfig-005    gcc-14
i386        buildonly-randconfig-005-20260625    gcc-14
i386                 buildonly-randconfig-006    gcc-14
i386        buildonly-randconfig-006-20260625    gcc-14
i386                                defconfig    gcc-16.1.0
i386                           randconfig-001    clang-22
i386                  randconfig-001-20260625    clang-22
i386                           randconfig-002    clang-22
i386                  randconfig-002-20260625    clang-22
i386                           randconfig-003    clang-22
i386                  randconfig-003-20260625    clang-22
i386                           randconfig-004    clang-22
i386                  randconfig-004-20260625    clang-22
i386                           randconfig-005    clang-22
i386                  randconfig-005-20260625    clang-22
i386                           randconfig-006    clang-22
i386                  randconfig-006-20260625    clang-22
i386                           randconfig-007    clang-22
i386                  randconfig-007-20260625    clang-22
i386                  randconfig-011-20260625    clang-22
i386                  randconfig-012-20260625    clang-22
i386                  randconfig-013-20260625    clang-22
i386                  randconfig-014-20260625    clang-22
i386                  randconfig-015-20260625    clang-22
i386                  randconfig-016-20260625    clang-22
i386                  randconfig-017-20260625    clang-22
loongarch                        allmodconfig    clang-23
loongarch                         allnoconfig    gcc-16.1.0
loongarch                           defconfig    clang-23
loongarch                      randconfig-001    gcc-11.5.0
loongarch             randconfig-001-20260625    gcc-11.5.0
loongarch                      randconfig-002    gcc-11.5.0
loongarch             randconfig-002-20260625    gcc-11.5.0
m68k                             allmodconfig    gcc-16.1.0
m68k                              allnoconfig    gcc-16.1.0
m68k                             allyesconfig    clang-23
m68k                             allyesconfig    gcc-16.1.0
m68k                                defconfig    clang-23
m68k                         nettel_defconfig    gcc-16.1.0
microblaze                        allnoconfig    gcc-16.1.0
microblaze                       allyesconfig    gcc-16.1.0
microblaze                          defconfig    clang-23
mips                             allmodconfig    gcc-16.1.0
mips                              allnoconfig    gcc-16.1.0
mips                             allyesconfig    gcc-16.1.0
mips                          ath25_defconfig    clang-23
nios2                            allmodconfig    clang-20
nios2                            allmodconfig    gcc-11.5.0
nios2                             allnoconfig    clang-23
nios2                             allnoconfig    gcc-11.5.0
nios2                               defconfig    clang-23
nios2                          randconfig-001    gcc-11.5.0
nios2                 randconfig-001-20260625    gcc-11.5.0
nios2                          randconfig-002    gcc-11.5.0
nios2                 randconfig-002-20260625    gcc-11.5.0
openrisc                         allmodconfig    clang-20
openrisc                         allmodconfig    gcc-16.1.0
openrisc                          allnoconfig    clang-23
openrisc                          allnoconfig    gcc-16.1.0
openrisc                            defconfig    gcc-16.1.0
parisc                           allmodconfig    gcc-16.1.0
parisc                            allnoconfig    clang-23
parisc                            allnoconfig    gcc-16.1.0
parisc                           allyesconfig    clang-17
parisc                           allyesconfig    gcc-16.1.0
parisc                              defconfig    gcc-16.1.0
parisc                randconfig-001-20260625    gcc-13.4.0
parisc                randconfig-002-20260625    gcc-13.4.0
parisc64                            defconfig    clang-23
powerpc                          allmodconfig    gcc-16.1.0
powerpc                           allnoconfig    clang-23
powerpc                           allnoconfig    gcc-16.1.0
powerpc               randconfig-001-20260625    gcc-13.4.0
powerpc               randconfig-002-20260625    gcc-13.4.0
powerpc                     tqm5200_defconfig    gcc-16.1.0
powerpc64             randconfig-001-20260625    gcc-13.4.0
powerpc64             randconfig-002-20260625    gcc-13.4.0
riscv                            allmodconfig    clang-23
riscv                             allnoconfig    clang-23
riscv                             allnoconfig    gcc-16.1.0
riscv                            allyesconfig    clang-23
riscv                               defconfig    gcc-16.1.0
riscv                          randconfig-001    gcc-8.5.0
riscv                 randconfig-001-20260625    gcc-8.5.0
riscv                          randconfig-002    gcc-8.5.0
riscv                 randconfig-002-20260625    gcc-8.5.0
s390                             allmodconfig    clang-17
s390                             allmodconfig    clang-23
s390                              allnoconfig    clang-23
s390                             allyesconfig    gcc-16.1.0
s390                                defconfig    gcc-16.1.0
s390                           randconfig-001    gcc-8.5.0
s390                  randconfig-001-20260625    gcc-8.5.0
s390                           randconfig-002    gcc-8.5.0
s390                  randconfig-002-20260625    gcc-8.5.0
sh                               allmodconfig    gcc-16.1.0
sh                                allnoconfig    clang-23
sh                                allnoconfig    gcc-16.1.0
sh                               allyesconfig    clang-17
sh                               allyesconfig    gcc-16.1.0
sh                                  defconfig    gcc-14
sh                             randconfig-001    gcc-8.5.0
sh                    randconfig-001-20260625    gcc-8.5.0
sh                             randconfig-002    gcc-8.5.0
sh                    randconfig-002-20260625    gcc-8.5.0
sparc                             allnoconfig    clang-23
sparc                             allnoconfig    gcc-16.1.0
sparc                               defconfig    gcc-16.1.0
sparc                 randconfig-001-20260625    gcc-8.5.0
sparc                 randconfig-002-20260625    gcc-8.5.0
sparc64                          allmodconfig    clang-20
sparc64                             defconfig    gcc-14
sparc64               randconfig-001-20260625    gcc-8.5.0
sparc64               randconfig-002-20260625    gcc-8.5.0
um                               allmodconfig    clang-17
um                                allnoconfig    clang-17
um                                allnoconfig    clang-23
um                               allyesconfig    gcc-14
um                               allyesconfig    gcc-16.1.0
um                                  defconfig    gcc-14
um                             i386_defconfig    gcc-14
um                    randconfig-001-20260625    gcc-8.5.0
um                    randconfig-002-20260625    gcc-8.5.0
um                           x86_64_defconfig    gcc-14
x86_64                           allmodconfig    clang-22
x86_64                            allnoconfig    clang-22
x86_64                            allnoconfig    clang-23
x86_64                           allyesconfig    clang-22
x86_64               buildonly-randconfig-001    clang-22
x86_64      buildonly-randconfig-001-20260625    clang-22
x86_64               buildonly-randconfig-002    clang-22
x86_64      buildonly-randconfig-002-20260625    clang-22
x86_64               buildonly-randconfig-003    clang-22
x86_64      buildonly-randconfig-003-20260625    clang-22
x86_64               buildonly-randconfig-004    clang-22
x86_64      buildonly-randconfig-004-20260625    clang-22
x86_64               buildonly-randconfig-005    clang-22
x86_64      buildonly-randconfig-005-20260625    clang-22
x86_64               buildonly-randconfig-006    clang-22
x86_64      buildonly-randconfig-006-20260625    clang-22
x86_64                              defconfig    gcc-14
x86_64                                  kexec    clang-22
x86_64                randconfig-001-20260625    gcc-14
x86_64                randconfig-002-20260625    gcc-14
x86_64                randconfig-003-20260625    gcc-14
x86_64                randconfig-004-20260625    gcc-14
x86_64                randconfig-005-20260625    gcc-14
x86_64                randconfig-006-20260625    gcc-14
x86_64                         randconfig-011    clang-22
x86_64                randconfig-011-20260625    clang-22
x86_64                         randconfig-012    clang-22
x86_64                randconfig-012-20260625    clang-22
x86_64                         randconfig-013    clang-22
x86_64                randconfig-013-20260625    clang-22
x86_64                         randconfig-014    clang-22
x86_64                randconfig-014-20260625    clang-22
x86_64                         randconfig-015    clang-22
x86_64                randconfig-015-20260625    clang-22
x86_64                         randconfig-016    clang-22
x86_64                randconfig-016-20260625    clang-22
x86_64                         randconfig-071    clang-22
x86_64                randconfig-071-20260625    clang-22
x86_64                         randconfig-072    clang-22
x86_64                randconfig-072-20260625    clang-22
x86_64                         randconfig-073    clang-22
x86_64                randconfig-073-20260625    clang-22
x86_64                         randconfig-074    clang-22
x86_64                randconfig-074-20260625    clang-22
x86_64                         randconfig-075    clang-22
x86_64                randconfig-075-20260625    clang-22
x86_64                         randconfig-076    clang-22
x86_64                randconfig-076-20260625    clang-22
x86_64                               rhel-9.4    clang-22
x86_64                           rhel-9.4-bpf    gcc-14
x86_64                          rhel-9.4-func    clang-22
x86_64                    rhel-9.4-kselftests    clang-22
x86_64                         rhel-9.4-kunit    gcc-14
x86_64                           rhel-9.4-ltp    gcc-14
x86_64                          rhel-9.4-rust    clang-22
xtensa                            allnoconfig    clang-23
xtensa                            allnoconfig    gcc-16.1.0
xtensa                           allyesconfig    clang-20
xtensa                           allyesconfig    gcc-16.1.0
xtensa                randconfig-001-20260625    gcc-8.5.0
xtensa                randconfig-002-20260625    gcc-8.5.0

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v6 0/6] [PATCH v6 0/6] Add reclaim to the dmem cgroup controller
From: Michal Koutný @ 2026-06-25  9:19 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: Tejun Heo, Thomas Hellström, intel-xe, Natalie Vock,
	Johannes Weiner, cgroups, Huang Rui, Matthew Brost, Matthew Auld,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter, David Airlie, Christian König, Alex Deucher,
	Rodrigo Vivi, dri-devel, amd-gfx, linux-kernel
In-Reply-To: <ajlUPmaMsa2gxOLg@quatroqueijos.cascardo.eti.br>

[-- Attachment #1: Type: text/plain, Size: 721 bytes --]

On Mon, Jun 22, 2026 at 12:26:54PM -0300, Thadeu Lima de Souza Cascardo <cascardo@igalia.com> wrote:
> As far as I understood the patchset, it doesn't fail the write if it fails
> to reclaim. It sets the new max, then, if the write is blocking, starts
> reclaim and eventually returns after multiple attempts. But it still
> returns success.
> 
> So I believe this is behaving as you would expect.

I was alarmed by the EBUSY mention similarly to Tejun but then I
couldn't find it in pre-patch (840ef6c78e6a2) nor in patched (v5) code.
Please make sure the EBUSY return behavior is not introduced
(essentially match memory.max behavior) and that the accompanying
message refers up to date code ;-)

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH 1/2] cgroup/dmem: add per-region event counters
From: Natalie Vock @ 2026-06-25  8:57 UTC (permalink / raw)
  To: Hongfu Li, tj
  Cc: cgroups, corbet, dev, dri-devel, hannes, linux-doc, linux-kernel,
	mkoutny, mripard, skhan, hongfu.li
In-Reply-To: <20260625021053.488107-1-lihongfu@kylinos.cn>

Hi,

On 6/25/26 04:10, Hongfu Li wrote:
> Hi, Tejun
> Thanks for the review comments.
> 
>>> Add dmem.events to report hierarchical low/max event counts per DMEM
>>> region.  Increment counters on dmem.max allocation failures and
>>> dmem.low protection events.  The file is available for non-root cgroups
>>> only.
>>
>> Please don't double space in descs or comments. Also, maybe it's obvious but
>> it'd help if you list why and how this is useful. Why do we want to add
>> this?
> 
> I'll fix the double spacing in the commit message and comments.
> 
> As for the motivation: dmem already exposes per-region limits and current
> usage, but not how often those limits actually matter at runtime. Without
> event counters, it's hard to tell whether allocation failures come from
> this cgroup, a parent limit, or pressure elsewhere in the hierarchy.
> dmem.events provides that visibility for tuning dmem.low/dmem.max and
> diagnosing recurring device memory pressure.

Shouldn't you be able to deduce this rather trivially from just looking 
at the current usage together with the low/max limits you already set? 
I'm not sure I really see anything this events file provides that 
analysis of current usage and set limits doesn't? If your usage is 
highly variable, the separately-developed dmem.peak file might also suit 
your needs, but still, not sure what you can do with dmem.events that 
you can't already do with these tools.

Best,
Natalie

> 
> I'll expand the commit message to cover this.
>   
>>> +  dmem.events
>>> +	A read-only file that reports the number of times each cgroup
>>> +	has hit its configured memory limits.  The format lists each
>>> +	region on a single line, followed by the event counters::
>>> +
>>> +	  drm/0000:03:00.0/vram0 low 0 max 3
>>> +	  drm/0000:03:00.0/stolen low 0 max 0
>>
>> This isn't a supported file format. Please read the documentation on allowed
>> formats.
> 
> Thanks for catching this. I'll switch dmem.events to nested-keyed format (region low=N max=M).
> 
> Thanks again for the valuable feedback.
> 
> Best regards,
> Hongfu


^ permalink raw reply

* Re: [PATCH v3] selftests/cgroup: Adjust cpu test duration based on HZ
From: Michal Koutný @ 2026-06-25  8:23 UTC (permalink / raw)
  To: Joe Simmons-Talbott
  Cc: Tejun Heo, Johannes Weiner, Shuah Khan, cgroups, linux-kselftest,
	linux-kernel
In-Reply-To: <20260624160358.430354-1-joest@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1898 bytes --]

Hi.

On Wed, Jun 24, 2026 at 12:03:57PM -0400, Joe Simmons-Talbott <joest@redhat.com> wrote:
> +/*
> + * Best effort attempt to get the kernel's HZ value from the config.
> + * Return the HZ value if found otherwise return -1 to indicate failure.
> + */
> +static long
> +_get_config_hz(void)

drop underscore from the static function

> +{
> +	long hz = -1;

use the default 1000 here to simplify the callers

> +	FILE *f;
> +	char cmd[256] = "zcat /proc/config.gz 2>/dev/null | grep '^CONFIG_HZ='";
> +
> +	f = popen(cmd, "r");
> +
> +	if (!f)
> +		return hz;
> +
> +	if (fscanf(f, "CONFIG_HZ=%ld", &hz) == EOF)
> +		goto out;
> +
> +out:
> +	pclose(f);
> +	return hz;
> +}
> +
>  /*
>   * This test creates a cgroup with some maximum value within a period, and
>   * verifies that a process in the cgroup is not overscheduled.
> @@ -646,15 +670,21 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
>  static int test_cpucg_max(const char *root)
>  {
>  	int ret = KSFT_FAIL;
> +	long hz = _get_config_hz();
>  	long quota_usec = 1000;
>  	long default_period_usec = 100000; /* cpu.max's default period */
> -	long duration_seconds = 1;
> +	long duration_seconds;
>  
> -	long duration_usec = duration_seconds * USEC_PER_SEC;
> +	long duration_usec;
>  	long usage_usec, n_periods, remainder_usec, expected_usage_usec;
>  	char *cpucg;
>  	char quota_buf[32];
>  
> +	if (hz == -1)
> +		hz = 1000;
> +	duration_seconds = 1000 / hz;
> +	duration_usec = duration_seconds * USEC_PER_SEC;

I'd do the calculation in usecs

	duration_usec = duration_seconds * USEC_PER_SEC * 1000 / hz;

so that actual duration is more precise (for hz=300 which is the only
that doesn't divide 1000)

All in all, make the adjustments for HZ with less code (since I expect
this will need adjustments for SMPs in future).

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs
From: Waiman Long @ 2026-06-25  5:27 UTC (permalink / raw)
  To: Jing Wu; +Cc: Thomas Gleixner, linux-kernel, rcu, cgroups, Qiliang Yuan
In-Reply-To: <20260624063404.2106807-1-realwujing@gmail.com>

On 6/24/26 2:34 AM, Jing Wu wrote:
> Hi Waiman,
>
> Thomas Gleixner suggested we coordinate, so reaching out directly.
>
> We have been working on a similar feature called Dynamic Housekeeping
> Management (DHM) [1][2][3][4]. The RFC was posted on 2026-02-06, v1 on
> 2026-03-25, and v2 on 2026-04-13 — a week before your series appeared.
> It seems we developed these independently in parallel.
>
> After Thomas's review of DHM v3, we are rebuilding v4 around the
> CPU-by-CPU offline/online hotplug mechanism, which aligns with the
> direction of your series.
>
> There is one key difference in scope worth discussing:
>
>    Your series requires "nohz_full=" to be present at boot (even with
>    an empty CPU list) to opt into runtime updates. DHM targets systems
>    where nohz_full= was never configured at boot — enabling CPU noise
>    isolation purely at runtime without any boot-time setup.
>
>    This requires making the nohz_full infrastructure activatable at
>    runtime for the first time, rather than just extending an already-
>    initialized boot configuration.
>
> Before we start coding v4, a few questions:
>
>    1. Are you planning a v2 of your series? If so, what is your
>       timeline? We want to avoid duplicating effort on the subsystem
>       patches (tick, RCU, genirq).

Yes, I am planning to send out a v2 in a few weeks depending on whether 
I can finish the other works that I am doing right now.


>
>    2. Would you be open to extending your series to cover the
>       "no boot parameter" use case, or do you think it is better kept
>       as a separate series?
The reason to make the v1 series depending on the nohz_full parameter is 
basically a short cut as some code will change its behavior slightly 
depending on if the nohz_full parameter is set. By making it optional, 
we just have to add more code to enable them. It is more work, but 
doable. I will make that optional in the next version, but I probably 
won't have all the needed code other than the essential ones and the 
rests will be handled in a followup patch series.
>
>    3. Are there specific patches in your series where you would welcome
>       our contribution directly?

I have broken down the shutdown callback into separate portions as 
suggested by Thomas. The other major change that I am working on is to 
try to shutdown to only CPUHP_AP_OFFLINE state instead of all the way 
down to CPUHP_OFFLINE. That will require some adjustments to the 
nohz_full related hotplug functions. I have some ideas of what needs to 
be done. However, I haven't looked into RCU yet. I know RCU support 
changing the nocb mask for fully offline CPUs, I will need to find out 
if it possible to do that for partially offline CPUs.

The work has been suspended for a while as I have other works to do. 
Hopefully I can restart it soon to further refresh my memory and we can 
discuss collaboration at that point.

Cheers,
Longman


>
> Happy to collaborate on a unified approach.
>
> [1] DHM RFC (2026-02-06): https://lore.kernel.org/r/20260206-feature-dynamic_isolcpus_dhei-v1-0-00a711eb0c74@gmail.com
> [2] DHM v1  (2026-03-25): https://lore.kernel.org/r/20260325-dhei-v12-final-v1-0-919cca23cadf@gmail.com
> [3] DHM v2  (2026-04-13): https://lore.kernel.org/r/20260413-wujing-dhm-v2-0-06df21caba5d@gmail.com
> [4] DHM v3  (2026-06-18): https://lore.kernel.org/r/20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com
> [5] Your series v1 (2026-04-20): https://lore.kernel.org/r/20260421030351.281436-1-longman@redhat.com
>
> Jing Wu <realwujing@gmail.com>
> Qiliang Yuan <yuanql9@chinatelecom.cn>
>


^ permalink raw reply

* [ANNOUNCE/CFP] Linux Plumbers 2026 Containers and Checkpoint/Restore Microconference
From: Kamalesh Babulal @ 2026-06-25  3:55 UTC (permalink / raw)
  To: cgroups, containers, bpf, linux-fsdevel, linux-api,
	linux-integrity, criu, lxc-devel, fuse-devel
  Cc: Stéphane Graber, Mike Rapoport, Christian Brauner,
	Michal Koutný, Adrian Reber, Kamalesh Babulal

Hello,

We are pleased to announce the Call for Proposals for the Containers and
Checkpoint/Restore Microconference[0] at Linux Plumbers Conference 2026,
taking place in Prague, Czechia, from October 5 to 7, 2026.

This microconference will focus on current work and open problems in
containers, checkpoint/restore, kernel interfaces, and related userspace
tooling. We hope to bring together people working on container
runtimes, CRIU, init systems, distributions, orchestration systems, and
the kernel interfaces that make these pieces work together.

Topics of interest include, but are not limited to:

  - New VFS and syscall interfaces relevant to containers, including
    work around idmapped mounts

  - Closing remaining gaps between cgroup v1 and cgroup v2, and making
    migration easier

  - The growing role of eBPF in container runtimes, observability,
    policy enforcement, and checkpoint/restore

  - Mechanisms for mediating and intercepting increasingly complex
    system calls

  - Lowering the barriers to practical use of user namespaces

  - Attestation, measurement, and other approaches to establishing
    container integrity

  - Better resource-control interfaces and limits for containerized
    workloads

  - Keeping CRIU working smoothly on modern Linux distributions

  - Checkpoint/restore support for GPUs and similar accelerators

  - Restoring FUSE daemons and related userspace services

  - Handling restartable sequences correctly during checkpoint and
    restore

  - Support for newly added kernel features and interfaces

  - Shadow stack support on x86 and arm64

  - Support for madvise(MADV_GUARD_INSTALL) and mseal()

  - pidfd-based checkpoint/restore, including process-exit information

We are also interested in additional topics that may emerge as work
evolves over the coming months. Ongoing development work, operational
experience, unresolved kernel API questions, and cross-project
coordination topics are all welcome.

We encourage you to bring open questions, unresolved issues, or problems
that would benefit from input from others. In your proposal, please
include a short description of the topic, what you would like to
discuss, and what kind of feedback or collaboration would help move the
work forward.

Allocated time per session is expected to be between 15 and 30 minutes.

Please submit proposals through the LPC 2026 abstracts page by August 7:

        https://lpc.events/event/20/abstracts/

Linux Plumbers Conference 2026 will be a hybrid event. While in-person
presentation is preferred to help keep the sessions smooth and
interactive, remote presentation will also be available.

We are looking forward to your proposals and to seeing you in Prague.

[0] https://lpc.events/event/20/contributions/2332/

Thanks,
Containers & Checkpoint/Restart Microconference Team

^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-25  3:01 UTC (permalink / raw)
  To: David Laight, Christian König, Jani Nikula,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <20260624152324.3def88ce@pumpkin>

在 2026/6/24 22:23, David Laight 写道:
> On Wed, 24 Jun 2026 15:23:47 +0200
> Christian König <christian.koenig@amd.com> wrote:
>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>  
>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>
>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>> every call site, even though most users only need it for the iterator
>>>>> implementation and never reference it in the loop body.
>>>>>
>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>> a unique internal cursor.  
>>>>
>>>> I'm not really sure 'mutable' means anything either.
>>>> It is possible to make it valid for the loop body (or even other threads)
>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>
>>>> It might be worth doing something that doesn't need the extra variable,
>>>> but there is little point doing all the churn just to rename things.
>>>>  
>>>>>
>>>>> This makes call sites that only mutate the list through the current entry
>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>> compatibility.
>>>>>
>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>> ---
>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>> --- a/include/linux/list.h
>>>>> +++ b/include/linux/list.h
>>>>> @@ -7,6 +7,7 @@
>>>>>  #include <linux/stddef.h>
>>>>>  #include <linux/poison.h>
>>>>>  #include <linux/const.h>
>>>>> +#include <linux/args.h>
>>>>>  
>>>>>  #include <asm/barrier.h>
>>>>>  
>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>  #define list_for_each_prev(pos, head) \
>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>  
>>>>> -/**
>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> +/*
>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>   */
>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>  	     !list_is_head(pos, (head)); \
>>>>>  	     pos = n, n = pos->next)
>>>>>  
>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>
>>>> Use auto
>>>>  
>>>>> +	     !list_is_head(pos, (head));				\
>>>>> +	     pos = tmp, tmp = pos->next)
>>>>> +
>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>> +
>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>> +	list_for_each_safe(pos, next, head)
>>>>> +
>>>>>  /**
>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> + * @...:	either (head) or (next, head)
>>>>> + *
>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>> + *		the caller.
>>>>> + * head:	the head for your list.
>>>>> + */
>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>> +		(pos, __VA_ARGS__)  
>>>>
>>>> The variable argument count logic really just slows down compilation.
>>>> Maybe there aren't enough copies of this code to make that significant.
>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>> I'm also not sure it really adds anything to the readability.
>>>>
>>>> And, it you are going to make the middle argument optional there is
>>>> no need to change the macro name.  
>>>
>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>> implementation approach. If we abandon that method, it means we will
>>> inevitably need to add some new macros. If mutable is not a good name,
>>> suggestions for better alternatives would be welcome; coming up with a
>>> suitable name is indeed rather tricky.  
>>
>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>
>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
> 
> IIRC currently you have a choice of either:
> 	define               Item that can't be deleted
> 	list_for_each()	     The current item.
> 	list_for_each_safe() The next item.
> There is also likely to be code that updates the variables to allow
> for other scenarios.
> 
> Note that if increase a reference count and release a lock then list_for_each()
> is likely safer than list_for_each_safe() :-)
> 
> list.h has 9 variants of the 'safe' loop.
> The bloat of another 9 is getting excessive.
> 
> It has to be said that this is one of my least favourite type of list...

Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
Andy Shevchenko, Alexei Starovoitov

For ease of discussion, I need to summarize the currently possible
approaches and briefly describe their respective pros and cons,
using the list_for_each_entry* interfaces as examples.

1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
would be used specifically for safe deletion scenarios that do not
need to expose the temporary cursor externally. The code can refer to
the v1 version.

Pros: Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: Requires adding a whole set of mutable interfaces, which makes the
      code somewhat redundant.

2. Directly optimize away the temporary cursor in list_for_each_entry_safe
and define it inside the loop instead, changing the interface from four
arguments to three.

Pros: Does not add redundant interfaces.
Cons: (1) Users need to manually update special cases that use the
      traversal variable of list_for_each_entry_safe, the new
      list_for_each_entry_safe would no longer apply there and would
      need to be open-coded.
      (2) Because the macro arguments changes, all list_for_each_entry_safe
      callers would need to be modified and merged together, making it
      difficult to merge such a large amount of code at once.

3. Use a variadic macro approach to optimize list_for_each_entry_safe,
so that it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can
      be merged directly.
Cons: (1) Increases compile time.
      (2) Makes the interface harder for users to use.

4. Optimize list_for_each_entry by defining the temporary cursor internally,
making it compatible with the functionality of list_for_each_entry_safe.
The code can refer to the v2 version.

Pros: (1) Does not add redundant interfaces.
      (2) The number of externally visible arguments of list_for_each_entry
      remains unchanged, still three.
Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.
      (2) Users need to manually update special cases that use the traversal
      variable of list_for_each_entry, the new list_for_each_entry would no
      longer apply there and would need to be open-coded. There are 15 such
      cases in total.

5. Use a variadic macro approach to optimize list_for_each_entry, so that
it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: (1) Increases compile time.
      (2) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.

6. Make no changes, keep the current logic unchanged, and close the current
email discussion.


Which of the six solutions above do people prefer?

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* [PATCH v3 4/4] blk-cgroup: factor policy pd teardown loop into helper
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel
In-Reply-To: <20260625025739.2459651-1-yukuai@kernel.org>

From: Zheng Qixing <zhengqixing@huawei.com>

Move the teardown sequence which offlines and frees per-policy
blkg_policy_data (pd) into a helper for readability.

No functional change intended.

Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 57 ++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 30 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d06915045bc4..0b28420c108f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1527,10 +1527,35 @@ struct cgroup_subsys io_cgrp_subsys = {
 	.depends_on = 1 << memory_cgrp_id,
 #endif
 };
 EXPORT_SYMBOL_GPL(io_cgrp_subsys);
 
+/*
+ * Tear down per-blkg policy data for @pol on @q.
+ */
+static void blkcg_policy_teardown_pds(struct request_queue *q,
+				      const struct blkcg_policy *pol)
+{
+	struct blkcg_gq *blkg;
+
+	list_for_each_entry(blkg, &q->blkg_list, q_node) {
+		struct blkcg *blkcg = blkg->blkcg;
+		struct blkg_policy_data *pd;
+
+		spin_lock(&blkcg->lock);
+		pd = blkg->pd[pol->plid];
+		if (pd) {
+			if (pd->online && pol->pd_offline_fn)
+				pol->pd_offline_fn(pd);
+			pd->online = false;
+			pol->pd_free_fn(pd);
+			WRITE_ONCE(blkg->pd[pol->plid], NULL);
+		}
+		spin_unlock(&blkcg->lock);
+	}
+}
+
 /**
  * blkcg_activate_policy - activate a blkcg policy on a gendisk
  * @disk: gendisk of interest
  * @pol: blkcg policy to activate
  *
@@ -1642,25 +1667,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	return ret;
 
 enomem:
 	/* alloc failed, take down everything */
 	spin_lock_irq(&q->queue_lock);
-	list_for_each_entry(blkg, &q->blkg_list, q_node) {
-		struct blkcg *blkcg = blkg->blkcg;
-		struct blkg_policy_data *pd;
-
-		spin_lock(&blkcg->lock);
-		pd = blkg->pd[pol->plid];
-		if (pd) {
-			if (pd->online && pol->pd_offline_fn)
-				pol->pd_offline_fn(pd);
-			pd->online = false;
-			pol->pd_free_fn(pd);
-			WRITE_ONCE(blkg->pd[pol->plid], NULL);
-		}
-		spin_unlock(&blkcg->lock);
-	}
+	blkcg_policy_teardown_pds(q, pol);
 	spin_unlock_irq(&q->queue_lock);
 	ret = -ENOMEM;
 	goto out;
 }
 EXPORT_SYMBOL_GPL(blkcg_activate_policy);
@@ -1675,11 +1686,10 @@ EXPORT_SYMBOL_GPL(blkcg_activate_policy);
  */
 void blkcg_deactivate_policy(struct gendisk *disk,
 			     const struct blkcg_policy *pol)
 {
 	struct request_queue *q = disk->queue;
-	struct blkcg_gq *blkg;
 	unsigned int memflags;
 
 	if (!blkcg_policy_enabled(q, pol))
 		return;
 
@@ -1688,24 +1698,11 @@ void blkcg_deactivate_policy(struct gendisk *disk,
 
 	mutex_lock(&q->blkcg_mutex);
 	spin_lock_irq(&q->queue_lock);
 
 	__clear_bit(pol->plid, q->blkcg_pols);
-
-	list_for_each_entry(blkg, &q->blkg_list, q_node) {
-		struct blkcg *blkcg = blkg->blkcg;
-
-		spin_lock(&blkcg->lock);
-		if (blkg->pd[pol->plid]) {
-			if (blkg->pd[pol->plid]->online && pol->pd_offline_fn)
-				pol->pd_offline_fn(blkg->pd[pol->plid]);
-			pol->pd_free_fn(blkg->pd[pol->plid]);
-			blkg->pd[pol->plid] = NULL;
-		}
-		spin_unlock(&blkcg->lock);
-	}
-
+	blkcg_policy_teardown_pds(q, pol);
 	spin_unlock_irq(&q->queue_lock);
 	mutex_unlock(&q->blkcg_mutex);
 
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
-- 
2.51.0


^ permalink raw reply related

* [PATCH v3 3/4] blk-cgroup: skip dying blkg in blkcg_activate_policy()
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel
In-Reply-To: <20260625025739.2459651-1-yukuai@kernel.org>

From: Zheng Qixing <zhengqixing@huawei.com>

When switching IO schedulers on a block device, blkcg_activate_policy()
can race with concurrent blkcg deletion, leading to a use-after-free in
rcu_accelerate_cbs.

T1:                               T2:
                                  blkg_destroy
                                  kill(&blkg->refcnt) // blkg->refcnt=1->0
                                  blkg_release // call_rcu(__blkg_release)
                                  ...
                                  blkg_free_workfn
                                  ->pd_free_fn(pd)
elv_iosched_store
elevator_switch
...
iterate blkg list
blkg_get(blkg) // blkg->refcnt=0->1
                                  list_del_init(&blkg->q_node)
blkg_put(pinned_blkg) // blkg->refcnt=1->0
blkg_release // call_rcu again
rcu_accelerate_cbs // uaf

Fix this by checking hlist_unhashed(&blkg->blkcg_node) before getting
a reference to the blkg. This is the same check used in blkg_destroy()
to detect if a blkg has already been destroyed. If the blkg is already
unhashed, skip processing it since it's being destroyed.

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fd1eed67924b..d06915045bc4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1575,10 +1575,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
 		struct blkg_policy_data *pd;
 
 		if (blkg->pd[pol->plid])
 			continue;
+		if (hlist_unhashed(&blkg->blkcg_node))
+			continue;
 
 		/* If prealloc matches, use it; otherwise try GFP_NOWAIT */
 		if (blkg == pinned_blkg) {
 			pd = pd_prealloc;
 			pd_prealloc = NULL;
-- 
2.51.0


^ permalink raw reply related

* [PATCH v3 2/4] blk-cgroup: fix race between policy activation and blkg destruction
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel
In-Reply-To: <20260625025739.2459651-1-yukuai@kernel.org>

From: Zheng Qixing <zhengqixing@huawei.com>

When switching an IO scheduler on a block device, blkcg_activate_policy()
allocates blkg_policy_data (pd) for all blkgs attached to the queue.
However, blkcg_activate_policy() may race with concurrent blkcg deletion,
leading to use-after-free and memory leak issues.

The use-after-free occurs in the following race:

T1 (blkcg_activate_policy):
  - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
  - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
  - Enters the enomem rollback path to release blkg1 resources

T2 (blkcg deletion):
  - blkcgA is deleted concurrently
  - blkg1 is freed via blkg_free_workfn()
  - blkg1->pd is freed

T1 (continued):
  - Rollback path accesses blkg1->pd->online after pd is freed
  - Triggers use-after-free

In addition, blkg_free_workfn() frees pd before removing the blkg from
q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd
for a blkg that is being destroyed, leaving the newly allocated pd
unreachable when the blkg is finally freed.

Fix these races by extending blkcg_mutex coverage to serialize
blkcg_activate_policy() rollback and blkg destruction, ensuring pd
lifecycle is synchronized with blkg list visibility.

Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d22a43c545b6..fd1eed67924b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1564,10 +1564,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	if (WARN_ON_ONCE(!pol->pd_alloc_fn || !pol->pd_free_fn))
 		return -EINVAL;
 
 	if (queue_is_mq(q))
 		memflags = blk_mq_freeze_queue(q);
+
+	mutex_lock(&q->blkcg_mutex);
 retry:
 	spin_lock_irq(&q->queue_lock);
 
 	/* blkg_list is pushed at the head, reverse walk to initialize parents first */
 	list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
@@ -1626,10 +1628,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
 	__set_bit(pol->plid, q->blkcg_pols);
 	ret = 0;
 
 	spin_unlock_irq(&q->queue_lock);
 out:
+	mutex_unlock(&q->blkcg_mutex);
 	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q, memflags);
 	if (pinned_blkg)
 		blkg_put(pinned_blkg);
 	if (pd_prealloc)
-- 
2.51.0


^ permalink raw reply related

* [PATCH v3 1/4] blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with blkcg_mutex
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel
In-Reply-To: <20260625025739.2459651-1-yukuai@kernel.org>

From: Yu Kuai <yukuai@fygo.io>

blkg_destroy_all() iterates q->blkg_list without holding blkcg_mutex,
which can race with blkg_free_workfn() that removes blkgs from the list
while holding blkcg_mutex.

Add blkcg_mutex protection around the q->blkg_list iteration to prevent
potential list corruption or use-after-free issues.

Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
 block/blk-cgroup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d2a1f5903f24..d22a43c545b6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -567,10 +567,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 	struct blkcg_gq *blkg;
 	int count = BLKG_DESTROY_BATCH_SIZE;
 	int i;
 
 restart:
+	mutex_lock(&q->blkcg_mutex);
 	spin_lock_irq(&q->queue_lock);
 	list_for_each_entry(blkg, &q->blkg_list, q_node) {
 		struct blkcg *blkcg = blkg->blkcg;
 
 		if (hlist_unhashed(&blkg->blkcg_node))
@@ -585,10 +586,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 		 * it when a batch of blkgs are destroyed.
 		 */
 		if (!(--count)) {
 			count = BLKG_DESTROY_BATCH_SIZE;
 			spin_unlock_irq(&q->queue_lock);
+			mutex_unlock(&q->blkcg_mutex);
 			cond_resched();
 			goto restart;
 		}
 	}
 
@@ -604,10 +606,11 @@ static void blkg_destroy_all(struct gendisk *disk)
 			__clear_bit(pol->plid, q->blkcg_pols);
 	}
 
 	q->root_blkg = NULL;
 	spin_unlock_irq(&q->queue_lock);
+	mutex_unlock(&q->blkcg_mutex);
 
 	wake_up_var(&q->root_blkg);
 }
 
 static void blkg_iostat_set(struct blkg_iostat *dst, struct blkg_iostat *src)
-- 
2.51.0


^ permalink raw reply related

* [PATCH v3 0/4] blk-cgroup: fix blkg list and policy data races
From: Yu Kuai @ 2026-06-25  2:57 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zheng Qixing, Christoph Hellwig, Tang Yizhou,
	Yu Kuai, cgroups, linux-block, linux-kernel

From: Yu Kuai <yukuai@fygo.io>

Hi,

This series fixes races around q->blkg_list and blkg policy data
lifetime.

Patch 1 protects blkg_destroy_all()'s q->blkg_list walk with
blkcg_mutex.

Patches 2-3 fix races between blkcg_activate_policy() and concurrent
blkg destruction.

Patch 4 factors the policy data teardown loop into a helper after the
race fixes.

Changes since v2:
- Rebase on the latest block-7.2 branch.

Changes since v1:
- Drop the BFQ q->blkg_list patch because the current block tree already
  has a stronger fix in commit 17b2d950a3c0 ("block, bfq: protect async
  queue reset with blkcg locks").
- Add Reviewed-by tags from Tang Yizhou.

Yu Kuai (1):
  blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with
    blkcg_mutex

Zheng Qixing (3):
  blk-cgroup: fix race between policy activation and blkg destruction
  blk-cgroup: skip dying blkg in blkcg_activate_policy()
  blk-cgroup: factor policy pd teardown loop into helper

 block/blk-cgroup.c | 65 +++++++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 30 deletions(-)

-- 
2.51.0

^ permalink raw reply

* Re: [PATCH 1/2] cgroup/dmem: add per-region event counters
From: Hongfu Li @ 2026-06-25  2:10 UTC (permalink / raw)
  To: tj
  Cc: cgroups, corbet, dev, dri-devel, hannes, lihongfu, linux-doc,
	linux-kernel, mkoutny, mripard, natalie.vock, skhan, hongfu.li
In-Reply-To: <ajwnf0uzT4PMHYZx@slm.duckdns.org>

Hi, Tejun
Thanks for the review comments.

> > Add dmem.events to report hierarchical low/max event counts per DMEM
> > region.  Increment counters on dmem.max allocation failures and
> > dmem.low protection events.  The file is available for non-root cgroups
> > only.
> 
> Please don't double space in descs or comments. Also, maybe it's obvious but
> it'd help if you list why and how this is useful. Why do we want to add
> this?

I'll fix the double spacing in the commit message and comments.

As for the motivation: dmem already exposes per-region limits and current
usage, but not how often those limits actually matter at runtime. Without
event counters, it's hard to tell whether allocation failures come from
this cgroup, a parent limit, or pressure elsewhere in the hierarchy.
dmem.events provides that visibility for tuning dmem.low/dmem.max and
diagnosing recurring device memory pressure.

I'll expand the commit message to cover this.
 
> > +  dmem.events
> > +	A read-only file that reports the number of times each cgroup
> > +	has hit its configured memory limits.  The format lists each
> > +	region on a single line, followed by the event counters::
> > +
> > +	  drm/0000:03:00.0/vram0 low 0 max 3
> > +	  drm/0000:03:00.0/stolen low 0 max 0
> 
> This isn't a supported file format. Please read the documentation on allowed
> formats.

Thanks for catching this. I'll switch dmem.events to nested-keyed format (region low=N max=M).

Thanks again for the valuable feedback.

Best regards,
Hongfu

^ permalink raw reply

* Re: [PATCH 0/8] blk-cgroup: remove queue_lock nesting from blkcg paths
From: yu kuai @ 2026-06-25  1:42 UTC (permalink / raw)
  To: Jens Axboe, yukuai, nilay, tom.leiming, bvanassche, tj, josef
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <34d48fb5-4952-4a48-b92a-f189bc3edd0b@kernel.dk>

Hi,

在 2026/6/24 20:43, Jens Axboe 写道:
> On 6/24/26 12:57 AM, yu kuai wrote:
>> Friendly ping ...
>>
>> This set can still be applied cleanly for block-7.2 branch.
> Not sure how you checked that, because patch 3 very much needs some
> manual attention to get applied. I have applied it now.

Thanks!

This was build on the top of my other set:
blk-cgroup: fix blkg list and policy data races

I'll rebase and resend this set :)

>
-- 
Thanks,
Kuai

^ permalink raw reply

* [PATCH v2] cgroup: Use data_race() for task->flags in task_css_set_check()
From: Guopeng Zhang @ 2026-06-25  1:39 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný
  Cc: cgroups, linux-kernel, Guopeng Zhang

From: Guopeng Zhang <zhangguopeng@kylinos.cn>

task_css_set_check() uses rcu_dereference_check() to verify that
task->cgroups can be dereferenced. One accepted condition is that the
task is already exiting, tested by checking PF_EXITING in task->flags.

This check is only part of the CONFIG_PROVE_RCU lockdep predicate. This
was found by KCSAN during fuzz testing. KCSAN can report a data race
when another task flag bit is updated concurrently. One report shows
pids_release() reading task->flags through task_css_set_check() while
do_task_dead() sets PF_NOFREEZE:

  KCSAN: data-race in task_css() [inline]
  KCSAN: data-race in pids_release()

  task_css()
  pids_release()
  cgroup_release()
  release_task()
  wait_task_zombie()

  value changed: 0x0040004c -> 0x0040804c

The changed bit is PF_NOFREEZE, not PF_EXITING. PF_EXITING remains set
before and after the update, so the task_css_set_check() condition does
not change. This is not a race on task->cgroups and does not indicate
incorrect pids charging or uncharging.

tools/memory-model/Documentation/access-marking.txt recommends
data_race() for data-racy loads used only for diagnostic purposes. Use
data_race() here to mark the intended diagnostic-only access.

No functional change intended.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
---
Changes in v2:
- Use data_race() instead of READ_ONCE() for the diagnostic-only
  CONFIG_PROVE_RCU predicate, as suggested by Tejun.
- Update the changelog to match access-marking.txt guidance.

 include/linux/cgroup.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f2aa46a4f871..b905208942bf 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -480,7 +480,7 @@ static inline void cgroup_unlock(void)
 		rcu_read_lock_sched_held() ||				\
 		lockdep_is_held(&cgroup_mutex) ||			\
 		lockdep_is_held(&css_set_lock) ||			\
-		((task)->flags & PF_EXITING) || (__c))
+		(data_race((task)->flags) & PF_EXITING) || (__c))
 #else
 #define task_css_set_check(task, __c)					\
 	rcu_dereference((task)->cgroups)
-- 
2.25.1

^ permalink raw reply related

* Re: [RFC PATCH v2 3/7] mm, swap: support physical swap as a vswap backend
From: Yosry Ahmed @ 2026-06-25  0:19 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <CAKEwX=PT_ABx51--Qv9AAZwkuH+_Wp_TeiUYVQBY=1=SCf1HJA@mail.gmail.com>

> > > +       if (swap_is_vswap(si)) {
> > > +               if (entry != vswap_zswap_load(swpentry)) {
> > > +                       ret = -ENOMEM;
> > > +                       goto out;
> > > +               }
> > > +               /*
> > > +                * Allocate physical backing BEFORE decompress - if it fails,
> > > +                * no wasted work. folio_realloc_swap sets vtable to PHYS,
> > > +                * overwriting ZSWAP - the old entry pointer is only held
> > > +                * by the caller now.
> > > +                */
> > > +               phys = folio_realloc_swap(folio);
> > > +               if (!phys.val) {
> > > +                       ret = -ENOMEM;
> > > +                       goto out;
> > > +               }
> > > +       } else {
> > > +               tree = swap_zswap_tree(swpentry);
> > > +               if (entry != xa_load(tree, offset)) {
> > > +                       ret = -ENOMEM;
> > > +                       goto out;
> > > +               }
> >
> > There's a lot of divergence in the code (in this patch and previous
> > ones). Seems like a lot of it is to do xarray operations vs vswap
> > operations. I wonder if we can abstract these into helpers, e.g.
> > zswap_tree_store(), zswap_tree_load(), etc. Maybe the name is not the
> > best, but you get the point :)
>
> How about zswap_entry_load() and zswap_entry_store()? :)

Even better!


> > > -       xa_erase(tree, offset);
> > > +       if (!swap_is_vswap(si))
> > > +               xa_erase(tree, offset);
> >
> > Maybe this can also be abstracted into a helper, but I wonder what the
> > corresponding vswap operation would be. I think folio_realloc_swap()
> > will have already "erased" the zswap entry from vswap. Maybe have a
>
> Yup that's the right logic. We already change the backend to physical
> swap slot here, so there's no real "erase".
>
> > vswap helper that will only remove it if it's a zswap entry? We can
> > probably do a lockless check first to make it cheap?
> >
> > It's probably silly to do this, and maybe there's a better way.
> > Generally, I think the code would be easier to follow if we abstract
> > away the xarray vs. vswap stuff into helpers (where it's reasonable).
>
> I'm not entirely sure if its worth it either, yeah. Unlike load and
> store, erase seems a bit asymmetric in the sense that we only need to
> do it for non-vswap cases.

Yeah :/

Maybe just add a comment why no erase is needed for the vswap case.

^ permalink raw reply

* Re: [RFC PATCH v2 3/7] mm, swap: support physical swap as a vswap backend
From: Nhat Pham @ 2026-06-25  0:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <CAO9r8zPXk2eRbVcEMQDTCH1j-w241h189=p04FenAfKAjkkQtA@mail.gmail.com>

On Tue, Jun 23, 2026 at 12:02 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 466f8a182716..5daff7a25f67 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -993,6 +993,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >         struct folio *folio;
> >         struct mempolicy *mpol;
> >         struct swap_info_struct *si;
> > +       swp_entry_t phys = {};
> >         int ret = 0;
> >
> >         /* try to allocate swap cache folio */
> > @@ -1000,16 +1001,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >         if (!si)
> >                 return -EEXIST;
> >
> > -       /*
> > -        * Vswap entries have no physical backing - writeback would fail
> > -        * and SIGBUS the caller. Bail before we waste a swap-cache folio
> > -        * allocation.
> > -        */
> > -       if (si->flags & SWP_VSWAP) {
> > -               put_swap_device(si);
> > -               return -EINVAL;
> > -       }
> > -
> >         mpol = get_task_policy(current);
> >         folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol,
> >                                        NO_INTERLEAVE_INDEX);
> > @@ -1028,40 +1019,78 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >         /*
> >          * folio is locked, and the swapcache is now secured against
> >          * concurrent swapping to and from the slot, and concurrent
> > -        * swapoff so we can safely dereference the zswap tree here.
> > -        * Verify that the swap entry hasn't been invalidated and recycled
> > -        * behind our backs, to avoid overwriting a new swap folio with
> > -        * old compressed data. Only when this is successful can the entry
> > -        * be dereferenced.
> > +        * swapoff so we can safely dereference the zswap tree (or vswap
> > +        * vtable) here. Verify that the swap entry hasn't been
> > +        * invalidated and recycled behind our backs, to avoid overwriting
> > +        * a new swap folio with old compressed data. Only when this is
> > +        * successful can the entry be dereferenced.
> >          */
> > -       tree = swap_zswap_tree(swpentry);
> > -       if (entry != xa_load(tree, offset)) {
> > -               ret = -ENOMEM;
> > -               goto out;
> > +       if (swap_is_vswap(si)) {
> > +               if (entry != vswap_zswap_load(swpentry)) {
> > +                       ret = -ENOMEM;
> > +                       goto out;
> > +               }
> > +               /*
> > +                * Allocate physical backing BEFORE decompress - if it fails,
> > +                * no wasted work. folio_realloc_swap sets vtable to PHYS,
> > +                * overwriting ZSWAP - the old entry pointer is only held
> > +                * by the caller now.
> > +                */
> > +               phys = folio_realloc_swap(folio);
> > +               if (!phys.val) {
> > +                       ret = -ENOMEM;
> > +                       goto out;
> > +               }
> > +       } else {
> > +               tree = swap_zswap_tree(swpentry);
> > +               if (entry != xa_load(tree, offset)) {
> > +                       ret = -ENOMEM;
> > +                       goto out;
> > +               }
>
> There's a lot of divergence in the code (in this patch and previous
> ones). Seems like a lot of it is to do xarray operations vs vswap
> operations. I wonder if we can abstract these into helpers, e.g.
> zswap_tree_store(), zswap_tree_load(), etc. Maybe the name is not the
> best, but you get the point :)

How about zswap_entry_load() and zswap_entry_store()? :)

>
> Here we can then do zswap_tree_load() for both code paths and only the
> folio_realloc_swap() needs to be different for vswap. We can do
> similar cleanups for the load/store paths as well.
>
> >         }
> >
> >         if (!zswap_decompress(entry, folio)) {
> >                 ret = -EIO;
> > +               /*
> > +                * For vswap: folio_realloc_swap already moved the entry
> > +                * out of the vtable. Restore it via vswap_zswap_store so
> > +                * the entry stays tracked (and the just-allocated PHYS
> > +                * slot is freed). For non-vswap: entry is still in the
> > +                * zswap tree.
> > +                */
> > +               if (swap_is_vswap(si) && phys.val)
> > +                       vswap_zswap_store(swpentry, entry);
>
> Should this go in the cleanup path instead (i.e. in the 'out' label?).

Ah, maybe if (ret == -EIO &&)...

>
> >                 goto out;
> >         }
> >
> > -       xa_erase(tree, offset);
> > +       if (!swap_is_vswap(si))
> > +               xa_erase(tree, offset);
>
> Maybe this can also be abstracted into a helper, but I wonder what the
> corresponding vswap operation would be. I think folio_realloc_swap()
> will have already "erased" the zswap entry from vswap. Maybe have a

Yup that's the right logic. We already change the backend to physical
swap slot here, so there's no real "erase".

> vswap helper that will only remove it if it's a zswap entry? We can
> probably do a lockless check first to make it cheap?
>
> It's probably silly to do this, and maybe there's a better way.
> Generally, I think the code would be easier to follow if we abstract
> away the xarray vs. vswap stuff into helpers (where it's reasonable).

I'm not entirely sure if its worth it either, yeah. Unlike load and
store, erase seems a bit asymmetric in the sense that we only need to
do it for non-vswap cases.

^ permalink raw reply

* Re: [PATCH] mm/memcontrol: remove unused for_each_mem_cgroup macro and cleanup
From: SeongJae Park @ 2026-06-25  0:10 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: SeongJae Park, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton, cgroups,
	linux-kernel, kernel-team
In-Reply-To: <20260624183700.1152742-1-joshua.hahnjy@gmail.com>

On Wed, 24 Jun 2026 11:36:59 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Commit 7e1c0d6f58207 ("memcg: switch lruvec stats to rstat") removed the
> last caller of for_each_mem_cgroup back in 2021, and there have not been
> any new callers since. Remove the macro.
> 
> A comment in mem_cgroup_css_online has also been out of date since 2021,
> when 2bfd36374edd9 ("mm: vmscan: consolidate shrinker_maps handling
> code") open-coded the for_each_mem_cgroup iterator. Update the comment.
> 
> Finally, 99430ab8b804c ("mm: introduce BPF kfuncs to access memcg
> statistics and events") added a second declaration for memcg_events to
> include/linux/memcontrol.h, duplicating the one in mm/memcontrol-v1.h.
> Let's clean that up too.
> 
> No functional changes intended.

Nice cleanup, thank you!

> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply

* Re: [RFC PATCH v2 2/7] mm, swap: support zswap and zeroswap as vswap backends
From: Nhat Pham @ 2026-06-24 23:08 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <ajnNWRO7apBq2-kQ@google.com>

On Mon, Jun 22, 2026 at 5:16 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> On Fri, Jun 12, 2026 at 12:37:33PM -0700, Nhat Pham wrote:
> > Build the virtual swap layer on top of the swap-table infrastructure.
> > Virtual swap entries decouple PTE swap entries from physical backing,
> > allowing pages to be compressed by zswap (or detected as zero-filled)
> > without pre-allocating a physical swap slot.
> >
> > This patch only supports zswap and zero-page backends. If zswap_store
> > fails, the page stays dirty in the swap cache (AOP_WRITEPAGE_ACTIVATE)
> > - physical disk backing fallback comes in the next patch. Zswap
> > writeback of vswap-backed entries is also disabled - the shrinker
> > skips when no physical swap pages are available.
> >
> > Suggested-by: Kairui Song <kasong@tencent.com>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> [..]
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 993406074d58..466f8a182716 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -38,6 +38,7 @@
> >  #include <linux/zsmalloc.h>
> >
> >  #include "swap.h"
> > +#include "vswap.h"
> >  #include "internal.h"
> >
> >  /*********************************
> > @@ -762,7 +763,7 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
> >   * Carries out the common pattern of freeing an entry's zsmalloc allocation,
> >   * freeing the entry itself, and decrementing the number of stored pages.
> >   */
> > -static void zswap_entry_free(struct zswap_entry *entry)
> > +void zswap_entry_free(struct zswap_entry *entry)
> >  {
> >       zswap_lru_del(&zswap_list_lru, entry);
> >       zs_free(entry->pool->zs_pool, entry->handle);
> > @@ -994,16 +995,21 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >       struct swap_info_struct *si;
> >       int ret = 0;
> >
> > +     /* try to allocate swap cache folio */
> >       si = get_swap_device(swpentry);
> >       if (!si)
> >               return -EEXIST;
> >
> > +     /*
> > +      * Vswap entries have no physical backing - writeback would fail
> > +      * and SIGBUS the caller. Bail before we waste a swap-cache folio
> > +      * allocation.
> > +      */
>
> Seems like this comment belongs in the previous patch, and the other
> comment movement is undoing what last patch did.

Yeah this comment belongs to the first patch. I added it after the
fact but commit to the second patch.

TBH, the first patch kinda not do much. It just declares a new special
struct swap_info_struct, with some helpers and checks, but it's not
hooked to any allocation path. Logically it should be squashed into
this patch, but this patch is already 600 LoC, lol.

>
> >       if (si->flags & SWP_VSWAP) {
> >               put_swap_device(si);
> >               return -EINVAL;
> >       }
> >
> > -     /* try to allocate swap cache folio */
> >       mpol = get_task_policy(current);
> >       folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol,
> >                                      NO_INTERLEAVE_INDEX);
> > @@ -1416,25 +1422,25 @@ static bool zswap_store_page(struct page *page,
> >       if (!zswap_compress(page, entry, pool))
> >               goto compress_failed;
> >
> > -     old = xa_store(swap_zswap_tree(page_swpentry),
> > -                    swp_offset(page_swpentry),
> > -                    entry, GFP_KERNEL);
> > -     if (xa_is_err(old)) {
> > -             int err = xa_err(old);
> > +     if (is_vswap_entry(page_swpentry)) {
> > +             vswap_zswap_store(page_swpentry, entry);
> > +     } else {
> > +             old = xa_store(swap_zswap_tree(page_swpentry),
> > +                            swp_offset(page_swpentry),
> > +                            entry, GFP_KERNEL);
> > +             if (xa_is_err(old)) {
> > +                     int err = xa_err(old);
> > +
> > +                     WARN_ONCE(err != -ENOMEM,
> > +                               "unexpected xarray error: %d\n", err);
> > +                     zswap_reject_alloc_fail++;
> > +                     goto store_failed;
> > +             }
> >
> > -             WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> > -             zswap_reject_alloc_fail++;
> > -             goto store_failed;
> > +             if (old)
> > +                     zswap_entry_free(old);
> >       }
> >
> > -     /*
> > -      * We may have had an existing entry that became stale when
> > -      * the folio was redirtied and now the new version is being
> > -      * swapped out. Get rid of the old.
> > -      */
> > -     if (old)
> > -             zswap_entry_free(old);
> > -
> >       /*
> >        * The entry is successfully compressed and stored in the tree, there is
> >        * no further possibility of failure. Grab refs to the pool and objcg,
> > @@ -1487,6 +1493,7 @@ bool zswap_store(struct folio *folio)
> >       struct mem_cgroup *memcg = NULL;
> >       struct zswap_pool *pool;
> >       bool ret = false;
> > +     bool partial_store = false;
> >       long index;
> >
> >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > @@ -1524,8 +1531,10 @@ bool zswap_store(struct folio *folio)
> >       for (index = 0; index < nr_pages; ++index) {
> >               struct page *page = folio_page(folio, index);
> >
> > -             if (!zswap_store_page(page, objcg, pool))
> > +             if (!zswap_store_page(page, objcg, pool)) {
> > +                     partial_store = index > 0;
> >                       goto put_pool;
> > +             }
> >       }
> >
> >       if (objcg)
> > @@ -1548,7 +1557,9 @@ bool zswap_store(struct folio *folio)
> >        * offsets corresponding to each page of the folio. Otherwise,
> >        * writeback could overwrite the new data in the swapfile.
> >        */
> > -     if (!ret) {
> > +     if (partial_store && is_vswap_entry(swp))
> > +             folio_release_vswap_backing(folio);
>
> Hmm the above should also only happen in the !ret case, but that's not
> obvious from the code here. I think all of this should go under if
> (!ret), but maybe reverse the polarity to avoid the indentation?

Yeah that's just me avoiding indentation lol. But yes, it only happens
in !ret case:

>
>         if (ret)
>                 return ret;
>
>         if (is_vswap_entry(swp)) {
>                 if (partial_store)
>                         folio_release_vswap_backing(folio);
>                 return ret;
>         }
>
>         ...
>
> Alternatively you can move the check_old code for xarray into a helper
> and do:
>
>         if (!ret) {
>                 if (is_vswap_entry(swp)) {
>                         if (partial_store)
>                                 folio_release_vswap_backing(folio);
>                 } else {
>                         zswap_free_old_xa_entries(swp, nr_pages)
>                 }
>         }

Yup! I can switch to this if you think it's cleaner.

>
> Also, I think you can probably drop partial_store and check the index
> directly here.

Ah yeah. That's true!

>
> > +     else if (!ret && !is_vswap_entry(swp)) {
> >               unsigned type = swp_type(swp);
> >               pgoff_t offset = swp_offset(swp);
> >               struct zswap_entry *entry;
> > @@ -1588,8 +1599,7 @@ bool zswap_store(struct folio *folio)
> >  int zswap_load(struct folio *folio)
> >  {
> >       swp_entry_t swp = folio->swap;
> > -     pgoff_t offset = swp_offset(swp);
> > -     struct xarray *tree = swap_zswap_tree(swp);
> > +     struct swap_info_struct *si = __swap_entry_to_info(swp);
> >       struct zswap_entry *entry;
> >
> >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > @@ -1599,16 +1609,25 @@ int zswap_load(struct folio *folio)
> >               return -ENOENT;
> >
> >       /*
> > -      * Large folios should not be swapped in while zswap is being used, as
> > -      * they are not properly handled. Zswap does not properly load large
> > -      * folios, and a large folio may only be partially in zswap.
> > +      * zswap_load() does not support large folios. For non-vswap
> > +      * entries this is unexpected on the swapin path: WARN and
> > +      * sigbus. For vswap entries __swap_cache_add_check() has already
> > +      * filtered out ZSWAP-backed THPs under the cluster lock, so the
> > +      * large folio here is zero- or phys-backed; return -ENOENT to
> > +      * fall through to the phys/zero IO path.
>
> Hmm should we start simple and avoid THP swapin for vswap initially?
>
> IIUC, it isn't really vswap specific. Even without vswap, it's possible
> that an entire folio is on-disk, not in zswap, in which case THP swap
> should be allowed.
>
> I assume it's not common for zswap to be enabled and an entire THP worth
> of pages are not in zswap, so maybe we can add this later?

I was thinking of removing it altogether haha. Are we even doing THP
swap in for non-sync IO devices?

if (!folio) {
    /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
    if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
        folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
[...]
else
    folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);

So I guess it's primarily zram that does THP swap in here? on
non-SWP_SYNCHRONOUS_IO devices, seems like we only do "THP swapin" if
we catch the page in swap cache (minor page fault). :) Will zram users
like vswap?

OTOH, zswap might be getting THP zswap-in support soon, so it's not
just zram backend that cares about these kinds of check? :)

Or maybe I can keep it, but separate it from this big patch to make it
easier to review :) Lemme play with it.

>
> >        */
> > -     if (WARN_ON_ONCE(folio_test_large(folio))) {
> > -             folio_unlock(folio);
> > -             return -EINVAL;
> > +     if (folio_test_large(folio)) {
> > +             if (WARN_ON_ONCE(!swap_is_vswap(si))) {
> > +                     folio_unlock(folio);
> > +                     return -EINVAL;
> > +             }
> > +             return -ENOENT;
> >       }
> >
> > -     entry = xa_load(tree, offset);
> > +     if (swap_is_vswap(si))
> > +             entry = vswap_zswap_load(swp);
> > +     else
> > +             entry = xa_load(swap_zswap_tree(swp), swp_offset(swp));
> >       if (!entry)
> >               return -ENOENT;
> >
> > @@ -1623,16 +1642,14 @@ int zswap_load(struct folio *folio)
> >       if (entry->objcg)
> >               count_objcg_events(entry->objcg, ZSWPIN, 1);
> >
> > -     /*
> > -      * We are reading into the swapcache, invalidate zswap entry.
> > -      * The swapcache is the authoritative owner of the page and
> > -      * its mappings, and the pressure that results from having two
> > -      * in-memory copies outweighs any benefits of caching the
> > -      * compression work.
> > -      */
> >       folio_mark_dirty(folio);
> > -     xa_erase(tree, offset);
> > -     zswap_entry_free(entry);
> > +
> > +     if (swap_is_vswap(si)) {
> > +             folio_release_vswap_backing(folio);
>
> Is there any advantage to calling folio_release_vswap_backing() over
> zswap_entry_free()? Seems like __vswap_release_backing() ends up just
> calling zswap_entry_free() -- and I don't see any vswap-specific state
> being cleaned up.
>
> I wonder if the zswap code should call zswap_entry_free() directly? Same
> goes for the call in zswap_store() above.

Most just not repeating the vtable lookup-and-lock and what not. :)
The pattern is repeated the third time in swapoff when I allow phys
swap to be the backend of vswap in the next patch so I figure probably
should add some helper.

>
> > +     } else {
> > +             xa_erase(swap_zswap_tree(swp), swp_offset(swp));
> > +             zswap_entry_free(entry);
> > +     }
> >
> >       folio_unlock(folio);
> >       return 0;
> > --
> > 2.53.0-Meta
> >

^ permalink raw reply

* Re: [PATCH-next v5 6/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
From: Waiman Long @ 2026-06-24 23:06 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Johannes Weiner, Peter Zijlstra, cgroups, linux-kernel,
	Aaron Tomlin, Guopeng Zhang, Ridong Chen
In-Reply-To: <ajutWBoJqkhktkvX@localhost.localdomain>

On 6/24/26 11:45 AM, Michal Koutný wrote:
> Hello Waiman.
>
> On Mon, Jun 01, 2026 at 10:32:03PM -0400, Waiman Long <longman@redhat.com> wrote:
>> This problem is less an issue when enabling the cpuset controller as all
>> the newly created child cpusets will have exactly the same set of CPUs
>> and memory nodes except when deadline tasks are involved in migration
>> as the deadline task accounting data can be off.
>>
>> It can be more problematic when the cpuset controller is disabled as
>> their set of CPUs and memory nodes may differ from their parent or with
>> the moving of multi-threaded process from different threaded cgroups.
> When I generalize that it can be an issue for any threaded controller
> that somehow relies on the _difference_ between old and new thread
> membership.
>
> So I checked some: pids and perf_events look alright (no
> diff-dependency) but I noticed the very same issue is tackled in
> sched_change_group/scx_cgroup_move_task and that there is a member
> inside task_struct allocated for this state tracking already:
>    task_struct::scx::cgrp_moving_from
>
>> Fix that by tracking the set of source (old) and destination cpusets
>> in singly linked lists and iterating them all to properly update the
>> internal data. Also keep the current cs and oldcs variables up-to-date
>> with the css and task iterators.
> So there would be more than a single use for something conceptually
> like:
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 004e6d56a499a..740c02f220c75 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1326,6 +1326,9 @@ struct task_struct {
>   #ifdef CONFIG_PREEMPT_RT
>          struct llist_node               cg_dead_lnode;
>   #endif /* CONFIG_PREEMPT_RT */
> +#ifdef CONFIG_CGROUPS_MOVING_FROM
> +       struct cgroup                   *cgrp_moving_from;
> +#endif
>   #endif /* CONFIG_CGROUPS */
>   #ifdef CONFIG_X86_CPU_RESCTRL
>          u32                             closid;
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index 1a3af2ea2a794..5b63afe83f333 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -240,9 +240,6 @@ struct sched_ext_entity {
>          bool                    disallow;       /* reject switching into SCX */
>   
>          /* cold fields */
> -#ifdef CONFIG_EXT_GROUP_SCHED
> -       struct cgroup           *cgrp_moving_from;
> -#endif
>          struct list_head        tasks_node;
>   };
>   
> diff --git a/init/Kconfig b/init/Kconfig
> index 2937c4d308aec..d7e7d4477f862 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1186,6 +1186,7 @@ config EXT_GROUP_SCHED
>          depends on SCHED_CLASS_EXT && CGROUP_SCHED
>          select GROUP_SCHED_WEIGHT
>          select GROUP_SCHED_BANDWIDTH
> +       select CGROUPS_MOVING_FROM
>          default y
>   
>   endif #CGROUP_SCHED
> @@ -1288,6 +1289,7 @@ config CPUSETS
>          depends on SMP
>          select UNION_FIND
>          select CPU_ISOLATION
> +       select CGROUPS_MOVING_FROM
>          help
>            This option will let you create and manage CPUSETs which
>            allow dynamically partitioning a system into sets of CPUs and
>
> I think this could simplify the before-after state tracking generally,
> WDYT?

I had actually introduced a new task_struct field in an early version to 
track the old cpuset to handle memory migration. However, Chen Ridong 
had shown me that we may not really need such granular detail. So I drop 
it in the newer versions. Also sharing a common field between cpuset and 
sched_ext can introduce complication as we have to make sure that we 
won't step into each other.

Thank for the suggestion anyway and I will reconsider it in case it is 
found that we really need such information to do the right thing.

Cheers,
Longman


^ permalink raw reply

* Re: [RFC PATCH v2 2/7] mm, swap: support zswap and zeroswap as vswap backends
From: Nhat Pham @ 2026-06-24 22:41 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <ajnQxMY0W3VGyAUE@google.com>

On Mon, Jun 22, 2026 at 5:18 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> [..]
> > @@ -1623,16 +1642,14 @@ int zswap_load(struct folio *folio)
> >       if (entry->objcg)
> >               count_objcg_events(entry->objcg, ZSWPIN, 1);
> >
> > -     /*
> > -      * We are reading into the swapcache, invalidate zswap entry.
> > -      * The swapcache is the authoritative owner of the page and
> > -      * its mappings, and the pressure that results from having two
> > -      * in-memory copies outweighs any benefits of caching the
> > -      * compression work.
> > -      */
>
> Forgot to ask, is dropping this comment intentional?

Ooops. Lemme restore it.

>
> >       folio_mark_dirty(folio);
> > -     xa_erase(tree, offset);
> > -     zswap_entry_free(entry);
> > +
> > +     if (swap_is_vswap(si)) {
> > +             folio_release_vswap_backing(folio);
> > +     } else {
> > +             xa_erase(swap_zswap_tree(swp), swp_offset(swp));
> > +             zswap_entry_free(entry);
> > +     }
> >
> >       folio_unlock(folio);
> >       return 0;
> > --
> > 2.53.0-Meta
> >

^ permalink raw reply

* Re: [syzbot] [cgroups?] INFO: task hung in cgroup_subtree_control_write (2)
From: Tejun Heo @ 2026-06-24 22:34 UTC (permalink / raw)
  To: syzbot+bb2e19a1190a556c01b1
  Cc: cgroups, hannes, linux-kernel, mkoutny, syzkaller-bugs
In-Reply-To: <6a2fb248.8812e0fc.3c3fa4.001a.GAE@google.com>

I tried to reproduce this locally with the syz reproducer on a matching
PREEMPT_RT + KASAN build and could not trigger it, including looping the
minimized reproducer with the matched controller set and high concurrency
over many VM-hours.

Thanks.
--
tejun

^ permalink raw reply

* Re: [PATCH] cgroup: Fix a typo of the function name in comment
From: Tejun Heo @ 2026-06-24 21:13 UTC (permalink / raw)
  To: Zenghui Yu; +Cc: Johannes Weiner, Michal Koutný, cgroups, linux-kernel
In-Reply-To: <20260622110708.15593-1-zenghui.yu@linux.dev>

Hello,

Applied to cgroup/for-7.3.

Thanks.

--
tejun

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox