cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
@ 2025-05-20  3:15 Zhongkun He
  2025-05-20 13:13 ` kernel test robot
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Zhongkun He @ 2025-05-20  3:15 UTC (permalink / raw)
  To: tj, hannes, longman; +Cc: cgroups, linux-kernel, muchun.song, Zhongkun He

Setting the cpuset.mems in cgroup v2 can trigger memory
migrate in cpuset. This behavior is fine for newly created
cgroups but it can cause issues for the existing cgroups.
In our scenario, modifying the cpuset.mems setting during
peak times frequently leads to noticeable service latency
or stuttering.

It is important to have a consistent set of behavior for
both cpus and memory. But it does cause issues at times,
so we would hope to have a flexible option.

This idea is from the non-blocking limit setting option in
memory control.

https://lore.kernel.org/all/20250506232833.3109790-1-shakeel.butt@linux.dev/

Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  7 +++++++
 kernel/cgroup/cpuset.c                  | 11 +++++++++++
 2 files changed, 18 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 1a16ce68a4d7..d9e8e2a770af 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2408,6 +2408,13 @@ Cpuset Interface Files
 	a need to change "cpuset.mems" with active tasks, it shouldn't
 	be done frequently.
 
+	If cpuset.mems is opened with O_NONBLOCK then the migration is
+	bypassed. This is useful for admin processes that need to adjust
+	the cpuset.mems dynamically without blocking. However, there is
+	a risk that previously allocated pages are not within the new
+	cpuset.mems range, which may be altered by move_pages syscall or
+	numa_balance.
+
   cpuset.mems.effective
 	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 24b70ea3e6ce..2a0867e0c6d2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3208,7 +3208,18 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 		retval = update_exclusive_cpumask(cs, trialcs, buf);
 		break;
 	case FILE_MEMLIST:
+		bool skip_migrate_once = false;
+
+		if ((of->file->f_flags & O_NONBLOCK) &&
+			is_memory_migrate(cs) &&
+			!cpuset_update_flag(CS_MEMORY_MIGRATE, cs, 0))
+			skip_migrate_once = true;
+
 		retval = update_nodemask(cs, trialcs, buf);
+
+		/* Restore the migrate flag */
+		if (skip_migrate_once)
+			cpuset_update_flag(CS_MEMORY_MIGRATE, cs, 1);
 		break;
 	default:
 		retval = -EINVAL;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-20  3:15 [PATCH] cpuset: introduce non-blocking cpuset.mems setting option Zhongkun He
@ 2025-05-20 13:13 ` kernel test robot
  2025-05-20 13:25 ` kernel test robot
  2025-05-20 13:34 ` Waiman Long
  2 siblings, 0 replies; 23+ messages in thread
From: kernel test robot @ 2025-05-20 13:13 UTC (permalink / raw)
  To: Zhongkun He, tj, hannes, longman
  Cc: oe-kbuild-all, cgroups, linux-kernel, muchun.song, Zhongkun He

Hi Zhongkun,

kernel test robot noticed the following build errors:

[auto build test ERROR on tj-cgroup/for-next]
[also build test ERROR on linus/master v6.15-rc7 next-20250516]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Zhongkun-He/cpuset-introduce-non-blocking-cpuset-mems-setting-option/20250520-111737
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20250520031552.1931598-1-hezhongkun.hzk%40bytedance.com
patch subject: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
config: sparc64-randconfig-001-20250520 (https://download.01.org/0day-ci/archive/20250520/202505202106.sXzGXeU4-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250520/202505202106.sXzGXeU4-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505202106.sXzGXeU4-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/cgroup/cpuset.c: In function 'cpuset_write_resmask':
>> kernel/cgroup/cpuset.c:3246:3: error: a label can only be part of a statement and a declaration is not a statement
      bool skip_migrate_once = false;
      ^~~~


vim +3246 kernel/cgroup/cpuset.c

  3215	
  3216	/*
  3217	 * Common handling for a write to a "cpus" or "mems" file.
  3218	 */
  3219	ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
  3220					    char *buf, size_t nbytes, loff_t off)
  3221	{
  3222		struct cpuset *cs = css_cs(of_css(of));
  3223		struct cpuset *trialcs;
  3224		int retval = -ENODEV;
  3225	
  3226		buf = strstrip(buf);
  3227		cpus_read_lock();
  3228		mutex_lock(&cpuset_mutex);
  3229		if (!is_cpuset_online(cs))
  3230			goto out_unlock;
  3231	
  3232		trialcs = alloc_trial_cpuset(cs);
  3233		if (!trialcs) {
  3234			retval = -ENOMEM;
  3235			goto out_unlock;
  3236		}
  3237	
  3238		switch (of_cft(of)->private) {
  3239		case FILE_CPULIST:
  3240			retval = update_cpumask(cs, trialcs, buf);
  3241			break;
  3242		case FILE_EXCLUSIVE_CPULIST:
  3243			retval = update_exclusive_cpumask(cs, trialcs, buf);
  3244			break;
  3245		case FILE_MEMLIST:
> 3246			bool skip_migrate_once = false;
  3247	
  3248			if ((of->file->f_flags & O_NONBLOCK) &&
  3249				is_memory_migrate(cs) &&
  3250				!cpuset_update_flag(CS_MEMORY_MIGRATE, cs, 0))
  3251				skip_migrate_once = true;
  3252	
  3253			retval = update_nodemask(cs, trialcs, buf);
  3254	
  3255			/* Restore the migrate flag */
  3256			if (skip_migrate_once)
  3257				cpuset_update_flag(CS_MEMORY_MIGRATE, cs, 1);
  3258			break;
  3259		default:
  3260			retval = -EINVAL;
  3261			break;
  3262		}
  3263	
  3264		free_cpuset(trialcs);
  3265		if (force_sd_rebuild)
  3266			rebuild_sched_domains_locked();
  3267	out_unlock:
  3268		mutex_unlock(&cpuset_mutex);
  3269		cpus_read_unlock();
  3270		flush_workqueue(cpuset_migrate_mm_wq);
  3271		return retval ?: nbytes;
  3272	}
  3273	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-20  3:15 [PATCH] cpuset: introduce non-blocking cpuset.mems setting option Zhongkun He
  2025-05-20 13:13 ` kernel test robot
@ 2025-05-20 13:25 ` kernel test robot
  2025-05-20 13:34 ` Waiman Long
  2 siblings, 0 replies; 23+ messages in thread
From: kernel test robot @ 2025-05-20 13:25 UTC (permalink / raw)
  To: Zhongkun He, tj, hannes, longman
  Cc: llvm, oe-kbuild-all, cgroups, linux-kernel, muchun.song,
	Zhongkun He

Hi Zhongkun,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on linus/master v6.15-rc7 next-20250516]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Zhongkun-He/cpuset-introduce-non-blocking-cpuset-mems-setting-option/20250520-111737
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20250520031552.1931598-1-hezhongkun.hzk%40bytedance.com
patch subject: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
config: s390-randconfig-002-20250520 (https://download.01.org/0day-ci/archive/20250520/202505202112.tmU9BTzA-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250520/202505202112.tmU9BTzA-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505202112.tmU9BTzA-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/cgroup/cpuset.c:3246:3: warning: label followed by a declaration is a C23 extension [-Wc23-extensions]
    3246 |                 bool skip_migrate_once = false;
         |                 ^
   1 warning generated.


vim +3246 kernel/cgroup/cpuset.c

  3215	
  3216	/*
  3217	 * Common handling for a write to a "cpus" or "mems" file.
  3218	 */
  3219	ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
  3220					    char *buf, size_t nbytes, loff_t off)
  3221	{
  3222		struct cpuset *cs = css_cs(of_css(of));
  3223		struct cpuset *trialcs;
  3224		int retval = -ENODEV;
  3225	
  3226		buf = strstrip(buf);
  3227		cpus_read_lock();
  3228		mutex_lock(&cpuset_mutex);
  3229		if (!is_cpuset_online(cs))
  3230			goto out_unlock;
  3231	
  3232		trialcs = alloc_trial_cpuset(cs);
  3233		if (!trialcs) {
  3234			retval = -ENOMEM;
  3235			goto out_unlock;
  3236		}
  3237	
  3238		switch (of_cft(of)->private) {
  3239		case FILE_CPULIST:
  3240			retval = update_cpumask(cs, trialcs, buf);
  3241			break;
  3242		case FILE_EXCLUSIVE_CPULIST:
  3243			retval = update_exclusive_cpumask(cs, trialcs, buf);
  3244			break;
  3245		case FILE_MEMLIST:
> 3246			bool skip_migrate_once = false;
  3247	
  3248			if ((of->file->f_flags & O_NONBLOCK) &&
  3249				is_memory_migrate(cs) &&
  3250				!cpuset_update_flag(CS_MEMORY_MIGRATE, cs, 0))
  3251				skip_migrate_once = true;
  3252	
  3253			retval = update_nodemask(cs, trialcs, buf);
  3254	
  3255			/* Restore the migrate flag */
  3256			if (skip_migrate_once)
  3257				cpuset_update_flag(CS_MEMORY_MIGRATE, cs, 1);
  3258			break;
  3259		default:
  3260			retval = -EINVAL;
  3261			break;
  3262		}
  3263	
  3264		free_cpuset(trialcs);
  3265		if (force_sd_rebuild)
  3266			rebuild_sched_domains_locked();
  3267	out_unlock:
  3268		mutex_unlock(&cpuset_mutex);
  3269		cpus_read_unlock();
  3270		flush_workqueue(cpuset_migrate_mm_wq);
  3271		return retval ?: nbytes;
  3272	}
  3273	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-20  3:15 [PATCH] cpuset: introduce non-blocking cpuset.mems setting option Zhongkun He
  2025-05-20 13:13 ` kernel test robot
  2025-05-20 13:25 ` kernel test robot
@ 2025-05-20 13:34 ` Waiman Long
  2025-05-21  2:35   ` [External] " Zhongkun He
  2 siblings, 1 reply; 23+ messages in thread
From: Waiman Long @ 2025-05-20 13:34 UTC (permalink / raw)
  To: Zhongkun He, tj, hannes; +Cc: cgroups, linux-kernel, muchun.song

On 5/19/25 11:15 PM, Zhongkun He wrote:
> Setting the cpuset.mems in cgroup v2 can trigger memory
> migrate in cpuset. This behavior is fine for newly created
> cgroups but it can cause issues for the existing cgroups.
> In our scenario, modifying the cpuset.mems setting during
> peak times frequently leads to noticeable service latency
> or stuttering.
>
> It is important to have a consistent set of behavior for
> both cpus and memory. But it does cause issues at times,
> so we would hope to have a flexible option.
>
> This idea is from the non-blocking limit setting option in
> memory control.
>
> https://lore.kernel.org/all/20250506232833.3109790-1-shakeel.butt@linux.dev/
>
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> ---
>   Documentation/admin-guide/cgroup-v2.rst |  7 +++++++
>   kernel/cgroup/cpuset.c                  | 11 +++++++++++
>   2 files changed, 18 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 1a16ce68a4d7..d9e8e2a770af 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2408,6 +2408,13 @@ Cpuset Interface Files
>   	a need to change "cpuset.mems" with active tasks, it shouldn't
>   	be done frequently.
>   
> +	If cpuset.mems is opened with O_NONBLOCK then the migration is
> +	bypassed. This is useful for admin processes that need to adjust
> +	the cpuset.mems dynamically without blocking. However, there is
> +	a risk that previously allocated pages are not within the new
> +	cpuset.mems range, which may be altered by move_pages syscall or
> +	numa_balance.
> +
>     cpuset.mems.effective
>   	A read-only multiple values file which exists on all
>   	cpuset-enabled cgroups.
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 24b70ea3e6ce..2a0867e0c6d2 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3208,7 +3208,18 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   		retval = update_exclusive_cpumask(cs, trialcs, buf);
>   		break;
>   	case FILE_MEMLIST:
> +		bool skip_migrate_once = false;
> +
> +		if ((of->file->f_flags & O_NONBLOCK) &&
> +			is_memory_migrate(cs) &&
> +			!cpuset_update_flag(CS_MEMORY_MIGRATE, cs, 0))
> +			skip_migrate_once = true;
> +
>   		retval = update_nodemask(cs, trialcs, buf);
> +
> +		/* Restore the migrate flag */
> +		if (skip_migrate_once)
> +			cpuset_update_flag(CS_MEMORY_MIGRATE, cs, 1);
>   		break;
>   	default:
>   		retval = -EINVAL;

I would prefer to temporarily make is_memory_migrate() helper return 
false by also checking an internal variable, for example, instead of 
messing with the cpuset flags.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-20 13:34 ` Waiman Long
@ 2025-05-21  2:35   ` Zhongkun He
  2025-05-21 17:14     ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Zhongkun He @ 2025-05-21  2:35 UTC (permalink / raw)
  To: Waiman Long; +Cc: tj, hannes, cgroups, linux-kernel, muchun.song

On Tue, May 20, 2025 at 9:35 PM Waiman Long <llong@redhat.com> wrote:
>
> On 5/19/25 11:15 PM, Zhongkun He wrote:
> > Setting the cpuset.mems in cgroup v2 can trigger memory
> > migrate in cpuset. This behavior is fine for newly created
> > cgroups but it can cause issues for the existing cgroups.
> > In our scenario, modifying the cpuset.mems setting during
> > peak times frequently leads to noticeable service latency
> > or stuttering.
> >
> > It is important to have a consistent set of behavior for
> > both cpus and memory. But it does cause issues at times,
> > so we would hope to have a flexible option.
> >
> > This idea is from the non-blocking limit setting option in
> > memory control.
> >
> > https://lore.kernel.org/all/20250506232833.3109790-1-shakeel.butt@linux.dev/
> >
> > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> > ---
> >   Documentation/admin-guide/cgroup-v2.rst |  7 +++++++
> >   kernel/cgroup/cpuset.c                  | 11 +++++++++++
> >   2 files changed, 18 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 1a16ce68a4d7..d9e8e2a770af 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -2408,6 +2408,13 @@ Cpuset Interface Files
> >       a need to change "cpuset.mems" with active tasks, it shouldn't
> >       be done frequently.
> >
> > +     If cpuset.mems is opened with O_NONBLOCK then the migration is
> > +     bypassed. This is useful for admin processes that need to adjust
> > +     the cpuset.mems dynamically without blocking. However, there is
> > +     a risk that previously allocated pages are not within the new
> > +     cpuset.mems range, which may be altered by move_pages syscall or
> > +     numa_balance.
> > +
> >     cpuset.mems.effective
> >       A read-only multiple values file which exists on all
> >       cpuset-enabled cgroups.
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 24b70ea3e6ce..2a0867e0c6d2 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -3208,7 +3208,18 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
> >               retval = update_exclusive_cpumask(cs, trialcs, buf);
> >               break;
> >       case FILE_MEMLIST:
> > +             bool skip_migrate_once = false;
> > +
> > +             if ((of->file->f_flags & O_NONBLOCK) &&
> > +                     is_memory_migrate(cs) &&
> > +                     !cpuset_update_flag(CS_MEMORY_MIGRATE, cs, 0))
> > +                     skip_migrate_once = true;
> > +
> >               retval = update_nodemask(cs, trialcs, buf);
> > +
> > +             /* Restore the migrate flag */
> > +             if (skip_migrate_once)
> > +                     cpuset_update_flag(CS_MEMORY_MIGRATE, cs, 1);
> >               break;
> >       default:
> >               retval = -EINVAL;
>
> I would prefer to temporarily make is_memory_migrate() helper return
> false by also checking an internal variable, for example, instead of
> messing with the cpuset flags.
>

Sounds reasonable, thanks for the feedback. I'll give it a try later.

> Cheers,
> Longman
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
@ 2025-05-21  3:45 Zhongkun He
  2025-05-21 17:15 ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Zhongkun He @ 2025-05-21  3:45 UTC (permalink / raw)
  To: tj, hannes, longman; +Cc: cgroups, linux-kernel, muchun.song, Zhongkun He

Setting the cpuset.mems in cgroup v2 can trigger memory
migrate in cpuset. This behavior is fine for newly created
cgroups but it can cause issues for the existing cgroups.
In our scenario, modifying the cpuset.mems setting during
peak times frequently leads to noticeable service latency
or stuttering.

It is important to have a consistent set of behavior for
both cpus and memory. But it does cause issues at times,
so we would like to have a flexible option.

This idea is from the non-blocking limit setting option in
memory control.

https://lore.kernel.org/all/20250506232833.3109790-1-shakeel.butt@linux.dev/

Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 7 +++++++
 kernel/cgroup/cpuset-internal.h         | 6 ++++++
 kernel/cgroup/cpuset.c                  | 7 +++++++
 3 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 1a16ce68a4d7..d9e8e2a770af 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2408,6 +2408,13 @@ Cpuset Interface Files
 	a need to change "cpuset.mems" with active tasks, it shouldn't
 	be done frequently.
 
+	If cpuset.mems is opened with O_NONBLOCK then the migration is
+	bypassed. This is useful for admin processes that need to adjust
+	the cpuset.mems dynamically without blocking. However, there is
+	a risk that previously allocated pages are not within the new
+	cpuset.mems range, which may be altered by move_pages syscall or
+	numa_balance.
+
   cpuset.mems.effective
 	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index 383963e28ac6..5686bb08c4fe 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -162,6 +162,9 @@ struct cpuset {
 	/* partition root state */
 	int partition_root_state;
 
+	/* Do not migrate memory when modifying cpuset.mems this time */
+	bool skip_migration_once;
+
 	/*
 	 * number of SCHED_DEADLINE tasks attached to this cpuset, so that we
 	 * know when to rebuild associated root domain bandwidth information.
@@ -227,6 +230,9 @@ static inline int is_sched_load_balance(const struct cpuset *cs)
 
 static inline int is_memory_migrate(const struct cpuset *cs)
 {
+	if (cs->skip_migration_once)
+		return 0;
+
 	return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
 }
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 24b70ea3e6ce..f43d7b291cde 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3208,7 +3208,14 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 		retval = update_exclusive_cpumask(cs, trialcs, buf);
 		break;
 	case FILE_MEMLIST:
+		if (of->file->f_flags & O_NONBLOCK)
+			cs->skip_migration_once = true;
+
 		retval = update_nodemask(cs, trialcs, buf);
+
+		/* Restore skip_migration */
+		if (cs->skip_migration_once)
+			cs->skip_migration_once = false;
 		break;
 	default:
 		retval = -EINVAL;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-21  2:35   ` [External] " Zhongkun He
@ 2025-05-21 17:14     ` Tejun Heo
  2025-05-22  3:37       ` Zhongkun He
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2025-05-21 17:14 UTC (permalink / raw)
  To: Zhongkun He; +Cc: Waiman Long, hannes, cgroups, linux-kernel, muchun.song

On Wed, May 21, 2025 at 10:35:57AM +0800, Zhongkun He wrote:
> On Tue, May 20, 2025 at 9:35 PM Waiman Long <llong@redhat.com> wrote:
> >
> > On 5/19/25 11:15 PM, Zhongkun He wrote:
> > > Setting the cpuset.mems in cgroup v2 can trigger memory
> > > migrate in cpuset. This behavior is fine for newly created
> > > cgroups but it can cause issues for the existing cgroups.
> > > In our scenario, modifying the cpuset.mems setting during
> > > peak times frequently leads to noticeable service latency
> > > or stuttering.
> > >
> > > It is important to have a consistent set of behavior for
> > > both cpus and memory. But it does cause issues at times,
> > > so we would hope to have a flexible option.
> > >
> > > This idea is from the non-blocking limit setting option in
> > > memory control.
> > >
> > > https://lore.kernel.org/all/20250506232833.3109790-1-shakeel.butt@linux.dev/
> > >
> > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> > > ---
> > >   Documentation/admin-guide/cgroup-v2.rst |  7 +++++++
> > >   kernel/cgroup/cpuset.c                  | 11 +++++++++++
> > >   2 files changed, 18 insertions(+)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 1a16ce68a4d7..d9e8e2a770af 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -2408,6 +2408,13 @@ Cpuset Interface Files
> > >       a need to change "cpuset.mems" with active tasks, it shouldn't
> > >       be done frequently.
> > >
> > > +     If cpuset.mems is opened with O_NONBLOCK then the migration is
> > > +     bypassed. This is useful for admin processes that need to adjust
> > > +     the cpuset.mems dynamically without blocking. However, there is
> > > +     a risk that previously allocated pages are not within the new
> > > +     cpuset.mems range, which may be altered by move_pages syscall or
> > > +     numa_balance.

I don't think this is a good idea. O_NONBLOCK means "don't wait", not "skip
this".

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-21  3:45 Zhongkun He
@ 2025-05-21 17:15 ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2025-05-21 17:15 UTC (permalink / raw)
  To: Zhongkun He; +Cc: hannes, longman, cgroups, linux-kernel, muchun.song

On Wed, May 21, 2025 at 11:45:27AM +0800, Zhongkun He wrote:
> Setting the cpuset.mems in cgroup v2 can trigger memory
> migrate in cpuset. This behavior is fine for newly created
> cgroups but it can cause issues for the existing cgroups.
> In our scenario, modifying the cpuset.mems setting during
> peak times frequently leads to noticeable service latency
> or stuttering.
> 
> It is important to have a consistent set of behavior for
> both cpus and memory. But it does cause issues at times,
> so we would like to have a flexible option.
> 
> This idea is from the non-blocking limit setting option in
> memory control.
> 
> https://lore.kernel.org/all/20250506232833.3109790-1-shakeel.butt@linux.dev/
> 
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 7 +++++++
>  kernel/cgroup/cpuset-internal.h         | 6 ++++++
>  kernel/cgroup/cpuset.c                  | 7 +++++++
>  3 files changed, 20 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 1a16ce68a4d7..d9e8e2a770af 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2408,6 +2408,13 @@ Cpuset Interface Files
>  	a need to change "cpuset.mems" with active tasks, it shouldn't
>  	be done frequently.
>  
> +	If cpuset.mems is opened with O_NONBLOCK then the migration is
> +	bypassed. This is useful for admin processes that need to adjust
> +	the cpuset.mems dynamically without blocking. However, there is
> +	a risk that previously allocated pages are not within the new
> +	cpuset.mems range, which may be altered by move_pages syscall or
> +	numa_balance.

As said in the other thread, nack on this approach.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-21 17:14     ` Tejun Heo
@ 2025-05-22  3:37       ` Zhongkun He
  2025-05-22 19:03         ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Zhongkun He @ 2025-05-22  3:37 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Waiman Long, hannes, cgroups, linux-kernel, muchun.song

On Thu, May 22, 2025 at 1:14 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Wed, May 21, 2025 at 10:35:57AM +0800, Zhongkun He wrote:
> > On Tue, May 20, 2025 at 9:35 PM Waiman Long <llong@redhat.com> wrote:
> > >
> > > On 5/19/25 11:15 PM, Zhongkun He wrote:
> > > > Setting the cpuset.mems in cgroup v2 can trigger memory
> > > > migrate in cpuset. This behavior is fine for newly created
> > > > cgroups but it can cause issues for the existing cgroups.
> > > > In our scenario, modifying the cpuset.mems setting during
> > > > peak times frequently leads to noticeable service latency
> > > > or stuttering.
> > > >
> > > > It is important to have a consistent set of behavior for
> > > > both cpus and memory. But it does cause issues at times,
> > > > so we would hope to have a flexible option.
> > > >
> > > > This idea is from the non-blocking limit setting option in
> > > > memory control.
> > > >
> > > > https://lore.kernel.org/all/20250506232833.3109790-1-shakeel.butt@linux.dev/
> > > >
> > > > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> > > > ---
> > > >   Documentation/admin-guide/cgroup-v2.rst |  7 +++++++
> > > >   kernel/cgroup/cpuset.c                  | 11 +++++++++++
> > > >   2 files changed, 18 insertions(+)
> > > >
> > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > > index 1a16ce68a4d7..d9e8e2a770af 100644
> > > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > > @@ -2408,6 +2408,13 @@ Cpuset Interface Files
> > > >       a need to change "cpuset.mems" with active tasks, it shouldn't
> > > >       be done frequently.
> > > >
> > > > +     If cpuset.mems is opened with O_NONBLOCK then the migration is
> > > > +     bypassed. This is useful for admin processes that need to adjust
> > > > +     the cpuset.mems dynamically without blocking. However, there is
> > > > +     a risk that previously allocated pages are not within the new
> > > > +     cpuset.mems range, which may be altered by move_pages syscall or
> > > > +     numa_balance.
>
> I don't think this is a good idea. O_NONBLOCK means "don't wait", not "skip
> this".

Yes, I agree.  However, we have been experiencing this issue for a long time,
so we hope to have an option to disable memory migration in v2.

Would it be possible to re-enable the memory.migrate interface and
disable memory migration by default in v2?

Alternatively, could we introduce an option in cpuset.mems to explicitly
indicate that memory migration should not occur?

Please feel free to share any suggestions you might have.

>
> Thanks.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-22  3:37       ` Zhongkun He
@ 2025-05-22 19:03         ` Tejun Heo
  2025-05-23 15:35           ` Zhongkun He
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2025-05-22 19:03 UTC (permalink / raw)
  To: Zhongkun He; +Cc: Waiman Long, hannes, cgroups, linux-kernel, muchun.song

Hello,

On Thu, May 22, 2025 at 11:37:44AM +0800, Zhongkun He wrote:
> > I don't think this is a good idea. O_NONBLOCK means "don't wait", not "skip
> > this".
> 
> Yes, I agree.  However, we have been experiencing this issue for a long time,
> so we hope to have an option to disable memory migration in v2.
> 
> Would it be possible to re-enable the memory.migrate interface and
> disable memory migration by default in v2?
> 
> Alternatively, could we introduce an option in cpuset.mems to explicitly
> indicate that memory migration should not occur?
> 
> Please feel free to share any suggestions you might have.

Is this something you want on the whole machine? If so, would global cgroup
mount option work?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-22 19:03         ` Tejun Heo
@ 2025-05-23 15:35           ` Zhongkun He
  2025-05-23 16:51             ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Zhongkun He @ 2025-05-23 15:35 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Waiman Long, hannes, cgroups, linux-kernel, muchun.song

On Fri, May 23, 2025 at 3:03 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, May 22, 2025 at 11:37:44AM +0800, Zhongkun He wrote:
> > > I don't think this is a good idea. O_NONBLOCK means "don't wait", not "skip
> > > this".
> >
> > Yes, I agree.  However, we have been experiencing this issue for a long time,
> > so we hope to have an option to disable memory migration in v2.
> >
> > Would it be possible to re-enable the memory.migrate interface and
> > disable memory migration by default in v2?
> >
> > Alternatively, could we introduce an option in cpuset.mems to explicitly
> > indicate that memory migration should not occur?
> >
> > Please feel free to share any suggestions you might have.
>
> Is this something you want on the whole machine? If so, would global cgroup
> mount option work?

It doesn't apply to the whole machine. It is only relevant to the pod
with huge pages,
where the service will be unavailable for over ten seconds if modify
the cpuset.mems.
Therefore, it would be ideal if there were an option to disable the migration
for this special case.

Thanks.

>
> Thanks.
>
> --
> tejun
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-23 15:35           ` Zhongkun He
@ 2025-05-23 16:51             ` Tejun Heo
  2025-05-24  1:10               ` Zhongkun He
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2025-05-23 16:51 UTC (permalink / raw)
  To: Zhongkun He; +Cc: Waiman Long, hannes, cgroups, linux-kernel, muchun.song

Hello,

On Fri, May 23, 2025 at 11:35:57PM +0800, Zhongkun He wrote:
> > Is this something you want on the whole machine? If so, would global cgroup
> > mount option work?
> 
> It doesn't apply to the whole machine. It is only relevant to the pod with
> huge pages, where the service will be unavailable for over ten seconds if
> modify the cpuset.mems. Therefore, it would be ideal if there were an
> option to disable the migration for this special case.

I suppose we can add back an interface similar to cgroup1 but can you detail
the use case a bit? If you relocate threads without relocating memory, you'd
be paying on-going cost for memory access. It'd be great if you can
elaborate why such mode of operation is desirable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-23 16:51             ` Tejun Heo
@ 2025-05-24  1:10               ` Zhongkun He
  2025-05-24  1:14                 ` Tejun Heo
  2025-06-17 12:40                 ` Michal Koutný
  0 siblings, 2 replies; 23+ messages in thread
From: Zhongkun He @ 2025-05-24  1:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Waiman Long, cgroups, linux-kernel, muchun.song

On Sat, May 24, 2025 at 12:51 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, May 23, 2025 at 11:35:57PM +0800, Zhongkun He wrote:
> > > Is this something you want on the whole machine? If so, would global cgroup
> > > mount option work?
> >
> > It doesn't apply to the whole machine. It is only relevant to the pod with
> > huge pages, where the service will be unavailable for over ten seconds if
> > modify the cpuset.mems. Therefore, it would be ideal if there were an
> > option to disable the migration for this special case.
>
> I suppose we can add back an interface similar to cgroup1 but can you detail
> the use case a bit? If you relocate threads without relocating memory, you'd

Thanks, that sounds great.

> be paying on-going cost for memory access. It'd be great if you can
> elaborate why such mode of operation is desirable.
>
> Thanks.

This is a story about optimizing CPU and memory bandwidth utilization.
In our production environment, the application exhibits distinct peak
and off-peak cycles and the cpuset.mems interface is modified
several times within a day.

During off-peak periods, tasks are evenly distributed across all NUMA nodes.
When peak periods arrive, we collectively migrate tasks to a designated node,
freeing up another node to accommodate new resource-intensive tasks.

We move the task by modifying the cpuset.cpus and cpuset.mems and
the memory migration is an option with cpuset.memory_migrate
interface in V1. After we relocate the threads, the memory will be
migrated by syscall move_pages in userspace slowly, within a few
minutes.

Presently, cpuset.mems triggers synchronous memory migration,
leading to prolonged and unacceptable service downtime in V2.

So we hope to add back an interface similar to cgroup v1, optional
the migration.

Thanks.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-24  1:10               ` Zhongkun He
@ 2025-05-24  1:14                 ` Tejun Heo
  2025-05-24  2:09                   ` Zhongkun He
  2025-06-17 12:40                 ` Michal Koutný
  1 sibling, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2025-05-24  1:14 UTC (permalink / raw)
  To: Zhongkun He; +Cc: Waiman Long, cgroups, linux-kernel, muchun.song

Hello,

On Sat, May 24, 2025 at 09:10:21AM +0800, Zhongkun He wrote:
...
> We move the task by modifying the cpuset.cpus and cpuset.mems and
> the memory migration is an option with cpuset.memory_migrate
> interface in V1. After we relocate the threads, the memory will be
> migrated by syscall move_pages in userspace slowly, within a few
> minutes.
> 
> Presently, cpuset.mems triggers synchronous memory migration,
> leading to prolonged and unacceptable service downtime in V2.
> 
> So we hope to add back an interface similar to cgroup v1, optional
> the migration.

Ah, I see, so it's not that you aren't migrating the memory but more that
the migration through cpuset.mems is too aggressive and causes disruption.
Is that the right understanding?

If so, would an interface to specify the rate of migration be a better
interface?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-24  1:14                 ` Tejun Heo
@ 2025-05-24  2:09                   ` Zhongkun He
  2025-05-27 19:04                     ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Zhongkun He @ 2025-05-24  2:09 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Waiman Long, cgroups, linux-kernel, muchun.song

On Sat, May 24, 2025 at 9:14 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Sat, May 24, 2025 at 09:10:21AM +0800, Zhongkun He wrote:
> ...
> > We move the task by modifying the cpuset.cpus and cpuset.mems and
> > the memory migration is an option with cpuset.memory_migrate
> > interface in V1. After we relocate the threads, the memory will be
> > migrated by syscall move_pages in userspace slowly, within a few
> > minutes.
> >
> > Presently, cpuset.mems triggers synchronous memory migration,
> > leading to prolonged and unacceptable service downtime in V2.
> >
> > So we hope to add back an interface similar to cgroup v1, optional
> > the migration.
>
> Ah, I see, so it's not that you aren't migrating the memory but more that
> the migration through cpuset.mems is too aggressive and causes disruption.
> Is that the right understanding?

Yes, exactly.

>
> If so, would an interface to specify the rate of migration be a better
> interface?
>

Per my understanding,  the interface of migration rate is far more complex.
To slow down the migration, moving it to the userspace can also help determine
when to carry out this operation.

Perhaps we can give it a try if there is a elegant code implementation which
can help the people do not migrate it in userspace.
If that path doesn't work, it's okay for us to disable the migration.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-24  2:09                   ` Zhongkun He
@ 2025-05-27 19:04                     ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2025-05-27 19:04 UTC (permalink / raw)
  To: Zhongkun He
  Cc: Waiman Long, cgroups, linux-kernel, muchun.song, Johannes Weiner

Hello,

On Sat, May 24, 2025 at 10:09:36AM +0800, Zhongkun He wrote:
> On Sat, May 24, 2025 at 9:14 AM Tejun Heo <tj@kernel.org> wrote:
> Per my understanding,  the interface of migration rate is far more complex.
> To slow down the migration, moving it to the userspace can also help determine
> when to carry out this operation.

(cc'ing Johannes for mm)

The user interface can be pretty simple. It can just be an approximate
bandwidth of scan or migration, but yeah, I don't know whether this is going
to be too complex. Pacing migration from user side isn't trivial either,
tho. If this is something necessary, it'd be nice if kernel can provide
something relatively simple to use and can cover most usecases.

Johannes, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-05-24  1:10               ` Zhongkun He
  2025-05-24  1:14                 ` Tejun Heo
@ 2025-06-17 12:40                 ` Michal Koutný
  2025-06-18  2:46                   ` Zhongkun He
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Koutný @ 2025-06-17 12:40 UTC (permalink / raw)
  To: Zhongkun He; +Cc: Tejun Heo, Waiman Long, cgroups, linux-kernel, muchun.song

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

Hello.

On Sat, May 24, 2025 at 09:10:21AM +0800, Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> This is a story about optimizing CPU and memory bandwidth utilization.
> In our production environment, the application exhibits distinct peak
> and off-peak cycles and the cpuset.mems interface is modified
> several times within a day.
> 
> During off-peak periods, tasks are evenly distributed across all NUMA nodes.
> When peak periods arrive, we collectively migrate tasks to a designated node,
> freeing up another node to accommodate new resource-intensive tasks.
> 
> We move the task by modifying the cpuset.cpus and cpuset.mems and
> the memory migration is an option with cpuset.memory_migrate
> interface in V1. After we relocate the threads, the memory will be
> migrated by syscall move_pages in userspace slowly, within a few
> minutes.

Why do you need cpuset.mems at all?
IIUC, you could configure cpuset.mems to a union of possible nodes for
the pod and then you leave up the adjustments of affinity upon the
userspace.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-06-17 12:40                 ` Michal Koutný
@ 2025-06-18  2:46                   ` Zhongkun He
  2025-06-18  9:04                     ` Michal Koutný
  0 siblings, 1 reply; 23+ messages in thread
From: Zhongkun He @ 2025-06-18  2:46 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Waiman Long, cgroups, linux-kernel, muchun.song

On Tue, Jun 17, 2025 at 8:40 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Sat, May 24, 2025 at 09:10:21AM +0800, Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> > This is a story about optimizing CPU and memory bandwidth utilization.
> > In our production environment, the application exhibits distinct peak
> > and off-peak cycles and the cpuset.mems interface is modified
> > several times within a day.
> >
> > During off-peak periods, tasks are evenly distributed across all NUMA nodes.
> > When peak periods arrive, we collectively migrate tasks to a designated node,
> > freeing up another node to accommodate new resource-intensive tasks.
> >
> > We move the task by modifying the cpuset.cpus and cpuset.mems and
> > the memory migration is an option with cpuset.memory_migrate
> > interface in V1. After we relocate the threads, the memory will be
> > migrated by syscall move_pages in userspace slowly, within a few
> > minutes.
>
> Why do you need cpuset.mems at all?
> IIUC, you could configure cpuset.mems to a union of possible nodes for
> the pod and then you leave up the adjustments of affinity upon the
> userspace.

It is unnecessary to adjust memory affinity periodically from userspace,
as it is a costly operation. Instead, we need to shrink cpuset.mems to
explicitly specify the NUMA node from which newly allocated pages should
come, and migrate the pages once in userspace slowly  or adjusted
by numa balance.

Thanks,
Zhongkun
>
> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-06-18  2:46                   ` Zhongkun He
@ 2025-06-18  9:04                     ` Michal Koutný
  2025-06-19  3:49                       ` Zhongkun He
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Koutný @ 2025-06-18  9:04 UTC (permalink / raw)
  To: Zhongkun He; +Cc: Tejun Heo, Waiman Long, cgroups, linux-kernel, muchun.song

[-- Attachment #1: Type: text/plain, Size: 702 bytes --]

On Wed, Jun 18, 2025 at 10:46:02AM +0800, Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> It is unnecessary to adjust memory affinity periodically from userspace,
> as it is a costly operation.

It'd be always costly when there's lots of data to migrate.

> Instead, we need to shrink cpuset.mems to explicitly specify the NUMA
> node from which newly allocated pages should come, and migrate the
> pages once in userspace slowly  or adjusted by numa balance.

IIUC, the issue is that there's no set_mempolicy(2) for 3rd party
threads (it only operates on current) OR that the migration path should
be optimized to avoid those latencies -- do you know what is the
contention point?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-06-18  9:04                     ` Michal Koutný
@ 2025-06-19  3:49                       ` Zhongkun He
  2025-06-19 12:10                         ` Michal Koutný
  0 siblings, 1 reply; 23+ messages in thread
From: Zhongkun He @ 2025-06-19  3:49 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Waiman Long, cgroups, linux-kernel, muchun.song

On Wed, Jun 18, 2025 at 5:05 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Wed, Jun 18, 2025 at 10:46:02AM +0800, Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> > It is unnecessary to adjust memory affinity periodically from userspace,
> > as it is a costly operation.
>
> It'd be always costly when there's lots of data to migrate.
>
> > Instead, we need to shrink cpuset.mems to explicitly specify the NUMA
> > node from which newly allocated pages should come, and migrate the
> > pages once in userspace slowly  or adjusted by numa balance.
>
> IIUC, the issue is that there's no set_mempolicy(2) for 3rd party
> threads (it only operates on current) OR that the migration path should
> be optimized to avoid those latencies -- do you know what is the
> contention point?

Hi Michal

In our scenario, when we shrink the allowed cpuset.mems —for example,
from nodes 1, 2, 3 to just nodes 2,3—there may still be a large number of pages
residing on node 1. Currently, modifying cpuset.mems triggers synchronous memory
migration, which results in prolonged and unacceptable service downtime under
cgroup v2. This behavior has become a major blocker for us in adopting
cgroup v2.

Tejun suggested adding an interface to control the migration rate, and
I plan to try
that later. However, we believe that the cpuset.migrate interface in
cgroup v1 is also
sufficient for our use case and is easier to work with.  :)

Thanks,
Zhongkun

>
> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-06-19  3:49                       ` Zhongkun He
@ 2025-06-19 12:10                         ` Michal Koutný
  2025-06-24  8:11                           ` Zhongkun He
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Koutný @ 2025-06-19 12:10 UTC (permalink / raw)
  To: Zhongkun He; +Cc: Tejun Heo, Waiman Long, cgroups, linux-kernel, muchun.song

[-- Attachment #1: Type: text/plain, Size: 1230 bytes --]

On Thu, Jun 19, 2025 at 11:49:58AM +0800, Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> In our scenario, when we shrink the allowed cpuset.mems —for example,
> from nodes 1, 2, 3 to just nodes 2,3—there may still be a large number of pages
> residing on node 1. Currently, modifying cpuset.mems triggers synchronous memory
> migration, which results in prolonged and unacceptable service downtime under
> cgroup v2. This behavior has become a major blocker for us in adopting
> cgroup v2.
> 
> Tejun suggested adding an interface to control the migration rate, and
> I plan to try that later.

It sounds unnecessarily not work-conserving and in principle adding
cond_resched()s (or eventually having a preemptible kernel) should
achieve the same. Or how would that project onto service metrics?
(But I'm not familiar with this migration path, thus I was asking about
the contention points.)

> However, we believe that the cpuset.migrate interface in cgroup v1 is
> also sufficient for our use case and is easier to work with.  :)

Too easy I think, it'd make cpuset.mems only "advisory" constraint. (I
know it could be justified too but perhaps not as a solution to costly
migrations.)

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-06-19 12:10                         ` Michal Koutný
@ 2025-06-24  8:11                           ` Zhongkun He
  2025-07-01  8:16                             ` Michal Koutný
  0 siblings, 1 reply; 23+ messages in thread
From: Zhongkun He @ 2025-06-24  8:11 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Waiman Long, cgroups, linux-kernel, muchun.song

()

On Thu, Jun 19, 2025 at 8:10 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Thu, Jun 19, 2025 at 11:49:58AM +0800, Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> > In our scenario, when we shrink the allowed cpuset.mems —for example,
> > from nodes 1, 2, 3 to just nodes 2,3—there may still be a large number of pages
> > residing on node 1. Currently, modifying cpuset.mems triggers synchronous memory
> > migration, which results in prolonged and unacceptable service downtime under
> > cgroup v2. This behavior has become a major blocker for us in adopting
> > cgroup v2.
> >
> > Tejun suggested adding an interface to control the migration rate, and
> > I plan to try that later.
>
> It sounds unnecessarily not work-conserving and in principle adding
> cond_resched()s (or eventually having a preemptible kernel) should
> achieve the same. Or how would that project onto service metrics?
> (But I'm not familiar with this migration path, thus I was asking about
> the contention points.)

The cond_resched() is already there, please have a look in
migrate_pages_batch().

The issue(contention ) lies in the fact that, during page migration, the PTE
is replaced with a migration_entry(). If a task attempts to access such a page,
it will be blocked in migration_entry_wait() until the migration completes.
When a large number of hot pages are involved, this can cause significant
service disruption due to prolonged blocking.

Thanks
Zhongkun

>
> > However, we believe that the cpuset.migrate interface in cgroup v1 is
> > also sufficient for our use case and is easier to work with.  :)
>
> Too easy I think, it'd make cpuset.mems only "advisory" constraint. (I
> know it could be justified too but perhaps not as a solution to costly
> migrations.)
>
> Michal

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [PATCH] cpuset: introduce non-blocking cpuset.mems setting option
  2025-06-24  8:11                           ` Zhongkun He
@ 2025-07-01  8:16                             ` Michal Koutný
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Koutný @ 2025-07-01  8:16 UTC (permalink / raw)
  To: Zhongkun He; +Cc: Tejun Heo, Waiman Long, cgroups, linux-kernel, muchun.song

[-- Attachment #1: Type: text/plain, Size: 912 bytes --]

On Tue, Jun 24, 2025 at 04:11:01PM +0800, Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
> The cond_resched() is already there, please have a look in
> migrate_pages_batch().

Thanks, this is enlightening.

> The issue(contention ) lies in the fact that, during page migration, the PTE
> is replaced with a migration_entry(). If a task attempts to access such a page,
> it will be blocked in migration_entry_wait() until the migration completes.
> When a large number of hot pages are involved, this can cause significant
> service disruption due to prolonged blocking.

migration_entry_wait() waits only for a single page (folio?) to be
migrated. How can the number of pages affect the disruption? Or do you
mean that these individual waits add up and the service is generally
slowed down by that? If the migration was spread out over longer time,
the cummulative slowdown would be the same.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-07-01  8:16 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20  3:15 [PATCH] cpuset: introduce non-blocking cpuset.mems setting option Zhongkun He
2025-05-20 13:13 ` kernel test robot
2025-05-20 13:25 ` kernel test robot
2025-05-20 13:34 ` Waiman Long
2025-05-21  2:35   ` [External] " Zhongkun He
2025-05-21 17:14     ` Tejun Heo
2025-05-22  3:37       ` Zhongkun He
2025-05-22 19:03         ` Tejun Heo
2025-05-23 15:35           ` Zhongkun He
2025-05-23 16:51             ` Tejun Heo
2025-05-24  1:10               ` Zhongkun He
2025-05-24  1:14                 ` Tejun Heo
2025-05-24  2:09                   ` Zhongkun He
2025-05-27 19:04                     ` Tejun Heo
2025-06-17 12:40                 ` Michal Koutný
2025-06-18  2:46                   ` Zhongkun He
2025-06-18  9:04                     ` Michal Koutný
2025-06-19  3:49                       ` Zhongkun He
2025-06-19 12:10                         ` Michal Koutný
2025-06-24  8:11                           ` Zhongkun He
2025-07-01  8:16                             ` Michal Koutný
  -- strict thread matches above, loose matches on Subject: below --
2025-05-21  3:45 Zhongkun He
2025-05-21 17:15 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).