linux-rt-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [cgroup/for-6.19 PATCH] cgroup/cpuset: Make callback_lock a raw_spinlock_t
@ 2025-11-12  3:57 Waiman Long
  2025-11-12  8:51 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 6+ messages in thread
From: Waiman Long @ 2025-11-12  3:57 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt
  Cc: linux-kernel, cgroups, linux-rt-devel, Chen Ridong, Pingfan Liu,
	Juri Lelli, Waiman Long

The callback_lock is a spinlock_t which is acquired either to read
a stable set of cpu or node masks or to modify those masks when
cpuset_mutex is also acquired. Sometime it may need to go up the
cgroup hierarchy while holding the lock to find the right set of masks
to use. Assuming that the depth of the cgroup hierarch is finite and
typically small, the lock hold time should be limited.

Some externally callable cpuset APIs like cpuset_cpus_allowed() and
cpuset_mems_allowed() acquires callback_lock with irq disabled to ensure
stable cpuset data. These APIs currently have the restriction that they
can't be called when a raw spinlock is being held. This is needed to
work correctly in a PREEMPT_RT kernel. This requires additional code
changes to work around this limitation. See [1] for a discussion of that.

Make these external cpuset APIs more useful by changing callback_lock
to a raw_spinlock_t to remove this limitation so that they can be called
from within other raw spinlock critical sections if needed.

[1] https://lore.kernel.org/lkml/20251110014706.8118-1-piliu@redhat.com/

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 94 +++++++++++++++++++++---------------------
 1 file changed, 47 insertions(+), 47 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 976bce6e5673..7b3a0345ee76 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -289,16 +289,16 @@ void cpuset_full_unlock(void)
 	cpus_read_unlock();
 }
 
-static DEFINE_SPINLOCK(callback_lock);
+static DEFINE_RAW_SPINLOCK(callback_lock);
 
 void cpuset_callback_lock_irq(void)
 {
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 }
 
 void cpuset_callback_unlock_irq(void)
 {
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 }
 
 static struct workqueue_struct *cpuset_migrate_mm_wq;
@@ -1616,11 +1616,11 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
 	    prstate_housekeeping_conflict(new_prs, tmp->new_cpus))
 		return PERR_HKEEPING;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
 	cs->remote_partition = true;
 	cpumask_copy(cs->effective_xcpus, tmp->new_cpus);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	update_isolation_cpumasks();
 	cpuset_force_rebuild();
 	cs->prs_err = 0;
@@ -1647,7 +1647,7 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
 	WARN_ON_ONCE(!is_remote_partition(cs));
 	WARN_ON_ONCE(!cpumask_subset(cs->effective_xcpus, subpartitions_cpus));
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->remote_partition = false;
 	partition_xcpus_del(cs->partition_root_state, NULL, cs->effective_xcpus);
 	if (cs->prs_err)
@@ -1658,7 +1658,7 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
 	/* effective_xcpus may need to be changed */
 	compute_excpus(cs, cs->effective_xcpus);
 	reset_partition_data(cs);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	update_isolation_cpumasks();
 	cpuset_force_rebuild();
 
@@ -1717,7 +1717,7 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
 			goto invalidate;
 	}
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	if (adding)
 		partition_xcpus_add(prs, NULL, tmp->addmask);
 	if (deleting)
@@ -1729,7 +1729,7 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
 	cpumask_copy(cs->effective_xcpus, excpus);
 	if (xcpus)
 		cpumask_copy(cs->exclusive_cpus, xcpus);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	update_isolation_cpumasks();
 	if (adding || deleting)
 		cpuset_force_rebuild();
@@ -2060,7 +2060,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 	 * Newly added CPUs will be removed from effective_cpus and
 	 * newly deleted ones will be added back to effective_cpus.
 	 */
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	if (old_prs != new_prs)
 		cs->partition_root_state = new_prs;
 
@@ -2073,7 +2073,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
 	if (deleting)
 		partition_xcpus_add(new_prs, parent, tmp->delmask);
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	update_isolation_cpumasks();
 
 	if ((old_prs != new_prs) && (cmd == partcmd_update))
@@ -2154,9 +2154,9 @@ static void compute_partition_effective_cpumask(struct cpuset *cs,
 			/*
 			 * Invalidate child partition
 			 */
-			spin_lock_irq(&callback_lock);
+			raw_spin_lock_irq(&callback_lock);
 			make_partition_invalid(child);
-			spin_unlock_irq(&callback_lock);
+			raw_spin_unlock_irq(&callback_lock);
 			notify_partition_change(child, old_prs);
 			continue;
 		}
@@ -2301,7 +2301,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 			new_prs = cp->partition_root_state;
 		}
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		cpumask_copy(cp->effective_cpus, tmp->new_cpus);
 		cp->partition_root_state = new_prs;
 		if (!cpumask_empty(cp->exclusive_cpus) && (cp != cs))
@@ -2316,7 +2316,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 				    cp->cpus_allowed, parent->effective_xcpus);
 		else if (new_prs < 0)
 			reset_partition_data(cp);
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		notify_partition_change(cp, old_prs);
 
@@ -2566,12 +2566,12 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 
 	partition_cpus_change(cs, trialcs, &tmp);
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
 	cpumask_copy(cs->effective_xcpus, trialcs->effective_xcpus);
 	if ((old_prs > 0) && !is_partition_valid(cs))
 		reset_partition_data(cs);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/* effective_cpus/effective_xcpus will be updated here */
 	update_cpumasks_hier(cs, &tmp, force);
@@ -2631,12 +2631,12 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	trialcs->prs_err = PERR_NONE;
 	partition_cpus_change(cs, trialcs, &tmp);
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->exclusive_cpus, trialcs->exclusive_cpus);
 	cpumask_copy(cs->effective_xcpus, trialcs->effective_xcpus);
 	if ((old_prs > 0) && !is_partition_valid(cs))
 		reset_partition_data(cs);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/*
 	 * Call update_cpumasks_hier() to update effective_cpus/effective_xcpus
@@ -2851,9 +2851,9 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 			continue;
 		rcu_read_unlock();
 
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		cp->effective_mems = *new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 
 		WARN_ON(!is_in_v2_mode() &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -2906,9 +2906,9 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 
 	check_insane_mems_config(&trialcs->mems_allowed);
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = trialcs->mems_allowed;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &trialcs->mems_allowed);
@@ -2962,9 +2962,9 @@ int cpuset_update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->flags = trialcs->flags;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed) {
 		if (cpuset_v2())
@@ -3082,14 +3082,14 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 		update_partition_exclusive_flag(cs, new_prs);
 	}
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->partition_root_state = new_prs;
 	WRITE_ONCE(cs->prs_err, err);
 	if (!is_partition_valid(cs))
 		reset_partition_data(cs);
 	else if (isolcpus_updated)
 		isolated_cpus_update(old_prs, new_prs, cs->effective_xcpus);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	update_isolation_cpumasks();
 
 	/* Force update if switching back to member & update effective_xcpus */
@@ -3403,7 +3403,7 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	cpuset_filetype_t type = seq_cft(sf)->private;
 	int ret = 0;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	switch (type) {
 	case FILE_CPULIST:
@@ -3434,7 +3434,7 @@ int cpuset_common_seq_show(struct seq_file *sf, void *v)
 		ret = -EINVAL;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	return ret;
 }
 
@@ -3627,12 +3627,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 
 	cpuset_inc();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	if (is_in_v2_mode()) {
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
 		cs->effective_mems = parent->effective_mems;
 	}
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -3659,12 +3659,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	}
 	rcu_read_unlock();
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cs->mems_allowed = parent->mems_allowed;
 	cs->effective_mems = parent->mems_allowed;
 	cpumask_copy(cs->cpus_allowed, parent->cpus_allowed);
 	cpumask_copy(cs->effective_cpus, parent->cpus_allowed);
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 out_unlock:
 	cpuset_full_unlock();
 	return 0;
@@ -3715,7 +3715,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
 static void cpuset_bind(struct cgroup_subsys_state *root_css)
 {
 	mutex_lock(&cpuset_mutex);
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 
 	if (is_in_v2_mode()) {
 		cpumask_copy(top_cpuset.cpus_allowed, cpu_possible_mask);
@@ -3727,7 +3727,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
 		top_cpuset.mems_allowed = top_cpuset.effective_mems;
 	}
 
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -3890,10 +3890,10 @@ hotplug_update_tasks(struct cpuset *cs,
 	if (nodes_empty(*new_mems))
 		*new_mems = parent_cs(cs)->effective_mems;
 
-	spin_lock_irq(&callback_lock);
+	raw_spin_lock_irq(&callback_lock);
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
-	spin_unlock_irq(&callback_lock);
+	raw_spin_unlock_irq(&callback_lock);
 
 	if (cpus_updated)
 		cpuset_update_tasks_cpumask(cs, new_cpus);
@@ -4055,7 +4055,7 @@ static void cpuset_handle_hotplug(void)
 	/* For v1, synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
 		cpuset_force_rebuild();
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
 		/*
@@ -4073,17 +4073,17 @@ static void cpuset_handle_hotplug(void)
 			}
 		}
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
 	}
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		spin_lock_irq(&callback_lock);
+		raw_spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
-		spin_unlock_irq(&callback_lock);
+		raw_spin_unlock_irq(&callback_lock);
 		cpuset_update_tasks_nodemask(&top_cpuset);
 	}
 
@@ -4176,7 +4176,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 	unsigned long flags;
 	struct cpuset *cs;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 
 	cs = task_cs(tsk);
 	if (cs != &top_cpuset)
@@ -4198,7 +4198,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 			cpumask_copy(pmask, possible_mask);
 	}
 
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 }
 
 /**
@@ -4269,9 +4269,9 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 	nodemask_t mask;
 	unsigned long flags;
 
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 	guarantee_online_mems(task_cs(tsk), &mask);
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 
 	return mask;
 }
@@ -4363,12 +4363,12 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
 		return true;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
-	spin_lock_irqsave(&callback_lock, flags);
+	raw_spin_lock_irqsave(&callback_lock, flags);
 
 	cs = nearest_hardwall_ancestor(task_cs(current));
 	allowed = node_isset(node, cs->mems_allowed);
 
-	spin_unlock_irqrestore(&callback_lock, flags);
+	raw_spin_unlock_irqrestore(&callback_lock, flags);
 	return allowed;
 }
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [cgroup/for-6.19 PATCH] cgroup/cpuset: Make callback_lock a raw_spinlock_t
  2025-11-12  3:57 [cgroup/for-6.19 PATCH] cgroup/cpuset: Make callback_lock a raw_spinlock_t Waiman Long
@ 2025-11-12  8:51 ` Sebastian Andrzej Siewior
  2025-11-12 18:21   ` Waiman Long
  0 siblings, 1 reply; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-11-12  8:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Clark Williams,
	Steven Rostedt, linux-kernel, cgroups, linux-rt-devel,
	Chen Ridong, Pingfan Liu, Juri Lelli

On 2025-11-11 22:57:59 [-0500], Waiman Long wrote:
> The callback_lock is a spinlock_t which is acquired either to read
> a stable set of cpu or node masks or to modify those masks when
> cpuset_mutex is also acquired. Sometime it may need to go up the
> cgroup hierarchy while holding the lock to find the right set of masks
> to use. Assuming that the depth of the cgroup hierarch is finite and
> typically small, the lock hold time should be limited.

We can't assume that, can we?

> Some externally callable cpuset APIs like cpuset_cpus_allowed() and

cpuset_cpus_allowed() has three callers in kernel/sched/ and all use
GFP_KERNEL shortly before invoking the function in question.

> cpuset_mems_allowed() acquires callback_lock with irq disabled to ensure
This I did not find. But I would ask to rework it somehow that we don't
need to use raw_spinlock_t as a hammer that solves it all.

> stable cpuset data. These APIs currently have the restriction that they
> can't be called when a raw spinlock is being held. This is needed to
> work correctly in a PREEMPT_RT kernel. This requires additional code
> changes to work around this limitation. See [1] for a discussion of that.
> 
> Make these external cpuset APIs more useful by changing callback_lock
> to a raw_spinlock_t to remove this limitation so that they can be called
> from within other raw spinlock critical sections if needed.
> 
> [1] https://lore.kernel.org/lkml/20251110014706.8118-1-piliu@redhat.com/
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Sebastian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [cgroup/for-6.19 PATCH] cgroup/cpuset: Make callback_lock a raw_spinlock_t
  2025-11-12  8:51 ` Sebastian Andrzej Siewior
@ 2025-11-12 18:21   ` Waiman Long
  2025-11-13  7:53     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 6+ messages in thread
From: Waiman Long @ 2025-11-12 18:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Clark Williams,
	Steven Rostedt, linux-kernel, cgroups, linux-rt-devel,
	Chen Ridong, Pingfan Liu, Juri Lelli

On 11/12/25 3:51 AM, Sebastian Andrzej Siewior wrote:
> On 2025-11-11 22:57:59 [-0500], Waiman Long wrote:
>> The callback_lock is a spinlock_t which is acquired either to read
>> a stable set of cpu or node masks or to modify those masks when
>> cpuset_mutex is also acquired. Sometime it may need to go up the
>> cgroup hierarchy while holding the lock to find the right set of masks
>> to use. Assuming that the depth of the cgroup hierarch is finite and
>> typically small, the lock hold time should be limited.
> We can't assume that, can we?
We can theoretically create a cgroup hierarchy with many levels, but no 
sane users will actually do that. If this is a concern to you, I can 
certainly drop this patch.
>> Some externally callable cpuset APIs like cpuset_cpus_allowed() and
> cpuset_cpus_allowed() has three callers in kernel/sched/ and all use
> GFP_KERNEL shortly before invoking the function in question.
The current callers of these APIs are fine. What I am talking is about 
new callers that may want to call them when holding a raw_spinlock_t.
>
>> cpuset_mems_allowed() acquires callback_lock with irq disabled to ensure
> This I did not find. But I would ask to rework it somehow that we don't
> need to use raw_spinlock_t as a hammer that solves it all.

OK.

Cheers,
Longman

>> stable cpuset data. These APIs currently have the restriction that they
>> can't be called when a raw spinlock is being held. This is needed to
>> work correctly in a PREEMPT_RT kernel. This requires additional code
>> changes to work around this limitation. See [1] for a discussion of that.
>>
>> Make these external cpuset APIs more useful by changing callback_lock
>> to a raw_spinlock_t to remove this limitation so that they can be called
>> from within other raw spinlock critical sections if needed.
>>
>> [1] https://lore.kernel.org/lkml/20251110014706.8118-1-piliu@redhat.com/
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Sebastian
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [cgroup/for-6.19 PATCH] cgroup/cpuset: Make callback_lock a raw_spinlock_t
  2025-11-12 18:21   ` Waiman Long
@ 2025-11-13  7:53     ` Sebastian Andrzej Siewior
  2025-11-13 16:25       ` Tejun Heo
  2025-11-13 16:35       ` Waiman Long
  0 siblings, 2 replies; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-11-13  7:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Clark Williams,
	Steven Rostedt, linux-kernel, cgroups, linux-rt-devel,
	Chen Ridong, Pingfan Liu, Juri Lelli

On 2025-11-12 13:21:12 [-0500], Waiman Long wrote:
> On 11/12/25 3:51 AM, Sebastian Andrzej Siewior wrote:
> > On 2025-11-11 22:57:59 [-0500], Waiman Long wrote:
> > > The callback_lock is a spinlock_t which is acquired either to read
> > > a stable set of cpu or node masks or to modify those masks when
> > > cpuset_mutex is also acquired. Sometime it may need to go up the
> > > cgroup hierarchy while holding the lock to find the right set of masks
> > > to use. Assuming that the depth of the cgroup hierarch is finite and
> > > typically small, the lock hold time should be limited.
> > We can't assume that, can we?
> We can theoretically create a cgroup hierarchy with many levels, but no sane
> users will actually do that. If this is a concern to you, I can certainly
> drop this patch.

Someone will think this is sane and will wonder. We usually don't impose
limits but make sure things are preemptible so it does not matter.

> > > Some externally callable cpuset APIs like cpuset_cpus_allowed() and
> > cpuset_cpus_allowed() has three callers in kernel/sched/ and all use
> > GFP_KERNEL shortly before invoking the function in question.
> The current callers of these APIs are fine. What I am talking is about new
> callers that may want to call them when holding a raw_spinlock_t.

No, please don't proactive do these changes like this which are not
fixes because something was/ is broken.

> > > cpuset_mems_allowed() acquires callback_lock with irq disabled to ensure
> > This I did not find. But I would ask to rework it somehow that we don't
> > need to use raw_spinlock_t as a hammer that solves it all.
> 
> OK.
> 
> Cheers,
> Longman

Sebastian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [cgroup/for-6.19 PATCH] cgroup/cpuset: Make callback_lock a raw_spinlock_t
  2025-11-13  7:53     ` Sebastian Andrzej Siewior
@ 2025-11-13 16:25       ` Tejun Heo
  2025-11-13 16:35       ` Waiman Long
  1 sibling, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2025-11-13 16:25 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Waiman Long, Johannes Weiner, Michal Koutný, Clark Williams,
	Steven Rostedt, linux-kernel, cgroups, linux-rt-devel,
	Chen Ridong, Pingfan Liu, Juri Lelli

Hello,

On Thu, Nov 13, 2025 at 08:53:56AM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-11-12 13:21:12 [-0500], Waiman Long wrote:
> > On 11/12/25 3:51 AM, Sebastian Andrzej Siewior wrote:
> > > On 2025-11-11 22:57:59 [-0500], Waiman Long wrote:
> > > > The callback_lock is a spinlock_t which is acquired either to read
> > > > a stable set of cpu or node masks or to modify those masks when
> > > > cpuset_mutex is also acquired. Sometime it may need to go up the
> > > > cgroup hierarchy while holding the lock to find the right set of masks
> > > > to use. Assuming that the depth of the cgroup hierarch is finite and
> > > > typically small, the lock hold time should be limited.
> > > We can't assume that, can we?
> > We can theoretically create a cgroup hierarchy with many levels, but no sane
> > users will actually do that. If this is a concern to you, I can certainly
> > drop this patch.
> 
> Someone will think this is sane and will wonder. We usually don't impose
> limits but make sure things are preemptible so it does not matter.

It's always better to be scalable but note that there are cases where the
overhead of nesting can't be hidden completely without significant
sacrifices in other areas and we don't want to overindex on depth
scalability at the cost of practical capabilities. This is also why cgroup
depth is a limited resource controlled by cgroup.max.depth knob.

If something works well with, say, 16 levels of nesting, it's already mostly
acceptable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [cgroup/for-6.19 PATCH] cgroup/cpuset: Make callback_lock a raw_spinlock_t
  2025-11-13  7:53     ` Sebastian Andrzej Siewior
  2025-11-13 16:25       ` Tejun Heo
@ 2025-11-13 16:35       ` Waiman Long
  1 sibling, 0 replies; 6+ messages in thread
From: Waiman Long @ 2025-11-13 16:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Waiman Long
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Clark Williams,
	Steven Rostedt, linux-kernel, cgroups, linux-rt-devel,
	Chen Ridong, Pingfan Liu, Juri Lelli

On 11/13/25 2:53 AM, Sebastian Andrzej Siewior wrote:
> On 2025-11-12 13:21:12 [-0500], Waiman Long wrote:
>> On 11/12/25 3:51 AM, Sebastian Andrzej Siewior wrote:
>>> On 2025-11-11 22:57:59 [-0500], Waiman Long wrote:
>>>> The callback_lock is a spinlock_t which is acquired either to read
>>>> a stable set of cpu or node masks or to modify those masks when
>>>> cpuset_mutex is also acquired. Sometime it may need to go up the
>>>> cgroup hierarchy while holding the lock to find the right set of masks
>>>> to use. Assuming that the depth of the cgroup hierarch is finite and
>>>> typically small, the lock hold time should be limited.
>>> We can't assume that, can we?
>> We can theoretically create a cgroup hierarchy with many levels, but no sane
>> users will actually do that. If this is a concern to you, I can certainly
>> drop this patch.
> Someone will think this is sane and will wonder. We usually don't impose
> limits but make sure things are preemptible so it does not matter.
>
>>>> Some externally callable cpuset APIs like cpuset_cpus_allowed() and
>>> cpuset_cpus_allowed() has three callers in kernel/sched/ and all use
>>> GFP_KERNEL shortly before invoking the function in question.
>> The current callers of these APIs are fine. What I am talking is about new
>> callers that may want to call them when holding a raw_spinlock_t.
> No, please don't proactive do these changes like this which are not
> fixes because something was/ is broken.

I posted this patch as a response to my review of Pingfan Liu's deadline 
scheduler patch as it need to call cpuset_cpus_allowed() to get a proper 
cpumask. However, there is an alternative way to solve this problem, so 
I am dropping this patch now. In the future, if there is a better use 
case that needs this change, I will push this patch again.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-11-13 16:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-12  3:57 [cgroup/for-6.19 PATCH] cgroup/cpuset: Make callback_lock a raw_spinlock_t Waiman Long
2025-11-12  8:51 ` Sebastian Andrzej Siewior
2025-11-12 18:21   ` Waiman Long
2025-11-13  7:53     ` Sebastian Andrzej Siewior
2025-11-13 16:25       ` Tejun Heo
2025-11-13 16:35       ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).