* [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity
@ 2025-10-13 20:31 Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 01/33] PCI: Prepare to protect against concurrent isolated cpuset change Frederic Weisbecker
` (32 more replies)
0 siblings, 33 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, David S . Miller, Danilo Krummrich,
Johannes Weiner, Catalin Marinas, Rafael J . Wysocki, Ingo Molnar,
Jens Axboe, linux-block, cgroups, Michal Koutny, Shakeel Butt,
Simon Horman, Waiman Long, Phil Auld, linux-pci, Muchun Song,
Peter Zijlstra, Eric Dumazet, Thomas Gleixner, Vlastimil Babka,
Greg Kroah-Hartman, Marco Crivellari, Will Deacon, Roman Gushchin,
Michal Hocko, Lai Jiangshan, linux-mm, Gabriele Monaco,
Andrew Morton, Tejun Heo, Bjorn Helgaas, Paolo Abeni, netdev,
linux-arm-kernel, Jakub Kicinski
Hi,
The kthread code was enhanced lately to provide an infrastructure which
manages the preferred affinity of unbound kthreads (node or custom
cpumask) against housekeeping constraints and CPU hotplug events.
One crucial missing piece is cpuset: when an isolated partition is
created, deleted, or its CPUs updated, all the unbound kthreads in the
top cpuset are affine to _all_ the non-isolated CPUs, possibly breaking
their preferred affinity along the way
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured.
The dispatch of the new cpumasks to workqueues and kthreads is performed
by housekeeping, as per the nice Tejun's suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from isolcpus= and cpuset isolated partitions. Housekeeping cpumasks are
now modifyable with specific synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future.
Changes since v2:
* Keep static key (peterz)
* Handle PCI work flush
* Comment why RCU is held until PCI work is queued (Waiman)
* Add new tags
* Add CONFIG_LOCKDEP ifdeffery (Waiman)
* Rename workqueue_unbound_exclude_cpumask() to workqueue_unbound_housekeeping_update()
and invert the parameter (Waiman)
* Fix a few changelogs that used to mention that HK_TYPE_KERNEL_NOISE
must depend on HK_TYPE_DOMAIN. It's strongly advised but not mandatory (Waiman)
* Cherry-pick latest version of "cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping"
(Waiman and Gabriele)
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
kthread/core-v3
HEAD: 4ba707cdced479592e9f461e1944b7fa6f75910f
Thanks,
Frederic
---
Frederic Weisbecker (32):
PCI: Prepare to protect against concurrent isolated cpuset change
cpu: Revert "cpu/hotplug: Prevent self deadlock on CPU hot-unplug"
memcg: Prepare to protect against concurrent isolated cpuset change
mm: vmstat: Prepare to protect against concurrent isolated cpuset change
sched/isolation: Save boot defined domain flags
cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT
driver core: cpu: Convert /sys/devices/system/cpu/isolated to use HK_TYPE_DOMAIN_BOOT
net: Keep ignoring isolated cpuset change
block: Protect against concurrent isolated cpuset change
cpu: Provide lockdep check for CPU hotplug lock write-held
cpuset: Provide lockdep check for cpuset lock held
sched/isolation: Convert housekeeping cpumasks to rcu pointers
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
PCI: Flush PCI probe workqueue on cpuset isolated partition change
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
cpuset: Remove cpuset_cpu_is_isolated()
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
PCI: Remove superfluous HK_TYPE_WQ check
kthread: Refine naming of affinity related fields
kthread: Include unbound kthreads in the managed affinity list
kthread: Include kthreadd to the managed affinity list
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
kthread: Honour kthreads preferred affinity after cpuset changes
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Add API to update preferred affinity on kthread runtime
kthread: Document kthread_affine_preferred()
genirq: Correctly handle preferred kthreads affinity
doc: Add housekeeping documentation
Gabriele Monaco (1):
cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping
Documentation/cpu_isolation/housekeeping.rst | 111 +++++++++++++++
arch/arm64/kernel/cpufeature.c | 18 ++-
block/blk-mq.c | 6 +-
drivers/base/cpu.c | 2 +-
drivers/pci/pci-driver.c | 71 +++++++---
include/linux/cpu.h | 4 +
include/linux/cpuhplock.h | 1 +
include/linux/cpuset.h | 8 +-
include/linux/kthread.h | 2 +
include/linux/memcontrol.h | 4 +
include/linux/mmu_context.h | 2 +-
include/linux/pci.h | 3 +
include/linux/percpu-rwsem.h | 1 +
include/linux/sched/isolation.h | 7 +-
include/linux/vmstat.h | 2 +
include/linux/workqueue.h | 2 +-
init/Kconfig | 1 +
kernel/cgroup/cpuset.c | 134 +++++++++++++-----
kernel/cpu.c | 42 +++---
kernel/irq/manage.c | 47 ++++---
kernel/kthread.c | 195 +++++++++++++++++++--------
kernel/sched/isolation.c | 137 +++++++++++++++----
kernel/sched/sched.h | 4 +
kernel/workqueue.c | 17 ++-
mm/memcontrol.c | 25 +++-
mm/vmstat.c | 15 ++-
net/core/net-sysfs.c | 2 +-
27 files changed, 647 insertions(+), 216 deletions(-)
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 01/33] PCI: Prepare to protect against concurrent isolated cpuset change
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-14 20:53 ` Bjorn Helgaas
2025-10-13 20:31 ` [PATCH 02/33] cpu: Revert "cpu/hotplug: Prevent self deadlock on CPU hot-unplug" Frederic Weisbecker
` (31 subsequent siblings)
32 siblings, 1 reply; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
HK_TYPE_DOMAIN will soon integrate cpuset isolated partitions and
therefore be made modifyable at runtime. Synchronize against the cpumask
update using RCU.
The RCU locked section includes both the housekeeping CPU target
election for the PCI probe work and the work enqueue.
This way the housekeeping update side will simply need to flush the
pending related works after updating the housekeeping mask in order to
make sure that no PCI work ever executes on an isolated CPU. This part
will be handled in a subsequent patch.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
drivers/pci/pci-driver.c | 47 ++++++++++++++++++++++++++++++++--------
1 file changed, 38 insertions(+), 9 deletions(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 302d61783f6c..7b74d22b20f7 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -302,9 +302,8 @@ struct drv_dev_and_id {
const struct pci_device_id *id;
};
-static long local_pci_probe(void *_ddi)
+static int local_pci_probe(struct drv_dev_and_id *ddi)
{
- struct drv_dev_and_id *ddi = _ddi;
struct pci_dev *pci_dev = ddi->dev;
struct pci_driver *pci_drv = ddi->drv;
struct device *dev = &pci_dev->dev;
@@ -338,6 +337,19 @@ static long local_pci_probe(void *_ddi)
return 0;
}
+struct pci_probe_arg {
+ struct drv_dev_and_id *ddi;
+ struct work_struct work;
+ int ret;
+};
+
+static void local_pci_probe_callback(struct work_struct *work)
+{
+ struct pci_probe_arg *arg = container_of(work, struct pci_probe_arg, work);
+
+ arg->ret = local_pci_probe(arg->ddi);
+}
+
static bool pci_physfn_is_probed(struct pci_dev *dev)
{
#ifdef CONFIG_PCI_IOV
@@ -362,34 +374,51 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
dev->is_probed = 1;
cpu_hotplug_disable();
-
/*
* Prevent nesting work_on_cpu() for the case where a Virtual Function
* device is probed from work_on_cpu() of the Physical device.
*/
if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
pci_physfn_is_probed(dev)) {
- cpu = nr_cpu_ids;
+ error = local_pci_probe(&ddi);
} else {
cpumask_var_t wq_domain_mask;
+ struct pci_probe_arg arg = { .ddi = &ddi };
+
+ INIT_WORK_ONSTACK(&arg.work, local_pci_probe_callback);
if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
error = -ENOMEM;
goto out;
}
+
+ /*
+ * The target election and the enqueue of the work must be within
+ * the same RCU read side section so that when the workqueue pool
+ * is flushed after a housekeeping cpumask update, further readers
+ * are guaranteed to queue the probing work to the appropriate
+ * targets.
+ */
+ rcu_read_lock();
cpumask_and(wq_domain_mask,
housekeeping_cpumask(HK_TYPE_WQ),
housekeeping_cpumask(HK_TYPE_DOMAIN));
cpu = cpumask_any_and(cpumask_of_node(node),
wq_domain_mask);
+ if (cpu < nr_cpu_ids) {
+ schedule_work_on(cpu, &arg.work);
+ rcu_read_unlock();
+ flush_work(&arg.work);
+ error = arg.ret;
+ } else {
+ rcu_read_unlock();
+ error = local_pci_probe(&ddi);
+ }
+
free_cpumask_var(wq_domain_mask);
+ destroy_work_on_stack(&arg.work);
}
-
- if (cpu < nr_cpu_ids)
- error = work_on_cpu(cpu, local_pci_probe, &ddi);
- else
- error = local_pci_probe(&ddi);
out:
dev->is_probed = 0;
cpu_hotplug_enable();
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 02/33] cpu: Revert "cpu/hotplug: Prevent self deadlock on CPU hot-unplug"
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 01/33] PCI: Prepare to protect against concurrent isolated cpuset change Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 03/33] memcg: Prepare to protect against concurrent isolated cpuset change Frederic Weisbecker
` (30 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
1) The commit:
2b8272ff4a70 ("cpu/hotplug: Prevent self deadlock on CPU hot-unplug")
was added to fix an issue where the hotplug control task (BP) was
throttled between CPUHP_AP_IDLE_DEAD and CPUHP_HRTIMERS_PREPARE waiting
in the hrtimer blindspot for the bandwidth callback queued in the dead
CPU.
2) Later on, the commit:
38685e2a0476 ("cpu/hotplug: Don't offline the last non-isolated CPU")
plugged on the target selection for the workqueue offloaded CPU down
process to prevent from destroying the last CPU domain.
3) Finally:
5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
removed entirely the conditions for the race exposed and partially fixed
in 1). The offloading of the CPU down process to a workqueue on another
CPU then becomes unnecessary. But the last CPU belonging to scheduler
domains must still remain online.
Therefore revert the now obsolete commit
2b8272ff4a70b866106ae13c36be7ecbef5d5da2 and move the housekeeping check
under the cpu_hotplug_lock write held. Since HK_TYPE_DOMAIN will include
both isolcpus and cpuset isolated partition, the hotplug lock will
synchronize against concurrent cpuset partition updates.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/cpu.c | 37 +++++++++++--------------------------
1 file changed, 11 insertions(+), 26 deletions(-)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index db9f6c539b28..453a806af2ee 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1410,6 +1410,16 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
cpus_write_lock();
+ /*
+ * Keep at least one housekeeping cpu onlined to avoid generating
+ * an empty sched_domain span.
+ */
+ if (cpumask_any_and(cpu_online_mask,
+ housekeeping_cpumask(HK_TYPE_DOMAIN)) >= nr_cpu_ids) {
+ ret = -EBUSY;
+ goto out;
+ }
+
cpuhp_tasks_frozen = tasks_frozen;
prev_state = cpuhp_set_state(cpu, st, target);
@@ -1456,22 +1466,8 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
return ret;
}
-struct cpu_down_work {
- unsigned int cpu;
- enum cpuhp_state target;
-};
-
-static long __cpu_down_maps_locked(void *arg)
-{
- struct cpu_down_work *work = arg;
-
- return _cpu_down(work->cpu, 0, work->target);
-}
-
static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
{
- struct cpu_down_work work = { .cpu = cpu, .target = target, };
-
/*
* If the platform does not support hotplug, report it explicitly to
* differentiate it from a transient offlining failure.
@@ -1480,18 +1476,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
return -EOPNOTSUPP;
if (cpu_hotplug_disabled)
return -EBUSY;
-
- /*
- * Ensure that the control task does not run on the to be offlined
- * CPU to prevent a deadlock against cfs_b->period_timer.
- * Also keep at least one housekeeping cpu onlined to avoid generating
- * an empty sched_domain span.
- */
- for_each_cpu_and(cpu, cpu_online_mask, housekeeping_cpumask(HK_TYPE_DOMAIN)) {
- if (cpu != work.cpu)
- return work_on_cpu(cpu, __cpu_down_maps_locked, &work);
- }
- return -EBUSY;
+ return _cpu_down(cpu, 0, target);
}
static int cpu_down(unsigned int cpu, enum cpuhp_state target)
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 03/33] memcg: Prepare to protect against concurrent isolated cpuset change
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 01/33] PCI: Prepare to protect against concurrent isolated cpuset change Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 02/33] cpu: Revert "cpu/hotplug: Prevent self deadlock on CPU hot-unplug" Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 04/33] mm: vmstat: " Frederic Weisbecker
` (29 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifyable at
runtime. In order to synchronize against memcg workqueue to make sure
that no asynchronous draining is pending or executing on a newly made
isolated CPU, target and queue a drain work under the same RCU critical
section.
Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a memcg
workqueue flush will also be issued in a further change to make sure
that no work remains pending after a CPU has been made isolated.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
mm/memcontrol.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4deda33625f4..1033e52ab6cf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1971,6 +1971,13 @@ static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock,
return flush;
}
+static void schedule_drain_work(int cpu, struct work_struct *work)
+{
+ guard(rcu)();
+ if (!cpu_is_isolated(cpu))
+ schedule_work_on(cpu, work);
+}
+
/*
* Drains all per-CPU charge caches for given root_memcg resp. subtree
* of the hierarchy under it.
@@ -2000,8 +2007,8 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
&memcg_st->flags)) {
if (cpu == curcpu)
drain_local_memcg_stock(&memcg_st->work);
- else if (!cpu_is_isolated(cpu))
- schedule_work_on(cpu, &memcg_st->work);
+ else
+ schedule_drain_work(cpu, &memcg_st->work);
}
if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) &&
@@ -2010,8 +2017,8 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
&obj_st->flags)) {
if (cpu == curcpu)
drain_local_obj_stock(&obj_st->work);
- else if (!cpu_is_isolated(cpu))
- schedule_work_on(cpu, &obj_st->work);
+ else
+ schedule_drain_work(cpu, &obj_st->work);
}
}
migrate_enable();
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 04/33] mm: vmstat: Prepare to protect against concurrent isolated cpuset change
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (2 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 03/33] memcg: Prepare to protect against concurrent isolated cpuset change Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 05/33] sched/isolation: Save boot defined domain flags Frederic Weisbecker
` (28 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifyable at
runtime. In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is pending or executing on a newly made
isolated CPU, target and queue a vmstat work under the same RCU read
side critical section.
Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a vmstat
workqueue flush will also be issued in a further change to make sure
that no work remains pending after a CPU has been made isolated.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
mm/vmstat.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bb09c032eecf..7afb2981501f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2135,11 +2135,13 @@ static void vmstat_shepherd(struct work_struct *w)
* infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
* for all isolated CPUs to avoid interference with the isolated workload.
*/
- if (cpu_is_isolated(cpu))
- continue;
+ scoped_guard(rcu) {
+ if (cpu_is_isolated(cpu))
+ continue;
- if (!delayed_work_pending(dw) && need_update(cpu))
- queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
+ if (!delayed_work_pending(dw) && need_update(cpu))
+ queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
+ }
cond_resched();
}
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 05/33] sched/isolation: Save boot defined domain flags
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (3 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 04/33] mm: vmstat: " Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-23 15:45 ` Valentin Schneider
2025-10-13 20:31 ` [PATCH 06/33] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT Frederic Weisbecker
` (27 subsequent siblings)
32 siblings, 1 reply; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
HK_TYPE_DOMAIN will soon integrate not only boot defined isolcpus= CPUs
but also cpuset isolated partitions.
Housekeeping still needs a way to record what was initially passed
to isolcpus= in order to keep these CPUs isolated after a cpuset
isolated partition is modified or destroyed while containing some of
them.
Create a new HK_TYPE_DOMAIN_BOOT to keep track of those.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
---
include/linux/sched/isolation.h | 1 +
kernel/sched/isolation.c | 5 +++--
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index d8501f4709b5..da22b038942a 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -7,6 +7,7 @@
#include <linux/tick.h>
enum hk_type {
+ HK_TYPE_DOMAIN_BOOT,
HK_TYPE_DOMAIN,
HK_TYPE_MANAGED_IRQ,
HK_TYPE_KERNEL_NOISE,
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index a4cf17b1fab0..8690fb705089 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -11,6 +11,7 @@
#include "sched.h"
enum hk_flags {
+ HK_FLAG_DOMAIN_BOOT = BIT(HK_TYPE_DOMAIN_BOOT),
HK_FLAG_DOMAIN = BIT(HK_TYPE_DOMAIN),
HK_FLAG_MANAGED_IRQ = BIT(HK_TYPE_MANAGED_IRQ),
HK_FLAG_KERNEL_NOISE = BIT(HK_TYPE_KERNEL_NOISE),
@@ -216,7 +217,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
if (!strncmp(str, "domain,", 7)) {
str += 7;
- flags |= HK_FLAG_DOMAIN;
+ flags |= HK_FLAG_DOMAIN | HK_FLAG_DOMAIN_BOOT;
continue;
}
@@ -246,7 +247,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
/* Default behaviour for isolcpus without flags */
if (!flags)
- flags |= HK_FLAG_DOMAIN;
+ flags |= HK_FLAG_DOMAIN | HK_FLAG_DOMAIN_BOOT;
return housekeeping_setup(str, flags);
}
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 06/33] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (4 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 05/33] sched/isolation: Save boot defined domain flags Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 07/33] driver core: cpu: Convert /sys/devices/system/cpu/isolated " Frederic Weisbecker
` (26 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
boot_hk_cpus is an ad-hoc copy of HK_TYPE_DOMAIN_BOOT. Remove it and use
the official version.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
---
kernel/cgroup/cpuset.c | 22 +++++++---------------
1 file changed, 7 insertions(+), 15 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 52468d2c178a..8595f1eadf23 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -81,12 +81,6 @@ static cpumask_var_t subpartitions_cpus;
*/
static cpumask_var_t isolated_cpus;
-/*
- * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot
- */
-static cpumask_var_t boot_hk_cpus;
-static bool have_boot_isolcpus;
-
/* List of remote partition root children */
static struct list_head remote_children;
@@ -1686,15 +1680,16 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
* @new_cpus: cpu mask
* Return: true if there is conflict, false otherwise
*
- * CPUs outside of boot_hk_cpus, if defined, can only be used in an
+ * CPUs outside of HK_TYPE_DOMAIN_BOOT, if defined, can only be used in an
* isolated partition.
*/
static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
{
- if (!have_boot_isolcpus)
+ if (!housekeeping_enabled(HK_TYPE_DOMAIN_BOOT))
return false;
- if ((prstate != PRS_ISOLATED) && !cpumask_subset(new_cpus, boot_hk_cpus))
+ if ((prstate != PRS_ISOLATED) &&
+ !cpumask_subset(new_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT)))
return true;
return false;
@@ -3824,12 +3819,9 @@ int __init cpuset_init(void)
BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
- have_boot_isolcpus = housekeeping_enabled(HK_TYPE_DOMAIN);
- if (have_boot_isolcpus) {
- BUG_ON(!alloc_cpumask_var(&boot_hk_cpus, GFP_KERNEL));
- cpumask_copy(boot_hk_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN));
- cpumask_andnot(isolated_cpus, cpu_possible_mask, boot_hk_cpus);
- }
+ if (housekeeping_enabled(HK_TYPE_DOMAIN_BOOT))
+ cpumask_andnot(isolated_cpus, cpu_possible_mask,
+ housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
return 0;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 07/33] driver core: cpu: Convert /sys/devices/system/cpu/isolated to use HK_TYPE_DOMAIN_BOOT
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (5 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 06/33] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 08/33] net: Keep ignoring isolated cpuset change Frederic Weisbecker
` (25 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
Make sure /sys/devices/system/cpu/isolated only prints what was passed
through the isolcpus= parameter before HK_TYPE_DOMAIN will also
integrate cpuset isolated partitions.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
drivers/base/cpu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index fa0a2eef93ac..050f87d5b8d4 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -291,7 +291,7 @@ static ssize_t print_cpus_isolated(struct device *dev,
return -ENOMEM;
cpumask_andnot(isolated, cpu_possible_mask,
- housekeeping_cpumask(HK_TYPE_DOMAIN));
+ housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
len = sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(isolated));
free_cpumask_var(isolated);
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 08/33] net: Keep ignoring isolated cpuset change
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (6 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 07/33] driver core: cpu: Convert /sys/devices/system/cpu/isolated " Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 09/33] block: Protect against concurrent " Frederic Weisbecker
` (24 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
RPS cpumask can be overriden through sysfs/syctl. The boot defined
isolated CPUs are then excluded from that cpumask.
However HK_TYPE_DOMAIN will soon integrate cpuset isolated
CPUs updates and the RPS infrastructure needs more thoughts to be able
to propagate such changes and synchronize against them.
Keep handling only what was passed through "isolcpus=" for now.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
net/core/net-sysfs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index ca878525ad7c..07624b682b08 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1022,7 +1022,7 @@ static int netdev_rx_queue_set_rps_mask(struct netdev_rx_queue *queue,
int rps_cpumask_housekeeping(struct cpumask *mask)
{
if (!cpumask_empty(mask)) {
- cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
+ cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_WQ));
if (cpumask_empty(mask))
return -EINVAL;
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 09/33] block: Protect against concurrent isolated cpuset change
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (7 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 08/33] net: Keep ignoring isolated cpuset change Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 10/33] cpu: Provide lockdep check for CPU hotplug lock write-held Frederic Weisbecker
` (23 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The block subsystem prevents running the workqueue to isolated CPUs,
including those defined by cpuset isolated partitions. Since
HK_TYPE_DOMAIN will soon contain both and be subject to runtime
modifications, synchronize against housekeeping using the relevant lock.
For full support of cpuset changes, the block subsystem may need to
propagate changes to isolated cpumask through the workqueue in the
future.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
block/blk-mq.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 09f579414161..ed1b8b149a8f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4240,12 +4240,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
/*
* Rule out isolated CPUs from hctx->cpumask to avoid
- * running block kworker on isolated CPUs
+ * running block kworker on isolated CPUs.
+ * FIXME: cpuset should propagate further changes to isolated CPUs
+ * here.
*/
+ rcu_read_lock();
for_each_cpu(cpu, hctx->cpumask) {
if (cpu_is_isolated(cpu))
cpumask_clear_cpu(cpu, hctx->cpumask);
}
+ rcu_read_unlock();
/*
* Initialize batch roundrobin counts
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 10/33] cpu: Provide lockdep check for CPU hotplug lock write-held
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (8 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 09/33] block: Protect against concurrent " Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 11/33] cpuset: Provide lockdep check for cpuset lock held Frederic Weisbecker
` (22 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
cpuset modifies partitions, including isolated, while holding the cpu
hotplug lock read-held.
This means that write-holding the CPU hotplug lock is safe to
synchronize against housekeeping cpumask changes.
Provide a lockdep check to validate that.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/cpuhplock.h | 1 +
include/linux/percpu-rwsem.h | 1 +
kernel/cpu.c | 5 +++++
3 files changed, 7 insertions(+)
diff --git a/include/linux/cpuhplock.h b/include/linux/cpuhplock.h
index f7aa20f62b87..286b3ab92e15 100644
--- a/include/linux/cpuhplock.h
+++ b/include/linux/cpuhplock.h
@@ -13,6 +13,7 @@
struct device;
extern int lockdep_is_cpus_held(void);
+extern int lockdep_is_cpus_write_held(void);
#ifdef CONFIG_HOTPLUG_CPU
void cpus_write_lock(void);
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 288f5235649a..c8cb010d655e 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -161,6 +161,7 @@ extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
__percpu_init_rwsem(sem, #sem, &rwsem_key); \
})
+#define percpu_rwsem_is_write_held(sem) lockdep_is_held_type(sem, 0)
#define percpu_rwsem_is_held(sem) lockdep_is_held(sem)
#define percpu_rwsem_assert_held(sem) lockdep_assert_held(sem)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 453a806af2ee..3b0443f7c486 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -534,6 +534,11 @@ int lockdep_is_cpus_held(void)
{
return percpu_rwsem_is_held(&cpu_hotplug_lock);
}
+
+int lockdep_is_cpus_write_held(void)
+{
+ return percpu_rwsem_is_write_held(&cpu_hotplug_lock);
+}
#endif
static void lockdep_acquire_cpus_lock(void)
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 11/33] cpuset: Provide lockdep check for cpuset lock held
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (9 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 10/33] cpu: Provide lockdep check for CPU hotplug lock write-held Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-14 13:29 ` Chen Ridong
2025-10-13 20:31 ` [PATCH 12/33] sched/isolation: Convert housekeeping cpumasks to rcu pointers Frederic Weisbecker
` (21 subsequent siblings)
32 siblings, 1 reply; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
cpuset modifies partitions, including isolated, while holding the cpuset
mutex.
This means that holding the cpuset mutex is safe to synchronize against
housekeeping cpumask changes.
Provide a lockdep check to validate that.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/cpuset.h | 2 ++
kernel/cgroup/cpuset.c | 7 +++++++
2 files changed, 9 insertions(+)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 2ddb256187b5..051d36fec578 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -18,6 +18,8 @@
#include <linux/mmu_context.h>
#include <linux/jump_label.h>
+extern bool lockdep_is_cpuset_held(void);
+
#ifdef CONFIG_CPUSETS
/*
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 8595f1eadf23..aa1ac7bcf2ea 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -279,6 +279,13 @@ void cpuset_full_unlock(void)
cpus_read_unlock();
}
+#ifdef CONFIG_LOCKDEP
+bool lockdep_is_cpuset_held(void)
+{
+ return lockdep_is_held(&cpuset_mutex);
+}
+#endif
+
static DEFINE_SPINLOCK(callback_lock);
void cpuset_callback_lock_irq(void)
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 12/33] sched/isolation: Convert housekeeping cpumasks to rcu pointers
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (10 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 11/33] cpuset: Provide lockdep check for cpuset lock held Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-21 1:46 ` Chen Ridong
2025-10-21 3:49 ` Waiman Long
2025-10-13 20:31 ` [PATCH 13/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
` (20 subsequent siblings)
32 siblings, 2 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
HK_TYPE_DOMAIN's cpumask will soon be made modifyable by cpuset.
A synchronization mechanism is then needed to synchronize the updates
with the housekeeping cpumask readers.
Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping
cpumask will be modified, the update side will wait for an RCU grace
period and propagate the change to interested subsystem when deemed
necessary.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/sched/isolation.c | 58 +++++++++++++++++++++++++---------------
kernel/sched/sched.h | 1 +
2 files changed, 37 insertions(+), 22 deletions(-)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 8690fb705089..b46c20b5437f 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -21,7 +21,7 @@ DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
EXPORT_SYMBOL_GPL(housekeeping_overridden);
struct housekeeping {
- cpumask_var_t cpumasks[HK_TYPE_MAX];
+ struct cpumask __rcu *cpumasks[HK_TYPE_MAX];
unsigned long flags;
};
@@ -33,17 +33,28 @@ bool housekeeping_enabled(enum hk_type type)
}
EXPORT_SYMBOL_GPL(housekeeping_enabled);
+const struct cpumask *housekeeping_cpumask(enum hk_type type)
+{
+ if (static_branch_unlikely(&housekeeping_overridden)) {
+ if (housekeeping.flags & BIT(type)) {
+ return rcu_dereference_check(housekeeping.cpumasks[type], 1);
+ }
+ }
+ return cpu_possible_mask;
+}
+EXPORT_SYMBOL_GPL(housekeeping_cpumask);
+
int housekeeping_any_cpu(enum hk_type type)
{
int cpu;
if (static_branch_unlikely(&housekeeping_overridden)) {
if (housekeeping.flags & BIT(type)) {
- cpu = sched_numa_find_closest(housekeeping.cpumasks[type], smp_processor_id());
+ cpu = sched_numa_find_closest(housekeeping_cpumask(type), smp_processor_id());
if (cpu < nr_cpu_ids)
return cpu;
- cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
+ cpu = cpumask_any_and_distribute(housekeeping_cpumask(type), cpu_online_mask);
if (likely(cpu < nr_cpu_ids))
return cpu;
/*
@@ -59,28 +70,18 @@ int housekeeping_any_cpu(enum hk_type type)
}
EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
-const struct cpumask *housekeeping_cpumask(enum hk_type type)
-{
- if (static_branch_unlikely(&housekeeping_overridden))
- if (housekeeping.flags & BIT(type))
- return housekeeping.cpumasks[type];
- return cpu_possible_mask;
-}
-EXPORT_SYMBOL_GPL(housekeeping_cpumask);
-
void housekeeping_affine(struct task_struct *t, enum hk_type type)
{
if (static_branch_unlikely(&housekeeping_overridden))
if (housekeeping.flags & BIT(type))
- set_cpus_allowed_ptr(t, housekeeping.cpumasks[type]);
+ set_cpus_allowed_ptr(t, housekeeping_cpumask(type));
}
EXPORT_SYMBOL_GPL(housekeeping_affine);
bool housekeeping_test_cpu(int cpu, enum hk_type type)
{
- if (static_branch_unlikely(&housekeeping_overridden))
- if (housekeeping.flags & BIT(type))
- return cpumask_test_cpu(cpu, housekeeping.cpumasks[type]);
+ if (housekeeping.flags & BIT(type))
+ return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
return true;
}
EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
@@ -96,20 +97,33 @@ void __init housekeeping_init(void)
if (housekeeping.flags & HK_FLAG_KERNEL_NOISE)
sched_tick_offload_init();
-
+ /*
+ * Realloc with a proper allocator so that any cpumask update
+ * can indifferently free the old version with kfree().
+ */
for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) {
+ struct cpumask *omask, *nmask = kmalloc(cpumask_size(), GFP_KERNEL);
+
+ if (WARN_ON_ONCE(!nmask))
+ return;
+
+ omask = rcu_dereference(housekeeping.cpumasks[type]);
+
/* We need at least one CPU to handle housekeeping work */
- WARN_ON_ONCE(cpumask_empty(housekeeping.cpumasks[type]));
+ WARN_ON_ONCE(cpumask_empty(omask));
+ cpumask_copy(nmask, omask);
+ RCU_INIT_POINTER(housekeeping.cpumasks[type], nmask);
+ memblock_free(omask, cpumask_size());
}
}
static void __init housekeeping_setup_type(enum hk_type type,
cpumask_var_t housekeeping_staging)
{
+ struct cpumask *mask = memblock_alloc_or_panic(cpumask_size(), SMP_CACHE_BYTES);
- alloc_bootmem_cpumask_var(&housekeeping.cpumasks[type]);
- cpumask_copy(housekeeping.cpumasks[type],
- housekeeping_staging);
+ cpumask_copy(mask, housekeeping_staging);
+ RCU_INIT_POINTER(housekeeping.cpumasks[type], mask);
}
static int __init housekeeping_setup(char *str, unsigned long flags)
@@ -162,7 +176,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
for_each_set_bit(type, &iter_flags, HK_TYPE_MAX) {
if (!cpumask_equal(housekeeping_staging,
- housekeeping.cpumasks[type])) {
+ housekeeping_cpumask(type))) {
pr_warn("Housekeeping: nohz_full= must match isolcpus=\n");
goto free_housekeeping_staging;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1f5d07067f60..0c0ef8999fd6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -42,6 +42,7 @@
#include <linux/ktime_api.h>
#include <linux/lockdep_api.h>
#include <linux/lockdep.h>
+#include <linux/memblock.h>
#include <linux/minmax.h>
#include <linux/mm.h>
#include <linux/module.h>
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 13/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (11 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 12/33] sched/isolation: Convert housekeeping cpumasks to rcu pointers Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-21 4:10 ` Waiman Long
2025-10-21 13:39 ` Waiman Long
2025-10-13 20:31 ` [PATCH 14/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
` (19 subsequent siblings)
32 siblings, 2 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
CPUs passed through isolcpus= boot option. Users interested in also
knowing the runtime defined isolated CPUs through cpuset must use
different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...
There are many drawbacks to that approach:
1) Most interested subsystems want to know about all isolated CPUs, not
just those defined on boot time.
2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
concurrent cpuset changes.
3) Further cpuset modifications are not propagated to subsystems
Solve 1) and 2) and centralize all isolated CPUs within the
HK_TYPE_DOMAIN housekeeping cpumask.
Subsystems can rely on RCU to synchronize against concurrent changes.
The propagation mentioned in 3) will be handled in further patches.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/sched/isolation.h | 2 +
kernel/cgroup/cpuset.c | 2 +
kernel/sched/isolation.c | 75 ++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 1 +
4 files changed, 74 insertions(+), 6 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index da22b038942a..94d5c835121b 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -32,6 +32,7 @@ extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
extern bool housekeeping_enabled(enum hk_type type);
extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
+extern int housekeeping_update(struct cpumask *mask, enum hk_type type);
extern void __init housekeeping_init(void);
#else
@@ -59,6 +60,7 @@ static inline bool housekeeping_test_cpu(int cpu, enum hk_type type)
return true;
}
+static inline int housekeeping_update(struct cpumask *mask, enum hk_type type) { return 0; }
static inline void housekeeping_init(void) { }
#endif /* CONFIG_CPU_ISOLATION */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index aa1ac7bcf2ea..b04a4242f2fa 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1403,6 +1403,8 @@ static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
WARN_ON_ONCE(ret < 0);
+ ret = housekeeping_update(isolated_cpus, HK_TYPE_DOMAIN);
+ WARN_ON_ONCE(ret < 0);
}
/**
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index b46c20b5437f..95d69c2102f6 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -29,18 +29,48 @@ static struct housekeeping housekeeping;
bool housekeeping_enabled(enum hk_type type)
{
- return !!(housekeeping.flags & BIT(type));
+ return !!(READ_ONCE(housekeeping.flags) & BIT(type));
}
EXPORT_SYMBOL_GPL(housekeeping_enabled);
+static bool housekeeping_dereference_check(enum hk_type type)
+{
+ if (IS_ENABLED(CONFIG_LOCKDEP) && type == HK_TYPE_DOMAIN) {
+ /* Cpuset isn't even writable yet? */
+ if (system_state <= SYSTEM_SCHEDULING)
+ return true;
+
+ /* CPU hotplug write locked, so cpuset partition can't be overwritten */
+ if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
+ return true;
+
+ /* Cpuset lock held, partitions not writable */
+ if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
+ return true;
+
+ return false;
+ }
+
+ return true;
+}
+
+static inline struct cpumask *housekeeping_cpumask_dereference(enum hk_type type)
+{
+ return rcu_dereference_check(housekeeping.cpumasks[type],
+ housekeeping_dereference_check(type));
+}
+
const struct cpumask *housekeeping_cpumask(enum hk_type type)
{
+ const struct cpumask *mask = NULL;
+
if (static_branch_unlikely(&housekeeping_overridden)) {
- if (housekeeping.flags & BIT(type)) {
- return rcu_dereference_check(housekeeping.cpumasks[type], 1);
- }
+ if (READ_ONCE(housekeeping.flags) & BIT(type))
+ mask = housekeeping_cpumask_dereference(type);
}
- return cpu_possible_mask;
+ if (!mask)
+ mask = cpu_possible_mask;
+ return mask;
}
EXPORT_SYMBOL_GPL(housekeeping_cpumask);
@@ -80,12 +110,45 @@ EXPORT_SYMBOL_GPL(housekeeping_affine);
bool housekeeping_test_cpu(int cpu, enum hk_type type)
{
- if (housekeeping.flags & BIT(type))
+ if (READ_ONCE(housekeeping.flags) & BIT(type))
return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
return true;
}
EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
+int housekeeping_update(struct cpumask *mask, enum hk_type type)
+{
+ struct cpumask *trial, *old = NULL;
+
+ if (type != HK_TYPE_DOMAIN)
+ return -ENOTSUPP;
+
+ trial = kmalloc(sizeof(*trial), GFP_KERNEL);
+ if (!trial)
+ return -ENOMEM;
+
+ cpumask_andnot(trial, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), mask);
+ if (!cpumask_intersects(trial, cpu_online_mask)) {
+ kfree(trial);
+ return -EINVAL;
+ }
+
+ if (!housekeeping.flags)
+ static_branch_enable(&housekeeping_overridden);
+
+ if (!(housekeeping.flags & BIT(type)))
+ old = housekeeping_cpumask_dereference(type);
+ else
+ WRITE_ONCE(housekeeping.flags, housekeeping.flags | BIT(type));
+ rcu_assign_pointer(housekeeping.cpumasks[type], trial);
+
+ synchronize_rcu();
+
+ kfree(old);
+
+ return 0;
+}
+
void __init housekeeping_init(void)
{
enum hk_type type;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0c0ef8999fd6..8fac8aa451c6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -30,6 +30,7 @@
#include <linux/context_tracking.h>
#include <linux/cpufreq.h>
#include <linux/cpumask_api.h>
+#include <linux/cpuset.h>
#include <linux/ctype.h>
#include <linux/file.h>
#include <linux/fs_api.h>
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 14/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (12 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 13/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-21 19:16 ` Waiman Long
2025-10-13 20:31 ` [PATCH 15/33] sched/isolation: Flush vmstat " Frederic Weisbecker
` (18 subsequent siblings)
32 siblings, 1 reply; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The HK_TYPE_DOMAIN housekeeping cpumask is now modifyable at runtime. In
order to synchronize against memcg workqueue to make sure that no
asynchronous draining is still pending or executing on a newly made
isolated CPU, the housekeeping susbsystem must flush the memcg
workqueues.
However the memcg workqueues can't be flushed easily since they are
queued to the main per-CPU workqueue pool.
Solve this with creating a memcg specific pool and provide and use the
appropriate flushing API.
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/memcontrol.h | 4 ++++
kernel/sched/isolation.c | 2 ++
kernel/sched/sched.h | 1 +
mm/memcontrol.c | 12 +++++++++++-
4 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 873e510d6f8d..001200df63cf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1074,6 +1074,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
return id;
}
+void mem_cgroup_flush_workqueue(void);
+
extern int mem_cgroup_init(void);
#else /* CONFIG_MEMCG */
@@ -1481,6 +1483,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
return 0;
}
+static inline void mem_cgroup_flush_workqueue(void) { }
+
static inline int mem_cgroup_init(void) { return 0; }
#endif /* CONFIG_MEMCG */
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 95d69c2102f6..9ec365dea921 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -144,6 +144,8 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
synchronize_rcu();
+ mem_cgroup_flush_workqueue();
+
kfree(old);
return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8fac8aa451c6..8bfc0b4b133f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
#include <linux/lockdep_api.h>
#include <linux/lockdep.h>
#include <linux/memblock.h>
+#include <linux/memcontrol.h>
#include <linux/minmax.h>
#include <linux/mm.h>
#include <linux/module.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1033e52ab6cf..1aa14e543f35 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -95,6 +95,8 @@ static bool cgroup_memory_nokmem __ro_after_init;
/* BPF memory accounting disabled? */
static bool cgroup_memory_nobpf __ro_after_init;
+static struct workqueue_struct *memcg_wq __ro_after_init;
+
static struct kmem_cache *memcg_cachep;
static struct kmem_cache *memcg_pn_cachep;
@@ -1975,7 +1977,7 @@ static void schedule_drain_work(int cpu, struct work_struct *work)
{
guard(rcu)();
if (!cpu_is_isolated(cpu))
- schedule_work_on(cpu, work);
+ queue_work_on(cpu, memcg_wq, work);
}
/*
@@ -5092,6 +5094,11 @@ void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages)
refill_stock(memcg, nr_pages);
}
+void mem_cgroup_flush_workqueue(void)
+{
+ flush_workqueue(memcg_wq);
+}
+
static int __init cgroup_memory(char *s)
{
char *token;
@@ -5134,6 +5141,9 @@ int __init mem_cgroup_init(void)
cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
memcg_hotplug_cpu_dead);
+ memcg_wq = alloc_workqueue("memcg", 0, 0);
+ WARN_ON(!memcg_wq);
+
for_each_possible_cpu(cpu) {
INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
drain_local_memcg_stock);
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 15/33] sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (13 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 14/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 16/33] PCI: Flush PCI probe workqueue " Frederic Weisbecker
` (17 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The HK_TYPE_DOMAIN housekeeping cpumask is now modifyable at runtime.
In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is still pending or executing on a
newly made isolated CPU, the housekeeping susbsystem must flush the
vmstat workqueues.
This involves flushing the whole mm_percpu_wq workqueue, shared with
LRU drain, introducing here a welcome side effect.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/vmstat.h | 2 ++
kernel/sched/isolation.c | 1 +
kernel/sched/sched.h | 1 +
mm/vmstat.c | 5 +++++
4 files changed, 9 insertions(+)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index c287998908bf..a81aa5635b47 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -303,6 +303,7 @@ int calculate_pressure_threshold(struct zone *zone);
int calculate_normal_threshold(struct zone *zone);
void set_pgdat_percpu_threshold(pg_data_t *pgdat,
int (*calculate_pressure)(struct zone *));
+void vmstat_flush_workqueue(void);
#else /* CONFIG_SMP */
/*
@@ -403,6 +404,7 @@ static inline void __dec_node_page_state(struct page *page,
static inline void refresh_zone_stat_thresholds(void) { }
static inline void cpu_vm_stats_fold(int cpu) { }
static inline void quiet_vmstat(void) { }
+static inline void vmstat_flush_workqueue(void) { }
static inline void drain_zonestat(struct zone *zone,
struct per_cpu_zonestat *pzstats) { }
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 9ec365dea921..5cd3d98a2663 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -145,6 +145,7 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
synchronize_rcu();
mem_cgroup_flush_workqueue();
+ vmstat_flush_workqueue();
kfree(old);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8bfc0b4b133f..84525885a3de 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -68,6 +68,7 @@
#include <linux/types.h>
#include <linux/u64_stats_sync_api.h>
#include <linux/uaccess.h>
+#include <linux/vmstat.h>
#include <linux/wait_api.h>
#include <linux/wait_bit.h>
#include <linux/workqueue_api.h>
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7afb2981501f..506d3ca2e47f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2115,6 +2115,11 @@ static void vmstat_shepherd(struct work_struct *w);
static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);
+void vmstat_flush_workqueue(void)
+{
+ flush_workqueue(mm_percpu_wq);
+}
+
static void vmstat_shepherd(struct work_struct *w)
{
int cpu;
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 16/33] PCI: Flush PCI probe workqueue on cpuset isolated partition change
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (14 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 15/33] sched/isolation: Flush vmstat " Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-14 20:50 ` Bjorn Helgaas
2025-10-13 20:31 ` [PATCH 17/33] cpuset: Propagate cpuset isolation update to workqueue through housekeeping Frederic Weisbecker
` (16 subsequent siblings)
32 siblings, 1 reply; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The HK_TYPE_DOMAIN housekeeping cpumask is now modifyable at runtime. In
order to synchronize against PCI probe works and make sure that no
asynchronous probing is still pending or executing on a newly made
isolated CPU, the housekeeping susbsystem must flush the PCI probe
works.
However the PCI probe works can't be flushed easily since they are
queued to the main per-CPU workqueue pool.
Solve this with creating a PCI probe specific pool and provide and use
the appropriate flushing API.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
drivers/pci/pci-driver.c | 17 ++++++++++++++++-
include/linux/pci.h | 3 +++
kernel/sched/isolation.c | 2 ++
3 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 7b74d22b20f7..ac86aaec8bcf 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -337,6 +337,8 @@ static int local_pci_probe(struct drv_dev_and_id *ddi)
return 0;
}
+static struct workqueue_struct *pci_probe_wq;
+
struct pci_probe_arg {
struct drv_dev_and_id *ddi;
struct work_struct work;
@@ -407,7 +409,11 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
cpu = cpumask_any_and(cpumask_of_node(node),
wq_domain_mask);
if (cpu < nr_cpu_ids) {
- schedule_work_on(cpu, &arg.work);
+ struct workqueue_struct *wq = pci_probe_wq;
+
+ if (WARN_ON_ONCE(!wq))
+ wq = system_percpu_wq;
+ queue_work_on(cpu, wq, &arg.work);
rcu_read_unlock();
flush_work(&arg.work);
error = arg.ret;
@@ -425,6 +431,11 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
return error;
}
+void pci_probe_flush_workqueue(void)
+{
+ flush_workqueue(pci_probe_wq);
+}
+
/**
* __pci_device_probe - check if a driver wants to claim a specific PCI device
* @drv: driver to call to check if it wants the PCI device
@@ -1760,6 +1771,10 @@ static int __init pci_driver_init(void)
{
int ret;
+ pci_probe_wq = alloc_workqueue("sync_wq", WQ_PERCPU, 0);
+ if (!pci_probe_wq)
+ return -ENOMEM;
+
ret = bus_register(&pci_bus_type);
if (ret)
return ret;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index d1fdf81fbe1e..3281c235b895 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1175,6 +1175,7 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int bus,
struct pci_ops *ops, void *sysdata,
struct list_head *resources);
int pci_host_probe(struct pci_host_bridge *bridge);
+void pci_probe_flush_workqueue(void);
int pci_bus_insert_busn_res(struct pci_bus *b, int bus, int busmax);
int pci_bus_update_busn_res_end(struct pci_bus *b, int busmax);
void pci_bus_release_busn_res(struct pci_bus *b);
@@ -2037,6 +2038,8 @@ static inline int pci_has_flag(int flag) { return 0; }
_PCI_NOP_ALL(read, *)
_PCI_NOP_ALL(write,)
+static inline void pci_probe_flush_workqueue(void) { }
+
static inline struct pci_dev *pci_get_device(unsigned int vendor,
unsigned int device,
struct pci_dev *from)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5cd3d98a2663..b1eea5440484 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -8,6 +8,7 @@
*
*/
#include <linux/sched/isolation.h>
+#include <linux/pci.h>
#include "sched.h"
enum hk_flags {
@@ -144,6 +145,7 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
synchronize_rcu();
+ pci_probe_flush_workqueue();
mem_cgroup_flush_workqueue();
vmstat_flush_workqueue();
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 17/33] cpuset: Propagate cpuset isolation update to workqueue through housekeeping
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (15 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 16/33] PCI: Flush PCI probe workqueue " Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 18/33] cpuset: Remove cpuset_cpu_is_isolated() Frederic Weisbecker
` (15 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
Until now, cpuset would propagate isolated partition changes to
workqueues so that unbound workers get properly reaffined.
Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.
For simplification purpose, the target function is adapted to take the
new housekeeping mask instead of the isolated mask.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/workqueue.h | 2 +-
init/Kconfig | 1 +
kernel/cgroup/cpuset.c | 14 ++++++--------
kernel/sched/isolation.c | 4 +++-
kernel/workqueue.c | 17 ++++++++++-------
5 files changed, 21 insertions(+), 17 deletions(-)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index dabc351cc127..a4749f56398f 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -588,7 +588,7 @@ struct workqueue_attrs *alloc_workqueue_attrs_noprof(void);
void free_workqueue_attrs(struct workqueue_attrs *attrs);
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs);
-extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask);
+extern int workqueue_unbound_housekeeping_update(const struct cpumask *hk);
extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work);
diff --git a/init/Kconfig b/init/Kconfig
index cab3ad28ca49..a1b3a3b66bfc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1247,6 +1247,7 @@ config CPUSETS
bool "Cpuset controller"
depends on SMP
select UNION_FIND
+ select CPU_ISOLATION
help
This option will let you create and manage CPUSETs which
allow dynamically partitioning a system into sets of CPUs and
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b04a4242f2fa..ea102e4695a5 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1392,7 +1392,7 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
return isolcpus_updated;
}
-static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
+static void update_housekeeping_cpumask(bool isolcpus_updated)
{
int ret;
@@ -1401,8 +1401,6 @@ static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
if (!isolcpus_updated)
return;
- ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
- WARN_ON_ONCE(ret < 0);
ret = housekeeping_update(isolated_cpus, HK_TYPE_DOMAIN);
WARN_ON_ONCE(ret < 0);
}
@@ -1558,7 +1556,7 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
list_add(&cs->remote_sibling, &remote_children);
cpumask_copy(cs->effective_xcpus, tmp->new_cpus);
spin_unlock_irq(&callback_lock);
- update_unbound_workqueue_cpumask(isolcpus_updated);
+ update_housekeeping_cpumask(isolcpus_updated);
cpuset_force_rebuild();
cs->prs_err = 0;
@@ -1599,7 +1597,7 @@ static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp)
compute_excpus(cs, cs->effective_xcpus);
reset_partition_data(cs);
spin_unlock_irq(&callback_lock);
- update_unbound_workqueue_cpumask(isolcpus_updated);
+ update_housekeeping_cpumask(isolcpus_updated);
cpuset_force_rebuild();
/*
@@ -1668,7 +1666,7 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
if (xcpus)
cpumask_copy(cs->exclusive_cpus, xcpus);
spin_unlock_irq(&callback_lock);
- update_unbound_workqueue_cpumask(isolcpus_updated);
+ update_housekeeping_cpumask(isolcpus_updated);
if (adding || deleting)
cpuset_force_rebuild();
@@ -2027,7 +2025,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
WARN_ON_ONCE(parent->nr_subparts < 0);
}
spin_unlock_irq(&callback_lock);
- update_unbound_workqueue_cpumask(isolcpus_updated);
+ update_housekeeping_cpumask(isolcpus_updated);
if ((old_prs != new_prs) && (cmd == partcmd_update))
update_partition_exclusive_flag(cs, new_prs);
@@ -3047,7 +3045,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
else if (isolcpus_updated)
isolated_cpus_update(old_prs, new_prs, cs->effective_xcpus);
spin_unlock_irq(&callback_lock);
- update_unbound_workqueue_cpumask(isolcpus_updated);
+ update_housekeeping_cpumask(isolcpus_updated);
/* Force update if switching back to member & update effective_xcpus */
update_cpumasks_hier(cs, &tmpmask, !new_prs);
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index b1eea5440484..691f045ab758 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -120,6 +120,7 @@ EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
int housekeeping_update(struct cpumask *mask, enum hk_type type)
{
struct cpumask *trial, *old = NULL;
+ int err;
if (type != HK_TYPE_DOMAIN)
return -ENOTSUPP;
@@ -148,10 +149,11 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
pci_probe_flush_workqueue();
mem_cgroup_flush_workqueue();
vmstat_flush_workqueue();
+ err = workqueue_unbound_housekeeping_update(housekeeping_cpumask(type));
kfree(old);
- return 0;
+ return err;
}
void __init housekeeping_init(void)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 45320e27a16c..32a436b76137 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6945,13 +6945,16 @@ static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
}
/**
- * workqueue_unbound_exclude_cpumask - Exclude given CPUs from unbound cpumask
- * @exclude_cpumask: the cpumask to be excluded from wq_unbound_cpumask
+ * workqueue_unbound_housekeeping_update - Propagate housekeeping cpumask update
+ * @hk: the new housekeeping cpumask
*
- * This function can be called from cpuset code to provide a set of isolated
- * CPUs that should be excluded from wq_unbound_cpumask.
+ * Update the unbound workqueue cpumask on top of the new housekeeping cpumask such
+ * that the effective unbound affinity is the intersection of the new housekeeping
+ * with the requested affinity set via nohz_full=/isolcpus= or sysfs.
+ *
+ * Return: 0 on success and -errno on failure.
*/
-int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
+int workqueue_unbound_housekeeping_update(const struct cpumask *hk)
{
cpumask_var_t cpumask;
int ret = 0;
@@ -6967,14 +6970,14 @@ int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask)
* (HK_TYPE_WQ ∩ HK_TYPE_DOMAIN) house keeping mask and rewritten
* by any subsequent write to workqueue/cpumask sysfs file.
*/
- if (!cpumask_andnot(cpumask, wq_requested_unbound_cpumask, exclude_cpumask))
+ if (!cpumask_and(cpumask, wq_requested_unbound_cpumask, hk))
cpumask_copy(cpumask, wq_requested_unbound_cpumask);
if (!cpumask_equal(cpumask, wq_unbound_cpumask))
ret = workqueue_apply_unbound_cpumask(cpumask);
/* Save the current isolated cpumask & export it via sysfs */
if (!ret)
- cpumask_copy(wq_isolated_cpumask, exclude_cpumask);
+ cpumask_andnot(wq_isolated_cpumask, cpu_possible_mask, hk);
mutex_unlock(&wq_pool_mutex);
free_cpumask_var(cpumask);
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 18/33] cpuset: Remove cpuset_cpu_is_isolated()
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (16 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 17/33] cpuset: Propagate cpuset isolation update to workqueue through housekeeping Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-29 18:05 ` Waiman Long
2025-10-13 20:31 ` [PATCH 19/33] sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() Frederic Weisbecker
` (14 subsequent siblings)
32 siblings, 1 reply; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The set of cpuset isolated CPUs is now included in HK_TYPE_DOMAIN
housekeeping cpumask. There is no usecase left interested in just
checking what is isolated by cpuset and not by the isolcpus= kernel
boot parameter.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/cpuset.h | 6 ------
include/linux/sched/isolation.h | 3 +--
kernel/cgroup/cpuset.c | 12 ------------
3 files changed, 1 insertion(+), 20 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 051d36fec578..a10775a4f702 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -78,7 +78,6 @@ extern void cpuset_lock(void);
extern void cpuset_unlock(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
-extern bool cpuset_cpu_is_isolated(int cpu);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
@@ -208,11 +207,6 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)
return false;
}
-static inline bool cpuset_cpu_is_isolated(int cpu)
-{
- return false;
-}
-
static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
{
return node_possible_map;
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 94d5c835121b..0f50c152cf68 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -76,8 +76,7 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
static inline bool cpu_is_isolated(int cpu)
{
return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
- !housekeeping_test_cpu(cpu, HK_TYPE_TICK) ||
- cpuset_cpu_is_isolated(cpu);
+ !housekeeping_test_cpu(cpu, HK_TYPE_TICK);
}
#endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ea102e4695a5..e19d3375a4ec 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -29,7 +29,6 @@
#include <linux/mempolicy.h>
#include <linux/mm.h>
#include <linux/memory.h>
-#include <linux/export.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/sched/deadline.h>
@@ -1405,17 +1404,6 @@ static void update_housekeeping_cpumask(bool isolcpus_updated)
WARN_ON_ONCE(ret < 0);
}
-/**
- * cpuset_cpu_is_isolated - Check if the given CPU is isolated
- * @cpu: the CPU number to be checked
- * Return: true if CPU is used in an isolated partition, false otherwise
- */
-bool cpuset_cpu_is_isolated(int cpu)
-{
- return cpumask_test_cpu(cpu, isolated_cpus);
-}
-EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
-
/**
* rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets
* @parent: Parent cpuset containing all siblings
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 19/33] sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (17 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 18/33] cpuset: Remove cpuset_cpu_is_isolated() Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 20/33] PCI: Remove superfluous HK_TYPE_WQ check Frederic Weisbecker
` (13 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
It doesn't make sense to use nohz_full without also isolating the
related CPUs from the domain topology, either through the use of
isolcpus= or cpuset isolated partitions.
And now HK_TYPE_DOMAIN includes all kinds of domain isolated CPUs.
This means that HK_TYPE_KERNEL_NOISE (of which HK_TYPE_TICK is only an
alias) should always be a subset of HK_TYPE_DOMAIN.
Therefore if a CPU is not HK_TYPE_DOMAIN, it shouldn't be
HK_TYPE_KERNEL_NOISE either. Testing the former is then enough.
Simplify cpu_is_isolated() accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/sched/isolation.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 0f50c152cf68..eef539820de2 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -75,8 +75,7 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
static inline bool cpu_is_isolated(int cpu)
{
- return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
- !housekeeping_test_cpu(cpu, HK_TYPE_TICK);
+ return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN);
}
#endif /* _LINUX_SCHED_ISOLATION_H */
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 20/33] PCI: Remove superfluous HK_TYPE_WQ check
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (18 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 19/33] sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 21/33] kthread: Refine naming of affinity related fields Frederic Weisbecker
` (12 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
It doesn't make sense to use nohz_full without also isolating the
related CPUs from the domain topology, either through the use of
isolcpus= or cpuset isolated partitions.
And now HK_TYPE_DOMAIN includes all kinds of domain isolated CPUs.
This means that HK_TYPE_KERNEL_NOISE (of which HK_TYPE_WQ is only an
alias) should always be a subset of HK_TYPE_DOMAIN.
Therefore sane configurations verify:
HK_TYPE_KERNEL_NOISE | HK_TYPE_DOMAIN == HK_TYPE_DOMAIN
Simplify the PCI probe target election accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
drivers/pci/pci-driver.c | 17 +++--------------
1 file changed, 3 insertions(+), 14 deletions(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index ac86aaec8bcf..e731aaf28c76 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -384,16 +384,9 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
pci_physfn_is_probed(dev)) {
error = local_pci_probe(&ddi);
} else {
- cpumask_var_t wq_domain_mask;
struct pci_probe_arg arg = { .ddi = &ddi };
INIT_WORK_ONSTACK(&arg.work, local_pci_probe_callback);
-
- if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
- error = -ENOMEM;
- goto out;
- }
-
/*
* The target election and the enqueue of the work must be within
* the same RCU read side section so that when the workqueue pool
@@ -402,12 +395,9 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
* targets.
*/
rcu_read_lock();
- cpumask_and(wq_domain_mask,
- housekeeping_cpumask(HK_TYPE_WQ),
- housekeeping_cpumask(HK_TYPE_DOMAIN));
-
cpu = cpumask_any_and(cpumask_of_node(node),
- wq_domain_mask);
+ housekeeping_cpumask(HK_TYPE_DOMAIN));
+
if (cpu < nr_cpu_ids) {
struct workqueue_struct *wq = pci_probe_wq;
@@ -422,10 +412,9 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
error = local_pci_probe(&ddi);
}
- free_cpumask_var(wq_domain_mask);
destroy_work_on_stack(&arg.work);
}
-out:
+
dev->is_probed = 0;
cpu_hotplug_enable();
return error;
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 21/33] kthread: Refine naming of affinity related fields
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (19 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 20/33] PCI: Remove superfluous HK_TYPE_WQ check Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 22/33] kthread: Include unbound kthreads in the managed affinity list Frederic Weisbecker
` (11 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The kthreads preferred affinity related fields use "hotplug" as the base
of their naming because the affinity management was initially deemed to
deal with CPU hotplug.
The scope of this role is going to broaden now and also deal with
cpuset isolated partition updates.
Switch the naming accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/kthread.c | 38 +++++++++++++++++++-------------------
1 file changed, 19 insertions(+), 19 deletions(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 31b072e8d427..c4dd967e9e9c 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -35,8 +35,8 @@ static DEFINE_SPINLOCK(kthread_create_lock);
static LIST_HEAD(kthread_create_list);
struct task_struct *kthreadd_task;
-static LIST_HEAD(kthreads_hotplug);
-static DEFINE_MUTEX(kthreads_hotplug_lock);
+static LIST_HEAD(kthread_affinity_list);
+static DEFINE_MUTEX(kthread_affinity_lock);
struct kthread_create_info
{
@@ -69,7 +69,7 @@ struct kthread {
/* To store the full name if task comm is truncated. */
char *full_name;
struct task_struct *task;
- struct list_head hotplug_node;
+ struct list_head affinity_node;
struct cpumask *preferred_affinity;
};
@@ -128,7 +128,7 @@ bool set_kthread_struct(struct task_struct *p)
init_completion(&kthread->exited);
init_completion(&kthread->parked);
- INIT_LIST_HEAD(&kthread->hotplug_node);
+ INIT_LIST_HEAD(&kthread->affinity_node);
p->vfork_done = &kthread->exited;
kthread->task = p;
@@ -323,10 +323,10 @@ void __noreturn kthread_exit(long result)
{
struct kthread *kthread = to_kthread(current);
kthread->result = result;
- if (!list_empty(&kthread->hotplug_node)) {
- mutex_lock(&kthreads_hotplug_lock);
- list_del(&kthread->hotplug_node);
- mutex_unlock(&kthreads_hotplug_lock);
+ if (!list_empty(&kthread->affinity_node)) {
+ mutex_lock(&kthread_affinity_lock);
+ list_del(&kthread->affinity_node);
+ mutex_unlock(&kthread_affinity_lock);
if (kthread->preferred_affinity) {
kfree(kthread->preferred_affinity);
@@ -390,9 +390,9 @@ static void kthread_affine_node(void)
return;
}
- mutex_lock(&kthreads_hotplug_lock);
- WARN_ON_ONCE(!list_empty(&kthread->hotplug_node));
- list_add_tail(&kthread->hotplug_node, &kthreads_hotplug);
+ mutex_lock(&kthread_affinity_lock);
+ WARN_ON_ONCE(!list_empty(&kthread->affinity_node));
+ list_add_tail(&kthread->affinity_node, &kthread_affinity_list);
/*
* The node cpumask is racy when read from kthread() but:
* - a racing CPU going down will either fail on the subsequent
@@ -402,7 +402,7 @@ static void kthread_affine_node(void)
*/
kthread_fetch_affinity(kthread, affinity);
set_cpus_allowed_ptr(current, affinity);
- mutex_unlock(&kthreads_hotplug_lock);
+ mutex_unlock(&kthread_affinity_lock);
free_cpumask_var(affinity);
}
@@ -876,10 +876,10 @@ int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
goto out;
}
- mutex_lock(&kthreads_hotplug_lock);
+ mutex_lock(&kthread_affinity_lock);
cpumask_copy(kthread->preferred_affinity, mask);
- WARN_ON_ONCE(!list_empty(&kthread->hotplug_node));
- list_add_tail(&kthread->hotplug_node, &kthreads_hotplug);
+ WARN_ON_ONCE(!list_empty(&kthread->affinity_node));
+ list_add_tail(&kthread->affinity_node, &kthread_affinity_list);
kthread_fetch_affinity(kthread, affinity);
/* It's safe because the task is inactive. */
@@ -887,7 +887,7 @@ int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
do_set_cpus_allowed(p, affinity);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
- mutex_unlock(&kthreads_hotplug_lock);
+ mutex_unlock(&kthread_affinity_lock);
out:
free_cpumask_var(affinity);
@@ -908,9 +908,9 @@ static int kthreads_online_cpu(unsigned int cpu)
struct kthread *k;
int ret;
- guard(mutex)(&kthreads_hotplug_lock);
+ guard(mutex)(&kthread_affinity_lock);
- if (list_empty(&kthreads_hotplug))
+ if (list_empty(&kthread_affinity_list))
return 0;
if (!zalloc_cpumask_var(&affinity, GFP_KERNEL))
@@ -918,7 +918,7 @@ static int kthreads_online_cpu(unsigned int cpu)
ret = 0;
- list_for_each_entry(k, &kthreads_hotplug, hotplug_node) {
+ list_for_each_entry(k, &kthread_affinity_list, affinity_node) {
if (WARN_ON_ONCE((k->task->flags & PF_NO_SETAFFINITY) ||
kthread_is_per_cpu(k->task))) {
ret = -EINVAL;
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 22/33] kthread: Include unbound kthreads in the managed affinity list
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (20 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 21/33] kthread: Refine naming of affinity related fields Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-21 22:42 ` Waiman Long
2025-10-13 20:31 ` [PATCH 23/33] kthread: Include kthreadd to " Frederic Weisbecker
` (10 subsequent siblings)
32 siblings, 1 reply; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The managed affinity list currently contains only unbound kthreads that
have affinity preferences. Unbound kthreads globally affine by default
are outside of the list because their affinity is automatically managed
by the scheduler (through the fallback housekeeping mask) and by cpuset.
However in order to preserve the preferred affinity of kthreads, cpuset
will delegate the isolated partition update propagation to the
housekeeping and kthread code.
Prepare for that with including all unbound kthreads in the managed
affinity list.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/kthread.c | 59 ++++++++++++++++++++++++------------------------
1 file changed, 30 insertions(+), 29 deletions(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index c4dd967e9e9c..cba3d297f267 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -365,9 +365,10 @@ static void kthread_fetch_affinity(struct kthread *kthread, struct cpumask *cpum
if (kthread->preferred_affinity) {
pref = kthread->preferred_affinity;
} else {
- if (WARN_ON_ONCE(kthread->node == NUMA_NO_NODE))
- return;
- pref = cpumask_of_node(kthread->node);
+ if (kthread->node == NUMA_NO_NODE)
+ pref = housekeeping_cpumask(HK_TYPE_KTHREAD);
+ else
+ pref = cpumask_of_node(kthread->node);
}
cpumask_and(cpumask, pref, housekeeping_cpumask(HK_TYPE_KTHREAD));
@@ -380,32 +381,29 @@ static void kthread_affine_node(void)
struct kthread *kthread = to_kthread(current);
cpumask_var_t affinity;
- WARN_ON_ONCE(kthread_is_per_cpu(current));
+ if (WARN_ON_ONCE(kthread_is_per_cpu(current)))
+ return;
- if (kthread->node == NUMA_NO_NODE) {
- housekeeping_affine(current, HK_TYPE_KTHREAD);
- } else {
- if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) {
- WARN_ON_ONCE(1);
- return;
- }
-
- mutex_lock(&kthread_affinity_lock);
- WARN_ON_ONCE(!list_empty(&kthread->affinity_node));
- list_add_tail(&kthread->affinity_node, &kthread_affinity_list);
- /*
- * The node cpumask is racy when read from kthread() but:
- * - a racing CPU going down will either fail on the subsequent
- * call to set_cpus_allowed_ptr() or be migrated to housekeepers
- * afterwards by the scheduler.
- * - a racing CPU going up will be handled by kthreads_online_cpu()
- */
- kthread_fetch_affinity(kthread, affinity);
- set_cpus_allowed_ptr(current, affinity);
- mutex_unlock(&kthread_affinity_lock);
-
- free_cpumask_var(affinity);
+ if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) {
+ WARN_ON_ONCE(1);
+ return;
}
+
+ mutex_lock(&kthread_affinity_lock);
+ WARN_ON_ONCE(!list_empty(&kthread->affinity_node));
+ list_add_tail(&kthread->affinity_node, &kthread_affinity_list);
+ /*
+ * The node cpumask is racy when read from kthread() but:
+ * - a racing CPU going down will either fail on the subsequent
+ * call to set_cpus_allowed_ptr() or be migrated to housekeepers
+ * afterwards by the scheduler.
+ * - a racing CPU going up will be handled by kthreads_online_cpu()
+ */
+ kthread_fetch_affinity(kthread, affinity);
+ set_cpus_allowed_ptr(current, affinity);
+ mutex_unlock(&kthread_affinity_lock);
+
+ free_cpumask_var(affinity);
}
static int kthread(void *_create)
@@ -924,8 +922,11 @@ static int kthreads_online_cpu(unsigned int cpu)
ret = -EINVAL;
continue;
}
- kthread_fetch_affinity(k, affinity);
- set_cpus_allowed_ptr(k->task, affinity);
+
+ if (k->preferred_affinity || k->node != NUMA_NO_NODE) {
+ kthread_fetch_affinity(k, affinity);
+ set_cpus_allowed_ptr(k->task, affinity);
+ }
}
free_cpumask_var(affinity);
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 23/33] kthread: Include kthreadd to the managed affinity list
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (21 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 22/33] kthread: Include unbound kthreads in the managed affinity list Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 24/33] kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management Frederic Weisbecker
` (9 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The unbound kthreads affinity management performed by cpuset is going to
be imported to the kthread core code for consolidation purposes.
Treat kthreadd just like any other kthread.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/kthread.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index cba3d297f267..cb0be05d6091 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -820,12 +820,13 @@ int kthreadd(void *unused)
/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, comm);
ignore_signals(tsk);
- set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_TYPE_KTHREAD));
set_mems_allowed(node_states[N_MEMORY]);
current->flags |= PF_NOFREEZE;
cgroup_init_kthreadd();
+ kthread_affine_node();
+
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (list_empty(&kthread_create_list))
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 24/33] kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (22 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 23/33] kthread: Include kthreadd to " Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 25/33] sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN Frederic Weisbecker
` (8 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
Unbound kthreads want to run neither on nohz_full CPUs nor on domain
isolated CPUs. And since nohz_full implies domain isolation, checking
the latter is enough to verify both.
Therefore exclude kthreads from domain isolation.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/kthread.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index cb0be05d6091..8d0c8c4c7e46 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -362,18 +362,20 @@ static void kthread_fetch_affinity(struct kthread *kthread, struct cpumask *cpum
{
const struct cpumask *pref;
+ guard(rcu)();
+
if (kthread->preferred_affinity) {
pref = kthread->preferred_affinity;
} else {
if (kthread->node == NUMA_NO_NODE)
- pref = housekeeping_cpumask(HK_TYPE_KTHREAD);
+ pref = housekeeping_cpumask(HK_TYPE_DOMAIN);
else
pref = cpumask_of_node(kthread->node);
}
- cpumask_and(cpumask, pref, housekeeping_cpumask(HK_TYPE_KTHREAD));
+ cpumask_and(cpumask, pref, housekeeping_cpumask(HK_TYPE_DOMAIN));
if (cpumask_empty(cpumask))
- cpumask_copy(cpumask, housekeeping_cpumask(HK_TYPE_KTHREAD));
+ cpumask_copy(cpumask, housekeeping_cpumask(HK_TYPE_DOMAIN));
}
static void kthread_affine_node(void)
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 25/33] sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (23 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 24/33] kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 26/33] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping Frederic Weisbecker
` (7 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
Tasks that have all their allowed CPUs offline don't want their affinity
to fallback on either nohz_full CPUs or on domain isolated CPUs. And
since nohz_full implies domain isolation, checking the latter is enough
to verify both.
Therefore exclude domain isolation from fallback task affinity.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/mmu_context.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index ac01dc4eb2ce..ed3dd0f3fe19 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -24,7 +24,7 @@ static inline void leave_mm(void) { }
#ifndef task_cpu_possible_mask
# define task_cpu_possible_mask(p) cpu_possible_mask
# define task_cpu_possible(cpu, p) true
-# define task_cpu_fallback_mask(p) housekeeping_cpumask(HK_TYPE_TICK)
+# define task_cpu_fallback_mask(p) housekeeping_cpumask(HK_TYPE_DOMAIN)
#else
# define task_cpu_possible(cpu, p) cpumask_test_cpu((cpu), task_cpu_possible_mask(p))
#endif
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 26/33] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (24 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 25/33] sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 27/33] sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN Frederic Weisbecker
` (6 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Gabriele Monaco, Michal Koutný, Andrew Morton, Bjorn Helgaas,
Catalin Marinas, Danilo Krummrich, David S . Miller, Eric Dumazet,
Frederic Weisbecker, Greg Kroah-Hartman, Ingo Molnar,
Jakub Kicinski, Jens Axboe, Johannes Weiner, Lai Jiangshan,
Marco Crivellari, Michal Hocko, Muchun Song, Paolo Abeni,
Peter Zijlstra, Phil Auld, Rafael J . Wysocki, Roman Gushchin,
Shakeel Butt, Simon Horman, Tejun Heo, Thomas Gleixner,
Vlastimil Babka, Waiman Long, Will Deacon, cgroups,
linux-arm-kernel, linux-block, linux-mm, linux-pci, netdev
From: Gabriele Monaco <gmonaco@redhat.com>
Currently the user can set up isolated cpus via cpuset and nohz_full in
such a way that leaves no housekeeping CPU (i.e. no CPU that is neither
domain isolated nor nohz full). This can be a problem for other
subsystems (e.g. the timer wheel imgration).
Prevent this configuration by blocking any assignation that would cause
the union of domain isolated cpus and nohz_full to covers all CPUs.
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/cgroup/cpuset.c | 63 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e19d3375a4ec..d1a799e361c3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1327,6 +1327,19 @@ static void isolated_cpus_update(int old_prs, int new_prs, struct cpumask *xcpus
cpumask_andnot(isolated_cpus, isolated_cpus, xcpus);
}
+/*
+ * isolated_cpus_should_update - Returns if the isolated_cpus mask needs update
+ * @prs: new or old partition_root_state
+ * @parent: parent cpuset
+ * Return: true if isolated_cpus needs modification, false otherwise
+ */
+static bool isolated_cpus_should_update(int prs, struct cpuset *parent)
+{
+ if (!parent)
+ parent = &top_cpuset;
+ return prs != parent->partition_root_state;
+}
+
/*
* partition_xcpus_add - Add new exclusive CPUs to partition
* @new_prs: new partition_root_state
@@ -1391,6 +1404,42 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
return isolcpus_updated;
}
+/*
+ * isolated_cpus_can_update - check for isolated & nohz_full conflicts
+ * @add_cpus: cpu mask for cpus that are going to be isolated
+ * @del_cpus: cpu mask for cpus that are no longer isolated, can be NULL
+ * Return: false if there is conflict, true otherwise
+ *
+ * If nohz_full is enabled and we have isolated CPUs, their combination must
+ * still leave housekeeping CPUs.
+ */
+static bool isolated_cpus_can_update(struct cpumask *add_cpus,
+ struct cpumask *del_cpus)
+{
+ cpumask_var_t full_hk_cpus;
+ int res = true;
+
+ if (!housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
+ return true;
+
+ if (del_cpus && cpumask_weight_and(del_cpus,
+ housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)))
+ return true;
+
+ if (!alloc_cpumask_var(&full_hk_cpus, GFP_KERNEL))
+ return false;
+
+ cpumask_and(full_hk_cpus, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
+ housekeeping_cpumask(HK_TYPE_DOMAIN));
+ cpumask_andnot(full_hk_cpus, full_hk_cpus, isolated_cpus);
+ cpumask_and(full_hk_cpus, full_hk_cpus, cpu_active_mask);
+ if (!cpumask_weight_andnot(full_hk_cpus, add_cpus))
+ res = false;
+
+ free_cpumask_var(full_hk_cpus);
+ return res;
+}
+
static void update_housekeeping_cpumask(bool isolcpus_updated)
{
int ret;
@@ -1538,6 +1587,9 @@ static int remote_partition_enable(struct cpuset *cs, int new_prs,
if (!cpumask_intersects(tmp->new_cpus, cpu_active_mask) ||
cpumask_subset(top_cpuset.effective_cpus, tmp->new_cpus))
return PERR_INVCPUS;
+ if (isolated_cpus_should_update(new_prs, NULL) &&
+ !isolated_cpus_can_update(tmp->new_cpus, NULL))
+ return PERR_HKEEPING;
spin_lock_irq(&callback_lock);
isolcpus_updated = partition_xcpus_add(new_prs, NULL, tmp->new_cpus);
@@ -1637,6 +1689,9 @@ static void remote_cpus_update(struct cpuset *cs, struct cpumask *xcpus,
else if (cpumask_intersects(tmp->addmask, subpartitions_cpus) ||
cpumask_subset(top_cpuset.effective_cpus, tmp->addmask))
cs->prs_err = PERR_NOCPUS;
+ else if (isolated_cpus_should_update(prs, NULL) &&
+ !isolated_cpus_can_update(tmp->addmask, tmp->delmask))
+ cs->prs_err = PERR_HKEEPING;
if (cs->prs_err)
goto invalidate;
}
@@ -1984,6 +2039,12 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
return err;
}
+ if (deleting && isolated_cpus_should_update(new_prs, parent) &&
+ !isolated_cpus_can_update(tmp->delmask, tmp->addmask)) {
+ cs->prs_err = PERR_HKEEPING;
+ return PERR_HKEEPING;
+ }
+
/*
* Change the parent's effective_cpus & effective_xcpus (top cpuset
* only).
@@ -2999,6 +3060,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
* Need to update isolated_cpus.
*/
isolcpus_updated = true;
+ if (!isolated_cpus_can_update(cs->effective_xcpus, NULL))
+ err = PERR_HKEEPING;
} else {
/*
* Switching back to member is always allowed even if it
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 27/33] sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (25 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 26/33] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 28/33] kthread: Honour kthreads preferred affinity after cpuset changes Frederic Weisbecker
` (5 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
When none of the allowed CPUs of a task are online, it gets migrated
to the fallback cpumask which is all the non nohz_full CPUs.
However just like nohz_full CPUs, domain isolated CPUs don't want to be
disturbed by tasks that have lost their CPU affinities.
And since nohz_full rely on domain isolation to work correctly, the
housekeeping mask of domain isolated CPUs should always be a superset of
the housekeeping mask of nohz_full CPUs (there can be CPUs that are
domain isolated but not nohz_full, OTOH there shouldn't be nohz_full
CPUs that are not domain isolated):
HK_TYPE_DOMAIN | HK_TYPE_KERNEL_NOISE == HK_TYPE_DOMAIN
Therefore use HK_TYPE_DOMAIN as the appropriate fallback target for
tasks and since this cpumask can be modified at runtime, make sure
that 32 bits support CPUs on ARM64 mismatched systems are not isolated
by cpusets.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
arch/arm64/kernel/cpufeature.c | 18 +++++++++++++++---
include/linux/cpu.h | 4 ++++
kernel/cgroup/cpuset.c | 17 ++++++++++++++---
3 files changed, 33 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 5ed401ff79e3..4296b149ccf0 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1655,6 +1655,18 @@ has_cpuid_feature(const struct arm64_cpu_capabilities *entry, int scope)
return feature_matches(val, entry);
}
+/*
+ * 32 bits support CPUs can't be isolated because tasks may be
+ * arbitrarily affine to them, defeating the purpose of isolation.
+ */
+bool arch_isolated_cpus_can_update(struct cpumask *new_cpus)
+{
+ if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
+ return !cpumask_intersects(cpu_32bit_el0_mask, new_cpus);
+ else
+ return true;
+}
+
const struct cpumask *system_32bit_el0_cpumask(void)
{
if (!system_supports_32bit_el0())
@@ -1668,7 +1680,7 @@ const struct cpumask *system_32bit_el0_cpumask(void)
const struct cpumask *task_cpu_fallback_mask(struct task_struct *p)
{
- return __task_cpu_possible_mask(p, housekeeping_cpumask(HK_TYPE_TICK));
+ return __task_cpu_possible_mask(p, housekeeping_cpumask(HK_TYPE_DOMAIN));
}
static int __init parse_32bit_el0_param(char *str)
@@ -3922,8 +3934,8 @@ static int enable_mismatched_32bit_el0(unsigned int cpu)
bool cpu_32bit = false;
if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) {
- if (!housekeeping_cpu(cpu, HK_TYPE_TICK))
- pr_info("Treating adaptive-ticks CPU %u as 64-bit only\n", cpu);
+ if (!housekeeping_cpu(cpu, HK_TYPE_DOMAIN))
+ pr_info("Treating domain isolated CPU %u as 64-bit only\n", cpu);
else
cpu_32bit = true;
}
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 487b3bf2e1ea..0b48af25ab5c 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -229,4 +229,8 @@ static inline bool cpu_attack_vector_mitigated(enum cpu_attack_vectors v)
#define smt_mitigations SMT_MITIGATIONS_OFF
#endif
+struct cpumask;
+
+bool arch_isolated_cpus_can_update(struct cpumask *new_cpus);
+
#endif /* _LINUX_CPU_H_ */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d1a799e361c3..817c07a7a1b4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1404,14 +1404,22 @@ static bool partition_xcpus_del(int old_prs, struct cpuset *parent,
return isolcpus_updated;
}
+bool __weak arch_isolated_cpus_can_update(struct cpumask *new_cpus)
+{
+ return true;
+}
+
/*
- * isolated_cpus_can_update - check for isolated & nohz_full conflicts
+ * isolated_cpus_can_update - check for conflicts against housekeeping and
+ * CPUs capabilities.
* @add_cpus: cpu mask for cpus that are going to be isolated
* @del_cpus: cpu mask for cpus that are no longer isolated, can be NULL
* Return: false if there is conflict, true otherwise
*
- * If nohz_full is enabled and we have isolated CPUs, their combination must
- * still leave housekeeping CPUs.
+ * Check for conflicts:
+ * - If nohz_full is enabled and there are isolated CPUs, their combination must
+ * still leave housekeeping CPUs.
+ * - Architecture has CPU capabilities incompatible with being isolated
*/
static bool isolated_cpus_can_update(struct cpumask *add_cpus,
struct cpumask *del_cpus)
@@ -1419,6 +1427,9 @@ static bool isolated_cpus_can_update(struct cpumask *add_cpus,
cpumask_var_t full_hk_cpus;
int res = true;
+ if (!arch_isolated_cpus_can_update(add_cpus))
+ return false;
+
if (!housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
return true;
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 28/33] kthread: Honour kthreads preferred affinity after cpuset changes
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (26 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 27/33] sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 29/33] kthread: Comment on the purpose and placement of kthread_affine_node() call Frederic Weisbecker
` (4 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
When cpuset isolated partitions get updated, unbound kthreads get
indifferently affine to all non isolated CPUs, regardless of their
individual affinity preferences.
For example kswapd is a per-node kthread that prefers to be affine to
the node it refers to. Whenever an isolated partition is created,
updated or deleted, kswapd's node affinity is going to be broken if any
CPU in the related node is not isolated because kswapd will be affine
globally.
Fix this with letting the consolidated kthread managed affinity code do
the affinity update on behalf of cpuset.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/kthread.h | 1 +
kernel/cgroup/cpuset.c | 5 ++---
kernel/kthread.c | 38 +++++++++++++++++++++++++++++---------
kernel/sched/isolation.c | 2 ++
4 files changed, 34 insertions(+), 12 deletions(-)
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 8d27403888ce..c92c1149ee6e 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -100,6 +100,7 @@ void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
void kthread_exit(long result) __noreturn;
void kthread_complete_and_exit(struct completion *, long) __noreturn;
+int kthreads_update_housekeeping(void);
int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 817c07a7a1b4..bc3f18ead7c8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1182,11 +1182,10 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
if (top_cs) {
/*
+ * PF_KTHREAD tasks are handled by housekeeping.
* PF_NO_SETAFFINITY tasks are ignored.
- * All per cpu kthreads should have PF_NO_SETAFFINITY
- * flag set, see kthread_set_per_cpu().
*/
- if (task->flags & PF_NO_SETAFFINITY)
+ if (task->flags & (PF_KTHREAD | PF_NO_SETAFFINITY))
continue;
cpumask_andnot(new_cpus, possible_mask, subpartitions_cpus);
} else {
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8d0c8c4c7e46..4d3cc04e5e8b 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -896,14 +896,7 @@ int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
}
EXPORT_SYMBOL_GPL(kthread_affine_preferred);
-/*
- * Re-affine kthreads according to their preferences
- * and the newly online CPU. The CPU down part is handled
- * by select_fallback_rq() which default re-affines to
- * housekeepers from other nodes in case the preferred
- * affinity doesn't apply anymore.
- */
-static int kthreads_online_cpu(unsigned int cpu)
+static int kthreads_update_affinity(bool force)
{
cpumask_var_t affinity;
struct kthread *k;
@@ -926,7 +919,7 @@ static int kthreads_online_cpu(unsigned int cpu)
continue;
}
- if (k->preferred_affinity || k->node != NUMA_NO_NODE) {
+ if (force || k->preferred_affinity || k->node != NUMA_NO_NODE) {
kthread_fetch_affinity(k, affinity);
set_cpus_allowed_ptr(k->task, affinity);
}
@@ -937,6 +930,33 @@ static int kthreads_online_cpu(unsigned int cpu)
return ret;
}
+/**
+ * kthreads_update_housekeeping - Update kthreads affinity on cpuset change
+ *
+ * When cpuset changes a partition type to/from "isolated" or updates related
+ * cpumasks, propagate the housekeeping cpumask change to preferred kthreads
+ * affinity.
+ *
+ * Returns 0 if successful, -ENOMEM if temporary mask couldn't
+ * be allocated or -EINVAL in case of internal error.
+ */
+int kthreads_update_housekeeping(void)
+{
+ return kthreads_update_affinity(true);
+}
+
+/*
+ * Re-affine kthreads according to their preferences
+ * and the newly online CPU. The CPU down part is handled
+ * by select_fallback_rq() which default re-affines to
+ * housekeepers from other nodes in case the preferred
+ * affinity doesn't apply anymore.
+ */
+static int kthreads_online_cpu(unsigned int cpu)
+{
+ return kthreads_update_affinity(false);
+}
+
static int kthreads_init(void)
{
return cpuhp_setup_state(CPUHP_AP_KTHREADS_ONLINE, "kthreads:online",
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 691f045ab758..93de1304e6d4 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -150,6 +150,8 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
mem_cgroup_flush_workqueue();
vmstat_flush_workqueue();
err = workqueue_unbound_housekeeping_update(housekeeping_cpumask(type));
+ WARN_ON_ONCE(err < 0);
+ err = kthreads_update_housekeeping();
kfree(old);
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 29/33] kthread: Comment on the purpose and placement of kthread_affine_node() call
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (27 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 28/33] kthread: Honour kthreads preferred affinity after cpuset changes Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 30/33] kthread: Add API to update preferred affinity on kthread runtime Frederic Weisbecker
` (3 subsequent siblings)
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
It may not appear obvious why kthread_affine_node() is not called before
the kthread creation completion instead of after the first wake-up.
The reason is that kthread_affine_node() applies a default affinity
behaviour that only takes place if no affinity preference have already
been passed by the kthread creation call site.
Add a comment to clarify that.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/kthread.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 4d3cc04e5e8b..d36bdfbd004e 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -453,6 +453,10 @@ static int kthread(void *_create)
self->started = 1;
+ /*
+ * Apply default node affinity if no call to kthread_bind[_mask]() nor
+ * kthread_affine_preferred() was issued before the first wake-up.
+ */
if (!(current->flags & PF_NO_SETAFFINITY) && !self->preferred_affinity)
kthread_affine_node();
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 30/33] kthread: Add API to update preferred affinity on kthread runtime
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (28 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 29/33] kthread: Comment on the purpose and placement of kthread_affine_node() call Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-14 12:35 ` Simon Horman
2025-10-13 20:31 ` [PATCH 31/33] kthread: Document kthread_affine_preferred() Frederic Weisbecker
` (2 subsequent siblings)
32 siblings, 1 reply; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
Kthreads can apply for a preferred affinity upon creation but they have
no means to update that preferred affinity after the first wake up.
kthread_affine_preferred() is optimized by assuming the kthread
is sleeping while applying the allowed cpumask.
Therefore introduce a new API to further update the preferred affinity.
It will be used by IRQ kthreads.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
include/linux/kthread.h | 1 +
kernel/kthread.c | 55 +++++++++++++++++++++++++++++++++++------
2 files changed, 48 insertions(+), 8 deletions(-)
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index c92c1149ee6e..a06cae7f2c55 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -86,6 +86,7 @@ void free_kthread_struct(struct task_struct *k);
void kthread_bind(struct task_struct *k, unsigned int cpu);
void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask);
int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask);
+int kthread_affine_preferred_update(struct task_struct *p, const struct cpumask *mask);
int kthread_stop(struct task_struct *k);
int kthread_stop_put(struct task_struct *k);
bool kthread_should_stop(void);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index d36bdfbd004e..f3397cf7542a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -322,17 +322,16 @@ EXPORT_SYMBOL_GPL(kthread_parkme);
void __noreturn kthread_exit(long result)
{
struct kthread *kthread = to_kthread(current);
+ struct cpumask *to_free = NULL;
kthread->result = result;
- if (!list_empty(&kthread->affinity_node)) {
- mutex_lock(&kthread_affinity_lock);
- list_del(&kthread->affinity_node);
- mutex_unlock(&kthread_affinity_lock);
- if (kthread->preferred_affinity) {
- kfree(kthread->preferred_affinity);
- kthread->preferred_affinity = NULL;
- }
+ scoped_guard(mutex, &kthread_affinity_lock) {
+ if (!list_empty(&kthread->affinity_node))
+ list_del_init(&kthread->affinity_node);
+ to_free = kthread->preferred_affinity;
+ kthread->preferred_affinity = NULL;
}
+ kfree(to_free);
do_exit(0);
}
EXPORT_SYMBOL(kthread_exit);
@@ -900,6 +899,46 @@ int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
}
EXPORT_SYMBOL_GPL(kthread_affine_preferred);
+/**
+ * kthread_affine_preferred_update - update a kthread's preferred affinity
+ * @p: thread created by kthread_create().
+ * @cpumask: new mask of CPUs (might not be online, must be possible) for @k
+ * to run on.
+ *
+ * Update the cpumask of the desired kthread's affinity that was passed by
+ * a previous call to kthread_affine_preferred(). This can be called either
+ * before or after the first wakeup of the kthread.
+ *
+ * Returns 0 if the affinity has been applied.
+ */
+int kthread_affine_preferred_update(struct task_struct *p,
+ const struct cpumask *mask)
+{
+ struct kthread *kthread = to_kthread(p);
+ cpumask_var_t affinity;
+ int ret = 0;
+
+ if (!zalloc_cpumask_var(&affinity, GFP_KERNEL))
+ return -ENOMEM;
+
+ scoped_guard(mutex, &kthread_affinity_lock) {
+ if (WARN_ON_ONCE(!kthread->preferred_affinity ||
+ list_empty(&kthread->affinity_node))) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ cpumask_copy(kthread->preferred_affinity, mask);
+ kthread_fetch_affinity(kthread, affinity);
+ set_cpus_allowed_ptr(p, affinity);
+ }
+out:
+ free_cpumask_var(affinity);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(kthread_affine_preferred_update);
+
static int kthreads_update_affinity(bool force)
{
cpumask_var_t affinity;
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 31/33] kthread: Document kthread_affine_preferred()
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (29 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 30/33] kthread: Add API to update preferred affinity on kthread runtime Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 32/33] genirq: Correctly handle preferred kthreads affinity Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 33/33] doc: Add housekeeping documentation Frederic Weisbecker
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
The documentation of this new API has been overlooked during its
introduction. Fill the gap.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/kthread.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index f3397cf7542a..b989aeaa441a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -857,6 +857,18 @@ int kthreadd(void *unused)
return 0;
}
+/**
+ * kthread_affine_preferred - Define a kthread's preferred affinity
+ * @p: thread created by kthread_create().
+ * @cpumask: preferred mask of CPUs (might not be online, must be possible) for @k
+ * to run on.
+ *
+ * Similar to kthread_bind_mask() except that the affinity is not a requirement
+ * but rather a preference that can be constrained by CPU isolation or CPU hotplug.
+ * Must be called before the first wakeup of the kthread.
+ *
+ * Returns 0 if the affinity has been applied.
+ */
int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
{
struct kthread *kthread = to_kthread(p);
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 32/33] genirq: Correctly handle preferred kthreads affinity
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (30 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 31/33] kthread: Document kthread_affine_preferred() Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 33/33] doc: Add housekeeping documentation Frederic Weisbecker
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
[CHECKME: Do some IRQ threads have strong affinity requirements? In
which case they should use kthread_bind()...]
The affinity of IRQ threads is applied through a direct call to the
scheduler. As a result this affinity may not be carried correctly across
hotplug events, cpuset isolated partitions updates, or against
housekeeping constraints.
For example a simple creation of cpuset isolated partition will
overwrite all IRQ threads affinity to the non isolated cpusets.
To prevent from that, use the appropriate kthread affinity APIs that
takes care of the preferred affinity during these kinds of events.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/irq/manage.c | 47 +++++++++++++++++++++++++++------------------
1 file changed, 28 insertions(+), 19 deletions(-)
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index c94837382037..d96f6675c888 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -176,15 +176,15 @@ bool irq_can_set_affinity_usr(unsigned int irq)
}
/**
- * irq_set_thread_affinity - Notify irq threads to adjust affinity
+ * irq_thread_notify_affinity - Notify irq threads to adjust affinity
* @desc: irq descriptor which has affinity changed
*
* Just set IRQTF_AFFINITY and delegate the affinity setting to the
- * interrupt thread itself. We can not call set_cpus_allowed_ptr() here as
- * we hold desc->lock and this code can be called from hard interrupt
+ * interrupt thread itself. We can not call kthread_affine_preferred_update()
+ * here as we hold desc->lock and this code can be called from hard interrupt
* context.
*/
-static void irq_set_thread_affinity(struct irq_desc *desc)
+static void irq_thread_notify_affinity(struct irq_desc *desc)
{
struct irqaction *action;
@@ -283,7 +283,7 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
fallthrough;
case IRQ_SET_MASK_OK_NOCOPY:
irq_validate_effective_affinity(data);
- irq_set_thread_affinity(desc);
+ irq_thread_notify_affinity(desc);
ret = 0;
}
@@ -1032,11 +1032,26 @@ static void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *a
}
if (valid)
- set_cpus_allowed_ptr(current, mask);
+ kthread_affine_preferred_update(current, mask);
free_cpumask_var(mask);
}
+
+static inline void irq_thread_set_affinity(struct task_struct *t,
+ struct irq_desc *desc)
+{
+ const struct cpumask *mask;
+
+ if (cpumask_available(desc->irq_common_data.affinity))
+ mask = irq_data_get_effective_affinity_mask(&desc->irq_data);
+ else
+ mask = cpu_possible_mask;
+
+ kthread_affine_preferred(t, mask);
+}
#else
static inline void irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) { }
+static inline void irq_thread_set_affinity(struct task_struct *t,
+ struct irq_desc *desc) { }
#endif
static int irq_wait_for_interrupt(struct irq_desc *desc,
@@ -1384,7 +1399,8 @@ static void irq_nmi_teardown(struct irq_desc *desc)
}
static int
-setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
+setup_irq_thread(struct irqaction *new, struct irq_desc *desc,
+ unsigned int irq, bool secondary)
{
struct task_struct *t;
@@ -1405,16 +1421,9 @@ setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
* references an already freed task_struct.
*/
new->thread = get_task_struct(t);
- /*
- * Tell the thread to set its affinity. This is
- * important for shared interrupt handlers as we do
- * not invoke setup_affinity() for the secondary
- * handlers as everything is already set up. Even for
- * interrupts marked with IRQF_NO_BALANCE this is
- * correct as we want the thread to move to the cpu(s)
- * on which the requesting code placed the interrupt.
- */
- set_bit(IRQTF_AFFINITY, &new->thread_flags);
+
+ irq_thread_set_affinity(t, desc);
+
return 0;
}
@@ -1486,11 +1495,11 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
* thread.
*/
if (new->thread_fn && !nested) {
- ret = setup_irq_thread(new, irq, false);
+ ret = setup_irq_thread(new, desc, irq, false);
if (ret)
goto out_mput;
if (new->secondary) {
- ret = setup_irq_thread(new->secondary, irq, true);
+ ret = setup_irq_thread(new->secondary, desc, irq, true);
if (ret)
goto out_thread;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* [PATCH 33/33] doc: Add housekeeping documentation
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
` (31 preceding siblings ...)
2025-10-13 20:31 ` [PATCH 32/33] genirq: Correctly handle preferred kthreads affinity Frederic Weisbecker
@ 2025-10-13 20:31 ` Frederic Weisbecker
32 siblings, 0 replies; 50+ messages in thread
From: Frederic Weisbecker @ 2025-10-13 20:31 UTC (permalink / raw)
To: LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
Documentation/cpu_isolation/housekeeping.rst | 111 +++++++++++++++++++
1 file changed, 111 insertions(+)
create mode 100644 Documentation/cpu_isolation/housekeeping.rst
diff --git a/Documentation/cpu_isolation/housekeeping.rst b/Documentation/cpu_isolation/housekeeping.rst
new file mode 100644
index 000000000000..e5417302774c
--- /dev/null
+++ b/Documentation/cpu_isolation/housekeeping.rst
@@ -0,0 +1,111 @@
+======================================
+Housekeeping
+======================================
+
+
+CPU Isolation moves away kernel work that may otherwise run on any CPU.
+The purpose of its related features is to reduce the OS jitter that some
+extreme workloads can't stand, such as in some DPDK usecases.
+
+The kernel work moved away by CPU isolation is commonly described as
+"housekeeping" because it includes ground work that performs cleanups,
+statistics maintainance and actions relying on them, memory release,
+various deferrals etc...
+
+Sometimes housekeeping is just some unbound work (unbound workqueues,
+unbound timers, ...) that gets easily assigned to non-isolated CPUs.
+But sometimes housekeeping is tied to a specific CPU and requires
+elaborated tricks to be offloaded to non-isolated CPUs (RCU_NOCB, remote
+scheduler tick, etc...).
+
+Thus, a housekeeping CPU can be considered as the reverse of an isolated
+CPU. It is simply a CPU that can execute housekeeping work. There must
+always be at least one online housekeeping CPU at any time. The CPUs that
+are not isolated are automatically assigned as housekeeping.
+
+Housekeeping is currently divided in four features described
+by the ``enum hk_type type``:
+
+1. HK_TYPE_DOMAIN matches the work moved away by scheduler domain
+ isolation performed through ``isolcpus=domain`` boot parameter or
+ isolated cpuset partitions in cgroup v2. This includes scheduler
+ load balancing, unbound workqueues and timers.
+
+2. HK_TYPE_KERNEL_NOISE matches the work moved away by tick isolation
+ performed through ``nohz_full=`` or ``isolcpus=nohz`` boot
+ parameters. This includes remote scheduler tick, vmstat and lockup
+ watchdog.
+
+3. HK_TYPE_MANAGED_IRQ matches the IRQ handlers moved away by managed
+ IRQ isolation performed through ``isolcpus=managed_irq``.
+
+4. HK_TYPE_DOMAIN_BOOT matches the work moved away by scheduler domain
+ isolation performed through ``isolcpus=domain`` only. It is similar
+ to HK_TYPE_DOMAIN except it ignores the isolation performed by
+ cpusets.
+
+
+Housekeeping cpumasks
+=================================
+
+Housekeeping cpumasks include the CPUs that can execute the work moved
+away by the matching isolation feature. These cpumasks are returned by
+the following function::
+
+ const struct cpumask *housekeeping_cpumask(enum hk_type type)
+
+By default, if neither ``nohz_full=``, nor ``isolcpus``, nor cpuset's
+isolated partitions are used, which covers most usecases, this function
+returns the cpu_possible_mask.
+
+Otherwise the function returns the cpumask complement of the isolation
+feature. For example:
+
+With isolcpus=domain,7 the following will return a mask with all possible
+CPUs except 7::
+
+ housekeeping_cpumask(HK_TYPE_DOMAIN)
+
+Similarly with nohz_full=5,6 the following will return a mask with all
+possible CPUs except 5,6::
+
+ housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
+
+
+Synchronization against cpusets
+=================================
+
+Cpuset can modify the HK_TYPE_DOMAIN housekeeping cpumask while creating,
+modifying or deleting an isolated partition.
+
+The users of HK_TYPE_DOMAIN cpumask must then make sure to synchronize
+properly against cpuset in order to make sure that:
+
+1. The cpumask snapshot stays coherent.
+
+2. No housekeeping work is queued on a newly made isolated CPU.
+
+3. Pending housekeeping work that was queued to a non isolated
+ CPU which just turned isolated through cpuset must be flushed
+ before the related created/modified isolated partition is made
+ available to userspace.
+
+This synchronization is maintained by an RCU based scheme. The cpuset update
+side waits for an RCU grace period after updating the HK_TYPE_DOMAIN
+cpumask and before flushing pending works. On the read side, care must be
+taken to gather the housekeeping target election and the work enqueue within
+the same RCU read side critical section.
+
+A typical layout example would look like this on the update side
+(``housekeeping_update()``)::
+
+ rcu_assign_pointer(housekeeping_cpumasks[type], trial);
+ synchronize_rcu();
+ flush_workqueue(example_workqueue);
+
+And then on the read side::
+
+ rcu_read_lock();
+ cpu = housekeeping_any_cpu(HK_TYPE_DOMAIN);
+ queue_work_on(cpu, example_workqueue, work);
+ rcu_read_unlock();
--
2.51.0
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 30/33] kthread: Add API to update preferred affinity on kthread runtime
2025-10-13 20:31 ` [PATCH 30/33] kthread: Add API to update preferred affinity on kthread runtime Frederic Weisbecker
@ 2025-10-14 12:35 ` Simon Horman
0 siblings, 0 replies; 50+ messages in thread
From: Simon Horman @ 2025-10-14 12:35 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Michal Koutný, Andrew Morton, Bjorn Helgaas,
Catalin Marinas, Danilo Krummrich, David S . Miller, Eric Dumazet,
Gabriele Monaco, Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski,
Jens Axboe, Johannes Weiner, Lai Jiangshan, Marco Crivellari,
Michal Hocko, Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Tejun Heo,
Thomas Gleixner, Vlastimil Babka, Waiman Long, Will Deacon,
cgroups, linux-arm-kernel, linux-block, linux-mm, linux-pci,
netdev
On Mon, Oct 13, 2025 at 10:31:43PM +0200, Frederic Weisbecker wrote:
...
> @@ -900,6 +899,46 @@ int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask)
> }
> EXPORT_SYMBOL_GPL(kthread_affine_preferred);
>
> +/**
> + * kthread_affine_preferred_update - update a kthread's preferred affinity
> + * @p: thread created by kthread_create().
> + * @cpumask: new mask of CPUs (might not be online, must be possible) for @k
> + * to run on.
nit: @mask: ...
Likewise for the documentation of kthread_affine_preferred()
in a subsequent patch in this series.
> + *
> + * Update the cpumask of the desired kthread's affinity that was passed by
> + * a previous call to kthread_affine_preferred(). This can be called either
> + * before or after the first wakeup of the kthread.
> + *
> + * Returns 0 if the affinity has been applied.
> + */
> +int kthread_affine_preferred_update(struct task_struct *p,
> + const struct cpumask *mask)
...
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 11/33] cpuset: Provide lockdep check for cpuset lock held
2025-10-13 20:31 ` [PATCH 11/33] cpuset: Provide lockdep check for cpuset lock held Frederic Weisbecker
@ 2025-10-14 13:29 ` Chen Ridong
0 siblings, 0 replies; 50+ messages in thread
From: Chen Ridong @ 2025-10-14 13:29 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
On 2025/10/14 4:31, Frederic Weisbecker wrote:
> cpuset modifies partitions, including isolated, while holding the cpuset
> mutex.
>
> This means that holding the cpuset mutex is safe to synchronize against
> housekeeping cpumask changes.
>
> Provide a lockdep check to validate that.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> include/linux/cpuset.h | 2 ++
> kernel/cgroup/cpuset.c | 7 +++++++
> 2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 2ddb256187b5..051d36fec578 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -18,6 +18,8 @@
> #include <linux/mmu_context.h>
> #include <linux/jump_label.h>
>
> +extern bool lockdep_is_cpuset_held(void);
> +
> #ifdef CONFIG_CPUSETS
>
> /*
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 8595f1eadf23..aa1ac7bcf2ea 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -279,6 +279,13 @@ void cpuset_full_unlock(void)
> cpus_read_unlock();
> }
>
> +#ifdef CONFIG_LOCKDEP
> +bool lockdep_is_cpuset_held(void)
> +{
> + return lockdep_is_held(&cpuset_mutex);
> +}
> +#endif
> +
> static DEFINE_SPINLOCK(callback_lock);
>
> void cpuset_callback_lock_irq(void)
Is the lockdep_is_cpuset_held function actually being used?
If CONFIG_LOCKDEP is disabled, compilation would fail with an "undefined reference to
lockdep_is_cpuset_held" error.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 16/33] PCI: Flush PCI probe workqueue on cpuset isolated partition change
2025-10-13 20:31 ` [PATCH 16/33] PCI: Flush PCI probe workqueue " Frederic Weisbecker
@ 2025-10-14 20:50 ` Bjorn Helgaas
0 siblings, 0 replies; 50+ messages in thread
From: Bjorn Helgaas @ 2025-10-14 20:50 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Michal Koutný, Andrew Morton, Bjorn Helgaas,
Catalin Marinas, Danilo Krummrich, David S . Miller, Eric Dumazet,
Gabriele Monaco, Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski,
Jens Axboe, Johannes Weiner, Lai Jiangshan, Marco Crivellari,
Michal Hocko, Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
On Mon, Oct 13, 2025 at 10:31:29PM +0200, Frederic Weisbecker wrote:
> The HK_TYPE_DOMAIN housekeeping cpumask is now modifyable at runtime. In
> order to synchronize against PCI probe works and make sure that no
> asynchronous probing is still pending or executing on a newly made
> isolated CPU, the housekeeping susbsystem must flush the PCI probe
> works.
>
> However the PCI probe works can't be flushed easily since they are
> queued to the main per-CPU workqueue pool.
>
> Solve this with creating a PCI probe specific pool and provide and use
> the appropriate flushing API.
s/modifyable/modifiable/
s/newly made isolated/newly isolated/
s/susbsystem/subsystem/
s/PCI probe specific pool/PCI probe-specific pool/
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 01/33] PCI: Prepare to protect against concurrent isolated cpuset change
2025-10-13 20:31 ` [PATCH 01/33] PCI: Prepare to protect against concurrent isolated cpuset change Frederic Weisbecker
@ 2025-10-14 20:53 ` Bjorn Helgaas
0 siblings, 0 replies; 50+ messages in thread
From: Bjorn Helgaas @ 2025-10-14 20:53 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: LKML, Michal Koutný, Andrew Morton, Bjorn Helgaas,
Catalin Marinas, Danilo Krummrich, David S . Miller, Eric Dumazet,
Gabriele Monaco, Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski,
Jens Axboe, Johannes Weiner, Lai Jiangshan, Marco Crivellari,
Michal Hocko, Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
On Mon, Oct 13, 2025 at 10:31:14PM +0200, Frederic Weisbecker wrote:
> HK_TYPE_DOMAIN will soon integrate cpuset isolated partitions and
> therefore be made modifyable at runtime. Synchronize against the cpumask
> update using RCU.
>
> The RCU locked section includes both the housekeeping CPU target
> election for the PCI probe work and the work enqueue.
>
> This way the housekeeping update side will simply need to flush the
> pending related works after updating the housekeeping mask in order to
> make sure that no PCI work ever executes on an isolated CPU. This part
> will be handled in a subsequent patch.
s/modifyable/modifiable/ (also in several other commit logs)
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 12/33] sched/isolation: Convert housekeeping cpumasks to rcu pointers
2025-10-13 20:31 ` [PATCH 12/33] sched/isolation: Convert housekeeping cpumasks to rcu pointers Frederic Weisbecker
@ 2025-10-21 1:46 ` Chen Ridong
2025-10-21 1:57 ` Chen Ridong
2025-10-21 4:03 ` Waiman Long
2025-10-21 3:49 ` Waiman Long
1 sibling, 2 replies; 50+ messages in thread
From: Chen Ridong @ 2025-10-21 1:46 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
On 2025/10/14 4:31, Frederic Weisbecker wrote:
> HK_TYPE_DOMAIN's cpumask will soon be made modifyable by cpuset.
> A synchronization mechanism is then needed to synchronize the updates
> with the housekeeping cpumask readers.
>
> Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping
> cpumask will be modified, the update side will wait for an RCU grace
> period and propagate the change to interested subsystem when deemed
> necessary.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> kernel/sched/isolation.c | 58 +++++++++++++++++++++++++---------------
> kernel/sched/sched.h | 1 +
> 2 files changed, 37 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 8690fb705089..b46c20b5437f 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -21,7 +21,7 @@ DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
> EXPORT_SYMBOL_GPL(housekeeping_overridden);
>
> struct housekeeping {
> - cpumask_var_t cpumasks[HK_TYPE_MAX];
> + struct cpumask __rcu *cpumasks[HK_TYPE_MAX];
> unsigned long flags;
> };
>
> @@ -33,17 +33,28 @@ bool housekeeping_enabled(enum hk_type type)
> }
> EXPORT_SYMBOL_GPL(housekeeping_enabled);
>
> +const struct cpumask *housekeeping_cpumask(enum hk_type type)
> +{
> + if (static_branch_unlikely(&housekeeping_overridden)) {
> + if (housekeeping.flags & BIT(type)) {
> + return rcu_dereference_check(housekeeping.cpumasks[type], 1);
> + }
> + }
> + return cpu_possible_mask;
> +}
> +EXPORT_SYMBOL_GPL(housekeeping_cpumask);
> +
> int housekeeping_any_cpu(enum hk_type type)
> {
> int cpu;
>
> if (static_branch_unlikely(&housekeeping_overridden)) {
> if (housekeeping.flags & BIT(type)) {
> - cpu = sched_numa_find_closest(housekeeping.cpumasks[type], smp_processor_id());
> + cpu = sched_numa_find_closest(housekeeping_cpumask(type), smp_processor_id());
> if (cpu < nr_cpu_ids)
> return cpu;
>
> - cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and_distribute(housekeeping_cpumask(type), cpu_online_mask);
> if (likely(cpu < nr_cpu_ids))
> return cpu;
> /*
> @@ -59,28 +70,18 @@ int housekeeping_any_cpu(enum hk_type type)
> }
> EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
>
> -const struct cpumask *housekeeping_cpumask(enum hk_type type)
> -{
> - if (static_branch_unlikely(&housekeeping_overridden))
> - if (housekeeping.flags & BIT(type))
> - return housekeeping.cpumasks[type];
> - return cpu_possible_mask;
> -}
> -EXPORT_SYMBOL_GPL(housekeeping_cpumask);
> -
> void housekeeping_affine(struct task_struct *t, enum hk_type type)
> {
> if (static_branch_unlikely(&housekeeping_overridden))
> if (housekeeping.flags & BIT(type))
> - set_cpus_allowed_ptr(t, housekeeping.cpumasks[type]);
> + set_cpus_allowed_ptr(t, housekeeping_cpumask(type));
> }
> EXPORT_SYMBOL_GPL(housekeeping_affine);
>
> bool housekeeping_test_cpu(int cpu, enum hk_type type)
> {
> - if (static_branch_unlikely(&housekeeping_overridden))
> - if (housekeeping.flags & BIT(type))
> - return cpumask_test_cpu(cpu, housekeeping.cpumasks[type]);
> + if (housekeeping.flags & BIT(type))
> + return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
> return true;
> }
> EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
> @@ -96,20 +97,33 @@ void __init housekeeping_init(void)
>
> if (housekeeping.flags & HK_FLAG_KERNEL_NOISE)
> sched_tick_offload_init();
> -
> + /*
> + * Realloc with a proper allocator so that any cpumask update
> + * can indifferently free the old version with kfree().
> + */
> for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) {
> + struct cpumask *omask, *nmask = kmalloc(cpumask_size(), GFP_KERNEL);
> +
> + if (WARN_ON_ONCE(!nmask))
> + return;
> +
> + omask = rcu_dereference(housekeeping.cpumasks[type]);
> +
> /* We need at least one CPU to handle housekeeping work */
> - WARN_ON_ONCE(cpumask_empty(housekeeping.cpumasks[type]));
> + WARN_ON_ONCE(cpumask_empty(omask));
> + cpumask_copy(nmask, omask);
> + RCU_INIT_POINTER(housekeeping.cpumasks[type], nmask);
> + memblock_free(omask, cpumask_size());
> }
> }
>
> static void __init housekeeping_setup_type(enum hk_type type,
> cpumask_var_t housekeeping_staging)
> {
> + struct cpumask *mask = memblock_alloc_or_panic(cpumask_size(), SMP_CACHE_BYTES);
>
> - alloc_bootmem_cpumask_var(&housekeeping.cpumasks[type]);
> - cpumask_copy(housekeeping.cpumasks[type],
> - housekeeping_staging);
> + cpumask_copy(mask, housekeeping_staging);
> + RCU_INIT_POINTER(housekeeping.cpumasks[type], mask);
> }
>
> static int __init housekeeping_setup(char *str, unsigned long flags)
> @@ -162,7 +176,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
>
> for_each_set_bit(type, &iter_flags, HK_TYPE_MAX) {
> if (!cpumask_equal(housekeeping_staging,
> - housekeeping.cpumasks[type])) {
> + housekeeping_cpumask(type))) {
> pr_warn("Housekeeping: nohz_full= must match isolcpus=\n");
> goto free_housekeeping_staging;
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1f5d07067f60..0c0ef8999fd6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -42,6 +42,7 @@
> #include <linux/ktime_api.h>
> #include <linux/lockdep_api.h>
> #include <linux/lockdep.h>
> +#include <linux/memblock.h>
> #include <linux/minmax.h>
> #include <linux/mm.h>
> #include <linux/module.h>
A warning was detected:
=============================
WARNING: suspicious RCU usage
6.17.0-next-20251009-00033-g4444da88969b #808 Not tainted
-----------------------------
kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by swapper/0/1:
#0: ffff888100600ce0 (&type->i_mutex_dir_key#3){++++}-{4:4}, at: walk_compone
stack backtrace:
CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-next-20251009-00033-g4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239
Call Trace:
<TASK>
dump_stack_lvl+0x68/0xa0
lockdep_rcu_suspicious+0x148/0x1b0
housekeeping_cpumask+0xaa/0xb0
housekeeping_test_cpu+0x25/0x40
find_get_block_common+0x41/0x3e0
bdev_getblk+0x28/0xa0
ext4_getblk+0xba/0x2d0
ext4_bread_batch+0x56/0x170
__ext4_find_entry+0x17c/0x410
? lock_release+0xc6/0x290
ext4_lookup+0x7a/0x1d0
__lookup_slow+0xf9/0x1b0
walk_component+0xe0/0x150
link_path_walk+0x201/0x3e0
path_openat+0xb1/0xb30
? stack_depot_save_flags+0x41e/0xa00
do_filp_open+0xbc/0x170
? _raw_spin_unlock_irqrestore+0x2c/0x50
? __create_object+0x59/0x80
? trace_kmem_cache_alloc+0x1d/0xa0
? vprintk_emit+0x2b2/0x360
do_open_execat+0x56/0x100
alloc_bprm+0x1a/0x200
? __pfx_kernel_init+0x10/0x10
kernel_execve+0x4b/0x160
kernel_init+0xe5/0x1c0
ret_from_fork+0x185/0x1d0
? __pfx_kernel_init+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
random: crng init done
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 12/33] sched/isolation: Convert housekeeping cpumasks to rcu pointers
2025-10-21 1:46 ` Chen Ridong
@ 2025-10-21 1:57 ` Chen Ridong
2025-10-21 4:03 ` Waiman Long
1 sibling, 0 replies; 50+ messages in thread
From: Chen Ridong @ 2025-10-21 1:57 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
On 2025/10/21 9:46, Chen Ridong wrote:
>
>
> On 2025/10/14 4:31, Frederic Weisbecker wrote:
>> HK_TYPE_DOMAIN's cpumask will soon be made modifyable by cpuset.
>> A synchronization mechanism is then needed to synchronize the updates
>> with the housekeeping cpumask readers.
>>
>> Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping
>> cpumask will be modified, the update side will wait for an RCU grace
>> period and propagate the change to interested subsystem when deemed
>> necessary.
>>
>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>> ---
>> kernel/sched/isolation.c | 58 +++++++++++++++++++++++++---------------
>> kernel/sched/sched.h | 1 +
>> 2 files changed, 37 insertions(+), 22 deletions(-)
>>
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index 8690fb705089..b46c20b5437f 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -21,7 +21,7 @@ DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
>> EXPORT_SYMBOL_GPL(housekeeping_overridden);
>>
>> struct housekeeping {
>> - cpumask_var_t cpumasks[HK_TYPE_MAX];
>> + struct cpumask __rcu *cpumasks[HK_TYPE_MAX];
>> unsigned long flags;
>> };
>>
>> @@ -33,17 +33,28 @@ bool housekeeping_enabled(enum hk_type type)
>> }
>> EXPORT_SYMBOL_GPL(housekeeping_enabled);
>>
>> +const struct cpumask *housekeeping_cpumask(enum hk_type type)
>> +{
>> + if (static_branch_unlikely(&housekeeping_overridden)) {
>> + if (housekeeping.flags & BIT(type)) {
>> + return rcu_dereference_check(housekeeping.cpumasks[type], 1);
>> + }
>> + }
>> + return cpu_possible_mask;
>> +}
>> +EXPORT_SYMBOL_GPL(housekeeping_cpumask);
>> +
>> int housekeeping_any_cpu(enum hk_type type)
>> {
>> int cpu;
>>
>> if (static_branch_unlikely(&housekeeping_overridden)) {
>> if (housekeeping.flags & BIT(type)) {
>> - cpu = sched_numa_find_closest(housekeeping.cpumasks[type], smp_processor_id());
>> + cpu = sched_numa_find_closest(housekeeping_cpumask(type), smp_processor_id());
>> if (cpu < nr_cpu_ids)
>> return cpu;
>>
>> - cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
>> + cpu = cpumask_any_and_distribute(housekeeping_cpumask(type), cpu_online_mask);
>> if (likely(cpu < nr_cpu_ids))
>> return cpu;
>> /*
>> @@ -59,28 +70,18 @@ int housekeeping_any_cpu(enum hk_type type)
>> }
>> EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
>>
>> -const struct cpumask *housekeeping_cpumask(enum hk_type type)
>> -{
>> - if (static_branch_unlikely(&housekeeping_overridden))
>> - if (housekeeping.flags & BIT(type))
>> - return housekeeping.cpumasks[type];
>> - return cpu_possible_mask;
>> -}
>> -EXPORT_SYMBOL_GPL(housekeeping_cpumask);
>> -
>> void housekeeping_affine(struct task_struct *t, enum hk_type type)
>> {
>> if (static_branch_unlikely(&housekeeping_overridden))
>> if (housekeeping.flags & BIT(type))
>> - set_cpus_allowed_ptr(t, housekeeping.cpumasks[type]);
>> + set_cpus_allowed_ptr(t, housekeeping_cpumask(type));
>> }
>> EXPORT_SYMBOL_GPL(housekeeping_affine);
>>
>> bool housekeeping_test_cpu(int cpu, enum hk_type type)
>> {
>> - if (static_branch_unlikely(&housekeeping_overridden))
>> - if (housekeeping.flags & BIT(type))
>> - return cpumask_test_cpu(cpu, housekeeping.cpumasks[type]);
>> + if (housekeeping.flags & BIT(type))
>> + return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
>> return true;
>> }
>> EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
>> @@ -96,20 +97,33 @@ void __init housekeeping_init(void)
>>
>> if (housekeeping.flags & HK_FLAG_KERNEL_NOISE)
>> sched_tick_offload_init();
>> -
>> + /*
>> + * Realloc with a proper allocator so that any cpumask update
>> + * can indifferently free the old version with kfree().
>> + */
>> for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) {
>> + struct cpumask *omask, *nmask = kmalloc(cpumask_size(), GFP_KERNEL);
>> +
>> + if (WARN_ON_ONCE(!nmask))
>> + return;
>> +
>> + omask = rcu_dereference(housekeeping.cpumasks[type]);
>> +
>> /* We need at least one CPU to handle housekeeping work */
>> - WARN_ON_ONCE(cpumask_empty(housekeeping.cpumasks[type]));
>> + WARN_ON_ONCE(cpumask_empty(omask));
>> + cpumask_copy(nmask, omask);
>> + RCU_INIT_POINTER(housekeeping.cpumasks[type], nmask);
>> + memblock_free(omask, cpumask_size());
>> }
>> }
>>
>> static void __init housekeeping_setup_type(enum hk_type type,
>> cpumask_var_t housekeeping_staging)
>> {
>> + struct cpumask *mask = memblock_alloc_or_panic(cpumask_size(), SMP_CACHE_BYTES);
>>
>> - alloc_bootmem_cpumask_var(&housekeeping.cpumasks[type]);
>> - cpumask_copy(housekeeping.cpumasks[type],
>> - housekeeping_staging);
>> + cpumask_copy(mask, housekeeping_staging);
>> + RCU_INIT_POINTER(housekeeping.cpumasks[type], mask);
>> }
>>
>> static int __init housekeeping_setup(char *str, unsigned long flags)
>> @@ -162,7 +176,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
>>
>> for_each_set_bit(type, &iter_flags, HK_TYPE_MAX) {
>> if (!cpumask_equal(housekeeping_staging,
>> - housekeeping.cpumasks[type])) {
>> + housekeeping_cpumask(type))) {
>> pr_warn("Housekeeping: nohz_full= must match isolcpus=\n");
>> goto free_housekeeping_staging;
>> }
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 1f5d07067f60..0c0ef8999fd6 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -42,6 +42,7 @@
>> #include <linux/ktime_api.h>
>> #include <linux/lockdep_api.h>
>> #include <linux/lockdep.h>
>> +#include <linux/memblock.h>
>> #include <linux/minmax.h>
>> #include <linux/mm.h>
>> #include <linux/module.h>
>
> A warning was detected:
>
> =============================
> WARNING: suspicious RCU usage
> 6.17.0-next-20251009-00033-g4444da88969b #808 Not tainted
> -----------------------------
> kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage!
>
> other info that might help us debug this:
>
>
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by swapper/0/1:
> #0: ffff888100600ce0 (&type->i_mutex_dir_key#3){++++}-{4:4}, at: walk_compone
>
> stack backtrace:
> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-next-20251009-00033-g4
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239
> Call Trace:
> <TASK>
> dump_stack_lvl+0x68/0xa0
> lockdep_rcu_suspicious+0x148/0x1b0
> housekeeping_cpumask+0xaa/0xb0
> housekeeping_test_cpu+0x25/0x40
> find_get_block_common+0x41/0x3e0
> bdev_getblk+0x28/0xa0
> ext4_getblk+0xba/0x2d0
> ext4_bread_batch+0x56/0x170
> __ext4_find_entry+0x17c/0x410
> ? lock_release+0xc6/0x290
> ext4_lookup+0x7a/0x1d0
> __lookup_slow+0xf9/0x1b0
> walk_component+0xe0/0x150
> link_path_walk+0x201/0x3e0
> path_openat+0xb1/0xb30
> ? stack_depot_save_flags+0x41e/0xa00
> do_filp_open+0xbc/0x170
> ? _raw_spin_unlock_irqrestore+0x2c/0x50
> ? __create_object+0x59/0x80
> ? trace_kmem_cache_alloc+0x1d/0xa0
> ? vprintk_emit+0x2b2/0x360
> do_open_execat+0x56/0x100
> alloc_bprm+0x1a/0x200
> ? __pfx_kernel_init+0x10/0x10
> kernel_execve+0x4b/0x160
> kernel_init+0xe5/0x1c0
> ret_from_fork+0x185/0x1d0
> ? __pfx_kernel_init+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> random: crng init done
>
This warning was likely introduced by patch 13, which added the housekeeping_dereference_check
condition, and is not caused by the current patch.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 12/33] sched/isolation: Convert housekeeping cpumasks to rcu pointers
2025-10-13 20:31 ` [PATCH 12/33] sched/isolation: Convert housekeeping cpumasks to rcu pointers Frederic Weisbecker
2025-10-21 1:46 ` Chen Ridong
@ 2025-10-21 3:49 ` Waiman Long
1 sibling, 0 replies; 50+ messages in thread
From: Waiman Long @ 2025-10-21 3:49 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Will Deacon, cgroups,
linux-arm-kernel, linux-block, linux-mm, linux-pci, netdev
On 10/13/25 4:31 PM, Frederic Weisbecker wrote:
> HK_TYPE_DOMAIN's cpumask will soon be made modifyable by cpuset.
> A synchronization mechanism is then needed to synchronize the updates
> with the housekeeping cpumask readers.
>
> Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping
> cpumask will be modified, the update side will wait for an RCU grace
> period and propagate the change to interested subsystem when deemed
> necessary.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> kernel/sched/isolation.c | 58 +++++++++++++++++++++++++---------------
> kernel/sched/sched.h | 1 +
> 2 files changed, 37 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 8690fb705089..b46c20b5437f 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -21,7 +21,7 @@ DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
> EXPORT_SYMBOL_GPL(housekeeping_overridden);
>
> struct housekeeping {
> - cpumask_var_t cpumasks[HK_TYPE_MAX];
> + struct cpumask __rcu *cpumasks[HK_TYPE_MAX];
> unsigned long flags;
> };
>
> @@ -33,17 +33,28 @@ bool housekeeping_enabled(enum hk_type type)
> }
> EXPORT_SYMBOL_GPL(housekeeping_enabled);
>
> +const struct cpumask *housekeeping_cpumask(enum hk_type type)
> +{
> + if (static_branch_unlikely(&housekeeping_overridden)) {
> + if (housekeeping.flags & BIT(type)) {
> + return rcu_dereference_check(housekeeping.cpumasks[type], 1);
> + }
> + }
> + return cpu_possible_mask;
> +}
> +EXPORT_SYMBOL_GPL(housekeeping_cpumask);
> +
> int housekeeping_any_cpu(enum hk_type type)
> {
> int cpu;
>
> if (static_branch_unlikely(&housekeeping_overridden)) {
> if (housekeeping.flags & BIT(type)) {
> - cpu = sched_numa_find_closest(housekeeping.cpumasks[type], smp_processor_id());
> + cpu = sched_numa_find_closest(housekeeping_cpumask(type), smp_processor_id());
> if (cpu < nr_cpu_ids)
> return cpu;
>
> - cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
> + cpu = cpumask_any_and_distribute(housekeeping_cpumask(type), cpu_online_mask);
> if (likely(cpu < nr_cpu_ids))
> return cpu;
> /*
> @@ -59,28 +70,18 @@ int housekeeping_any_cpu(enum hk_type type)
> }
> EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
>
> -const struct cpumask *housekeeping_cpumask(enum hk_type type)
> -{
> - if (static_branch_unlikely(&housekeeping_overridden))
> - if (housekeeping.flags & BIT(type))
> - return housekeeping.cpumasks[type];
> - return cpu_possible_mask;
> -}
> -EXPORT_SYMBOL_GPL(housekeeping_cpumask);
> -
> void housekeeping_affine(struct task_struct *t, enum hk_type type)
> {
> if (static_branch_unlikely(&housekeeping_overridden))
> if (housekeeping.flags & BIT(type))
> - set_cpus_allowed_ptr(t, housekeeping.cpumasks[type]);
> + set_cpus_allowed_ptr(t, housekeeping_cpumask(type));
> }
> EXPORT_SYMBOL_GPL(housekeeping_affine);
>
> bool housekeeping_test_cpu(int cpu, enum hk_type type)
> {
> - if (static_branch_unlikely(&housekeeping_overridden))
> - if (housekeeping.flags & BIT(type))
> - return cpumask_test_cpu(cpu, housekeeping.cpumasks[type]);
> + if (housekeeping.flags & BIT(type))
> + return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
> return true;
> }
The housekeeping_overridden static key check is kept in other places
except this one. Should we keep it for consistency?
Cheers,
Longman
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 12/33] sched/isolation: Convert housekeeping cpumasks to rcu pointers
2025-10-21 1:46 ` Chen Ridong
2025-10-21 1:57 ` Chen Ridong
@ 2025-10-21 4:03 ` Waiman Long
1 sibling, 0 replies; 50+ messages in thread
From: Waiman Long @ 2025-10-21 4:03 UTC (permalink / raw)
To: Chen Ridong, Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Will Deacon, cgroups,
linux-arm-kernel, linux-block, linux-mm, linux-pci, netdev
On 10/20/25 9:46 PM, Chen Ridong wrote:
>
> On 2025/10/14 4:31, Frederic Weisbecker wrote:
>> HK_TYPE_DOMAIN's cpumask will soon be made modifyable by cpuset.
>> A synchronization mechanism is then needed to synchronize the updates
>> with the housekeeping cpumask readers.
>>
>> Turn the housekeeping cpumasks into RCU pointers. Once a housekeeping
>> cpumask will be modified, the update side will wait for an RCU grace
>> period and propagate the change to interested subsystem when deemed
>> necessary.
>>
>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>> ---
>> kernel/sched/isolation.c | 58 +++++++++++++++++++++++++---------------
>> kernel/sched/sched.h | 1 +
>> 2 files changed, 37 insertions(+), 22 deletions(-)
>>
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index 8690fb705089..b46c20b5437f 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -21,7 +21,7 @@ DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
>> EXPORT_SYMBOL_GPL(housekeeping_overridden);
>>
>> struct housekeeping {
>> - cpumask_var_t cpumasks[HK_TYPE_MAX];
>> + struct cpumask __rcu *cpumasks[HK_TYPE_MAX];
>> unsigned long flags;
>> };
>>
>> @@ -33,17 +33,28 @@ bool housekeeping_enabled(enum hk_type type)
>> }
>> EXPORT_SYMBOL_GPL(housekeeping_enabled);
>>
>> +const struct cpumask *housekeeping_cpumask(enum hk_type type)
>> +{
>> + if (static_branch_unlikely(&housekeeping_overridden)) {
>> + if (housekeeping.flags & BIT(type)) {
>> + return rcu_dereference_check(housekeeping.cpumasks[type], 1);
>> + }
>> + }
>> + return cpu_possible_mask;
>> +}
>> +EXPORT_SYMBOL_GPL(housekeeping_cpumask);
>> +
>> int housekeeping_any_cpu(enum hk_type type)
>> {
>> int cpu;
>>
>> if (static_branch_unlikely(&housekeeping_overridden)) {
>> if (housekeeping.flags & BIT(type)) {
>> - cpu = sched_numa_find_closest(housekeeping.cpumasks[type], smp_processor_id());
>> + cpu = sched_numa_find_closest(housekeeping_cpumask(type), smp_processor_id());
>> if (cpu < nr_cpu_ids)
>> return cpu;
>>
>> - cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
>> + cpu = cpumask_any_and_distribute(housekeeping_cpumask(type), cpu_online_mask);
>> if (likely(cpu < nr_cpu_ids))
>> return cpu;
>> /*
>> @@ -59,28 +70,18 @@ int housekeeping_any_cpu(enum hk_type type)
>> }
>> EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
>>
>> -const struct cpumask *housekeeping_cpumask(enum hk_type type)
>> -{
>> - if (static_branch_unlikely(&housekeeping_overridden))
>> - if (housekeeping.flags & BIT(type))
>> - return housekeeping.cpumasks[type];
>> - return cpu_possible_mask;
>> -}
>> -EXPORT_SYMBOL_GPL(housekeeping_cpumask);
>> -
>> void housekeeping_affine(struct task_struct *t, enum hk_type type)
>> {
>> if (static_branch_unlikely(&housekeeping_overridden))
>> if (housekeeping.flags & BIT(type))
>> - set_cpus_allowed_ptr(t, housekeeping.cpumasks[type]);
>> + set_cpus_allowed_ptr(t, housekeeping_cpumask(type));
>> }
>> EXPORT_SYMBOL_GPL(housekeeping_affine);
>>
>> bool housekeeping_test_cpu(int cpu, enum hk_type type)
>> {
>> - if (static_branch_unlikely(&housekeeping_overridden))
>> - if (housekeeping.flags & BIT(type))
>> - return cpumask_test_cpu(cpu, housekeeping.cpumasks[type]);
>> + if (housekeeping.flags & BIT(type))
>> + return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
>> return true;
>> }
>> EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
>> @@ -96,20 +97,33 @@ void __init housekeeping_init(void)
>>
>> if (housekeeping.flags & HK_FLAG_KERNEL_NOISE)
>> sched_tick_offload_init();
>> -
>> + /*
>> + * Realloc with a proper allocator so that any cpumask update
>> + * can indifferently free the old version with kfree().
>> + */
>> for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) {
>> + struct cpumask *omask, *nmask = kmalloc(cpumask_size(), GFP_KERNEL);
>> +
>> + if (WARN_ON_ONCE(!nmask))
>> + return;
>> +
>> + omask = rcu_dereference(housekeeping.cpumasks[type]);
>> +
>> /* We need at least one CPU to handle housekeeping work */
>> - WARN_ON_ONCE(cpumask_empty(housekeeping.cpumasks[type]));
>> + WARN_ON_ONCE(cpumask_empty(omask));
>> + cpumask_copy(nmask, omask);
>> + RCU_INIT_POINTER(housekeeping.cpumasks[type], nmask);
>> + memblock_free(omask, cpumask_size());
>> }
>> }
>>
>> static void __init housekeeping_setup_type(enum hk_type type,
>> cpumask_var_t housekeeping_staging)
>> {
>> + struct cpumask *mask = memblock_alloc_or_panic(cpumask_size(), SMP_CACHE_BYTES);
>>
>> - alloc_bootmem_cpumask_var(&housekeeping.cpumasks[type]);
>> - cpumask_copy(housekeeping.cpumasks[type],
>> - housekeeping_staging);
>> + cpumask_copy(mask, housekeeping_staging);
>> + RCU_INIT_POINTER(housekeeping.cpumasks[type], mask);
>> }
>>
>> static int __init housekeeping_setup(char *str, unsigned long flags)
>> @@ -162,7 +176,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
>>
>> for_each_set_bit(type, &iter_flags, HK_TYPE_MAX) {
>> if (!cpumask_equal(housekeeping_staging,
>> - housekeeping.cpumasks[type])) {
>> + housekeeping_cpumask(type))) {
>> pr_warn("Housekeeping: nohz_full= must match isolcpus=\n");
>> goto free_housekeeping_staging;
>> }
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 1f5d07067f60..0c0ef8999fd6 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -42,6 +42,7 @@
>> #include <linux/ktime_api.h>
>> #include <linux/lockdep_api.h>
>> #include <linux/lockdep.h>
>> +#include <linux/memblock.h>
>> #include <linux/minmax.h>
>> #include <linux/mm.h>
>> #include <linux/module.h>
> A warning was detected:
>
> =============================
> WARNING: suspicious RCU usage
> 6.17.0-next-20251009-00033-g4444da88969b #808 Not tainted
> -----------------------------
> kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage!
>
> other info that might help us debug this:
>
>
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by swapper/0/1:
> #0: ffff888100600ce0 (&type->i_mutex_dir_key#3){++++}-{4:4}, at: walk_compone
>
> stack backtrace:
> CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-next-20251009-00033-g4
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239
> Call Trace:
> <TASK>
> dump_stack_lvl+0x68/0xa0
> lockdep_rcu_suspicious+0x148/0x1b0
> housekeeping_cpumask+0xaa/0xb0
> housekeeping_test_cpu+0x25/0x40
> find_get_block_common+0x41/0x3e0
> bdev_getblk+0x28/0xa0
> ext4_getblk+0xba/0x2d0
> ext4_bread_batch+0x56/0x170
> __ext4_find_entry+0x17c/0x410
> ? lock_release+0xc6/0x290
> ext4_lookup+0x7a/0x1d0
> __lookup_slow+0xf9/0x1b0
> walk_component+0xe0/0x150
> link_path_walk+0x201/0x3e0
> path_openat+0xb1/0xb30
> ? stack_depot_save_flags+0x41e/0xa00
> do_filp_open+0xbc/0x170
> ? _raw_spin_unlock_irqrestore+0x2c/0x50
> ? __create_object+0x59/0x80
> ? trace_kmem_cache_alloc+0x1d/0xa0
> ? vprintk_emit+0x2b2/0x360
> do_open_execat+0x56/0x100
> alloc_bprm+0x1a/0x200
> ? __pfx_kernel_init+0x10/0x10
> kernel_execve+0x4b/0x160
> kernel_init+0xe5/0x1c0
> ret_from_fork+0x185/0x1d0
> ? __pfx_kernel_init+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> random: crng init done
>
It is because bh_lru_install() of fs/buffer.c calls cpu_is_isolated()
without holding a rcu_read_lock. Will need to add a rcu_read_lock() there.
Cheers,
Longman
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 13/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
2025-10-13 20:31 ` [PATCH 13/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
@ 2025-10-21 4:10 ` Waiman Long
2025-10-22 1:36 ` Chen Ridong
2025-10-21 13:39 ` Waiman Long
1 sibling, 1 reply; 50+ messages in thread
From: Waiman Long @ 2025-10-21 4:10 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Will Deacon, cgroups,
linux-arm-kernel, linux-block, linux-mm, linux-pci, netdev
On 10/13/25 4:31 PM, Frederic Weisbecker wrote:
> Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
> CPUs passed through isolcpus= boot option. Users interested in also
> knowing the runtime defined isolated CPUs through cpuset must use
> different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...
>
> There are many drawbacks to that approach:
>
> 1) Most interested subsystems want to know about all isolated CPUs, not
> just those defined on boot time.
>
> 2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
> concurrent cpuset changes.
>
> 3) Further cpuset modifications are not propagated to subsystems
>
> Solve 1) and 2) and centralize all isolated CPUs within the
> HK_TYPE_DOMAIN housekeeping cpumask.
>
> Subsystems can rely on RCU to synchronize against concurrent changes.
>
> The propagation mentioned in 3) will be handled in further patches.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> include/linux/sched/isolation.h | 2 +
> kernel/cgroup/cpuset.c | 2 +
> kernel/sched/isolation.c | 75 ++++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 1 +
> 4 files changed, 74 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index da22b038942a..94d5c835121b 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -32,6 +32,7 @@ extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
> extern bool housekeeping_enabled(enum hk_type type);
> extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
> extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
> +extern int housekeeping_update(struct cpumask *mask, enum hk_type type);
> extern void __init housekeeping_init(void);
>
> #else
> @@ -59,6 +60,7 @@ static inline bool housekeeping_test_cpu(int cpu, enum hk_type type)
> return true;
> }
>
> +static inline int housekeeping_update(struct cpumask *mask, enum hk_type type) { return 0; }
> static inline void housekeeping_init(void) { }
> #endif /* CONFIG_CPU_ISOLATION */
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index aa1ac7bcf2ea..b04a4242f2fa 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1403,6 +1403,8 @@ static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
>
> ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
> WARN_ON_ONCE(ret < 0);
> + ret = housekeeping_update(isolated_cpus, HK_TYPE_DOMAIN);
> + WARN_ON_ONCE(ret < 0);
> }
>
> /**
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index b46c20b5437f..95d69c2102f6 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -29,18 +29,48 @@ static struct housekeeping housekeeping;
>
> bool housekeeping_enabled(enum hk_type type)
> {
> - return !!(housekeeping.flags & BIT(type));
> + return !!(READ_ONCE(housekeeping.flags) & BIT(type));
> }
> EXPORT_SYMBOL_GPL(housekeeping_enabled);
>
> +static bool housekeeping_dereference_check(enum hk_type type)
> +{
> + if (IS_ENABLED(CONFIG_LOCKDEP) && type == HK_TYPE_DOMAIN) {
> + /* Cpuset isn't even writable yet? */
> + if (system_state <= SYSTEM_SCHEDULING)
> + return true;
> +
> + /* CPU hotplug write locked, so cpuset partition can't be overwritten */
> + if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
> + return true;
> +
> + /* Cpuset lock held, partitions not writable */
> + if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
> + return true;
I have some doubt about this condition as the cpuset_mutex may be held
in the process of making changes to an isolated partition that will
impact HK_TYPE_DOMAIN cpumask.
Cheers,
Longman
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 13/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
2025-10-13 20:31 ` [PATCH 13/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
2025-10-21 4:10 ` Waiman Long
@ 2025-10-21 13:39 ` Waiman Long
1 sibling, 0 replies; 50+ messages in thread
From: Waiman Long @ 2025-10-21 13:39 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Will Deacon, cgroups,
linux-arm-kernel, linux-block, linux-mm, linux-pci, netdev
On 10/13/25 4:31 PM, Frederic Weisbecker wrote:
> @@ -80,12 +110,45 @@ EXPORT_SYMBOL_GPL(housekeeping_affine);
>
> bool housekeeping_test_cpu(int cpu, enum hk_type type)
> {
> - if (housekeeping.flags & BIT(type))
> + if (READ_ONCE(housekeeping.flags) & BIT(type))
> return cpumask_test_cpu(cpu, housekeeping_cpumask(type));
> return true;
> }
> EXPORT_SYMBOL_GPL(housekeeping_test_cpu);
>
> +int housekeeping_update(struct cpumask *mask, enum hk_type type)
> +{
> + struct cpumask *trial, *old = NULL;
> +
> + if (type != HK_TYPE_DOMAIN)
> + return -ENOTSUPP;
> +
> + trial = kmalloc(sizeof(*trial), GFP_KERNEL);
Should you use cpumask_size() instead of sizeof(*trial) as the latter
can be much bigger?
> + if (!trial)
> + return -ENOMEM;
> +
> + cpumask_andnot(trial, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), mask);
> + if (!cpumask_intersects(trial, cpu_online_mask)) {
> + kfree(trial);
> + return -EINVAL;
> + }
> +
> + if (!housekeeping.flags)
> + static_branch_enable(&housekeeping_overridden);
> +
> + if (!(housekeeping.flags & BIT(type)))
> + old = housekeeping_cpumask_dereference(type);
> + else
> + WRITE_ONCE(housekeeping.flags, housekeeping.flags | BIT(type));
> + rcu_assign_pointer(housekeeping.cpumasks[type], trial);
> +
> + synchronize_rcu();
> +
> + kfree(old);
If "isolcpus" boot command line option is set, old can be a pointer to
the boot time memblock area which isn't a pointer that can be handled by
the slab allocator AFAIU. I don't know the exact consequence, but it may
not be good. One possible solution I can think of is to make
HK_TYPE_DOMAIN and HK_TYPE_DOMAIN_ROOT point to the same memblock
pointer and don't pass the old HK_TYPE_DOMAIN pointer to kfree() if it
matches HK_TYPE_DOMAIN_BOOT one. Alternatively, we can just set the
HK_TYPE_DOMAIN_BOOT pointer at boot and make HK_TYPE_DOMAIN falls back
to HK_TYPE_DOMAIN_BOOT if not set.
Cheers,
Longman
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 14/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change
2025-10-13 20:31 ` [PATCH 14/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
@ 2025-10-21 19:16 ` Waiman Long
2025-10-21 19:28 ` Waiman Long
0 siblings, 1 reply; 50+ messages in thread
From: Waiman Long @ 2025-10-21 19:16 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Will Deacon, cgroups,
linux-arm-kernel, linux-block, linux-mm, linux-pci, netdev
On 10/13/25 4:31 PM, Frederic Weisbecker wrote:
> The HK_TYPE_DOMAIN housekeeping cpumask is now modifyable at runtime. In
> order to synchronize against memcg workqueue to make sure that no
> asynchronous draining is still pending or executing on a newly made
> isolated CPU, the housekeeping susbsystem must flush the memcg
> workqueues.
>
> However the memcg workqueues can't be flushed easily since they are
> queued to the main per-CPU workqueue pool.
>
> Solve this with creating a memcg specific pool and provide and use the
> appropriate flushing API.
>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> include/linux/memcontrol.h | 4 ++++
> kernel/sched/isolation.c | 2 ++
> kernel/sched/sched.h | 1 +
> mm/memcontrol.c | 12 +++++++++++-
> 4 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 873e510d6f8d..001200df63cf 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1074,6 +1074,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
> return id;
> }
>
> +void mem_cgroup_flush_workqueue(void);
> +
> extern int mem_cgroup_init(void);
> #else /* CONFIG_MEMCG */
>
> @@ -1481,6 +1483,8 @@ static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
> return 0;
> }
>
> +static inline void mem_cgroup_flush_workqueue(void) { }
> +
> static inline int mem_cgroup_init(void) { return 0; }
> #endif /* CONFIG_MEMCG */
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 95d69c2102f6..9ec365dea921 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -144,6 +144,8 @@ int housekeeping_update(struct cpumask *mask, enum hk_type type)
>
> synchronize_rcu();
>
> + mem_cgroup_flush_workqueue();
> +
> kfree(old);
>
> return 0;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 8fac8aa451c6..8bfc0b4b133f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -44,6 +44,7 @@
> #include <linux/lockdep_api.h>
> #include <linux/lockdep.h>
> #include <linux/memblock.h>
> +#include <linux/memcontrol.h>
> #include <linux/minmax.h>
> #include <linux/mm.h>
> #include <linux/module.h>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1033e52ab6cf..1aa14e543f35 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -95,6 +95,8 @@ static bool cgroup_memory_nokmem __ro_after_init;
> /* BPF memory accounting disabled? */
> static bool cgroup_memory_nobpf __ro_after_init;
>
> +static struct workqueue_struct *memcg_wq __ro_after_init;
> +
> static struct kmem_cache *memcg_cachep;
> static struct kmem_cache *memcg_pn_cachep;
>
> @@ -1975,7 +1977,7 @@ static void schedule_drain_work(int cpu, struct work_struct *work)
> {
> guard(rcu)();
> if (!cpu_is_isolated(cpu))
> - schedule_work_on(cpu, work);
> + queue_work_on(cpu, memcg_wq, work);
> }
>
> /*
> @@ -5092,6 +5094,11 @@ void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages)
> refill_stock(memcg, nr_pages);
> }
>
> +void mem_cgroup_flush_workqueue(void)
> +{
> + flush_workqueue(memcg_wq);
> +}
> +
> static int __init cgroup_memory(char *s)
> {
> char *token;
> @@ -5134,6 +5141,9 @@ int __init mem_cgroup_init(void)
> cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
> memcg_hotplug_cpu_dead);
>
> + memcg_wq = alloc_workqueue("memcg", 0, 0);
Should we explicitly mark the memcg_wq as WQ_PERCPU even though I think
percpu is the default. The schedule_work_on() schedules work on the
system_percpu_wq.
Cheers,
Longman
> + WARN_ON(!memcg_wq);
> +
> for_each_possible_cpu(cpu) {
> INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
> drain_local_memcg_stock);
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 14/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change
2025-10-21 19:16 ` Waiman Long
@ 2025-10-21 19:28 ` Waiman Long
0 siblings, 0 replies; 50+ messages in thread
From: Waiman Long @ 2025-10-21 19:28 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Will Deacon, cgroups,
linux-arm-kernel, linux-block, linux-mm, linux-pci, netdev
On 10/21/25 3:16 PM, Waiman Long wrote:
> On 10/13/25 4:31 PM, Frederic Weisbecker wrote:
>> The HK_TYPE_DOMAIN housekeeping cpumask is now modifyable at runtime. In
>> order to synchronize against memcg workqueue to make sure that no
>> asynchronous draining is still pending or executing on a newly made
>> isolated CPU, the housekeeping susbsystem must flush the memcg
>> workqueues.
>>
>> However the memcg workqueues can't be flushed easily since they are
>> queued to the main per-CPU workqueue pool.
>>
>> Solve this with creating a memcg specific pool and provide and use the
>> appropriate flushing API.
>>
>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>> ---
>> include/linux/memcontrol.h | 4 ++++
>> kernel/sched/isolation.c | 2 ++
>> kernel/sched/sched.h | 1 +
>> mm/memcontrol.c | 12 +++++++++++-
>> 4 files changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 873e510d6f8d..001200df63cf 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -1074,6 +1074,8 @@ static inline u64 cgroup_id_from_mm(struct
>> mm_struct *mm)
>> return id;
>> }
>> +void mem_cgroup_flush_workqueue(void);
>> +
>> extern int mem_cgroup_init(void);
>> #else /* CONFIG_MEMCG */
>> @@ -1481,6 +1483,8 @@ static inline u64 cgroup_id_from_mm(struct
>> mm_struct *mm)
>> return 0;
>> }
>> +static inline void mem_cgroup_flush_workqueue(void) { }
>> +
>> static inline int mem_cgroup_init(void) { return 0; }
>> #endif /* CONFIG_MEMCG */
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index 95d69c2102f6..9ec365dea921 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -144,6 +144,8 @@ int housekeeping_update(struct cpumask *mask,
>> enum hk_type type)
>> synchronize_rcu();
>> + mem_cgroup_flush_workqueue();
>> +
>> kfree(old);
>> return 0;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 8fac8aa451c6..8bfc0b4b133f 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -44,6 +44,7 @@
>> #include <linux/lockdep_api.h>
>> #include <linux/lockdep.h>
>> #include <linux/memblock.h>
>> +#include <linux/memcontrol.h>
>> #include <linux/minmax.h>
>> #include <linux/mm.h>
>> #include <linux/module.h>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 1033e52ab6cf..1aa14e543f35 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -95,6 +95,8 @@ static bool cgroup_memory_nokmem __ro_after_init;
>> /* BPF memory accounting disabled? */
>> static bool cgroup_memory_nobpf __ro_after_init;
>> +static struct workqueue_struct *memcg_wq __ro_after_init;
>> +
>> static struct kmem_cache *memcg_cachep;
>> static struct kmem_cache *memcg_pn_cachep;
>> @@ -1975,7 +1977,7 @@ static void schedule_drain_work(int cpu,
>> struct work_struct *work)
>> {
>> guard(rcu)();
>> if (!cpu_is_isolated(cpu))
>> - schedule_work_on(cpu, work);
>> + queue_work_on(cpu, memcg_wq, work);
>> }
>> /*
>> @@ -5092,6 +5094,11 @@ void mem_cgroup_sk_uncharge(const struct sock
>> *sk, unsigned int nr_pages)
>> refill_stock(memcg, nr_pages);
>> }
>> +void mem_cgroup_flush_workqueue(void)
>> +{
>> + flush_workqueue(memcg_wq);
>> +}
>> +
>> static int __init cgroup_memory(char *s)
>> {
>> char *token;
>> @@ -5134,6 +5141,9 @@ int __init mem_cgroup_init(void)
>> cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD,
>> "mm/memctrl:dead", NULL,
>> memcg_hotplug_cpu_dead);
>> + memcg_wq = alloc_workqueue("memcg", 0, 0);
>
> Should we explicitly mark the memcg_wq as WQ_PERCPU even though I
> think percpu is the default. The schedule_work_on() schedules work on
> the system_percpu_wq.
According to commit dadb3ebcf39 ("workqueue: WQ_PERCPU added to
alloc_workqueue users"), the default may be changed to WQ_UNBOUND in the
future.
Cheers,
Longman
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 22/33] kthread: Include unbound kthreads in the managed affinity list
2025-10-13 20:31 ` [PATCH 22/33] kthread: Include unbound kthreads in the managed affinity list Frederic Weisbecker
@ 2025-10-21 22:42 ` Waiman Long
0 siblings, 0 replies; 50+ messages in thread
From: Waiman Long @ 2025-10-21 22:42 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Will Deacon, cgroups,
linux-arm-kernel, linux-block, linux-mm, linux-pci, netdev
On 10/13/25 4:31 PM, Frederic Weisbecker wrote:
> The managed affinity list currently contains only unbound kthreads that
> have affinity preferences. Unbound kthreads globally affine by default
> are outside of the list because their affinity is automatically managed
> by the scheduler (through the fallback housekeeping mask) and by cpuset.
>
> However in order to preserve the preferred affinity of kthreads, cpuset
> will delegate the isolated partition update propagation to the
> housekeeping and kthread code.
>
> Prepare for that with including all unbound kthreads in the managed
> affinity list.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> kernel/kthread.c | 59 ++++++++++++++++++++++++------------------------
> 1 file changed, 30 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index c4dd967e9e9c..cba3d297f267 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -365,9 +365,10 @@ static void kthread_fetch_affinity(struct kthread *kthread, struct cpumask *cpum
> if (kthread->preferred_affinity) {
> pref = kthread->preferred_affinity;
> } else {
> - if (WARN_ON_ONCE(kthread->node == NUMA_NO_NODE))
> - return;
> - pref = cpumask_of_node(kthread->node);
> + if (kthread->node == NUMA_NO_NODE)
> + pref = housekeeping_cpumask(HK_TYPE_KTHREAD);
> + else
> + pref = cpumask_of_node(kthread->node);
> }
>
> cpumask_and(cpumask, pref, housekeeping_cpumask(HK_TYPE_KTHREAD));
> @@ -380,32 +381,29 @@ static void kthread_affine_node(void)
> struct kthread *kthread = to_kthread(current);
> cpumask_var_t affinity;
>
> - WARN_ON_ONCE(kthread_is_per_cpu(current));
> + if (WARN_ON_ONCE(kthread_is_per_cpu(current)))
> + return;
>
> - if (kthread->node == NUMA_NO_NODE) {
> - housekeeping_affine(current, HK_TYPE_KTHREAD);
> - } else {
> - if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) {
> - WARN_ON_ONCE(1);
> - return;
> - }
> -
> - mutex_lock(&kthread_affinity_lock);
> - WARN_ON_ONCE(!list_empty(&kthread->affinity_node));
> - list_add_tail(&kthread->affinity_node, &kthread_affinity_list);
> - /*
> - * The node cpumask is racy when read from kthread() but:
> - * - a racing CPU going down will either fail on the subsequent
> - * call to set_cpus_allowed_ptr() or be migrated to housekeepers
> - * afterwards by the scheduler.
> - * - a racing CPU going up will be handled by kthreads_online_cpu()
> - */
> - kthread_fetch_affinity(kthread, affinity);
> - set_cpus_allowed_ptr(current, affinity);
> - mutex_unlock(&kthread_affinity_lock);
> -
> - free_cpumask_var(affinity);
> + if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) {
> + WARN_ON_ONCE(1);
> + return;
> }
> +
> + mutex_lock(&kthread_affinity_lock);
> + WARN_ON_ONCE(!list_empty(&kthread->affinity_node));
> + list_add_tail(&kthread->affinity_node, &kthread_affinity_list);
> + /*
> + * The node cpumask is racy when read from kthread() but:
> + * - a racing CPU going down will either fail on the subsequent
> + * call to set_cpus_allowed_ptr() or be migrated to housekeepers
> + * afterwards by the scheduler.
> + * - a racing CPU going up will be handled by kthreads_online_cpu()
> + */
> + kthread_fetch_affinity(kthread, affinity);
> + set_cpus_allowed_ptr(current, affinity);
> + mutex_unlock(&kthread_affinity_lock);
> +
> + free_cpumask_var(affinity);
> }
>
> static int kthread(void *_create)
> @@ -924,8 +922,11 @@ static int kthreads_online_cpu(unsigned int cpu)
> ret = -EINVAL;
> continue;
> }
> - kthread_fetch_affinity(k, affinity);
> - set_cpus_allowed_ptr(k->task, affinity);
> +
> + if (k->preferred_affinity || k->node != NUMA_NO_NODE) {
> + kthread_fetch_affinity(k, affinity);
> + set_cpus_allowed_ptr(k->task, affinity);
> + }
> }
My understanding of kthreads_online_cpu() is that hotplug won't affect
the affinity returned from kthread_fetch_affinity(). However,
set_cpus_allowed_ptr() will mask out all the offline CPUs. So if the
given "cpu" to be brought online is in the returned affinity, we should
call set_cpus_allowed_ptr() to add this cpu into its affinity mask
though the current code will call it even it is not strictly necessary.
This change will not do this update to NUMA_NO_NODE kthread with no
preferred_affinity, is this a problem?
Cheers,
Longman
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 13/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
2025-10-21 4:10 ` Waiman Long
@ 2025-10-22 1:36 ` Chen Ridong
0 siblings, 0 replies; 50+ messages in thread
From: Chen Ridong @ 2025-10-22 1:36 UTC (permalink / raw)
To: Waiman Long, Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Will Deacon, cgroups,
linux-arm-kernel, linux-block, linux-mm, linux-pci, netdev
On 2025/10/21 12:10, Waiman Long wrote:
> On 10/13/25 4:31 PM, Frederic Weisbecker wrote:
>> Until now, HK_TYPE_DOMAIN used to only include boot defined isolated
>> CPUs passed through isolcpus= boot option. Users interested in also
>> knowing the runtime defined isolated CPUs through cpuset must use
>> different APIs: cpuset_cpu_is_isolated(), cpu_is_isolated(), etc...
>>
>> There are many drawbacks to that approach:
>>
>> 1) Most interested subsystems want to know about all isolated CPUs, not
>> just those defined on boot time.
>>
>> 2) cpuset_cpu_is_isolated() / cpu_is_isolated() are not synchronized with
>> concurrent cpuset changes.
>>
>> 3) Further cpuset modifications are not propagated to subsystems
>>
>> Solve 1) and 2) and centralize all isolated CPUs within the
>> HK_TYPE_DOMAIN housekeeping cpumask.
>>
>> Subsystems can rely on RCU to synchronize against concurrent changes.
>>
>> The propagation mentioned in 3) will be handled in further patches.
>>
>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>> ---
>> include/linux/sched/isolation.h | 2 +
>> kernel/cgroup/cpuset.c | 2 +
>> kernel/sched/isolation.c | 75 ++++++++++++++++++++++++++++++---
>> kernel/sched/sched.h | 1 +
>> 4 files changed, 74 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
>> index da22b038942a..94d5c835121b 100644
>> --- a/include/linux/sched/isolation.h
>> +++ b/include/linux/sched/isolation.h
>> @@ -32,6 +32,7 @@ extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
>> extern bool housekeeping_enabled(enum hk_type type);
>> extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
>> extern bool housekeeping_test_cpu(int cpu, enum hk_type type);
>> +extern int housekeeping_update(struct cpumask *mask, enum hk_type type);
>> extern void __init housekeeping_init(void);
>> #else
>> @@ -59,6 +60,7 @@ static inline bool housekeeping_test_cpu(int cpu, enum hk_type type)
>> return true;
>> }
>> +static inline int housekeeping_update(struct cpumask *mask, enum hk_type type) { return 0; }
>> static inline void housekeeping_init(void) { }
>> #endif /* CONFIG_CPU_ISOLATION */
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index aa1ac7bcf2ea..b04a4242f2fa 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1403,6 +1403,8 @@ static void update_unbound_workqueue_cpumask(bool isolcpus_updated)
>> ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
>> WARN_ON_ONCE(ret < 0);
>> + ret = housekeeping_update(isolated_cpus, HK_TYPE_DOMAIN);
>> + WARN_ON_ONCE(ret < 0);
>> }
>> /**
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index b46c20b5437f..95d69c2102f6 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -29,18 +29,48 @@ static struct housekeeping housekeeping;
>> bool housekeeping_enabled(enum hk_type type)
>> {
>> - return !!(housekeeping.flags & BIT(type));
>> + return !!(READ_ONCE(housekeeping.flags) & BIT(type));
>> }
>> EXPORT_SYMBOL_GPL(housekeeping_enabled);
>> +static bool housekeeping_dereference_check(enum hk_type type)
>> +{
>> + if (IS_ENABLED(CONFIG_LOCKDEP) && type == HK_TYPE_DOMAIN) {
>> + /* Cpuset isn't even writable yet? */
>> + if (system_state <= SYSTEM_SCHEDULING)
>> + return true;
>> +
>> + /* CPU hotplug write locked, so cpuset partition can't be overwritten */
>> + if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held())
>> + return true;
>> +
>> + /* Cpuset lock held, partitions not writable */
>> + if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held())
>> + return true;
>
> I have some doubt about this condition as the cpuset_mutex may be held in the process of making
> changes to an isolated partition that will impact HK_TYPE_DOMAIN cpumask.
>
> Cheers,
> Longman
>
+1
ie. 'echo isolate > cpuset.cpus.partition'
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 05/33] sched/isolation: Save boot defined domain flags
2025-10-13 20:31 ` [PATCH 05/33] sched/isolation: Save boot defined domain flags Frederic Weisbecker
@ 2025-10-23 15:45 ` Valentin Schneider
0 siblings, 0 replies; 50+ messages in thread
From: Valentin Schneider @ 2025-10-23 15:45 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Frederic Weisbecker, Michal Koutný, Andrew Morton,
Bjorn Helgaas, Catalin Marinas, Danilo Krummrich,
David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Waiman Long,
Will Deacon, cgroups, linux-arm-kernel, linux-block, linux-mm,
linux-pci, netdev
On 13/10/25 22:31, Frederic Weisbecker wrote:
> HK_TYPE_DOMAIN will soon integrate not only boot defined isolcpus= CPUs
> but also cpuset isolated partitions.
>
> Housekeeping still needs a way to record what was initially passed
> to isolcpus= in order to keep these CPUs isolated after a cpuset
> isolated partition is modified or destroyed while containing some of
> them.
>
> Create a new HK_TYPE_DOMAIN_BOOT to keep track of those.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Reviewed-by: Phil Auld <pauld@redhat.com>
> ---
> include/linux/sched/isolation.h | 1 +
> kernel/sched/isolation.c | 5 +++--
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index d8501f4709b5..da22b038942a 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -7,6 +7,7 @@
> #include <linux/tick.h>
>
> enum hk_type {
> + HK_TYPE_DOMAIN_BOOT,
> HK_TYPE_DOMAIN,
> HK_TYPE_MANAGED_IRQ,
> HK_TYPE_KERNEL_NOISE,
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index a4cf17b1fab0..8690fb705089 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -11,6 +11,7 @@
> #include "sched.h"
>
> enum hk_flags {
> + HK_FLAG_DOMAIN_BOOT = BIT(HK_TYPE_DOMAIN_BOOT),
> HK_FLAG_DOMAIN = BIT(HK_TYPE_DOMAIN),
> HK_FLAG_MANAGED_IRQ = BIT(HK_TYPE_MANAGED_IRQ),
> HK_FLAG_KERNEL_NOISE = BIT(HK_TYPE_KERNEL_NOISE),
> @@ -216,7 +217,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
>
> if (!strncmp(str, "domain,", 7)) {
> str += 7;
> - flags |= HK_FLAG_DOMAIN;
> + flags |= HK_FLAG_DOMAIN | HK_FLAG_DOMAIN_BOOT;
> continue;
> }
>
> @@ -246,7 +247,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
>
> /* Default behaviour for isolcpus without flags */
> if (!flags)
> - flags |= HK_FLAG_DOMAIN;
> + flags |= HK_FLAG_DOMAIN | HK_FLAG_DOMAIN_BOOT;
I got stupidly confused by the cpumask_andnot() used later on since these
are housekeeping cpumasks and not isolated ones; AFAICT HK_FLAG_DOMAIN_BOOT
is meant to be a superset of HK_FLAG_DOMAIN - or, put in a way my brain
comprehends, NOT(HK_FLAG_DOMAIN) (i.e. runtime isolated cpumask) is a
superset of NOT(HK_FLAG_DOMAIN_BOOT) (i.e. boottime isolated cpumask),
thus the final shape of cpu_is_isolated() makes sense:
static inline bool cpu_is_isolated(int cpu)
{
return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN);
}
Could we document that to make it a bit more explicit? Maybe something like
enum hk_type {
/* Set at boot-time via the isolcpus= cmdline argument */
HK_TYPE_DOMAIN_BOOT,
/*
* Updated at runtime via isolated cpusets; strict subset of
* HK_TYPE_DOMAIN_BOOT as it accounts for boot-time isolated CPUs.
*/
HK_TYPE_DOMAIN,
...
}
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 18/33] cpuset: Remove cpuset_cpu_is_isolated()
2025-10-13 20:31 ` [PATCH 18/33] cpuset: Remove cpuset_cpu_is_isolated() Frederic Weisbecker
@ 2025-10-29 18:05 ` Waiman Long
0 siblings, 0 replies; 50+ messages in thread
From: Waiman Long @ 2025-10-29 18:05 UTC (permalink / raw)
To: Frederic Weisbecker, LKML
Cc: Michal Koutný, Andrew Morton, Bjorn Helgaas, Catalin Marinas,
Danilo Krummrich, David S . Miller, Eric Dumazet, Gabriele Monaco,
Greg Kroah-Hartman, Ingo Molnar, Jakub Kicinski, Jens Axboe,
Johannes Weiner, Lai Jiangshan, Marco Crivellari, Michal Hocko,
Muchun Song, Paolo Abeni, Peter Zijlstra, Phil Auld,
Rafael J . Wysocki, Roman Gushchin, Shakeel Butt, Simon Horman,
Tejun Heo, Thomas Gleixner, Vlastimil Babka, Will Deacon, cgroups,
linux-arm-kernel, linux-block, linux-mm, linux-pci, netdev
On 10/13/25 4:31 PM, Frederic Weisbecker wrote:
> The set of cpuset isolated CPUs is now included in HK_TYPE_DOMAIN
> housekeeping cpumask. There is no usecase left interested in just
> checking what is isolated by cpuset and not by the isolcpus= kernel
> boot parameter.
>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> include/linux/cpuset.h | 6 ------
> include/linux/sched/isolation.h | 3 +--
> kernel/cgroup/cpuset.c | 12 ------------
> 3 files changed, 1 insertion(+), 20 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 051d36fec578..a10775a4f702 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -78,7 +78,6 @@ extern void cpuset_lock(void);
> extern void cpuset_unlock(void);
> extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
> extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
> -extern bool cpuset_cpu_is_isolated(int cpu);
> extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
> #define cpuset_current_mems_allowed (current->mems_allowed)
> void cpuset_init_current_mems_allowed(void);
> @@ -208,11 +207,6 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)
> return false;
> }
>
> -static inline bool cpuset_cpu_is_isolated(int cpu)
> -{
> - return false;
> -}
> -
> static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
> {
> return node_possible_map;
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index 94d5c835121b..0f50c152cf68 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -76,8 +76,7 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
> static inline bool cpu_is_isolated(int cpu)
> {
> return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
> - !housekeeping_test_cpu(cpu, HK_TYPE_TICK) ||
> - cpuset_cpu_is_isolated(cpu);
> + !housekeeping_test_cpu(cpu, HK_TYPE_TICK);
> }
>
You can also remove the "<linux/cpuset.h>" include from isolation.h
which was added by commit 3232e7aad11e5 ("cgroup/cpuset: Include
isolated cpuset CPUs in cpu_is_isolated() check") which introduces
cpuset_cpu_is_isolated().
Cheers,
Longman
^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2025-10-29 18:05 UTC | newest]
Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-13 20:31 [PATCH 00/33 v3] cpuset/isolation: Honour kthreads preferred affinity Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 01/33] PCI: Prepare to protect against concurrent isolated cpuset change Frederic Weisbecker
2025-10-14 20:53 ` Bjorn Helgaas
2025-10-13 20:31 ` [PATCH 02/33] cpu: Revert "cpu/hotplug: Prevent self deadlock on CPU hot-unplug" Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 03/33] memcg: Prepare to protect against concurrent isolated cpuset change Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 04/33] mm: vmstat: " Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 05/33] sched/isolation: Save boot defined domain flags Frederic Weisbecker
2025-10-23 15:45 ` Valentin Schneider
2025-10-13 20:31 ` [PATCH 06/33] cpuset: Convert boot_hk_cpus to use HK_TYPE_DOMAIN_BOOT Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 07/33] driver core: cpu: Convert /sys/devices/system/cpu/isolated " Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 08/33] net: Keep ignoring isolated cpuset change Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 09/33] block: Protect against concurrent " Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 10/33] cpu: Provide lockdep check for CPU hotplug lock write-held Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 11/33] cpuset: Provide lockdep check for cpuset lock held Frederic Weisbecker
2025-10-14 13:29 ` Chen Ridong
2025-10-13 20:31 ` [PATCH 12/33] sched/isolation: Convert housekeeping cpumasks to rcu pointers Frederic Weisbecker
2025-10-21 1:46 ` Chen Ridong
2025-10-21 1:57 ` Chen Ridong
2025-10-21 4:03 ` Waiman Long
2025-10-21 3:49 ` Waiman Long
2025-10-13 20:31 ` [PATCH 13/33] cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset Frederic Weisbecker
2025-10-21 4:10 ` Waiman Long
2025-10-22 1:36 ` Chen Ridong
2025-10-21 13:39 ` Waiman Long
2025-10-13 20:31 ` [PATCH 14/33] sched/isolation: Flush memcg workqueues on cpuset isolated partition change Frederic Weisbecker
2025-10-21 19:16 ` Waiman Long
2025-10-21 19:28 ` Waiman Long
2025-10-13 20:31 ` [PATCH 15/33] sched/isolation: Flush vmstat " Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 16/33] PCI: Flush PCI probe workqueue " Frederic Weisbecker
2025-10-14 20:50 ` Bjorn Helgaas
2025-10-13 20:31 ` [PATCH 17/33] cpuset: Propagate cpuset isolation update to workqueue through housekeeping Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 18/33] cpuset: Remove cpuset_cpu_is_isolated() Frederic Weisbecker
2025-10-29 18:05 ` Waiman Long
2025-10-13 20:31 ` [PATCH 19/33] sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 20/33] PCI: Remove superfluous HK_TYPE_WQ check Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 21/33] kthread: Refine naming of affinity related fields Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 22/33] kthread: Include unbound kthreads in the managed affinity list Frederic Weisbecker
2025-10-21 22:42 ` Waiman Long
2025-10-13 20:31 ` [PATCH 23/33] kthread: Include kthreadd to " Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 24/33] kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 25/33] sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 26/33] cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 27/33] sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 28/33] kthread: Honour kthreads preferred affinity after cpuset changes Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 29/33] kthread: Comment on the purpose and placement of kthread_affine_node() call Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 30/33] kthread: Add API to update preferred affinity on kthread runtime Frederic Weisbecker
2025-10-14 12:35 ` Simon Horman
2025-10-13 20:31 ` [PATCH 31/33] kthread: Document kthread_affine_preferred() Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 32/33] genirq: Correctly handle preferred kthreads affinity Frederic Weisbecker
2025-10-13 20:31 ` [PATCH 33/33] doc: Add housekeeping documentation Frederic Weisbecker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).