* [RFC PATCH v3 0/3] genirq/cpuhotplug: Adjust managed interrupts according to change of housekeeping cpumask
@ 2024-10-29 12:05 Costa Shulyupin
2024-10-29 12:05 ` [RFC PATCH v3 1/3] sched/isolation: Add infrastructure for dynamic CPU isolation Costa Shulyupin
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Costa Shulyupin @ 2024-10-29 12:05 UTC (permalink / raw)
To: longman, ming.lei, pauld, juri.lelli, vschneid, Jens Axboe,
Thomas Gleixner, Peter Zijlstra, Zefan Li, Tejun Heo,
Johannes Weiner, Michal Koutný, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Costa Shulyupin, linux-block, linux-kernel, cgroups
The housekeeping CPU masks, set up by the "isolcpus" and "nohz_full"
boot command line options, are used at boot time to exclude selected
CPUs from running some kernel housekeeping subsystems to minimize
disturbance to latency sensitive userspace applications such as DPDK.
This options can only be changed with a reboot. This is a problem for
containerized workloads running on OpenShift/Kubernetes where a
mix of low latency and "normal" workloads can be created/destroyed
dynamically and the number of CPUs allocated to each workload is often
not known at boot time.
Theoretically, complete CPU offlining/onlining could be used for
housekeeping adjustments, but this approach is not practical.
Telco companies use Linux to run DPDK in OpenShift/Kubernetes containers.
DPDK requires isolated cpus to run real-time processes.
Kubernetes manages allocation of resources for containers.
Unfortunately Kubernetes doesn't support dynamic CPU offlining/onlining:
https://github.com/kubernetes/kubernetes/issues/67500
and is not planning to support it.
Addressing this issue at the application level appears to be even
less straightforward than addressing it at the kernel level.
This series of patches is based on series
isolation: Exclude dynamically isolated CPUs from housekeeping masks:
https://lore.kernel.org/lkml/20240821142312.236970-1-longman@redhat.com/
Its purpose is to exclude dynamically isolated CPUs from some
housekeeping masks so that subsystems that check the housekeeping masks
at run time will not use those isolated CPUs.
However, some of subsystems can use obsolete housekeeping CPU masks.
Therefore, to prevent the use of these isolated CPUs, it is necessary to
explicitly propagate changes of the housekeeping masks to all subsystems
depending on the mask.
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
Changes in v3:
- Address the comments by Thomas Gleixner.
Changes in v2:
- Focus in this patch series on managed interrupts only.
- https://lore.kernel.org/lkml/20240916122044.3056787-1-costa.shul@redhat.com/
Changes in v1:
- https://lore.kernel.org/lkml/20240516190437.3545310-1-costa.shul@redhat.com/
References:
- Linux Kernel Dynamic CPU Isolation: https://pretalx.com/devconf-us-2024/talk/AZBQLE/
Costa Shulyupin (3):
sched/isolation: Add infrastructure for dynamic CPU isolation
DO NOT MERGE: test for managed irqs adjustment
genirq/cpuhotplug: Adjust managed irqs according to change of
housekeeping CPU
block/blk-mq.c | 19 +++++++
include/linux/blk-mq.h | 2 +
include/linux/cpu.h | 4 ++
include/linux/irq.h | 2 +
kernel/cgroup/cpuset.c | 1 +
kernel/cpu.c | 2 +-
kernel/irq/cpuhotplug.c | 99 +++++++++++++++++++++++++++++++++
kernel/sched/isolation.c | 51 +++++++++++++++--
tests/managed_irq.sh | 116 +++++++++++++++++++++++++++++++++++++++
9 files changed, 291 insertions(+), 5 deletions(-)
create mode 100755 tests/managed_irq.sh
--
2.47.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [RFC PATCH v3 1/3] sched/isolation: Add infrastructure for dynamic CPU isolation
2024-10-29 12:05 [RFC PATCH v3 0/3] genirq/cpuhotplug: Adjust managed interrupts according to change of housekeeping cpumask Costa Shulyupin
@ 2024-10-29 12:05 ` Costa Shulyupin
2024-10-29 12:05 ` [RFC PATCH v3 2/3] DO NOT MERGE: test for managed irqs adjustment Costa Shulyupin
2024-10-29 12:05 ` [RFC PATCH v3 3/3] genirq/cpuhotplug: Adjust managed irqs according to change of housekeeping CPU Costa Shulyupin
2 siblings, 0 replies; 5+ messages in thread
From: Costa Shulyupin @ 2024-10-29 12:05 UTC (permalink / raw)
To: longman, ming.lei, pauld, juri.lelli, vschneid, Jens Axboe,
Thomas Gleixner, Peter Zijlstra, Zefan Li, Tejun Heo,
Johannes Weiner, Michal Koutný, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Costa Shulyupin, linux-block, linux-kernel, cgroups
Introduce the infrastructure function housekeeping_update()
to modify housekeeping_cpumask at runtime and update the
configurations of dependent subsystems accordingly in
following patches.
Parent patch:
sched/isolation: Exclude dynamically isolated CPUs from housekeeping masks
https://lore.kernel.org/lkml/20240821142312.236970-1-longman@redhat.com/
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
Note: In theory, updating housekeeping.flags only once is possible.
However, this approach would make this and following code less clear
and harder to follow and review.
Changes in v3:
- Remove redundant WRITE_ONCE. The first WRITE_ONCE is located
in housekeeping_update() because this function will
update dependent subsystems with changes of the housekeeping masks.
Changes in v2:
- remove unnecessary `err` variable
- add for_each_clear_bit... to clear isolated CPUs
- Address Gleixner's comments:
- use WRITE_ONCE to change housekeeping.flags
- use `struct cpumask *update` in signature of housekeeping_update
v1:
- https://lore.kernel.org/lkml/20240516190437.3545310-2-costa.shul@redhat.com/
---
kernel/sched/isolation.c | 43 ++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 2ee47bc25aea..ebbb215505e8 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -127,6 +127,40 @@ static void __init housekeeping_setup_type(enum hk_type type,
housekeeping_staging);
}
+/*
+ * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
+ * change.
+ */
+static int housekeeping_update(enum hk_type type, const struct cpumask *update)
+{
+ struct {
+ struct cpumask changed;
+ struct cpumask enable;
+ struct cpumask disable;
+ } *masks;
+
+ masks = kmalloc(sizeof(*masks), GFP_KERNEL);
+ if (!masks)
+ return -ENOMEM;
+
+ lockdep_assert_cpus_held();
+ cpumask_xor(&masks->changed, housekeeping_cpumask(type), update);
+ cpumask_and(&masks->enable, &masks->changed, update);
+ cpumask_andnot(&masks->disable, &masks->changed, update);
+ cpumask_copy(housekeeping.cpumasks[type], update);
+ WRITE_ONCE(housekeeping.flags, housekeeping.flags | BIT(type));
+ if (!static_branch_unlikely(&housekeeping_overridden))
+ static_key_enable_cpuslocked(&housekeeping_overridden.key);
+
+ /* Add here code to update dependent subsystems with
+ * changes of the housekeeping masks.
+ */
+
+ kfree(masks);
+
+ return 0;
+}
+
static int __init housekeeping_setup(char *str, unsigned long flags)
{
cpumask_var_t non_housekeeping_mask, housekeeping_staging;
@@ -330,10 +364,12 @@ int housekeeping_exlude_isolcpus(const struct cpumask *isolcpus, unsigned long f
/*
* Reset housekeeping to bootup default
*/
+
+ for_each_clear_bit(type, &boot_hk_flags, HK_TYPE_MAX)
+ housekeeping_update(type, cpu_possible_mask);
for_each_set_bit(type, &boot_hk_flags, HK_TYPE_MAX)
- cpumask_copy(housekeeping.cpumasks[type], boot_hk_cpumask);
+ housekeeping_update(type, boot_hk_cpumask);
- WRITE_ONCE(housekeeping.flags, boot_hk_flags);
if (!boot_hk_flags && static_key_enabled(&housekeeping_overridden))
static_key_disable_cpuslocked(&housekeeping_overridden.key);
return 0;
@@ -358,9 +394,8 @@ int housekeeping_exlude_isolcpus(const struct cpumask *isolcpus, unsigned long f
cpumask_andnot(tmp_mask, src_mask, isolcpus);
if (!cpumask_intersects(tmp_mask, cpu_online_mask))
return -EINVAL; /* Invalid isolated CPUs */
- cpumask_copy(housekeeping.cpumasks[type], tmp_mask);
+ housekeeping_update(type, tmp_mask);
}
- WRITE_ONCE(housekeeping.flags, boot_hk_flags | flags);
excluded = true;
if (!static_key_enabled(&housekeeping_overridden))
static_key_enable_cpuslocked(&housekeeping_overridden.key);
--
2.47.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [RFC PATCH v3 2/3] DO NOT MERGE: test for managed irqs adjustment
2024-10-29 12:05 [RFC PATCH v3 0/3] genirq/cpuhotplug: Adjust managed interrupts according to change of housekeeping cpumask Costa Shulyupin
2024-10-29 12:05 ` [RFC PATCH v3 1/3] sched/isolation: Add infrastructure for dynamic CPU isolation Costa Shulyupin
@ 2024-10-29 12:05 ` Costa Shulyupin
2024-10-29 12:05 ` [RFC PATCH v3 3/3] genirq/cpuhotplug: Adjust managed irqs according to change of housekeeping CPU Costa Shulyupin
2 siblings, 0 replies; 5+ messages in thread
From: Costa Shulyupin @ 2024-10-29 12:05 UTC (permalink / raw)
To: longman, ming.lei, pauld, juri.lelli, vschneid, Jens Axboe,
Thomas Gleixner, Peter Zijlstra, Zefan Li, Tejun Heo,
Johannes Weiner, Michal Koutný, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Costa Shulyupin, linux-block, linux-kernel, cgroups
shell script for testing managed interrupts
status adjustments.
Targets: managed_irq_affinity_adjust(),
irq_restore_affinity_of_irq(), managed_irq_isolate()
Managed interrupts can be created in various ways. One of them:
qemu-img create -f qcow2 test.qcow2 100M
virtme-ng -v --cpus 4 --rw --user root \
--qemu-opts '\-drive id=d1,if=none,file=test.qcow2 \
\-device nvme,id=i1,drive=d1,serial=1,bootindex=2'
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
Changes in v2:
- use shell script only
v1:
- https://lore.kernel.org/lkml/20240516190437.3545310-8-costa.shul@redhat.com/
---
tests/managed_irq.sh | 116 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 116 insertions(+)
create mode 100755 tests/managed_irq.sh
diff --git a/tests/managed_irq.sh b/tests/managed_irq.sh
new file mode 100755
index 000000000000..14bc2f4b13c5
--- /dev/null
+++ b/tests/managed_irq.sh
@@ -0,0 +1,116 @@
+#!/bin/zsh
+
+# shell script for testing IRQ status adjustment.
+# Targets: managed_irq_affinity_adjust(),
+# irq_restore_affinity_of_irq(), managed_irq_isolate()
+
+# cpu# to isolate
+#
+isolate=1
+
+managed_affined=$(
+ cd /sys/kernel/debug/irq/irqs/;
+ grep -l -e "affinity: $isolate$" /dev/null $(grep -l IRQD_AFFINITY_MANAGED *) |
+ head -n1
+)
+test_irq=${managed_affined%% *}
+
+[ -z $test_irq ] && { echo No managed IRQs found;exit 1}
+
+rm -rf 0.irqs
+cp -R /sys/kernel/debug/irq/irqs 0.irqs
+
+cd /sys/fs/cgroup/
+echo +cpuset > cgroup.subtree_control
+mkdir -p test
+echo isolated > test/cpuset.cpus.partition
+
+effective_affinity=/proc/irq/$test_irq/effective_affinity
+test_irq_debug=/sys/kernel/debug/irq/irqs/$test_irq
+
+errors=0
+
+check()
+{
+ local _status=$?
+ if [[ $_status == 0 ]]
+ then
+ echo PASS
+ else
+ let errors+=1
+ echo FAIL:
+ cat $test_irq_debug
+ fi
+ return $_status
+}
+
+check_activated()
+{
+ echo "Check normal irq affinity"
+ test 0 -ne $((0x$(cat $effective_affinity) & 1 << $isolate))
+ check
+ grep -q IRQD_ACTIVATED $test_irq_debug
+ check
+ grep -q IRQD_IRQ_STARTED $test_irq_debug
+ check
+ ! grep -q IRQD_IRQ_DISABLED $test_irq_debug
+ check
+ ! grep -q IRQD_IRQ_MASKED $test_irq_debug
+ check
+ ! grep -q IRQD_MANAGED_SHUTDOWN $test_irq_debug
+ check
+}
+
+check_shutdown()
+{
+ echo "Check that irq affinity doesn't contain isolated cpu."
+ test 0 -eq $((0x$(cat $effective_affinity) & 1 << $isolate))
+ check
+ ! grep -q IRQD_ACTIVATED $test_irq_debug
+ check
+ ! grep -q IRQD_IRQ_STARTED $test_irq_debug
+ check
+ grep -q IRQD_IRQ_DISABLED $test_irq_debug
+ check
+ grep -q IRQD_IRQ_MASKED $test_irq_debug
+ check
+ grep -q IRQD_MANAGED_SHUTDOWN $test_irq_debug
+ check
+}
+
+echo "Isolating CPU #$isolate"
+echo $isolate > test/cpuset.cpus
+
+check_shutdown
+
+echo Reset cpuset
+echo "" > test/cpuset.cpus
+
+check_activated
+
+echo "Isolating CPU #$isolate again"
+echo $isolate > test/cpuset.cpus
+
+check_shutdown()
+
+echo "Isolating CPU #3 and restore CPU #$isolate"
+echo 3 > test/cpuset.cpus
+
+check_activated
+
+echo Reset cpuset
+echo "" > test/cpuset.cpus
+
+rmdir test
+cd -
+
+rm -rf final.irqs
+cp -R /sys/kernel/debug/irq/irqs final.irqs
+
+if ! diff -r --ignore-matching-lines=Vector: 0.irqs final.irqs; then
+ echo diff failed;
+ let errors+=1
+fi
+
+echo errors=$errors
+(return $errors)
--
2.47.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [RFC PATCH v3 3/3] genirq/cpuhotplug: Adjust managed irqs according to change of housekeeping CPU
2024-10-29 12:05 [RFC PATCH v3 0/3] genirq/cpuhotplug: Adjust managed interrupts according to change of housekeeping cpumask Costa Shulyupin
2024-10-29 12:05 ` [RFC PATCH v3 1/3] sched/isolation: Add infrastructure for dynamic CPU isolation Costa Shulyupin
2024-10-29 12:05 ` [RFC PATCH v3 2/3] DO NOT MERGE: test for managed irqs adjustment Costa Shulyupin
@ 2024-10-29 12:05 ` Costa Shulyupin
2024-10-29 18:54 ` Thomas Gleixner
2 siblings, 1 reply; 5+ messages in thread
From: Costa Shulyupin @ 2024-10-29 12:05 UTC (permalink / raw)
To: longman, ming.lei, pauld, juri.lelli, vschneid, Jens Axboe,
Thomas Gleixner, Peter Zijlstra, Zefan Li, Tejun Heo,
Johannes Weiner, Michal Koutný, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Costa Shulyupin, linux-block, linux-kernel, cgroups
Interrupts disturb real-time tasks on affined cpus.
To ensure CPU isolation for real-time tasks, interrupt handling must
be adjusted accordingly.
Non-managed interrupts can be configured from userspace,
while managed interrupts require adjustments in kernelspace.
Adjust status of managed interrupts according change
of housekeeping CPUs to support dynamic CPU isolation.
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
The following code is a proof of concept to validate
and review the correctness of the approach to solving the problem.
C++ comments denote temporary comments.
v3:
- rename `int managed_irq_affinity_adjust()`
to `void managed_irq_adjust_activity()`
- Addresses Thomas Gleixner's comments:
- add locking to managed_irq_adjust_activity()
- add blk_mq_flush_on_cpu() to flush queues associated
with isolated interrupts.
v2:
- refactor irq_affinity_adjust():
- add more comments
- add managed_irq_isolate() derived from migrate_one_irq(),
irq_needs_fixup() and irq_fixup_move_pending()
- use irq_set_affinity() instead of irq_set_affinity_locked
- Addressed Gleixner's comments:
- use `struct cpumask *` instead of `cpumask_var_t` in function signature
- remove locking in irq_affinity_adjust()
v1:
- https://lore.kernel.org/lkml/20240516190437.3545310-5-costa.shul@redhat.com/
---
block/blk-mq.c | 19 ++++++++
include/linux/blk-mq.h | 2 +
include/linux/cpu.h | 4 ++
include/linux/irq.h | 2 +
kernel/cgroup/cpuset.c | 1 +
kernel/cpu.c | 2 +-
kernel/irq/cpuhotplug.c | 99 ++++++++++++++++++++++++++++++++++++++++
kernel/sched/isolation.c | 14 ++++--
8 files changed, 139 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2909c51a13bd..484b6488739b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3756,6 +3756,25 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
return 0;
}
+/*
+ * Draft function to flush data in
+ * block multi-queues to perform isolation
+ * of specified cpu from managed interrupts.
+ */
+void blk_mq_flush_on_cpu(int cpu)
+{
+ // TODO: Thoroughly test this code with high test coverage.
+ /* Calling:
+ * - blk_mq_hctx_notify_offline()
+ * - blk_mq_hctx_notify_dead()
+ * - bio_cpu_dead()
+ */
+ cpuhp_invoke_callback(cpu, CPUHP_AP_BLK_MQ_ONLINE, false, NULL, NULL);
+ cpuhp_invoke_callback(cpu, CPUHP_BLK_MQ_DEAD, false, NULL, NULL);
+ cpuhp_invoke_callback(cpu, CPUHP_BIO_DEAD, false, NULL, NULL);
+ blk_softirq_cpu_dead(cpu);
+}
+
static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
{
if (!(hctx->flags & BLK_MQ_F_STACKING))
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2035fad3131f..286cd2a0ba84 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -929,6 +929,8 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q);
unsigned int blk_mq_rq_cpu(struct request *rq);
+void blk_mq_flush_on_cpu(int cpu);
+
bool __blk_should_fake_timeout(struct request_queue *q);
static inline bool blk_should_fake_timeout(struct request_queue *q)
{
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index bdcec1732445..f4504b3c129a 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -203,4 +203,8 @@ static inline bool cpu_mitigations_auto_nosmt(void)
}
#endif
+int cpuhp_invoke_callback(unsigned int cpu, enum cpuhp_state state,
+ bool bringup, struct hlist_node *node,
+ struct hlist_node **lastp);
+
#endif /* _LINUX_CPU_H_ */
diff --git a/include/linux/irq.h b/include/linux/irq.h
index d73bbb7797d0..3974f95c1783 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -619,6 +619,8 @@ extern int irq_affinity_online_cpu(unsigned int cpu);
# define irq_affinity_online_cpu NULL
#endif
+void managed_irq_adjust_activity(struct cpumask *enable_mask);
+
#if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_PENDING_IRQ)
void __irq_move_irq(struct irq_data *data);
static inline void irq_move_irq(struct irq_data *data)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 5066397899c9..8364369976e4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -95,6 +95,7 @@ static struct list_head remote_children;
#define HOUSEKEEPING_FLAGS (BIT(HK_TYPE_TIMER) | BIT(HK_TYPE_RCU) |\
BIT(HK_TYPE_SCHED) | BIT(HK_TYPE_MISC) |\
BIT(HK_TYPE_DOMAIN) | BIT(HK_TYPE_WQ) |\
+ BIT(HK_TYPE_MANAGED_IRQ) |\
BIT(HK_TYPE_KTHREAD))
/*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index afc920116d42..44c7da0e1b8d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -171,7 +171,7 @@ static bool cpuhp_step_empty(bool bringup, struct cpuhp_step *step)
*
* Return: %0 on success or a negative errno code
*/
-static int cpuhp_invoke_callback(unsigned int cpu, enum cpuhp_state state,
+int cpuhp_invoke_callback(unsigned int cpu, enum cpuhp_state state,
bool bringup, struct hlist_node *node,
struct hlist_node **lastp)
{
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index b92cf6a76351..9233694d541a 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -279,3 +279,102 @@ int irq_affinity_online_cpu(unsigned int cpu)
return 0;
}
+
+/*
+ * managed_irq_isolate() - Deactivate managed interrupts if necessary
+ */
+// derived from migrate_one_irq(), irq_needs_fixup(), irq_fixup_move_pending().
+// Finally, this function can be considered for
+// merging back with migrate_one_irq().
+static int managed_irq_isolate(struct irq_desc *desc)
+{
+ struct irq_data *d = irq_desc_get_irq_data(desc);
+ struct irq_chip *chip = irq_data_get_irq_chip(d);
+ const struct cpumask *a;
+ bool maskchip;
+ int err;
+
+ /*
+ * Deactivate if:
+ * - Interrupt is managed
+ * - Interrupt is not per cpu
+ * - Interrupt is started
+ * - Effective affinity mask includes isolated CPUs
+ */
+ if (!irqd_affinity_is_managed(d) || irqd_is_per_cpu(d) || !irqd_is_started(d)
+ || cpumask_subset(irq_data_get_effective_affinity_mask(d),
+ housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)))
+ return 0;
+ // TBD: it is required?
+ /*
+ * Complete an eventually pending irq move cleanup. If this
+ * interrupt was moved in hard irq context, then the vectors need
+ * to be cleaned up. It can't wait until this interrupt actually
+ * happens and this CPU was involved.
+ */
+ irq_force_complete_move(desc);
+
+ if (irqd_is_setaffinity_pending(d)) {
+ irqd_clr_move_pending(d);
+ if (cpumask_intersects(desc->pending_mask,
+ housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)))
+ a = irq_desc_get_pending_mask(desc);
+ } else {
+ a = irq_data_get_affinity_mask(d);
+ }
+
+ maskchip = chip->irq_mask && !irq_can_move_pcntxt(d) && !irqd_irq_masked(d);
+ if (maskchip)
+ chip->irq_mask(d);
+
+ if (!cpumask_intersects(a, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ))) {
+ /*
+ * Shut managed interrupt down and leave the affinity untouched.
+ * The effective affinity is reset to the first online CPU.
+ */
+ irqd_set_managed_shutdown(d);
+ irq_shutdown_and_deactivate(desc);
+ return 0;
+ }
+
+ /*
+ * Do not set the force argument of irq_do_set_affinity() as this
+ * disables the masking of offline CPUs from the supplied affinity
+ * mask and therefore might keep/reassign the irq to the isolated
+ * CPU.
+ */
+ err = irq_do_set_affinity(d, a, false);
+ if (err)
+ pr_warn_ratelimited("IRQ%u: set affinity failed(%d).\n",
+ d->irq, err);
+
+ if (maskchip)
+ chip->irq_unmask(d);
+
+ return err;
+}
+
+/** managed_irq_adjust_activity() - Deactivate of restore managed interrupts
+ * according to change of housekeeping cpumask.
+ *
+ * @enable_mask: CPUs for which interrupts should be restored
+ */
+void managed_irq_adjust_activity(struct cpumask *enable_mask)
+{
+ unsigned int irq;
+
+ for_each_active_irq(irq) {
+ struct irq_desc *desc = irq_to_desc(irq);
+ unsigned long flags;
+ unsigned int cpu;
+
+ if (!desc)
+ continue;
+
+ raw_spin_lock_irqsave(&desc->lock, flags);
+ for_each_cpu(cpu, enable_mask)
+ irq_restore_affinity_of_irq(desc, cpu);
+ managed_irq_isolate(desc);
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ }
+}
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index ebbb215505e8..d1a0f1b104da 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -8,6 +8,8 @@
*
*/
+#include <linux/blk-mq.h>
+
#undef pr_fmt
#define pr_fmt(fmt) "%s:%d: %s " fmt, __FILE__, __LINE__, __func__
@@ -152,10 +154,16 @@ static int housekeeping_update(enum hk_type type, const struct cpumask *update)
if (!static_branch_unlikely(&housekeeping_overridden))
static_key_enable_cpuslocked(&housekeeping_overridden.key);
- /* Add here code to update dependent subsystems with
- * changes of the housekeeping masks.
- */
+ switch (type) {
+ case HK_TYPE_MANAGED_IRQ:
+ int cpu;
+ for_each_cpu(cpu, &masks->disable)
+ blk_mq_flush_on_cpu(cpu);
+ managed_irq_adjust_activity(&masks->enable);
+ break;
+ default:
+ }
kfree(masks);
return 0;
--
2.47.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC PATCH v3 3/3] genirq/cpuhotplug: Adjust managed irqs according to change of housekeeping CPU
2024-10-29 12:05 ` [RFC PATCH v3 3/3] genirq/cpuhotplug: Adjust managed irqs according to change of housekeeping CPU Costa Shulyupin
@ 2024-10-29 18:54 ` Thomas Gleixner
0 siblings, 0 replies; 5+ messages in thread
From: Thomas Gleixner @ 2024-10-29 18:54 UTC (permalink / raw)
To: Costa Shulyupin, longman, ming.lei, pauld, juri.lelli, vschneid,
Jens Axboe, Peter Zijlstra, Zefan Li, Tejun Heo, Johannes Weiner,
Michal Koutný, Ingo Molnar, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Costa Shulyupin, linux-block, linux-kernel, cgroups
On Tue, Oct 29 2024 at 14:05, Costa Shulyupin wrote:
> index afc920116d42..44c7da0e1b8d 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -171,7 +171,7 @@ static bool cpuhp_step_empty(bool bringup, struct cpuhp_step *step)
> *
> * Return: %0 on success or a negative errno code
> */
> -static int cpuhp_invoke_callback(unsigned int cpu, enum cpuhp_state state,
> +int cpuhp_invoke_callback(unsigned int cpu, enum cpuhp_state state,
> bool bringup, struct hlist_node *node,
> struct hlist_node **lastp)
This is deep internal functionality of cpu hotplug and only valid when
the hotplug lock is write held or if it is read held _and_ the state
mutex is held.
Otherwise it is completely unprotected against a concurrent state or
instance insertion/removal and concurrent invocations of this function.
And no, we are not going to expose the state mutex just because. CPU
hotplug is complex enough already and we really don't need more side
channels into it.
There is another issue with this approach in general:
1) The 3 block states are just the tip of the iceberg. You are going
to play a whack a mole game to add other subsystems/drivers as
well.
2) The whole logic has ordering constraints. The states have strict
ordering for a reason. So what guarantees that e.g. BLK_MQ_ONLINE
has no dependencies on non BLK related states to be invoked before
that. I'm failing to see the analysis of correctness here.
Just because it did not explode right away does not make it
correct. We've had enough subtle problems with ordering and
dependencies in the past. No need to introduce new ones.
CPU hotplug solves this problem without any hackery. Take a CPU offline,
change the mask of that CPU and bring it online again. Repeat until all
CPU changes are done.
If some user space component cannot deal with that, then fix that
instead of inflicting fragile and unmaintainable complexity on the
kernel. That kubernetes problem is known since 2018 and nobody has
actually sat down and solved it. Now we waste another 6 years to make it
"work" in the kernel magically.
This needs userspace awareness anyway. If you isolate a CPU then tasks
or containers which are assigned to that CPU need to move away and the
container has to exclude that CPU. If you remove the isolation then what
is opening the CPU for existing containers magically?
I'm not buying any of this "will" just work and nobody notices
handwaving.
Thanks,
tglx
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-10-29 18:54 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-29 12:05 [RFC PATCH v3 0/3] genirq/cpuhotplug: Adjust managed interrupts according to change of housekeeping cpumask Costa Shulyupin
2024-10-29 12:05 ` [RFC PATCH v3 1/3] sched/isolation: Add infrastructure for dynamic CPU isolation Costa Shulyupin
2024-10-29 12:05 ` [RFC PATCH v3 2/3] DO NOT MERGE: test for managed irqs adjustment Costa Shulyupin
2024-10-29 12:05 ` [RFC PATCH v3 3/3] genirq/cpuhotplug: Adjust managed irqs according to change of housekeeping CPU Costa Shulyupin
2024-10-29 18:54 ` Thomas Gleixner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).