[PATCH v1 0/7] sched: Adjust affinity according to change of housekeeping cpumask

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/7] sched: Adjust affinity according to change of housekeeping cpumask
@ 2024-05-16 19:04 Costa Shulyupin
  2024-05-16 19:04 ` [PATCH v1 1/7] sched/isolation: Add infrastructure to adjust affinity for dynamic CPU isolation Costa Shulyupin
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Costa Shulyupin @ 2024-05-16 19:04 UTC (permalink / raw)
  To: longman, pauld, juri.lelli, prarit, vschneid, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

The housekeeping CPU masks, set up by the "isolcpus" and "nohz_full"
boot command line options, are used at boot time to exclude selected
CPUs from running some kernel housekeeping facilities to minimize
disturbance to latency sensitive userspace applications such as DPDK.

However, these options can have negative consequences for "normal"
workloads. Both nohz_full and rcu_nocbs can be applied to a subset of
the CPUs on a server (so as to not impact the "normal" workloads), but
they can only be changed with a reboot. This is a problem for
containerized workloads running on OpenShift (i.e. kubernetes) where a
mix of low latency and "normal" workloads can be created/destroyed
dynamically and the number of CPUs allocated to each workload is often
not known at boot time.

This series of patches is based on series
"isolation: Exclude dynamically isolated CPUs from housekeeping masks"
https://lore.kernel.org/lkml/20240229021414.508972-1-longman@redhat.com/
Its purpose is to exclude dynamically isolated CPUs from some
housekeeping masks so that subsystems that check the housekeeping masks
at run time will not use those isolated CPUs.

However, some of subsystems can use obsolete housekeeping CPU masks.
Therefore, to prevent the use of these isolated CPUs, it is necessary to
explicitly propagate changes of the housekeeping masks to all subsystems
depending on the mask.

Costa Shulyupin (7):
  sched/isolation: Add infrastructure to adjust affinity for dynamic CPU
    isolation
  sched/isolation: Adjust affinity of timers according to change of
    housekeeping cpumask
  sched/isolation: Adjust affinity of hrtimers according to change of
    housekeeping cpumask
  sched/isolation: Adjust affinity of managed irqs according to change
    of housekeeping cpumask
  [NOT-FOR-MERGE] test timers affinity adjustment
  [NOT-FOR-MERGE] test timers and hrtimers affinity adjustment
  [NOT-FOR-MERGE] test managed irqs affinity adjustment

 include/linux/hrtimer.h  |   2 +
 include/linux/timer.h    |   2 +
 init/Kconfig             |   1 +
 kernel/cgroup/cpuset.c   |   3 +-
 kernel/sched/isolation.c | 119 +++++++++++++++++++++++++++++++++++++--
 kernel/time/hrtimer.c    |  81 ++++++++++++++++++++++++++
 kernel/time/timer.c      |  64 +++++++++++++++++++++
 tests/managed_irq.c      |  71 +++++++++++++++++++++++
 tests/timers.c           |  58 +++++++++++++++++++
 9 files changed, 395 insertions(+), 6 deletions(-)
 create mode 100644 tests/managed_irq.c
 create mode 100644 tests/timers.c

-- 
2.45.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v1 1/7] sched/isolation: Add infrastructure to adjust affinity for dynamic CPU isolation
  2024-05-16 19:04 [PATCH v1 0/7] sched: Adjust affinity according to change of housekeeping cpumask Costa Shulyupin
@ 2024-05-16 19:04 ` Costa Shulyupin
  2024-05-17 21:37   ` Thomas Gleixner
  2024-05-16 19:04 ` [PATCH v1 2/7] sched/isolation: Adjust affinity of timers according to change of housekeeping cpumask Costa Shulyupin
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Costa Shulyupin @ 2024-05-16 19:04 UTC (permalink / raw)
  To: longman, pauld, juri.lelli, prarit, vschneid, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

Introduce infrastructure function housekeeping_update() to change
housekeeping_cpumask during runtime and adjust affinities of depended
subsystems.

Affinity adjustments of subsystems follow in subsequent patches.

Parent patch:
"sched/isolation: Exclude dynamically isolated CPUs from housekeeping masks"
https://lore.kernel.org/lkml/20240229021414.508972-2-longman@redhat.com/

Test example for cgroup2:

cd /sys/fs/cgroup/
echo +cpuset > cgroup.subtree_control
mkdir test
echo isolated > test/cpuset.cpus.partition
echo $isolate > test/cpuset.cpus

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 kernel/sched/isolation.c | 48 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 948b9ee0dc2cc..036e48f0e7d1b 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -116,6 +116,39 @@ static void __init housekeeping_setup_type(enum hk_type type,
 		     housekeeping_staging);
 }
 
+/*
+ * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
+ * change.
+ *
+ * Assuming cpuset_mutex is held in sched_partition_write or
+ * cpuset_write_resmask.
+ */
+static int housekeeping_update(enum hk_type type, cpumask_var_t update)
+{
+	struct {
+		struct cpumask changed;
+		struct cpumask enable;
+		struct cpumask disable;
+	} *masks;
+
+	masks = kmalloc(sizeof(*masks), GFP_KERNEL);
+	if (!masks)
+		return -ENOMEM;
+
+	lockdep_assert_cpus_held();
+	cpumask_xor(&masks->changed, housekeeping_cpumask(type), update);
+	cpumask_and(&masks->enable, &masks->changed, update);
+	cpumask_andnot(&masks->disable, &masks->changed, update);
+	cpumask_copy(housekeeping.cpumasks[type], update);
+	housekeeping.flags |= BIT(type);
+	if (!static_branch_unlikely(&housekeeping_overridden))
+		static_key_enable_cpuslocked(&housekeeping_overridden.key);
+
+	kfree(masks);
+
+	return 0;
+}
+
 static int __init housekeeping_setup(char *str, unsigned long flags)
 {
 	cpumask_var_t non_housekeeping_mask, housekeeping_staging;
@@ -314,9 +347,12 @@ int housekeeping_exlude_isolcpus(const struct cpumask *isolcpus, unsigned long f
 		/*
 		 * Reset housekeeping to bootup default
 		 */
-		for_each_set_bit(type, &housekeeping_boot.flags, HK_TYPE_MAX)
-			cpumask_copy(housekeeping.cpumasks[type],
-				     housekeeping_boot.cpumasks[type]);
+		for_each_set_bit(type, &housekeeping_boot.flags, HK_TYPE_MAX) {
+			int err = housekeeping_update(type, housekeeping_boot.cpumasks[type]);
+
+			if (err)
+				return err;
+		}
 
 		WRITE_ONCE(housekeeping.flags, housekeeping_boot.flags);
 		if (!housekeeping_boot.flags &&
@@ -344,9 +380,11 @@ int housekeeping_exlude_isolcpus(const struct cpumask *isolcpus, unsigned long f
 		cpumask_andnot(tmp_mask, src_mask, isolcpus);
 		if (!cpumask_intersects(tmp_mask, cpu_online_mask))
 			return -EINVAL;	/* Invalid isolated CPUs */
-		cpumask_copy(housekeeping.cpumasks[type], tmp_mask);
+		int err = housekeeping_update(type, tmp_mask);
+
+		if (err)
+			return err;
 	}
-	WRITE_ONCE(housekeeping.flags, housekeeping_boot.flags | flags);
 	excluded = true;
 	if (!static_branch_unlikely(&housekeeping_overridden))
 		static_key_enable_cpuslocked(&housekeeping_overridden.key);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v1 2/7] sched/isolation: Adjust affinity of timers according to change of housekeeping cpumask
  2024-05-16 19:04 [PATCH v1 0/7] sched: Adjust affinity according to change of housekeeping cpumask Costa Shulyupin
  2024-05-16 19:04 ` [PATCH v1 1/7] sched/isolation: Add infrastructure to adjust affinity for dynamic CPU isolation Costa Shulyupin
@ 2024-05-16 19:04 ` Costa Shulyupin
  2024-05-17 22:39   ` Thomas Gleixner
  2024-05-16 19:04 ` [PATCH v1 3/7] sched/isolation: Adjust affinity of hrtimers " Costa Shulyupin
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Costa Shulyupin @ 2024-05-16 19:04 UTC (permalink / raw)
  To: longman, pauld, juri.lelli, prarit, vschneid, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

Adjust affinity timers and watchdog_cpumask according to
change of housekeeping.cpumasks[HK_TYPE_TIMER] during runtime.

watchdog_cpumask is initialized during boot in lockup_detector_init()
from housekeeping_cpumask(HK_TYPE_TIMER).

lockup_detector_reconfigure() utilizes updated watchdog_cpumask
via __lockup_detector_reconfigure().

timers_resettle_from_cpu() is blindly prototyped from timers_dead_cpu().
local_irq_disable is used because cpuhp_thread_fun uses it before
cpuhp_invoke_callback() call.

Core test snippets without infrastructure:

1. Create timer on specific cpu with:

	timer_setup(&test_timer, test_timer_cb, TIMER_PINNED);
        test_timer.expires = KTIME_MAX;
        add_timer_on(&test_timer, test_cpu);

2. Call housekeeping_update()

3. Assure that there is no timers on specified cpu at the end
of timers_resettle_from_cpu() with:

static int count_timers(int cpu)
{
	struct timer_base *base;
	int b, v, count = 0;

	for (b = 0; b < NR_BASES; b++) {
		base = per_cpu_ptr(&timer_bases[b], cpu);
		raw_spin_lock_irq(&base->lock);

		for (v = 0; v < WHEEL_SIZE; v++) {
			struct hlist_node *c;

			hlist_for_each(c, base->vectors + v)
				count++;
		}
		raw_spin_unlock_irq(&base->lock);
	}

	return count;
}

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 include/linux/timer.h    |  2 ++
 init/Kconfig             |  1 +
 kernel/sched/isolation.c | 27 ++++++++++++++++++++++++++
 kernel/time/timer.c      | 42 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 72 insertions(+)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index e67ecd1cbc97d..a09266abdb18a 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -219,9 +219,11 @@ unsigned long round_jiffies_up_relative(unsigned long j);
 #ifdef CONFIG_HOTPLUG_CPU
 int timers_prepare_cpu(unsigned int cpu);
 int timers_dead_cpu(unsigned int cpu);
+void timers_resettle_from_cpu(unsigned int cpu);
 #else
 #define timers_prepare_cpu	NULL
 #define timers_dead_cpu		NULL
+static inline void timers_resettle_from_cpu(unsigned int cpu) { }
 #endif
 
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 72404c1f21577..fac49c6bb965a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -682,6 +682,7 @@ config CPU_ISOLATION
 	bool "CPU isolation"
 	depends on SMP || COMPILE_TEST
 	default y
+	select HOTPLUG_CPU
 	help
 	  Make sure that CPUs running critical tasks are not disturbed by
 	  any source of "noise" such as unbound workqueues, timers, kthreads...
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 036e48f0e7d1b..3b63f0212887e 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -8,6 +8,10 @@
  *
  */
 
+#ifdef CONFIG_LOCKUP_DETECTOR
+#include <linux/nmi.h>
+#endif
+
 enum hk_flags {
 	HK_FLAG_TIMER		= BIT(HK_TYPE_TIMER),
 	HK_FLAG_RCU		= BIT(HK_TYPE_RCU),
@@ -116,6 +120,19 @@ static void __init housekeeping_setup_type(enum hk_type type,
 		     housekeeping_staging);
 }
 
+static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable_mask)
+{
+	unsigned int cpu;
+
+	for_each_cpu(cpu, enable_mask)	{
+		timers_prepare_cpu(cpu);
+	}
+
+	for_each_cpu(cpu, disable_mask) {
+		timers_resettle_from_cpu(cpu);
+	}
+}
+
 /*
  * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
  * change.
@@ -144,6 +161,16 @@ static int housekeeping_update(enum hk_type type, cpumask_var_t update)
 	if (!static_branch_unlikely(&housekeeping_overridden))
 		static_key_enable_cpuslocked(&housekeeping_overridden.key);
 
+	switch (type) {
+	case HK_TYPE_TIMER:
+		resettle_all_timers(&masks->enable, &masks->disable);
+#ifdef CONFIG_LOCKUP_DETECTOR
+		cpumask_copy(&watchdog_cpumask, housekeeping_cpumask(HK_TYPE_TIMER));
+		lockup_detector_reconfigure();
+#endif
+		break;
+	default:
+	}
 	kfree(masks);
 
 	return 0;
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 48288dd4a102f..2d15c0e7b0550 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -51,6 +51,7 @@
 #include <asm/div64.h>
 #include <asm/timex.h>
 #include <asm/io.h>
+#include <linux/sched/isolation.h>
 
 #include "tick-internal.h"
 #include "timer_migration.h"
@@ -2657,6 +2658,47 @@ int timers_prepare_cpu(unsigned int cpu)
 	return 0;
 }
 
+/**
+ * timers_resettle_from_cpu - resettles timers from
+ * specified cpu to housekeeping cpus.
+ */
+void timers_resettle_from_cpu(unsigned int cpu)
+{
+	struct timer_base *old_base;
+	struct timer_base *new_base;
+	int b, i;
+
+	local_irq_disable();
+	for (b = 0; b < NR_BASES; b++) {
+		old_base = per_cpu_ptr(&timer_bases[b], cpu);
+		new_base = per_cpu_ptr(&timer_bases[b],
+				cpumask_any_and(cpu_active_mask,
+					housekeeping_cpumask(HK_TYPE_TIMER)));
+		/*
+		 * The caller is globally serialized and nobody else
+		 * takes two locks at once, deadlock is not possible.
+		 */
+		raw_spin_lock_irq(&new_base->lock);
+		raw_spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);
+
+		/*
+		 * The current CPUs base clock might be stale. Update it
+		 * before moving the timers over.
+		 */
+		forward_timer_base(new_base);
+
+		WARN_ON_ONCE(old_base->running_timer);
+		old_base->running_timer = NULL;
+
+		for (i = 0; i < WHEEL_SIZE; i++)
+			migrate_timer_list(new_base, old_base->vectors + i);
+
+		raw_spin_unlock(&old_base->lock);
+		raw_spin_unlock_irq(&new_base->lock);
+	}
+	local_irq_enable();
+}
+
 int timers_dead_cpu(unsigned int cpu)
 {
 	struct timer_base *old_base;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v1 3/7] sched/isolation: Adjust affinity of hrtimers according to change of housekeeping cpumask
  2024-05-16 19:04 [PATCH v1 0/7] sched: Adjust affinity according to change of housekeeping cpumask Costa Shulyupin
  2024-05-16 19:04 ` [PATCH v1 1/7] sched/isolation: Add infrastructure to adjust affinity for dynamic CPU isolation Costa Shulyupin
  2024-05-16 19:04 ` [PATCH v1 2/7] sched/isolation: Adjust affinity of timers according to change of housekeeping cpumask Costa Shulyupin
@ 2024-05-16 19:04 ` Costa Shulyupin
  2024-05-17 23:42   ` Thomas Gleixner
  2024-05-16 19:04 ` [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs " Costa Shulyupin
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Costa Shulyupin @ 2024-05-16 19:04 UTC (permalink / raw)
  To: longman, pauld, juri.lelli, prarit, vschneid, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

Adjust affinity of watchdog_cpumask, hrtimers according to
change of housekeeping.cpumasks[HK_TYPE_TIMER].

Function migrate_hrtimer_list_except() is prototyped from
migrate_hrtimer_list() and is more generic.

Potentially it can be used instead of migrate_hrtimer_list.

Function hrtimers_resettle_from_cpu() is blindly prototyped
from hrtimers_cpu_dying(). local_irq_disable() is used because
cpuhp_thread_fun() uses it before cpuhp_invoke_callback().

Core test snippets without infrastructure:

1. Create hrtimer on specific cpu with:

        set_cpus_allowed_ptr(current, cpumask_of(test_cpu));
        hrtimer_init(&test_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        test_hrtimer.function = test_hrtimer_cb;
        hrtimer_start(&test_hrtimer, -1,  HRTIMER_MODE_REL);

2. Call housekeeping_update()

3. Assure that there is only tick_nohz_handler on specified cpu
in /proc/timer_list manually or with script:

grep -E 'cpu| #[0-9]' /proc/timer_list | \
	awk "/cpu:/{y=0};/cpu: $test_cpu\$/{y=1};y"

Another alternative solution to migrate hrtimers:
1. Use cpuhp to set sched_timer offline
2. Resettle all hrtimers likewise migrate_hrtimer_list
3. Use cpuhp to set sched_timer online

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 include/linux/hrtimer.h  |  2 +
 kernel/sched/isolation.c |  2 +
 kernel/time/hrtimer.c    | 81 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index aa1e65ccb6158..004632fc7d643 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -381,8 +381,10 @@ extern void sysrq_timer_list_show(void);
 int hrtimers_prepare_cpu(unsigned int cpu);
 #ifdef CONFIG_HOTPLUG_CPU
 int hrtimers_cpu_dying(unsigned int cpu);
+void hrtimers_resettle_from_cpu(unsigned int cpu);
 #else
 #define hrtimers_cpu_dying	NULL
+static inline void hrtimers_resettle_from_cpu(unsigned int cpu) { }
 #endif
 
 #endif
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 3b63f0212887e..85a17d39d8bb0 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -126,10 +126,12 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
 
 	for_each_cpu(cpu, enable_mask)	{
 		timers_prepare_cpu(cpu);
+		hrtimers_prepare_cpu(cpu);
 	}
 
 	for_each_cpu(cpu, disable_mask) {
 		timers_resettle_from_cpu(cpu);
+		hrtimers_resettle_from_cpu(cpu);
 	}
 }
 
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 492c14aac642b..7e71ebbb72348 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2201,6 +2201,87 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
 	}
 }
 
+/*
+ * migrate_hrtimer_list_except - migrates hrtimers from one base to another,
+ * except specified one.
+ */
+static void migrate_hrtimer_list_except(struct hrtimer_clock_base *old_base,
+				struct hrtimer_clock_base *new_base, struct hrtimer *except)
+{
+	struct hrtimer *timer;
+	struct timerqueue_node *node;
+
+	node = timerqueue_getnext(&old_base->active);
+	while (node) {
+		timer = container_of(node, struct hrtimer, node);
+		node = timerqueue_iterate_next(node);
+		if (timer == except)
+			continue;
+
+		BUG_ON(hrtimer_callback_running(timer));
+		debug_deactivate(timer);
+
+		/*
+		 * Mark it as ENQUEUED not INACTIVE otherwise the
+		 * timer could be seen as !active and just vanish away
+		 * under us on another CPU
+		 */
+		__remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
+		timer->base = new_base;
+		/*
+		 * Enqueue the timers on the new cpu. This does not
+		 * reprogram the event device in case the timer
+		 * expires before the earliest on this CPU, but we run
+		 * hrtimer_interrupt after we migrated everything to
+		 * sort out already expired timers and reprogram the
+		 * event device.
+		 */
+		enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS);
+	}
+}
+
+/**
+ * hrtimers_resettle_from_cpu - resettles hrtimers from
+ * specified cpu to housekeeping cpus.
+ */
+void hrtimers_resettle_from_cpu(unsigned int isol_cpu)
+{
+	int ncpu, i;
+	struct tick_sched *ts = tick_get_tick_sched(isol_cpu);
+	struct hrtimer_cpu_base *old_base, *new_base;
+
+	local_irq_disable();
+	ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
+
+	old_base = &per_cpu(hrtimer_bases, isol_cpu);
+	new_base = &per_cpu(hrtimer_bases, ncpu);
+
+	/*
+	 * The caller is globally serialized and nobody else
+	 * takes two locks at once, deadlock is not possible.
+	 */
+	raw_spin_lock(&old_base->lock);
+	raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+		migrate_hrtimer_list_except(&old_base->clock_base[i],
+				     &new_base->clock_base[i],
+				     &ts->sched_timer);
+	}
+
+	/*
+	 * The migration might have changed the first expiring softirq
+	 * timer on this CPU. Update it.
+	 */
+	__hrtimer_get_next_event(new_base, HRTIMER_ACTIVE_SOFT);
+
+	raw_spin_unlock(&new_base->lock);
+	raw_spin_unlock(&old_base->lock);
+	local_irq_enable();
+
+	/* Tell the other CPU to retrigger the next event */
+	smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
+}
+
 int hrtimers_cpu_dying(unsigned int dying_cpu)
 {
 	int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs according to change of housekeeping cpumask
  2024-05-16 19:04 [PATCH v1 0/7] sched: Adjust affinity according to change of housekeeping cpumask Costa Shulyupin
                   ` (2 preceding siblings ...)
  2024-05-16 19:04 ` [PATCH v1 3/7] sched/isolation: Adjust affinity of hrtimers " Costa Shulyupin
@ 2024-05-16 19:04 ` Costa Shulyupin
  2024-05-18  1:17   ` Thomas Gleixner
  2024-05-16 19:04 ` [PATCH v1 5/7] [NOT-FOR-MERGE] test timers affinity adjustment Costa Shulyupin
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Costa Shulyupin @ 2024-05-16 19:04 UTC (permalink / raw)
  To: longman, pauld, juri.lelli, prarit, vschneid, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

irq_affinity_adjust() is prototyped from irq_affinity_online_cpu()
and irq_restore_affinity_of_irq().

Core test snippets without infrastructure:

1. Create managed IRQ on specific cpu with:

static int test_set_affinity(struct irq_data *data,
			     const struct cpumask *m, bool f)
{
	irq_data_update_effective_affinity(data, m);
	return 0;
}

static int make_test_irq(void)
{
	struct irq_affinity_desc a = { mask: *cpumask_of(test_cpu),
				       is_managed: true };
	static struct irq_chip test_chip = { .irq_set_affinity = test_set_affinity };
	int test_irq = __irq_alloc_descs(-1, 1, 1, 0, THIS_MODULE, &a);

	irq_set_chip(test_irq, &test_chip);
	irq_set_status_flags(test_irq, IRQ_MOVE_PCNTXT);
	request_irq(test_irq, test_irq_cb, 0, "test_affinity", 0);

	return test_irq;
}

2. Isolate specified CPU.

3.  Assure that test_irq doesn't affine with test_cpu:

cat /proc/irq/$test_irq/smp_affinity_list

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 kernel/cgroup/cpuset.c   |  3 ++-
 kernel/sched/isolation.c | 44 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9d01e8e0a3ed9..2e59a2983eb2e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -225,7 +225,8 @@ static struct list_head remote_children;
 /*
  * The set of housekeeping flags to be updated for CPU isolation
  */
-#define	HOUSEKEEPING_FLAGS	(BIT(HK_TYPE_TIMER) | BIT(HK_TYPE_RCU))
+#define	HOUSEKEEPING_FLAGS	(BIT(HK_TYPE_TIMER) | BIT(HK_TYPE_RCU) \
+		| BIT(HK_TYPE_MANAGED_IRQ))
 
 /*
  * Partition root states:
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 85a17d39d8bb0..b0503ed362fce 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -135,6 +135,43 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
 	}
 }
 
+static int irq_affinity_adjust(cpumask_var_t disable_mask)
+{
+	unsigned int irq;
+	cpumask_var_t mask;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	irq_lock_sparse();
+	for_each_active_irq(irq) {
+		struct irq_desc *desc = irq_to_desc(irq);
+
+		raw_spin_lock_irq(&desc->lock);
+		struct irq_data *data = irq_desc_get_irq_data(desc);
+
+		if (irqd_affinity_is_managed(data) && cpumask_weight_and(disable_mask,
+			irq_data_get_affinity_mask(data))) {
+
+			cpumask_and(mask, cpu_online_mask, irq_default_affinity);
+			cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));
+			irq_set_affinity_locked(data, mask, true);
+			WARN_ON(cpumask_weight_and(irq_data_get_effective_affinity_mask(data),
+						disable_mask));
+			WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
+						cpu_online_mask));
+			WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
+						housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)));
+		}
+		raw_spin_unlock_irq(&desc->lock);
+	}
+	irq_unlock_sparse();
+
+	free_cpumask_var(mask);
+
+	return 0;
+}
+
 /*
  * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
  * change.
@@ -144,6 +181,8 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
  */
 static int housekeeping_update(enum hk_type type, cpumask_var_t update)
 {
+	int err = 0;
+
 	struct {
 		struct cpumask changed;
 		struct cpumask enable;
@@ -171,11 +210,14 @@ static int housekeeping_update(enum hk_type type, cpumask_var_t update)
 		lockup_detector_reconfigure();
 #endif
 		break;
+	case HK_TYPE_MANAGED_IRQ:
+		err = irq_affinity_adjust(&masks->disable);
+		break;
 	default:
 	}
 	kfree(masks);
 
-	return 0;
+	return err;
 }
 
 static int __init housekeeping_setup(char *str, unsigned long flags)
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v1 5/7] [NOT-FOR-MERGE] test timers affinity adjustment
  2024-05-16 19:04 [PATCH v1 0/7] sched: Adjust affinity according to change of housekeeping cpumask Costa Shulyupin
                   ` (3 preceding siblings ...)
  2024-05-16 19:04 ` [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs " Costa Shulyupin
@ 2024-05-16 19:04 ` Costa Shulyupin
  2024-05-16 19:04 ` [PATCH v1 6/7] [NOT-FOR-MERGE] test timers and hrtimers " Costa Shulyupin
  2024-05-16 19:04 ` [PATCH v1 7/7] [NOT-FOR-MERGE] test managed irqs " Costa Shulyupin
  6 siblings, 0 replies; 14+ messages in thread
From: Costa Shulyupin @ 2024-05-16 19:04 UTC (permalink / raw)
  To: longman, pauld, juri.lelli, prarit, vschneid, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

There must be no timers on the specified cpu
at the end of timers_resettle_from_cpu.

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 kernel/time/timer.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d15c0e7b0550..b7d253d230453 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2658,6 +2658,27 @@ int timers_prepare_cpu(unsigned int cpu)
 	return 0;
 }
 
+static int count_timers(int cpu)
+{
+	struct timer_base *base;
+	int b, v, count = 0;
+
+	for (b = 0; b < NR_BASES; b++) {
+		base = per_cpu_ptr(&timer_bases[b], cpu);
+		raw_spin_lock_irq(&base->lock);
+
+		for (v = 0; v < WHEEL_SIZE; v++) {
+			struct hlist_node *c;
+
+			hlist_for_each(c, base->vectors + v)
+				count++;
+		}
+		raw_spin_unlock_irq(&base->lock);
+	}
+
+	return count;
+}
+
 /**
  * timers_resettle_from_cpu - resettles timers from
  * specified cpu to housekeeping cpus.
@@ -2697,6 +2718,7 @@ void timers_resettle_from_cpu(unsigned int cpu)
 		raw_spin_unlock_irq(&new_base->lock);
 	}
 	local_irq_enable();
+	WARN_ON(count_timers(cpu));
 }
 
 int timers_dead_cpu(unsigned int cpu)
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v1 6/7] [NOT-FOR-MERGE] test timers and hrtimers affinity adjustment
  2024-05-16 19:04 [PATCH v1 0/7] sched: Adjust affinity according to change of housekeeping cpumask Costa Shulyupin
                   ` (4 preceding siblings ...)
  2024-05-16 19:04 ` [PATCH v1 5/7] [NOT-FOR-MERGE] test timers affinity adjustment Costa Shulyupin
@ 2024-05-16 19:04 ` Costa Shulyupin
  2024-05-16 19:04 ` [PATCH v1 7/7] [NOT-FOR-MERGE] test managed irqs " Costa Shulyupin
  6 siblings, 0 replies; 14+ messages in thread
From: Costa Shulyupin @ 2024-05-16 19:04 UTC (permalink / raw)
  To: longman, pauld, juri.lelli, prarit, vschneid, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

Kernel module to generate timers and hrtimers for the test
and shell commands.

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 tests/timers.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)
 create mode 100644 tests/timers.c

diff --git a/tests/timers.c b/tests/timers.c
new file mode 100644
index 0000000000000..bf6ef3244bc05
--- /dev/null
+++ b/tests/timers.c
@@ -0,0 +1,58 @@
+#include <linux/timer.h>
+#include <linux/hrtimer.h>
+#include <linux/module.h>
+
+/*
+ * Testing instructions:
+ *
+ * isolate=1
+ * insmod timers.ko test_cpu=$isolate
+ * cd /sys/fs/cgroup/
+ * echo +cpuset > cgroup.subtree_control
+ * mkdir test
+ * echo isolated > test/cpuset.cpus.partition
+ * echo $isolate > test/cpuset.cpus
+ *
+ * awk "/cpu:/{y=0};/cpu: $isolate\$/{y=1};/ #[0-9]/ && y;" /proc/timer_list \
+ * 	| grep -q test_hrtimer_cb && echo FAIL || echo PASS
+ *
+ * Assure that there is no timers on the isolated cpu.
+ */
+
+static void test_timer_cb(struct timer_list *unused) { }
+
+static struct timer_list test_timer;
+
+static struct hrtimer test_hrtimer;
+
+static enum hrtimer_restart test_hrtimer_cb(struct hrtimer *hr_timer)
+{
+	return HRTIMER_NORESTART;
+}
+
+static int test_cpu = 1;
+module_param(test_cpu, int, 0444);
+
+static int timers_init(void)
+{
+	set_cpus_allowed_ptr(current, cpumask_of(test_cpu));
+	timer_setup(&test_timer, test_timer_cb, TIMER_PINNED);
+	test_timer.expires = KTIME_MAX;
+	add_timer(&test_timer);
+
+	hrtimer_init(&test_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	test_hrtimer.function = test_hrtimer_cb;
+	hrtimer_start(&test_hrtimer, -1, HRTIMER_MODE_REL);
+	return 0;
+}
+
+static void timers_exit(void)
+{
+	del_timer(&test_timer);
+	hrtimer_cancel(&test_hrtimer);
+}
+
+module_init(timers_init);
+module_exit(timers_exit);
+
+MODULE_LICENSE("GPL");
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v1 7/7] [NOT-FOR-MERGE] test managed irqs affinity adjustment
  2024-05-16 19:04 [PATCH v1 0/7] sched: Adjust affinity according to change of housekeeping cpumask Costa Shulyupin
                   ` (5 preceding siblings ...)
  2024-05-16 19:04 ` [PATCH v1 6/7] [NOT-FOR-MERGE] test timers and hrtimers " Costa Shulyupin
@ 2024-05-16 19:04 ` Costa Shulyupin
  6 siblings, 0 replies; 14+ messages in thread
From: Costa Shulyupin @ 2024-05-16 19:04 UTC (permalink / raw)
  To: longman, pauld, juri.lelli, prarit, vschneid, Anna-Maria Behnsen,
	Frederic Weisbecker, Thomas Gleixner, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

Kernel module to generatemanaged irqs for the test
and shell commands.

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
---
 tests/managed_irq.c | 71 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)
 create mode 100644 tests/managed_irq.c

diff --git a/tests/managed_irq.c b/tests/managed_irq.c
new file mode 100644
index 0000000000000..fd4e91e44e59d
--- /dev/null
+++ b/tests/managed_irq.c
@@ -0,0 +1,71 @@
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/module.h>
+
+/*
+ * Testing instructions:
+ *
+ * isolate=1
+ * insmod managed_irq.ko test_cpu=$isolate
+ * cd /sys/fs/cgroup/
+ * echo +cpuset > cgroup.subtree_control
+ * mkdir test
+ * echo isolated > test/cpuset.cpus.partition
+ * echo $isolate > test/cpuset.cpus
+ * read test_irq < /sys/module/managed_irq/parameters/test_irq
+ *
+ * cat /proc/irq/$test_irq/smp_affinity_list
+ *
+ * read affinity < /proc/irq/$test_irq/smp_affinity
+ * test 0 -ne $((0x$affinity & 1 << $isolate)) && echo FAIL | PASS
+ *
+ * Assure that irq affinity doesn't contain isolated cpu.
+ */
+
+static int test_cpu = 1;
+module_param(test_cpu, int, 0444);
+
+static int test_irq;
+module_param(test_irq, int, 0444);
+
+static irqreturn_t test_irq_cb(int irq, void *dev_id)
+{
+	return IRQ_HANDLED;
+}
+
+static int test_set_affinity(struct irq_data *d, const struct cpumask *m, bool f)
+{
+	irq_data_update_effective_affinity(d, m);
+	return 0;
+}
+
+static int make_test_irq(void)
+{
+	struct irq_affinity_desc a = {
+		mask: *cpumask_of(test_cpu),
+		is_managed: true
+	};
+	int test_irq = __irq_alloc_descs(-1, 1, 1, 0, THIS_MODULE, &a);
+	static struct irq_chip test_chip = { .irq_set_affinity =
+						     test_set_affinity };
+
+	irq_set_chip(test_irq, &test_chip);
+	irq_set_status_flags(test_irq, IRQ_MOVE_PCNTXT);
+	pr_debug("test_irq=%d\n", test_irq);
+	int err = request_irq(test_irq, test_irq_cb, 0, "test_affinity", 0);
+
+	if (err)
+		pr_err("%d\n", err);
+
+	return test_irq;
+}
+
+static int managed_irq_init(void)
+{
+	test_irq = make_test_irq();
+	return 0;
+}
+
+module_init(managed_irq_init);
+
+MODULE_LICENSE("GPL");
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/7] sched/isolation: Add infrastructure to adjust affinity for dynamic CPU isolation
  2024-05-16 19:04 ` [PATCH v1 1/7] sched/isolation: Add infrastructure to adjust affinity for dynamic CPU isolation Costa Shulyupin
@ 2024-05-17 21:37   ` Thomas Gleixner
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Gleixner @ 2024-05-17 21:37 UTC (permalink / raw)
  To: Costa Shulyupin, longman, pauld, juri.lelli, prarit, vschneid,
	Anna-Maria Behnsen, Frederic Weisbecker, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
> Introduce infrastructure function housekeeping_update() to change
> housekeeping_cpumask during runtime and adjust affinities of depended
> subsystems.
>
> Affinity adjustments of subsystems follow in subsequent patches.
>
> Parent patch:
> "sched/isolation: Exclude dynamically isolated CPUs from housekeeping masks"
> https://lore.kernel.org/lkml/20240229021414.508972-2-longman@redhat.com/
>
> Test example for cgroup2:
>
> cd /sys/fs/cgroup/
> echo +cpuset > cgroup.subtree_control
> mkdir test
> echo isolated > test/cpuset.cpus.partition
> echo $isolate > test/cpuset.cpus

This changelog is not telling me anything. Please see
Documentation/process/ what changelogs should contain.

> +/*
> + * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
> + * change.
> + *
> + * Assuming cpuset_mutex is held in sched_partition_write or
> + * cpuset_write_resmask.

Locking cannot be assumed. lockdep_assert_held() is there to document
and enforce such requirements.

> + */
> +static int housekeeping_update(enum hk_type type, cpumask_var_t update)

Please us 'struct cpumask *update' as it makes it clear what this is
about. cpumask_var_t is a hack to make onstack and embedded cpumask and
their allocated counterparts possible without #ifdeffery in the code.

But any function which is not related to alloc/free of cpumask_var_t
should simply use 'struct cpumask *' as argument type.

> +	housekeeping.flags |= BIT(type);

The existing code uses WRITE_ONCE() probably for a reason. Why is that
not longer required here?

>  static int __init housekeeping_setup(char *str, unsigned long flags)
>  {
>  	cpumask_var_t non_housekeeping_mask, housekeeping_staging;
> @@ -314,9 +347,12 @@ int housekeeping_exlude_isolcpus(const struct cpumask *isolcpus, unsigned long f
>  		/*
>  		 * Reset housekeeping to bootup default
>  		 */
> -		for_each_set_bit(type, &housekeeping_boot.flags, HK_TYPE_MAX)
> -			cpumask_copy(housekeeping.cpumasks[type],
> -				     housekeeping_boot.cpumasks[type]);
> +		for_each_set_bit(type, &housekeeping_boot.flags, HK_TYPE_MAX) {
> +			int err = housekeeping_update(type, housekeeping_boot.cpumasks[type]);
> +
> +			if (err)
> +				return err;
> +		}
>  
>  		WRITE_ONCE(housekeeping.flags, housekeeping_boot.flags);
>  		if (!housekeeping_boot.flags &&
> @@ -344,9 +380,11 @@ int housekeeping_exlude_isolcpus(const struct cpumask *isolcpus, unsigned long f
>  		cpumask_andnot(tmp_mask, src_mask, isolcpus);
>  		if (!cpumask_intersects(tmp_mask, cpu_online_mask))
>  			return -EINVAL;	/* Invalid isolated CPUs */
> -		cpumask_copy(housekeeping.cpumasks[type], tmp_mask);
> +		int err = housekeeping_update(type, tmp_mask);
> +
> +		if (err)
> +			return err;

Do we really need two places to define 'int err' or might it be possible
to have one instance defined at function scope?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/7] sched/isolation: Adjust affinity of timers according to change of housekeeping cpumask
  2024-05-16 19:04 ` [PATCH v1 2/7] sched/isolation: Adjust affinity of timers according to change of housekeeping cpumask Costa Shulyupin
@ 2024-05-17 22:39   ` Thomas Gleixner
  2024-05-17 22:52     ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2024-05-17 22:39 UTC (permalink / raw)
  To: Costa Shulyupin, longman, pauld, juri.lelli, prarit, vschneid,
	Anna-Maria Behnsen, Frederic Weisbecker, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
> Adjust affinity timers and watchdog_cpumask according to
> change of housekeeping.cpumasks[HK_TYPE_TIMER] during runtime.

Timers and watchdog are two different things. It's clearly documented
that a patch should change one thing and not combine random stuff.

> watchdog_cpumask is initialized during boot in lockup_detector_init()
> from housekeeping_cpumask(HK_TYPE_TIMER).
>
> lockup_detector_reconfigure() utilizes updated watchdog_cpumask
> via __lockup_detector_reconfigure().

You describe WHAT the patch is doing, but there is no WHY and zero
rationale why this is correct.

> timers_resettle_from_cpu() is blindly prototyped from timers_dead_cpu().
> local_irq_disable is used because cpuhp_thread_fun uses it before
> cpuhp_invoke_callback() call.

I have no idea what this word salad is trying to tell me.

The local_irq_disable() in cpuhp_thread_fun() has absolutely nothing to
do with timers_dead_cpu().

> Core test snippets without infrastructure:
>
> 1. Create timer on specific cpu with:
>
> 	timer_setup(&test_timer, test_timer_cb, TIMER_PINNED);
>         test_timer.expires = KTIME_MAX;
>         add_timer_on(&test_timer, test_cpu);
>
> 2. Call housekeeping_update()
>
> 3. Assure that there is no timers on specified cpu at the end
> of timers_resettle_from_cpu() with:
>
> static int count_timers(int cpu)
> {
> 	struct timer_base *base;
> 	int b, v, count = 0;
>
> 	for (b = 0; b < NR_BASES; b++) {
> 		base = per_cpu_ptr(&timer_bases[b], cpu);
> 		raw_spin_lock_irq(&base->lock);
>
> 		for (v = 0; v < WHEEL_SIZE; v++) {
> 			struct hlist_node *c;
>
> 			hlist_for_each(c, base->vectors + v)
> 				count++;
> 		}
> 		raw_spin_unlock_irq(&base->lock);
> 	}
>
> 	return count;
> }

And that snippet in the change log is magically providing a unit test or what?

> diff --git a/init/Kconfig b/init/Kconfig
> index 72404c1f21577..fac49c6bb965a 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -682,6 +682,7 @@ config CPU_ISOLATION
>  	bool "CPU isolation"
>  	depends on SMP || COMPILE_TEST
>  	default y
> +	select HOTPLUG_CPU

Why?

> +#ifdef CONFIG_LOCKUP_DETECTOR

That ifdef is required because?

> +#include <linux/nmi.h>
> +#endif
> +
>  enum hk_flags {
>  	HK_FLAG_TIMER		= BIT(HK_TYPE_TIMER),
>  	HK_FLAG_RCU		= BIT(HK_TYPE_RCU),
> @@ -116,6 +120,19 @@ static void __init housekeeping_setup_type(enum hk_type type,
>  		     housekeeping_staging);
>  }
>  
> +static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable_mask)
> +{
> +	unsigned int cpu;
> +
> +	for_each_cpu(cpu, enable_mask)	{

Pointless bracket

> +		timers_prepare_cpu(cpu);

Seriously? You reset the timer base of an online CPU?

What's the point of this excercise? The timer base is initialized and in
consistent state. The timer base of an isolated CPU can have active
pinned timers on it.

So going there and resetting state without locking is definitely a
brilliant idea.

> +	for_each_cpu(cpu, disable_mask) {
> +		timers_resettle_from_cpu(cpu);
> +	}
> +}
> +
>  /*
>   * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
>   * change.
> @@ -144,6 +161,16 @@ static int housekeeping_update(enum hk_type type, cpumask_var_t update)
>  	if (!static_branch_unlikely(&housekeeping_overridden))
>  		static_key_enable_cpuslocked(&housekeeping_overridden.key);
>  
> +	switch (type) {
> +	case HK_TYPE_TIMER:
> +		resettle_all_timers(&masks->enable, &masks->disable);
> +#ifdef CONFIG_LOCKUP_DETECTOR
> +		cpumask_copy(&watchdog_cpumask, housekeeping_cpumask(HK_TYPE_TIMER));
> +		lockup_detector_reconfigure();
> +#endif

What's wrong with adding a function

void lockup_detector_update_mask(const struct cpumask *msk);

and having an empty stub for it when CONFIG_LOCKUP_DETECTOR=n?

That spares all the ugly ifdeffery and the nmi.h include in one go.

> +		break;
> +	default:
> +	}
>  	kfree(masks);
>  
>  	return 0;
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 48288dd4a102f..2d15c0e7b0550 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -51,6 +51,7 @@
>  #include <asm/div64.h>
>  #include <asm/timex.h>
>  #include <asm/io.h>
> +#include <linux/sched/isolation.h>

What needs this include?
  
>  #include "tick-internal.h"
>  #include "timer_migration.h"
> @@ -2657,6 +2658,47 @@ int timers_prepare_cpu(unsigned int cpu)
>  	return 0;
>  }
>  
> +/**
> + * timers_resettle_from_cpu - resettles timers from
> + * specified cpu to housekeeping cpus.
> + */
> +void timers_resettle_from_cpu(unsigned int cpu)
> +{
> +	struct timer_base *old_base;
> +	struct timer_base *new_base;
> +	int b, i;
> +
> +	local_irq_disable();

What for?

> +	for (b = 0; b < NR_BASES; b++) {

You cannot blindly move pinned timers away from a CPU. That's the last
resort which is used in the CPU hotplug case, but the CPU is not going
out in the dynamic change case and the pinned timer might be there for a
reason.

> +		old_base = per_cpu_ptr(&timer_bases[b], cpu);
> +		new_base = per_cpu_ptr(&timer_bases[b],
> +				cpumask_any_and(cpu_active_mask,
> +					housekeeping_cpumask(HK_TYPE_TIMER)));

The cpumask needs to be reevaluted for every base because you blindly
copied the code from timers_dead_cpu(), right?

> +		/*
> +		 * The caller is globally serialized and nobody else
> +		 * takes two locks at once, deadlock is not possible.
> +		 */
> +		raw_spin_lock_irq(&new_base->lock);

Here you disable interrupts again and further down you enable them again
when dropping the lock. So on loop exit this does an imbalanced
local_irq_enable(), no? What's the point of this local_irq_dis/enable()
pair around the loop?



> +		raw_spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);
> +
> +		/*
> +		 * The current CPUs base clock might be stale. Update it

What guarantees that this is the current CPUs timer base? Nothing...

> +		 * before moving the timers over.
> +		 */
> +		forward_timer_base(new_base);
> +
> +		WARN_ON_ONCE(old_base->running_timer);
> +		old_base->running_timer = NULL;
> +
> +		for (i = 0; i < WHEEL_SIZE; i++)
> +			migrate_timer_list(new_base, old_base->vectors + i);
> +
> +		raw_spin_unlock(&old_base->lock);
> +		raw_spin_unlock_irq(&new_base->lock);
> +	}
> +	local_irq_enable();
> +}

It's absolutely not rocket science to reuse timers_dead_cpu() for this.

The only requirement timers_dead_cpu() has is that the CPU to which the
timers are migrated is not going away. That's already given due to the
hotplug context.

The reason why it uses get_cpu_ptr(), which disables preemption and
therefore migration, is that it want's to avoid moving the timers to a
remote CPU as that has implications vs. NOHZ because it might end up
kicking the remote CPU out of idle.

timers_dead_cpu() could be simply modified to:

void timers_dead_cpu(unsigned int cpu)
{
    migrate_disable();
    timers_migrate_from_cpu(cpu, BASE_LOCAL);
    migrate_enable();
}

and have

#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_ISOLATION)
static void timers_migrate_from_cpu(unsigned int cpu, unsigned int base)
{
        lockdep_assert(migration_disabled());

        for (; base < NR_BASES; base++) {
		old_base = per_cpu_ptr(&timer_bases[b], cpu);
		new_base = this_cpu_ptr(&timer_bases[b]);
        	....
	}
}
#endif

Now that isolation muck just has to do:

    1) Ensure that the CPU it runs on is a housekeeping CPU

    2) Invoke timers_migrate_to_hkcpu(cpu) which is in timer.c

       #ifdef CONFIG_ISOLATION
       void timers_migrate_to_hkcpu(unsigned int cpu)
       {
       		timers_migrate_from_cpu(cpu, BASE_GLOBAL);
       }
       #endif

No?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/7] sched/isolation: Adjust affinity of timers according to change of housekeeping cpumask
  2024-05-17 22:39   ` Thomas Gleixner
@ 2024-05-17 22:52     ` Thomas Gleixner
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Gleixner @ 2024-05-17 22:52 UTC (permalink / raw)
  To: Costa Shulyupin, longman, pauld, juri.lelli, prarit, vschneid,
	Anna-Maria Behnsen, Frederic Weisbecker, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

On Sat, May 18 2024 at 00:39, Thomas Gleixner wrote:
> On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
>> Adjust affinity timers and watchdog_cpumask according to
>> change of housekeeping.cpumasks[HK_TYPE_TIMER] during runtime.
>
> Timers and watchdog are two different things. It's clearly documented
> that a patch should change one thing and not combine random stuff.
>
>> watchdog_cpumask is initialized during boot in lockup_detector_init()
>> from housekeeping_cpumask(HK_TYPE_TIMER).
>>
>> lockup_detector_reconfigure() utilizes updated watchdog_cpumask
>> via __lockup_detector_reconfigure().
>
> You describe WHAT the patch is doing, but there is no WHY and zero
> rationale why this is correct.
>
>> timers_resettle_from_cpu() is blindly prototyped from timers_dead_cpu().
>> local_irq_disable is used because cpuhp_thread_fun uses it before
>> cpuhp_invoke_callback() call.
>
> I have no idea what this word salad is trying to tell me.
>
> The local_irq_disable() in cpuhp_thread_fun() has absolutely nothing to
> do with timers_dead_cpu().
>
>> Core test snippets without infrastructure:
>>
>> 1. Create timer on specific cpu with:
>>
>> 	timer_setup(&test_timer, test_timer_cb, TIMER_PINNED);
>>         test_timer.expires = KTIME_MAX;
>>         add_timer_on(&test_timer, test_cpu);
>>
>> 2. Call housekeeping_update()
>>
>> 3. Assure that there is no timers on specified cpu at the end
>> of timers_resettle_from_cpu() with:
>>
>> static int count_timers(int cpu)
>> {
>> 	struct timer_base *base;
>> 	int b, v, count = 0;
>>
>> 	for (b = 0; b < NR_BASES; b++) {
>> 		base = per_cpu_ptr(&timer_bases[b], cpu);
>> 		raw_spin_lock_irq(&base->lock);
>>
>> 		for (v = 0; v < WHEEL_SIZE; v++) {
>> 			struct hlist_node *c;
>>
>> 			hlist_for_each(c, base->vectors + v)
>> 				count++;
>> 		}
>> 		raw_spin_unlock_irq(&base->lock);
>> 	}
>>
>> 	return count;
>> }
>
> And that snippet in the change log is magically providing a unit test or what?
>
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 72404c1f21577..fac49c6bb965a 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -682,6 +682,7 @@ config CPU_ISOLATION
>>  	bool "CPU isolation"
>>  	depends on SMP || COMPILE_TEST
>>  	default y
>> +	select HOTPLUG_CPU
>
> Why?
>
>> +#ifdef CONFIG_LOCKUP_DETECTOR
>
> That ifdef is required because?
>
>> +#include <linux/nmi.h>
>> +#endif
>> +
>>  enum hk_flags {
>>  	HK_FLAG_TIMER		= BIT(HK_TYPE_TIMER),
>>  	HK_FLAG_RCU		= BIT(HK_TYPE_RCU),
>> @@ -116,6 +120,19 @@ static void __init housekeeping_setup_type(enum hk_type type,
>>  		     housekeeping_staging);
>>  }
>>  
>> +static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable_mask)
>> +{
>> +	unsigned int cpu;
>> +
>> +	for_each_cpu(cpu, enable_mask)	{
>
> Pointless bracket
>
>> +		timers_prepare_cpu(cpu);
>
> Seriously? You reset the timer base of an online CPU?
>
> What's the point of this excercise? The timer base is initialized and in
> consistent state. The timer base of an isolated CPU can have active
> pinned timers on it.
>
> So going there and resetting state without locking is definitely a
> brilliant idea.
>
>> +	for_each_cpu(cpu, disable_mask) {
>> +		timers_resettle_from_cpu(cpu);
>> +	}
>> +}
>> +
>>  /*
>>   * housekeeping_update - change housekeeping.cpumasks[type] and propagate the
>>   * change.
>> @@ -144,6 +161,16 @@ static int housekeeping_update(enum hk_type type, cpumask_var_t update)
>>  	if (!static_branch_unlikely(&housekeeping_overridden))
>>  		static_key_enable_cpuslocked(&housekeeping_overridden.key);
>>  
>> +	switch (type) {
>> +	case HK_TYPE_TIMER:
>> +		resettle_all_timers(&masks->enable, &masks->disable);
>> +#ifdef CONFIG_LOCKUP_DETECTOR
>> +		cpumask_copy(&watchdog_cpumask, housekeeping_cpumask(HK_TYPE_TIMER));
>> +		lockup_detector_reconfigure();
>> +#endif
>
> What's wrong with adding a function
>
> void lockup_detector_update_mask(const struct cpumask *msk);
>
> and having an empty stub for it when CONFIG_LOCKUP_DETECTOR=n?
>
> That spares all the ugly ifdeffery and the nmi.h include in one go.
>
>> +		break;
>> +	default:
>> +	}
>>  	kfree(masks);
>>  
>>  	return 0;
>> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
>> index 48288dd4a102f..2d15c0e7b0550 100644
>> --- a/kernel/time/timer.c
>> +++ b/kernel/time/timer.c
>> @@ -51,6 +51,7 @@
>>  #include <asm/div64.h>
>>  #include <asm/timex.h>
>>  #include <asm/io.h>
>> +#include <linux/sched/isolation.h>
>
> What needs this include?
>   
>>  #include "tick-internal.h"
>>  #include "timer_migration.h"
>> @@ -2657,6 +2658,47 @@ int timers_prepare_cpu(unsigned int cpu)
>>  	return 0;
>>  }
>>  
>> +/**
>> + * timers_resettle_from_cpu - resettles timers from
>> + * specified cpu to housekeeping cpus.
>> + */
>> +void timers_resettle_from_cpu(unsigned int cpu)
>> +{
>> +	struct timer_base *old_base;
>> +	struct timer_base *new_base;
>> +	int b, i;
>> +
>> +	local_irq_disable();
>
> What for?
>
>> +	for (b = 0; b < NR_BASES; b++) {
>
> You cannot blindly move pinned timers away from a CPU. That's the last
> resort which is used in the CPU hotplug case, but the CPU is not going
> out in the dynamic change case and the pinned timer might be there for a
> reason.
>
>> +		old_base = per_cpu_ptr(&timer_bases[b], cpu);
>> +		new_base = per_cpu_ptr(&timer_bases[b],
>> +				cpumask_any_and(cpu_active_mask,
>> +					housekeeping_cpumask(HK_TYPE_TIMER)));
>
> The cpumask needs to be reevaluted for every base because you blindly
> copied the code from timers_dead_cpu(), right?
>
>> +		/*
>> +		 * The caller is globally serialized and nobody else
>> +		 * takes two locks at once, deadlock is not possible.
>> +		 */
>> +		raw_spin_lock_irq(&new_base->lock);
>
> Here you disable interrupts again and further down you enable them again
> when dropping the lock. So on loop exit this does an imbalanced
> local_irq_enable(), no? What's the point of this local_irq_dis/enable()
> pair around the loop?
>
>
>
>> +		raw_spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);
>> +
>> +		/*
>> +		 * The current CPUs base clock might be stale. Update it
>
> What guarantees that this is the current CPUs timer base? Nothing...
>
>> +		 * before moving the timers over.
>> +		 */
>> +		forward_timer_base(new_base);
>> +
>> +		WARN_ON_ONCE(old_base->running_timer);
>> +		old_base->running_timer = NULL;
>> +
>> +		for (i = 0; i < WHEEL_SIZE; i++)
>> +			migrate_timer_list(new_base, old_base->vectors + i);
>> +
>> +		raw_spin_unlock(&old_base->lock);
>> +		raw_spin_unlock_irq(&new_base->lock);
>> +	}
>> +	local_irq_enable();
>> +}
>
> It's absolutely not rocket science to reuse timers_dead_cpu() for this.
>
> The only requirement timers_dead_cpu() has is that the CPU to which the
> timers are migrated is not going away. That's already given due to the
> hotplug context.
>
> The reason why it uses get_cpu_ptr(), which disables preemption and
> therefore migration, is that it want's to avoid moving the timers to a
> remote CPU as that has implications vs. NOHZ because it might end up
> kicking the remote CPU out of idle.
>
> timers_dead_cpu() could be simply modified to:
>
> void timers_dead_cpu(unsigned int cpu)
> {
>     migrate_disable();
>     timers_migrate_from_cpu(cpu, BASE_LOCAL);
>     migrate_enable();
> }
>
> and have
>
> #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_ISOLATION)
> static void timers_migrate_from_cpu(unsigned int cpu, unsigned int base)
> {
>         lockdep_assert(migration_disabled());
>
>         for (; base < NR_BASES; base++) {
> 		old_base = per_cpu_ptr(&timer_bases[b], cpu);
> 		new_base = this_cpu_ptr(&timer_bases[b]);
>         	....
> 	}
> }
> #endif
>
> Now that isolation muck just has to do:
>
>     1) Ensure that the CPU it runs on is a housekeeping CPU
>
>     2) Invoke timers_migrate_to_hkcpu(cpu) which is in timer.c
>
>        #ifdef CONFIG_ISOLATION
>        void timers_migrate_to_hkcpu(unsigned int cpu)
>        {
>        	timers_migrate_from_cpu(cpu, BASE_GLOBAL);
>        }
>        #endif
>
> No?

So if you don't care about the remote CPU queueing aspect, then this can
simply do:

void timers_dead_cpu(unsigned int cpu)
{
    migrate_disable();
    timers_migrate_from_cpu(cpu, smp_processor_id(), BASE_LOCAL);
    migrate_enable();
}

and have

#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_ISOLATION)
static void timers_migrate_from_cpu(unsigned int from_cpu, unsigned int to_cpu, unsigned int base)
{

        for (; base < NR_BASES; base++) {
		old_base = per_cpu_ptr(&timer_bases[b], from_cpu);
		new_base = per_cpu_ptr(&timer_bases[b], to_cpu);
        	....
	}
}
#endif

Now that isolation muck just has to do:

    Invoke timers_migrate_to_hkcpu(cpu) which is in timer.c

       #ifdef CONFIG_ISOLATION
       void timers_migrate_to_hkcpu(unsigned int from)
       {
              unsigned int to = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER)));

              timers_migrate_from_cpu(from, to, BASE_GLOBAL);
       }
       #endif

See?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 3/7] sched/isolation: Adjust affinity of hrtimers according to change of housekeeping cpumask
  2024-05-16 19:04 ` [PATCH v1 3/7] sched/isolation: Adjust affinity of hrtimers " Costa Shulyupin
@ 2024-05-17 23:42   ` Thomas Gleixner
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Gleixner @ 2024-05-17 23:42 UTC (permalink / raw)
  To: Costa Shulyupin, longman, pauld, juri.lelli, prarit, vschneid,
	Anna-Maria Behnsen, Frederic Weisbecker, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
> Adjust affinity of watchdog_cpumask, hrtimers according to
> change of housekeeping.cpumasks[HK_TYPE_TIMER].
>
> Function migrate_hrtimer_list_except() is prototyped from
> migrate_hrtimer_list() and is more generic.
>
> Potentially it can be used instead of migrate_hrtimer_list.
>
> Function hrtimers_resettle_from_cpu() is blindly prototyped
> from hrtimers_cpu_dying(). local_irq_disable() is used because
> cpuhp_thread_fun() uses it before cpuhp_invoke_callback().

I'm again impressed by the depth of analysis.

Q: What the heck has cpuhp_thread_fun() to do with hrtimers_cpu_dying()?

A: Absolutely nothing.

> Core test snippets without infrastructure:
>
> 1. Create hrtimer on specific cpu with:
>
>         set_cpus_allowed_ptr(current, cpumask_of(test_cpu));
>         hrtimer_init(&test_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>         test_hrtimer.function = test_hrtimer_cb;
>         hrtimer_start(&test_hrtimer, -1,  HRTIMER_MODE_REL);
>
> 2. Call housekeeping_update()
>
> 3. Assure that there is only tick_nohz_handler on specified cpu
> in /proc/timer_list manually or with script:
>
> grep -E 'cpu| #[0-9]' /proc/timer_list | \
> 	awk "/cpu:/{y=0};/cpu: $test_cpu\$/{y=1};y"
>
> Another alternative solution to migrate hrtimers:
> 1. Use cpuhp to set sched_timer offline
> 2. Resettle all hrtimers likewise migrate_hrtimer_list
> 3. Use cpuhp to set sched_timer online

Another pile of incomprehensible word salad which pretends to be a
change log.

> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -126,10 +126,12 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
>  
>  	for_each_cpu(cpu, enable_mask)	{
>  		timers_prepare_cpu(cpu);
> +		hrtimers_prepare_cpu(cpu);

And yet another lockless re-initialization of an active in use data
structure ...

> +/*
> + * migrate_hrtimer_list_except - migrates hrtimers from one base to another,
> + * except specified one.
> + */
> +static void migrate_hrtimer_list_except(struct hrtimer_clock_base *old_base,
> +				struct hrtimer_clock_base *new_base, struct hrtimer *except)
> +{
> +	struct hrtimer *timer;
> +	struct timerqueue_node *node;

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#variable-declarations

> +	node = timerqueue_getnext(&old_base->active);
> +	while (node) {
> +		timer = container_of(node, struct hrtimer, node);
> +		node = timerqueue_iterate_next(node);
> +		if (timer == except)
> +			continue;

What's the rationale that there is only a single timer to except from
the migration?

> +		BUG_ON(hrtimer_callback_running(timer));

Q: What prevents that a hrtimer callback is running on the CPU which was
   not isolated before?

A: Nothing. Ergo this is prone to kill a perfectly correct system just
   because of blindly copying something.

   At least your attempt of a changelog is clear about that...

> +		debug_deactivate(timer);
> +
> +		/*
> +		 * Mark it as ENQUEUED not INACTIVE otherwise the
> +		 * timer could be seen as !active and just vanish away
> +		 * under us on another CPU
> +		 */
> +		__remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
> +		timer->base = new_base;
> +		/*
> +		 * Enqueue the timers on the new cpu. This does not
> +		 * reprogram the event device in case the timer
> +		 * expires before the earliest on this CPU, but we run
> +		 * hrtimer_interrupt after we migrated everything to
> +		 * sort out already expired timers and reprogram the
> +		 * event device.
> +		 */
> +		enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS);
> +	}
> +}
> +
> +/**
> + * hrtimers_resettle_from_cpu - resettles hrtimers from
> + * specified cpu to housekeeping cpus.
> + */
> +void hrtimers_resettle_from_cpu(unsigned int isol_cpu)
> +{
> +	int ncpu, i;
> +	struct tick_sched *ts = tick_get_tick_sched(isol_cpu);
> +	struct hrtimer_cpu_base *old_base, *new_base;
> +
> +	local_irq_disable();
> +	ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
> +
> +	old_base = &per_cpu(hrtimer_bases, isol_cpu);
> +	new_base = &per_cpu(hrtimer_bases, ncpu);
> +
> +	/*
> +	 * The caller is globally serialized and nobody else
> +	 * takes two locks at once, deadlock is not possible.
> +	 */
> +	raw_spin_lock(&old_base->lock);
> +	raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);
> +	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
> +		migrate_hrtimer_list_except(&old_base->clock_base[i],
> +				     &new_base->clock_base[i],
> +				     &ts->sched_timer);
> +	}
> +
> +	/*
> +	 * The migration might have changed the first expiring softirq
> +	 * timer on this CPU. Update it.
> +	 */
> +	__hrtimer_get_next_event(new_base, HRTIMER_ACTIVE_SOFT);
> +
> +	raw_spin_unlock(&new_base->lock);
> +	raw_spin_unlock(&old_base->lock);
> +	local_irq_enable();
> +
> +	/* Tell the other CPU to retrigger the next event */
> +	smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
> +}

We clearly need another copy of hrtimers_cpu_dying() and
migrate_hrtimer_list() to add a local_irq_dis/enable() pair and a
completely ill defined exclusion mechanism which assumes that the tick
hrtimer is the only one which has to be excluded from migration.

Even if the exlusion mechanism would be correct there is ZERO
justification for this copy and pasta orgy even if you marked this
series RFC. RFC != POC hackery.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs according to change of housekeeping cpumask
  2024-05-16 19:04 ` [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs " Costa Shulyupin
@ 2024-05-18  1:17   ` Thomas Gleixner
  2024-05-18  1:25     ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2024-05-18  1:17 UTC (permalink / raw)
  To: Costa Shulyupin, longman, pauld, juri.lelli, prarit, vschneid,
	Anna-Maria Behnsen, Frederic Weisbecker, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
> irq_affinity_adjust() is prototyped from irq_affinity_online_cpu()
> and irq_restore_affinity_of_irq().

I'm used to this prototyped phrase by now. It still does not justify to
expose me to this POC hackery.

My previous comments about change logs still apply.

> +static int irq_affinity_adjust(cpumask_var_t disable_mask)
> +{
> +	unsigned int irq;
> +	cpumask_var_t mask;
> +
> +	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	irq_lock_sparse();
> +	for_each_active_irq(irq) {
> +		struct irq_desc *desc = irq_to_desc(irq);
> +
> +		raw_spin_lock_irq(&desc->lock);

That's simply broken. This is not CPU hotplug on an outgoing CPU. Why
are you assuming that your isolation change code can rely on the
implicit guarantees of CPU hot(un)plug?

Also there is a reason why interrupt related code is in kernel/irq/* and
not in some random other location. Even if C allows you to fiddle with
everything that does not mean that hiding random hacks in other places
is correct in any way.

> +		struct irq_data *data = irq_desc_get_irq_data(desc);
> +
> +		if (irqd_affinity_is_managed(data) && cpumask_weight_and(disable_mask,
> +			irq_data_get_affinity_mask(data))) {

Interrupt target isolation is only relevant for managed interrupts and
non-managed interrupts clearly are going to migrate themself away
magically, right?

> +
> +			cpumask_and(mask, cpu_online_mask, irq_default_affinity);
> +			cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_MANAGED_IRQ));

There are clearly a lot of comments explaining what this is doing and
why it is correct as there is a guarantee that these masks overlap by
definition.

> +			irq_set_affinity_locked(data, mask, true);

Plus the extensive explanation why using 'force=true' is even remotely
correct here.

I conceed that the documentation of that function and its arguments is
close to non-existant, but if you follow the call chain of that function
there are enough hints down the road, no?

> +			WARN_ON(cpumask_weight_and(irq_data_get_effective_affinity_mask(data),
> +						disable_mask));
> +			WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
> +						cpu_online_mask));
> +			WARN_ON(!cpumask_subset(irq_data_get_effective_affinity_mask(data),
> +						housekeeping_cpumask(HK_TYPE_MANAGED_IRQ)));

These warnings are required and useful within the spinlock held and
interrupt disabled section because of what?

 - Because the resulting stack trace provides a well known call chain

 - Because the resulting warnings do not tell anything about the
   affected interrupt number

 - Because the resulting warnings do not tell anything about the CPU
   masks which cause the problem

 - Because the aggregate information of the above is utterly useless

Impressive...

Thanks,

       tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs according to change of housekeeping cpumask
  2024-05-18  1:17   ` Thomas Gleixner
@ 2024-05-18  1:25     ` Thomas Gleixner
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Gleixner @ 2024-05-18  1:25 UTC (permalink / raw)
  To: Costa Shulyupin, longman, pauld, juri.lelli, prarit, vschneid,
	Anna-Maria Behnsen, Frederic Weisbecker, Zefan Li, Tejun Heo,
	Johannes Weiner, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Petr Mladek, Andrew Morton,
	Masahiro Yamada, Randy Dunlap, Yoann Congal, Gustavo A. R. Silva,
	Nhat Pham, Costa Shulyupin, linux-kernel, cgroups

Costa!

On Sat, May 18 2024 at 03:17, Thomas Gleixner wrote:
> Impressive...

Now let's take a step back because none of this makes any sense at all
conceptually.

Reconfiguring the housekeeping CPUs on a life system is expensive and a
slow path operation no matter what.

So why inflicting all of this nonsense to the kernel instead of
cleverly (ab)using CPU hotplug for it in user space:

          for_each_cpu(cpu, new_house_keeping_mask) {
          	if (cpu_ishk(cpu))
                	continue;
                cpu_offline(cpu);
                set_cpu_in_hkmask(cpu);
                cpu_online(cpu);
          }

          for_each_cpu(cpu, new_isolated_mask) {
          	if (!cpu_ishk(cpu))
                	continue;
                cpu_offline(cpu);
                clear_cpu_in_hkmask(cpu);
                cpu_online(cpu);
          }

Or something like that. You get the idea, right?

IOW, the only kernel change which is required to achieve your goal is to
ensure that changing the housekeeping/isolated property of a CPU at
runtime is only possible when the CPU is "offline".

Then all of the magic things you try to solve just work out of the box
because the existing and well exercised hotplug code takes care of it
already, no?

I might be missing something obvious as always, so feel free to educate
me on it. 

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-05-18  1:25 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-16 19:04 [PATCH v1 0/7] sched: Adjust affinity according to change of housekeeping cpumask Costa Shulyupin
2024-05-16 19:04 ` [PATCH v1 1/7] sched/isolation: Add infrastructure to adjust affinity for dynamic CPU isolation Costa Shulyupin
2024-05-17 21:37   ` Thomas Gleixner
2024-05-16 19:04 ` [PATCH v1 2/7] sched/isolation: Adjust affinity of timers according to change of housekeeping cpumask Costa Shulyupin
2024-05-17 22:39   ` Thomas Gleixner
2024-05-17 22:52     ` Thomas Gleixner
2024-05-16 19:04 ` [PATCH v1 3/7] sched/isolation: Adjust affinity of hrtimers " Costa Shulyupin
2024-05-17 23:42   ` Thomas Gleixner
2024-05-16 19:04 ` [PATCH v1 4/7] sched/isolation: Adjust affinity of managed irqs " Costa Shulyupin
2024-05-18  1:17   ` Thomas Gleixner
2024-05-18  1:25     ` Thomas Gleixner
2024-05-16 19:04 ` [PATCH v1 5/7] [NOT-FOR-MERGE] test timers affinity adjustment Costa Shulyupin
2024-05-16 19:04 ` [PATCH v1 6/7] [NOT-FOR-MERGE] test timers and hrtimers " Costa Shulyupin
2024-05-16 19:04 ` [PATCH v1 7/7] [NOT-FOR-MERGE] test managed irqs " Costa Shulyupin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).