[PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching
@ 2024-11-05 21:48 Tejun Heo
  2024-11-05 21:49 ` [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Tejun Heo @ 2024-11-05 21:48 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, kernel-team, sched-ext, Andrea Righi, Changwoo Min

A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
banging on the same DSQ on a large NUMA system to the point where switching
to the bypass mode can take a long time. Turning on the bypass mode requires
dequeueing and re-enqueueing currently runnable tasks, if the DSQs that they
are on are live-locked, this can take tens of seconds cascading into other
failures. This was observed on 2 x Intel Sapphire Rapids machines with 224
logical CPUs.

Inject artifical delays while the bypass mode is switching to guarantee
timely completion.

While at it, move __scx_ops_bypass_lock into scx_ops_bypass() and rename it
to bypass_lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Valentin Andrei <vandrei@meta.com>
Reported-by: Patrick Lu <patlu@meta.com>
---
 kernel/sched/ext.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 3 deletions(-)

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -867,8 +867,8 @@ static DEFINE_MUTEX(scx_ops_enable_mutex
 DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled);
 DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
 static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED);
+static atomic_t scx_ops_breather_depth = ATOMIC_INIT(0);
 static int scx_ops_bypass_depth;
-static DEFINE_RAW_SPINLOCK(__scx_ops_bypass_lock);
 static bool scx_ops_init_task_enabled;
 static bool scx_switching_all;
 DEFINE_STATIC_KEY_FALSE(__scx_switched_all);
@@ -2474,11 +2474,48 @@ static struct rq *move_task_between_dsqs
 	return dst_rq;
 }
 
+/*
+ * A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
+ * banging on the same DSQ on a large NUMA system to the point where switching
+ * to the bypass mode can take a long time. Inject artifical delays while the
+ * bypass mode is switching to guarantee timely completion.
+ */
+static void scx_ops_breather(struct rq *rq)
+{
+	u64 until;
+
+	lockdep_assert_rq_held(rq);
+
+	if (likely(!atomic_read(&scx_ops_breather_depth)))
+		return;
+
+	raw_spin_rq_unlock(rq);
+
+	until = ktime_get_ns() + NSEC_PER_MSEC;
+
+	do {
+		int cnt = 1024;
+		while (atomic_read(&scx_ops_breather_depth) && --cnt)
+			cpu_relax();
+	} while (atomic_read(&scx_ops_breather_depth) &&
+		 time_before64(ktime_get_ns(), until));
+
+	raw_spin_rq_lock(rq);
+}
+
 static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
 {
 	struct task_struct *p;
 retry:
 	/*
+	 * This retry loop can repeatedly race against scx_ops_bypass()
+	 * dequeueing tasks from @dsq trying to put the system into the bypass
+	 * mode. On some multi-socket machines (e.g. 2x Intel 8480c), this can
+	 * live-lock the machine into soft lockups. Give a breather.
+	 */
+	scx_ops_breather(rq);
+
+	/*
 	 * The caller can't expect to successfully consume a task if the task's
 	 * addition to @dsq isn't guaranteed to be visible somehow. Test
 	 * @dsq->list without locking and skip if it seems empty.
@@ -4550,10 +4587,11 @@ bool task_should_scx(struct task_struct
  */
 static void scx_ops_bypass(bool bypass)
 {
+	static DEFINE_RAW_SPINLOCK(bypass_lock);
 	int cpu;
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&__scx_ops_bypass_lock, flags);
+	raw_spin_lock_irqsave(&bypass_lock, flags);
 	if (bypass) {
 		scx_ops_bypass_depth++;
 		WARN_ON_ONCE(scx_ops_bypass_depth <= 0);
@@ -4566,6 +4604,8 @@ static void scx_ops_bypass(bool bypass)
 			goto unlock;
 	}
 
+	atomic_inc(&scx_ops_breather_depth);
+
 	/*
 	 * No task property is changing. We just need to make sure all currently
 	 * queued tasks are re-queued according to the new scx_rq_bypassing()
@@ -4621,8 +4661,10 @@ static void scx_ops_bypass(bool bypass)
 		/* resched to restore ticks and idle state */
 		resched_cpu(cpu);
 	}
+
+	atomic_dec(&scx_ops_breather_depth);
 unlock:
-	raw_spin_unlock_irqrestore(&__scx_ops_bypass_lock, flags);
+	raw_spin_unlock_irqrestore(&bypass_lock, flags);
 }
 
 static void free_exit_info(struct scx_exit_info *ei)
@@ -6275,6 +6317,13 @@ static bool scx_dispatch_from_dsq(struct
 		raw_spin_rq_lock(src_rq);
 	}
 
+	/*
+	 * If the BPF scheduler keeps calling this function repeatedly, it can
+	 * cause similar live-lock conditions as consume_dispatch_q(). Insert a
+	 * breather if necessary.
+	 */
+	scx_ops_breather(src_rq);
+
 	locked_rq = src_rq;
 	raw_spin_lock(&src_dsq->lock);
 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
  2024-11-05 21:48 [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching Tejun Heo
@ 2024-11-05 21:49 ` Tejun Heo
  2024-11-06 21:32   ` Doug Anderson
  2024-11-08 20:38   ` Tejun Heo
  2024-11-05 22:03 ` [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching David Vernet
  2024-11-05 23:57 ` Andrea Righi
  2 siblings, 2 replies; 14+ messages in thread
From: Tejun Heo @ 2024-11-05 21:49 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, kernel-team, sched-ext, Andrea Righi, Changwoo Min,
	Douglas Anderson, Andrew Morton

On 2 x Intel Sapphire Rapids machines with 224 logical CPUs, a poorly
behaving BPF scheduler can live-lock the system by making multiple CPUs bang
on the same DSQ to the point where soft-lockup detection triggers before
SCX's own watchdog can take action. It also seems possible that the machine
can be live-locked enough to prevent scx_ops_helper, which is an RT task,
from running in a timely manner.

Implement scx_softlockup() which is called when three quarters of
soft-lockup threshold has passed. The function immediately enables the ops
breather and triggers an ops error to initiate ejection of the BPF
scheduler.

The previous and this patch combined enable the kernel to reliably recover
the system from live-lock conditions that can be triggered by a poorly
behaving BPF scheduler on Intel dual socket systems.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/sched/ext.h         |    2 +
 kernel/sched/ext.c                |   45 ++++++++++++++++++++++++++++++++++++++
 kernel/watchdog.c                 |    8 ++++++
 tools/sched_ext/scx_show_state.py |    2 +
 4 files changed, 57 insertions(+)

--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -205,11 +205,13 @@ struct sched_ext_entity {
 
 void sched_ext_free(struct task_struct *p);
 void print_scx_info(const char *log_lvl, struct task_struct *p);
+void scx_softlockup(u32 dur_s);
 
 #else	/* !CONFIG_SCHED_CLASS_EXT */
 
 static inline void sched_ext_free(struct task_struct *p) {}
 static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
+static inline void scx_softlockup(u32 dur_s) {}
 
 #endif	/* CONFIG_SCHED_CLASS_EXT */
 #endif	/* _LINUX_SCHED_EXT_H */
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -867,6 +867,7 @@ static DEFINE_MUTEX(scx_ops_enable_mutex
 DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled);
 DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
 static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED);
+static unsigned long scx_in_softlockup;
 static atomic_t scx_ops_breather_depth = ATOMIC_INIT(0);
 static int scx_ops_bypass_depth;
 static bool scx_ops_init_task_enabled;
@@ -4556,6 +4557,49 @@ bool task_should_scx(struct task_struct
 }
 
 /**
+ * scx_softlockup - sched_ext softlockup handler
+ *
+ * On some multi-socket setups (e.g. 2x Intel 8480c), the BPF scheduler can
+ * live-lock the system by making many CPUs target the same DSQ to the point
+ * where soft-lockup detection triggers. This function is called from
+ * soft-lockup watchdog when the triggering point is close and tries to unjam
+ * the system by enabling the breather and aborting the BPF scheduler.
+ */
+void scx_softlockup(u32 dur_s)
+{
+	switch (scx_ops_enable_state()) {
+	case SCX_OPS_ENABLING:
+	case SCX_OPS_ENABLED:
+		break;
+	default:
+		return;
+	}
+
+	/* allow only one instance, cleared at the end of scx_ops_bypass() */
+	if (test_and_set_bit(0, &scx_in_softlockup))
+		return;
+
+	printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU%d stuck for %us, disabling \"%s\"\n",
+			smp_processor_id(), dur_s, scx_ops.name);
+
+	/*
+	 * Some CPUs may be trapped in the dispatch paths. Enable breather
+	 * immediately; otherwise, we might even be able to get to
+	 * scx_ops_bypass().
+	 */
+	atomic_inc(&scx_ops_breather_depth);
+
+	scx_ops_error("soft lockup - CPU#%d stuck for %us",
+		      smp_processor_id(), dur_s);
+}
+
+static void scx_clear_softlockup(void)
+{
+	if (test_and_clear_bit(0, &scx_in_softlockup))
+		atomic_dec(&scx_ops_breather_depth);
+}
+
+/**
  * scx_ops_bypass - [Un]bypass scx_ops and guarantee forward progress
  *
  * Bypassing guarantees that all runnable tasks make forward progress without
@@ -4665,6 +4709,7 @@ static void scx_ops_bypass(bool bypass)
 	atomic_dec(&scx_ops_breather_depth);
 unlock:
 	raw_spin_unlock_irqrestore(&bypass_lock, flags);
+	scx_clear_softlockup();
 }
 
 static void free_exit_info(struct scx_exit_info *ei)
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -644,6 +644,14 @@ static int is_softlockup(unsigned long t
 		    need_counting_irqs())
 			start_counting_irqs();
 
+		/*
+		 * A poorly behaving BPF scheduler can live-lock the system into
+		 * soft lockups. Tell sched_ext to try ejecting the BPF
+		 * scheduler when close to a soft lockup.
+		 */
+		if (time_after_eq(now, period_ts + get_softlockup_thresh() * 3 / 4))
+			scx_softlockup(now - touch_ts);
+
 		/* Warn about unreasonable delays. */
 		if (time_after(now, period_ts + get_softlockup_thresh()))
 			return now - touch_ts;
--- a/tools/sched_ext/scx_show_state.py
+++ b/tools/sched_ext/scx_show_state.py
@@ -35,6 +35,8 @@ print(f'enabled       : {read_static_key
 print(f'switching_all : {read_int("scx_switching_all")}')
 print(f'switched_all  : {read_static_key("__scx_switched_all")}')
 print(f'enable_state  : {ops_state_str(enable_state)} ({enable_state})')
+print(f'in_softlockup : {prog["scx_in_softlockup"].value_()}')
+print(f'breather_depth: {read_atomic("scx_ops_breather_depth")}')
 print(f'bypass_depth  : {prog["scx_ops_bypass_depth"].value_()}')
 print(f'nr_rejected   : {read_atomic("scx_nr_rejected")}')
 print(f'enable_seq    : {read_atomic("scx_enable_seq")}')

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching
  2024-11-05 21:48 [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching Tejun Heo
  2024-11-05 21:49 ` [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup Tejun Heo
@ 2024-11-05 22:03 ` David Vernet
  2024-11-05 23:02   ` Tejun Heo
  2024-11-05 23:57 ` Andrea Righi
  2 siblings, 1 reply; 14+ messages in thread
From: David Vernet @ 2024-11-05 22:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, kernel-team, sched-ext, Andrea Righi, Changwoo Min

[-- Attachment #1: Type: text/plain, Size: 2444 bytes --]

On Tue, Nov 05, 2024 at 11:48:11AM -1000, Tejun Heo wrote:

[...]

>  static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
>  {
>  	struct task_struct *p;
>  retry:
>  	/*
> +	 * This retry loop can repeatedly race against scx_ops_bypass()
> +	 * dequeueing tasks from @dsq trying to put the system into the bypass
> +	 * mode. On some multi-socket machines (e.g. 2x Intel 8480c), this can
> +	 * live-lock the machine into soft lockups. Give a breather.
> +	 */
> +	scx_ops_breather(rq);

Should we move this to after the list_empty() check? Or before the goto retry
below so we can avoid having to do the atomic read on the typical hotpath?

> +
> +	/*
>  	 * The caller can't expect to successfully consume a task if the task's
>  	 * addition to @dsq isn't guaranteed to be visible somehow. Test
>  	 * @dsq->list without locking and skip if it seems empty.
> @@ -4550,10 +4587,11 @@ bool task_should_scx(struct task_struct
>   */
>  static void scx_ops_bypass(bool bypass)
>  {
> +	static DEFINE_RAW_SPINLOCK(bypass_lock);
>  	int cpu;
>  	unsigned long flags;
>  
> -	raw_spin_lock_irqsave(&__scx_ops_bypass_lock, flags);
> +	raw_spin_lock_irqsave(&bypass_lock, flags);
>  	if (bypass) {
>  		scx_ops_bypass_depth++;
>  		WARN_ON_ONCE(scx_ops_bypass_depth <= 0);
> @@ -4566,6 +4604,8 @@ static void scx_ops_bypass(bool bypass)
>  			goto unlock;
>  	}
>  
> +	atomic_inc(&scx_ops_breather_depth);
> +
>  	/*
>  	 * No task property is changing. We just need to make sure all currently
>  	 * queued tasks are re-queued according to the new scx_rq_bypassing()
> @@ -4621,8 +4661,10 @@ static void scx_ops_bypass(bool bypass)
>  		/* resched to restore ticks and idle state */
>  		resched_cpu(cpu);
>  	}
> +
> +	atomic_dec(&scx_ops_breather_depth);
>  unlock:
> -	raw_spin_unlock_irqrestore(&__scx_ops_bypass_lock, flags);
> +	raw_spin_unlock_irqrestore(&bypass_lock, flags);
>  }
>  
>  static void free_exit_info(struct scx_exit_info *ei)
> @@ -6275,6 +6317,13 @@ static bool scx_dispatch_from_dsq(struct
>  		raw_spin_rq_lock(src_rq);
>  	}
>  
> +	/*
> +	 * If the BPF scheduler keeps calling this function repeatedly, it can
> +	 * cause similar live-lock conditions as consume_dispatch_q(). Insert a
> +	 * breather if necessary.
> +	 */
> +	scx_ops_breather(src_rq);
> +
>  	locked_rq = src_rq;
>  	raw_spin_lock(&src_dsq->lock);
>  
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching
  2024-11-05 22:03 ` [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching David Vernet
@ 2024-11-05 23:02   ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-11-05 23:02 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, kernel-team, sched-ext, Andrea Righi, Changwoo Min

Hello,

On Tue, Nov 05, 2024 at 04:03:46PM -0600, David Vernet wrote:
> On Tue, Nov 05, 2024 at 11:48:11AM -1000, Tejun Heo wrote:
> 
> [...]
> 
> >  static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
> >  {
> >  	struct task_struct *p;
> >  retry:
> >  	/*
> > +	 * This retry loop can repeatedly race against scx_ops_bypass()
> > +	 * dequeueing tasks from @dsq trying to put the system into the bypass
> > +	 * mode. On some multi-socket machines (e.g. 2x Intel 8480c), this can
> > +	 * live-lock the machine into soft lockups. Give a breather.
> > +	 */
> > +	scx_ops_breather(rq);
> 
> Should we move this to after the list_empty() check? Or before the goto retry
> below so we can avoid having to do the atomic read on the typical hotpath?

I don't think there's going to be a measurable difference in terms of
overhead. It's going to be a cached unlikely jump in most cases and there's
value in catching the CPUs in the breather as soon as possible in case even
if the DSQ currently targeted is empty as it's difficult to reliably predict
the different lockup scenarios.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching
  2024-11-05 21:48 [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching Tejun Heo
  2024-11-05 21:49 ` [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup Tejun Heo
  2024-11-05 22:03 ` [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching David Vernet
@ 2024-11-05 23:57 ` Andrea Righi
  2024-11-06  0:26   ` Tejun Heo
  2 siblings, 1 reply; 14+ messages in thread
From: Andrea Righi @ 2024-11-05 23:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, linux-kernel, kernel-team, sched-ext, Changwoo Min

Hi Tejun,

On Tue, Nov 05, 2024 at 11:48:11AM -1000, Tejun Heo wrote:
...
> +/*
> + * A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
> + * banging on the same DSQ on a large NUMA system to the point where switching
> + * to the bypass mode can take a long time. Inject artifical delays while the
> + * bypass mode is switching to guarantee timely completion.
> + */
> +static void scx_ops_breather(struct rq *rq)
> +{
> +       u64 until;
> +
> +       lockdep_assert_rq_held(rq);
> +
> +       if (likely(!atomic_read(&scx_ops_breather_depth)))
> +               return;
> +
> +       raw_spin_rq_unlock(rq);
> +
> +       until = ktime_get_ns() + NSEC_PER_MSEC;
> +
> +       do {
> +               int cnt = 1024;
> +               while (atomic_read(&scx_ops_breather_depth) && --cnt)
> +                       cpu_relax();
> +       } while (atomic_read(&scx_ops_breather_depth) &&
> +                time_before64(ktime_get_ns(), until));

Do you think there's any benefit using the idle injection framework here
instead of this cpu_relax() loop? At the end we're trying to throttle
the scx scheduler from hammering a DSQ until the scheduler is kicked
out, so we may just inject real idle cycles?

-Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching
  2024-11-05 23:57 ` Andrea Righi
@ 2024-11-06  0:26   ` Tejun Heo
  2024-11-06  0:33     ` Andrea Righi
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2024-11-06  0:26 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, linux-kernel, kernel-team, sched-ext, Changwoo Min

Hello,

On Wed, Nov 06, 2024 at 12:57:42AM +0100, Andrea Righi wrote:
...
> Do you think there's any benefit using the idle injection framework here
> instead of this cpu_relax() loop? At the end we're trying to throttle
> the scx scheduler from hammering a DSQ until the scheduler is kicked
> out, so we may just inject real idle cycles?

That involves switching to the dedicated task and so on, right? When this is
needed, we can't even trust whether the system is going to make forward
progress within the scheduler. I don't think it'd be a good idea to call out
to something more complicated. Also, from forward-progress-guaranteeing
point of view, cpu_relax() is as good as anything else and this shouldn't be
active long enough for power consumption to be a factor.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching
  2024-11-06  0:26   ` Tejun Heo
@ 2024-11-06  0:33     ` Andrea Righi
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2024-11-06  0:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, linux-kernel, kernel-team, sched-ext, Changwoo Min

On Tue, Nov 05, 2024 at 02:26:38PM -1000, Tejun Heo wrote:
> External email: Use caution opening links or attachments
> 
> 
> Hello,
> 
> On Wed, Nov 06, 2024 at 12:57:42AM +0100, Andrea Righi wrote:
> ...
> > Do you think there's any benefit using the idle injection framework here
> > instead of this cpu_relax() loop? At the end we're trying to throttle
> > the scx scheduler from hammering a DSQ until the scheduler is kicked
> > out, so we may just inject real idle cycles?
> 
> That involves switching to the dedicated task and so on, right? When this is
> needed, we can't even trust whether the system is going to make forward
> progress within the scheduler. I don't think it'd be a good idea to call out
> to something more complicated. Also, from forward-progress-guaranteeing
> point of view, cpu_relax() is as good as anything else and this shouldn't be
> active long enough for power consumption to be a factor.

Ok, I see, we want to keep it simple, because the CPUs might be
congested (like even from a hardware perspective), so in that case
cpu_relax() makes more sense probably.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
  2024-11-05 21:49 ` [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup Tejun Heo
@ 2024-11-06 21:32   ` Doug Anderson
  2024-11-06 22:08     ` Tejun Heo
  2024-11-08 20:38   ` Tejun Heo
  1 sibling, 1 reply; 14+ messages in thread
From: Doug Anderson @ 2024-11-06 21:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, linux-kernel, kernel-team, sched-ext, Andrea Righi,
	Changwoo Min, Andrew Morton

Hi,

On Tue, Nov 5, 2024 at 1:49 PM Tejun Heo <tj@kernel.org> wrote:
>
> On 2 x Intel Sapphire Rapids machines with 224 logical CPUs, a poorly
> behaving BPF scheduler can live-lock the system by making multiple CPUs bang
> on the same DSQ to the point where soft-lockup detection triggers before
> SCX's own watchdog can take action. It also seems possible that the machine
> can be live-locked enough to prevent scx_ops_helper, which is an RT task,
> from running in a timely manner.
>
> Implement scx_softlockup() which is called when three quarters of
> soft-lockup threshold has passed. The function immediately enables the ops
> breather and triggers an ops error to initiate ejection of the BPF
> scheduler.
>
> The previous and this patch combined enable the kernel to reliably recover
> the system from live-lock conditions that can be triggered by a poorly
> behaving BPF scheduler on Intel dual socket systems.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Douglas Anderson <dianders@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/sched/ext.h         |    2 +
>  kernel/sched/ext.c                |   45 ++++++++++++++++++++++++++++++++++++++
>  kernel/watchdog.c                 |    8 ++++++
>  tools/sched_ext/scx_show_state.py |    2 +
>  4 files changed, 57 insertions(+)

If someone more senior wants to override me then that's fine, but to
me this feels a bit too ugly/hacky to land. Specifically:

1. It doesn't feel right to add knowledge of "sched-ext" to the
softlockup detector. You're calling from a generic part of the kernel
to a specific part and it just feels unexpected, like there should be
some better boundaries between the two.

2. You're relying on a debug feature to enforce correctness. The
softlockup detector isn't designed to _fix_ softlockups. It's designed
to detect and report softlockups and then possibly reboot the machine.
Someone would not expect that turning on this debug feature would
cause the system to take the action of kicking out a BPF scheduler.


It feels like sched-ext should fix its own watchdog so it detects and
fixes the problem before the softlockup detector does.

-Doug

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
  2024-11-06 21:32   ` Doug Anderson
@ 2024-11-06 22:08     ` Tejun Heo
  2024-11-06 23:02       ` Doug Anderson
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2024-11-06 22:08 UTC (permalink / raw)
  To: Doug Anderson
  Cc: David Vernet, linux-kernel, kernel-team, sched-ext, Andrea Righi,
	Changwoo Min, Andrew Morton

Hello, Doug.

On Wed, Nov 06, 2024 at 01:32:40PM -0800, Doug Anderson wrote:
...
> 1. It doesn't feel right to add knowledge of "sched-ext" to the
> softlockup detector. You're calling from a generic part of the kernel
> to a specific part and it just feels unexpected, like there should be
> some better boundaries between the two.

I suppose we can create a notifier like infrastructure if directly calling
is what's bothersome but it's likely an overkill at this point. The second
point probably is more important to discuss.

> 2. You're relying on a debug feature to enforce correctness. The
> softlockup detector isn't designed to _fix_ softlockups. It's designed
> to detect and report softlockups and then possibly reboot the machine.
> Someone would not expect that turning on this debug feature would
> cause the system to take the action of kicking out a BPF scheduler.

Softlockups can trigger panic and thus system reset, which is arguably also
a remediative action.

> It feels like sched-ext should fix its own watchdog so it detects and
> fixes the problem before the softlockup detector does.

sched_ext has its own watchdog with configurable timeout and softlockups
would eventually trigger that too. However, that's looking at the time
between tasks waking up and running to detect stalls and the (configurable)
time duration is usually longer than softlockup detection threshold, which
makes sense given what the different failure modes they're looking at.

If sched_ext is to expand its watchdog to monitor softlockup like
conditions, it would essentially look just like softirq watchdog and we
would still have the same problem of coordinating detection thresholds.

Having a notification mechanism which triggers when watchdog is about to
trigger which can take a drastic action doesn't sound too odd to me. If I
make it use a notification chain so that the mechanism is more generic,
would that make it more acceptable to you?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
  2024-11-06 22:08     ` Tejun Heo
@ 2024-11-06 23:02       ` Doug Anderson
  2024-11-06 23:07         ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Doug Anderson @ 2024-11-06 23:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, linux-kernel, kernel-team, sched-ext, Andrea Righi,
	Changwoo Min, Andrew Morton

Hi,

On Wed, Nov 6, 2024 at 2:08 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Doug.
>
> On Wed, Nov 06, 2024 at 01:32:40PM -0800, Doug Anderson wrote:
> ...
> > 1. It doesn't feel right to add knowledge of "sched-ext" to the
> > softlockup detector. You're calling from a generic part of the kernel
> > to a specific part and it just feels unexpected, like there should be
> > some better boundaries between the two.
>
> I suppose we can create a notifier like infrastructure if directly calling
> is what's bothersome but it's likely an overkill at this point. The second
> point probably is more important to discuss.
>
> > 2. You're relying on a debug feature to enforce correctness. The
> > softlockup detector isn't designed to _fix_ softlockups. It's designed
> > to detect and report softlockups and then possibly reboot the machine.
> > Someone would not expect that turning on this debug feature would
> > cause the system to take the action of kicking out a BPF scheduler.
>
> Softlockups can trigger panic and thus system reset, which is arguably also
> a remediative action.

Sort of, though it doesn't feel to me like quite the same thing.


> > It feels like sched-ext should fix its own watchdog so it detects and
> > fixes the problem before the softlockup detector does.
>
> sched_ext has its own watchdog with configurable timeout and softlockups
> would eventually trigger that too. However, that's looking at the time
> between tasks waking up and running to detect stalls and the (configurable)
> time duration is usually longer than softlockup detection threshold, which
> makes sense given what the different failure modes they're looking at.
>
> If sched_ext is to expand its watchdog to monitor softlockup like
> conditions, it would essentially look just like softirq watchdog and we
> would still have the same problem of coordinating detection thresholds.
>
> Having a notification mechanism which triggers when watchdog is about to
> trigger which can take a drastic action doesn't sound too odd to me. If I
> make it use a notification chain so that the mechanism is more generic,
> would that make it more acceptable to you?

Honestly, it would feel better to me if the soft lockup timer didn't
tell schedext to kill things but instead we just make some special
exception for "schedext" tasks and exclude them from the softlockup
detector because they're already being watched by their own watchdog.
Would that be possible? Then tweaking the "softlockup" timeouts
doesn't implicitly change how long schedext things can run.

-Doug

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
  2024-11-06 23:02       ` Doug Anderson
@ 2024-11-06 23:07         ` Tejun Heo
  2024-11-06 23:20           ` Doug Anderson
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2024-11-06 23:07 UTC (permalink / raw)
  To: Doug Anderson
  Cc: David Vernet, linux-kernel, kernel-team, sched-ext, Andrea Righi,
	Changwoo Min, Andrew Morton

Hello,

On Wed, Nov 06, 2024 at 03:02:35PM -0800, Doug Anderson wrote:
...
> Honestly, it would feel better to me if the soft lockup timer didn't
> tell schedext to kill things but instead we just make some special
> exception for "schedext" tasks and exclude them from the softlockup
> detector because they're already being watched by their own watchdog.
> Would that be possible? Then tweaking the "softlockup" timeouts
> doesn't implicitly change how long schedext things can run.

Some systems can get into full blown live-lock condition where CPUs are
barely making forward progress through the scheduler and all normal (!RT &&
!DEADLINE) tasks are on sched_ext, so the only reasonable way to exclude
sched_ext would be disabling softlockup detection while sched_ext is
enabled which doesn't feel like a sound trade-off.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
  2024-11-06 23:07         ` Tejun Heo
@ 2024-11-06 23:20           ` Doug Anderson
  2024-11-07 19:31             ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Doug Anderson @ 2024-11-06 23:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, linux-kernel, kernel-team, sched-ext, Andrea Righi,
	Changwoo Min, Andrew Morton

Hi,

On Wed, Nov 6, 2024 at 3:07 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Nov 06, 2024 at 03:02:35PM -0800, Doug Anderson wrote:
> ...
> > Honestly, it would feel better to me if the soft lockup timer didn't
> > tell schedext to kill things but instead we just make some special
> > exception for "schedext" tasks and exclude them from the softlockup
> > detector because they're already being watched by their own watchdog.
> > Would that be possible? Then tweaking the "softlockup" timeouts
> > doesn't implicitly change how long schedext things can run.
>
> Some systems can get into full blown live-lock condition where CPUs are
> barely making forward progress through the scheduler and all normal (!RT &&
> !DEADLINE) tasks are on sched_ext, so the only reasonable way to exclude
> sched_ext would be disabling softlockup detection while sched_ext is
> enabled which doesn't feel like a sound trade-off.

Hmmm, I see.

It still feels wrong to me that the softlockup detector duration is
affecting how long schedext tasks are running. It feels like a
fundamentally separate knob to be adjusting. You might want to stop
misbehaving schedext tasks really quickly but otherwise leave the
softlockup detector to be longer. Tying the two just seems weird.

If we're trying to avoid duplicating code / avoid spinning up extra
timers then it feels like separating out some common code makes sense
and then that common code could be used by both the softlockup
detector and the schedext watchdog. This would allow both to be
configured separately. Yes, you could configure the schedext watchdog
to be effectively "useless" by setting it to be too big, but that's
true of lots of other watchdog-like things that are in the system. You
have to set the timeouts sensibly. Certainly you could make the
default something sensible, at least.

In any case, I'm not actually a maintainer here even if I've touched a
lot of this code recently. As I said, if someone more senior wants to
step in and say "Doug, you're wrong and everything looks great" then I
won't be offended.

-Doug

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
  2024-11-06 23:20           ` Doug Anderson
@ 2024-11-07 19:31             ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-11-07 19:31 UTC (permalink / raw)
  To: Doug Anderson
  Cc: David Vernet, linux-kernel, kernel-team, sched-ext, Andrea Righi,
	Changwoo Min, Andrew Morton

Hello, Doug.

On Wed, Nov 06, 2024 at 03:20:17PM -0800, Doug Anderson wrote:
> It still feels wrong to me that the softlockup detector duration is
> affecting how long schedext tasks are running. It feels like a
> fundamentally separate knob to be adjusting. You might want to stop
> misbehaving schedext tasks really quickly but otherwise leave the
> softlockup detector to be longer. Tying the two just seems weird.

The tying happens because softlockup can take a really drastic action of
resetting the whole machine.

> If we're trying to avoid duplicating code / avoid spinning up extra
> timers then it feels like separating out some common code makes sense
> and then that common code could be used by both the softlockup
> detector and the schedext watchdog. This would allow both to be
> configured separately. Yes, you could configure the schedext watchdog
> to be effectively "useless" by setting it to be too big, but that's
> true of lots of other watchdog-like things that are in the system. You
> have to set the timeouts sensibly. Certainly you could make the
> default something sensible, at least.

I don't really get the argument. It's just adding a simple notification to
tell another part of the kernel which can have effect on the condition being
detected that the threshold is imminent because it can resolve the situation
in an a lot more amicable way. I don't see what the big design problem is.
Sure, if we keep adding those notifications, we'd want to make the mechanism
more generic but that's not a difficult thing to do.

I don't see balance in your argument. Softlockup can already take a
remediative action, a pretty drastic one at that and that has practical
implications. In this case, it can be pretty easily dealt with. It solves a
practical problem. Even if we refactor everything and so that sched-ext can
do softlockup detection on its own (why? what's the benefit?), we still have
a coordination problem which is just brushed away. On the other side of the
scale is three lines of notification code. The trade off seems pretty clear.

> In any case, I'm not actually a maintainer here even if I've touched a
> lot of this code recently. As I said, if someone more senior wants to
> step in and say "Doug, you're wrong and everything looks great" then I
> won't be offended.

Andrew, if you don't object, I'll route the patches through the sched-ext
tree.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup
  2024-11-05 21:49 ` [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup Tejun Heo
  2024-11-06 21:32   ` Doug Anderson
@ 2024-11-08 20:38   ` Tejun Heo
  1 sibling, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2024-11-08 20:38 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, kernel-team, sched-ext, Andrea Righi, Changwoo Min,
	Douglas Anderson, Andrew Morton

On Tue, Nov 05, 2024 at 11:49:04AM -1000, Tejun Heo wrote:
> On 2 x Intel Sapphire Rapids machines with 224 logical CPUs, a poorly
> behaving BPF scheduler can live-lock the system by making multiple CPUs bang
> on the same DSQ to the point where soft-lockup detection triggers before
> SCX's own watchdog can take action. It also seems possible that the machine
> can be live-locked enough to prevent scx_ops_helper, which is an RT task,
> from running in a timely manner.
> 
> Implement scx_softlockup() which is called when three quarters of
> soft-lockup threshold has passed. The function immediately enables the ops
> breather and triggers an ops error to initiate ejection of the BPF
> scheduler.
> 
> The previous and this patch combined enable the kernel to reliably recover
> the system from live-lock conditions that can be triggered by a poorly
> behaving BPF scheduler on Intel dual socket systems.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Douglas Anderson <dianders@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>

Applying 1-2 to sched_ext/for-6.13.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-11-08 20:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-05 21:48 [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching Tejun Heo
2024-11-05 21:49 ` [PATCH sched_ext/for-6.13 2/2] sched_ext: Enable the ops breather and eject BPF scheduler on softlockup Tejun Heo
2024-11-06 21:32   ` Doug Anderson
2024-11-06 22:08     ` Tejun Heo
2024-11-06 23:02       ` Doug Anderson
2024-11-06 23:07         ` Tejun Heo
2024-11-06 23:20           ` Doug Anderson
2024-11-07 19:31             ` Tejun Heo
2024-11-08 20:38   ` Tejun Heo
2024-11-05 22:03 ` [PATCH sched_ext/for-6.13 1/2] sched_ext: Avoid live-locking bypass mode switching David Vernet
2024-11-05 23:02   ` Tejun Heo
2024-11-05 23:57 ` Andrea Righi
2024-11-06  0:26   ` Tejun Heo
2024-11-06  0:33     ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox