Re: [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Vernet <void@manifault.com>
To: Tejun Heo <tj@kernel.org>
Cc: kernel-team@meta.com, linux-kernel@vger.kernel.org, sched-ext@meta.com
Subject: Re: [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long
Date: Thu, 10 Oct 2024 14:12:37 -0500	[thread overview]
Message-ID: <20241010191237.GF28209@maniforge> (raw)
In-Reply-To: <20241009214411.681233-7-tj@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5097 bytes --]

On Wed, Oct 09, 2024 at 11:41:02AM -1000, Tejun Heo wrote:
> While enabling and disabling a BPF scheduler, every task is iterated a
> couple times by walking scx_tasks. Except for one, all iterations keep
> holding scx_tasks_lock. On multi-socket systems under heavy rq lock
> contention and high number of threads, this can can lead to RCU and other
> stalls.
> 
> The following is triggered on a 2 x AMD EPYC 7642 system (192 logical CPUs)
> running `stress-ng --workload 150 --workload-threads 10` with >400k idle
> threads and RCU stall period reduced to 5s:
> 
>   rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>   rcu:     91-...!: (10 ticks this GP) idle=0754/1/0x4000000000000000 softirq=18204/18206 fqs=17
>   rcu:     186-...!: (17 ticks this GP) idle=ec54/1/0x4000000000000000 softirq=25863/25866 fqs=17
>   rcu:     (detected by 80, t=10042 jiffies, g=89305, q=33 ncpus=192)
>   Sending NMI from CPU 80 to CPUs 91:
>   NMI backtrace for cpu 91
>   CPU: 91 UID: 0 PID: 284038 Comm: sched_ext_ops_h Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471
>   Hardware name: Supermicro Super Server/H11DSi, BIOS 2.8 12/14/2023
>   Sched_ext: simple (disabling+all)
>   RIP: 0010:queued_spin_lock_slowpath+0x17b/0x2f0
>   Code: 02 c0 10 03 00 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8 48 8b 11 48 85 d2 74 09 0f 0d 0a eb 0a 31 d2 eb 06 31 d2 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 39 d8 75 0d be 01 00 00 00 89 d8 f0 0f b1 37
>   RSP: 0018:ffffc9000fadfcb8 EFLAGS: 00000002
>   RAX: 0000000001700001 RBX: 0000000001700000 RCX: ffff88bfcaaf10c0
>   RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff88bfca8f0080
>   RBP: 0000000001700000 R08: 0000000000000090 R09: ffffffffffffffff
>   R10: ffff88a74761b268 R11: 0000000000000000 R12: ffff88a6b6765460
>   R13: ffffc9000fadfd60 R14: ffff88bfca8f0080 R15: ffff88bfcaac0000
>   FS:  0000000000000000(0000) GS:ffff88bfcaac0000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00007f5c55f526a0 CR3: 0000000afd474000 CR4: 0000000000350eb0
>   Call Trace:
>    <NMI>
>    </NMI>
>    <TASK>
>    do_raw_spin_lock+0x9c/0xb0
>    task_rq_lock+0x50/0x190
>    scx_task_iter_next_locked+0x157/0x170
>    scx_ops_disable_workfn+0x2c2/0xbf0
>    kthread_worker_fn+0x108/0x2a0
>    kthread+0xeb/0x110
>    ret_from_fork+0x36/0x40
>    ret_from_fork_asm+0x1a/0x30
>    </TASK>
>   Sending NMI from CPU 80 to CPUs 186:
>   NMI backtrace for cpu 186
>   CPU: 186 UID: 0 PID: 51248 Comm: fish Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471
> 
> scx_task_iter can safely drop locks while iterating. Make
> scx_task_iter_next() drop scx_tasks_lock every 32 iterations to avoid
> stalls.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

LG, just had one question below.

Acked-by: David Vernet <void@manifault.com>

> ---
>  kernel/sched/ext.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index d53c7a365fec..b44946198ea5 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -18,6 +18,12 @@ enum scx_consts {
>  	SCX_EXIT_DUMP_DFL_LEN		= 32768,
>  
>  	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,
> +
> +	/*
> +	 * Iterating all tasks may take a while. Periodically drop
> +	 * scx_tasks_lock to avoid causing e.g. CSD and RCU stalls.
> +	 */
> +	SCX_OPS_TASK_ITER_BATCH		= 32,
>  };
>  
>  enum scx_exit_kind {
> @@ -1273,6 +1279,7 @@ struct scx_task_iter {
>  	struct task_struct		*locked;
>  	struct rq			*rq;
>  	struct rq_flags			rf;
> +	u32				cnt;
>  };
>  
>  /**
> @@ -1301,6 +1308,7 @@ static void scx_task_iter_start(struct scx_task_iter *iter)
>  	iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR };
>  	list_add(&iter->cursor.tasks_node, &scx_tasks);
>  	iter->locked = NULL;
> +	iter->cnt = 0;
>  }
>  
>  static void __scx_task_iter_rq_unlock(struct scx_task_iter *iter)
> @@ -1355,14 +1363,21 @@ static void scx_task_iter_stop(struct scx_task_iter *iter)
>   * scx_task_iter_next - Next task
>   * @iter: iterator to walk
>   *
> - * Visit the next task. See scx_task_iter_start() for details.
> + * Visit the next task. See scx_task_iter_start() for details. Locks are dropped
> + * and re-acquired every %SCX_OPS_TASK_ITER_BATCH iterations to avoid causing
> + * stalls by holding scx_tasks_lock for too long.
>   */
>  static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
>  {
>  	struct list_head *cursor = &iter->cursor.tasks_node;
>  	struct sched_ext_entity *pos;
>  
> -	lockdep_assert_held(&scx_tasks_lock);
> +	if (!(++iter->cnt % SCX_OPS_TASK_ITER_BATCH)) {
> +		scx_task_iter_unlock(iter);
> +		cpu_relax();

Could you explain why we need this cpu_relax()? I thought it was only
necessary for busy-wait loops.

> +		cond_resched();
> +		scx_task_iter_relock(iter);
> +	}
>  
>  	list_for_each_entry(pos, cursor, tasks_node) {
>  		if (&pos->tasks_node == &scx_tasks)
> -- 
> 2.46.2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

next prev parent reply	other threads:[~2024-10-10 19:12 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-09 21:40 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo
2024-10-09 21:40 ` [PATCH 1/6] Revert "sched_ext: Use shorter slice while bypassing" Tejun Heo
2024-10-10 17:59   ` David Vernet
2024-10-09 21:40 ` [PATCH 2/6] sched_ext: Start schedulers with consistent p->scx.slice values Tejun Heo
2024-10-10 18:00   ` David Vernet
2024-10-09 21:40 ` [PATCH 3/6] sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl() Tejun Heo
2024-10-09 21:41 ` [PATCH 4/6] sched_ext: bypass mode shouldn't depend on ops.select_cpu() Tejun Heo
2024-10-10 18:15   ` David Vernet
2024-10-10 18:26     ` Tejun Heo
2024-10-10 18:31       ` David Vernet
2024-10-09 21:41 ` [PATCH 5/6] sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers Tejun Heo
2024-10-10 18:36   ` David Vernet
2024-10-09 21:41 ` [PATCH 6/6] sched_ext: Don't hold scx_tasks_lock for too long Tejun Heo
2024-10-10 19:12   ` David Vernet [this message]
2024-10-10 21:38     ` Tejun Heo
2024-10-10 23:38       ` Waiman Long
2024-10-10 21:43 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix RCU and other stalls while iterating tasks during enable/disable Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241010191237.GF28209@maniforge \
    --to=void@manifault.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@meta.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.