Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: David Vernet <void@manifault.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	mingo@redhat.com, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, vschneid@redhat.com, tj@kernel.org,
	roman.gushchin@linux.dev, gautham.shenoy@amd.com,
	aaron.lu@intel.com, wuyun.abel@bytedance.com,
	kernel-team@meta.com
Subject: Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
Date: Thu, 31 Aug 2023 14:11:03 -0500	[thread overview]
Message-ID: <20230831191103.GC531917@maniforge> (raw)
In-Reply-To: <20230831104508.7619-4-kprateek.nayak@amd.com>

On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:

Hi Prateek,

> Even with the two patches, I still observe the following lock
> contention when profiling the tbench 128-clients run with IBS:
> 
>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>      - 10.94% native_queued_spin_lock_slowpath
>         - 10.73% _raw_spin_lock
>            - 9.57% __schedule
>                 schedule_idle
>                 do_idle
>               + cpu_startup_entry
>            - 0.82% task_rq_lock
>                 newidle_balance
>                 pick_next_task_fair
>                 __schedule
>                 schedule_idle
>                 do_idle
>               + cpu_startup_entry
> 
> Since David mentioned rq->avg_idle check is probably not the right step
> towards the solution, this experiment introduces a per-shard
> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> notifies of the possibility of one or more rq covered in the shard's
> domain having a queued task. shard's overload flag is set at the same
> time as "rq->rd->overload", and is cleared when shard's list is found
> to be empty.

I think this is an interesting idea, but I feel that it's still working
against the core proposition of SHARED_RUNQ, which is to enable work
conservation.

> With these changes, following are the results for tbench 128-clients:

Just to make sure I understand, this is to address the contention we're
observing on tbench with 64 - 256 clients, right?  That's my
understanding from Gautham's reply in [0].

[0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com/

If so, are we sure this change won't regress other workloads that would
have benefited from the work conservation?

Also, I assume that you don't see the improved contention without this,
even if you include your fix to the newidle_balance() that has us skip
over the <= LLC domain?

Thanks,
David

P.S. Taking off on vacation now, so any replies will be very delayed.
Thanks again for working on this!

> 
> tip				: 1.00 (var: 1.00%)
> tip + v3 + series till patch 2	: 0.41 (var: 1.15%) (diff: -58.81%)
> tip + v3 + full series		: 1.01 (var: 0.36%) (diff: +00.92%)
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c  | 13 +++++++++++--
>  kernel/sched/sched.h | 17 +++++++++++++++++
>  2 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 446ffdad49e1..31fe109fdaf0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -186,6 +186,7 @@ static void shared_runq_reassign_domains(void)
>  		rq->cfs.shared_runq = shared_runq;
>  		rq->cfs.shard = &shared_runq->shards[shard_idx];
>  		rq_unlock(rq, &rf);
> +		WRITE_ONCE(rq->cfs.shard->overload, 0);
>  	}
>  }
>  
> @@ -202,6 +203,7 @@ static void __shared_runq_drain(struct shared_runq *shared_runq)
>  		list_for_each_entry_safe(p, tmp, &shard->list, shared_runq_node)
>  			list_del_init(&p->shared_runq_node);
>  		raw_spin_unlock(&shard->lock);
> +		WRITE_ONCE(shard->overload, 0);
>  	}
>  }
>  
> @@ -258,13 +260,20 @@ shared_runq_pop_task(struct shared_runq_shard *shard, int target)
>  {
>  	struct task_struct *p;
>  
> -	if (list_empty(&shard->list))
> +	if (!READ_ONCE(shard->overload))
>  		return NULL;
>  
> +	if (list_empty(&shard->list)) {
> +		WRITE_ONCE(shard->overload, 0);
> +		return NULL;
> +	}
> +
>  	raw_spin_lock(&shard->lock);
>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>  				     shared_runq_node);
> -	if (p && is_cpu_allowed(p, target))
> +	if (!p)
> +		WRITE_ONCE(shard->overload, 0);
> +	else if (is_cpu_allowed(p, target))
>  		list_del_init(&p->shared_runq_node);
>  	else
>  		p = NULL;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f50176f720b1..e8d4d948f742 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -601,6 +601,20 @@ do {									\
>  struct shared_runq_shard {
>  	struct list_head list;
>  	raw_spinlock_t lock;
> +	/*
> +	 * shared_runq_shard can contain running tasks.
> +	 * In such cases where all the tasks are running,
> +	 * it is futile to attempt to pull tasks from the
> +	 * list. Overload flag is used to indicate case
> +	 * where one or more rq in the shard domain may
> +	 * have a queued task. If the flag is 0, it is
> +	 * very likely that all tasks in the shard are
> +	 * running and cannot be migrated. This is not
> +	 * guarded by the shard lock, and since it may
> +	 * be updated often, it is placed into its own
> +	 * cacheline.
> +	 */
> +	int overload ____cacheline_aligned;
>  } ____cacheline_aligned;
>  
>  /* This would likely work better as a configurable knob via debugfs */
> @@ -2585,6 +2599,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
>  	if (prev_nr < 2 && rq->nr_running >= 2) {
>  		if (!READ_ONCE(rq->rd->overload))
>  			WRITE_ONCE(rq->rd->overload, 1);
> +
> +		if (rq->cfs.shard && !READ_ONCE(rq->cfs.shard->overload))
> +			WRITE_ONCE(rq->cfs.shard->overload, 1);
>  	}
>  #endif
>  
> -- 
> 2.34.1
>

next prev parent reply	other threads:[~2023-08-31 19:11 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
2023-08-09 22:12 ` [PATCH v3 1/7] sched: Expose move_queued_task() from core.c David Vernet
2023-08-09 22:12 ` [PATCH v3 2/7] sched: Move is_cpu_allowed() into sched.h David Vernet
2023-08-09 22:12 ` [PATCH v3 3/7] sched: Check cpu_active() earlier in newidle_balance() David Vernet
2023-08-09 22:12 ` [PATCH v3 4/7] sched: Enable sched_feat callbacks on enable/disable David Vernet
2023-08-09 22:12 ` [PATCH v3 5/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls David Vernet
2023-08-09 22:12 ` [PATCH v3 6/7] sched: Implement shared runqueue in CFS David Vernet
2023-08-10  7:11   ` kernel test robot
2023-08-10  7:41   ` kernel test robot
2023-08-30  6:46   ` K Prateek Nayak
2023-08-31  1:34     ` David Vernet
2023-08-31  3:47       ` K Prateek Nayak
2023-08-09 22:12 ` [PATCH v3 7/7] sched: Shard per-LLC shared runqueues David Vernet
2023-08-09 23:46   ` kernel test robot
2023-08-10  0:12     ` David Vernet
2023-08-10  7:11   ` kernel test robot
2023-08-30  6:17   ` Chen Yu
2023-08-31  0:01     ` David Vernet
2023-08-31 10:45       ` Chen Yu
2023-08-31 19:14         ` David Vernet
2023-09-23  6:35           ` Chen Yu
2023-08-17  8:42 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Gautham R. Shenoy
2023-08-18  5:03   ` David Vernet
2023-08-18  8:49     ` Gautham R. Shenoy
2023-08-24 11:14       ` Gautham R. Shenoy
2023-08-24 22:51         ` David Vernet
2023-08-30  9:56           ` K Prateek Nayak
2023-08-31  2:32             ` David Vernet
2023-08-31  4:21               ` K Prateek Nayak
2023-08-31 10:45             ` [RFC PATCH 0/3] DO NOT MERGE: Breaking down the experimantal diff K Prateek Nayak
2023-08-31 10:45               ` [RFC PATCH 1/3] sched/fair: Move SHARED_RUNQ related structs and definitions into sched.h K Prateek Nayak
2023-08-31 10:45               ` [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance K Prateek Nayak
2023-08-31 18:45                 ` David Vernet
2023-08-31 19:47                   ` K Prateek Nayak
2023-08-31 10:45               ` [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag K Prateek Nayak
2023-08-31 19:11                 ` David Vernet [this message]
2023-08-31 20:23                   ` K Prateek Nayak
2023-09-29 17:01                     ` David Vernet
2023-10-04  4:21                       ` K Prateek Nayak
2023-10-04 17:20                         ` David Vernet
2023-10-05  3:50                           ` K Prateek Nayak
2023-09-27  4:23                   ` K Prateek Nayak
2023-09-27  6:59                     ` Chen Yu
2023-09-27  8:36                       ` K Prateek Nayak
2023-09-28  8:41                         ` Chen Yu
2023-10-03 21:05                       ` David Vernet
2023-10-07  2:10                         ` Chen Yu
2023-09-27 13:08                     ` David Vernet
2023-11-27  8:28 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Aboorva Devarajan
2023-11-27 19:49   ` David Vernet
2023-12-07  6:00     ` Aboorva Devarajan
2023-12-04 19:30 ` David Vernet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230831191103.GC531917@maniforge \
    --to=void@manifault.com \
    --cc=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=wuyun.abel@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox