From: Mel Gorman <mgorman@suse.de>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Luis Chamberlain <mcgrof@kernel.org>,
Kees Cook <keescook@chromium.org>,
Iurii Zaikin <yzaikin@google.com>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
David Rientjes <rientjes@google.com>,
Oleg Rombakh <olegrom@google.com>,
linux-doc@vger.kernel.org, Paul Turner <pjt@google.com>
Subject: Re: [PATCH v2] sched: Warn on long periods of pending need_resched
Date: Wed, 24 Mar 2021 13:39:16 +0000 [thread overview]
Message-ID: <20210324133916.GQ15768@suse.de> (raw)
In-Reply-To: <YFssoD5NDl6dFfg/@hirez.programming.kicks-ass.net>
On Wed, Mar 24, 2021 at 01:12:16PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 24, 2021 at 11:42:24AM +0000, Mel Gorman wrote:
> > On Wed, Mar 24, 2021 at 11:54:24AM +0100, Peter Zijlstra wrote:
> > > On Wed, Mar 24, 2021 at 10:37:43AM +0100, Peter Zijlstra wrote:
> > > > Should we perhaps take out all SCHED_DEBUG sysctls and move them to
> > > > /debug/sched/ ? (along with the existing /debug/sched_{debug,features,preemp}
> > > > files)
> > > >
> > > > Having all that in sysctl and documented gives them far too much sheen
> > > > of ABI.
> > >
> > > ... a little something like this ...
> > >
> >
> > I did not read this particularly carefully or boot it to check but some
> > of the sysctls moved are expected to exist and should never should have
> > been under SCHED_DEBUG.
> >
> > For example, I'm surprised that numa_balancing is under the SCHED_DEBUG
> > sysctl because there are legimiate reasons to disable that at runtime.
> > For example, HPC clusters running various workloads may disable NUMA
> > balancing globally for particular jobs without wanting to reboot and
> > reenable it when finished.
>
> Yeah, lets say I was pleasantly surprised to find it there :-)
>
Minimally, lets move that out before it gets kicked out. Patch below.
> > Moving something like sched_min_granularity_ns will break a number of
> > tuning guides as well as the "tuned" tool which ships by default with
> > some distros and I believe some of the default profiles used for tuned
> > tweak kernel.sched_min_granularity_ns
>
> Yeah, can't say I care. I suppose some people with PREEMPT=n kernels
> increase that to make their server workloads 'go fast'. But I'll
> absolutely suck rock on anything desktop.
>
Broadly speaking yes and despite the lack of documentation, enough people
think of that parameter when tuning for throughput vs latency depending on
the expected use of the machine. kernel.sched_wakeup_granularity_ns might
get tuned if preemption is causing overscheduling. Same potentially with
kernel.sched_min_granularity_ns and kernel.sched_latency_ns. That said, I'm
struggling to think of an instance where I've seen tuning recommendations
properly quantified other than the impact on microbenchmarks but I
think there will be complaining if they disappear. I suspect that some
recommended tuning is based on "I tried a number of different values and
this seemed to work reasonably well".
kernel.sched_schedstats probably should not depend in SCHED_DEBUG because
it has value for workload analysis which is not necessarily about debugging
per-se. It might simply be informing whether another variable should be
tuned or useful for debugging applications rather than the kernel.
The others I'm less concerned with. kernel.sched_tunable_scaling is very
specific. sysctl_sched_migration_cost is subtle because it affects lots
of things including whether tasks are cache hot and load balancing and
is best left alone. I wonder how many people can accurately predict how
workloads will behave when that is tuned? sched_nr_migrate is also a hard
one to tune in a sensible fashion.
As an aside, I wonder how often SCHED_DEBUG has been enabled simply
because LATENCYTOP selects it -- no idea offhand why LATENCYTOP even
needs SCHED_DEBUG.
> These knobs really shouldn't have been as widely available as they are.
>
Probably not. Worse, some of the tuning is probably based on "this worked
for workload X 10 years ago so I'll just keep doing that"
> And guides, well, the writes have to earn a living too, right.
>
For most of the guides I've seen they either specify values without
explaining why or just describe roughly what the parameter does and it's
not always that accurate a description.
> > Whether there are legimiate reasons to modify those values or not,
> > removing them may generate fun bug reports.
>
> Which I'll close with -EDONTCARE, userspace has to cope with
> SCHED_DEBUG=n in any case.
True but removing the throughput vs latency parameters is likely to
generate a lot of noise even if the reasons for tuning are bad ones.
Some definitely should not be depending on SCHED_DEBUG, others may
need to be moved to debugfs one patch at a time so they can be reverted
individually if complaining is excessive and there is a legiminate reason
why it should be tuned. It's possible that complaining will be based on
a workload regression that really depended on tuned changing parameters.
Anyway, I definitely want to save kernel.numa_balancing from the firing
line so....
--8<--
sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG
From: Mel Gorman <mgorman@suse.de>
The ability to enable/disable NUMA balancing is not a debugging feature
and should not depend on CONFIG_SCHED_DEBUG. For example, machines within
a HPC cluster may disable NUMA balancing temporarily for some jobs and
re-enable it for other jobs without needing to reboot.
This patch removes the dependency on CONFIG_SCHED_DEBUG for
kernel.numa_balancing sysctl. The other numa balancing related sysctls
are left as-is because if they need to be tuned then it is more likely
that NUMA balancing needs to be fixed instead.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
kernel/sysctl.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62fbd09b5dc1..8042098ae080 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1753,6 +1753,9 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ONE,
},
+#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_SCHED_DEBUG */
+#ifdef CONFIG_NUMA_BALANCING
{
.procname = "numa_balancing",
.data = NULL, /* filled in by handler */
@@ -1763,7 +1766,6 @@ static struct ctl_table kern_table[] = {
.extra2 = SYSCTL_ONE,
},
#endif /* CONFIG_NUMA_BALANCING */
-#endif /* CONFIG_SCHED_DEBUG */
{
.procname = "sched_rt_period_us",
.data = &sysctl_sched_rt_period,
next prev parent reply other threads:[~2021-03-24 13:40 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-23 3:57 [PATCH v2] sched: Warn on long periods of pending need_resched Josh Don
2021-03-24 9:37 ` Peter Zijlstra
2021-03-24 10:54 ` Peter Zijlstra
2021-03-24 10:55 ` Peter Zijlstra
2021-03-24 11:42 ` Mel Gorman
2021-03-24 12:12 ` Peter Zijlstra
2021-03-24 13:39 ` Mel Gorman [this message]
2021-03-24 14:36 ` Peter Zijlstra
2021-03-24 15:52 ` Mel Gorman
2021-03-25 21:58 ` Josh Don
2021-03-26 8:58 ` Peter Zijlstra
2021-03-24 11:27 ` Mel Gorman
2021-03-25 21:50 ` Josh Don
2021-03-30 22:44 ` Josh Don
2021-04-16 15:04 ` Peter Zijlstra
2021-04-16 21:33 ` Josh Don
2021-04-19 7:52 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210324133916.GQ15768@suse.de \
--to=mgorman@suse.de \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=keescook@chromium.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=mingo@redhat.com \
--cc=olegrom@google.com \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=rientjes@google.com \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=yzaikin@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).