From: Ingo Molnar <mingo@kernel.org>
To: kernel test robot <oliver.sang@intel.com>,
Mel Gorman <mgorman@techsingularity.net>,
Peter Zijlstra <peterz@infradead.org>
Cc: oe-lkp@lists.linux.dev, lkp@intel.com,
linux-kernel@vger.kernel.org, ying.huang@intel.com,
feng.tang@intel.com, fengwei.yin@intel.com,
aubrey.li@linux.intel.com, yu.c.chen@intel.com,
Mike Galbraith <efault@gmx.de>,
K Prateek Nayak <kprateek.nayak@amd.com>,
"Peter Zijlstra (Intel)" <peterz@infradead.org>,
linux-tip-commits@vger.kernel.org, x86@kernel.org,
Gautham Shenoy <gautham.shenoy@amd.com>
Subject: Re: [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks
Date: Mon, 25 Sep 2023 13:07:08 +0200 [thread overview]
Message-ID: <ZRFp3EO2JUXtK6XB@gmail.com> (raw)
In-Reply-To: <202309221758.d655aa5b-oliver.sang@intel.com>
* kernel test robot <oliver.sang@intel.com> wrote:
> Hello,
>
> kernel test robot noticed a -19.0% regression of stress-ng.filename.ops_per_sec on:
Thanks for the testing, this is useful!
So I've tabulated the results into a much easier to read format:
> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec -19.0% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec -6.0% regression
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec 17.6% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds -5.3% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 11.5% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds -3.5% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 100.2% improvement
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -82.1% regression
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec 59.4% improvement
> | testcase: change | blogbench: blogbench.write_score -35.9% regression
> | testcase: change | hackbench: hackbench.throughput -4.8% regression
> | testcase: change | blogbench: blogbench.write_score -59.3% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec -34.6% regression
> | testcase: change | netperf: netperf.Throughput_Mbps 60.6% improvement
> | testcase: change | hackbench: hackbench.throughput 19.1% improvement
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec -15.7% regression
And then sorted them along the regression/improvement axis:
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -82.1% regression
> | testcase: change | blogbench: blogbench.write_score -59.3% regression
> | testcase: change | blogbench: blogbench.write_score -35.9% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec -34.6% regression
> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec -19.0% regression
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec -15.7% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec -6.0% regression
> | testcase: change | hackbench: hackbench.throughput -4.8% regression
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds +5.3% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds +3.5% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 11.5% improvement
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec 17.6% improvement
> | testcase: change | hackbench: hackbench.throughput 19.1% improvement
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec 59.4% improvement
> | testcase: change | netperf: netperf.Throughput_Mbps 60.6% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 100.2% improvement
Testing results notes:
- the '+' denotes an inverted improvement. The mixing of signs in the output of the
ktest robot is arguably confusing.
- Any hope getting similar summary format by default? It's much more informative than
just picking up the biggest regression, which wasn't even done correctly AFAICT.
Summary:
While there's a lot of improvements, it is primarily the nature of performance
regressions that dictate the way forward:
- stress-ng.sigsuspend.ops_per_sec regressions, -93%:
Clearly signal delivery performance hurts from delayed preemption, but
that should be straightforward to resolve, if we are willing to commit
to adding a high-prio insta-wakeup variant API ...
- stress-ng.exec.ops_per_sec -34% regression:
Likewise this possibly expresses that it's better to immediately reschedule
during exec() - but maybe it's more and reflects some unfavorable migration,
as suggested by the NUMA locality figures:
%change %stddev
| \
79317172 -34.2% 52217838 ± 3% numa-numastat.node0.local_node
79360983 -34.2% 52240348 ± 3% numa-numastat.node0.numa_hit
77971050 -33.2% 52068168 ± 3% numa-numastat.node1.local_node
78009071 -33.2% 52089987 ± 3% numa-numastat.node1.numa_hit
88287 -45.7% 47970 ± 2% vmstat.system.cs
- 'blogbench' regression of -59%:
It too has a very large reduction in context switches:
%stddev %change %stddev
\ | \
30035 -49.7% 15097 ± 3% vmstat.system.cs
2243545 ± 2% -4.1% 2152228 blogbench.read_score
52412617 -28.3% 37571769 blogbench.time.file_system_outputs
2682930 -74.1% 694136 blogbench.time.involuntary_context_switches
2369329 -50.0% 1184098 ± 5% blogbench.time.voluntary_context_switches
5851 -35.9% 3752 ± 2% blogbench.write_score
It's unclear to me what's happening with this one, just from these stats,
but it's "write_score" that hurts most.
- 'stress-ng.filename.ops_per_sec' regression of -19%:
This test suffered from an *increase* in context-switching, and a large
increase in CPU-idle:
%stddev %change %stddev
\ | \
4641666 +19.5% 5545394 ± 2% cpuidle..usage
90589 ± 2% +70.5% 154471 ± 2% vmstat.system.cs
628439 -19.2% 507711 stress-ng.filename.ops
10317 -19.0% 8355 stress-ng.filename.ops_per_sec
171981 -59.7% 69333 ± 3% stress-ng.time.involuntary_context_switches
770691 ± 3% +200.9% 2319214 stress-ng.time.voluntary_context_switches
Anyway, it's clear from these results that while many workloads hurt
from our notion of wake-preemption, there's several ones that benefit
from it, especially generic ones like phoronix-test-suite - which have
no good way to turn off wakeup preemption (SCHED_BATCH might help though).
One way to approach this would be to instead of always doing
wakeup-preemption (our current default), we could turn it around and
only use it when it is clearly beneficial - such as signal delivery,
or exec().
The canonical way to solve this would be give *userspace* a way to
signal that it's beneficial to preempt immediately, ie. yield(),
but right now that interface is hurting tasks that only want to
give other tasks a chance to run, without necessarily giving up
their own right to run:
se->deadline += calc_delta_fair(se->slice, se);
Anyway, my patch is obviously a no-go as-is, and this clearly needs more work.
Thanks,
Ingo
next prev parent reply other threads:[~2023-09-25 11:07 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-10 13:24 [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression kernel test robot
2023-08-11 1:11 ` Chen Yu
2023-08-11 2:42 ` Chen Yu
2023-08-14 13:29 ` Peter Zijlstra
2023-08-14 18:32 ` Mike Galbraith
2023-08-15 23:52 ` Peter Zijlstra
2023-08-16 3:54 ` Mike Galbraith
2023-08-16 12:37 ` Peter Zijlstra
2023-08-16 13:40 ` Peter Zijlstra
2023-08-16 15:38 ` Mike Galbraith
2023-08-16 20:04 ` Peter Zijlstra
2023-08-17 1:25 ` Mike Galbraith
2023-08-17 15:10 ` [tip: sched/core] sched/eevdf: Curb wakeup-preemption tip-bot2 for Peter Zijlstra
2023-08-21 10:39 ` K Prateek Nayak
2023-08-21 15:30 ` Mike Galbraith
2023-08-22 3:03 ` K Prateek Nayak
2023-08-22 6:09 ` Mike Galbraith
2023-08-25 6:41 ` K Prateek Nayak
2023-09-19 9:02 ` [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks Ingo Molnar
2023-09-19 9:48 ` Mike Galbraith
2023-09-22 10:00 ` kernel test robot
2023-09-25 11:07 ` Ingo Molnar [this message]
2023-09-25 16:45 ` Chen Yu
2023-08-18 1:09 ` [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression Chen Yu
2023-08-22 6:48 ` Chen Yu
2023-08-22 7:07 ` Chen Yu
2023-08-16 3:40 ` Chen Yu
2023-08-16 9:20 ` Peter Zijlstra
2023-08-14 12:49 ` Peter Zijlstra
2023-08-18 1:54 ` Chen Yu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZRFp3EO2JUXtK6XB@gmail.com \
--to=mingo@kernel.org \
--cc=aubrey.li@linux.intel.com \
--cc=efault@gmx.de \
--cc=feng.tang@intel.com \
--cc=fengwei.yin@intel.com \
--cc=gautham.shenoy@amd.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-tip-commits@vger.kernel.org \
--cc=lkp@intel.com \
--cc=mgorman@techsingularity.net \
--cc=oe-lkp@lists.linux.dev \
--cc=oliver.sang@intel.com \
--cc=peterz@infradead.org \
--cc=x86@kernel.org \
--cc=ying.huang@intel.com \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox