Re: [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: kernel test robot <oliver.sang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>
Cc: oe-lkp@lists.linux.dev, lkp@intel.com,
	linux-kernel@vger.kernel.org, ying.huang@intel.com,
	feng.tang@intel.com, fengwei.yin@intel.com,
	aubrey.li@linux.intel.com, yu.c.chen@intel.com,
	Mike Galbraith <efault@gmx.de>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	linux-tip-commits@vger.kernel.org, x86@kernel.org,
	Gautham Shenoy <gautham.shenoy@amd.com>
Subject: Re: [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks
Date: Mon, 25 Sep 2023 13:07:08 +0200	[thread overview]
Message-ID: <ZRFp3EO2JUXtK6XB@gmail.com> (raw)
In-Reply-To: <202309221758.d655aa5b-oliver.sang@intel.com>


* kernel test robot <oliver.sang@intel.com> wrote:

> Hello,
> 
> kernel test robot noticed a -19.0% regression of stress-ng.filename.ops_per_sec on:

Thanks for the testing, this is useful!

So I've tabulated the results into a much easier to read format:

> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec                                      -19.0% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec                                        -6.0% regression 
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec                                          17.6% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds    -5.3% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec                              11.5% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds        -3.5% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec                                         100.2% improvement
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec                                    -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec                                    -82.1% regression
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec                                           59.4% improvement
> | testcase: change | blogbench: blogbench.write_score                                               -35.9% regression
> | testcase: change | hackbench: hackbench.throughput                                                 -4.8% regression
> | testcase: change | blogbench: blogbench.write_score                                               -59.3% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec                                          -34.6% regression
> | testcase: change | netperf: netperf.Throughput_Mbps                                                60.6% improvement
> | testcase: change | hackbench: hackbench.throughput                                                 19.1% improvement
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec                                       -15.7% regression

And then sorted them along the regression/improvement axis:

> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec                                    -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec                                    -82.1% regression
> | testcase: change | blogbench: blogbench.write_score                                               -59.3% regression
> | testcase: change | blogbench: blogbench.write_score                                               -35.9% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec                                          -34.6% regression
> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec                                      -19.0% regression
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec                                       -15.7% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec                                        -6.0% regression
> | testcase: change | hackbench: hackbench.throughput                                                 -4.8% regression
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds    +5.3% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds        +3.5% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec                              11.5% improvement
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec                                          17.6% improvement
> | testcase: change | hackbench: hackbench.throughput                                                 19.1% improvement
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec                                           59.4% improvement
> | testcase: change | netperf: netperf.Throughput_Mbps                                                60.6% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec                                         100.2% improvement

Testing results notes:

    - the '+' denotes an inverted improvement. The mixing of signs in the output of the 
      ktest robot is arguably confusing.

    - Any hope getting similar summary format by default? It's much more informative than 
      just picking up the biggest regression, which wasn't even done correctly AFAICT.

Summary:

While there's a lot of improvements, it is primarily the nature of performance
regressions that dictate the way forward:

 - stress-ng.sigsuspend.ops_per_sec regressions, -93%:

    Clearly signal delivery performance hurts from delayed preemption, but
    that should be straightforward to resolve, if we are willing to commit
    to adding a high-prio insta-wakeup variant API ...

 - stress-ng.exec.ops_per_sec -34% regression:

    Likewise this possibly expresses that it's better to immediately reschedule
    during exec() - but maybe it's more and reflects some unfavorable migration,
    as suggested by the NUMA locality figures:

                     %change         %stddev
                        |                \                                         
  79317172           -34.2%   52217838 ±  3%  numa-numastat.node0.local_node
  79360983           -34.2%   52240348 ±  3%  numa-numastat.node0.numa_hit                            
  77971050           -33.2%   52068168 ±  3%  numa-numastat.node1.local_node
  78009071           -33.2%   52089987 ±  3%  numa-numastat.node1.numa_hit
     88287           -45.7%      47970 ±  2%  vmstat.system.cs

 - 'blogbench' regression of -59%:

    It too has a very large reduction in context switches:

         %stddev     %change         %stddev
             \          |                \  
     30035           -49.7%      15097 ±  3%  vmstat.system.cs
   2243545 ±  2%      -4.1%    2152228        blogbench.read_score
  52412617           -28.3%   37571769        blogbench.time.file_system_outputs
   2682930           -74.1%     694136        blogbench.time.involuntary_context_switches
   2369329           -50.0%    1184098 ±  5%  blogbench.time.voluntary_context_switches
      5851           -35.9%       3752 ±  2%  blogbench.write_score

    It's unclear to me what's happening with this one, just from these stats,
    but it's "write_score" that hurts most.
 
 - 'stress-ng.filename.ops_per_sec' regression of -19%:

    This test suffered from an *increase* in context-switching, and a large
    increase in CPU-idle:

         %stddev     %change         %stddev
             \          |                \  
   4641666           +19.5%    5545394 ±  2%  cpuidle..usage
     90589 ±  2%     +70.5%     154471 ±  2%  vmstat.system.cs
    628439           -19.2%     507711        stress-ng.filename.ops
     10317           -19.0%       8355        stress-ng.filename.ops_per_sec

    171981           -59.7%      69333 ±  3%  stress-ng.time.involuntary_context_switches
    770691 ±  3%    +200.9%    2319214        stress-ng.time.voluntary_context_switches

Anyway, it's clear from these results that while many workloads hurt
from our notion of wake-preemption, there's several ones that benefit
from it, especially generic ones like phoronix-test-suite - which have
no good way to turn off wakeup preemption (SCHED_BATCH might help though).

One way to approach this would be to instead of always doing
wakeup-preemption (our current default), we could turn it around and
only use it when it is clearly beneficial - such as signal delivery,
or exec().

The canonical way to solve this would be give *userspace* a way to
signal that it's beneficial to preempt immediately, ie. yield(),
but right now that interface is hurting tasks that only want to
give other tasks a chance to run, without necessarily giving up
their own right to run:

        se->deadline += calc_delta_fair(se->slice, se);

Anyway, my patch is obviously a no-go as-is, and this clearly needs more work.

Thanks,

	Ingo

next prev parent reply	other threads:[~2023-09-25 11:07 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-10 13:24 [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression kernel test robot
2023-08-11  1:11 ` Chen Yu
2023-08-11  2:42   ` Chen Yu
2023-08-14 13:29     ` Peter Zijlstra
2023-08-14 18:32       ` Mike Galbraith
2023-08-15 23:52         ` Peter Zijlstra
2023-08-16  3:54           ` Mike Galbraith
2023-08-16 12:37         ` Peter Zijlstra
2023-08-16 13:40           ` Peter Zijlstra
2023-08-16 15:38             ` Mike Galbraith
2023-08-16 20:04               ` Peter Zijlstra
2023-08-17  1:25                 ` Mike Galbraith
2023-08-17 15:10             ` [tip: sched/core] sched/eevdf: Curb wakeup-preemption tip-bot2 for Peter Zijlstra
2023-08-21 10:39               ` K Prateek Nayak
2023-08-21 15:30                 ` Mike Galbraith
2023-08-22  3:03                   ` K Prateek Nayak
2023-08-22  6:09                     ` Mike Galbraith
2023-08-25  6:41                       ` K Prateek Nayak
2023-09-19  9:02                       ` [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks Ingo Molnar
2023-09-19  9:48                         ` Mike Galbraith
2023-09-22 10:00                         ` kernel test robot
2023-09-25 11:07                           ` Ingo Molnar [this message]
2023-09-25 16:45                             ` Chen Yu
2023-08-18  1:09             ` [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression Chen Yu
2023-08-22  6:48               ` Chen Yu
2023-08-22  7:07                 ` Chen Yu
2023-08-16  3:40       ` Chen Yu
2023-08-16  9:20         ` Peter Zijlstra
2023-08-14 12:49   ` Peter Zijlstra
2023-08-18  1:54     ` Chen Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZRFp3EO2JUXtK6XB@gmail.com \
    --to=mingo@kernel.org \
    --cc=aubrey.li@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=gautham.shenoy@amd.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=mgorman@techsingularity.net \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=peterz@infradead.org \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.