public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Brian Twichell <tbrian@us.ibm.com>
To: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: David Lang <david.lang@digitalinsight.com>,
	linux-kernel@vger.kernel.org, mbligh@mbligh.org,
	slpratt@us.ibm.com, anton@samba.org
Subject: Re: Database regression due to scheduler changes ?
Date: Tue, 08 Nov 2005 23:03:48 -0600	[thread overview]
Message-ID: <43718334.2090905@us.ibm.com> (raw)
In-Reply-To: <436FF6A6.1040708@yahoo.com.au>

Nick Piggin wrote:

>
> I think you are right that the NUMA domain is probably being too
> constrictive of task balancing, and that is where the regression
> is coming from.
>
> For some workloads it is definitely important to have the NUMA
> domain, because it helps spread load over memory controllers as
> well as CPUs - so I guess eliminating that domain is not a good
> long term solution.
>
> I would look at changing parameters of SD_NODE_INIT in include/
> asm-powerpc/topology.h so they are closer to SD_CPU_INIT parameters
> (ie. more aggressive).

I ran with the following:

--- topology.h.orig     2005-11-08 13:11:57.000000000 -0600
+++ topology.h  2005-11-08 13:17:15.000000000 -0600
@@ -43,11 +43,11 @@ static inline int node_to_first_cpu(int
        .span                   = CPU_MASK_NONE,        \
        .parent                 = NULL,                 \
        .groups                 = NULL,                 \
-       .min_interval           = 8,                    \
-       .max_interval           = 32,                   \
-       .busy_factor            = 32,                   \
+       .min_interval           = 1,                    \
+       .max_interval           = 4,                    \
+       .busy_factor            = 64,                   \
        .imbalance_pct          = 125,                  \
-       .cache_hot_time         = (10*1000000),         \
+       .cache_hot_time         = (5*1000000/2),        \
        .cache_nice_tries       = 1,                    \
        .per_cpu_gain           = 100,                  \
        .flags                  = SD_LOAD_BALANCE       \

There was no improvement in performance.  The schedstats from this run 
follow:

       2516          sys_sched_yield()
          0(  0.00%) found (only) active queue empty on current cpu
          0(  0.00%) found (only) expired queue empty on current cpu
         46(  1.83%) found both queues empty on current cpu
       2470( 98.17%) found neither queue empty on current cpu


    22969106          schedule()
     694922          goes idle
          3(  0.00%) switched active and expired queues
          0(  0.00%) used existing active queue

          0          active_load_balance()
          0          sched_balance_exec()

      0.19/1.28      avg runtime/latency over all cpus (ms)

[scheduler domain #0]
    1153606          load_balance()
      82580(  7.16%) called while idle
                         488(  0.59%) tried but failed to move any tasks
                       63876( 77.35%) found no busier group
                       18216( 22.06%) succeeded in moving at least one task
                                      (average imbalance:   1.526)
     317610( 27.53%) called while busy
                          15(  0.00%) tried but failed to move any tasks
                      220139( 69.31%) found no busier group
                       97456( 30.68%) succeeded in moving at least one task
                                      (average imbalance:   1.752)
     753416( 65.31%) called when newly idle
                         487(  0.06%) tried but failed to move any tasks
                      624132( 82.84%) found no busier group
                      128797( 17.10%) succeeded in moving at least one task
                                      (average imbalance:   1.531)

          0          sched_balance_exec() tried to push a task

[scheduler domain #1]
     715638          load_balance()
      68533(  9.58%) called while idle
                        3140(  4.58%) tried but failed to move any tasks
                       60357( 88.07%) found no busier group
                        5036(  7.35%) succeeded in moving at least one task
                                      (average imbalance:   1.251)
      22486(  3.14%) called while busy
                          64(  0.28%) tried but failed to move any tasks
                       21352( 94.96%) found no busier group
                        1070(  4.76%) succeeded in moving at least one task
                                      (average imbalance:   1.922)
     624619( 87.28%) called when newly idle
                        5218(  0.84%) tried but failed to move any tasks
                      591970( 94.77%) found no busier group
                       27431(  4.39%) succeeded in moving at least one task
                                      (average imbalance:   1.382)

          0          sched_balance_exec() tried to push a task

[scheduler domain #2]
     685164          load_balance()
      63247(  9.23%) called while idle
                        7280( 11.51%) tried but failed to move any tasks
                       52200( 82.53%) found no busier group
                        3767(  5.96%) succeeded in moving at least one task
                                      (average imbalance:   1.361)
      24729(  3.61%) called while busy
                         418(  1.69%) tried but failed to move any tasks
                       21025( 85.02%) found no busier group
                        3286( 13.29%) succeeded in moving at least one task
                                      (average imbalance:   3.579)
     597188( 87.16%) called when newly idle
                       67577( 11.32%) tried but failed to move any tasks
                      371377( 62.19%) found no busier group
                      158234( 26.50%) succeeded in moving at least one task
                                      (average imbalance:   2.146)

          0          sched_balance_exec() tried to push a task

>
> I would also take a look at removing SD_WAKE_IDLE from the flags.
> This flag should make balancing more aggressive, but it can have
> problems when applied to a NUMA domain due to too much task
> movement.

Independent from the run above, I ran with the following:

--- topology.h.orig     2005-11-08 19:32:19.000000000 -0600
+++ topology.h  2005-11-08 19:34:25.000000000 -0600
@@ -53,7 +53,6 @@ static inline int node_to_first_cpu(int
        .flags                  = SD_LOAD_BALANCE       \
                                | SD_BALANCE_EXEC       \
                                | SD_BALANCE_NEWIDLE    \
-                               | SD_WAKE_IDLE          \
                                | SD_WAKE_BALANCE,      \
        .last_balance           = jiffies,              \
        .balance_interval       = 1,                    \

There was no improvement in performance. 

I didn't expect any change in performance this time, because I
don't think the SD_WAKE_IDLE flag is effective in the NUMA
domain, due to the following code in wake_idle:

        for_each_domain(cpu, sd) {
                if (sd->flags & SD_WAKE_IDLE) {
                        cpus_and(tmp, sd->span, p->cpus_allowed);
                        for_each_cpu_mask(i, tmp) {
                                if (idle_cpu(i))
                                        return i;
                        }
                }
                else
                        break;
        }
 
If I read that loop correctly it stops at the first domain
which doesn't have SD_WAKE_IDLE set, which is the CPU domain
(see SD_CPU_INIT), and thus it never gets to the NUMA domain.

Thanks for the suggestions Nick.  Andrew raises some
good questions that I will address tomorrow.

Cheers,
Brian


  parent reply	other threads:[~2005-11-09  5:03 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-11-07 22:17 Database regression due to scheduler changes ? Brian Twichell
2005-11-07 22:35 ` David Lang
2005-11-07 23:06   ` Brian Twichell
2005-11-08  0:51     ` Nick Piggin
2005-11-08  1:15       ` Anton Blanchard
2005-11-08  1:34         ` Martin J. Bligh
2005-11-08  1:46           ` Nick Piggin
2005-11-08  1:48             ` Nick Piggin
2005-11-08  1:58             ` Martin J. Bligh
2005-11-08  2:04             ` David Lang
2005-11-08  2:12               ` Martin J. Bligh
2005-11-08  2:15               ` Nick Piggin
2005-11-09  5:03       ` Brian Twichell [this message]
     [not found]         ` <43718DFE.3040600@yahoo.com.au>
2005-11-14 23:03           ` Brian Twichell
2005-11-08  2:31   ` Byron Stanoszek
2005-11-07 22:47 ` linux-os (Dick Johnson)
2005-11-08  3:54   ` Nick Piggin
     [not found] <43715361.3070802@us.ibm.com>
2005-11-09  2:14 ` Andrew Theurer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=43718334.2090905@us.ibm.com \
    --to=tbrian@us.ibm.com \
    --cc=anton@samba.org \
    --cc=david.lang@digitalinsight.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbligh@mbligh.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=slpratt@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox