All of lore.kernel.org
 help / color / mirror / Atom feed
From: Micah Dowty <micah@vmware.com>
To: linux-kernel@vger.kernel.org
Subject: High priority tasks break SMP balancer?
Date: Fri, 9 Nov 2007 14:34:17 -0800	[thread overview]
Message-ID: <20071109223417.GB16250@vmware.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 1756 bytes --]

I've been investigating a problem recently, in which N runnable
CPU-bound tasks on an N-way machine run on only N-1 CPUs. The
remaining CPU is almost 100% idle. I have seen it occur with both the
CFS and O(1) schedulers.

I've traced this down to what seems to be a quirk in the SMP balancer,
whereby a high-priority thread which spends most of its time sleeping
can artificially inflate the CPU load average calculated for one
processor. Most of the time this CPU is idle (nr_running==0) yet its
CPU load average is much higher than that of any other CPU.

Please find attached a sample program which demonstrates this
behaviour on a 2-way SMP machine. It creates three threads: two are
CPU bound and run at the default priority, the third spends most of
its time sleeping and runs at an elevated priority. It wakes up
frequently (using /dev/rtc) and randomly generates some CPU load.

On my machine (2-way Opteron with a vanilla 2.6.23.1 kernel) this test
program will reliably put the scheduler into a state where one CPU has
both of the busy-looping processes in its runqueue, and the other CPU
is usually idle. The usually-idle CPU will have a very high cpu_load,
as reported by /proc/sched_debug.

Your mileage may vary. On some machines, this test program will only
enter the "bad" state for a few seconds. Sometimes we bounce back and
forth between good and bad states every few seconds. In all cases,
removing the priority elevation fixes the balancing problem.

Is this a behaviour any of the scheduler developers are aware of? I
would be very greatful if anyone could shed some light on the root
cause behind the inflated cpu_load average. If this turns out to be a
real bug, I would be happy to work on a patch.

Thanks in advance,
Micah Dowty

[-- Attachment #2: priosched.c --]
[-- Type: text/plain, Size: 4075 bytes --]

/*
 * This is a demonstration of unwanted SMP scheduler side-effects
 * caused by high-priority threads.
 *
 * In this demo, we have three threads:
 *
 *   1. A busy-loop (CPU bound) at nice level 0.
 *
 *   2. Another busy-loop at nice level 0.
 *
 *   3. A high-priority thread (nice -10) which spends
 *      most of its time sleeping, but it wakes up
 *      frequently and spends a little bit of CPU each time.
 *
 * This is meant to model three of our threads in the vmware-vmx: a
 * VCPU thread, the MKS thread, and the main VMX thread. Ideally, the
 * VCPU and MKS would spend most of their time running on separate
 * CPUs on an SMP system. The VMX thread would wake up frequently,
 * interrupt an arbitrary cpu-bound thread, then go back to sleep.
 *
 * The actual behaviour I see on Linux 2.6 is that the system
 * oscillates between two states, with a period of a few seconds. In
 * the "good" state, the two busy-loop threads run on separate CPU. In
 * the "bad" state, both of the busy-loop threads run on the same
 * physical CPU and the other CPU sits idle.
 *
 * Taking a closer look at the kernel's scheduler debug output
 * (/proc/sched_debug and /proc/schedstat) the problem becomes
 * clearer: Even though the VMX thread spends very little time
 * running, its runtime is given extra weight according to its
 * priority. The "load" calculated by the scheduler for its CPU is
 * low, but the load average calculated via delta_exec and delta_fair
 * can become quite high. The result is that the physical CPU where
 * the VMX was running gets stuck with a high load average even after
 * the VMX thread is sleeping again. This high load average causes all
 * running tasks to be rebalanced onto the other CPU until the high
 * load subsides.
 *
 * This example requires a machine with exactly 2 CPUs.
 *
 * Usage:
 *   cc -o priosched priosched.c -lpthread
 *   sudo ./priosched
 *
 * Now observe the load on both CPUs. In the "good" state, both CPUs
 * will be busy. In the "bad" state, both of the busyThreads will be
 * stuck on the same CPU and the other CPU will be idle. 
 *
 * If you have a kernel with scheduler debugging compiled in, try "cat
 * /proc/sched_debug". In the "bad" state, one CPU will have an empty
 * runnable task list and a list of cpu_load[] averages around 9000.
 *
 * -- Micah Dowty <micah@vmware.com>
 */

#include <unistd.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <linux/ioctl.h>
#include <linux/rtc.h>
#include <time.h>

/*
 * Knobs.
 * You may have to tweak these to reproduce the problem on your machine.
 */
#define NUM_BUSY_THREADS          2
#define MAINTHREAD_PRIORITY     -15   // Nice level
#define MAINTHREAD_WAKE_HZ      256   // Frequency to wake up at
#define MAINTHREAD_LOAD_PERCENT   5   // Percent of time to wake up and generate load
#define MAINTHREAD_LOAD_CYCLES   10   // Consecutive clock ticks to generate load for

void  *busyThreadFunc(void *arg)
{
   while (1);
}

int main()
{
   pthread_t busyThreads[NUM_BUSY_THREADS];
   int i, rtc;

   for (i = 0; i < NUM_BUSY_THREADS; i++) {
      if (pthread_create(&busyThreads[i], NULL, busyThreadFunc, NULL)) {
         perror("pthread_create");
         return 1;
      }
   }

   if (nice(MAINTHREAD_PRIORITY) == -1) {
      fprintf(stderr, "This program must be run as root.\n");
      return 1;
   }

   rtc = open("/dev/rtc", O_RDONLY);
   if (rtc == -1) {
      perror("/dev/rtc");
      return 1;
   }

   if (ioctl(rtc, RTC_IRQP_SET, MAINTHREAD_WAKE_HZ) ||
       ioctl(rtc, RTC_PIE_ON, 0)) {
      perror("ioctl");
      return 1;
   }

   while (1) {
      unsigned long data;
      if (read(rtc, &data, sizeof data) != sizeof data) {
         perror("read");
         return 1;
      }

      if (random() % 100 <= MAINTHREAD_LOAD_PERCENT) {
         for (i = 0; i < MAINTHREAD_LOAD_CYCLES; i++) {
            fcntl(rtc, F_SETFL, O_NONBLOCK);
            while (read(rtc, &data, sizeof data) < 0);
            fcntl(rtc, F_SETFL, 0);
         }
      }
   }

   return 0;
}

             reply	other threads:[~2007-11-09 22:34 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-11-09 22:34 Micah Dowty [this message]
2007-11-09 23:56 ` High priority tasks break SMP balancer? Cyrus Massoumi
2007-11-10  0:11   ` Micah Dowty
2007-11-14 18:39     ` Micah Dowty
2007-11-15 18:48     ` Kyle Moffett
2007-11-15 19:14       ` Micah Dowty
2007-11-15 20:07         ` Christoph Lameter
2007-11-15 20:24           ` Micah Dowty
2007-11-15 21:28             ` Christoph Lameter
2007-11-15 21:35               ` Micah Dowty
2007-11-16  2:31                 ` Christoph Lameter
2007-11-16  2:44                   ` Micah Dowty
2007-11-16  6:07                     ` Ingo Molnar
2007-11-16  9:19                       ` Micah Dowty
2007-11-16 10:45                         ` Ingo Molnar
2007-11-16 10:48                       ` Micah Dowty
2007-11-16 22:12                         ` Christoph Lameter
2007-11-16 10:48                       ` Dmitry Adamushko
2007-11-16 22:14                         ` Micah Dowty
2007-11-16 23:26                           ` Dmitry Adamushko
2007-11-17  1:03                             ` Micah Dowty
2007-11-17 19:10                               ` Dmitry Adamushko
2007-11-19 18:51                                 ` Micah Dowty
2007-11-19 22:22                                   ` Dmitry Adamushko
2007-11-19 23:05                                     ` Micah Dowty
2007-11-20  5:57                                       ` Ingo Molnar
2007-11-20 18:06                                         ` Micah Dowty
2007-11-20 21:47                                           ` Dmitry Adamushko
2007-11-22  7:46                                             ` Micah Dowty
2007-11-22 12:53                                               ` Dmitry Adamushko
2007-11-26 19:44                                                 ` Micah Dowty
2007-11-27  9:21                                                   ` Dmitry Adamushko
2007-11-27 17:13                                                     ` Micah Dowty
2007-11-16 19:13         ` David Newall
2007-11-16 21:38           ` Micah Dowty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20071109223417.GB16250@vmware.com \
    --to=micah@vmware.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.