public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Olaf Kirch <okir@suse.de>
To: linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, Andreas Gruenbacher <agruen@suse.de>
Subject: CFS Performance Issues
Date: Thu, 28 May 2009 15:02:22 +0200	[thread overview]
Message-ID: <200905281502.22487.okir@suse.de> (raw)

[-- Attachment #1: Type: text/plain, Size: 3671 bytes --]


Hi Ingo,

As you probably know, we've been chasing a variety of performance issues
on our SLE11 kernel, and one of the suspects has been CFS for quite a
while. The benchmarks that pointed to CFS include AIM7, dbench, and a few
others, but the picture has been a bit hazy as to what is really the problem here.

Now IBM recently told us they had played around with some scheduler
tunables and found that by turning off NEW_FAIR_SCHEDULERS, they
could make the regression on a compute benchmark go away completely.
We're currently working on rerunning other benchmarks with NEW_FAIR_SLEEPERS
turned off to see whether it has an impact on these as well.

Of course, the first question we asked ourselves was, how can NEW_FAIR_SLEEPERS
affect a benchmark that rarely sleeps, or not at all?

The answer was, it's not affecting the benchmark processes, but some noise
going on in the background. When I was first able to reproduce this on my work
station, it was knotify4 running in the background - using hardly any CPU, but
getting woken up ~1000 times a second. Don't ask me what it's doing :-)

So I sat down and reproduced this; the most recent iteration of the test program
is courtesy of Andreas Gruenbacher (see below).

This program spawns a number of processes that just spin in a loop. It also spawns
a single process that wakes up 1000 times a second. Every second, it computes the
average time slice per process (utime / number of involuntary context switches),
and prints out the overall average time slice and average utime.

While running this program, you can conveniently enable or disable fair sleepers.
When I do this on my test machine (no desktop in the background this time :-)
I see this:

./slice 16
    avg slice:  1.12 utime: 216263.187500
    avg slice:  0.25 utime: 125507.687500
    avg slice:  0.31 utime: 125257.937500
    avg slice:  0.31 utime: 125507.812500
    avg slice:  0.12 utime: 124507.875000
    avg slice:  0.38 utime: 124757.687500
    avg slice:  0.31 utime: 125508.000000
    avg slice:  0.44 utime: 125757.750000
    avg slice:  2.00 utime: 128258.000000
 ------ here I turned off new_fair_sleepers ----
    avg slice: 10.25 utime: 137008.500000
    avg slice:  9.31 utime: 139008.875000
    avg slice: 10.50 utime: 141508.687500
    avg slice:  9.44 utime: 139258.750000
    avg slice: 10.31 utime: 140008.687500
    avg slice:  9.19 utime: 139008.625000
    avg slice: 10.00 utime: 137258.625000
    avg slice: 10.06 utime: 135258.562500
    avg slice:  9.62 utime: 138758.562500

As you can see, the average time slice is *extremely* low with new fair
sleepers enabled. Turning it off, we get ~10ms time slices, and a
performance that is roughly 10% higher. It looks like this kind of
"silly time slice syndrome" is what is really eating performance here.

After staring at place_entity for a while, and by watching the process'
vruntime for a while, I think what's happening is this.

With fair sleepers turned off, a process that just got woken up will
get the vruntime of the process that's leftmost in the rbtree, and will
thus be placed to the right of the current task.

However, with fair_sleepers enabled, a newly woken up process
will retain its old vruntime as long as it's less than sched_latency
in the past, and thus it will be placed to the very left in the rbtree.
Since a task that is mostly sleeping will never accrue vruntime at
the same rate a cpu-bound task does, it will always preempt any
running task almost immediately after it's scheduled.

Does this make sense?

Any insight you can offer here is greatly appreciated!

Thanks,
Olaf
-- 
Neo didn't bring down the Matrix. SOA did.
				--soafacts.com


[-- Attachment #2: slice.c --]
[-- Type: text/x-csrc, Size: 2106 bytes --]

/*
 * from agruen - 2009 05 28
 *
 * Test time slices given to each process
 */
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/msg.h>
#include <sys/stat.h>
#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

#undef WITH_PER_PROCESS_SLICES

struct msg {
	long mtype;
	long nivcsw;
	long utime;
};

int msqid;

void report_to_parent(int dummy) {
	static long old_nivcsw, old_utime;
	long utime;
	struct rusage usage;
	struct msg msg;

	getrusage(RUSAGE_SELF, &usage);

	utime = usage.ru_utime.tv_sec * 1000000 + usage.ru_utime.tv_usec;

	msg.mtype = 1;
	msg.nivcsw = usage.ru_nivcsw - old_nivcsw;
	msg.utime = utime - old_utime;
	msgsnd(msqid, &msg, sizeof(msg) - sizeof(long), 0);

	old_nivcsw = usage.ru_nivcsw;
	old_utime = utime;
}

void worker(void) {
	struct sigaction sa;

	sa.sa_handler = report_to_parent;
	sigemptyset(&sa.sa_mask);
	sa.sa_flags = 0;
	sigaction(SIGALRM, &sa, NULL);

	while (1)
		/* do nothing */ ;
}

void sleeper(void) {

	while (1) {
		usleep(1000);
	}
}

int main(int argc, char *argv[])
{
	int n, nproc;
	pid_t *pid;

	if (argc != 2) {
		fprintf(stderr, "Usage: %s <number-of-processes>\n", argv[0]);
		return 1;
	}
	
	msqid = msgget(IPC_PRIVATE, S_IRUSR | S_IWUSR);

	nproc = atoi(argv[1]);
	pid = malloc(nproc * sizeof(pid_t));
	for (n = 0; n < nproc; n++) {
		pid[n] = fork();
		if (pid[n] == 0)
			worker();
	}

	/* Fork sleeper(s) */
	for (n = 0; n < (nproc + 7)/8; n++)
		if (fork() == 0)
			sleeper();
	for(;;) {
		long avg_slice = 0, avg_utime = 0;

		sleep(1);
		for (n = 0; n < nproc; n++)
			kill(pid[n], SIGALRM);
		for (n = 0; n < nproc; n++) {
			struct msg msg;
			double slice;

			msgrcv(msqid, &msg, sizeof(msg) - sizeof(long), 0, 0);
			slice = 0.001 * msg.utime /
				        (msg.nivcsw ? msg.nivcsw : 1);
#ifdef WITH_PER_PROCESS_SLICES
			printf("%6.1f ", slice);
#endif

			avg_slice += slice;
			avg_utime += msg.utime;
		}

		printf("    avg slice: %5.2f utime: %f",
				(double) avg_slice / nproc,
				(double) avg_utime / nproc);
		printf("\n");
	}

	return 0;
}

             reply	other threads:[~2009-05-28 13:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-28 13:02 Olaf Kirch [this message]
2009-05-28 15:49 ` CFS Performance Issues David Newall
2009-05-28 18:20   ` Olaf Kirch
2009-05-28 18:43     ` David Newall
2009-05-28 20:31 ` Peter Zijlstra
2009-05-30 11:18   ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200905281502.22487.okir@suse.de \
    --to=okir@suse.de \
    --cc=agruen@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox