* [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
@ 2016-01-27 14:52 Jan Stancek
2016-01-28 15:55 ` Jan Stancek
0 siblings, 1 reply; 7+ messages in thread
From: Jan Stancek @ 2016-01-27 14:52 UTC (permalink / raw)
To: alex.shi, guz.fnst, peterz, mingo, jolsa, riel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1490 bytes --]
Hello,
pthread_cond_wait_1/2 [1] is rarely failing for me on 4.5.0-rc1,
on x86_64 KVM guest with 2 CPUs.
This test [1]:
- spawns 2 SCHED_RR threads
- first thread with higher priority sets alarm for 2 seconds and blocks on condition
- second thread with lower priority is busy looping for 5 seconds
- after 2 seconds alarm signal arrives and handler signals condition
- high priority thread should resume running
But rarely I see that high priority thread doesn't resume running until
low priority thread completes its 5 second busy loop.
Looking at traces (short version attached, long version at [2]),
I see that after 2 seconds scheduler tries to wake up main thread, but it
appears to do that on same CPU where SCHED_RR low prio thread is running,
so nothing happens. Then scheduler makes numerous balance attempts,
but main thread is not balanced to idle CPU.
My guess is this started with following commit, which changed weighted_cpuload():
commit b92486cbf2aa230d00f160664858495c81d2b37b
Author: Alex Shi <alex.shi@intel.com>
Date: Thu Jun 20 10:18:50 2013 +0800
sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_task
I could reproduce it with HEAD set at above commit, I couldn't reproduce it
with 3.10 kernel so far.
Regards,
Jan
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/open_posix_testsuite/functional/threads/condvar/pthread_cond_wait_1.c
[2] http://jan.stancek.eu/tmp/pthread_cond_wait_failure/sched-trace1.tar.bz2
[-- Attachment #2: sched-trace1-short-v4.5-rc1 --]
[-- Type: text/plain, Size: 5410 bytes --]
# test starts running
bash-5626 [001] .... 294.609021: sched_process_fork: comm=bash pid=5626 child_comm=bash child_pid=10005
bash-5626 [001] d... 294.609023: sched_migrate_task: comm=bash pid=10005 prio=120 orig_cpu=1 dest_cpu=0
bash-5626 [001] d... 294.609029: sched_wakeup_new: comm=bash pid=10005 prio=120 target_cpu=000
<idle>-0 [000] d... 294.609073: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=bash next_pid=10005 next_prio=120
condvar_pthread-10005 [000] d.h. 294.609355: sched_stat_runtime: comm=bash pid=10005 runtime=330015 [ns] vruntime=937586860 [ns]
condvar_pthread-10005 [000] d... 294.610222: sched_stat_runtime: comm=condvar_pthread pid=10005 runtime=866662 [ns] vruntime=938453522 [ns]
condvar_pthread-10005 [000] d... 294.610224: sched_migrate_task: comm=condvar_pthread pid=10005 prio=120 orig_cpu=0 dest_cpu=0
condvar_pthread-10005 [000] .... 294.610231: sched_process_fork: comm=condvar_pthread pid=10005 child_comm=condvar_pthread child_pid=10006
# high prio thread (10006) is started
condvar_pthread-10005 [000] d... 294.610232: sched_migrate_task: comm=condvar_pthread pid=10006 prio=120 orig_cpu=0 dest_cpu=0
condvar_pthread-10005 [000] d... 294.610233: sched_stat_runtime: comm=condvar_pthread pid=10005 runtime=11929 [ns] vruntime=938465451 [ns]
condvar_pthread-10005 [000] d... 294.610238: sched_wakeup_new: comm=condvar_pthread pid=10006 prio=120 target_cpu=000
condvar_pthread-10005 [000] d... 294.610267: sched_stat_runtime: comm=condvar_pthread pid=10005 runtime=34190 [ns] vruntime=938499641 [ns]
condvar_pthread-10005 [000] d... 294.610268: sched_migrate_task: comm=condvar_pthread pid=10005 prio=120 orig_cpu=0 dest_cpu=0
condvar_pthread-10005 [000] .... 294.610270: sched_process_fork: comm=condvar_pthread pid=10005 child_comm=condvar_pthread child_pid=10007
# low prio thread (10007) is started
condvar_pthread-10005 [000] d... 294.610271: sched_migrate_task: comm=condvar_pthread pid=10007 prio=120 orig_cpu=0 dest_cpu=1
condvar_pthread-10005 [000] d... 294.610277: sched_wakeup_new: comm=condvar_pthread pid=10007 prio=120 target_cpu=001
condvar_pthread-10005 [000] d... 294.610298: sched_stat_runtime: comm=condvar_pthread pid=10005 runtime=31353 [ns] vruntime=938530994 [ns]
condvar_pthread-10005 [000] d... 294.610302: sched_switch: prev_comm=condvar_pthread prev_pid=10005 prev_prio=120 prev_state=S ==> next_comm=condvar_pthread next_pid=10006 next_prio=120
condvar_pthread-10006 [000] d... 294.610311: sched_stat_runtime: comm=condvar_pthread pid=10006 runtime=12355 [ns] vruntime=944465872 [ns]
<idle>-0 [001] d... 294.610318: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=condvar_pthread next_pid=10007 next_prio=120
condvar_pthread-10007 [001] d... 294.610332: sched_stat_runtime: comm=condvar_pthread pid=10007 runtime=59910 [ns] vruntime=581696119 [ns]
condvar_pthread-10006 [000] d... 294.610340: sched_switch: prev_comm=condvar_pthread prev_pid=10006 prev_prio=89 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
condvar_pthread-10007 [001] d... 296.571389: sched_switch: prev_comm=condvar_pthread prev_pid=10007 prev_prio=94 prev_state=R ==> next_comm=watchdog/1 next_pid=13 next_prio=0
watchdog/1-13 [001] d... 296.571393: sched_migrate_task: comm=condvar_pthread pid=10007 prio=94 orig_cpu=1 dest_cpu=0
<idle>-0 [000] d... 296.571413: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=condvar_pthread next_pid=10007 next_prio=94
# after 2 seconds, alarm signal tries to wake up main thread on same cpu where low prio thread is running
condvar_pthread-10007 [000] d.h. 296.610352: sched_wakeup: comm=condvar_pthread pid=10005 prio=120 target_cpu=000
condvar_pthread-10007 [000] d... 299.610446: sched_switch: prev_comm=condvar_pthread prev_pid=10007 prev_prio=94 prev_state=x ==> next_comm=tuned next_pid=1183 next_prio=120
xfsaild/dm-0-424 [000] d... 299.610582: sched_switch: prev_comm=xfsaild/dm-0 prev_pid=424 prev_prio=120 prev_state=S ==> next_comm=condvar_pthread next_pid=10005 next_prio=120
# main thread runs only after low prio thread is done
condvar_pthread-10005 [000] dN.. 299.610596: sched_wakeup: comm=condvar_pthread pid=10006 prio=89 target_cpu=000
condvar_pthread-10005 [000] dN.. 299.610597: sched_stat_runtime: comm=condvar_pthread pid=10005 runtime=13944 [ns] vruntime=938544938 [ns]
condvar_pthread-10005 [000] d... 299.610598: sched_switch: prev_comm=condvar_pthread prev_pid=10005 prev_prio=120 prev_state=R ==> next_comm=condvar_pthread next_pid=10006 next_prio=89
condvar_pthread-10006 [000] d... 299.610614: sched_switch: prev_comm=condvar_pthread prev_pid=10006 prev_prio=89 prev_state=x ==> next_comm=condvar_pthread next_pid=10005 next_prio=120
condvar_pthread-10005 [000] d.h. 299.611591: sched_stat_runtime: comm=condvar_pthread pid=10005 runtime=976745 [ns] vruntime=939521683 [ns]
condvar_pthread-10005 [000] d... 299.612029: sched_stat_runtime: comm=condvar_pthread pid=10005 runtime=438310 [ns] vruntime=939959993 [ns]
condvar_pthread-10005 [000] d... 299.612031: sched_switch: prev_comm=condvar_pthread prev_pid=10005 prev_prio=120 prev_state=x ==> next_comm=rcuos/0 next_pid=9 next_prio=120
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
2016-01-27 14:52 [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds Jan Stancek
@ 2016-01-28 15:55 ` Jan Stancek
2016-01-28 17:49 ` Peter Zijlstra
0 siblings, 1 reply; 7+ messages in thread
From: Jan Stancek @ 2016-01-28 15:55 UTC (permalink / raw)
To: alex.shi, guz.fnst, peterz, mingo, jolsa, riel, linux-kernel; +Cc: jstancek
[-- Attachment #1: Type: text/plain, Size: 2927 bytes --]
On 01/27/2016 03:52 PM, Jan Stancek wrote:
> Hello,
>
> pthread_cond_wait_1/2 [1] is rarely failing for me on 4.5.0-rc1,
> on x86_64 KVM guest with 2 CPUs.
>
> This test [1]:
> - spawns 2 SCHED_RR threads
> - first thread with higher priority sets alarm for 2 seconds and blocks on condition
> - second thread with lower priority is busy looping for 5 seconds
> - after 2 seconds alarm signal arrives and handler signals condition
> - high priority thread should resume running
I have slightly modified testcase, so it will finish immediately when high prio
thread is done. And also to allow it to compile outside of openposix testsuite.
Testcase is attached. I'm running it in following way:
gcc -O2 -pthread pthread_cond_wait_1.c
while [ True ]; do
time ./a.out
sleep 1
done
for couple thousand iterations. About half of those are
on system booted with init=/bin/bash.
>
> But rarely I see that high priority thread doesn't resume running until
> low priority thread completes its 5 second busy loop.
>
> Looking at traces (short version attached, long version at [2]),
> I see that after 2 seconds scheduler tries to wake up main thread, but it
> appears to do that on same CPU where SCHED_RR low prio thread is running,
> so nothing happens. Then scheduler makes numerous balance attempts,
> but main thread is not balanced to idle CPU.
>
> My guess is this started with following commit, which changed weighted_cpuload():
> commit b92486cbf2aa230d00f160664858495c81d2b37b
> Author: Alex Shi <alex.shi@intel.com>
> Date: Thu Jun 20 10:18:50 2013 +0800
> sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_task
Here are some numbers gathered from kernels with HEAD at b92486c and
previous commit 83dfd52. System is 2 CPU KVM guest.
Each iteration measures how long it took for testcase to finish.
Ideally it should take about 2 seconds.
1. HEAD at 83dfd52 sched: Update cpu load after task_tick
finish time [s] | iterations
----------------------------------
[ 2, 2.2] | 3134
[ 2.2, 2.5] | 18
[ 2.5, 3] | 0
[ 3, 4] | 0
[ 4, 5] | 0
[ 5, 999] | 0
2. HEAD at b92486c sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_task
finish time [s] | iterations
----------------------------------
[ 2, 2.2] | 1617
[ 2.2, 2.5] | 38
[ 2.5, 3] | 727
[ 3, 4] | 399
[ 4, 5] | 17
[ 5, 999] | 11
Regards,
Jan
>
> I could reproduce it with HEAD set at above commit, I couldn't reproduce it
> with 3.10 kernel so far.
>
> Regards,
> Jan
>
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/open_posix_testsuite/functional/threads/condvar/pthread_cond_wait_1.c
> [2] http://jan.stancek.eu/tmp/pthread_cond_wait_failure/sched-trace1.tar.bz2
>
[-- Attachment #2: pthread_cond_wait_1.c --]
[-- Type: text/plain, Size: 6064 bytes --]
/*
* Copyright (c) 2004, QUALCOMM Inc. All rights reserved.
* Created by: abisain REMOVE-THIS AT qualcomm DOT com
* This file is licensed under the GPL license. For the full content
* of this license, see the COPYING file at the top level of this
* source tree.
* Test that pthread_cond_signal()
* shall wakeup a high priority thread even when a low priority thread
* is running
* Steps:
* 1. Create a condition variable
* 2. Create a high priority thread and make it wait on the cond
* 3. Create a low priority thread and let it busy-loop
* 4. Signal the cond in a signal handler and check that high
* priority thread got woken up
*
*/
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <time.h>
#define TEST "5-1"
#define AREA "scheduler"
#define ERROR_PREFIX "unexpected error: " AREA " " TEST ": "
#define HIGH_PRIORITY 10
#define LOW_PRIORITY 5
#define RUNTIME 5
#define POLICY SCHED_RR
#define PTS_PASS 0
#define PTS_FAIL 1
#define PTS_UNRESOLVED 2
/* mutex required by the cond variable */
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
/* condition variable that threads block on*/
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
/* Flags that the threads use to indicate events */
volatile int woken_up = 0;
volatile int low_done = 0;
/* Signal handler that handle the ALRM and wakes up
* the high priority thread
*/
void signal_handler(int sig)
{
(void) sig;
if (pthread_cond_signal(&cond) != 0) {
printf(ERROR_PREFIX "pthread_cond_signal\n");
exit(PTS_UNRESOLVED);
}
}
/* Utility function to find difference between two time values */
float timediff(struct timespec t2, struct timespec t1)
{
float diff = t2.tv_sec - t1.tv_sec;
diff += (t2.tv_nsec - t1.tv_nsec) / 1000000000.0;
return diff;
}
void *hi_priority_thread(void *tmp)
{
struct sched_param param;
int policy;
int rc = 0;
(void) tmp;
param.sched_priority = HIGH_PRIORITY;
rc = pthread_setschedparam(pthread_self(), POLICY, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_setschedparam\n");
exit(PTS_UNRESOLVED);
}
rc = pthread_getschedparam(pthread_self(), &policy, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_getschedparam\n");
exit(PTS_UNRESOLVED);
}
if ((policy != POLICY) || (param.sched_priority != HIGH_PRIORITY)) {
printf("Error: the policy or priority not correct\n");
exit(PTS_UNRESOLVED);
}
/* Install a signal handler for ALRM */
if (signal(SIGALRM, signal_handler) != 0) {
perror(ERROR_PREFIX "signal:");
exit(PTS_UNRESOLVED);
}
/* acquire the mutex */
rc = pthread_mutex_lock(&mutex);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_mutex_lock\n");
exit(PTS_UNRESOLVED);
}
/* Setup an alarm to go off in 2 seconds */
alarm(2);
/* Block, to be woken up by the signal handler */
rc = pthread_cond_wait(&cond, &mutex);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_cond_wait\n");
exit(PTS_UNRESOLVED);
}
/* This variable is unprotected because the scheduling removes
* the contention
*/
if (low_done != 1)
woken_up = 1;
rc = pthread_mutex_unlock(&mutex);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_mutex_unlock\n");
exit(PTS_UNRESOLVED);
}
return NULL;
}
void *low_priority_thread(void *tmp)
{
struct timespec start_time, current_time;
struct sched_param param;
int policy;
int rc = 0;
(void) tmp;
param.sched_priority = LOW_PRIORITY;
rc = pthread_setschedparam(pthread_self(), POLICY, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_setschedparam\n");
exit(PTS_UNRESOLVED);
}
rc = pthread_getschedparam(pthread_self(), &policy, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_getschedparam\n");
exit(PTS_UNRESOLVED);
}
if ((policy != POLICY) || (param.sched_priority != LOW_PRIORITY)) {
printf("Error: the policy or priority not correct\n");
exit(PTS_UNRESOLVED);
}
/* grab the start time and busy loop for 5 seconds */
clock_gettime(CLOCK_REALTIME, &start_time);
while (1 && !woken_up) {
clock_gettime(CLOCK_REALTIME, ¤t_time);
if (timediff(current_time, start_time) > RUNTIME)
break;
}
low_done = 1;
return NULL;
}
int main()
{
pthread_t high_id, low_id;
pthread_attr_t high_attr, low_attr;
struct sched_param param;
int rc = 0;
/* Create the higher priority thread */
rc = pthread_attr_init(&high_attr);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_attr_init\n");
exit(PTS_UNRESOLVED);
}
rc = pthread_attr_setschedpolicy(&high_attr, POLICY);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_attr_setschedpolicy\n");
exit(PTS_UNRESOLVED);
}
param.sched_priority = HIGH_PRIORITY;
rc = pthread_attr_setschedparam(&high_attr, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_attr_setschedparam\n");
exit(PTS_UNRESOLVED);
}
rc = pthread_create(&high_id, &high_attr, hi_priority_thread, NULL);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_create\n");
exit(PTS_UNRESOLVED);
}
/* Create the low priority thread */
rc = pthread_attr_init(&low_attr);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_attr_init\n");
exit(PTS_UNRESOLVED);
}
rc = pthread_attr_setschedpolicy(&low_attr, POLICY);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_attr_setschedpolicy\n");
exit(PTS_UNRESOLVED);
}
param.sched_priority = LOW_PRIORITY;
rc = pthread_attr_setschedparam(&low_attr, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_attr_setschedparam\n");
exit(PTS_UNRESOLVED);
}
rc = pthread_create(&low_id, &low_attr, low_priority_thread, NULL);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_create\n");
exit(PTS_UNRESOLVED);
}
/* Wait for the threads to exit */
rc = pthread_join(high_id, NULL);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_join\n");
exit(PTS_UNRESOLVED);
}
rc = pthread_join(low_id, NULL);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_join\n");
exit(PTS_UNRESOLVED);
}
/* Check the result */
if (woken_up == 0) {
printf("Test FAILED: high priority was not woken up\\n");
exit(PTS_FAIL);
}
printf("Test PASSED\n");
exit(PTS_PASS);
}
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
2016-01-28 15:55 ` Jan Stancek
@ 2016-01-28 17:49 ` Peter Zijlstra
2016-01-28 18:43 ` Jan Stancek
0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2016-01-28 17:49 UTC (permalink / raw)
To: Jan Stancek; +Cc: alex.shi, guz.fnst, mingo, jolsa, riel, linux-kernel
On Thu, Jan 28, 2016 at 04:55:02PM +0100, Jan Stancek wrote:
> On 01/27/2016 03:52 PM, Jan Stancek wrote:
> > Hello,
> >
> > pthread_cond_wait_1/2 [1] is rarely failing for me on 4.5.0-rc1,
> > on x86_64 KVM guest with 2 CPUs.
> >
> > This test [1]:
> > - spawns 2 SCHED_RR threads
> > - first thread with higher priority sets alarm for 2 seconds and blocks on condition
> > - second thread with lower priority is busy looping for 5 seconds
> > - after 2 seconds alarm signal arrives and handler signals condition
> > - high priority thread should resume running
>
> I have slightly modified testcase, so it will finish immediately when high prio
> thread is done. And also to allow it to compile outside of openposix testsuite.
Yeah, I 'fixed' the testcase too.
So I've had it run for almost 2 hours without a single fail. I've
hot-plugged my cpu count down to 2.
How long should I have to wait for a fail?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
2016-01-28 17:49 ` Peter Zijlstra
@ 2016-01-28 18:43 ` Jan Stancek
2016-01-29 10:15 ` Peter Zijlstra
0 siblings, 1 reply; 7+ messages in thread
From: Jan Stancek @ 2016-01-28 18:43 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: alex shi, guz fnst, mingo, jolsa, riel, linux-kernel
----- Original Message -----
> From: "Peter Zijlstra" <peterz@infradead.org>
> To: "Jan Stancek" <jstancek@redhat.com>
> Cc: "alex shi" <alex.shi@intel.com>, "guz fnst" <guz.fnst@cn.fujitsu.com>, mingo@redhat.com, jolsa@redhat.com,
> riel@redhat.com, linux-kernel@vger.kernel.org
> Sent: Thursday, 28 January, 2016 6:49:03 PM
> Subject: Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
>
> On Thu, Jan 28, 2016 at 04:55:02PM +0100, Jan Stancek wrote:
> > On 01/27/2016 03:52 PM, Jan Stancek wrote:
> > > Hello,
> > >
> > > pthread_cond_wait_1/2 [1] is rarely failing for me on 4.5.0-rc1,
> > > on x86_64 KVM guest with 2 CPUs.
> > >
> > > This test [1]:
> > > - spawns 2 SCHED_RR threads
> > > - first thread with higher priority sets alarm for 2 seconds and blocks
> > > on condition
> > > - second thread with lower priority is busy looping for 5 seconds
> > > - after 2 seconds alarm signal arrives and handler signals condition
> > > - high priority thread should resume running
> >
> > I have slightly modified testcase, so it will finish immediately when high
> > prio
> > thread is done. And also to allow it to compile outside of openposix
> > testsuite.
>
> Yeah, I 'fixed' the testcase too.
>
> So I've had it run for almost 2 hours without a single fail. I've
> hot-plugged my cpu count down to 2.
I can try that too. I'm mostly seeing this on s390 and x86_64 KVM guests,
both have 2 CPUs.
Have you noticed if iteration times vary or if they stay consitently
at ~2 seconds?
>
> How long should I have to wait for a fail?
It's about 1000-2000 iterations for me, which I think you covered
by now in those 2 hours.
Regards,
Jan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
2016-01-28 18:43 ` Jan Stancek
@ 2016-01-29 10:15 ` Peter Zijlstra
2016-01-29 10:33 ` Jan Stancek
0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2016-01-29 10:15 UTC (permalink / raw)
To: Jan Stancek; +Cc: alex shi, guz fnst, mingo, jolsa, riel, linux-kernel
On Thu, Jan 28, 2016 at 01:43:13PM -0500, Jan Stancek wrote:
> > How long should I have to wait for a fail?
>
> It's about 1000-2000 iterations for me, which I think you covered
> by now in those 2 hours.
So I've been running:
while ! ./pthread_cond_wait_1 ; do sleep 1; done
overnight on the machine, and have yet to hit a wobbly -- that is, its
still running.
Also note that I don't think failing this test is a bug per se.
Undesirable maybe, but within spec, since SIGALRM is process wide, so it
being delivered to the SCHED_OTHER task is accepted, and SCHED_OTHER has
no timeliness guarantees.
That said; if I could reliably reproduce I'd have a go at fixing this, I
suspect there's a 'fun' problem at the bottom of this.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
2016-01-29 10:15 ` Peter Zijlstra
@ 2016-01-29 10:33 ` Jan Stancek
2016-02-08 13:40 ` Jan Stancek
0 siblings, 1 reply; 7+ messages in thread
From: Jan Stancek @ 2016-01-29 10:33 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: alex shi, guz fnst, mingo, jolsa, riel, linux-kernel
----- Original Message -----
> From: "Peter Zijlstra" <peterz@infradead.org>
> To: "Jan Stancek" <jstancek@redhat.com>
> Cc: "alex shi" <alex.shi@intel.com>, "guz fnst" <guz.fnst@cn.fujitsu.com>, mingo@redhat.com, jolsa@redhat.com,
> riel@redhat.com, linux-kernel@vger.kernel.org
> Sent: Friday, 29 January, 2016 11:15:22 AM
> Subject: Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
>
> On Thu, Jan 28, 2016 at 01:43:13PM -0500, Jan Stancek wrote:
> > > How long should I have to wait for a fail?
> >
> > It's about 1000-2000 iterations for me, which I think you covered
> > by now in those 2 hours.
>
> So I've been running:
>
> while ! ./pthread_cond_wait_1 ; do sleep 1; done
>
> overnight on the machine, and have yet to hit a wobbly -- that is, its
> still running.
I have seen similar result.
Then, instead of turning CPUs off, I spawned more low prio threads to scale
with number of CPUs on system:
@@ -213,10 +213,14 @@
printf(ERROR_PREFIX "pthread_attr_setschedparam\n");
exit(PTS_UNRESOLVED);
}
- rc = pthread_create(&low_id, &low_attr, low_priority_thread, NULL);
- if (rc != 0) {
- printf(ERROR_PREFIX "pthread_create\n");
- exit(PTS_UNRESOLVED);
+
+ int i, ncpus = sysconf(_SC_NPROCESSORS_ONLN);
+ for (i = 0; i < ncpus - 1; i++) {
+ rc = pthread_create(&low_id, &low_attr, low_priority_thread, NULL);
+ if (rc != 0) {
+ printf(ERROR_PREFIX "pthread_create\n");
+ exit(PTS_UNRESOLVED);
+ }
and let this ran on 3 bare metal x86 systems over night (v4.5-rc1). It
failed on 2 systems (12 and 24 CPUs) with 1:1000 chance, it never failed
on 3rd one (4 CPUs).
>
> Also note that I don't think failing this test is a bug per se.
> Undesirable maybe, but within spec, since SIGALRM is process wide, so it
> being delivered to the SCHED_OTHER task is accepted, and SCHED_OTHER has
> no timeliness guarantees.
>
> That said; if I could reliably reproduce I'd have a go at fixing this, I
> suspect there's a 'fun' problem at the bottom of this.
Thanks for trying, I'll see if I can find some more reliable way.
Regards,
Jan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
2016-01-29 10:33 ` Jan Stancek
@ 2016-02-08 13:40 ` Jan Stancek
0 siblings, 0 replies; 7+ messages in thread
From: Jan Stancek @ 2016-02-08 13:40 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: alex shi, guz fnst, mingo, jolsa, riel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2600 bytes --]
On 01/29/2016 11:33 AM, Jan Stancek wrote:
>>
>> Also note that I don't think failing this test is a bug per se.
>> Undesirable maybe, but within spec, since SIGALRM is process wide, so it
>> being delivered to the SCHED_OTHER task is accepted, and SCHED_OTHER has
>> no timeliness guarantees.
>>
>> That said; if I could reliably reproduce I'd have a go at fixing this, I
>> suspect there's a 'fun' problem at the bottom of this.
>
> Thanks for trying, I'll see if I can find some more reliable way.
I think I have found a more reliably way, however it requires an older
stable kernel: 3.12.53 up to 4.1.17.
Consider following scenario:
- all tasks on system have RT sched class
- main thread of reproducer becomes the only SCHED_OTHER task on system
- when alarm(2) expires, main thread is woken up on cpu that is occupied by
busy looping RT thread (low_priority_thread)
- because main thread was sleeping for 2 seconds, its load has decayed to 0
- the only chance for main thread to run is if it gets balanced to idle CPU
- task_tick_fair() doesn't run, there is RT task running on this CPU
- main thread is on cfs run queue but its load stays 0
- load balancer never sees this CPU (group) as busy
Attached is reproducer and script, which tries to trigger scenario above.
I can reproduce it with 4.1.17 on baremetal 4 CPU x86_64 with about 1:50 chance.
In this setup failure state persists for a long time, perhaps indefinitely.
I tried extending RUNTIME to 10 minutes, main thread still wouldn't run.
One more clue: I could work around this issue if I forced an update_entity_load_avg()
on sched_entities that have not been updated for some time, as part of
periodic rebalance_domains() call.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c7c1d28..1b5fe80 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5264,6 +5264,7 @@ static void update_blocked_averages(int cpu)
struct rq *rq = cpu_rq(cpu);
struct cfs_rq *cfs_rq;
unsigned long flags;
+ struct rb_node *rb;
raw_spin_lock_irqsave(&rq->lock, flags);
update_rq_clock(rq);
@@ -5281,6 +5282,19 @@ static void update_blocked_averages(int cpu)
}
raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+ cfs_rq = &(cpu_rq(cpu)->cfs);
+ for (rb = rb_first_postorder(&cfs_rq->tasks_timeline); rb; rb = rb_next_postorder(rb)) {
+ struct sched_entity *se = rb_entry(rb, struct sched_entity, run_node);
+
+ // Task on rq has not been updated for 500ms :-(
+ if ((cfs_rq_clock_task(cfs_rq) - se->avg.last_runnable_update) > 500L * (1 << 20))
+ update_entity_load_avg(se, 1);
+ }
}
/*
Regards,
Jan
[-- Attachment #2: pthread_cond_wait_1_v3.c --]
[-- Type: text/plain, Size: 5201 bytes --]
/*
* reproducer v3 for:
* [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
*
* Based on LTP's pthread_cond_wait_1.c
*
*/
#define _GNU_SOURCE
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <inttypes.h>
#include <unistd.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>
#define ERROR_PREFIX "unexpected error: "
#define HIGH_PRIORITY 10
#define LOW_PRIORITY 5
#define RUNTIME 5
#define POLICY SCHED_RR
#define PTS_PASS 0
#define PTS_FAIL 1
#define PTS_UNRESOLVED 2
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
/* Flags that the threads use to indicate events */
volatile int woken_up = 0;
volatile int low_done = 0;
/* Signal handler that handle the ALRM and wakes up
* the high priority thread
*/
void signal_handler(int sig)
{
(void) sig;
if (pthread_cond_signal(&cond) != 0) {
printf(ERROR_PREFIX "pthread_cond_signal\n");
exit(PTS_UNRESOLVED);
}
}
/* Utility function to find difference between two time values */
float timediff(struct timespec t2, struct timespec t1)
{
float diff = t2.tv_sec - t1.tv_sec;
diff += (t2.tv_nsec - t1.tv_nsec) / 1000000000.0;
return diff;
}
void *hi_priority_thread(void *tmp)
{
struct sched_param param;
int policy;
int rc = 0;
(void) tmp;
param.sched_priority = HIGH_PRIORITY;
rc = pthread_setschedparam(pthread_self(), POLICY, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_setschedparam\n");
exit(PTS_UNRESOLVED);
}
rc = pthread_getschedparam(pthread_self(), &policy, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_getschedparam\n");
exit(PTS_UNRESOLVED);
}
if ((policy != POLICY) || (param.sched_priority != HIGH_PRIORITY)) {
printf("Error: the policy or priority not correct\n");
exit(PTS_UNRESOLVED);
}
/* Install a signal handler for ALRM */
if (signal(SIGALRM, signal_handler) != 0) {
perror(ERROR_PREFIX "signal:");
exit(PTS_UNRESOLVED);
}
/* acquire the mutex */
rc = pthread_mutex_lock(&mutex);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_mutex_lock\n");
exit(PTS_UNRESOLVED);
}
/* Setup an alarm to go off in 2 seconds */
alarm(2);
/* Block, to be woken up by the signal handler */
rc = pthread_cond_wait(&cond, &mutex);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_cond_wait\n");
exit(PTS_UNRESOLVED);
}
/* This variable is unprotected because the scheduling removes
* the contention
*/
if (low_done != 1)
woken_up = 1;
rc = pthread_mutex_unlock(&mutex);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_mutex_unlock\n");
exit(PTS_UNRESOLVED);
}
return NULL;
}
void *low_priority_thread(void *tmp)
{
struct timespec start_time, current_time;
struct sched_param param;
int policy;
cpu_set_t cpuset;
int rc = 0, slept_times = 0;
float slept_for = 0;
uintptr_t tnum = (uintptr_t)tmp;
param.sched_priority = LOW_PRIORITY;
rc = pthread_setschedparam(pthread_self(), POLICY, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_setschedparam\n");
exit(PTS_UNRESOLVED);
}
rc = pthread_getschedparam(pthread_self(), &policy, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_getschedparam\n");
exit(PTS_UNRESOLVED);
}
if ((policy != POLICY) || (param.sched_priority != LOW_PRIORITY)) {
printf("Error: the policy or priority not correct\n");
exit(PTS_UNRESOLVED);
}
CPU_ZERO(&cpuset);
CPU_SET(tnum, &cpuset);
rc = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_setaffinity_np\n");
exit(PTS_UNRESOLVED);
}
/* grab the start time and busy loop for 5 seconds */
clock_gettime(CLOCK_REALTIME, &start_time);
while (!woken_up && !low_done) {
clock_gettime(CLOCK_REALTIME, ¤t_time);
if (timediff(current_time, start_time) > RUNTIME)
break;
}
low_done = 1;
return NULL;
}
int main()
{
pthread_t high_id, *low_id, paused_id;
struct sched_param param;
int rc = 0;
int i, ncpus = sysconf(_SC_NPROCESSORS_ONLN);
low_id = malloc(ncpus * sizeof(pthread_t));
/* high prio thread */
rc = pthread_create(&high_id, NULL, hi_priority_thread, NULL);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_create\n");
exit(PTS_UNRESOLVED);
}
/* low prio thread on each cpu except last one */
for (i = 0; i < ncpus - 1; i++) {
uintptr_t tnum = i;
rc = pthread_create(&low_id[i], NULL, low_priority_thread, (void *)tnum);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_create\n");
exit(PTS_UNRESOLVED);
}
}
param.sched_priority = 0;
rc = pthread_setschedparam(pthread_self(), SCHED_OTHER, ¶m);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_setschedparam\n");
exit(PTS_UNRESOLVED);
}
/* Wait for the threads to exit */
rc = pthread_join(high_id, NULL);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_join\n");
exit(PTS_UNRESOLVED);
}
for (i = 0; i < ncpus - 1; i++) {
rc = pthread_join(low_id[i], NULL);
if (rc != 0) {
printf(ERROR_PREFIX "pthread_join\n");
exit(PTS_UNRESOLVED);
}
}
if (woken_up == 0) {
printf("Test FAILED: high priority was not woken up\n");
exit(PTS_FAIL);
}
printf("Test PASSED\n");
exit(PTS_PASS);
}
[-- Attachment #3: reproduce_v3.sh --]
[-- Type: application/x-shellscript, Size: 260 bytes --]
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-02-08 13:40 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-27 14:52 [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds Jan Stancek
2016-01-28 15:55 ` Jan Stancek
2016-01-28 17:49 ` Peter Zijlstra
2016-01-28 18:43 ` Jan Stancek
2016-01-29 10:15 ` Peter Zijlstra
2016-01-29 10:33 ` Jan Stancek
2016-02-08 13:40 ` Jan Stancek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).