linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING
@ 2018-03-23 15:09 joe.korty
  2018-03-23 16:59 ` Julia Cartwright
  0 siblings, 1 reply; 7+ messages in thread
From: joe.korty @ 2018-03-23 15:09 UTC (permalink / raw)
  To: bigeasy; +Cc: mingo, tglx, rostedt, linux-rt-users, linux-kernel

I see the below kernel splat in 4.9-rt when I run a test program that
continually changes the affinity of some set of running pids:

   do not call blocking ops when !TASK_RUNNING; state=2 set at ...
      ...
      stop_one_cpu+0x60/0x80
      migrate_enable+0x21f/0x3e0
      rt_spin_unlock+0x2f/0x40
      prepare_to_wait+0x5c/0x80
      ...

The reason is that spin_unlock, write_unlock, and read_unlock call
migrate_enable, and since 4.4-rt, migrate_enable will sleep if it discovers
that a migration is in order.  But sleeping in the unlock services is not
expected by most kernel developers, and where that counts most is in code
sequences like the following:

  set_current_state(TASK_UNINTERRUPIBLE);
  spin_unlock(&s);
  schedule();

Rather than try to fix up such expectations, this patch merely restores
the non-sleeping nature of unlocking by the simple expediate of deferring,
to a future migrate_enable, and actual migration-with-sleep operation,
for as long as migrate_enable sees task state != TASK_RUNNING.  This works
well as the kernel is littered with lock/unlock sequences, so if the
current unlock cannot migrate, another unlock will come along shortly
and it will do the deferred migration for us.

With this patch and a debug harness applied to 4.9.84-rt62, I get the
following results when a set of stress(1) threads have their affinities
randomly changes as fast as possible:

   [  111.331805] 6229(stress): migrate_enable() deferred.
   [  111.332112] 6229(stress): migrate_enable() deferral APPLIED.
   [  111.977399] 6231(stress): migrate_enable() deferred.
   [  111.977833] 6231(stress): migrate_enable() deferral APPLIED.
   [  135.240704] 6231(stress): migrate_enable() deferred.
   [  135.241164] 6231(stress): migrate_enable() deferral APPLIED.
   [  139.734616] 6229(stress): migrate_enable() deferred.
   [  139.736553] 6229(stress): migrate_enable() deferral APPLIED.
   [  156.441660] 6229(stress): migrate_enable() deferred.
   [  156.441709] 6229(stress): migrate_enable() deferral APPLIED.
   [  163.600191] 6231(stress): migrate_enable() deferred.
   [  163.600561] 6231(stress): migrate_enable() deferral APPLIED.
   [  181.593035] 6231(stress): migrate_enable() deferred.
   [  181.593136] 6231(stress): migrate_enable() deferral APPLIED.
   [  198.376387] 6229(stress): migrate_enable() deferred.
   [  198.376591] 6229(stress): migrate_enable() deferral APPLIED.
   [  202.355940] 6231(stress): migrate_enable() deferred.
   [  202.356939] 6231(stress): migrate_enable() deferral APPLIED.
   [  203.773164] 6229(stress): migrate_enable() deferred.
   [  203.773243] 6229(stress): migrate_enable() deferral APPLIED.
   ...

The test sequence used:

   stress --cpu 8 --io 1 --vm 2 &
   ./rotate $(ps -eLotid,comm | fgrep stress | awk '{print $1}')

The test program used:

cat <<EOF >rotate.c
/*
 * rotate.c
 * Randomly and rapidly change the affinity of some set of processes.
 * Does not return.
 * Limitations: Hard coded to use cpus 0-7.  May not work on systems
 * with more than 64 cpus.
 *
 * Usage: rotate pid pid ...
 *
 * Algorithm: In a loop, for each pid given, randomly select 2 of
 * the 8 cpus 37.5% of the time; otherwise, randomly select a
 * single cpu from that same set.
 */

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <errno.h>

int pids[100000];
int npids;
int ncpus = 8;

int main(int argc, char **argv)
{
	int pid, cpu, status;
	unsigned long mask = 0;
	int i;

	setbuf(stdout, NULL);

	argc--,argv++;
	npids = argc;
	for(i=0; i<npids; i++) {
		pids[i] = atoi(argv[i]);
		if (!pids[i]) {
			fprintf(stderr, "tid #%d is bad\n", i);
			return 1;
		}
	}

	for(;;) {
		for(i=0; i<npids; i++) {
			cpu = (unsigned long long)rand() * ncpus / RAND_MAX;
			mask = 1U << cpu;
			if (rand() & 0x400) {
				cpu = (unsigned long long)rand() * ncpus / RAND_MAX;
				mask |= 1U << cpu;
			}
			pid = pids[i];
			status = sched_setaffinity(pid, sizeof(mask), (cpu_set_t *)&mask);
			if (status == -1) {
				fprintf(stderr, "sched_setaffinity(%d, 8, %016lx) failed\n", pid, mask);
				perror("sched_setaffinity");
				return 1;
			}
		}
	}
	return 0;
}
EOF

The debug patch used:

cat patches/debug <<EOF
    1	XXXXXXXX--- a/include/linux/sched.h
    2	XXXXXXXX+++ b/include/linux/sched.h
    3	XXXXXXXX@@ -1558,6 +1558,7 @@ struct task_struct {
    4	XXXXXXXX #ifdef CONFIG_PREEMPT_RT_FULL
    5	XXXXXXXX	int migrate_disable;
    6	XXXXXXXX	int migrate_disable_update;
    7	XXXXXXXX+	int migrate_enable_deferred;
    8	XXXXXXXX # ifdef CONFIG_SCHED_DEBUG
    9	XXXXXXXX	int migrate_disable_atomic;
   10	XXXXXXXX # endif
   11	XXXXXXXX--- a/kernel/sched/core.c
   12	XXXXXXXX+++ b/kernel/sched/core.c
   13	XXXXXXXX@@ -3468,6 +3468,12 @@ void migrate_enable(void)
   14	XXXXXXXX		struct rq *rq;
   15	XXXXXXXX		struct rq_flags rf;
   16	XXXXXXXX 
   17	XXXXXXXX+		if (p->migrate_enable_deferred) {
   18	XXXXXXXX+			p->migrate_enable_deferred = 0;
   19	XXXXXXXX+			pr_info("%d(%s): migrate_enable() deferral APPLIED.\n",
   20	XXXXXXXX+				p->pid, p->comm);
   21	XXXXXXXX+		}
   22	XXXXXXXX+
   23	XXXXXXXX		rq = task_rq_lock(p, &rf);
   24	XXXXXXXX		update_rq_clock(rq);
   25	XXXXXXXX 
   26	XXXXXXXX@@ -3499,6 +3505,15 @@ void migrate_enable(void)
   27	XXXXXXXX			tlb_migrate_finish(p->mm);
   28	XXXXXXXX			return;
   29	XXXXXXXX		}
   30	XXXXXXXX+	} else if (p->migrate_disable_update && p->state != TASK_RUNNING) {
   31	XXXXXXXX+		if (p->migrate_enable_deferred)
   32	XXXXXXXX+			pr_info("%d(%s): migrate_enable() deferred (again).\n",
   33	XXXXXXXX+				p->pid, p->comm);
   34	XXXXXXXX+		else {
   35	XXXXXXXX+			pr_info("%d(%s): migrate_enable() deferred.\n",
   36	XXXXXXXX+				p->pid, p->comm);
   37	XXXXXXXX+			p->migrate_enable_deferred = 1;
   38	XXXXXXXX+		}
   39	XXXXXXXX	}
   40	XXXXXXXX 
   41	XXXXXXXX	unpin_current_cpu();
EOF

The rt patch sched-migrate-disable-handle-updated-task-mask-mg-di.patch
appears to have introduced this issue, around the 4.4-rt timeframe.

Signed-off-by: Joe Korty <joe.korty@concurrent-rt.com>

Index: b/kernel/sched/core.c
===================================================================
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3457,7 +3457,14 @@ void migrate_enable(void)
 	 */
 	p->migrate_disable = 0;
 
-	if (p->migrate_disable_update) {
+	/*
+	 * Do not apply affinity update on this migrate_enable if task
+	 * is preparing to go to sleep for some other reason (eg, task
+	 * state == TASK_INTERRUPTIBLE).  Instead defer update to a
+	 * future migate_enable that is called when task state is again
+	 * == TASK_RUNNING.
+	 */
+	if (p->migrate_disable_update && p->state == TASK_RUNNING) {
 		struct rq *rq;
 		struct rq_flags rf;
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING
  2018-03-23 15:09 [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING joe.korty
@ 2018-03-23 16:59 ` Julia Cartwright
  2018-03-23 17:21   ` joe.korty
  0 siblings, 1 reply; 7+ messages in thread
From: Julia Cartwright @ 2018-03-23 16:59 UTC (permalink / raw)
  To: Joe Korty; +Cc: bigeasy, mingo, tglx, rostedt, linux-rt-users, linux-kernel

Hey Joe-

Thanks for the writeup.

On Fri, Mar 23, 2018 at 11:09:59AM -0400, joe.korty@concurrent-rt.com wrote:
> I see the below kernel splat in 4.9-rt when I run a test program that
> continually changes the affinity of some set of running pids:
> 
>    do not call blocking ops when !TASK_RUNNING; state=2 set at ...
>       ...
>       stop_one_cpu+0x60/0x80
>       migrate_enable+0x21f/0x3e0
>       rt_spin_unlock+0x2f/0x40
>       prepare_to_wait+0x5c/0x80
>       ...

This is clearly a problem.

> The reason is that spin_unlock, write_unlock, and read_unlock call
> migrate_enable, and since 4.4-rt, migrate_enable will sleep if it discovers
> that a migration is in order.  But sleeping in the unlock services is not
> expected by most kernel developers,

I don't buy this, see below:

> and where that counts most is in code sequences like the following:
>
>   set_current_state(TASK_UNINTERRUPIBLE);
>   spin_unlock(&s);
>   schedule();

The analog in mainline is CONFIG_PREEMPT and the implicit
preempt_enable() in spin_unlock().  In this configuration, a kernel
developer should _absolutely_ expect their task to be suspended (and
potentially migrated), _regardless of the task state_ if there is a
preemption event on the CPU on which this task is executing.

Similarly, on RT, there is nothing _conceptually_ wrong on RT with
migrating on migrate_enable(), regardless of task state, if there is a
pending migration event.

It's clear, however, that the mechanism used here is broken ...

   Julia

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING
  2018-03-23 16:59 ` Julia Cartwright
@ 2018-03-23 17:21   ` joe.korty
  2018-03-23 17:42     ` Julia Cartwright
  2018-03-26 15:35     ` Steven Rostedt
  0 siblings, 2 replies; 7+ messages in thread
From: joe.korty @ 2018-03-23 17:21 UTC (permalink / raw)
  To: Julia Cartwright
  Cc: bigeasy, mingo, tglx, rostedt, linux-rt-users, linux-kernel

Hi Julia,
Thanks for the quick response!

On Fri, Mar 23, 2018 at 11:59:21AM -0500, Julia Cartwright wrote:
> Hey Joe-
> 
> Thanks for the writeup.
> 
> On Fri, Mar 23, 2018 at 11:09:59AM -0400, joe.korty@concurrent-rt.com wrote:
> > I see the below kernel splat in 4.9-rt when I run a test program that
> > continually changes the affinity of some set of running pids:
> > 
> >    do not call blocking ops when !TASK_RUNNING; state=2 set at ...
> >       ...
> >       stop_one_cpu+0x60/0x80
> >       migrate_enable+0x21f/0x3e0
> >       rt_spin_unlock+0x2f/0x40
> >       prepare_to_wait+0x5c/0x80
> >       ...
> 
> This is clearly a problem.
> 
> > The reason is that spin_unlock, write_unlock, and read_unlock call
> > migrate_enable, and since 4.4-rt, migrate_enable will sleep if it discovers
> > that a migration is in order.  But sleeping in the unlock services is not
> > expected by most kernel developers,
> 
> I don't buy this, see below:
> 
> > and where that counts most is in code sequences like the following:
> >
> >   set_current_state(TASK_UNINTERRUPIBLE);
> >   spin_unlock(&s);
> >   schedule();
> 
> The analog in mainline is CONFIG_PREEMPT and the implicit
> preempt_enable() in spin_unlock().  In this configuration, a kernel
> developer should _absolutely_ expect their task to be suspended (and
> potentially migrated), _regardless of the task state_ if there is a
> preemption event on the CPU on which this task is executing.
> 
> Similarly, on RT, there is nothing _conceptually_ wrong on RT with
> migrating on migrate_enable(), regardless of task state, if there is a
> pending migration event.

My understanding is, in standard Linux and in rt, setting
task state to anything other than TASK_RUNNING in of itself
blocks preemption.  A preemption is not really needed here
as it is expected that there is a schedule() written in that
will shortly be executed.  And if a 'involuntary schedule'
(ie, preemption) were allowed to occur between the task
state set and the schedule(), that would change the task
state back to TASK_RUNNING, which would cause the schedule
to NOP.  Thus we risk not having paused long enough here
for the condition we were waiting for to become true.

> 
> It's clear, however, that the mechanism used here is broken ...
> 
>    Julia

Thanks,
Joe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING
  2018-03-23 17:21   ` joe.korty
@ 2018-03-23 17:42     ` Julia Cartwright
  2018-03-26 15:35     ` Steven Rostedt
  1 sibling, 0 replies; 7+ messages in thread
From: Julia Cartwright @ 2018-03-23 17:42 UTC (permalink / raw)
  To: Joe Korty; +Cc: bigeasy, mingo, tglx, rostedt, linux-rt-users, linux-kernel

On Fri, Mar 23, 2018 at 01:21:31PM -0400, joe.korty@concurrent-rt.com wrote:
> On Fri, Mar 23, 2018 at 11:59:21AM -0500, Julia Cartwright wrote:
> > On Fri, Mar 23, 2018 at 11:09:59AM -0400, joe.korty@concurrent-rt.com wrote:
> > > I see the below kernel splat in 4.9-rt when I run a test program that
> > > continually changes the affinity of some set of running pids:
> > >
> > >    do not call blocking ops when !TASK_RUNNING; state=2 set at ...
> > >       ...
> > >       stop_one_cpu+0x60/0x80
> > >       migrate_enable+0x21f/0x3e0
> > >       rt_spin_unlock+0x2f/0x40
> > >       prepare_to_wait+0x5c/0x80
> > >       ...
> >
> > This is clearly a problem.
> >
> > > The reason is that spin_unlock, write_unlock, and read_unlock call
> > > migrate_enable, and since 4.4-rt, migrate_enable will sleep if it discovers
> > > that a migration is in order.  But sleeping in the unlock services is not
> > > expected by most kernel developers,
> >
> > I don't buy this, see below:
> >
> > > and where that counts most is in code sequences like the following:
> > >
> > >   set_current_state(TASK_UNINTERRUPIBLE);
> > >   spin_unlock(&s);
> > >   schedule();
> >
> > The analog in mainline is CONFIG_PREEMPT and the implicit
> > preempt_enable() in spin_unlock().  In this configuration, a kernel
> > developer should _absolutely_ expect their task to be suspended (and
> > potentially migrated), _regardless of the task state_ if there is a
> > preemption event on the CPU on which this task is executing.
> >
> > Similarly, on RT, there is nothing _conceptually_ wrong on RT with
> > migrating on migrate_enable(), regardless of task state, if there is a
> > pending migration event.
>
> My understanding is, in standard Linux and in rt, setting
> task state to anything other than TASK_RUNNING in of itself
> blocks preemption.

I'm assuming you're referring to the window of time between a task
setting its state to !TASK_RUNNING and schedule()?  The task remains
preemptible in this window.

> A preemption is not really needed here as it is expected that there is
> a schedule() written in that will shortly be executed.
>
> And if a 'involuntary schedule' (ie, preemption) were allowed to occur
> between the task state set and the schedule(), that would change the
> task state back to TASK_RUNNING;

This isn't the case.  A preempted task preserves its state.

   Julia

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING
  2018-03-23 17:21   ` joe.korty
  2018-03-23 17:42     ` Julia Cartwright
@ 2018-03-26 15:35     ` Steven Rostedt
  2018-03-26 15:54       ` joe.korty
  1 sibling, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2018-03-26 15:35 UTC (permalink / raw)
  To: joe.korty
  Cc: Julia Cartwright, bigeasy, mingo, tglx, linux-rt-users,
	linux-kernel

On Fri, 23 Mar 2018 13:21:31 -0400
joe.korty@concurrent-rt.com wrote:

> My understanding is, in standard Linux and in rt, setting
> task state to anything other than TASK_RUNNING in of itself
> blocks preemption.

That is clearly false. The only thing that blocks preemption with a
CONFIG_PREEMPT kernel is preempt_disable() and local_irq*() disabling.

(Note spin_locks call preempt_disable in non RT).

Otherwise, nothing will stop preemption.

>  A preemption is not really needed here
> as it is expected that there is a schedule() written in that
> will shortly be executed.  And if a 'involuntary schedule'
> (ie, preemption) were allowed to occur between the task
> state set and the schedule(), that would change the task
> state back to TASK_RUNNING, which would cause the schedule
> to NOP.  Thus we risk not having paused long enough here
> for the condition we were waiting for to become true.

That is also incorrect. As Julia mentioned, a preemption keeps the
state of the task.

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING
  2018-03-26 15:35     ` Steven Rostedt
@ 2018-03-26 15:54       ` joe.korty
  2018-03-26 15:59         ` Steven Rostedt
  0 siblings, 1 reply; 7+ messages in thread
From: joe.korty @ 2018-03-26 15:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Julia Cartwright, bigeasy, mingo, tglx, linux-rt-users,
	linux-kernel

Oh well.  Makes me wonder why might_sleep is testing for
!TASK_RUNNABLE though.

Thanks for the correction,
Joe


On Mon, Mar 26, 2018 at 11:35:15AM -0400, Steven Rostedt wrote:
> On Fri, 23 Mar 2018 13:21:31 -0400
> joe.korty@concurrent-rt.com wrote:
> 
> > My understanding is, in standard Linux and in rt, setting
> > task state to anything other than TASK_RUNNING in of itself
> > blocks preemption.
> 
> That is clearly false. The only thing that blocks preemption with a
> CONFIG_PREEMPT kernel is preempt_disable() and local_irq*() disabling.
> 
> (Note spin_locks call preempt_disable in non RT).
> 
> Otherwise, nothing will stop preemption.
> 
> >  A preemption is not really needed here
> > as it is expected that there is a schedule() written in that
> > will shortly be executed.  And if a 'involuntary schedule'
> > (ie, preemption) were allowed to occur between the task
> > state set and the schedule(), that would change the task
> > state back to TASK_RUNNING, which would cause the schedule
> > to NOP.  Thus we risk not having paused long enough here
> > for the condition we were waiting for to become true.
> 
> That is also incorrect. As Julia mentioned, a preemption keeps the
> state of the task.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING
  2018-03-26 15:54       ` joe.korty
@ 2018-03-26 15:59         ` Steven Rostedt
  0 siblings, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2018-03-26 15:59 UTC (permalink / raw)
  To: joe.korty
  Cc: Julia Cartwright, bigeasy, mingo, tglx, linux-rt-users,
	linux-kernel

On Mon, 26 Mar 2018 11:54:51 -0400
joe.korty@concurrent-rt.com wrote:

> Oh well.  Makes me wonder why might_sleep is testing for
> !TASK_RUNNABLE though.

Because might_sleep() is used when the function might call schedule()
directly. And schedule() *will* change the task state to TASK_RUNNING.
might_sleep() has nothing to do with preemption.

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-03-26 15:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-23 15:09 [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING joe.korty
2018-03-23 16:59 ` Julia Cartwright
2018-03-23 17:21   ` joe.korty
2018-03-23 17:42     ` Julia Cartwright
2018-03-26 15:35     ` Steven Rostedt
2018-03-26 15:54       ` joe.korty
2018-03-26 15:59         ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).