Considerations on sched APIs under RT patch

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Considerations on sched APIs under RT patch
@ 2010-04-19 20:48 Primiano Tucci
  2010-04-20  9:20 ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Primiano Tucci @ 2010-04-19 20:48 UTC (permalink / raw)
  To: linux-kernel

Hi all
I am an Italian researcher and I am working on a (cross-platform) Real
Time scheduling infrastructure.
I am currently using Linux Kernel 2.6.29.6-rt24-smp (PREEMPT-RT Patch)
running on a Intel Q9550 CPU.
Yesterday days I found a strange behavior of the scheduler API's using
the RT patch, in particular the pthread_setaffinity_np (that stands on
sched_setaffinity).
(link: http://article.gmane.org/gmane.linux.kernel.api/1550)

I think the main problem is that sched_setaffinity makes use of a
rwlock, but rwlocks are pre-emptible with the RT patch.
So it could happen that an high priority process/thread that makes use
of the sched_setaffinity facility could be unwillingly preempted when
controlling other (even low-priority) processes/threads.
I think sched_setaffinity should make use of raw_spinlocks, or should
anyway be guaranteed to not be pre-empted (maybe a preempt_disable?),
otherwise could lead in unwanted situations for a Real Time OS, such
the one described below.

The issue can be easily reproduced taking inspiration from this scenario:

I have 4 Real Time Threads (SCHED_FIFO) distributed as follows:

T0 : CPU 0, Priority 2 (HIGH)
T1 : CPU 1, Priority 2 (HIGH)
T3 : CPU 0, Priority 1 (LOW)
T4 : CPU 1, Priority 1 (LOW)

So T0 and T1 are actually the "big bosses" on CPUs #0 and #1, T3 and
T4, instead, never execute (let's assume that each thread is a simple
busy wait that never sleeps/yields) Now, at a certain point, from T0
code, I want to migrate T4 from CPU #1 to #0, keeping its low
priority.
Therefore I perform a pthread_setaffinity_np from T0 changing T4 mask
from CPU #1 to #0.

In this scenario it happens that T3 (that should never execute since
there is T0 with higher priority currently running on the same CPU #0)
"emerge" and executes for a bit.
It seems that the pthread_setaffinity_np syscall is somehow
"suspensive" for the time needed to migrate T4 and let the scheduler
to execute T3 for that bunch of time.

What do you think about this situation? Should sched APIs be revised?

Thanks in advance,
Primiano

--
 Primiano Tucci
 http://www.primianotucci.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-19 20:48 Considerations on sched APIs under RT patch Primiano Tucci
@ 2010-04-20  9:20 ` Peter Zijlstra
  2010-04-20 21:56   ` Primiano Tucci
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-04-20  9:20 UTC (permalink / raw)
  To: Primiano Tucci; +Cc: linux-kernel, tglx, rostedt

On Mon, 2010-04-19 at 22:48 +0200, Primiano Tucci wrote:

> Yesterday days I found a strange behavior of the scheduler API's using
> the RT patch, in particular the pthread_setaffinity_np (that stands on
> sched_setaffinity).

> I think the main problem is that sched_setaffinity makes use of a
> rwlock, but rwlocks are pre-emptible with the RT patch.

It does? where?

sys_sched_setaffinity()
  sched_setaffinity()
    set_cpus_allowed_ptr()

set_cpus_allowed_ptr() is the one that does the real work, and that
takes the rq->lock and plays games with the migration thread, non of
which should be able to cause any form of priority inversion.

> So it could happen that an high priority process/thread that makes use
> of the sched_setaffinity facility could be unwillingly preempted when
> controlling other (even low-priority) processes/threads.

Well, suppose there was a rwlock_t, then for PREEMPT_RT=y that would be
mapped to an rt_mutex, which is PI aware.

> I think sched_setaffinity should make use of raw_spinlocks, or should
> anyway be guaranteed to not be pre-empted (maybe a preempt_disable?),
> otherwise could lead in unwanted situations for a Real Time OS, such
> the one described below.

It does, rq->lock is a non preemptible lock, and the migration thread
runs at a priority higher than FIFO-99.

> The issue can be easily reproduced taking inspiration from this scenario:
> 
> I have 4 Real Time Threads (SCHED_FIFO) distributed as follows:
> 
> T0 : CPU 0, Priority 2 (HIGH)
> T1 : CPU 1, Priority 2 (HIGH)
> T3 : CPU 0, Priority 1 (LOW)
> T4 : CPU 1, Priority 1 (LOW)
> 
> So T0 and T1 are actually the "big bosses" on CPUs #0 and #1, T3 and
> T4, instead, never execute (let's assume that each thread is a simple
> busy wait that never sleeps/yields) Now, at a certain point, from T0
> code, I want to migrate T4 from CPU #1 to #0, keeping its low
> priority.
> Therefore I perform a pthread_setaffinity_np from T0 changing T4 mask
> from CPU #1 to #0.
> 
> In this scenario it happens that T3 (that should never execute since
> there is T0 with higher priority currently running on the same CPU #0)
> "emerge" and executes for a bit.
> It seems that the pthread_setaffinity_np syscall is somehow
> "suspensive" for the time needed to migrate T4 and let the scheduler
> to execute T3 for that bunch of time.
> 
> What do you think about this situation? Should sched APIs be revised?

Not sure why you thinking the APIs should be changed. If this does
indeed happen then there is a bug somewhere in the implementation, the
trick will be finding it.

So you run these four RT tasks on CPUs 0,1 and then control them from
another cpu, say 2?

Can you get a function trace that illustrates T3 getting scheduled,
preferably while running the latest -rt kernel?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-20  9:20 ` Peter Zijlstra
@ 2010-04-20 21:56   ` Primiano Tucci
  2010-04-20 23:00     ` Steven Rostedt
                       ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Primiano Tucci @ 2010-04-20 21:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, tglx, rostedt

Hi Peter,
thank you for your reply.

On Tue, Apr 20, 2010 at 11:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2010-04-19 at 22:48 +0200, Primiano Tucci wrote:
>
>> Yesterday days I found a strange behavior of the scheduler API's using
>> the RT patch, in particular the pthread_setaffinity_np (that stands on
>> sched_setaffinity).
>
>> I think the main problem is that sched_setaffinity makes use of a
>> rwlock, but rwlocks are pre-emptible with the RT patch.
>
> It does? where?
>
> sys_sched_setaffinity()
>  sched_setaffinity()
>    set_cpus_allowed_ptr()


I see

long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) {
	cpumask_var_t cpus_allowed, new_mask;
	struct task_struct *p;
	int retval;

	get_online_cpus();
-->	read_lock(&tasklist_lock);


My question is: suppose that tasklist_lock is held by a writer.
What happens to the calling thread? It can't take the lock, therefore
it yields to the next ready task (that in my scenario has a lower
priority).
In my view, this is not a Priority Inversion problem. The problem is
that the sched_setaffinity is unexpectedly "suspensive" and yields to
the lower thread.

Thank you for your support,
Primiano

>
> set_cpus_allowed_ptr() is the one that does the real work, and that
> takes the rq->lock and plays games with the migration thread, non of
> which should be able to cause any form of priority inversion.
>
>> So it could happen that an high priority process/thread that makes use
>> of the sched_setaffinity facility could be unwillingly preempted when
>> controlling other (even low-priority) processes/threads.
>
> Well, suppose there was a rwlock_t, then for PREEMPT_RT=y that would be
> mapped to an rt_mutex, which is PI aware.
>
>> I think sched_setaffinity should make use of raw_spinlocks, or should
>> anyway be guaranteed to not be pre-empted (maybe a preempt_disable?),
>> otherwise could lead in unwanted situations for a Real Time OS, such
>> the one described below.
>
> It does, rq->lock is a non preemptible lock, and the migration thread
> runs at a priority higher than FIFO-99.
>
>> The issue can be easily reproduced taking inspiration from this scenario:
>>
>> I have 4 Real Time Threads (SCHED_FIFO) distributed as follows:
>>
>> T0 : CPU 0, Priority 2 (HIGH)
>> T1 : CPU 1, Priority 2 (HIGH)
>> T3 : CPU 0, Priority 1 (LOW)
>> T4 : CPU 1, Priority 1 (LOW)
>>
>> So T0 and T1 are actually the "big bosses" on CPUs #0 and #1, T3 and
>> T4, instead, never execute (let's assume that each thread is a simple
>> busy wait that never sleeps/yields) Now, at a certain point, from T0
>> code, I want to migrate T4 from CPU #1 to #0, keeping its low
>> priority.
>> Therefore I perform a pthread_setaffinity_np from T0 changing T4 mask
>> from CPU #1 to #0.
>>
>> In this scenario it happens that T3 (that should never execute since
>> there is T0 with higher priority currently running on the same CPU #0)
>> "emerge" and executes for a bit.
>> It seems that the pthread_setaffinity_np syscall is somehow
>> "suspensive" for the time needed to migrate T4 and let the scheduler
>> to execute T3 for that bunch of time.
>>
>> What do you think about this situation? Should sched APIs be revised?
>
> Not sure why you thinking the APIs should be changed. If this does
> indeed happen then there is a bug somewhere in the implementation, the
> trick will be finding it.
>
> So you run these four RT tasks on CPUs 0,1 and then control them from
> another cpu, say 2?
>
> Can you get a function trace that illustrates T3 getting scheduled,
> preferably while running the latest -rt kernel?
>
>

--
 Primiano Tucci
 http://www.primianotucci.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-20 21:56   ` Primiano Tucci
@ 2010-04-20 23:00     ` Steven Rostedt
  2010-04-21  5:16       ` Primiano Tucci
  2010-04-21 12:56     ` Peter Zijlstra
  2010-04-27 13:18     ` Thomas Gleixner
  2 siblings, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2010-04-20 23:00 UTC (permalink / raw)
  To: Primiano Tucci; +Cc: Peter Zijlstra, linux-kernel, tglx

On Tue, 2010-04-20 at 23:56 +0200, Primiano Tucci wrote:
> Hi Peter,

> long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) {
> 	cpumask_var_t cpus_allowed, new_mask;
> 	struct task_struct *p;
> 	int retval;
> 
> 	get_online_cpus();
> -->	read_lock(&tasklist_lock);
> 
> 
> My question is: suppose that tasklist_lock is held by a writer.
> What happens to the calling thread? It can't take the lock, therefore
> it yields to the next ready task (that in my scenario has a lower
> priority).
> In my view, this is not a Priority Inversion problem. The problem is
> that the sched_setaffinity is unexpectedly "suspensive" and yields to
> the lower thread.

read_locks are converted into "special" rt_mutexes. The only thing
special about them, is the owner may grab the same read lock more than
once (recursive).

If a lower priority process currently holds the tasklist_lock for write,
when a high priority process tries to take it for read (or write for
that matter) it will block on the lower priority process. But that lower
priority process will acquire the priority of the higher priority
process (priority inheritance) and will run at that priority until it
releases the lock. Then it will go back to its low priority and the
higher priority process will then preempt it and acquire the lock for
read.

The above is what is expected.

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-20 23:00     ` Steven Rostedt
@ 2010-04-21  5:16       ` Primiano Tucci
  2010-04-21  8:49         ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Primiano Tucci @ 2010-04-21  5:16 UTC (permalink / raw)
  To: rostedt; +Cc: Peter Zijlstra, linux-kernel, tglx

Hi steve
> read_locks are converted into "special" rt_mutexes. The only thing
> special about them, is the owner may grab the same read lock more than
> once (recursive).
>
> If a lower priority process currently holds the tasklist_lock for write,
> when a high priority process tries to take it for read (or write for
> that matter) it will block on the lower priority process. But that lower
> priority process will acquire the priority of the higher priority
> process (priority inheritance) and will run at that priority until it
> releases the lock. Then it will go back to its low priority and the
> higher priority process will then preempt it and acquire the lock for
> read.

In your example you implied that the low priority process, holding the
lock for write, runs on the same CPU of the higher priority process
that wants to lock it for read. This is clear to me.
My problem is, in a SMP environment, what happens if a process (let's
say T1 on CPU #1) holds the lock for write (its priority does not
matter, it is not a PI problem) and now a process T0 on cpu #0 wants
to lock it for read?
The process T0 will be blocked! But who will run now on CPU 0, until
the rwlock is held by T1? Probably the next ready process on CPU #'0.
Is it right?

Thanks,
Primiano

--
 Primiano Tucci
 http://www.primianotucci.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-21  5:16       ` Primiano Tucci
@ 2010-04-21  8:49         ` Peter Zijlstra
  2010-04-21 12:46           ` Steven Rostedt
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-04-21  8:49 UTC (permalink / raw)
  To: Primiano Tucci; +Cc: rostedt, linux-kernel, tglx

On Wed, 2010-04-21 at 07:16 +0200, Primiano Tucci wrote:
> Hi steve
> > read_locks are converted into "special" rt_mutexes. The only thing
> > special about them, is the owner may grab the same read lock more than
> > once (recursive).
> >
> > If a lower priority process currently holds the tasklist_lock for write,
> > when a high priority process tries to take it for read (or write for
> > that matter) it will block on the lower priority process. But that lower
> > priority process will acquire the priority of the higher priority
> > process (priority inheritance) and will run at that priority until it
> > releases the lock. Then it will go back to its low priority and the
> > higher priority process will then preempt it and acquire the lock for
> > read.
> 
> In your example you implied that the low priority process, holding the
> lock for write, runs on the same CPU of the higher priority process
> that wants to lock it for read. This is clear to me.
> My problem is, in a SMP environment, what happens if a process (let's
> say T1 on CPU #1) holds the lock for write (its priority does not
> matter, it is not a PI problem) and now a process T0 on cpu #0 wants
> to lock it for read?
> The process T0 will be blocked! But who will run now on CPU 0, until
> the rwlock is held by T1? Probably the next ready process on CPU #'0.
> Is it right?

Yes. This is the reality of SMP systems, nothing much you can do about
that. System resources are shared between all cpus, irrespective of task
affinities.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-21  8:49         ` Peter Zijlstra
@ 2010-04-21 12:46           ` Steven Rostedt
  2010-04-21 19:24             ` Primiano Tucci
  0 siblings, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2010-04-21 12:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Primiano Tucci, linux-kernel, tglx

On Wed, 2010-04-21 at 10:49 +0200, Peter Zijlstra wrote:
> On Wed, 2010-04-21 at 07:16 +0200, Primiano Tucci wrote:
> > Hi steve
> > > read_locks are converted into "special" rt_mutexes. The only thing
> > > special about them, is the owner may grab the same read lock more than
> > > once (recursive).
> > >
> > > If a lower priority process currently holds the tasklist_lock for write,
> > > when a high priority process tries to take it for read (or write for
> > > that matter) it will block on the lower priority process. But that lower
> > > priority process will acquire the priority of the higher priority
> > > process (priority inheritance) and will run at that priority until it
> > > releases the lock. Then it will go back to its low priority and the
> > > higher priority process will then preempt it and acquire the lock for
> > > read.
> > 
> > In your example you implied that the low priority process, holding the
> > lock for write, runs on the same CPU of the higher priority process
> > that wants to lock it for read. This is clear to me.
> > My problem is, in a SMP environment, what happens if a process (let's
> > say T1 on CPU #1) holds the lock for write (its priority does not
> > matter, it is not a PI problem) and now a process T0 on cpu #0 wants
> > to lock it for read?
> > The process T0 will be blocked! But who will run now on CPU 0, until
> > the rwlock is held by T1? Probably the next ready process on CPU #'0.
> > Is it right?
> 
> Yes. This is the reality of SMP systems, nothing much you can do about
> that. System resources are shared between all cpus, irrespective of task
> affinities.

Actually, we do better than that. With adaptive locks, if the process on
the other CPU is still running, the high priority task will spin until
the other process releases the lock or goes to sleep. If it goes to
sleep, then the high prio task will also sleep, otherwise it just spins
and takes the lock when it is released.

-- Steve




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-21 12:46           ` Steven Rostedt
@ 2010-04-21 19:24             ` Primiano Tucci
  2010-04-21 19:57               ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Primiano Tucci @ 2010-04-21 19:24 UTC (permalink / raw)
  To: rostedt; +Cc: Peter Zijlstra, linux-kernel, tglx

> Actually, we do better than that. With adaptive locks, if the process on
> the other CPU is still running, the high priority task will spin until
> the other process releases the lock or goes to sleep. If it goes to
> sleep, then the high prio task will also sleep, otherwise it just spins
> and takes the lock when it is released.
>
> -- Steve

It sounds more reasonable now.
I know that in a preemptible kernel even syscalls can be preempted.
that absolutely fair except for those syscalls (such as setaffinity,
setpriority) that control the scheduler.

Going back to my original problem, the real question is:
Is it sure that calling a scheduler api won't induce a re-scheduling
of the caller process (e.g. as in the case of a lock held by another
processor)? It would be very unpleasant if the scheduling apis can
induce re-scheduling, making the realization of a Real Time scheduling
infrastructure completely un-deterministic.

If I have clearly understood your replies it seems that my problem is
due to an *old* kernel version that still uses rw_lock into the
setaffinity! Is it right?

Thank you for your extremely valuable support.
Primiano

--
 Primiano Tucci
 http://www.primianotucci.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-21 19:24             ` Primiano Tucci
@ 2010-04-21 19:57               ` Peter Zijlstra
  2010-04-21 20:38                 ` Primiano Tucci
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-04-21 19:57 UTC (permalink / raw)
  To: Primiano Tucci; +Cc: rostedt, linux-kernel, tglx

On Wed, 2010-04-21 at 21:24 +0200, Primiano Tucci wrote:
> Is it sure that calling a scheduler api won't induce a re-scheduling
> of the caller process (e.g. as in the case of a lock held by another
> processor)? It would be very unpleasant if the scheduling apis can
> induce re-scheduling, making the realization of a Real Time scheduling
> infrastructure completely un-deterministic. 

No, any syscall can end up blocking/scheduling there are no exceptions.
But blocking doesn't mean its non-deterministic, esp. when coupled with
things like PI.

But you do have to treat system resources as such, that is they can (and
will) create cross-cpu dependencies, if you do not take that into
account you will of course be surprised.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-21 19:57               ` Peter Zijlstra
@ 2010-04-21 20:38                 ` Primiano Tucci
  2010-04-21 20:58                   ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Primiano Tucci @ 2010-04-21 20:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rostedt, linux-kernel, tglx

> No, any syscall can end up blocking/scheduling there are no exceptions.
> But blocking doesn't mean its non-deterministic, esp. when coupled with
> things like PI.
>
> But you do have to treat system resources as such, that is they can (and
> will) create cross-cpu dependencies, if you do not take that into
> account you will of course be surprised.
>
I actually don't understand why do you recall PI so frequently, it
seems to be the unique point of interest.
Actually I take care about not sharing cross-cpu resources, but I
cannot take care of what the kernel should do.
In my viewpoint is unacceptable that the scheduler apis can led into a
rescheduling.
It voids any form of process control.
If I lose the control while controlling other processes, Quis
custodiet ipsos custodes?

P.S. It actually does not happen in other RTOSes, e.g., VxWorks SMP

Primiano,

--
 Primiano Tucci
 http://www.primianotucci.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-21 20:38                 ` Primiano Tucci
@ 2010-04-21 20:58                   ` Peter Zijlstra
  2010-04-22 13:20                     ` Steven Rostedt
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-04-21 20:58 UTC (permalink / raw)
  To: Primiano Tucci; +Cc: rostedt, linux-kernel, tglx

On Wed, 2010-04-21 at 22:38 +0200, Primiano Tucci wrote:
> > No, any syscall can end up blocking/scheduling there are no exceptions.
> > But blocking doesn't mean its non-deterministic, esp. when coupled with
> > things like PI.
> >
> > But you do have to treat system resources as such, that is they can (and
> > will) create cross-cpu dependencies, if you do not take that into
> > account you will of course be surprised.
> >
> I actually don't understand why do you recall PI so frequently, it
> seems to be the unique point of interest.

PI keeps preemptible locks working in a RT environment. Non-preemptible
or preemptible+PI are both valid RT constructs that can be analyzed 

> Actually I take care about not sharing cross-cpu resources, but I
> cannot take care of what the kernel should do.

An SMP kernel must be treated as a cross-cpu resource. There's just no
way around that. For instance, Unix allows two processes on different
cpus to invoke sched_setscheduler/sched_setaffinity or any number of
system calls on the same target process. Filesystems are shared etc..

> In my viewpoint is unacceptable that the scheduler apis can led into a
> rescheduling.

They can even lead to pagefaults and disk IO if you're not careful.

I'm not sure if there are blocking locks left thereabout, but spinlocks
or rt_mutex, both create cross-cpu dependencies that need to be
analyzed, !preempt isn't magic in any way.

> It voids any form of process control.
> If I lose the control while controlling other processes, Quis
> custodiet ipsos custodes?
> 
> P.S. It actually does not happen in other RTOSes, e.g., VxWorks SMP

I don't know any of those, but its impossible to migrate tasks from one
cpu to another without creating cross-cpu dependencies.

Whether locks are preemptible or not doesn't make them any less
analyzable, if you use system-calls in your RT program, their
implementation needs to be considered.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-21 20:58                   ` Peter Zijlstra
@ 2010-04-22 13:20                     ` Steven Rostedt
  2010-04-22 13:50                       ` Primiano Tucci
  0 siblings, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2010-04-22 13:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Primiano Tucci, linux-kernel, tglx

On Wed, 2010-04-21 at 22:58 +0200, Peter Zijlstra wrote:

> > P.S. It actually does not happen in other RTOSes, e.g., VxWorks SMP
> 
> I don't know any of those, but its impossible to migrate tasks from one
> cpu to another without creating cross-cpu dependencies.
> 
> Whether locks are preemptible or not doesn't make them any less
> analyzable, if you use system-calls in your RT program, their
> implementation needs to be considered

It's been a while since I've used SMP VxWorks, but back then what it did
was to copy an image for ever CPU separately. It was not really SMP but
instead a separate OS for each CPU. Things may have changed since then
(it was around 2002 when I saw this).

There's projects to do the same for Linux, and I feel it may give you
the most control of the system. But the hardware is still shared, so the
contention does not go away, it just gets moved to the hardware
resources.

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-22 13:20                     ` Steven Rostedt
@ 2010-04-22 13:50                       ` Primiano Tucci
  2010-04-22 13:57                         ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Primiano Tucci @ 2010-04-22 13:50 UTC (permalink / raw)
  To: rostedt; +Cc: Peter Zijlstra, linux-kernel, tglx

> It's been a while since I've used SMP VxWorks, but back then what it did
> was to copy an image for ever CPU separately. It was not really SMP but
> instead a separate OS for each CPU. Things may have changed since then
> (it was around 2002 when I saw this).
>
> There's projects to do the same for Linux, and I feel it may give you
> the most control of the system. But the hardware is still shared, so the
> contention does not go away, it just gets moved to the hardware
> resources.

I knew this kind of solution based on OS-partitioning, but my group
and I are currently working on a Global-EDF scheduler, a unique
scheduler (and therefore a unique OS/Kernel) that is able to migrate
tasks between CPUs in order to maximize the global CPU usage.
In order to to this we have a unique "super"-process (a
Meta-Scheduler) that needs to be able to control priority and affinity
of the managed tasks, without losing the control while doing this.
We are able to make it work under VXWorks 6.7 SMP (that seems to be
really-smp), now we are trying to port the same infrastructure on a
Linux-RT Kernel to compare the performances, but we found the issue
with the sefaffinity syscall as described into my first mail.
I will try to update to the last kernel version and see if it works correctly.

Thanks,
Primiano

--
 Primiano Tucci
 http://www.primianotucci.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-22 13:50                       ` Primiano Tucci
@ 2010-04-22 13:57                         ` Peter Zijlstra
  2010-04-22 15:40                           ` Primiano Tucci
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-04-22 13:57 UTC (permalink / raw)
  To: Primiano Tucci; +Cc: rostedt, linux-kernel, tglx

On Thu, 2010-04-22 at 15:50 +0200, Primiano Tucci wrote:
> I knew this kind of solution based on OS-partitioning, but my group
> and I are currently working on a Global-EDF scheduler, a unique
> scheduler (and therefore a unique OS/Kernel) that is able to migrate
> tasks between CPUs in order to maximize the global CPU usage.

I would hardly call a global-edf scheduler unique. Its a well studied
algorithm and even available in commercial SMP operating systems
(hopefully soon Linux too, see SCHED_DEADLINE, which will approximate
global-edf, much like the current SCHED_FIFO approximates global-fifo).

> In order to to this we have a unique "super"-process (a
> Meta-Scheduler) that needs to be able to control priority and affinity
> of the managed tasks, without losing the control while doing this. 

Implementing this as userspace/middleware seems daft. But if your
controlling process has a global affinity mask and runs as the highest
available userspace process priority its still all valid.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-22 13:57                         ` Peter Zijlstra
@ 2010-04-22 15:40                           ` Primiano Tucci
  2010-04-22 16:28                             ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Primiano Tucci @ 2010-04-22 15:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rostedt, linux-kernel, tglx

> I would hardly call a global-edf scheduler unique.
I used the word "unique" to underline that it is not a partitioned
scheduling scheme, but ONE global scheduler that manages tasks

> Its a well studied
> algorithm and even available in commercial SMP operating systems

Can you cite me a commercial SMP system that supports
multicore/multiprocessor G-EDF?
Eventually can you cite me an uniprocessor system that support EDF?
In my knowledge the most commercial systems do not ever have the
concept of periodic task, release periods and deadlines (e.g. ,
VXWorks, QNX Neutrino)... how can they support G-EDF?
The other commercial  RT System, in my knoweledge, that support
pediodic tasks such as B&R Automation Runtime components and Beckhoff
Twincat System, only offer a Rate Monotonic scheduling scheme.

> (hopefully soon Linux too, see SCHED_DEADLINE, which will approximate
> global-edf, much like the current SCHED_FIFO approximates global-fifo).
> Implementing this as userspace/middleware seems daft. But if your
> controlling process has a global affinity mask and runs as the highest
> available userspace process priority its still all valid.

It is how it is implemented now, and how it works under VXWorks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-22 15:40                           ` Primiano Tucci
@ 2010-04-22 16:28                             ` Peter Zijlstra
  2010-04-22 17:48                               ` Bjoern Brandenburg
  2010-04-22 19:33                               ` Primiano Tucci
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2010-04-22 16:28 UTC (permalink / raw)
  To: Primiano Tucci; +Cc: rostedt, linux-kernel, tglx, Bjoern Brandenburg

On Thu, 2010-04-22 at 17:40 +0200, Primiano Tucci wrote:

> > Its a well studied
> > algorithm and even available in commercial SMP operating systems
> 
> Can you cite me a commercial SMP system that supports
> multicore/multiprocessor G-EDF?

From: http://www.cs.unc.edu/~anderson/papers/rtlws09.pdf

"Regarding the frequently voiced objections to G-
EDF’s viability in a “real” system, it should be noted that
xnu, the kernel underlying Apple’s multimedia-friendly
OS X, has been relying on G-EDF to support real-time
applications on multiprocessors for several years [5]."

"[5] Apple Inc. xnu source code.
http://opensource.apple.com/source/xnu/."


> > Implementing this as userspace/middleware seems daft. But if your
> > controlling process has a global affinity mask and runs as the highest
> > available userspace process priority its still all valid.
> 
> It is how it is implemented now, and how it works under VXWorks!

But that does not, and can not, provide proper isolation and bandwidth
guarantees since there could be runnable tasks outside the scope of the
middleware.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-22 16:28                             ` Peter Zijlstra
@ 2010-04-22 17:48                               ` Bjoern Brandenburg
  2010-04-22 19:33                               ` Primiano Tucci
  1 sibling, 0 replies; 20+ messages in thread
From: Bjoern Brandenburg @ 2010-04-22 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Primiano Tucci, Steven Rostedt, linux-kernel, tglx,
	James H. Anderson, Andrea Bastoni

On Apr 22, 2010, at 12:28 PM, Peter Zijlstra wrote:
> On Thu, 2010-04-22 at 17:40 +0200, Primiano Tucci wrote:
> 
>>> Its a well studied
>>> algorithm and even available in commercial SMP operating systems
>> 
>> Can you cite me a commercial SMP system that supports
>> multicore/multiprocessor G-EDF?
> 
> From: http://www.cs.unc.edu/~anderson/papers/rtlws09.pdf
> 
> "Regarding the frequently voiced objections to G-
> EDF’s viability in a “real” system, it should be noted that
> xnu, the kernel underlying Apple’s multimedia-friendly
> OS X, has been relying on G-EDF to support real-time
> applications on multiprocessors for several years [5]."
> 
> "[5] Apple Inc. xnu source code.
> http://opensource.apple.com/source/xnu/."

Since that reference is a bit coarse-grained, let me clarify by pointing out the actual implementation.

In particular, if you look at /usr/include/mach/thread_policy.h (on OS X, of course), you'll find:

> /*
>  * THREAD_TIME_CONSTRAINT_POLICY:
>  *
>  * This scheduling mode is for threads which have real time
>  * constraints on their execution.
>  *
>  * Parameters:
>  *
>  * period: This is the nominal amount of time between separate
>  * processing arrivals, specified in absolute time units.  A
>  * value of 0 indicates that there is no inherent periodicity in
>  * the computation.
>  *
>  * computation: This is the nominal amount of computation
>  * time needed during a separate processing arrival, specified
>  * in absolute time units.
>  *
>  * constraint: This is the maximum amount of real time that
>  * may elapse from the start of a separate processing arrival
>  * to the end of computation for logically correct functioning,
>  * specified in absolute time units.  Must be (>= computation).
>  * Note that latency = (constraint - computation).
>  *
>  * preemptible: This indicates that the computation may be
>  * interrupted, subject to the constraint specified above.
>  */
> 

I.e., THREAD_TIME_CONSTRAINT_POLICY implements the sporadic task model.

Global EDF is implemented in osfmk/kern/sched_prim.c (line numbers pertain to XNU 1504.3.12):

In line 473:

> 	/*
> 	 * Calculate deadline for real-time threads.
> 	 */
> 	if (thread->sched_mode & TH_MODE_REALTIME) {
> 		thread->realtime.deadline = mach_absolute_time();
> 		thread->realtime.deadline += thread->realtime.constraint;
> 	}

Further, in choose_processor() starting at line 26:

> 		if (thread->sched_pri >= BASEPRI_RTQUEUES) {
> 			/*
> 			 *	For an RT thread, iterate through active processors, first fit.
> 			 */
> 			processor = (processor_t)queue_first(&cset->active_queue);
> 			while (!queue_end(&cset->active_queue, (queue_entry_t)processor)) {
> 				if (thread->sched_pri > processor->current_pri ||
> 						thread->realtime.deadline < processor->deadline)
> 					return (processor);

And in thread_setrun() at line 2482:

> 		if (thread->last_processor != PROCESSOR_NULL) {
> 			/*
> 			 *	Simple (last processor) affinity case.
> 			 */
> 			processor = thread->last_processor;
> 			pset = processor->processor_set;
> 			pset_lock(pset);
> 
> 			/*
> 			 *	Choose a different processor in certain cases.
> 			 */
> 			if (thread->sched_pri >= BASEPRI_RTQUEUES) {
> 				/*
> 				 *	If the processor is executing an RT thread with
> 				 *	an earlier deadline, choose another.
> 				 */
> 				if (thread->sched_pri <= processor->current_pri ||
> 						thread->realtime.deadline >= processor->deadline)
> 					processor = choose_processor(pset, PROCESSOR_NULL, thread);
> 			}
> 


This Global EDF implementation might have been inherited from RT Mach, but I'm not sure about that. 

In LITMUS^RT [1], we implement Global EDF a bit differently. Instead of iterating over all processors, we keep the processors in a max heap (ordered by deadline, with no RT task == infinity). The xnu variant may be beneficial if you only expect a few RT tasks at any time, whereas ours is based on the assumption that most processors will be scheduling RT tasks most of the time.

- Björn

[1] http://www.cs.unc.edu/~anderson/litmus-rt (The posted patch is horribly out of date; we'll have something more recent based on 2.6.32 up soon.)

---
Björn B. Brandenburg
Ph.D. Candidate 
Dept. of Computer Science
University of North Carolina at Chapel Hill
http://www.cs.unc.edu/~bbb





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-22 16:28                             ` Peter Zijlstra
  2010-04-22 17:48                               ` Bjoern Brandenburg
@ 2010-04-22 19:33                               ` Primiano Tucci
  1 sibling, 0 replies; 20+ messages in thread
From: Primiano Tucci @ 2010-04-22 19:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: rostedt, linux-kernel, tglx, Bjoern Brandenburg

Hi Peter,
It is not in my intents to start a flame, but, as you pointed out
yourself, EDF is not so common, particularly in commercial RTOSes as
you stated (and xnu, indeed, does not fall into this category).

It is not surprisingly that the citation you found was from Mr
Brandenburg and Mr. Anderson from North Carolina University,
incidentally I had a copy of their paper (On the Implementation of
Global RT Schedulers) at the time of reading your message. I think
they're are among the few that concretely and deeply investigated on
G-EDF.
As you could read by Brandenburg and Anderson, there isn't a
standard/reference implementation of Global EDF, but a alot of
"variants" are possible (e.g., event-driven / tick driven promotion,
dedicated/global interrupt handling,  typology of data structures and
locking methods ...).
My intent is not to make a form of top ten best kernel award, all we
are trying to do is investigating on the various facilities offered by
the RT operating systems in order to determine how much we can rely on
them.

>> It is how it is implemented now, and how it works under VXWorks!
>
> But that does not, and can not, provide proper isolation and bandwidth
> guarantees since there could be runnable tasks outside the scope of the
> middleware.

Actually our implementation in VxWorks is based on kernel space tasks
that have the maximum priority on the system, even greater of VxWorks
system tasks (that have been shifted down after long and fruitful
support discussions with windriver) therefore no task (except from
interrupt service routines that are accurately weighted under VxWorks)
can alter our middleware. Actually it is controlling real world (yet
not production stage) manufacturing systems, and it is a bit more than
just a play-game.

We are trying to study and understand if, and how, analogous things
can be done on other Real Time Operating Systems.
We choiced posix threads as a start point on linux kernel to verify
the feasibility of the same approach, without intervening too deep on
the kernel.

Thanks again for your interventions!

--
 Primiano Tucci
 http://www.primianotucci.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-20 21:56   ` Primiano Tucci
  2010-04-20 23:00     ` Steven Rostedt
@ 2010-04-21 12:56     ` Peter Zijlstra
  2010-04-27 13:18     ` Thomas Gleixner
  2 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2010-04-21 12:56 UTC (permalink / raw)
  To: Primiano Tucci; +Cc: linux-kernel, tglx, rostedt

On Tue, 2010-04-20 at 23:56 +0200, Primiano Tucci wrote:
> 
> long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) {
>         cpumask_var_t cpus_allowed, new_mask;
>         struct task_struct *p;
>         int retval;
> 
>         get_online_cpus();
> -->     read_lock(&tasklist_lock);

FWIW recent kernels don't use tasklist_lock there anymore.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Considerations on sched APIs under RT patch
  2010-04-20 21:56   ` Primiano Tucci
  2010-04-20 23:00     ` Steven Rostedt
  2010-04-21 12:56     ` Peter Zijlstra
@ 2010-04-27 13:18     ` Thomas Gleixner
  2 siblings, 0 replies; 20+ messages in thread
From: Thomas Gleixner @ 2010-04-27 13:18 UTC (permalink / raw)
  To: Primiano Tucci; +Cc: Peter Zijlstra, linux-kernel, rostedt

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1030 bytes --]

On Tue, 20 Apr 2010, Primiano Tucci wrote:

> Hi Peter,
> thank you for your reply.
> 
> On Tue, Apr 20, 2010 at 11:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, 2010-04-19 at 22:48 +0200, Primiano Tucci wrote:
> >
> >> Yesterday days I found a strange behavior of the scheduler API's using
> >> the RT patch, in particular the pthread_setaffinity_np (that stands on
> >> sched_setaffinity).
> >
> >> I think the main problem is that sched_setaffinity makes use of a
> >> rwlock, but rwlocks are pre-emptible with the RT patch.
> >
> > It does? where?
> >
> > sys_sched_setaffinity()
> >  sched_setaffinity()
> >    set_cpus_allowed_ptr()
> 
> 
> I see
> 
> long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) {
> 	cpumask_var_t cpus_allowed, new_mask;
> 	struct task_struct *p;
> 	int retval;
> 
> 	get_online_cpus();
> -->	read_lock(&tasklist_lock);

You must be looking at some older version of -rt. The current
2.6.33-rt series does not take tasklist_lock in sched_setaffinity().

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2010-04-27 13:19 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-19 20:48 Considerations on sched APIs under RT patch Primiano Tucci
2010-04-20  9:20 ` Peter Zijlstra
2010-04-20 21:56   ` Primiano Tucci
2010-04-20 23:00     ` Steven Rostedt
2010-04-21  5:16       ` Primiano Tucci
2010-04-21  8:49         ` Peter Zijlstra
2010-04-21 12:46           ` Steven Rostedt
2010-04-21 19:24             ` Primiano Tucci
2010-04-21 19:57               ` Peter Zijlstra
2010-04-21 20:38                 ` Primiano Tucci
2010-04-21 20:58                   ` Peter Zijlstra
2010-04-22 13:20                     ` Steven Rostedt
2010-04-22 13:50                       ` Primiano Tucci
2010-04-22 13:57                         ` Peter Zijlstra
2010-04-22 15:40                           ` Primiano Tucci
2010-04-22 16:28                             ` Peter Zijlstra
2010-04-22 17:48                               ` Bjoern Brandenburg
2010-04-22 19:33                               ` Primiano Tucci
2010-04-21 12:56     ` Peter Zijlstra
2010-04-27 13:18     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox