All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai-help] Sporadic PC freeze after rt_task_start
@ 2007-07-10  8:00 M. Koehrer
  2007-07-10  8:40 ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-10  8:00 UTC (permalink / raw)
  To: xenomai

Hi everybody,

I noticed a sporadic freeze of my PC using Xenomai 2.3.1 and kernel 2.6.20.4 on a Pentium D.
adeos-ipipe-2.6.20-i386-1.8-01.patch.

The freeze happened sporadically on one of our systems, occasionally it took up to 6 hours  to get it.
Using a PCI Post Code board and writing POST codes to it, I was able to locate the code that was causing
the issue. And finally I was able to extract it to a very simple program that shows the same behaviour!!

Here is my simple test program:
**************************************** BEGIN *****************
#include <stdio.h>
#include <sys/mman.h>

#include <native/task.h>
#include <native/sem.h>


RT_TASK taska_desc;

void mytaska(void *cookie)
{
    int i;

    for (i=0; i < 5; i++)
    {
        rt_task_sleep(5000000);
    }
}

int main(void)
{
    int i;
    int j;
    mlockall(MCL_CURRENT|MCL_FUTURE);

    for (j=0; j < 100; j++)
        for (i=10; i < 15000; i++)
        {
            rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE | T_FPU | T_CPU(1));
            rt_task_start(&taska_desc, &mytaska, NULL);
            usleep(1500);

            rt_task_join(&taska_desc);
            if ( i % 100 == 0)
                printf("Loop %i\n", i);
        }

    return 0;
}
*************************************** END ***********************************
It is important to know, that I started the kernel with isolcpus=1, i.e. all non-realtime tasks
are running on CPU 0.
Somehow it seems to have to do with the usleep() that is following the rt_task_start.
usleep() is executed on CPU 0 and rt_task_start starts a task on CPU 1...
Can this be as the begin of usleep() is executed before the task is started but the end of
usleep() is when the task has already started. Could this be a cause for a race condition?

I leave the program running for a while and somehow it freezes the PC (only reset works).

Any feedback on this is welcome!

Regards

Mathias






-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10  8:00 [Xenomai-help] Sporadic PC freeze after rt_task_start M. Koehrer
@ 2007-07-10  8:40 ` Jan Kiszka
  2007-07-10 12:29   ` M. Koehrer
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2007-07-10  8:40 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 3122 bytes --]

M. Koehrer wrote:
> Hi everybody,
> 
> I noticed a sporadic freeze of my PC using Xenomai 2.3.1 and kernel 2.6.20.4 on a Pentium D.
> adeos-ipipe-2.6.20-i386-1.8-01.patch.
> 
> The freeze happened sporadically on one of our systems, occasionally it took up to 6 hours  to get it.
> Using a PCI Post Code board and writing POST codes to it, I was able to locate the code that was causing
> the issue. And finally I was able to extract it to a very simple program that shows the same behaviour!!
> 
> Here is my simple test program:
> **************************************** BEGIN *****************
> #include <stdio.h>
> #include <sys/mman.h>
> 
> #include <native/task.h>
> #include <native/sem.h>
> 
> 
> RT_TASK taska_desc;
> 
> void mytaska(void *cookie)
> {
>     int i;
> 
>     for (i=0; i < 5; i++)
>     {
>         rt_task_sleep(5000000);
>     }
> }
> 
> int main(void)
> {
>     int i;
>     int j;
>     mlockall(MCL_CURRENT|MCL_FUTURE);
> 
>     for (j=0; j < 100; j++)
>         for (i=10; i < 15000; i++)
>         {
>             rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE | T_FPU | T_CPU(1));
>             rt_task_start(&taska_desc, &mytaska, NULL);
>             usleep(1500);
> 
>             rt_task_join(&taska_desc);
>             if ( i % 100 == 0)
>                 printf("Loop %i\n", i);
>         }
> 
>     return 0;
> }
> *************************************** END ***********************************
> It is important to know, that I started the kernel with isolcpus=1, i.e. all non-realtime tasks
> are running on CPU 0.
> Somehow it seems to have to do with the usleep() that is following the rt_task_start.
> usleep() is executed on CPU 0 and rt_task_start starts a task on CPU 1...
> Can this be as the begin of usleep() is executed before the task is started but the end of
> usleep() is when the task has already started. Could this be a cause for a race condition?
> 
> I leave the program running for a while and somehow it freezes the PC (only reset works).
> 
> Any feedback on this is welcome!

Maybe you are seeing the same bug like this test exposes:

#include <native/task.h>
#include <sched.h>
#include <sys/mman.h>

void func(void *arg)
{
	rt_task_set_periodic(NULL, TM_NOW, 1000000000LL);
	while(1) rt_task_wait_period(NULL);
}

main()
{
	RT_TASK task;
	cpu_set_t set;

	mlockall(MCL_CURRENT|MCL_FUTURE);
	printf("rt_task_spawn=%d\n", rt_task_spawn(&task, "Receiver", 0,
	       10, 0, func, NULL));
	CPU_ZERO(&set);
	CPU_SET(1, &set);
	printf("sched_setaffinity=%d\n", sched_setaffinity(0,
	       sizeof(cpu_set_t), &set));
	sleep(1);
	printf("rt_task_delete=%d\n", rt_task_delete(&task));
}

Though, this test doesn't hard-lock, just stalls the process in some
zombie state.

This bug is already scheduled for closer examination, stay tuned.

In the meantime: Is it possible to check if
 a) my demo code happens to lock up hard for you?
 b) any behaviour changes with latest xeno-2.3.2/ipipe-1.8-05 and your
    test case?

Thanks for reporting,
Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10  8:40 ` Jan Kiszka
@ 2007-07-10 12:29   ` M. Koehrer
  2007-07-10 12:41     ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-10 12:29 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai

Hi Jan,

I have compiled and started your test.
It works fine - no error or warning...
The output is:
rt_task_spawn=0
sched_setaffinity=0
rt_task_delete=0

Regards

Mathias

> M. Koehrer wrote:
> > Hi everybody,
> > 
> > I noticed a sporadic freeze of my PC using Xenomai 2.3.1 and kernel
> 2.6.20.4 on a Pentium D.
> > adeos-ipipe-2.6.20-i386-1.8-01.patch.
> > 
> > The freeze happened sporadically on one of our systems, occasionally it
> took up to 6 hours  to get it.
> > Using a PCI Post Code board and writing POST codes to it, I was able to
> locate the code that was causing
> > the issue. And finally I was able to extract it to a very simple program
> that shows the same behaviour!!
> > 
> > Here is my simple test program:
> > **************************************** BEGIN *****************
> > #include <stdio.h>
> > #include <sys/mman.h>
> > 
> > #include <native/task.h>
> > #include <native/sem.h>
> > 
> > 
> > RT_TASK taska_desc;
> > 
> > void mytaska(void *cookie)
> > {
> >     int i;
> > 
> >     for (i=0; i < 5; i++)
> >     {
> >         rt_task_sleep(5000000);
> >     }
> > }
> > 
> > int main(void)
> > {
> >     int i;
> >     int j;
> >     mlockall(MCL_CURRENT|MCL_FUTURE);
> > 
> >     for (j=0; j < 100; j++)
> >         for (i=10; i < 15000; i++)
> >         {
> >             rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE |
> T_FPU | T_CPU(1));
> >             rt_task_start(&taska_desc, &mytaska, NULL);
> >             usleep(1500);
> > 
> >             rt_task_join(&taska_desc);
> >             if ( i % 100 == 0)
> >                 printf("Loop %i\n", i);
> >         }
> > 
> >     return 0;
> > }
> > *************************************** END
> ***********************************
> > It is important to know, that I started the kernel with isolcpus=1, i.e.
> all non-realtime tasks
> > are running on CPU 0.
> > Somehow it seems to have to do with the usleep() that is following the
> rt_task_start.
> > usleep() is executed on CPU 0 and rt_task_start starts a task on CPU 1...
> > Can this be as the begin of usleep() is executed before the task is
> started but the end of
> > usleep() is when the task has already started. Could this be a cause for a
> race condition?
> > 
> > I leave the program running for a while and somehow it freezes the PC
> (only reset works).
> > 
> > Any feedback on this is welcome!
> 
> Maybe you are seeing the same bug like this test exposes:
> 
> #include <native/task.h>
> #include <sched.h>
> #include <sys/mman.h>
> 
> void func(void *arg)
> {
> 	rt_task_set_periodic(NULL, TM_NOW, 1000000000LL);
> 	while(1) rt_task_wait_period(NULL);
> }
> 
> main()
> {
> 	RT_TASK task;
> 	cpu_set_t set;
> 
> 	mlockall(MCL_CURRENT|MCL_FUTURE);
> 	printf("rt_task_spawn=%d\n", rt_task_spawn(&task, "Receiver", 0,
> 	       10, 0, func, NULL));
> 	CPU_ZERO(&set);
> 	CPU_SET(1, &set);
> 	printf("sched_setaffinity=%d\n", sched_setaffinity(0,
> 	       sizeof(cpu_set_t), &set));
> 	sleep(1);
> 	printf("rt_task_delete=%d\n", rt_task_delete(&task));
> }
> 
> Though, this test doesn't hard-lock, just stalls the process in some
> zombie state.
> 
> This bug is already scheduled for closer examination, stay tuned.
> 
> In the meantime: Is it possible to check if
>  a) my demo code happens to lock up hard for you?
>  b) any behaviour changes with latest xeno-2.3.2/ipipe-1.8-05 and your
>     test case?
> 
> Thanks for reporting,
> Jan
> 
> 
> 


-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10 12:29   ` M. Koehrer
@ 2007-07-10 12:41     ` Jan Kiszka
  2007-07-10 14:40       ` M. Koehrer
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2007-07-10 12:41 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai


[-- Attachment #1.1: Type: text/plain, Size: 433 bytes --]

M. Koehrer wrote:
> Hi Jan,
> 
> I have compiled and started your test.
> It works fine - no error or warning...
> The output is:
> rt_task_spawn=0
> sched_setaffinity=0
> rt_task_delete=0

Means it simply terminates then? Interesting. Maybe some difference in
the .config, maybe due to the timing I get under qemu (that's where I
notices the lock-up). Let's go for a .config comparison first, mine is
attached.

Jan

[-- Attachment #1.2: config.bz2 --]
[-- Type: application/octet-stream, Size: 8061 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10 12:41     ` Jan Kiszka
@ 2007-07-10 14:40       ` M. Koehrer
  2007-07-10 15:34         ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-10 14:40 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai


[-- Attachment #1.1: Type: text/plain, Size: 1348 bytes --]

Hi Jan,

yes, it terminates nicely.
I have attached my config.

Regards

 
Mathias

----- Original Nachricht ----
Von:     Jan Kiszka <jan.kiszka@domain.hid>
An:      "M. Koehrer" <mathias_koehrer@domain.hid>
Datum:   10.07.2007 14:41
Betreff: Re: [Xenomai-help] Sporadic PC freeze after rt_task_start

> M. Koehrer wrote:
> > Hi Jan,
> > 
> > I have compiled and started your test.
> > It works fine - no error or warning...
> > The output is:
> > rt_task_spawn=0
> > sched_setaffinity=0
> > rt_task_delete=0
> 
> Means it simply terminates then? Interesting. Maybe some difference in
> the .config, maybe due to the timing I get under qemu (that's where I
> notices the lock-up). Let's go for a .config comparison first, mine is
> attached.
> 
> Jan
> 
> 
> --------------------------------
> 
> _______________________________________________
> Xenomai-help mailing list
> Xenomai-help@domain.hid
> https://mail.gna.org/listinfo/xenomai-help
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 8583 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10 14:40       ` M. Koehrer
@ 2007-07-10 15:34         ` Jan Kiszka
  2007-07-11  6:43           ` M. Koehrer
  2007-07-11 14:47           ` Jan Kiszka
  0 siblings, 2 replies; 35+ messages in thread
From: Jan Kiszka @ 2007-07-10 15:34 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 566 bytes --]

M. Koehrer wrote:
> Hi Jan,
> 
> yes, it terminates nicely.
> I have attached my config.

Nothing obvious. Leaves us with probable timing differences or the
different versions of our setups (I found this over 2.3.2 and trunk).

OK, further analysis on your side would be appreciated. E.g. trying the
latest release, switching on debug features in Xenomai like the NMI
watchdog or nucleus debugging. Also, nailing down what service call
precisely locks up (the join, the termination of the task, etc.) would
be good to reduce the search space.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10 15:34         ` Jan Kiszka
@ 2007-07-11  6:43           ` M. Koehrer
  2007-07-11  7:32             ` Jan Kiszka
  2007-07-11 14:47           ` Jan Kiszka
  1 sibling, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-11  6:43 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai

Hi Jan,

as I mentioned in my first mail on this topic, I have extracted this example from a huge
real time application where the system sporadically freezes.
What I have found out there was, that the system freeze happened always when trying
to start a thread (in the original application, the tasks run fairly long).
I think it has to do with rt_task_start() followed by usleep() and the non-realtime stuff running
on CPU 0 and the realtime stuff running on CPU 1.
I never saw that issue on a single core CPU (even if the same SMP kernel was used).

Regards

Mathias


> M. Koehrer wrote:
> > Hi Jan,
> > 
> > yes, it terminates nicely.
> > I have attached my config.
> 
> Nothing obvious. Leaves us with probable timing differences or the
> different versions of our setups (I found this over 2.3.2 and trunk).
> 
> OK, further analysis on your side would be appreciated. E.g. trying the
> latest release, switching on debug features in Xenomai like the NMI
> watchdog or nucleus debugging. Also, nailing down what service call
> precisely locks up (the join, the termination of the task, etc.) would
> be good to reduce the search space.
> 
> Jan
> 
> 
> 
> --------------------------------
> 
> _______________________________________________
> Xenomai-help mailing list
> Xenomai-help@domain.hid
> https://mail.gna.org/listinfo/xenomai-help
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-11  6:43           ` M. Koehrer
@ 2007-07-11  7:32             ` Jan Kiszka
  2007-07-11 12:45               ` M. Koehrer
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2007-07-11  7:32 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

M. Koehrer wrote:
> Hi Jan,
> 
> as I mentioned in my first mail on this topic, I have extracted this example from a huge
> real time application where the system sporadically freezes.
> What I have found out there was, that the system freeze happened always when trying
> to start a thread (in the original application, the tasks run fairly long).

OK, so it's one of rt_task_create (unlikely), rt_task_start, usleep, or
some early code in the task function itself. Still, a lot of "or"...

Again, please consider my further debugging suggestions.

> I think it has to do with rt_task_start() followed by usleep() and the non-realtime stuff running
> on CPU 0 and the realtime stuff running on CPU 1.
> I never saw that issue on a single core CPU (even if the same SMP kernel was used).

Yeah, it must be a nice race that requires real parallelism.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-11  7:32             ` Jan Kiszka
@ 2007-07-11 12:45               ` M. Koehrer
  0 siblings, 0 replies; 35+ messages in thread
From: M. Koehrer @ 2007-07-11 12:45 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai

Hi Jan,

I tried with Xenomai 2.3.2 and (still) the kernel 2.6.20.4 - same configuration as before.
I used the adeos patch for 2.6.20 that is part of Xenomai 2.3.2
Same effect.
The system freezes...

Regards

Mathias

> M. Koehrer wrote:
> > Hi Jan,
> > 
> > as I mentioned in my first mail on this topic, I have extracted this
> example from a huge
> > real time application where the system sporadically freezes.
> > What I have found out there was, that the system freeze happened always
> when trying
> > to start a thread (in the original application, the tasks run fairly
> long).
> 
> OK, so it's one of rt_task_create (unlikely), rt_task_start, usleep, or
> some early code in the task function itself. Still, a lot of "or"...
> 
> Again, please consider my further debugging suggestions.
> 
> > I think it has to do with rt_task_start() followed by usleep() and the
> non-realtime stuff running
> > on CPU 0 and the realtime stuff running on CPU 1.
> > I never saw that issue on a single core CPU (even if the same SMP kernel
> was used).
> 
> Yeah, it must be a nice race that requires real parallelism.
> 
> Jan
> 
> 


-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10 15:34         ` Jan Kiszka
  2007-07-11  6:43           ` M. Koehrer
@ 2007-07-11 14:47           ` Jan Kiszka
  2007-07-13  7:27             ` M. Koehrer
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2007-07-11 14:47 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 1017 bytes --]

Jan Kiszka wrote:
> M. Koehrer wrote:
>> Hi Jan,
>>
>> yes, it terminates nicely.
>> I have attached my config.
> 
> Nothing obvious. Leaves us with probable timing differences or the
> different versions of our setups (I found this over 2.3.2 and trunk).

My posted issue is a classic race of self-terminating the native task on
CPU1 vs. remote-terminating it from CPU0. When the latter wins, things
fall apart. Some solution needs more thoughts.

Anyway, this issues is most probably unrelated to your bug.

> 
> OK, further analysis on your side would be appreciated. E.g. trying the
> latest release, switching on debug features in Xenomai like the NMI
> watchdog or nucleus debugging. Also, nailing down what service call
> precisely locks up (the join, the termination of the task, etc.) would
> be good to reduce the search space.
> 

As you posted in a different mail, recent versions make no difference.
Could you now switch on the watchdog and nucleus debugging?

Thanks,
Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-11 14:47           ` Jan Kiszka
@ 2007-07-13  7:27             ` M. Koehrer
  2007-07-13  8:26               ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-13  7:27 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai

Hi Jan,

I did another test to identify the freeze. I have plugged in a POST-CODE 80 PCI board into the PC
and instrumented the code to write to port 80 to find out where the freeze actually happens.
It seems not to return for rt_task_start as the last written POST code (see source code below)  is 0x30.
I hope to find a time slot to modify the kernel to do another test.
The bad thing is that it takes really long to get the freeze (up to a couple of hours).

Regards

Mathias

---------------------------------- BEGIN SOURCE CODE -----------------------
#include <stdio.h>
#include <sys/mman.h>
#include <sys/io.h>

#include <native/task.h>
#include <native/sem.h>


RT_TASK taska_desc;

void mytaska(void *cookie)
{
    int i;

    outb(0x80,0x80);

    for (i=0; i < 5; i++)
    {
        rt_task_sleep(5000000);
        outb(0x90,0x80);
        // printf("Task A\n");
    }
    outb(0xa0,0x80);

    // printf("End of task A\n");
}


int main(void)
{
    int i;
    int j;
    ioperm(0x80, 1, 1);
    mlockall(MCL_CURRENT|MCL_FUTURE);

    for (j=0; j < 10000; j++)
        for (i=10; i < 15000; i++)
        {
            outb(0x20, 0x80);
            rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE | T_FPU | T_CPU(1));
            outb(0x30, 0x80);
            rt_task_start(&taska_desc, &mytaska, NULL);
            outb(0x40, 0x80);
            usleep(1500);
            outb(0x50, 0x80);

            rt_task_join(&taska_desc);
            if ( i % 100 == 0)
                printf("Loop %i %i\n", j,  i);
        }


    return 0;
}
--------------------------- END  -----------------

 
> > Nothing obvious. Leaves us with probable timing differences or the
> > different versions of our setups (I found this over 2.3.2 and trunk).
> 
> My posted issue is a classic race of self-terminating the native task on
> CPU1 vs. remote-terminating it from CPU0. When the latter wins, things
> fall apart. Some solution needs more thoughts.
> 
> Anyway, this issues is most probably unrelated to your bug.
> 
> > 
> > OK, further analysis on your side would be appreciated. E.g. trying the
> > latest release, switching on debug features in Xenomai like the NMI
> > watchdog or nucleus debugging. Also, nailing down what service call
> > precisely locks up (the join, the termination of the task, etc.) would
> > be good to reduce the search space.
> > 
> 
> As you posted in a different mail, recent versions make no difference.
> Could you now switch on the watchdog and nucleus debugging?
> 
> Thanks,
> Jan
> 
> 
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-13  7:27             ` M. Koehrer
@ 2007-07-13  8:26               ` Jan Kiszka
  2007-07-16  7:07                 ` M. Koehrer
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2007-07-13  8:26 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 753 bytes --]

M. Koehrer wrote:
> Hi Jan,
> 
> I did another test to identify the freeze. I have plugged in a POST-CODE 80 PCI board into the PC
> and instrumented the code to write to port 80 to find out where the freeze actually happens.
> It seems not to return for rt_task_start as the last written POST code (see source code below)  is 0x30.
> I hope to find a time slot to modify the kernel to do another test.

Again: Please consider NMI watchdog and nucleus debugging support for
those tests as well. Maybe (I dare to say: likely on SMP) they catch
where the CPUs hang around instead of doing their work.

> The bad thing is that it takes really long to get the freeze (up to a couple of hours).

Your effort is appreciated even more!

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-13  8:26               ` Jan Kiszka
@ 2007-07-16  7:07                 ` M. Koehrer
  2007-07-16 22:42                   ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-16  7:07 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai

Hi Jan,

I left my PC running the whole weekend using the kernel command line parameter "nmi_watchdog=1".
However, using this option, the PC did not freeze at all...
This is really ugly.
The kernel configuration was the same I mailed a couple of days ago.

Also, I tried (a couple of hours) to run with Xeno debugging (Nucleus debugging and Watchdog support) enabled. However, this did not lead to a freeze either.

This seems to be a really nasty timing issue...

Any ideas on how to continue are welcome.

Regards

Mathias

> > I did another test to identify the freeze. I have plugged in a POST-CODE
> 80 PCI board into the PC
> > and instrumented the code to write to port 80 to find out where the freeze
> actually happens.
> > It seems not to return for rt_task_start as the last written POST code
> (see source code below)  is 0x30.
> > I hope to find a time slot to modify the kernel to do another test.
> 
> Again: Please consider NMI watchdog and nucleus debugging support for
> those tests as well. Maybe (I dare to say: likely on SMP) they catch
> where the CPUs hang around instead of doing their work.
> 
> > The bad thing is that it takes really long to get the freeze (up to a
> couple of hours).
> 
> Your effort is appreciated even more!
> 
> Jan
> 
> 


-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-16  7:07                 ` M. Koehrer
@ 2007-07-16 22:42                   ` Jan Kiszka
  2007-07-19 10:58                     ` M. Koehrer
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2007-07-16 22:42 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 738 bytes --]

M. Koehrer wrote:
> Hi Jan,
> 
> I left my PC running the whole weekend using the kernel command line parameter "nmi_watchdog=1".
> However, using this option, the PC did not freeze at all...
> This is really ugly.
> The kernel configuration was the same I mailed a couple of days ago.

Mpf. What about nmi_watchdog=2? It's said to tick at lower rate, thus
may not have such an "unwanted" side-effect.

> 
> Also, I tried (a couple of hours) to run with Xeno debugging (Nucleus debugging and Watchdog support) enabled. However, this did not lead to a freeze either.
> 
> This seems to be a really nasty timing issue...
> 
> Any ideas on how to continue are welcome.

No. I had no time to think further so far.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-16 22:42                   ` Jan Kiszka
@ 2007-07-19 10:58                     ` M. Koehrer
  2007-07-19 11:27                       ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-19 10:58 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai


[-- Attachment #1.1: Type: text/plain, Size: 3141 bytes --]

Hi!

After a couple of over-night test runs, I finally got an NMI watchdog detected lockup with the sporadic freeze option.
I started the system with the argument nmi_watchdog=1 (also isolcpus=1).
See the code below. As I have not connected a serial console, I have attached a screen shot in a fairly
bad quality as jpg file... However, it is good enough to be able to read everything... 
The lockup is in function rpi_pop [xeno_nucleus].
It is called from gatekeeper_thread and from default_wake_function.
See the attached jpg for details.

Perhaps that helps to identify the issue.

Regards

Mathias

--------- BEGIN CODE ---------------
#include <stdio.h>
#include <sys/mman.h>
#include <sys/io.h>

#include <native/task.h>
#include <native/sem.h>


RT_TASK taska_desc;

void mytaska(void *cookie)
{
    int i;

    outb(0x80,0x80);

    for (i=0; i < 5; i++)
    {
        rt_task_sleep(5000000);
        outb(0x90,0x80);
        // printf("Task A\n");
    }
    outb(0xa0,0x80);

    // printf("End of task A\n");
}


int main(void)
{
    int i;
    int j;
    ioperm(0x80, 1, 1);
    mlockall(MCL_CURRENT|MCL_FUTURE);

    for (j=0; j < 10000; j++)
        for (i=10; i < 15000; i++)
        {
            outb(0x20, 0x80);
            rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE | T_FPU | T_CPU(1));
            //    outb(0x30, 0x80);
            rt_task_start(&taska_desc, &mytaska, NULL);
            outb(0x40, 0x80);
            usleep(1500);
            outb(0x50, 0x80);

            rt_task_join(&taska_desc);
            if ( i % 100 == 0)
                printf("Loop %i %i\n", j,  i);
        }


    return 0;
}

--------- END CODE ---------------

 


----- Original Nachricht ----
Von:     Jan Kiszka <jan.kiszka@domain.hid>
An:      "M. Koehrer" <mathias_koehrer@domain.hid>
Datum:   17.07.2007 00:42
Betreff: Re: [Xenomai-help] Sporadic PC freeze after rt_task_start

> M. Koehrer wrote:
> > Hi Jan,
> > 
> > I left my PC running the whole weekend using the kernel command line
> parameter "nmi_watchdog=1".
> > However, using this option, the PC did not freeze at all...
> > This is really ugly.
> > The kernel configuration was the same I mailed a couple of days ago.
> 
> Mpf. What about nmi_watchdog=2? It's said to tick at lower rate, thus
> may not have such an "unwanted" side-effect.
> 
> > 
> > Also, I tried (a couple of hours) to run with Xeno debugging (Nucleus
> debugging and Watchdog support) enabled. However, this did not lead to a
> freeze either.
> > 
> > This seems to be a really nasty timing issue...
> > 
> > Any ideas on how to continue are welcome.
> 
> No. I had no time to think further so far.
> 
> Jan
> 
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2

[-- Attachment #2: XenoCrash.jpg --]
[-- Type: image/jpeg, Size: 84873 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 10:58                     ` M. Koehrer
@ 2007-07-19 11:27                       ` Jan Kiszka
  2007-07-19 12:19                         ` Philippe Gerum
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2007-07-19 11:27 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-help, M. Koehrer

[-- Attachment #1: Type: text/plain, Size: 1450 bytes --]

M. Koehrer wrote:
> Hi!
> 
> After a couple of over-night test runs, I finally got an NMI watchdog detected lockup with the sporadic freeze option.
> I started the system with the argument nmi_watchdog=1 (also isolcpus=1).
> See the code below. As I have not connected a serial console, I have attached a screen shot in a fairly
> bad quality as jpg file... However, it is good enough to be able to read everything... 
> The lockup is in function rpi_pop [xeno_nucleus].
> It is called from gatekeeper_thread and from default_wake_function.
> See the attached jpg for details.

Looks like we are stuck on rpilock, Philippe.

And when looking at the holders of rpilock, I think one issue could be
that we hold that lock while calling into xnpod_renice_root [1], ie.
doing a potential context switch. Was this checked to be save?
Furthermore, that code path reveals that we take nklock nested into
rpilock [2]. I haven't found a spot for the other way around (and I hope
there is none), but such nesting is already evil per se...

Mathias, already tried your test case with our old friend "priority
coupling" switched off? *If* this lock-up is actually due to rpilock
brokenness, switching the feature off should make it disappear.

Jan


[1]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.c?v=SVN-trunk#435
[2]http://www.rts.uni-hannover.de/xenomai/lxr/source/include/nucleus/pod.h?v=SVN-trunk#308


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 11:27                       ` Jan Kiszka
@ 2007-07-19 12:19                         ` Philippe Gerum
  2007-07-19 12:40                           ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: Philippe Gerum @ 2007-07-19 12:19 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-help, M. Koehrer

On Thu, 2007-07-19 at 13:27 +0200, Jan Kiszka wrote:
> M. Koehrer wrote:
> > Hi!
> > 
> > After a couple of over-night test runs, I finally got an NMI watchdog detected lockup with the sporadic freeze option.
> > I started the system with the argument nmi_watchdog=1 (also isolcpus=1).
> > See the code below. As I have not connected a serial console, I have attached a screen shot in a fairly
> > bad quality as jpg file... However, it is good enough to be able to read everything... 
> > The lockup is in function rpi_pop [xeno_nucleus].
> > It is called from gatekeeper_thread and from default_wake_function.
> > See the attached jpg for details.
> 
> Looks like we are stuck on rpilock, Philippe.
> 

Seems likely, yes. Switching the nucleus DEBUG option would engage the
lockup detector, and pull the brake whenever the nucleus fails to grab
the rpilock.

Mathias, I guess this test has not been run with the nucleus debug
option enabled. Any chance to get a disassembly of the rpi_pop routine
as compiled into your kernel, so that we could check if we are really
stuck on this lock, or rather on some infinite walk into a corrupted RPI
list?

> And when looking at the holders of rpilock, I think one issue could be
> that we hold that lock while calling into xnpod_renice_root [1], ie.
> doing a potential context switch. Was this checked to be save?

xnpod_renice_root() does no reschedule immediately on purpose, we would
never have been able to run any SMP config more than a couple of seconds
otherwise. (See the NOSWITCH bit).

> Furthermore, that code path reveals that we take nklock nested into
> rpilock [2]. I haven't found a spot for the other way around (and I hope
> there is none)

xnshadow_start().

> , but such nesting is already evil per se...

Well, nesting spinlocks only falls into evilness when you get a circular
graph, but since the rpilock is a rookie in the locking team, I'm going
to check this.

Ok, I'm tackling this lockup issue now. I first need to reproduce it.
More news later.

> 
> Mathias, already tried your test case with our old friend "priority
> coupling" switched off? *If* this lock-up is actually due to rpilock
> brokenness, switching the feature off should make it disappear.
> 

It would be nice to switch on the nucleus DEBUG feature, especially the
queue debugging one. I understand this may hide the bug due to the
alteration of timings, but still, it would be useful to know whether a
configuration without NMI but with such debug knob on would trigger the
alarm.

> Jan
> 
> 
> [1]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.c?v=SVN-trunk#435
> [2]http://www.rts.uni-hannover.de/xenomai/lxr/source/include/nucleus/pod.h?v=SVN-trunk#308
> 
-- 
Philippe.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 12:19                         ` Philippe Gerum
@ 2007-07-19 12:40                           ` Jan Kiszka
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Kiszka @ 2007-07-19 12:40 UTC (permalink / raw)
  To: rpm; +Cc: xenomai-help, M. Koehrer

[-- Attachment #1: Type: text/plain, Size: 1090 bytes --]

Philippe Gerum wrote:
>> And when looking at the holders of rpilock, I think one issue could be
>> that we hold that lock while calling into xnpod_renice_root [1], ie.
>> doing a potential context switch. Was this checked to be save?
> 
> xnpod_renice_root() does no reschedule immediately on purpose, we would
> never have been able to run any SMP config more than a couple of seconds
> otherwise. (See the NOSWITCH bit).

OK, then it's not the cause.

> 
>> Furthermore, that code path reveals that we take nklock nested into
>> rpilock [2]. I haven't found a spot for the other way around (and I hope
>> there is none)
> 
> xnshadow_start().

Nope, that one is not holding nklock. But I found an offender...

> 
>> , but such nesting is already evil per se...
> 
> Well, nesting spinlocks only falls into evilness when you get a circular
> graph, but since the rpilock is a rookie in the locking team, I'm going
> to check this.

Take this one: gatekeeper_thread calls into rpi_pop with nklock
acquired. So we have a classic ABAB locking bug. Bang!

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
@ 2007-07-19 13:27 M. Koehrer
  2007-07-19 13:42 ` Philippe Gerum
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-19 13:27 UTC (permalink / raw)
  To: rpm, jan.kiszka; +Cc: xenomai, mathias_koehrer

> 
> Mathias, I guess this test has not been run with the nucleus debug
> option enabled. Any chance to get a disassembly of the rpi_pop routine
> as compiled into your kernel, so that we could check if we are really
> stuck on this lock, or rather on some infinite walk into a corrupted RPI
> list?
I tried last week with debug on and I did not get the freeze...
However, I will try again. I will enable all Xeno debug options (CONFIG_XENO_OPT_DEBUG,
CONFIG_XENO_OPT_DEBUG_NUCLEUS, CONFIG_XENO_OPT_DEBUG_QUEUE, 
CONFIG_XENO_OPT_WATCHDOG) and run the test again...

For this test I leave CONFIG_XENO_OPT_RPIDISABLE unset (default).

I hope to have more information on the results tomorrow morning...

Regards

Mathias

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 13:27 [Xenomai-help] Sporadic PC freeze after rt_task_start M. Koehrer
@ 2007-07-19 13:42 ` Philippe Gerum
  2007-07-19 13:52   ` M. Koehrer
  0 siblings, 1 reply; 35+ messages in thread
From: Philippe Gerum @ 2007-07-19 13:42 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai, jan.kiszka

On Thu, 2007-07-19 at 15:27 +0200, M. Koehrer wrote:
> > 
> > Mathias, I guess this test has not been run with the nucleus debug
> > option enabled. Any chance to get a disassembly of the rpi_pop routine
> > as compiled into your kernel, so that we could check if we are really
> > stuck on this lock, or rather on some infinite walk into a corrupted RPI
> > list?
> I tried last week with debug on and I did not get the freeze...
> However, I will try again. I will enable all Xeno debug options (CONFIG_XENO_OPT_DEBUG,
> CONFIG_XENO_OPT_DEBUG_NUCLEUS, CONFIG_XENO_OPT_DEBUG_QUEUE, 
> CONFIG_XENO_OPT_WATCHDOG) and run the test again...
> 
> For this test I leave CONFIG_XENO_OPT_RPIDISABLE unset (default).
> 

Yes.

> I hope to have more information on the results tomorrow morning...
> 

Ok, thanks. Meanwhile, there is some more thought on our side to fix the
locking sequence. Catch-22 looks so close.

> Regards
> 
> Mathias
> 
-- 
Philippe.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 13:42 ` Philippe Gerum
@ 2007-07-19 13:52   ` M. Koehrer
  2007-07-20  0:14     ` [Xenomai-core] RPI is good for you Philippe Gerum
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-19 13:52 UTC (permalink / raw)
  To: rpm, mathias_koehrer; +Cc: xenomai, jan.kiszka

Hi all,

I got a freeze after 30 minutes running...
The bad thing is, that the PC was frozen but no Oops message was printed out... 
This did not help at all...
I will run my system again with nmi_disable=1 (and the kernel debug active) to see if 
I can catch a freeze...


Regards

Mathias
> > > Mathias, I guess this test has not been run with the nucleus debug
> > > option enabled. Any chance to get a disassembly of the rpi_pop routine
> > > as compiled into your kernel, so that we could check if we are really
> > > stuck on this lock, or rather on some infinite walk into a corrupted
> RPI
> > > list?
> > I tried last week with debug on and I did not get the freeze...
> > However, I will try again. I will enable all Xeno debug options
> (CONFIG_XENO_OPT_DEBUG,
> > CONFIG_XENO_OPT_DEBUG_NUCLEUS, CONFIG_XENO_OPT_DEBUG_QUEUE, 
> > CONFIG_XENO_OPT_WATCHDOG) and run the test again...
> > 
> > For this test I leave CONFIG_XENO_OPT_RPIDISABLE unset (default).
> > 
> 
> Yes.
> 
> > I hope to have more information on the results tomorrow morning...
> > 


-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Xenomai-core] RPI is good for you
  2007-07-19 13:52   ` M. Koehrer
@ 2007-07-20  0:14     ` Philippe Gerum
  2007-07-20  7:32       ` M. Koehrer
  0 siblings, 1 reply; 35+ messages in thread
From: Philippe Gerum @ 2007-07-20  0:14 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 485 bytes --]


Mathias,

Could you try applying the attached patch against v2.3.2, and run your
box using the failing configuration. This patch is a _preliminary_
attempt at fixing two major issues, it is not complete, and may not even
be fully correct since it does not address all the pending issues yet.
Still, I would be interested to know whether I'm on the right path, and
if it changes something to your problem, without making your box jump
out of the window, that is.

TIA,

-- 
Philippe.


[-- Attachment #2: fix-rpi-locking-gk.patch --]
[-- Type: text/x-patch, Size: 5770 bytes --]

Index: ksrc/skins/psos+/task.c
===================================================================
--- ksrc/skins/psos+/task.c	(revision 2765)
+++ ksrc/skins/psos+/task.c	(working copy)
@@ -288,13 +288,6 @@
 		goto unlock_and_exit;
 	}
 
-#if defined(__KERNEL__) && defined(CONFIG_XENO_OPT_PERVASIVE)
-	if (xnthread_user_task(&task->threadbase) != NULL
-	    && !xnthread_test_state(&task->threadbase,XNDORMANT)
-	    && (!xnpod_primary_p() || task != psos_current_task()))
-		xnshadow_send_sig(&task->threadbase, SIGKILL, 1);
-#endif /* __KERNEL__ && CONFIG_XENO_OPT_PERVASIVE */
-
 	xnpod_delete_thread(&task->threadbase);
 
       unlock_and_exit:
Index: ksrc/skins/vxworks/taskLib.c
===================================================================
--- ksrc/skins/vxworks/taskLib.c	(revision 2765)
+++ ksrc/skins/vxworks/taskLib.c	(working copy)
@@ -285,13 +285,6 @@
 		goto error;
 	}
 
-#if defined(__KERNEL__) && defined(CONFIG_XENO_OPT_PERVASIVE)
-	if (xnthread_user_task(&task->threadbase) != NULL
-	    && !xnthread_test_state(&task->threadbase,XNDORMANT)
-	    && (!xnpod_primary_p() || task != wind_current_task()))
-		xnshadow_send_sig(&task->threadbase, SIGKILL, 1);
-#endif /* __KERNEL__ && CONFIG_XENO_OPT_PERVASIVE */
-
 	xnpod_delete_thread(&task->threadbase);
 	xnlock_put_irqrestore(&nklock, s);
 
Index: ksrc/skins/native/task.c
===================================================================
--- ksrc/skins/native/task.c	(revision 2765)
+++ ksrc/skins/native/task.c	(working copy)
@@ -581,29 +581,6 @@
 	if (err)
 		goto unlock_and_exit;
 
-#if defined(__KERNEL__) && defined(CONFIG_XENO_OPT_PERVASIVE)
-	/* rt_task_delete() might be called for cleaning up a just
-	   created shadow task which has not been successfully mapped,
-	   so make sure we have an associated Linux mate before trying
-	   to send it a signal. This will also prevent any action on
-	   kernel-based Xenomai threads for which the user TCB
-	   extension is always NULL.
-	   We don't send any signal to dormant threads because GDB
-	   (6.x) has some problems dealing with vanishing threads
-	   under some circumstances, likely when asynchronous
-	   cancellation is in effect. In most cases, this is a
-	   non-issue since pthread_cancel() is requested from the skin
-	   interface library in parallel on the target thread, but
-	   when calling rt_task_delete() from kernel space against a
-	   created but unstarted user-space task, the Linux thread
-	   mated to the Xenomai shadow might linger unexpectedly on
-	   the startup barrier. */
-	if (xnthread_user_task(&task->thread_base) != NULL
-	    && !xnthread_test_state(&task->thread_base,XNDORMANT)
-	    && (!xnpod_primary_p() || task != xeno_current_task()))
-		xnshadow_send_sig(&task->thread_base, SIGKILL, 1);
-#endif /* __KERNEL__ && CONFIG_XENO_OPT_PERVASIVE */
-
 	/* Does not return if task is current. */
 	xnpod_delete_thread(&task->thread_base);
 
Index: ksrc/nucleus/pod.c
===================================================================
--- ksrc/nucleus/pod.c	(revision 2765)
+++ ksrc/nucleus/pod.c	(working copy)
@@ -1245,10 +1245,35 @@
 	if (xnthread_test_state(thread, XNZOMBIE))
 		goto unlock_and_exit;	/* No double-deletion. */
 
+	sched = thread->sched;
+
+#if defined(__KERNEL__) && defined(CONFIG_XENO_OPT_PERVASIVE)
+	/* xnpod_delete_thread() might be called for cleaning up a
+	   just created shadow task which has not been successfully
+	   mapped, so make sure we have an associated Linux mate
+	   before trying to send it a signal. This will also prevent
+	   any action on kernel-based Xenomai threads for which the
+	   user TCB extension is always NULL.  We don't send any
+	   signal to dormant threads because GDB (6.x) has some
+	   problems dealing with vanishing threads under some
+	   circumstances, likely when asynchronous cancellation is in
+	   effect. In most cases, this is a non-issue since
+	   pthread_cancel() is requested from the skin interface
+	   library in parallel on the target thread, but when calling
+	   xnpod_delete_thread() from kernel space against a created
+	   but unstarted user-space task, the Linux thread mated to
+	   the Xenomai shadow might linger unexpectedly on the startup
+	   barrier. */
+	if (xnthread_user_task(thread) != NULL
+	    && !xnthread_test_state(thread, XNDORMANT)
+	    && (!xnpod_primary_p() || thread != sched->runthread)) {
+		xnshadow_send_sig(thread, SIGKILL, 1);
+		goto unlock_and_exit;
+	}
+#endif /* __KERNEL__ && CONFIG_XENO_OPT_PERVASIVE */
+
 	xnltt_log_event(xeno_ev_thrdelete, thread->name);
 
-	sched = thread->sched;
-
 	removeq(&nkpod->threadq, &thread->glink);
 	nkpod->threadq_rev++;
 
Index: ksrc/nucleus/shadow.c
===================================================================
--- ksrc/nucleus/shadow.c	(revision 2765)
+++ ksrc/nucleus/shadow.c	(working copy)
@@ -837,7 +837,8 @@
 		if (kthread_should_stop())
 			break;
 
-		xnlock_get_irqsave(&nklock, s);
+		/* Real-time shadow TCBs are always removed on behalf
+		   of the killed thread. */
 
 		thread = gk->thread;
 
@@ -846,6 +847,8 @@
 		   pending request, just ignore the latter. */
 
 		if (xnthread_user_task(thread)->state == TASK_INTERRUPTIBLE) {
+			rpi_pop(thread);
+			xnlock_get_irqsave(&nklock, s);
 #ifdef CONFIG_SMP
 			/* If the task changed its CPU while in secondary mode,
 			   change the CPU of the underlying Xenomai shadow too. We
@@ -855,14 +858,12 @@
 			thread->sched = xnpod_sched_slot(cpu);
 #endif /* CONFIG_SMP */
 			xnpod_resume_thread(thread, XNRELAX);
-			rpi_pop(thread);
 #ifdef CONFIG_XENO_OPT_ISHIELD
 			disengage_irq_shield();
 #endif /* CONFIG_XENO_OPT_ISHIELD */
 			xnpod_schedule();
+			xnlock_put_irqrestore(&nklock, s);
 		}
-
-		xnlock_put_irqrestore(&nklock, s);
 	}
 
 	return 0;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-20  0:14     ` [Xenomai-core] RPI is good for you Philippe Gerum
@ 2007-07-20  7:32       ` M. Koehrer
  2007-07-20 11:54         ` M. Koehrer
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-20  7:32 UTC (permalink / raw)
  To: rpm, mathias_koehrer; +Cc: xenomai

Hi Philippe,

I have done my tests on 2.3.1 and I tried to apply the patch on 2.3.1 which worked 
(with some line offset in the patch...).
I wait now for any result.
So far everything is running fine. No issues.

More later.

Regards

Mathias
 


> 
> Mathias,
> 
> Could you try applying the attached patch against v2.3.2, and run your
> box using the failing configuration. This patch is a _preliminary_
> attempt at fixing two major issues, it is not complete, and may not even
> be fully correct since it does not address all the pending issues yet.
> Still, I would be interested to know whether I'm on the right path, and
> if it changes something to your problem, without making your box jump
> out of the window, that is.
> 
> TIA,
> 
> -- 
> Philippe.
> 
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-20  7:32       ` M. Koehrer
@ 2007-07-20 11:54         ` M. Koehrer
  2007-07-20 12:16           ` Philippe Gerum
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-20 11:54 UTC (permalink / raw)
  To: rpm, mathias_koehrer; +Cc: xenomai

Hi Philippe,
I left my test running for a couple of hours - no freeze so far... 

However, I have to do some other stuff on this machine, I have to stop the test now...

Regards

Mathias

> Hi Philippe,
> 
> I have done my tests on 2.3.1 and I tried to apply the patch on 2.3.1 which
> worked 
> (with some line offset in the patch...).
> I wait now for any result.
> So far everything is running fine. No issues.
> 
> More later.
> 
> Regards
> 
> Mathias
>  
> 
> 
> > 
> > Mathias,
> > 
> > Could you try applying the attached patch against v2.3.2, and run your
> > box using the failing configuration. This patch is a _preliminary_
> > attempt at fixing two major issues, it is not complete, and may not even
> > be fully correct since it does not address all the pending issues yet.
> > Still, I would be interested to know whether I'm on the right path, and
> > if it changes something to your problem, without making your box jump
> > out of the window, that is.
> > 
> > TIA,
> > 
> > -- 
> > Philippe.
> > 
> > 
> 
> -- 
> Mathias Koehrer
> mathias_koehrer@domain.hid
> 
> 
> Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
> ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
> und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
> nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
> http://www.arcor.de/rd/emf-dsl-2
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-20 11:54         ` M. Koehrer
@ 2007-07-20 12:16           ` Philippe Gerum
  2007-07-23 15:01             ` Philippe Gerum
  0 siblings, 1 reply; 35+ messages in thread
From: Philippe Gerum @ 2007-07-20 12:16 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

                    On Fri, 2007-07-20 at 13:54 +0200, M. Koehrer wrote:
> Hi Philippe,
> I left my test running for a couple of hours - no freeze so far... 
> 
> However, I have to do some other stuff on this machine, I have to stop the test now...
> 

Ok, thanks for the feedback. I will send an extended patch later today,
so that you could test it on a longer period when you see fit.

TIA,

> Regards
> 
> Mathias
> 
> > Hi Philippe,
> > 
> > I have done my tests on 2.3.1 and I tried to apply the patch on 2.3.1 which
> > worked 
> > (with some line offset in the patch...).
> > I wait now for any result.
> > So far everything is running fine. No issues.
> > 
> > More later.
> > 
> > Regards
> > 
> > Mathias
> >  
> > 
> > 
> > > 
> > > Mathias,
> > > 
> > > Could you try applying the attached patch against v2.3.2, and run your
> > > box using the failing configuration. This patch is a _preliminary_
> > > attempt at fixing two major issues, it is not complete, and may not even
> > > be fully correct since it does not address all the pending issues yet.
> > > Still, I would be interested to know whether I'm on the right path, and
> > > if it changes something to your problem, without making your box jump
> > > out of the window, that is.
> > > 
> > > TIA,
> > > 
> > > -- 
> > > Philippe.
> > > 
> > > 
> > 
> > -- 
> > Mathias Koehrer
> > mathias_koehrer@domain.hid
> > 
> > 
> > Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
> > ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
> > und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
> > nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
> > http://www.arcor.de/rd/emf-dsl-2
> > 
> 
-- 
Philippe.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-20 12:16           ` Philippe Gerum
@ 2007-07-23 15:01             ` Philippe Gerum
  2007-07-23 17:15               ` Philippe Gerum
                                 ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Philippe Gerum @ 2007-07-23 15:01 UTC (permalink / raw)
  To: M. Koehrer; +Cc: Jan Kiszka, xenomai

[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]

On Fri, 2007-07-20 at 14:16 +0200, Philippe Gerum wrote: 
> On Fri, 2007-07-20 at 13:54 +0200, M. Koehrer wrote:
> > Hi Philippe,
> > I left my test running for a couple of hours - no freeze so far... 
> > 
> > However, I have to do some other stuff on this machine, I have to stop the test now...
> > 
> 
> Ok, thanks for the feedback. I will send an extended patch later today,
> so that you could test it on a longer period when you see fit.

It took me a bit longer than expected, but here is a patch which
addresses all the pending issues with RPI, hopefully (applies against
2.3.1 stock).

The good thing about Jan grumbling at me, is that this usually makes me
look at the big picture anew. And the RPI picture was not that nice,
that's a fact.

Beside the locking sequence issue, the ex-aequo #1 problem was that CPU
migration of Linux tasks causing a RPI boost had some very nasty
side-effects on RPI management, and would create all sort of funky
situations I'm too shameful to talk about, except under the generic term
of "horrendous mess".

Now, regarding the deadlock issue, suppressing the RPI-specific locking
entirely would have been the best solution, but unfortunately, the
migration scheme makes this out of reach, at least without resorting to
some hairy and likely unreliable implementation. Therefore, the solution
I came with consists of making the RPI lock a per-cpu thing, so that
most RPI routines are actually grabbing a _local_ lock wrt the current
CPU, those routines being allowed hold the nklock as they wish. When
some per-CPU RPI lock is accessed from a remote CPU, it is guaranteed
that _no nklock_ may be held nested. Actually, the remote case only
occurs once, in rpi_clear_remote(), and all its callers are guaranteed
to be nklock-free (a debug assertion even enforces that).

For the migration issue, the RPI transitions have been ironed out to
make sure we deal properly with all the subtleties of the Linux load
balancer.

Mathias, please let me know if the attached patch improves the situation
on your side.

-- 
Philippe.



[-- Attachment #2: fix-rpi-cpu-migration.patch --]
[-- Type: text/x-patch, Size: 32486 bytes --]

diff -uNrp xenomai-2.3.1/include/asm-generic/system.h xenomai-2.3.1-rpi/include/asm-generic/system.h
--- xenomai-2.3.1/include/asm-generic/system.h	2007-03-18 17:02:02.000000000 +0100
+++ xenomai-2.3.1-rpi/include/asm-generic/system.h	2007-07-23 14:05:54.000000000 +0200
@@ -374,6 +374,11 @@ static inline int xnarch_send_ipi (xnarc
     return rthal_send_ipi(RTHAL_SERVICE_IPI0, cpumask);
 }
 
+static inline int xnlock_is_owner(xnlock_t *lock)
+{
+	return atomic_read(&lock->owner) == xnarch_current_cpu();
+}
+
 #else /* !CONFIG_SMP */
 
 #define xnlock_init(lock)              do { } while(0)
@@ -383,6 +388,7 @@ static inline int xnarch_send_ipi (xnarc
 #define xnlock_put_irqrestore(lock,x)  rthal_local_irq_restore(x)
 #define xnlock_clear_irqoff(lock)      rthal_local_irq_disable()
 #define xnlock_clear_irqon(lock)       rthal_local_irq_enable()
+#define xnlock_is_owner(lock)	       1
 
 static inline int xnarch_send_ipi (xnarch_cpumask_t cpumask)
 {
diff -uNrp xenomai-2.3.1/include/nucleus/core.h xenomai-2.3.1-rpi/include/nucleus/core.h
--- xenomai-2.3.1/include/nucleus/core.h	2007-02-01 20:00:27.000000000 +0100
+++ xenomai-2.3.1-rpi/include/nucleus/core.h	2007-07-23 14:05:54.000000000 +0200
@@ -33,9 +33,11 @@
 /* Visible priority range supported by the core pod. */
 #define XNCORE_MIN_PRIO     0
 #define XNCORE_MAX_PRIO     257
+/* Base priority of the root thread for the core pod. */
+#define XNCORE_IDLE_PRIO    -1
 
 /* Total number of priority levels (including the hidden root one) */
-#define XNCORE_NR_PRIO      (XNCORE_MAX_PRIO - XNCORE_MIN_PRIO + 2)
+#define XNCORE_NR_PRIO      (XNCORE_MAX_PRIO - XNCORE_IDLE_PRIO + 1)
 
 /* Priority sub-range used by core APIs. */
 #define XNCORE_LOW_PRIO     0
@@ -44,9 +46,6 @@
 /* Priority of IRQ servers in user-space. */
 #define XNCORE_IRQ_PRIO     XNCORE_MAX_PRIO
 
-/* Base priority of the root thread for the core pod. */
-#define XNCORE_BASE_PRIO    -1
-
 #ifdef __KERNEL__
 
 #ifdef __cplusplus
diff -uNrp xenomai-2.3.1/include/nucleus/pod.h xenomai-2.3.1-rpi/include/nucleus/pod.h
--- xenomai-2.3.1/include/nucleus/pod.h	2007-03-18 17:02:02.000000000 +0100
+++ xenomai-2.3.1-rpi/include/nucleus/pod.h	2007-07-23 14:05:54.000000000 +0200
@@ -336,6 +336,11 @@ static inline void xnpod_renice_root(int
 	xnlock_put_irqrestore(&nklock, s);
 }
 
+static inline int xnpod_root_priority(void)
+{
+	return xnthread_current_priority(xnpod_current_root());
+}
+
 static inline int xnpod_get_qdir(xnpod_t *pod)
 {
 	/* Returns the queuing direction of threads for a given pod */
diff -uNrp xenomai-2.3.1/include/nucleus/queue.h xenomai-2.3.1-rpi/include/nucleus/queue.h
--- xenomai-2.3.1/include/nucleus/queue.h	2007-02-04 12:24:23.000000000 +0100
+++ xenomai-2.3.1-rpi/include/nucleus/queue.h	2007-07-23 14:09:20.000000000 +0200
@@ -104,7 +104,8 @@ do { \
     while (curr != &(__qslot)->head && nelems < (__qslot)->elems)	\
         curr = curr->last, nelems++; \
     if (curr != &(__qslot)->head || nelems != (__qslot)->elems)	  \
-        xnpod_fatal("corrupted queue, qslot->elems=%d, qslot=%p at %s:%d", \
+	xnpod_fatal("corrupted queue, qslot->elems=%d/%d, qslot=%p at %s:%d", \
+		    nelems,				\
                     (__qslot)->elems,				  \
                     __qslot,					  \
 		    __FILE__,__LINE__);				  \
@@ -746,14 +747,22 @@ static inline xnpholder_t *findmlqh(xnml
 
 static inline xnpholder_t *getheadmlq(xnmlqueue_t *mlqslot)
 {
+    xnpholder_t *pholder;
     xnqueue_t *queue;
 
     if (emptymlq_p(mlqslot))
         return NULL;
 
     queue = &mlqslot->queue[ffsmlq(mlqslot)];
-    
-    return (xnpholder_t *)getheadq(queue);
+    holder = (xnpholder_t *)getheadq(queue);
+
+    XENO_ASSERT(QUEUES, holder,
+		xnpod_fatal
+		("corrupted multi-level queue, qslot=%p at %s:%d", mlqslot,
+		 __FILE__, __LINE__);
+		);
+
+    return holder;
 }
 
 static inline xnpholder_t *getmlq(xnmlqueue_t *mlqslot)
@@ -775,12 +784,11 @@ static inline xnpholder_t *getmlq(xnmlqu
 		    __FILE__,__LINE__);
         );
 
-    hi = idx / BITS_PER_LONG;
-    lo = idx % BITS_PER_LONG;
-
     mlqslot->elems--;    
 
     if (emptyq_p(queue)) {
+	hi = idx / BITS_PER_LONG;
+        lo = idx % BITS_PER_LONG;
         __clrbits(mlqslot->lomap[hi],1 << lo);
 	if (mlqslot->lomap[hi] == 0)
 	    __clrbits(mlqslot->himap,1 << hi);
diff -uNrp xenomai-2.3.1/include/nucleus/thread.h xenomai-2.3.1-rpi/include/nucleus/thread.h
--- xenomai-2.3.1/include/nucleus/thread.h	2007-03-15 15:10:30.000000000 +0100
+++ xenomai-2.3.1-rpi/include/nucleus/thread.h	2007-07-23 14:05:54.000000000 +0200
@@ -168,9 +168,7 @@ typedef struct xnthread {
 
     xnholder_t glink;		/* Thread holder in global queue */
 
-/* We don't want side-effects on laddr here! */
-#define link2thread(laddr,link) \
-((xnthread_t *)(((char *)laddr) - (int)(&((xnthread_t *)0)->link)))
+#define link2thread(ln, fld)	container_of(ln, xnthread_t, fld)
 
     xnpqueue_t claimq;		/* Owned resources claimed by others (PIP) */
 
diff -uNrp xenomai-2.3.1/ksrc/nucleus/pod.c xenomai-2.3.1-rpi/ksrc/nucleus/pod.c
--- xenomai-2.3.1/ksrc/nucleus/pod.c	2007-03-19 12:14:13.000000000 +0100
+++ xenomai-2.3.1-rpi/ksrc/nucleus/pod.c	2007-07-23 14:05:54.000000000 +0200
@@ -1242,10 +1242,52 @@ void xnpod_delete_thread(xnthread_t *thr
 	if (xnthread_test_state(thread, XNZOMBIE))
 		goto unlock_and_exit;	/* No double-deletion. */
 
-	xnltt_log_event(xeno_ev_thrdelete, thread->name);
-
 	sched = thread->sched;
 
+#if defined(__KERNEL__) && defined(CONFIG_XENO_OPT_PERVASIVE)
+	/*
+	 * This block serves two purposes:
+	 *
+	 * 1) Make sure Linux counterparts of shadow threads do exit
+	 * upon deletion request from the nucleus through a call to
+	 * xnpod_delete_thread().
+	 *
+	 * 2) Make sure shadow threads are removed from the system on
+	 * behalf of their own context, by sending them a lethal
+	 * signal when it is not the case instead of wiping out their
+	 * TCB. In such a case, the deletion is asynchronous, and
+	 * killed thread will later enter xnpod_delete_thread() from
+	 * the exit notification handler (I-pipe).
+	 *
+	 * Sidenote: xnpod_delete_thread() might be called for
+	 * cleaning up a just created shadow task which has not been
+	 * successfully mapped, so we need to make sure that we have
+	 * an associated Linux mate before trying to send it a signal
+	 * (i.e. user_task extension != NULL). This will also prevent
+	 * any action on kernel-based Xenomai threads for which the
+	 * user TCB extension is always NULL.  We don't send any
+	 * signal to dormant threads because GDB (6.x) has some
+	 * problems dealing with vanishing threads under some
+	 * circumstances, likely when asynchronous cancellation is in
+	 * effect. In most cases, this is a non-issue since
+	 * pthread_cancel() is requested from the skin interface
+	 * library in parallel on the target thread. In the rare case
+	 * of calling xnpod_delete_thread() from kernel space against
+	 * a created but unstarted user-space task, the Linux thread
+	 * mated to the Xenomai shadow might linger unexpectedly on
+	 * the startup barrier.
+	 */
+
+	if (xnthread_user_task(thread) != NULL &&
+	    !xnthread_test_state(thread, XNDORMANT) &&
+	    thread != sched->runthread) {
+		xnshadow_send_sig(thread, SIGKILL, 1);
+		goto unlock_and_exit;
+	}
+#endif /* __KERNEL__ && CONFIG_XENO_OPT_PERVASIVE */
+
+	xnltt_log_event(xeno_ev_thrdelete, thread->name);
+
 	removeq(&nkpod->threadq, &thread->glink);
 	nkpod->threadq_rev++;
 
diff -uNrp xenomai-2.3.1/ksrc/nucleus/shadow.c xenomai-2.3.1-rpi/ksrc/nucleus/shadow.c
--- xenomai-2.3.1/ksrc/nucleus/shadow.c	2007-03-19 14:19:49.000000000 +0100
+++ xenomai-2.3.1-rpi/ksrc/nucleus/shadow.c	2007-07-23 14:10:29.000000000 +0200
@@ -75,6 +75,9 @@ static struct __gatekeeper {
 	struct linux_semaphore sync;
 	xnthread_t *thread;
 	struct xnrpi {
+#ifdef CONFIG_SMP
+		xnlock_t lock;
+#endif /* CONFIG_SMP */
 		xnsched_queue_t threadq;
 	} rpislot;
 
@@ -121,20 +124,13 @@ void xnpod_discard_iface_proc(struct xns
 
 #ifndef CONFIG_XENO_OPT_RPIDISABLE
 
-#ifdef CONFIG_SMP
-static xnlock_t rpilock = XNARCH_LOCK_UNLOCKED;
-#endif /* CONFIG_SMP */
-
 #define rpi_p(t)	((t)->rpi != NULL)
 
-static inline void rpi_init(void)
-{
-	xnlock_init(&rpilock);
-}
-
 static inline void rpi_init_gk(struct __gatekeeper *gk)
 {
-	sched_initpq(&gk->rpislot.threadq, xnqueue_down, XNCORE_NR_PRIO);
+ 	struct xnrpi *rpislot = &gk->rpislot;
+ 	xnlock_init(&rpislot->lock);
+ 	sched_initpq(&rpislot->threadq, xnqueue_down, XNCORE_NR_PRIO);
 }
 
 static inline void rpi_none(xnthread_t *thread)
@@ -144,12 +140,12 @@ static inline void rpi_none(xnthread_t *
 
 static void rpi_push(xnthread_t *thread)
 {
-	struct __gatekeeper *gk;
+	struct xnrpi *rpislot;
 	xnthread_t *top;
 	int prio;
 	spl_t s;
 
-	gk = &gatekeeper[rthal_processor_id()];
+	rpislot = &gatekeeper[rthal_processor_id()].rpislot;
 
 	/* non-RT shadows and RT shadows which disabled RPI cause the
 	   root priority to be lowered to its base level. The purpose
@@ -160,187 +156,241 @@ static void rpi_push(xnthread_t *thread)
 
 	if (likely(xnthread_user_task(thread)->policy == SCHED_FIFO &&
 		   !xnthread_test_state(thread, XNRPIOFF))) {
-		xnlock_get_irqsave(&rpilock, s);
+		xnlock_get_irqsave(&rpislot->lock, s);
 
 		if (XENO_DEBUG(NUCLEUS) && rpi_p(thread))
 			xnpod_fatal("re-enqueuing a relaxed thread in the RPI queue");
 
-		sched_insertpqf(&gk->rpislot.threadq, &thread->xlink, xnthread_current_priority(thread));
-		thread->rpi = &gk->rpislot;
+		sched_insertpqf(&rpislot->threadq, &thread->xlink, xnthread_current_priority(thread));
+		thread->rpi = rpislot;
 
-		top = link2thread(sched_getheadpq(&gk->rpislot.threadq), xlink);
+		top = link2thread(sched_getheadpq(&rpislot->threadq), xlink);
 		prio = xnthread_current_priority(top);
-		xnlock_put_irqrestore(&rpilock, s);
+		xnlock_put_irqrestore(&rpislot->lock, s);
 	} else
-		prio = XNCORE_BASE_PRIO;
+		prio = XNCORE_IDLE_PRIO;
 
-	xnpod_renice_root(prio);
+	if (xnpod_root_priority() != prio)
+		xnpod_renice_root(prio);
 }
 
 static void rpi_pop(xnthread_t *thread)
 {
-	struct __gatekeeper *gk;
+	struct xnrpi *rpislot;
 	int prio;
 	spl_t s;
 
-	gk = &gatekeeper[rthal_processor_id()];
+	rpislot = &gatekeeper[rthal_processor_id()].rpislot;
 
-	xnlock_get_irqsave(&rpilock, s);
+	xnlock_get_irqsave(&rpislot->lock, s);
 
 	/* Make sure we don't try to unlink a shadow which is not
-	   linked to the RPI queue. We must be known by the current
-	   RPI slot here, or not be linked at all to any RPI slot. */
-	if (unlikely(!rpi_p(thread))) {
-		xnlock_put_irqrestore(&rpilock, s);
+	   linked to the local RPI queue. This may happen in case a
+	   hardening thread is migrated by the kernel while in flight
+	   to the primary mode. */
+
+	if (likely(thread->rpi == rpislot)) {
+		sched_removepq(&rpislot->threadq, &thread->xlink);
+		rpi_none(thread);
+	} else if (!rpi_p(thread)) {
+		xnlock_put_irqrestore(&rpislot->lock, s);
 		return;
 	}
 
-	sched_removepq(&gk->rpislot.threadq, &thread->xlink);
-	rpi_none(thread);
-
-	if (likely(sched_emptypq_p(&gk->rpislot.threadq)))
-		prio = XNCORE_BASE_PRIO;
+	if (likely(sched_emptypq_p(&rpislot->threadq)))
+		prio = XNCORE_IDLE_PRIO;
 	else {
-		xnthread_t *top = link2thread(sched_getheadpq(&gk->rpislot.threadq), xlink);
+		xnpholder_t *pholder = sched_getheadpq(&rpislot->threadq);
+		xnthread_t *top = link2thread(pholder, xlink);
 		prio = xnthread_current_priority(top);
 	}
 
-	xnlock_put_irqrestore(&rpilock, s);
+	xnlock_put_irqrestore(&rpislot->lock, s);
 
-	xnpod_renice_root(prio);
+	if (xnpod_root_priority() != prio)
+		xnpod_renice_root(prio);
 }
 
-static inline void rpi_update(xnthread_t *thread)
+static void rpi_update(xnthread_t *thread)
 {
+	struct xnrpi *rpislot;
 	spl_t s;
 
-	/* This is the only place where we could unlink a thread from
-	 * a remote RPI slot (after a migration within the Linux
-	 * domain), so let's use the backlink pointer the thread
-	 * provides to fetch the actual slot it is supposed to be
-	 * linked to, _not_ the gatekeeper's RPI slot for the current
-	 * CPU. */
-	xnlock_get_irqsave(&rpilock, s);
+	rpislot = &gatekeeper[rthal_processor_id()].rpislot;
+	xnlock_get_irqsave(&rpislot->lock, s);
+	sched_removepq(&rpislot->threadq, &thread->xlink);
+	rpi_none(thread);
+	rpi_push(thread);
+	xnlock_put_irqrestore(&rpislot->lock, s);
+}
+
+#ifdef CONFIG_SMP
+
+static void rpi_clear_remote(xnthread_t *thread)
+{
+	struct xnrpi *rpislot;
+	int rcpu = -1;
+	spl_t s;
 
-	if (unlikely(!rpi_p(thread))) {
-		xnlock_put_irqrestore(&rpilock, s);
+	/*
+	 * This is the only place where we may touch a remote RPI slot
+	 * (after a migration within the Linux domain), so let's use
+	 * the backlink pointer the thread provides to fetch the
+	 * actual slot it is supposed to be linked to, _not_ the
+	 * gatekeeper's RPI slot for the current CPU.
+	 *
+	 * BIG FAT WARNING: The nklock must NOT be held when entering
+	 * this routine, otherwise a deadlock would be possible,
+	 * caused by conflicting locking sequences between the per-CPU
+	 * RPI lock and the nklock.
+	 */
+
+	if (XENO_DEBUG(NUCLEUS) && xnlock_is_owner(&nklock))
+		xnpod_fatal("nklock held while calling %s - this may deadlock!",
+			    __FUNCTION__);
+
+	rpislot = thread->rpi;
+
+	if (unlikely(rpislot == NULL))
 		return;
-	}
 
-	sched_removepq(&thread->rpi->threadq, &thread->xlink);
+	xnlock_get_irqsave(&rpislot->lock, s);
 
-#ifdef CONFIG_SMP
+	/* The RPI slot - if present - is always valid, and won't
+	 * change since the thread is resuming on this CPU and cannot
+	 * migrate under our feet. We may grab the remote slot lock
+	 * now. */
+
+	sched_removepq(&rpislot->threadq, &thread->xlink);
+	rpi_none(thread);
 
-	/* Ok, this one is not trivial. Unless a relaxed shadow has
+	if (sched_emptypq_p(&rpislot->threadq))
+		rcpu = container_of(rpislot, struct __gatekeeper, rpislot) - gatekeeper;
+
+	xnlock_put_irqrestore(&rpislot->lock, s);
+
+	/*
+	 * Ok, this one is not trivial. Unless a relaxed shadow has
 	 * forced its CPU affinity, it may migrate to another CPU as a
 	 * result of Linux's load balancing strategy. If the last
-	 * relaxed Xenomai thread is moved while in a blocked state
-	 * from a CPU (i.e. != TASK_RUNNING), there is no way for
-	 * rpi_switch() to lower the root thread priority, since
-	 * do_schedule_event() is only called for incoming/outgoing
-	 * Xenomai shadows, and not for regular Linux tasks. This
-	 * would leave the Xenomai root thread for the source CPU with
-	 * a boosted priority, inherited from the last migrated
-	 * shadow. To prevent this, we send an IPI to the source CPU
-	 * when a migration is detected from the destination CPU, so
-	 * that it could adjust its root thread priority whenever no
-	 * other relaxed shadow is undergoing a RPI boost. */
-
-	if (sched_emptypq_p(&thread->rpi->threadq)) {
-		int rcpu = container_of(thread->rpi, struct __gatekeeper, rpislot) - gatekeeper;
-		if (rcpu != rthal_processor_id()) {
-			xnsched_t *rsched = xnpod_sched_slot(rcpu);
-			if (!testbits(rsched->status, XNRPICK)) {
-				xnarch_cpumask_t cpumask;
-				setbits(rsched->status, XNRPICK);
-				xnarch_cpus_clear(cpumask);
-				xnarch_cpu_set(rcpu, cpumask);
-				xnarch_send_ipi(cpumask);
-			}
+	 * relaxed Xenomai thread migrates, there is no way for
+	 * rpi_switch() to lower the root thread priority on the
+	 * source CPU, since do_schedule_event() is only called for
+	 * incoming/outgoing Xenomai shadows. This would leave the
+	 * Xenomai root thread for the source CPU with a boosted
+	 * priority. To prevent this, we send an IPI from the
+	 * destination CPU to the source CPU when a migration is
+	 * detected, so that the latter could adjust its root thread
+	 * priority.
+	 */
+	if (rcpu != -1 && rcpu != rthal_processor_id()) {
+		xnsched_t *rsched = xnpod_sched_slot(rcpu);
+		if (!testbits(rsched->status, XNRPICK)) {
+			xnarch_cpumask_t cpumask;
+			setbits(rsched->status, XNRPICK);
+			xnarch_cpus_clear(cpumask);
+			xnarch_cpu_set(rcpu, cpumask);
+			xnarch_send_ipi(cpumask);
 		}
 	}
-#endif
+}
 
-	rpi_none(thread);
+static void rpi_migrate(xnthread_t *thread)
+{
+	rpi_clear_remote(thread);
 	rpi_push(thread);
-	xnlock_put_irqrestore(&rpilock, s);
 }
 
+#else  /* !CONFIG_SMP */
+#define rpi_clear_remote(t)	do { } while(0)
+#define rpi_migrate(t)		do { } while(0)
+#endif	/* !CONFIG_SMP */
+
 static inline void rpi_switch(struct task_struct *next)
 {
 	xnthread_t *threadin, *threadout;
-	struct __gatekeeper *gk;
+	struct xnrpi *rpislot;
 	int oldprio, newprio;
 	spl_t s;
 
 	threadout = xnshadow_thread(current);
 	threadin = xnshadow_thread(next);
-	gk = &gatekeeper[rthal_processor_id()];
-	oldprio = xnthread_current_priority(xnpod_current_root());
+	rpislot = &gatekeeper[rthal_processor_id()].rpislot;
+	oldprio = xnpod_root_priority();
 
 	if (threadout &&
 	    current->state != TASK_RUNNING &&
 	    !xnthread_test_info(threadout, XNATOMIC)) {
-		/* A blocked Linux task must be removed from the RPI
+		/*
+		 * A blocked Linux task must be removed from the RPI
 		 * list. Checking for XNATOMIC prevents from unlinking
 		 * a thread which is currently in flight to the
 		 * primary domain (see xnshadow_harden()); not doing
 		 * so would open a tiny window for priority
-		 * inversion. */
-		xnlock_get_irqsave(&rpilock, s);
-		if (threadout->rpi != NULL) {
-			sched_removepq(&gk->rpislot.threadq, &threadout->xlink);
+		 * inversion.
+		 *
+		 * BIG FAT WARNING: Do not consider a blocked thread
+		 * linked to another processor's RPI list for removal,
+		 * since this may happen if such thread immediately
+		 * resumes on the remote CPU.
+		 */
+		xnlock_get_irqsave(&rpislot->lock, s);
+		if (threadout->rpi == rpislot) {
+			sched_removepq(&rpislot->threadq, &threadout->xlink);
 			rpi_none(threadout);
 		}
-		xnlock_put_irqrestore(&rpilock, s);
+		xnlock_put_irqrestore(&rpislot->lock, s);
 	}
 
-	if (threadin != NULL &&
-	    next->policy == SCHED_FIFO &&
-	    !xnthread_test_state(threadin, XNRPIOFF)) {
-		newprio = xnthread_current_priority(threadin);
-
-		/* Be careful about two issues affecting a task's RPI
-		 * state here:
-		 *
-		 * 1) A relaxed shadow awakes (Linux-wise) after a
-		 * blocked state, which caused it to be removed from
-		 * the RPI list while it was sleeping; we have to link
-		 * it back again as it resumes.
-		 *
-		 * 2) A relaxed shadow has migrated from another CPU,
-		 * in that case, we end up having a thread linked to
-		 * an RPI slot which is _not_ the current gatekeeper's
-		 * one (keep in mind that we don't care about
-		 * migrations handled by Xenomai in primary mode,
-		 * since the shadow would not be linked to any RPI
-		 * queue in the first place).  Since a migration must
-		 * happen while the task is off the CPU Linux-wise,
-		 * rpi_switch() will be called upon resumption on the
-		 * target CPU by the Linux scheduler. At that point,
-		 * we just need to update the RPI information in case
-		 * the RPI queue backlink does not match the
-		 * gatekeeper's RPI slot for the current CPU. */
-
-		if (unlikely(threadin->rpi == NULL)) {
-			xnlock_get_irqsave(&rpilock, s);
-			sched_insertpqf(&gk->rpislot.threadq, &threadin->xlink, newprio);
-			threadin->rpi = &gk->rpislot;
-			xnlock_put_irqrestore(&rpilock, s);
-		} else if (unlikely(threadin->rpi != &gk->rpislot))
-			rpi_update(threadin);
-	} else {
-		xnlock_get_irqsave(&rpilock, s);
+	if (threadin == NULL ||
+	    next->policy != SCHED_FIFO ||
+	    xnthread_test_state(threadin, XNRPIOFF)) {
+		xnlock_get_irqsave(&rpislot->lock, s);
 
-		if (!sched_emptypq_p(&gk->rpislot.threadq)) {
-			xnthread_t *top = link2thread(sched_getheadpq(&gk->rpislot.threadq), xlink);
+		if (!sched_emptypq_p(&rpislot->threadq)) {
+			xnpholder_t *pholder = sched_getheadpq(&rpislot->threadq);
+			xnthread_t *top = link2thread(pholder, xlink);
 			newprio = xnthread_current_priority(top);
 		} else
-			newprio = XNCORE_BASE_PRIO;
+			newprio = XNCORE_IDLE_PRIO;
 
-		xnlock_put_irqrestore(&rpilock, s);
+		xnlock_put_irqrestore(&rpislot->lock, s);
+		goto boost_root;
 	}
 
+	newprio = xnthread_current_priority(threadin);
+
+	/* Be careful about two issues affecting a task's RPI state
+	 * here:
+	 *
+	 * 1) A relaxed shadow awakes (Linux-wise) after a blocked
+	 * state, which caused it to be removed from the RPI list
+	 * while it was sleeping; we have to link it back again as it
+	 * resumes.
+	 *
+	 * 2) A relaxed shadow has migrated from another CPU, in that
+	 * case, we end up having a thread linked to an RPI slot which
+	 * is _not_ the current gatekeeper's one [sidenote: we don't
+	 * care about migrations handled by Xenomai in primary mode,
+	 * since the shadow would not be linked to any RPI queue in
+	 * the first place].  Since a migration must happen while the
+	 * task is off the CPU Linux-wise, rpi_switch() will be called
+	 * upon resumption on the target CPU by the Linux
+	 * scheduler. At that point, we just need to update the RPI
+	 * information in case the RPI queue backlink does not match
+	 * the gatekeeper's RPI slot for the current CPU. */
+
+	if (unlikely(threadin->rpi == NULL)) {
+		xnlock_get_irqsave(&rpislot->lock, s);
+		sched_insertpqf(&rpislot->threadq, &threadin->xlink, newprio);
+		threadin->rpi = rpislot;
+		xnlock_put_irqrestore(&rpislot->lock, s);
+	} else if (unlikely(threadin->rpi != rpislot))
+		/* We hold no lock here. */
+		rpi_migrate(threadin);
+
+boost_root:
+
 	if (newprio == oldprio)
 		return;
 
@@ -356,38 +406,36 @@ static inline void rpi_switch(struct tas
 		xnpod_schedule();
 }
 
-static inline void rpi_clear(void)
+static inline void rpi_clear_local(xnthread_t *thread)
 {
-	if (!xnshadow_thread(current))
-		xnpod_renice_root(XNCORE_BASE_PRIO);
+	if (thread == NULL && xnpod_root_priority() != XNCORE_IDLE_PRIO)
+		xnpod_renice_root(XNCORE_IDLE_PRIO);
 }
 
 #ifdef CONFIG_SMP
 
 void xnshadow_rpi_check(void)
 {
-	struct __gatekeeper *gk;
+	struct xnrpi *rpislot = &gatekeeper[rthal_processor_id()].rpislot;
  	spl_t s;
  
- 	gk = &gatekeeper[rthal_processor_id()];
- 
- 	xnlock_get_irqsave(&rpilock, s);
+ 	xnlock_get_irqsave(&rpislot->lock, s);
  
- 	if (sched_emptypq_p(&gk->rpislot.threadq)) {
- 		if (xnthread_current_priority(xnpod_current_root()) != XNCORE_BASE_PRIO)
- 			xnpod_renice_root(XNCORE_BASE_PRIO);
+ 	if (sched_emptypq_p(&rpislot->threadq)) {
+ 		if (xnpod_root_priority() != XNCORE_IDLE_PRIO)
+ 			xnpod_renice_root(XNCORE_IDLE_PRIO);
  	}
  
- 	xnlock_put_irqrestore(&rpilock, s);
+ 	xnlock_put_irqrestore(&rpislot->lock, s);
 }
 
 #endif	/* CONFIG_SMP */
  
 #else
 
-#define rpi_init(gk)		do { } while(0)
-#define rpi_init_gk(gk)	do { } while(0)
-#define rpi_clear()		do { } while(0)
+#define rpi_init_gk(gk)		do { } while(0)
+#define rpi_clear_local(t)	do { } while(0)
+#define rpi_clear_remote(t)	do { } while(0)
 #define rpi_push(t)		do { } while(0)
 #define rpi_pop(t)		do { } while(0)
 #define rpi_update(t)		do { } while(0)
@@ -731,7 +779,7 @@ static void lostage_handler(void *cookie
 			   the relaxed shadow actually resumes in
 			   secondary mode. */
 	
-			rpi_clear();
+			rpi_clear_local(xnshadow_thread(current));
 
 		do_wakeup:
 
@@ -827,7 +875,8 @@ static int gatekeeper_thread(void *data)
 		if (kthread_should_stop())
 			break;
 
-		xnlock_get_irqsave(&nklock, s);
+		/* Real-time shadow TCBs are always removed on behalf
+		   of the killed thread. */
 
 		thread = gk->thread;
 
@@ -836,6 +885,8 @@ static int gatekeeper_thread(void *data)
 		   pending request, just ignore the latter. */
 
 		if (xnthread_user_task(thread)->state == TASK_INTERRUPTIBLE) {
+			rpi_pop(thread);
+			xnlock_get_irqsave(&nklock, s);
 #ifdef CONFIG_SMP
 			/* If the task changed its CPU while in secondary mode,
 			   change the CPU of the underlying Xenomai shadow too. We
@@ -845,14 +896,12 @@ static int gatekeeper_thread(void *data)
 			thread->sched = xnpod_sched_slot(cpu);
 #endif /* CONFIG_SMP */
 			xnpod_resume_thread(thread, XNRELAX);
-			rpi_pop(thread);
 #ifdef CONFIG_XENO_OPT_ISHIELD
 			disengage_irq_shield();
 #endif /* CONFIG_XENO_OPT_ISHIELD */
 			xnpod_schedule();
+			xnlock_put_irqrestore(&nklock, s);
 		}
-
-		xnlock_put_irqrestore(&nklock, s);
 	}
 
 	return 0;
@@ -881,21 +930,35 @@ static int gatekeeper_thread(void *data)
 int xnshadow_harden(void)
 {
 	struct task_struct *this_task = current;
-	/* Linux is not allowed to migrate shadow mates on its own, and
-	   shadows cannot be migrated by anyone but themselves, so the cpu
-	   number is constant in this context, despite the potential for
-	   preemption. */
-	struct __gatekeeper *gk = &gatekeeper[task_cpu(this_task)];
-	xnthread_t *thread = xnshadow_thread(this_task);
+	struct __gatekeeper *gk;
+	xnthread_t *thread;
+	int gk_cpu;
+
+redo:
+	gk_cpu = task_cpu(this_task);
+	thread = xnshadow_thread(this_task);
 
 	if (!thread)
 		return -EPERM;
 
+	gk = &gatekeeper[gk_cpu];
+
 	if (signal_pending(this_task) || down_interruptible(&gk->sync))
 		/* Grab the request token. */
 		return -ERESTARTSYS;
 
-	xnltt_log_event(xeno_ev_primarysw, this_task->comm);
+	preempt_disable();
+
+	/* Assume that we might have been migrated while waiting for
+	 * the token. Redo acquisition in such a case, so that we
+	 * don't mistakenly send the request to the wrong
+	 * gatekeeper. */
+
+	if (gk_cpu != task_cpu(this_task)) {
+		preempt_enable();
+		up(&gk->sync);
+		goto redo;
+	}
 
 	/* Set up the request to move "current" from the Linux domain to
 	   the Xenomai domain. This will cause the shadow thread to resume
@@ -907,12 +970,13 @@ int xnshadow_harden(void)
 	   preemption and using the TASK_ATOMICSWITCH cumulative state
 	   provided by Adeos to Linux tasks. */
 
+	xnltt_log_event(xeno_ev_primarysw, this_task->comm);
+
 	gk->thread = thread;
 	xnthread_set_info(thread, XNATOMIC);
-	preempt_disable();
 	set_current_state(TASK_INTERRUPTIBLE | TASK_ATOMICSWITCH);
 	wake_up_interruptible_sync(&gk->waitq);
-	schedule();
+	schedule();	/* Will preempt_enable() thanks to TASK_ATOMICSWITCH */
 	xnthread_clear_info(thread, XNATOMIC);
 
 	/* Rare case: we might have been awaken by a signal before the
@@ -943,6 +1007,16 @@ int xnshadow_harden(void)
 
 	xnlock_clear_irqon(&nklock);
 
+	/*
+	 * Normally, we should not be linked to any RPI list at this
+	 * point, except if Linux sent us to another CPU while in
+	 * flight to the primary domain, waiting to be resumed by the
+	 * gatekeeper; in such a case, we must unlink from the remote
+	 * CPU's RPI list now.
+	 */
+	if (rpi_p(thread))
+		rpi_clear_remote(thread);
+
 	xnltt_log_event(xeno_ev_primary, thread->name);
 
 	return 0;
@@ -1082,7 +1156,7 @@ void xnshadow_exit(void)
  *
  * This service can be called from:
  *
- * - Regular user-space process. 
+ * - Regular user-space process.
  *
  * Rescheduling: always.
  *
@@ -1177,7 +1251,7 @@ void xnshadow_unmap(xnthread_t *thread)
 	    !testbits(xnpod_current_sched()->status, XNKCOUT))
 		xnpod_fatal("xnshadow_unmap() called from invalid context");
 
-	p = xnthread_archtcb(thread)->user_task;	/* May be != current */
+	p = xnthread_archtcb(thread)->user_task;
 
 	magic = xnthread_get_magic(thread);
 
@@ -1206,6 +1280,11 @@ void xnshadow_unmap(xnthread_t *thread)
 	if (!p)
 		return;
 
+	XENO_ASSERT(NUCLEUS, p == current,
+		    xnpod_fatal("%s invoked for a non-current task (t=%s/p=%s)",
+				__FUNCTION__, thread->name, p->comm);
+		);
+
 	xnshadow_thrptd(p) = NULL;
 
 	if (p->state != TASK_RUNNING)
@@ -1277,12 +1356,15 @@ void xnshadow_renice(xnthread_t *thread)
 {
 	/* Called with nklock locked, Xenomai interrupts off. */
 	struct task_struct *p = xnthread_archtcb(thread)->user_task;
+
 	/* We need to bound the priority values in the [1..MAX_RT_PRIO-1]
 	   range, since the core pod's priority scale is a superset of
 	   Linux's priority scale. */
 	int prio = normalize_priority(thread->cprio);
 	schedule_linux_call(LO_RENICE_REQ, p, prio);
-	if (!xnthread_test_state(thread, XNDORMANT))
+
+	if (!xnthread_test_state(thread, XNDORMANT) &&
+	    xnthread_sched(thread) == xnpod_current_sched())
 		rpi_update(thread);
 }
 
@@ -2014,7 +2096,7 @@ static inline void do_sigwake_event(stru
 	xnthread_t *thread = xnshadow_thread(p);
 	spl_t s;
 
-	if (!thread || xnthread_test_state(thread, XNROOT))	/* Eh? root as shadow? */
+	if (!thread)
 		return;
 
 	xnlock_get_irqsave(&nklock, s);
@@ -2080,8 +2162,18 @@ static inline void do_setsched_event(str
 	if (xnthread_current_priority(thread) != priority) {
 		xnpod_renice_thread_inner(thread, priority, 0);
 		if (xnsched_resched_p()) {
-			if (current == p)
+			if (p == current)
+				/* Implies xnthread_sched(thread) == current_sched */
 				rpi_update(thread);
+			/*
+			 * rpi_switch() will fix things properly
+			 * otherwise.  This may delay the update if
+			 * the thread is running on the remote CPU
+			 * until it gets back into rpi_switch() as the
+			 * incoming thread anew, but this is
+			 * acceptable (i.e. strict ordering across
+			 * CPUs is not supported anyway).
+			 */
 			xnpod_schedule();
 		}
 	}
@@ -2272,8 +2364,6 @@ int xnshadow_mount(void)
 	lostage_apc =
 	    rthal_apc_alloc("lostage_handler", &lostage_handler, NULL);
 
-	rpi_init();
-
 	for_each_online_cpu(cpu) {
 		struct __gatekeeper *gk = &gatekeeper[cpu];
 		rpi_init_gk(gk);
diff -uNrp xenomai-2.3.1/ksrc/skins/native/task.c xenomai-2.3.1-rpi/ksrc/skins/native/task.c
--- xenomai-2.3.1/ksrc/skins/native/task.c	2007-02-28 18:25:52.000000000 +0100
+++ xenomai-2.3.1-rpi/ksrc/skins/native/task.c	2007-07-23 14:05:54.000000000 +0200
@@ -581,29 +581,6 @@ int rt_task_delete(RT_TASK *task)
 	if (err)
 		goto unlock_and_exit;
 
-#if defined(__KERNEL__) && defined(CONFIG_XENO_OPT_PERVASIVE)
-	/* rt_task_delete() might be called for cleaning up a just
-	   created shadow task which has not been successfully mapped,
-	   so make sure we have an associated Linux mate before trying
-	   to send it a signal. This will also prevent any action on
-	   kernel-based Xenomai threads for which the user TCB
-	   extension is always NULL.
-	   We don't send any signal to dormant threads because GDB
-	   (6.x) has some problems dealing with vanishing threads
-	   under some circumstances, likely when asynchronous
-	   cancellation is in effect. In most cases, this is a
-	   non-issue since pthread_cancel() is requested from the skin
-	   interface library in parallel on the target thread, but
-	   when calling rt_task_delete() from kernel space against a
-	   created but unstarted user-space task, the Linux thread
-	   mated to the Xenomai shadow might linger unexpectedly on
-	   the startup barrier. */
-	if (xnthread_user_task(&task->thread_base) != NULL
-	    && !xnthread_test_state(&task->thread_base,XNDORMANT)
-	    && (!xnpod_primary_p() || task != xeno_current_task()))
-		xnshadow_send_sig(&task->thread_base, SIGKILL, 1);
-#endif /* __KERNEL__ && CONFIG_XENO_OPT_PERVASIVE */
-
 	/* Does not return if task is current. */
 	xnpod_delete_thread(&task->thread_base);
 
diff -uNrp xenomai-2.3.1/ksrc/skins/psos+/task.c xenomai-2.3.1-rpi/ksrc/skins/psos+/task.c
--- xenomai-2.3.1/ksrc/skins/psos+/task.c	2006-12-26 19:39:04.000000000 +0100
+++ xenomai-2.3.1-rpi/ksrc/skins/psos+/task.c	2007-07-23 14:05:54.000000000 +0200
@@ -284,13 +284,6 @@ u_long t_delete(u_long tid)
 		goto unlock_and_exit;
 	}
 
-#if defined(__KERNEL__) && defined(CONFIG_XENO_OPT_PERVASIVE)
-	if (xnthread_user_task(&task->threadbase) != NULL
-	    && !xnthread_test_state(&task->threadbase,XNDORMANT)
-	    && (!xnpod_primary_p() || task != psos_current_task()))
-		xnshadow_send_sig(&task->threadbase, SIGKILL, 1);
-#endif /* __KERNEL__ && CONFIG_XENO_OPT_PERVASIVE */
-
 	xnpod_delete_thread(&task->threadbase);
 
       unlock_and_exit:
diff -uNrp xenomai-2.3.1/ksrc/skins/vxworks/taskLib.c xenomai-2.3.1-rpi/ksrc/skins/vxworks/taskLib.c
--- xenomai-2.3.1/ksrc/skins/vxworks/taskLib.c	2006-12-26 19:39:05.000000000 +0100
+++ xenomai-2.3.1-rpi/ksrc/skins/vxworks/taskLib.c	2007-07-23 14:05:54.000000000 +0200
@@ -285,13 +285,6 @@ STATUS taskDelete(TASK_ID task_id)
 		goto error;
 	}
 
-#if defined(__KERNEL__) && defined(CONFIG_XENO_OPT_PERVASIVE)
-	if (xnthread_user_task(&task->threadbase) != NULL
-	    && !xnthread_test_state(&task->threadbase,XNDORMANT)
-	    && (!xnpod_primary_p() || task != wind_current_task()))
-		xnshadow_send_sig(&task->threadbase, SIGKILL, 1);
-#endif /* __KERNEL__ && CONFIG_XENO_OPT_PERVASIVE */
-
 	xnpod_delete_thread(&task->threadbase);
 	xnlock_put_irqrestore(&nklock, s);
 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-23 15:01             ` Philippe Gerum
@ 2007-07-23 17:15               ` Philippe Gerum
  2007-07-23 17:45                 ` [Xenomai-core] r2773 mistake? x86_64 adeos patch Jeff Koftinoff
  2007-07-24  9:55                 ` [Xenomai-core] RPI is good for you Philippe Gerum
  2007-07-24 12:05               ` M. Koehrer
  2007-07-25 11:40               ` Jan Kiszka
  2 siblings, 2 replies; 35+ messages in thread
From: Philippe Gerum @ 2007-07-23 17:15 UTC (permalink / raw)
  To: M. Koehrer; +Cc: Jan Kiszka, xenomai

On Mon, 2007-07-23 at 17:01 +0200, Philippe Gerum wrote:
> On Fri, 2007-07-20 at 14:16 +0200, Philippe Gerum wrote: 
> > On Fri, 2007-07-20 at 13:54 +0200, M. Koehrer wrote:
> > > Hi Philippe,
> > > I left my test running for a couple of hours - no freeze so far... 
> > > 
> > > However, I have to do some other stuff on this machine, I have to stop the test now...
> > > 
> > 
> > Ok, thanks for the feedback. I will send an extended patch later today,
> > so that you could test it on a longer period when you see fit.
> 
> It took me a bit longer than expected, but here is a patch which
> addresses all the pending issues with RPI, hopefully (applies against
> 2.3.1 stock).
> 

Here is a fix for a typo if you happen to use the scalable scheduler.

--- include/nucleus/queue.h~	2007-07-23 18:47:21.000000000 +0200
+++ include/nucleus/queue.h	2007-07-23 19:10:40.000000000 +0200
@@ -744,7 +744,7 @@
 
 static inline xnpholder_t *getheadmlq(xnmlqueue_t *mlqslot)
 {
-	xnpholder_t *pholder;
+	xnpholder_t *holder;
 	xnqueue_t *queue;
 
 	if (emptymlq_p(mlqslot))
-- 
Philippe.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Xenomai-core] r2773 mistake? x86_64 adeos patch
  2007-07-23 17:15               ` Philippe Gerum
@ 2007-07-23 17:45                 ` Jeff Koftinoff
  2007-07-23 18:04                   ` Philippe Gerum
  2007-07-24  9:55                 ` [Xenomai-core] RPI is good for you Philippe Gerum
  1 sibling, 1 reply; 35+ messages in thread
From: Jeff Koftinoff @ 2007-07-23 17:45 UTC (permalink / raw)
  To: xenomai-core

Hi!

We are trying to compile xenomai for x86_64. Yesterday, revision  
r2773 "Add Adeos 2.6.22/x86_64", added the file adeos-ipie-2.6.22- 
x86_64-1.1-00.patch.

However it appears that the patch is only patching files in arch/ 
i386/ instead of arch/x86_64...

Or am I doing something wrong?

Thanks

jeffk


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] r2773 mistake? x86_64 adeos patch
  2007-07-23 17:45                 ` [Xenomai-core] r2773 mistake? x86_64 adeos patch Jeff Koftinoff
@ 2007-07-23 18:04                   ` Philippe Gerum
  0 siblings, 0 replies; 35+ messages in thread
From: Philippe Gerum @ 2007-07-23 18:04 UTC (permalink / raw)
  To: Jeff Koftinoff; +Cc: xenomai-core

On Mon, 2007-07-23 at 10:45 -0700, Jeff Koftinoff wrote:
> Hi!
> 
> We are trying to compile xenomai for x86_64. Yesterday, revision  
> r2773 "Add Adeos 2.6.22/x86_64", added the file adeos-ipie-2.6.22- 
> x86_64-1.1-00.patch.
> 
> However it appears that the patch is only patching files in arch/ 
> i386/ instead of arch/x86_64...
> 

Ahem, Oops... Fixed now. Sorry.

> Or am I doing something wrong?
> 
> Thanks
> 
> jeffk
-- 
Philippe.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-23 17:15               ` Philippe Gerum
  2007-07-23 17:45                 ` [Xenomai-core] r2773 mistake? x86_64 adeos patch Jeff Koftinoff
@ 2007-07-24  9:55                 ` Philippe Gerum
  1 sibling, 0 replies; 35+ messages in thread
From: Philippe Gerum @ 2007-07-24  9:55 UTC (permalink / raw)
  To: M. Koehrer; +Cc: Jan Kiszka, xenomai

On Mon, 2007-07-23 at 19:15 +0200, Philippe Gerum wrote:
> On Mon, 2007-07-23 at 17:01 +0200, Philippe Gerum wrote:
> > On Fri, 2007-07-20 at 14:16 +0200, Philippe Gerum wrote: 
> > > On Fri, 2007-07-20 at 13:54 +0200, M. Koehrer wrote:
> > > > Hi Philippe,
> > > > I left my test running for a couple of hours - no freeze so far... 
> > > > 
> > > > However, I have to do some other stuff on this machine, I have to stop the test now...
> > > > 
> > > 
> > > Ok, thanks for the feedback. I will send an extended patch later today,
> > > so that you could test it on a longer period when you see fit.
> > 
> > It took me a bit longer than expected, but here is a patch which
> > addresses all the pending issues with RPI, hopefully (applies against
> > 2.3.1 stock).
> > 
> 
> Here is a fix for a typo if you happen to use the scalable scheduler.
> 
> --- include/nucleus/queue.h~	2007-07-23 18:47:21.000000000 +0200
> +++ include/nucleus/queue.h	2007-07-23 19:10:40.000000000 +0200
> @@ -744,7 +744,7 @@
>  
>  static inline xnpholder_t *getheadmlq(xnmlqueue_t *mlqslot)
>  {
> -	xnpholder_t *pholder;
> +	xnpholder_t *holder;
>  	xnqueue_t *queue;
>  
>  	if (emptymlq_p(mlqslot))

This other fix is needed to cope with a tiny migration window which
exists while creating a thread, and which may affect RPI.

--- ksrc/nucleus/shadow.c	(revision 2782)
+++ ksrc/nucleus/shadow.c	(working copy)
@@ -386,10 +386,12 @@
 	 * the gatekeeper's RPI slot for the current CPU. */
 
 	if (unlikely(threadin->rpi == NULL)) {
-		xnlock_get_irqsave(&rpislot->lock, s);
-		sched_insertpqf(&rpislot->threadq, &threadin->xlink, newprio);
-		threadin->rpi = rpislot;
-		xnlock_put_irqrestore(&rpislot->lock, s);
+		if (!xnthread_test_state(threadin, XNDORMANT)) {
+			xnlock_get_irqsave(&rpislot->lock, s);
+			sched_insertpqf(&rpislot->threadq, &threadin->xlink, newprio);
+			threadin->rpi = rpislot;
+			xnlock_put_irqrestore(&rpislot->lock, s);
+		}
 	} else if (unlikely(threadin->rpi != rpislot))
 		/* We hold no lock here. */
 		rpi_migrate(threadin);

-- 
Philippe.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-23 15:01             ` Philippe Gerum
  2007-07-23 17:15               ` Philippe Gerum
@ 2007-07-24 12:05               ` M. Koehrer
  2007-07-25  6:54                 ` M. Koehrer
  2007-07-25 11:40               ` Jan Kiszka
  2 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-24 12:05 UTC (permalink / raw)
  To: rpm, mathias_koehrer; +Cc: jan.kiszka, xenomai

Hi Philippe,

I have attached this patch to my application. So far it looks really good.
However, I leave my test running to be sure that it works.

Regards

Mathias 


> On Fri, 2007-07-20 at 14:16 +0200, Philippe Gerum wrote: 
> > On Fri, 2007-07-20 at 13:54 +0200, M. Koehrer wrote:
> > > Hi Philippe,
> > > I left my test running for a couple of hours - no freeze so far... 
> > > 
> > > However, I have to do some other stuff on this machine, I have to stop
> the test now...
> > > 
> > 
> > Ok, thanks for the feedback. I will send an extended patch later today,
> > so that you could test it on a longer period when you see fit.
> 
> It took me a bit longer than expected, but here is a patch which
> addresses all the pending issues with RPI, hopefully (applies against
> 2.3.1 stock).
> 
> The good thing about Jan grumbling at me, is that this usually makes me
> look at the big picture anew. And the RPI picture was not that nice,
> that's a fact.
> 
> Beside the locking sequence issue, the ex-aequo #1 problem was that CPU
> migration of Linux tasks causing a RPI boost had some very nasty
> side-effects on RPI management, and would create all sort of funky
> situations I'm too shameful to talk about, except under the generic term
> of "horrendous mess".
> 
> Now, regarding the deadlock issue, suppressing the RPI-specific locking
> entirely would have been the best solution, but unfortunately, the
> migration scheme makes this out of reach, at least without resorting to
> some hairy and likely unreliable implementation. Therefore, the solution
> I came with consists of making the RPI lock a per-cpu thing, so that
> most RPI routines are actually grabbing a _local_ lock wrt the current
> CPU, those routines being allowed hold the nklock as they wish. When
> some per-CPU RPI lock is accessed from a remote CPU, it is guaranteed
> that _no nklock_ may be held nested. Actually, the remote case only
> occurs once, in rpi_clear_remote(), and all its callers are guaranteed
> to be nklock-free (a debug assertion even enforces that).
> 
> For the migration issue, the RPI transitions have been ironed out to
> make sure we deal properly with all the subtleties of the Linux load
> balancer.
> 
> Mathias, please let me know if the attached patch improves the situation
> on your side.
> 
> -- 
> Philippe.
> 
> 
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT FÜR ALLE NEUEINSTEIGER
Jetzt bei Arcor: günstig und schnell mit DSL - das All-Inclusive-Paket
für clevere Doppel-Sparer, nur  34,95 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-24 12:05               ` M. Koehrer
@ 2007-07-25  6:54                 ` M. Koehrer
  2007-07-25  7:06                   ` Philippe Gerum
  0 siblings, 1 reply; 35+ messages in thread
From: M. Koehrer @ 2007-07-25  6:54 UTC (permalink / raw)
  To: mathias_koehrer, rpm; +Cc: jan.kiszka, xenomai

Hi Philippe,

as I have mentioned yesterday, I have applied your (first) patch to Xenomai (I did not apply 
the other additional patches). And yes, my application was running fine without freeze in an
overnight test. Not only the tiny test application but also the complex real time application that
was the root cause for everything.
That is really a great improvement. Will this fix end up shortly in a maintenance version of Xenomai?
I would appreciate that as this is a severe bug that should have a fix published as soon as possible.

Thanks a lot for the excellent support!

Regards

Mathias

> Hi Philippe,
> 
> I have attached this patch to my application. So far it looks really good.
> However, I leave my test running to be sure that it works.
> 
> Regards
> 
> Mathias 
> 
> 
> > On Fri, 2007-07-20 at 14:16 +0200, Philippe Gerum wrote: 
> > > On Fri, 2007-07-20 at 13:54 +0200, M. Koehrer wrote:
> > > > Hi Philippe,
> > > > I left my test running for a couple of hours - no freeze so far... 
> > > > 
> > > > However, I have to do some other stuff on this machine, I have to
> stop
> > the test now...
> > > > 
> > > 
> > > Ok, thanks for the feedback. I will send an extended patch later today,
> > > so that you could test it on a longer period when you see fit.
> > 
> > It took me a bit longer than expected, but here is a patch which
> > addresses all the pending issues with RPI, hopefully (applies against
> > 2.3.1 stock).
> > 
> > The good thing about Jan grumbling at me, is that this usually makes me
> > look at the big picture anew. And the RPI picture was not that nice,
> > that's a fact.
> > 
> > Beside the locking sequence issue, the ex-aequo #1 problem was that CPU
> > migration of Linux tasks causing a RPI boost had some very nasty
> > side-effects on RPI management, and would create all sort of funky
> > situations I'm too shameful to talk about, except under the generic term
> > of "horrendous mess".
> > 
> > Now, regarding the deadlock issue, suppressing the RPI-specific locking
> > entirely would have been the best solution, but unfortunately, the
> > migration scheme makes this out of reach, at least without resorting to
> > some hairy and likely unreliable implementation. Therefore, the solution
> > I came with consists of making the RPI lock a per-cpu thing, so that
> > most RPI routines are actually grabbing a _local_ lock wrt the current
> > CPU, those routines being allowed hold the nklock as they wish. When
> > some per-CPU RPI lock is accessed from a remote CPU, it is guaranteed
> > that _no nklock_ may be held nested. Actually, the remote case only
> > occurs once, in rpi_clear_remote(), and all its callers are guaranteed
> > to be nklock-free (a debug assertion even enforces that).
> > 
> > For the migration issue, the RPI transitions have been ironed out to
> > make sure we deal properly with all the subtleties of the Linux load
> > balancer.
> > 
> > Mathias, please let me know if the attached patch improves the situation
> > on your side.
> > 



-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT FÜR ALLE NEUEINSTEIGER
Jetzt bei Arcor: günstig und schnell mit DSL - das All-Inclusive-Paket
für clevere Doppel-Sparer, nur  34,95 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-25  6:54                 ` M. Koehrer
@ 2007-07-25  7:06                   ` Philippe Gerum
  0 siblings, 0 replies; 35+ messages in thread
From: Philippe Gerum @ 2007-07-25  7:06 UTC (permalink / raw)
  To: M. Koehrer; +Cc: jan.kiszka, xenomai

On Wed, 2007-07-25 at 08:54 +0200, M. Koehrer wrote:
> Hi Philippe,
> 
> as I have mentioned yesterday, I have applied your (first) patch to Xenomai (I did not apply 
> the other additional patches).

The third one would be needed to run the same code against the trunk,
due to some difference in CPU affinity management between v2.3.x and
-devel, but I had no problem running the lockup test for six hours
without it over v2.3.x.

>  And yes, my application was running fine without freeze in an
> overnight test. Not only the tiny test application but also the complex real time application that
> was the root cause for everything.
> That is really a great improvement. Will this fix end up shortly in a maintenance version of Xenomai?
> I would appreciate that as this is a severe bug that should have a fix published as soon as possible.
> 

Yes, it is already merged into the maintenance and -devel branch, and
v2.3.3 will be released shortly, since this is indeed a deadly bug.

> Thanks a lot for the excellent support!
> 

No problem. The lockup test you sent did make a huge difference and
actually allowed me to focus on solving the issue immediately, instead
of trying to find a way to reproduce it first. Thanks for this.

> Regards
> 
> Mathias
> 
> > Hi Philippe,
> > 
> > I have attached this patch to my application. So far it looks really good.
> > However, I leave my test running to be sure that it works.
> > 
> > Regards
> > 
> > Mathias 
> > 
> > 
> > > On Fri, 2007-07-20 at 14:16 +0200, Philippe Gerum wrote: 
> > > > On Fri, 2007-07-20 at 13:54 +0200, M. Koehrer wrote:
> > > > > Hi Philippe,
> > > > > I left my test running for a couple of hours - no freeze so far... 
> > > > > 
> > > > > However, I have to do some other stuff on this machine, I have to
> > stop
> > > the test now...
> > > > > 
> > > > 
> > > > Ok, thanks for the feedback. I will send an extended patch later today,
> > > > so that you could test it on a longer period when you see fit.
> > > 
> > > It took me a bit longer than expected, but here is a patch which
> > > addresses all the pending issues with RPI, hopefully (applies against
> > > 2.3.1 stock).
> > > 
> > > The good thing about Jan grumbling at me, is that this usually makes me
> > > look at the big picture anew. And the RPI picture was not that nice,
> > > that's a fact.
> > > 
> > > Beside the locking sequence issue, the ex-aequo #1 problem was that CPU
> > > migration of Linux tasks causing a RPI boost had some very nasty
> > > side-effects on RPI management, and would create all sort of funky
> > > situations I'm too shameful to talk about, except under the generic term
> > > of "horrendous mess".
> > > 
> > > Now, regarding the deadlock issue, suppressing the RPI-specific locking
> > > entirely would have been the best solution, but unfortunately, the
> > > migration scheme makes this out of reach, at least without resorting to
> > > some hairy and likely unreliable implementation. Therefore, the solution
> > > I came with consists of making the RPI lock a per-cpu thing, so that
> > > most RPI routines are actually grabbing a _local_ lock wrt the current
> > > CPU, those routines being allowed hold the nklock as they wish. When
> > > some per-CPU RPI lock is accessed from a remote CPU, it is guaranteed
> > > that _no nklock_ may be held nested. Actually, the remote case only
> > > occurs once, in rpi_clear_remote(), and all its callers are guaranteed
> > > to be nklock-free (a debug assertion even enforces that).
> > > 
> > > For the migration issue, the RPI transitions have been ironed out to
> > > make sure we deal properly with all the subtleties of the Linux load
> > > balancer.
> > > 
> > > Mathias, please let me know if the attached patch improves the situation
> > > on your side.
> > > 
> 
> 
> 
-- 
Philippe.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-23 15:01             ` Philippe Gerum
  2007-07-23 17:15               ` Philippe Gerum
  2007-07-24 12:05               ` M. Koehrer
@ 2007-07-25 11:40               ` Jan Kiszka
  2007-07-25 14:58                 ` Philippe Gerum
  2 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2007-07-25 11:40 UTC (permalink / raw)
  To: rpm; +Cc: M. Koehrer, xenomai

Philippe Gerum schrieb:
> On Fri, 2007-07-20 at 14:16 +0200, Philippe Gerum wrote: 
>> On Fri, 2007-07-20 at 13:54 +0200, M. Koehrer wrote:
>>> Hi Philippe,
>>> I left my test running for a couple of hours - no freeze so far... 
>>>
>>> However, I have to do some other stuff on this machine, I have to stop the test now...
>>>
>> Ok, thanks for the feedback. I will send an extended patch later today,
>> so that you could test it on a longer period when you see fit.
> 
> It took me a bit longer than expected, but here is a patch which
> addresses all the pending issues with RPI, hopefully (applies against
> 2.3.1 stock).
> 
> The good thing about Jan grumbling at me, is that this usually makes me
> look at the big picture anew. And the RPI picture was not that nice,
> that's a fact.
> 
> Beside the locking sequence issue, the ex-aequo #1 problem was that CPU
> migration of Linux tasks causing a RPI boost had some very nasty
> side-effects on RPI management, and would create all sort of funky
> situations I'm too shameful to talk about, except under the generic term
> of "horrendous mess".
> 
> Now, regarding the deadlock issue, suppressing the RPI-specific locking
> entirely would have been the best solution, but unfortunately, the
> migration scheme makes this out of reach, at least without resorting to
> some hairy and likely unreliable implementation. Therefore, the solution
> I came with consists of making the RPI lock a per-cpu thing, so that
> most RPI routines are actually grabbing a _local_ lock wrt the current
> CPU, those routines being allowed hold the nklock as they wish. When
> some per-CPU RPI lock is accessed from a remote CPU, it is guaranteed
> that _no nklock_ may be held nested. Actually, the remote case only
> occurs once, in rpi_clear_remote(), and all its callers are guaranteed
> to be nklock-free (a debug assertion even enforces that).

Yeah, it is actually safe against deadlocks now. Still, I wonder why we 
can't design xnshadow_rpi_check like this:

	...
	int need_renice = 0;

	xnlock_get_irqsave(&rpislot->lock, s);

	if (sched_emptypq_p(&rpislot->threadq) &&
	    xnpod_root_priority() != XNCORE_IDLE_PRIO)
		need_renice = 1;

	xnlock_put_irqrestore(&rpislot->lock, s);

	if (need_renice)
		xnpod_renice_root(XNCORE_IDLE_PRIO);


If we can avoid nesting (even if it's safe), we should do so. Or does 
this pattern here introduce new, ugly race possibility?

Jan


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Xenomai-core] RPI is good for you
  2007-07-25 11:40               ` Jan Kiszka
@ 2007-07-25 14:58                 ` Philippe Gerum
  0 siblings, 0 replies; 35+ messages in thread
From: Philippe Gerum @ 2007-07-25 14:58 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: M. Koehrer, xenomai

On Wed, 2007-07-25 at 13:40 +0200, Jan Kiszka wrote:
> Philippe Gerum schrieb:
> > On Fri, 2007-07-20 at 14:16 +0200, Philippe Gerum wrote: 
> > 
> > Now, regarding the deadlock issue, suppressing the RPI-specific locking
> > entirely would have been the best solution, but unfortunately, the
> > migration scheme makes this out of reach, at least without resorting to
> > some hairy and likely unreliable implementation. Therefore, the solution
> > I came with consists of making the RPI lock a per-cpu thing, so that
> > most RPI routines are actually grabbing a _local_ lock wrt the current
> > CPU, those routines being allowed hold the nklock as they wish. When
> > some per-CPU RPI lock is accessed from a remote CPU, it is guaranteed
> > that _no nklock_ may be held nested. Actually, the remote case only
> > occurs once, in rpi_clear_remote(), and all its callers are guaranteed
> > to be nklock-free (a debug assertion even enforces that).
> 
> Yeah, it is actually safe against deadlocks now. Still, I wonder why we 
> can't design xnshadow_rpi_check like this:
> 
> 	...
> 	int need_renice = 0;
> 
> 	xnlock_get_irqsave(&rpislot->lock, s);
> 
> 	if (sched_emptypq_p(&rpislot->threadq) &&
> 	    xnpod_root_priority() != XNCORE_IDLE_PRIO)
> 		need_renice = 1;
> 
> 	xnlock_put_irqrestore(&rpislot->lock, s);
> 
> 	if (need_renice)
> 		xnpod_renice_root(XNCORE_IDLE_PRIO);
> 
> 
> If we can avoid nesting (even if it's safe), we should do so. Or does 
> this pattern here introduce new, ugly race possibility?

Yeah, this would work, because what we need actually is that
xnshadow_rpi_check() runs with local interrupts off, which is the case
since it is called on behalf of an IPI with hardware interrupts
disabled. Otherwise, we would open a short window for this kind of
scenario:

xnshadow_rpi_check
	empty-p(RPI list)?
		<IRQ> switch to some primary mode thread
			primary mode thread relaxes
				push other thread to RPI list
	(spuriously) downgrade root thread priority

This said, we have no interest ATM to let hw interrupts flow when
handling this IPI, so we could indeed flatten the locking sequence as
you described, provided we also put some big fat warning into that piece
of code too.

-- 
Philippe.




^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2007-07-25 14:58 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-19 13:27 [Xenomai-help] Sporadic PC freeze after rt_task_start M. Koehrer
2007-07-19 13:42 ` Philippe Gerum
2007-07-19 13:52   ` M. Koehrer
2007-07-20  0:14     ` [Xenomai-core] RPI is good for you Philippe Gerum
2007-07-20  7:32       ` M. Koehrer
2007-07-20 11:54         ` M. Koehrer
2007-07-20 12:16           ` Philippe Gerum
2007-07-23 15:01             ` Philippe Gerum
2007-07-23 17:15               ` Philippe Gerum
2007-07-23 17:45                 ` [Xenomai-core] r2773 mistake? x86_64 adeos patch Jeff Koftinoff
2007-07-23 18:04                   ` Philippe Gerum
2007-07-24  9:55                 ` [Xenomai-core] RPI is good for you Philippe Gerum
2007-07-24 12:05               ` M. Koehrer
2007-07-25  6:54                 ` M. Koehrer
2007-07-25  7:06                   ` Philippe Gerum
2007-07-25 11:40               ` Jan Kiszka
2007-07-25 14:58                 ` Philippe Gerum
  -- strict thread matches above, loose matches on Subject: below --
2007-07-10  8:00 [Xenomai-help] Sporadic PC freeze after rt_task_start M. Koehrer
2007-07-10  8:40 ` Jan Kiszka
2007-07-10 12:29   ` M. Koehrer
2007-07-10 12:41     ` Jan Kiszka
2007-07-10 14:40       ` M. Koehrer
2007-07-10 15:34         ` Jan Kiszka
2007-07-11  6:43           ` M. Koehrer
2007-07-11  7:32             ` Jan Kiszka
2007-07-11 12:45               ` M. Koehrer
2007-07-11 14:47           ` Jan Kiszka
2007-07-13  7:27             ` M. Koehrer
2007-07-13  8:26               ` Jan Kiszka
2007-07-16  7:07                 ` M. Koehrer
2007-07-16 22:42                   ` Jan Kiszka
2007-07-19 10:58                     ` M. Koehrer
2007-07-19 11:27                       ` Jan Kiszka
2007-07-19 12:19                         ` Philippe Gerum
2007-07-19 12:40                           ` Jan Kiszka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.