[Xenomai-help] Sporadic PC freeze after rt_task

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Xenomai-help] Sporadic PC freeze after rt_task_start
@ 2007-07-10  8:00 M. Koehrer
  2007-07-10  8:40 ` Jan Kiszka
  0 siblings, 1 reply; 33+ messages in thread
From: M. Koehrer @ 2007-07-10  8:00 UTC (permalink / raw)
  To: xenomai

Hi everybody,

I noticed a sporadic freeze of my PC using Xenomai 2.3.1 and kernel 2.6.20.4 on a Pentium D.
adeos-ipipe-2.6.20-i386-1.8-01.patch.

The freeze happened sporadically on one of our systems, occasionally it took up to 6 hours  to get it.
Using a PCI Post Code board and writing POST codes to it, I was able to locate the code that was causing
the issue. And finally I was able to extract it to a very simple program that shows the same behaviour!!

Here is my simple test program:
**************************************** BEGIN *****************
#include <stdio.h>
#include <sys/mman.h>

#include <native/task.h>
#include <native/sem.h>

RT_TASK taska_desc;

void mytaska(void *cookie)
{
    int i;

    for (i=0; i < 5; i++)
    {
        rt_task_sleep(5000000);
    }
}

int main(void)
{
    int i;
    int j;
    mlockall(MCL_CURRENT|MCL_FUTURE);

    for (j=0; j < 100; j++)
        for (i=10; i < 15000; i++)
        {
            rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE | T_FPU | T_CPU(1));
            rt_task_start(&taska_desc, &mytaska, NULL);
            usleep(1500);

            rt_task_join(&taska_desc);
            if ( i % 100 == 0)
                printf("Loop %i\n", i);
        }

    return 0;
}
*************************************** END ***********************************
It is important to know, that I started the kernel with isolcpus=1, i.e. all non-realtime tasks
are running on CPU 0.
Somehow it seems to have to do with the usleep() that is following the rt_task_start.
usleep() is executed on CPU 0 and rt_task_start starts a task on CPU 1...
Can this be as the begin of usleep() is executed before the task is started but the end of
usleep() is when the task has already started. Could this be a cause for a race condition?

I leave the program running for a while and somehow it freezes the PC (only reset works).

Any feedback on this is welcome!

Regards

Mathias

-- 
Mathias Koehrer
mathias_koehrer@domain.hid

Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10  8:00 [Xenomai-help] Sporadic PC freeze after rt_task_start M. Koehrer
@ 2007-07-10  8:40 ` Jan Kiszka
  2007-07-10 12:29   ` M. Koehrer
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-10  8:40 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 3122 bytes --]

M. Koehrer wrote:
> Hi everybody,
> 
> I noticed a sporadic freeze of my PC using Xenomai 2.3.1 and kernel 2.6.20.4 on a Pentium D.
> adeos-ipipe-2.6.20-i386-1.8-01.patch.
> 
> The freeze happened sporadically on one of our systems, occasionally it took up to 6 hours  to get it.
> Using a PCI Post Code board and writing POST codes to it, I was able to locate the code that was causing
> the issue. And finally I was able to extract it to a very simple program that shows the same behaviour!!
> 
> Here is my simple test program:
> **************************************** BEGIN *****************
> #include <stdio.h>
> #include <sys/mman.h>
> 
> #include <native/task.h>
> #include <native/sem.h>
> 
> 
> RT_TASK taska_desc;
> 
> void mytaska(void *cookie)
> {
>     int i;
> 
>     for (i=0; i < 5; i++)
>     {
>         rt_task_sleep(5000000);
>     }
> }
> 
> int main(void)
> {
>     int i;
>     int j;
>     mlockall(MCL_CURRENT|MCL_FUTURE);
> 
>     for (j=0; j < 100; j++)
>         for (i=10; i < 15000; i++)
>         {
>             rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE | T_FPU | T_CPU(1));
>             rt_task_start(&taska_desc, &mytaska, NULL);
>             usleep(1500);
> 
>             rt_task_join(&taska_desc);
>             if ( i % 100 == 0)
>                 printf("Loop %i\n", i);
>         }
> 
>     return 0;
> }
> *************************************** END ***********************************
> It is important to know, that I started the kernel with isolcpus=1, i.e. all non-realtime tasks
> are running on CPU 0.
> Somehow it seems to have to do with the usleep() that is following the rt_task_start.
> usleep() is executed on CPU 0 and rt_task_start starts a task on CPU 1...
> Can this be as the begin of usleep() is executed before the task is started but the end of
> usleep() is when the task has already started. Could this be a cause for a race condition?
> 
> I leave the program running for a while and somehow it freezes the PC (only reset works).
> 
> Any feedback on this is welcome!

Maybe you are seeing the same bug like this test exposes:

#include <native/task.h>
#include <sched.h>
#include <sys/mman.h>

void func(void *arg)
{
	rt_task_set_periodic(NULL, TM_NOW, 1000000000LL);
	while(1) rt_task_wait_period(NULL);
}

main()
{
	RT_TASK task;
	cpu_set_t set;

	mlockall(MCL_CURRENT|MCL_FUTURE);
	printf("rt_task_spawn=%d\n", rt_task_spawn(&task, "Receiver", 0,
	       10, 0, func, NULL));
	CPU_ZERO(&set);
	CPU_SET(1, &set);
	printf("sched_setaffinity=%d\n", sched_setaffinity(0,
	       sizeof(cpu_set_t), &set));
	sleep(1);
	printf("rt_task_delete=%d\n", rt_task_delete(&task));
}

Though, this test doesn't hard-lock, just stalls the process in some
zombie state.

This bug is already scheduled for closer examination, stay tuned.

In the meantime: Is it possible to check if
 a) my demo code happens to lock up hard for you?
 b) any behaviour changes with latest xeno-2.3.2/ipipe-1.8-05 and your
    test case?

Thanks for reporting,
Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10  8:40 ` Jan Kiszka
@ 2007-07-10 12:29   ` M. Koehrer
  2007-07-10 12:41     ` Jan Kiszka
  0 siblings, 1 reply; 33+ messages in thread
From: M. Koehrer @ 2007-07-10 12:29 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai

Hi Jan,

I have compiled and started your test.
It works fine - no error or warning...
The output is:
rt_task_spawn=0
sched_setaffinity=0
rt_task_delete=0

Regards

Mathias

> M. Koehrer wrote:
> > Hi everybody,
> > 
> > I noticed a sporadic freeze of my PC using Xenomai 2.3.1 and kernel
> 2.6.20.4 on a Pentium D.
> > adeos-ipipe-2.6.20-i386-1.8-01.patch.
> > 
> > The freeze happened sporadically on one of our systems, occasionally it
> took up to 6 hours  to get it.
> > Using a PCI Post Code board and writing POST codes to it, I was able to
> locate the code that was causing
> > the issue. And finally I was able to extract it to a very simple program
> that shows the same behaviour!!
> > 
> > Here is my simple test program:
> > **************************************** BEGIN *****************
> > #include <stdio.h>
> > #include <sys/mman.h>
> > 
> > #include <native/task.h>
> > #include <native/sem.h>
> > 
> > 
> > RT_TASK taska_desc;
> > 
> > void mytaska(void *cookie)
> > {
> >     int i;
> > 
> >     for (i=0; i < 5; i++)
> >     {
> >         rt_task_sleep(5000000);
> >     }
> > }
> > 
> > int main(void)
> > {
> >     int i;
> >     int j;
> >     mlockall(MCL_CURRENT|MCL_FUTURE);
> > 
> >     for (j=0; j < 100; j++)
> >         for (i=10; i < 15000; i++)
> >         {
> >             rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE |
> T_FPU | T_CPU(1));
> >             rt_task_start(&taska_desc, &mytaska, NULL);
> >             usleep(1500);
> > 
> >             rt_task_join(&taska_desc);
> >             if ( i % 100 == 0)
> >                 printf("Loop %i\n", i);
> >         }
> > 
> >     return 0;
> > }
> > *************************************** END
> ***********************************
> > It is important to know, that I started the kernel with isolcpus=1, i.e.
> all non-realtime tasks
> > are running on CPU 0.
> > Somehow it seems to have to do with the usleep() that is following the
> rt_task_start.
> > usleep() is executed on CPU 0 and rt_task_start starts a task on CPU 1...
> > Can this be as the begin of usleep() is executed before the task is
> started but the end of
> > usleep() is when the task has already started. Could this be a cause for a
> race condition?
> > 
> > I leave the program running for a while and somehow it freezes the PC
> (only reset works).
> > 
> > Any feedback on this is welcome!
> 
> Maybe you are seeing the same bug like this test exposes:
> 
> #include <native/task.h>
> #include <sched.h>
> #include <sys/mman.h>
> 
> void func(void *arg)
> {
> 	rt_task_set_periodic(NULL, TM_NOW, 1000000000LL);
> 	while(1) rt_task_wait_period(NULL);
> }
> 
> main()
> {
> 	RT_TASK task;
> 	cpu_set_t set;
> 
> 	mlockall(MCL_CURRENT|MCL_FUTURE);
> 	printf("rt_task_spawn=%d\n", rt_task_spawn(&task, "Receiver", 0,
> 	       10, 0, func, NULL));
> 	CPU_ZERO(&set);
> 	CPU_SET(1, &set);
> 	printf("sched_setaffinity=%d\n", sched_setaffinity(0,
> 	       sizeof(cpu_set_t), &set));
> 	sleep(1);
> 	printf("rt_task_delete=%d\n", rt_task_delete(&task));
> }
> 
> Though, this test doesn't hard-lock, just stalls the process in some
> zombie state.
> 
> This bug is already scheduled for closer examination, stay tuned.
> 
> In the meantime: Is it possible to check if
>  a) my demo code happens to lock up hard for you?
>  b) any behaviour changes with latest xeno-2.3.2/ipipe-1.8-05 and your
>     test case?
> 
> Thanks for reporting,
> Jan
> 
> 
> 


-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10 12:29   ` M. Koehrer
@ 2007-07-10 12:41     ` Jan Kiszka
  2007-07-10 14:40       ` M. Koehrer
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-10 12:41 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai


[-- Attachment #1.1: Type: text/plain, Size: 433 bytes --]

M. Koehrer wrote:
> Hi Jan,
> 
> I have compiled and started your test.
> It works fine - no error or warning...
> The output is:
> rt_task_spawn=0
> sched_setaffinity=0
> rt_task_delete=0

Means it simply terminates then? Interesting. Maybe some difference in
the .config, maybe due to the timing I get under qemu (that's where I
notices the lock-up). Let's go for a .config comparison first, mine is
attached.

Jan

[-- Attachment #1.2: config.bz2 --]
[-- Type: application/octet-stream, Size: 8061 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10 12:41     ` Jan Kiszka
@ 2007-07-10 14:40       ` M. Koehrer
  2007-07-10 15:34         ` Jan Kiszka
  0 siblings, 1 reply; 33+ messages in thread
From: M. Koehrer @ 2007-07-10 14:40 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai


[-- Attachment #1.1: Type: text/plain, Size: 1348 bytes --]

Hi Jan,

yes, it terminates nicely.
I have attached my config.

Regards

 
Mathias

----- Original Nachricht ----
Von:     Jan Kiszka <jan.kiszka@domain.hid>
An:      "M. Koehrer" <mathias_koehrer@domain.hid>
Datum:   10.07.2007 14:41
Betreff: Re: [Xenomai-help] Sporadic PC freeze after rt_task_start

> M. Koehrer wrote:
> > Hi Jan,
> > 
> > I have compiled and started your test.
> > It works fine - no error or warning...
> > The output is:
> > rt_task_spawn=0
> > sched_setaffinity=0
> > rt_task_delete=0
> 
> Means it simply terminates then? Interesting. Maybe some difference in
> the .config, maybe due to the timing I get under qemu (that's where I
> notices the lock-up). Let's go for a .config comparison first, mine is
> attached.
> 
> Jan
> 
> 
> --------------------------------
> 
> _______________________________________________
> Xenomai-help mailing list
> Xenomai-help@domain.hid
> https://mail.gna.org/listinfo/xenomai-help
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 8583 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10 14:40       ` M. Koehrer
@ 2007-07-10 15:34         ` Jan Kiszka
  2007-07-11  6:43           ` M. Koehrer
  2007-07-11 14:47           ` Jan Kiszka
  0 siblings, 2 replies; 33+ messages in thread
From: Jan Kiszka @ 2007-07-10 15:34 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 566 bytes --]

M. Koehrer wrote:
> Hi Jan,
> 
> yes, it terminates nicely.
> I have attached my config.

Nothing obvious. Leaves us with probable timing differences or the
different versions of our setups (I found this over 2.3.2 and trunk).

OK, further analysis on your side would be appreciated. E.g. trying the
latest release, switching on debug features in Xenomai like the NMI
watchdog or nucleus debugging. Also, nailing down what service call
precisely locks up (the join, the termination of the task, etc.) would
be good to reduce the search space.

Jan

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10 15:34         ` Jan Kiszka
@ 2007-07-11  6:43           ` M. Koehrer
  2007-07-11  7:32             ` Jan Kiszka
  2007-07-11 14:47           ` Jan Kiszka
  1 sibling, 1 reply; 33+ messages in thread
From: M. Koehrer @ 2007-07-11  6:43 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai

Hi Jan,

as I mentioned in my first mail on this topic, I have extracted this example from a huge
real time application where the system sporadically freezes.
What I have found out there was, that the system freeze happened always when trying
to start a thread (in the original application, the tasks run fairly long).
I think it has to do with rt_task_start() followed by usleep() and the non-realtime stuff running
on CPU 0 and the realtime stuff running on CPU 1.
I never saw that issue on a single core CPU (even if the same SMP kernel was used).

Regards

Mathias

> M. Koehrer wrote:
> > Hi Jan,
> > 
> > yes, it terminates nicely.
> > I have attached my config.
> 
> Nothing obvious. Leaves us with probable timing differences or the
> different versions of our setups (I found this over 2.3.2 and trunk).
> 
> OK, further analysis on your side would be appreciated. E.g. trying the
> latest release, switching on debug features in Xenomai like the NMI
> watchdog or nucleus debugging. Also, nailing down what service call
> precisely locks up (the join, the termination of the task, etc.) would
> be good to reduce the search space.
> 
> Jan
> 
> 
> 
> --------------------------------
> 
> _______________________________________________
> Xenomai-help mailing list
> Xenomai-help@domain.hid
> https://mail.gna.org/listinfo/xenomai-help
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid

Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-11  6:43           ` M. Koehrer
@ 2007-07-11  7:32             ` Jan Kiszka
  2007-07-11 12:45               ` M. Koehrer
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-11  7:32 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

M. Koehrer wrote:
> Hi Jan,
> 
> as I mentioned in my first mail on this topic, I have extracted this example from a huge
> real time application where the system sporadically freezes.
> What I have found out there was, that the system freeze happened always when trying
> to start a thread (in the original application, the tasks run fairly long).

OK, so it's one of rt_task_create (unlikely), rt_task_start, usleep, or
some early code in the task function itself. Still, a lot of "or"...

Again, please consider my further debugging suggestions.

> I think it has to do with rt_task_start() followed by usleep() and the non-realtime stuff running
> on CPU 0 and the realtime stuff running on CPU 1.
> I never saw that issue on a single core CPU (even if the same SMP kernel was used).

Yeah, it must be a nice race that requires real parallelism.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-11  7:32             ` Jan Kiszka
@ 2007-07-11 12:45               ` M. Koehrer
  0 siblings, 0 replies; 33+ messages in thread
From: M. Koehrer @ 2007-07-11 12:45 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai

Hi Jan,

I tried with Xenomai 2.3.2 and (still) the kernel 2.6.20.4 - same configuration as before.
I used the adeos patch for 2.6.20 that is part of Xenomai 2.3.2
Same effect.
The system freezes...

Regards

Mathias

> M. Koehrer wrote:
> > Hi Jan,
> > 
> > as I mentioned in my first mail on this topic, I have extracted this
> example from a huge
> > real time application where the system sporadically freezes.
> > What I have found out there was, that the system freeze happened always
> when trying
> > to start a thread (in the original application, the tasks run fairly
> long).
> 
> OK, so it's one of rt_task_create (unlikely), rt_task_start, usleep, or
> some early code in the task function itself. Still, a lot of "or"...
> 
> Again, please consider my further debugging suggestions.
> 
> > I think it has to do with rt_task_start() followed by usleep() and the
> non-realtime stuff running
> > on CPU 0 and the realtime stuff running on CPU 1.
> > I never saw that issue on a single core CPU (even if the same SMP kernel
> was used).
> 
> Yeah, it must be a nice race that requires real parallelism.
> 
> Jan
> 
> 


-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-10 15:34         ` Jan Kiszka
  2007-07-11  6:43           ` M. Koehrer
@ 2007-07-11 14:47           ` Jan Kiszka
  2007-07-13  7:27             ` M. Koehrer
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-11 14:47 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 1017 bytes --]

Jan Kiszka wrote:
> M. Koehrer wrote:
>> Hi Jan,
>>
>> yes, it terminates nicely.
>> I have attached my config.
> 
> Nothing obvious. Leaves us with probable timing differences or the
> different versions of our setups (I found this over 2.3.2 and trunk).

My posted issue is a classic race of self-terminating the native task on
CPU1 vs. remote-terminating it from CPU0. When the latter wins, things
fall apart. Some solution needs more thoughts.

Anyway, this issues is most probably unrelated to your bug.

> 
> OK, further analysis on your side would be appreciated. E.g. trying the
> latest release, switching on debug features in Xenomai like the NMI
> watchdog or nucleus debugging. Also, nailing down what service call
> precisely locks up (the join, the termination of the task, etc.) would
> be good to reduce the search space.
> 

As you posted in a different mail, recent versions make no difference.
Could you now switch on the watchdog and nucleus debugging?

Thanks,
Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-11 14:47           ` Jan Kiszka
@ 2007-07-13  7:27             ` M. Koehrer
  2007-07-13  8:26               ` Jan Kiszka
  0 siblings, 1 reply; 33+ messages in thread
From: M. Koehrer @ 2007-07-13  7:27 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai

Hi Jan,

I did another test to identify the freeze. I have plugged in a POST-CODE 80 PCI board into the PC
and instrumented the code to write to port 80 to find out where the freeze actually happens.
It seems not to return for rt_task_start as the last written POST code (see source code below)  is 0x30.
I hope to find a time slot to modify the kernel to do another test.
The bad thing is that it takes really long to get the freeze (up to a couple of hours).

Regards

Mathias

---------------------------------- BEGIN SOURCE CODE -----------------------
#include <stdio.h>
#include <sys/mman.h>
#include <sys/io.h>

#include <native/task.h>
#include <native/sem.h>


RT_TASK taska_desc;

void mytaska(void *cookie)
{
    int i;

    outb(0x80,0x80);

    for (i=0; i < 5; i++)
    {
        rt_task_sleep(5000000);
        outb(0x90,0x80);
        // printf("Task A\n");
    }
    outb(0xa0,0x80);

    // printf("End of task A\n");
}


int main(void)
{
    int i;
    int j;
    ioperm(0x80, 1, 1);
    mlockall(MCL_CURRENT|MCL_FUTURE);

    for (j=0; j < 10000; j++)
        for (i=10; i < 15000; i++)
        {
            outb(0x20, 0x80);
            rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE | T_FPU | T_CPU(1));
            outb(0x30, 0x80);
            rt_task_start(&taska_desc, &mytaska, NULL);
            outb(0x40, 0x80);
            usleep(1500);
            outb(0x50, 0x80);

            rt_task_join(&taska_desc);
            if ( i % 100 == 0)
                printf("Loop %i %i\n", j,  i);
        }


    return 0;
}
--------------------------- END  -----------------

 
> > Nothing obvious. Leaves us with probable timing differences or the
> > different versions of our setups (I found this over 2.3.2 and trunk).
> 
> My posted issue is a classic race of self-terminating the native task on
> CPU1 vs. remote-terminating it from CPU0. When the latter wins, things
> fall apart. Some solution needs more thoughts.
> 
> Anyway, this issues is most probably unrelated to your bug.
> 
> > 
> > OK, further analysis on your side would be appreciated. E.g. trying the
> > latest release, switching on debug features in Xenomai like the NMI
> > watchdog or nucleus debugging. Also, nailing down what service call
> > precisely locks up (the join, the termination of the task, etc.) would
> > be good to reduce the search space.
> > 
> 
> As you posted in a different mail, recent versions make no difference.
> Could you now switch on the watchdog and nucleus debugging?
> 
> Thanks,
> Jan
> 
> 
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-13  7:27             ` M. Koehrer
@ 2007-07-13  8:26               ` Jan Kiszka
  2007-07-16  7:07                 ` M. Koehrer
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-13  8:26 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 753 bytes --]

M. Koehrer wrote:
> Hi Jan,
> 
> I did another test to identify the freeze. I have plugged in a POST-CODE 80 PCI board into the PC
> and instrumented the code to write to port 80 to find out where the freeze actually happens.
> It seems not to return for rt_task_start as the last written POST code (see source code below)  is 0x30.
> I hope to find a time slot to modify the kernel to do another test.

Again: Please consider NMI watchdog and nucleus debugging support for
those tests as well. Maybe (I dare to say: likely on SMP) they catch
where the CPUs hang around instead of doing their work.

> The bad thing is that it takes really long to get the freeze (up to a couple of hours).

Your effort is appreciated even more!

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-13  8:26               ` Jan Kiszka
@ 2007-07-16  7:07                 ` M. Koehrer
  2007-07-16 22:42                   ` Jan Kiszka
  0 siblings, 1 reply; 33+ messages in thread
From: M. Koehrer @ 2007-07-16  7:07 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai

Hi Jan,

I left my PC running the whole weekend using the kernel command line parameter "nmi_watchdog=1".
However, using this option, the PC did not freeze at all...
This is really ugly.
The kernel configuration was the same I mailed a couple of days ago.

Also, I tried (a couple of hours) to run with Xeno debugging (Nucleus debugging and Watchdog support) enabled. However, this did not lead to a freeze either.

This seems to be a really nasty timing issue...

Any ideas on how to continue are welcome.

Regards

Mathias

> > I did another test to identify the freeze. I have plugged in a POST-CODE
> 80 PCI board into the PC
> > and instrumented the code to write to port 80 to find out where the freeze
> actually happens.
> > It seems not to return for rt_task_start as the last written POST code
> (see source code below)  is 0x30.
> > I hope to find a time slot to modify the kernel to do another test.
> 
> Again: Please consider NMI watchdog and nucleus debugging support for
> those tests as well. Maybe (I dare to say: likely on SMP) they catch
> where the CPUs hang around instead of doing their work.
> 
> > The bad thing is that it takes really long to get the freeze (up to a
> couple of hours).
> 
> Your effort is appreciated even more!
> 
> Jan
> 
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid

Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-16  7:07                 ` M. Koehrer
@ 2007-07-16 22:42                   ` Jan Kiszka
  2007-07-19 10:58                     ` M. Koehrer
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-16 22:42 UTC (permalink / raw)
  To: M. Koehrer; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 738 bytes --]

M. Koehrer wrote:
> Hi Jan,
> 
> I left my PC running the whole weekend using the kernel command line parameter "nmi_watchdog=1".
> However, using this option, the PC did not freeze at all...
> This is really ugly.
> The kernel configuration was the same I mailed a couple of days ago.

Mpf. What about nmi_watchdog=2? It's said to tick at lower rate, thus
may not have such an "unwanted" side-effect.

> 
> Also, I tried (a couple of hours) to run with Xeno debugging (Nucleus debugging and Watchdog support) enabled. However, this did not lead to a freeze either.
> 
> This seems to be a really nasty timing issue...
> 
> Any ideas on how to continue are welcome.

No. I had no time to think further so far.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-16 22:42                   ` Jan Kiszka
@ 2007-07-19 10:58                     ` M. Koehrer
  2007-07-19 11:27                       ` Jan Kiszka
  0 siblings, 1 reply; 33+ messages in thread
From: M. Koehrer @ 2007-07-19 10:58 UTC (permalink / raw)
  To: jan.kiszka, mathias_koehrer; +Cc: xenomai


[-- Attachment #1.1: Type: text/plain, Size: 3141 bytes --]

Hi!

After a couple of over-night test runs, I finally got an NMI watchdog detected lockup with the sporadic freeze option.
I started the system with the argument nmi_watchdog=1 (also isolcpus=1).
See the code below. As I have not connected a serial console, I have attached a screen shot in a fairly
bad quality as jpg file... However, it is good enough to be able to read everything... 
The lockup is in function rpi_pop [xeno_nucleus].
It is called from gatekeeper_thread and from default_wake_function.
See the attached jpg for details.

Perhaps that helps to identify the issue.

Regards

Mathias

--------- BEGIN CODE ---------------
#include <stdio.h>
#include <sys/mman.h>
#include <sys/io.h>

#include <native/task.h>
#include <native/sem.h>


RT_TASK taska_desc;

void mytaska(void *cookie)
{
    int i;

    outb(0x80,0x80);

    for (i=0; i < 5; i++)
    {
        rt_task_sleep(5000000);
        outb(0x90,0x80);
        // printf("Task A\n");
    }
    outb(0xa0,0x80);

    // printf("End of task A\n");
}


int main(void)
{
    int i;
    int j;
    ioperm(0x80, 1, 1);
    mlockall(MCL_CURRENT|MCL_FUTURE);

    for (j=0; j < 10000; j++)
        for (i=10; i < 15000; i++)
        {
            outb(0x20, 0x80);
            rt_task_create(&taska_desc, "mytaska", 0, 81, T_JOINABLE | T_FPU | T_CPU(1));
            //    outb(0x30, 0x80);
            rt_task_start(&taska_desc, &mytaska, NULL);
            outb(0x40, 0x80);
            usleep(1500);
            outb(0x50, 0x80);

            rt_task_join(&taska_desc);
            if ( i % 100 == 0)
                printf("Loop %i %i\n", j,  i);
        }


    return 0;
}

--------- END CODE ---------------

 


----- Original Nachricht ----
Von:     Jan Kiszka <jan.kiszka@domain.hid>
An:      "M. Koehrer" <mathias_koehrer@domain.hid>
Datum:   17.07.2007 00:42
Betreff: Re: [Xenomai-help] Sporadic PC freeze after rt_task_start

> M. Koehrer wrote:
> > Hi Jan,
> > 
> > I left my PC running the whole weekend using the kernel command line
> parameter "nmi_watchdog=1".
> > However, using this option, the PC did not freeze at all...
> > This is really ugly.
> > The kernel configuration was the same I mailed a couple of days ago.
> 
> Mpf. What about nmi_watchdog=2? It's said to tick at lower rate, thus
> may not have such an "unwanted" side-effect.
> 
> > 
> > Also, I tried (a couple of hours) to run with Xeno debugging (Nucleus
> debugging and Watchdog support) enabled. However, this did not lead to a
> freeze either.
> > 
> > This seems to be a really nasty timing issue...
> > 
> > Any ideas on how to continue are welcome.
> 
> No. I had no time to think further so far.
> 
> Jan
> 
> 

-- 
Mathias Koehrer
mathias_koehrer@domain.hid


Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2

[-- Attachment #2: XenoCrash.jpg --]
[-- Type: image/jpeg, Size: 84873 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 10:58                     ` M. Koehrer
@ 2007-07-19 11:27                       ` Jan Kiszka
  2007-07-19 12:19                         ` Philippe Gerum
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-19 11:27 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-help, M. Koehrer

[-- Attachment #1: Type: text/plain, Size: 1450 bytes --]

M. Koehrer wrote:
> Hi!
> 
> After a couple of over-night test runs, I finally got an NMI watchdog detected lockup with the sporadic freeze option.
> I started the system with the argument nmi_watchdog=1 (also isolcpus=1).
> See the code below. As I have not connected a serial console, I have attached a screen shot in a fairly
> bad quality as jpg file... However, it is good enough to be able to read everything... 
> The lockup is in function rpi_pop [xeno_nucleus].
> It is called from gatekeeper_thread and from default_wake_function.
> See the attached jpg for details.

Looks like we are stuck on rpilock, Philippe.

And when looking at the holders of rpilock, I think one issue could be
that we hold that lock while calling into xnpod_renice_root [1], ie.
doing a potential context switch. Was this checked to be save?
Furthermore, that code path reveals that we take nklock nested into
rpilock [2]. I haven't found a spot for the other way around (and I hope
there is none), but such nesting is already evil per se...

Mathias, already tried your test case with our old friend "priority
coupling" switched off? *If* this lock-up is actually due to rpilock
brokenness, switching the feature off should make it disappear.

Jan

[1]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.c?v=SVN-trunk#435
[2]http://www.rts.uni-hannover.de/xenomai/lxr/source/include/nucleus/pod.h?v=SVN-trunk#308

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 11:27                       ` Jan Kiszka
@ 2007-07-19 12:19                         ` Philippe Gerum
  2007-07-19 12:40                           ` Jan Kiszka
  0 siblings, 1 reply; 33+ messages in thread
From: Philippe Gerum @ 2007-07-19 12:19 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-help, M. Koehrer

On Thu, 2007-07-19 at 13:27 +0200, Jan Kiszka wrote:
> M. Koehrer wrote:
> > Hi!
> > 
> > After a couple of over-night test runs, I finally got an NMI watchdog detected lockup with the sporadic freeze option.
> > I started the system with the argument nmi_watchdog=1 (also isolcpus=1).
> > See the code below. As I have not connected a serial console, I have attached a screen shot in a fairly
> > bad quality as jpg file... However, it is good enough to be able to read everything... 
> > The lockup is in function rpi_pop [xeno_nucleus].
> > It is called from gatekeeper_thread and from default_wake_function.
> > See the attached jpg for details.
> 
> Looks like we are stuck on rpilock, Philippe.
> 

Seems likely, yes. Switching the nucleus DEBUG option would engage the
lockup detector, and pull the brake whenever the nucleus fails to grab
the rpilock.

Mathias, I guess this test has not been run with the nucleus debug
option enabled. Any chance to get a disassembly of the rpi_pop routine
as compiled into your kernel, so that we could check if we are really
stuck on this lock, or rather on some infinite walk into a corrupted RPI
list?

> And when looking at the holders of rpilock, I think one issue could be
> that we hold that lock while calling into xnpod_renice_root [1], ie.
> doing a potential context switch. Was this checked to be save?

xnpod_renice_root() does no reschedule immediately on purpose, we would
never have been able to run any SMP config more than a couple of seconds
otherwise. (See the NOSWITCH bit).

> Furthermore, that code path reveals that we take nklock nested into
> rpilock [2]. I haven't found a spot for the other way around (and I hope
> there is none)

xnshadow_start().

> , but such nesting is already evil per se...

Well, nesting spinlocks only falls into evilness when you get a circular
graph, but since the rpilock is a rookie in the locking team, I'm going
to check this.

Ok, I'm tackling this lockup issue now. I first need to reproduce it.
More news later.

> 
> Mathias, already tried your test case with our old friend "priority
> coupling" switched off? *If* this lock-up is actually due to rpilock
> brokenness, switching the feature off should make it disappear.
> 

It would be nice to switch on the nucleus DEBUG feature, especially the
queue debugging one. I understand this may hide the bug due to the
alteration of timings, but still, it would be useful to know whether a
configuration without NMI but with such debug knob on would trigger the
alarm.

> Jan
> 
> 
> [1]http://www.rts.uni-hannover.de/xenomai/lxr/source/ksrc/nucleus/shadow.c?v=SVN-trunk#435
> [2]http://www.rts.uni-hannover.de/xenomai/lxr/source/include/nucleus/pod.h?v=SVN-trunk#308
> 
-- 
Philippe.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 12:19                         ` Philippe Gerum
@ 2007-07-19 12:40                           ` Jan Kiszka
  2007-07-19 13:55                             ` [Xenomai-core] " Philippe Gerum
  2007-07-19 15:14                             ` Philippe Gerum
  0 siblings, 2 replies; 33+ messages in thread
From: Jan Kiszka @ 2007-07-19 12:40 UTC (permalink / raw)
  To: rpm; +Cc: xenomai-help, M. Koehrer

[-- Attachment #1: Type: text/plain, Size: 1090 bytes --]

Philippe Gerum wrote:
>> And when looking at the holders of rpilock, I think one issue could be
>> that we hold that lock while calling into xnpod_renice_root [1], ie.
>> doing a potential context switch. Was this checked to be save?
> 
> xnpod_renice_root() does no reschedule immediately on purpose, we would
> never have been able to run any SMP config more than a couple of seconds
> otherwise. (See the NOSWITCH bit).

OK, then it's not the cause.

> 
>> Furthermore, that code path reveals that we take nklock nested into
>> rpilock [2]. I haven't found a spot for the other way around (and I hope
>> there is none)
> 
> xnshadow_start().

Nope, that one is not holding nklock. But I found an offender...

> 
>> , but such nesting is already evil per se...
> 
> Well, nesting spinlocks only falls into evilness when you get a circular
> graph, but since the rpilock is a rookie in the locking team, I'm going
> to check this.

Take this one: gatekeeper_thread calls into rpi_pop with nklock
acquired. So we have a classic ABAB locking bug. Bang!

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 12:40                           ` Jan Kiszka
@ 2007-07-19 13:55                             ` Philippe Gerum
  2007-07-19 15:14                             ` Philippe Gerum
  1 sibling, 0 replies; 33+ messages in thread
From: Philippe Gerum @ 2007-07-19 13:55 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-help, M. Koehrer, xenomai

On Thu, 2007-07-19 at 14:40 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> >> And when looking at the holders of rpilock, I think one issue could be
> >> that we hold that lock while calling into xnpod_renice_root [1], ie.
> >> doing a potential context switch. Was this checked to be save?
> > 
> > xnpod_renice_root() does no reschedule immediately on purpose, we would
> > never have been able to run any SMP config more than a couple of seconds
> > otherwise. (See the NOSWITCH bit).
> 
> OK, then it's not the cause.
> 
> > 
> >> Furthermore, that code path reveals that we take nklock nested into
> >> rpilock [2]. I haven't found a spot for the other way around (and I hope
> >> there is none)
> > 
> > xnshadow_start().
> 
> Nope, that one is not holding nklock.

Indeed, but this only works because its callers who may hold this lock
do not activate shadow threads so far. This looks so fragile... I'll add
some comment about this in the doc.

> But I found an offender...
> 
> > 
> >> , but such nesting is already evil per se...
> > 
> > Well, nesting spinlocks only falls into evilness when you get a circular
> > graph, but since the rpilock is a rookie in the locking team, I'm going
> > to check this.
> 
> Take this one: gatekeeper_thread calls into rpi_pop with nklock
> acquired. So we have a classic ABAB locking bug. Bang!
> 

Damnit.

The fix needs some thought and attention, we are running against the
deletion path here.

PS: Time to switch to -core.

> Jan
> 
-- 
Philippe.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 12:40                           ` Jan Kiszka
  2007-07-19 13:55                             ` [Xenomai-core] " Philippe Gerum
@ 2007-07-19 15:14                             ` Philippe Gerum
  2007-07-19 15:35                               ` Jan Kiszka
  2007-07-20  7:03                               ` M. Koehrer
  1 sibling, 2 replies; 33+ messages in thread
From: Philippe Gerum @ 2007-07-19 15:14 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-help, M. Koehrer, xenomai

On Thu, 2007-07-19 at 14:40 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> >> And when looking at the holders of rpilock, I think one issue could be
> >> that we hold that lock while calling into xnpod_renice_root [1], ie.
> >> doing a potential context switch. Was this checked to be save?
> > 
> > xnpod_renice_root() does no reschedule immediately on purpose, we would
> > never have been able to run any SMP config more than a couple of seconds
> > otherwise. (See the NOSWITCH bit).
> 
> OK, then it's not the cause.
> 
> > 
> >> Furthermore, that code path reveals that we take nklock nested into
> >> rpilock [2]. I haven't found a spot for the other way around (and I hope
> >> there is none)
> > 
> > xnshadow_start().
> 
> Nope, that one is not holding nklock. But I found an offender...

Gasp. xnshadow_renice() kills us too.

-- 
Philippe.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 15:14                             ` Philippe Gerum
@ 2007-07-19 15:35                               ` Jan Kiszka
  2007-07-19 16:03                                 ` Philippe Gerum
  2007-07-20  7:03                               ` M. Koehrer
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-19 15:35 UTC (permalink / raw)
  To: rpm; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]

Philippe Gerum wrote:
> On Thu, 2007-07-19 at 14:40 +0200, Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>>> And when looking at the holders of rpilock, I think one issue could be
>>>> that we hold that lock while calling into xnpod_renice_root [1], ie.
>>>> doing a potential context switch. Was this checked to be save?
>>> xnpod_renice_root() does no reschedule immediately on purpose, we would
>>> never have been able to run any SMP config more than a couple of seconds
>>> otherwise. (See the NOSWITCH bit).
>> OK, then it's not the cause.
>>
>>>> Furthermore, that code path reveals that we take nklock nested into
>>>> rpilock [2]. I haven't found a spot for the other way around (and I hope
>>>> there is none)
>>> xnshadow_start().
>> Nope, that one is not holding nklock. But I found an offender...
> 
> Gasp. xnshadow_renice() kills us too.

Looks like we are approaching mainline "qualities" here - but they have
at least lockdep (and still face nasty races regularly).

As long as you can't avoid nesting or the inner lock only protects
really, really trivial code (list manipulation etc.), I would say there
is one lock too much... Did I mention that I consider nesting to be
evil? :-> Besides correctness, there is also an increasing worst-case
behaviour issue with each additional nesting level.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 15:35                               ` Jan Kiszka
@ 2007-07-19 16:03                                 ` Philippe Gerum
  2007-07-19 17:18                                   ` Jan Kiszka
  2007-07-19 17:57                                   ` Jan Kiszka
  0 siblings, 2 replies; 33+ messages in thread
From: Philippe Gerum @ 2007-07-19 16:03 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai

On Thu, 2007-07-19 at 17:35 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Thu, 2007-07-19 at 14:40 +0200, Jan Kiszka wrote:
> >> Philippe Gerum wrote:
> >>>> And when looking at the holders of rpilock, I think one issue could be
> >>>> that we hold that lock while calling into xnpod_renice_root [1], ie.
> >>>> doing a potential context switch. Was this checked to be save?
> >>> xnpod_renice_root() does no reschedule immediately on purpose, we would
> >>> never have been able to run any SMP config more than a couple of seconds
> >>> otherwise. (See the NOSWITCH bit).
> >> OK, then it's not the cause.
> >>
> >>>> Furthermore, that code path reveals that we take nklock nested into
> >>>> rpilock [2]. I haven't found a spot for the other way around (and I hope
> >>>> there is none)
> >>> xnshadow_start().
> >> Nope, that one is not holding nklock. But I found an offender...
> > 
> > Gasp. xnshadow_renice() kills us too.
> 
> Looks like we are approaching mainline "qualities" here - but they have
> at least lockdep (and still face nasty races regularly).
> 

We only have a 2-level locking depth at most, thare barely qualifies for
being compared to the situation with mainline. Most often, the more
radical the solution, the less relevant it is: simple nesting on very
few levels is not bad, bugous nesting sequence is.

> As long as you can't avoid nesting or the inner lock only protects
> really, really trivial code (list manipulation etc.), I would say there
> is one lock too much... Did I mention that I consider nesting to be
> evil? :-> Besides correctness, there is also an increasing worst-case
> behaviour issue with each additional nesting level.
> 

In this case, we do not want the RPI manipulation to affect the
worst-case of all other threads by holding the nklock. This is
fundamentally a migration-related issue, which is a situation that must
not impact all other contexts relying on the nklock. Given this, you
need to protect the RPI list and prevent the scheduler data to be
altered at the same time, there is no cheap trick to avoid this.

We need to keep the rpilock, otherwise we would have significantly large
latency penalties, especially when domain migration are frequent, and
yes, we do need RPI, otherwise the sequence for emulated RTOS services
would be plain wrong (e.g. task creation).

Ok, the rpilock is local, the nesting level is bearable, let's focus on
putting this thingy straight.

> Jan
> 
-- 
Philippe.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 16:03                                 ` Philippe Gerum
@ 2007-07-19 17:18                                   ` Jan Kiszka
  2007-07-19 18:24                                     ` Philippe Gerum
  2007-07-19 17:57                                   ` Jan Kiszka
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-19 17:18 UTC (permalink / raw)
  To: rpm; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 3911 bytes --]

Philippe Gerum wrote:
> On Thu, 2007-07-19 at 17:35 +0200, Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> On Thu, 2007-07-19 at 14:40 +0200, Jan Kiszka wrote:
>>>> Philippe Gerum wrote:
>>>>>> And when looking at the holders of rpilock, I think one issue could be
>>>>>> that we hold that lock while calling into xnpod_renice_root [1], ie.
>>>>>> doing a potential context switch. Was this checked to be save?
>>>>> xnpod_renice_root() does no reschedule immediately on purpose, we would
>>>>> never have been able to run any SMP config more than a couple of seconds
>>>>> otherwise. (See the NOSWITCH bit).
>>>> OK, then it's not the cause.
>>>>
>>>>>> Furthermore, that code path reveals that we take nklock nested into
>>>>>> rpilock [2]. I haven't found a spot for the other way around (and I hope
>>>>>> there is none)
>>>>> xnshadow_start().
>>>> Nope, that one is not holding nklock. But I found an offender...
>>> Gasp. xnshadow_renice() kills us too.
>> Looks like we are approaching mainline "qualities" here - but they have
>> at least lockdep (and still face nasty races regularly).
>>
> 
> We only have a 2-level locking depth at most, thare barely qualifies for
> being compared to the situation with mainline. Most often, the more
> radical the solution, the less relevant it is: simple nesting on very
> few levels is not bad, bugous nesting sequence is.
> 
>> As long as you can't avoid nesting or the inner lock only protects
>> really, really trivial code (list manipulation etc.), I would say there
>> is one lock too much... Did I mention that I consider nesting to be
>> evil? :-> Besides correctness, there is also an increasing worst-case
>> behaviour issue with each additional nesting level.
>>
> 
> In this case, we do not want the RPI manipulation to affect the
> worst-case of all other threads by holding the nklock. This is
> fundamentally a migration-related issue, which is a situation that must
> not impact all other contexts relying on the nklock. Given this, you
> need to protect the RPI list and prevent the scheduler data to be
> altered at the same time, there is no cheap trick to avoid this.
> 
> We need to keep the rpilock, otherwise we would have significantly large
> latency penalties, especially when domain migration are frequent, and
> yes, we do need RPI, otherwise the sequence for emulated RTOS services
> would be plain wrong (e.g. task creation).

If rpilock is known to protect potentially costly code, you _must not_
hold other locks while taking it. Otherwise, you do not win a dime by
using two locks, rather make things worse (overhead of taking two locks
instead of just one). That all relates to the worst case, of course, the
one thing we are worried about most.

In that light, the nesting nklock->rpilock must go away, independently
of the ordering bug. The other way around might be a different thing,
though I'm not sure if there is actually so much difference between the
locks in the worst case.

What is the actual _combined_ lock holding time in the longest
nklock/rpilock nesting path? Is that one really larger than any other
pre-existing nklock path? Only in that case, it makes sense to think
about splitting, though you will still be left with precisely the same
(rather a few cycles more) CPU-local latency. Is there really no chance
to split the lock paths?

> Ok, the rpilock is local, the nesting level is bearable, let's focus on
> putting this thingy straight.

The whole RPI thing, though required for some scenarios, remains ugly
and error-prone (including worst-case latency issues). I can only
underline my recommendation to switch off complexity in Xenomai when one
doesn't need it - which often includes RPI. Sorry, Philippe, but I think
we have to be honest to the users here. RPI remains problematic, at
least /wrt your beloved latency.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 17:18                                   ` Jan Kiszka
@ 2007-07-19 18:24                                     ` Philippe Gerum
  2007-07-19 20:15                                       ` Jan Kiszka
  0 siblings, 1 reply; 33+ messages in thread
From: Philippe Gerum @ 2007-07-19 18:24 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai

On Thu, 2007-07-19 at 19:18 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Thu, 2007-07-19 at 17:35 +0200, Jan Kiszka wrote:
> >> Philippe Gerum wrote:
> >>> On Thu, 2007-07-19 at 14:40 +0200, Jan Kiszka wrote:
> >>>> Philippe Gerum wrote:
> >>>>>> And when looking at the holders of rpilock, I think one issue could be
> >>>>>> that we hold that lock while calling into xnpod_renice_root [1], ie.
> >>>>>> doing a potential context switch. Was this checked to be save?
> >>>>> xnpod_renice_root() does no reschedule immediately on purpose, we would
> >>>>> never have been able to run any SMP config more than a couple of seconds
> >>>>> otherwise. (See the NOSWITCH bit).
> >>>> OK, then it's not the cause.
> >>>>
> >>>>>> Furthermore, that code path reveals that we take nklock nested into
> >>>>>> rpilock [2]. I haven't found a spot for the other way around (and I hope
> >>>>>> there is none)
> >>>>> xnshadow_start().
> >>>> Nope, that one is not holding nklock. But I found an offender...
> >>> Gasp. xnshadow_renice() kills us too.
> >> Looks like we are approaching mainline "qualities" here - but they have
> >> at least lockdep (and still face nasty races regularly).
> >>
> > 
> > We only have a 2-level locking depth at most, thare barely qualifies for
> > being compared to the situation with mainline. Most often, the more
> > radical the solution, the less relevant it is: simple nesting on very
> > few levels is not bad, bugous nesting sequence is.
> > 
> >> As long as you can't avoid nesting or the inner lock only protects
> >> really, really trivial code (list manipulation etc.), I would say there
> >> is one lock too much... Did I mention that I consider nesting to be
> >> evil? :-> Besides correctness, there is also an increasing worst-case
> >> behaviour issue with each additional nesting level.
> >>
> > 
> > In this case, we do not want the RPI manipulation to affect the
> > worst-case of all other threads by holding the nklock. This is
> > fundamentally a migration-related issue, which is a situation that must
> > not impact all other contexts relying on the nklock. Given this, you
> > need to protect the RPI list and prevent the scheduler data to be
> > altered at the same time, there is no cheap trick to avoid this.
> > 
> > We need to keep the rpilock, otherwise we would have significantly large
> > latency penalties, especially when domain migration are frequent, and
> > yes, we do need RPI, otherwise the sequence for emulated RTOS services
> > would be plain wrong (e.g. task creation).
> 
> If rpilock is known to protect potentially costly code, you _must not_
> hold other locks while taking it. Otherwise, you do not win a dime by
> using two locks, rather make things worse (overhead of taking two locks
> instead of just one).

I guess that by now you already understood that holding such outer lock
is what should not be done, and what should be fixed, right? So let's
focus on the real issue here: holding two locks is not the problem,
holding them in the wrong sequence, is.

>  That all relates to the worst case, of course, the
> one thing we are worried about most.
> 
> In that light, the nesting nklock->rpilock must go away, independently
> of the ordering bug. The other way around might be a different thing,
> though I'm not sure if there is actually so much difference between the
> locks in the worst case.
> 
> What is the actual _combined_ lock holding time in the longest
> nklock/rpilock nesting path?

It is short.

>  Is that one really larger than any other
> pre-existing nklock path?

Yes. Look, could you please assume one second that I did not choose this
implementation randomly? :o)

>  Only in that case, it makes sense to think
> about splitting, though you will still be left with precisely the same
> (rather a few cycles more) CPU-local latency. Is there really no chance
> to split the lock paths?
> 

The answer to your question is into the dynamics of migrating tasks
between domains, and how this relates to the overall dynamics of the
system. Migration needs priority tracking, priority tracking requires
almost the same amount of work than updating the scheduler data. Since
we can reduce the pressure on the nklock during migration which is a
thread-local action additionally involving the root thread, it is _good_
to do so. Even if this costs a few brain cycles more.

> > Ok, the rpilock is local, the nesting level is bearable, let's focus on
> > putting this thingy straight.
> 
> The whole RPI thing, though required for some scenarios, remains ugly
> and error-prone (including worst-case latency issues).
>  I can only
> underline my recommendation to switch off complexity in Xenomai when one
> doesn't need it - which often includes RPI.
>  Sorry, Philippe, but I think
> we have to be honest to the users here. RPI remains problematic, at
> least /wrt your beloved latency.

The best way to be honest to users is to depict things as they are:

1) RPI is there because we currently rely on a co-kernel technology, and
we have to make our best to fix the consequences of having two
schedulers by at least coupling their priority scheme when applicable.
Otherwise, you just _cannot_ emulate common RTOS behaviour properly.
Additionally, albeit disabling RPI is perfectly fine and allows to run
most applications the RTAI way, it is _utterly flawed_ at the logical
level, if you intend to integrate the two kernels. I do understand that
you might not care about such integration, that you might even find it
silly, and this is not even an issue for me. But the whole purpose of
Xenomai has never ever been to reel off the "yet-another-co-kernel"
mantra once again. I -very fundamentally- don't give a dime about
co-kernels per se, what I want is a framework which exhibits real-time
OS behaviours, with deep Linux integration, in order to build skins upon
it, and give users access to the regular programming model, and RPI does
help here. Period.

2) RPI is not perfect, has been rewritten a couple of times already, and
has suffered a handful of severe bugs. Would you throw away any software
only on this basis? I guess not, otherwise you would not run Linux,
especially not in SMP.

3) As time passes, RPI is stabilizing because it is now handled using
the right core logic, albeit it involves tricky situations. Besides, the
RPI bug we have been talking about is nothing compared to the issue
regarding the deletion path I'm currently fixing, which has much large
implications, and is way more rotten. However, we are not going to
prevent people from deleting threads instead in order to solve the bug,
are we?

Let's keep the issue on the plain technical ground:
- is there a bug? You bet there is.
- is the issue fixable? I think so.
- is it worth investing some brain cycles to do so? Yes.

I don't see any reason for getting nervous here.

> 
> Jan
> 
-- 
Philippe.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 18:24                                     ` Philippe Gerum
@ 2007-07-19 20:15                                       ` Jan Kiszka
  2007-07-19 21:35                                         ` Philippe Gerum
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-19 20:15 UTC (permalink / raw)
  To: rpm; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 9390 bytes --]

Philippe Gerum wrote:
> On Thu, 2007-07-19 at 19:18 +0200, Jan Kiszka wrote:
>> Philippe Gerum wrote:
>>> On Thu, 2007-07-19 at 17:35 +0200, Jan Kiszka wrote:
>>>> Philippe Gerum wrote:
>>>>> On Thu, 2007-07-19 at 14:40 +0200, Jan Kiszka wrote:
>>>>>> Philippe Gerum wrote:
>>>>>>>> And when looking at the holders of rpilock, I think one issue could be
>>>>>>>> that we hold that lock while calling into xnpod_renice_root [1], ie.
>>>>>>>> doing a potential context switch. Was this checked to be save?
>>>>>>> xnpod_renice_root() does no reschedule immediately on purpose, we would
>>>>>>> never have been able to run any SMP config more than a couple of seconds
>>>>>>> otherwise. (See the NOSWITCH bit).
>>>>>> OK, then it's not the cause.
>>>>>>
>>>>>>>> Furthermore, that code path reveals that we take nklock nested into
>>>>>>>> rpilock [2]. I haven't found a spot for the other way around (and I hope
>>>>>>>> there is none)
>>>>>>> xnshadow_start().
>>>>>> Nope, that one is not holding nklock. But I found an offender...
>>>>> Gasp. xnshadow_renice() kills us too.
>>>> Looks like we are approaching mainline "qualities" here - but they have
>>>> at least lockdep (and still face nasty races regularly).
>>>>
>>> We only have a 2-level locking depth at most, thare barely qualifies for
>>> being compared to the situation with mainline. Most often, the more
>>> radical the solution, the less relevant it is: simple nesting on very
>>> few levels is not bad, bugous nesting sequence is.
>>>
>>>> As long as you can't avoid nesting or the inner lock only protects
>>>> really, really trivial code (list manipulation etc.), I would say there
>>>> is one lock too much... Did I mention that I consider nesting to be
>>>> evil? :-> Besides correctness, there is also an increasing worst-case
>>>> behaviour issue with each additional nesting level.
>>>>
>>> In this case, we do not want the RPI manipulation to affect the
>>> worst-case of all other threads by holding the nklock. This is
>>> fundamentally a migration-related issue, which is a situation that must
>>> not impact all other contexts relying on the nklock. Given this, you
>>> need to protect the RPI list and prevent the scheduler data to be
>>> altered at the same time, there is no cheap trick to avoid this.
>>>
>>> We need to keep the rpilock, otherwise we would have significantly large
>>> latency penalties, especially when domain migration are frequent, and
>>> yes, we do need RPI, otherwise the sequence for emulated RTOS services
>>> would be plain wrong (e.g. task creation).
>> If rpilock is known to protect potentially costly code, you _must not_
>> hold other locks while taking it. Otherwise, you do not win a dime by
>> using two locks, rather make things worse (overhead of taking two locks
>> instead of just one).
> 
> I guess that by now you already understood that holding such outer lock
> is what should not be done, and what should be fixed, right? So let's
> focus on the real issue here: holding two locks is not the problem,
> holding them in the wrong sequence, is.

Holding two locks in the right order can still be wrong /wrt to latency
as I pointed out. If you can avoid holding both here, I would be much
happier immediately.

> 
>>  That all relates to the worst case, of course, the
>> one thing we are worried about most.
>>
>> In that light, the nesting nklock->rpilock must go away, independently
>> of the ordering bug. The other way around might be a different thing,
>> though I'm not sure if there is actually so much difference between the
>> locks in the worst case.
>>
>> What is the actual _combined_ lock holding time in the longest
>> nklock/rpilock nesting path?
> 
> It is short.
> 
>>  Is that one really larger than any other
>> pre-existing nklock path?
> 
> Yes. Look, could you please assume one second that I did not choose this
> implementation randomly? :o)

For sure not randomly, but I still don't understand the motivations
completely.

> 
>>  Only in that case, it makes sense to think
>> about splitting, though you will still be left with precisely the same
>> (rather a few cycles more) CPU-local latency. Is there really no chance
>> to split the lock paths?
>>
> 
> The answer to your question is into the dynamics of migrating tasks
> between domains, and how this relates to the overall dynamics of the
> system. Migration needs priority tracking, priority tracking requires
> almost the same amount of work than updating the scheduler data. Since
> we can reduce the pressure on the nklock during migration which is a
> thread-local action additionally involving the root thread, it is _good_
> to do so. Even if this costs a few brain cycles more.

So we are trading off average performance against worst-case spinning
time here?

> 
>>> Ok, the rpilock is local, the nesting level is bearable, let's focus on
>>> putting this thingy straight.
>> The whole RPI thing, though required for some scenarios, remains ugly
>> and error-prone (including worst-case latency issues).
>>  I can only
>> underline my recommendation to switch off complexity in Xenomai when one
>> doesn't need it - which often includes RPI.
>>  Sorry, Philippe, but I think
>> we have to be honest to the users here. RPI remains problematic, at
>> least /wrt your beloved latency.
> 
> The best way to be honest to users is to depict things as they are:
> 
> 1) RPI is there because we currently rely on a co-kernel technology, and
> we have to make our best to fix the consequences of having two
> schedulers by at least coupling their priority scheme when applicable.
> Otherwise, you just _cannot_ emulate common RTOS behaviour properly.
> Additionally, albeit disabling RPI is perfectly fine and allows to run
> most applications the RTAI way, it is _utterly flawed_ at the logical
> level, if you intend to integrate the two kernels. I do understand that
> you might not care about such integration, that you might even find it
> silly, and this is not even an issue for me. But the whole purpose of
> Xenomai has never ever been to reel off the "yet-another-co-kernel"
> mantra once again. I -very fundamentally- don't give a dime about
> co-kernels per se, what I want is a framework which exhibits real-time
> OS behaviours, with deep Linux integration, in order to build skins upon
> it, and give users access to the regular programming model, and RPI does
> help here. Period.
> 
> 2) RPI is not perfect, has been rewritten a couple of times already, and
> has suffered a handful of severe bugs. Would you throw away any software
> only on this basis? I guess not, otherwise you would not run Linux,
> especially not in SMP.

Linux code that broke (or still breaks) on concurrent execution on
multiple logical (PREEMPT[_RT]) or physical (SMP) CPUs underwent lots of
rewrites / disposals over the time because it is hard to get right and
efficient. For the same reasons, those features remained off whenever
the production scenario allowed it.

> 3) As time passes, RPI is stabilizing because it is now handled using
> the right core logic, albeit it involves tricky situations. Besides, the
> RPI bug we have been talking about is nothing compared to the issue
> regarding the deletion path I'm currently fixing, which has much large
> implications, and is way more rotten. However, we are not going to
> prevent people from deleting threads instead in order to solve the bug,
> are we?

No, we are redesigning the code to make it more robust. But we are also
avoiding certain code patterns in application that are know to be
problematic (e.g. asynchronous rt_task_delete...). Still, I wouldn't
compare thread deletion to RPI /wrt its necessity.

> 
> Let's keep the issue on the plain technical ground:
> - is there a bug? You bet there is.
> - is the issue fixable? I think so.
> - is it worth investing some brain cycles to do so? Yes.
> 
> I don't see any reason for getting nervous here.

Well, I wouldn't grumble if I complained for the first time, or maybe
also the second. In contrast to other more special features of Xenomai,
this one was first always on, then selectable due to my begging, and is
now still default y while known to be the root of multiple severe and
_very_ subtle issues over the last 3 years. And there is a noticeable
complexity increment to the worst-case paths even when RPI will be
finally correct.

Users widely don't know this (that's my guess), users generally don't
need it (I'm still _strongly_ convinced in this), but users stumble over
it. Ironically those - like Mathias - who are interested in hard
real-time, not integrated soft RT. That's, well, still improvable.

Domain migration is one, if not THE neuralgic point of any co-kernel
approach. It's where RTAI broke countless times (dunno know if it still
does, but they never audited code like we do), and it's where Xenomai
stumbled over and over again. I'm not arguing for the removal of RPI,
I'm only worried about those poor users who are not told what they are
running. Default-y features should have matured and provide a reasonable
gains/costs ratio. I was always sceptical about both points, and I'm
afraid I was right. Please prove me wrong, at least in the future.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 20:15                                       ` Jan Kiszka
@ 2007-07-19 21:35                                         ` Philippe Gerum
  2007-07-20 14:20                                           ` Jan Kiszka
  0 siblings, 1 reply; 33+ messages in thread
From: Philippe Gerum @ 2007-07-19 21:35 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: mathias_koehrer, xenomai

On Thu, 2007-07-19 at 22:15 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Thu, 2007-07-19 at 19:18 +0200, Jan Kiszka wrote:
> >> Philippe Gerum wrote:
> >>> On Thu, 2007-07-19 at 17:35 +0200, Jan Kiszka wrote:
> >>>> Philippe Gerum wrote:
> >>>>> On Thu, 2007-07-19 at 14:40 +0200, Jan Kiszka wrote:
> >>>>>> Philippe Gerum wrote:
> >>>>>>>> And when looking at the holders of rpilock, I think one issue could be
> >>>>>>>> that we hold that lock while calling into xnpod_renice_root [1], ie.
> >>>>>>>> doing a potential context switch. Was this checked to be save?
> >>>>>>> xnpod_renice_root() does no reschedule immediately on purpose, we would
> >>>>>>> never have been able to run any SMP config more than a couple of seconds
> >>>>>>> otherwise. (See the NOSWITCH bit).
> >>>>>> OK, then it's not the cause.
> >>>>>>
> >>>>>>>> Furthermore, that code path reveals that we take nklock nested into
> >>>>>>>> rpilock [2]. I haven't found a spot for the other way around (and I hope
> >>>>>>>> there is none)
> >>>>>>> xnshadow_start().
> >>>>>> Nope, that one is not holding nklock. But I found an offender...
> >>>>> Gasp. xnshadow_renice() kills us too.
> >>>> Looks like we are approaching mainline "qualities" here - but they have
> >>>> at least lockdep (and still face nasty races regularly).
> >>>>
> >>> We only have a 2-level locking depth at most, thare barely qualifies for
> >>> being compared to the situation with mainline. Most often, the more
> >>> radical the solution, the less relevant it is: simple nesting on very
> >>> few levels is not bad, bugous nesting sequence is.
> >>>
> >>>> As long as you can't avoid nesting or the inner lock only protects
> >>>> really, really trivial code (list manipulation etc.), I would say there
> >>>> is one lock too much... Did I mention that I consider nesting to be
> >>>> evil? :-> Besides correctness, there is also an increasing worst-case
> >>>> behaviour issue with each additional nesting level.
> >>>>
> >>> In this case, we do not want the RPI manipulation to affect the
> >>> worst-case of all other threads by holding the nklock. This is
> >>> fundamentally a migration-related issue, which is a situation that must
> >>> not impact all other contexts relying on the nklock. Given this, you
> >>> need to protect the RPI list and prevent the scheduler data to be
> >>> altered at the same time, there is no cheap trick to avoid this.
> >>>
> >>> We need to keep the rpilock, otherwise we would have significantly large
> >>> latency penalties, especially when domain migration are frequent, and
> >>> yes, we do need RPI, otherwise the sequence for emulated RTOS services
> >>> would be plain wrong (e.g. task creation).
> >> If rpilock is known to protect potentially costly code, you _must not_
> >> hold other locks while taking it. Otherwise, you do not win a dime by
> >> using two locks, rather make things worse (overhead of taking two locks
> >> instead of just one).
> > 
> > I guess that by now you already understood that holding such outer lock
> > is what should not be done, and what should be fixed, right? So let's
> > focus on the real issue here: holding two locks is not the problem,
> > holding them in the wrong sequence, is.
> 
> Holding two locks in the right order can still be wrong /wrt to latency
> as I pointed out. If you can avoid holding both here, I would be much
> happier immediately.
> 

The point is not about making you happier I'm afraid, but only to get
things right. If a nested lock has to be held for a short time, in order
to maintain consistency while an outer lock must be held for a longer
time, then it's ok, provided the locking sequence is correct.

> > 
> >>  That all relates to the worst case, of course, the
> >> one thing we are worried about most.
> >>
> >> In that light, the nesting nklock->rpilock must go away, independently
> >> of the ordering bug. The other way around might be a different thing,
> >> though I'm not sure if there is actually so much difference between the
> >> locks in the worst case.
> >>
> >> What is the actual _combined_ lock holding time in the longest
> >> nklock/rpilock nesting path?
> > 
> > It is short.
> > 
> >>  Is that one really larger than any other
> >> pre-existing nklock path?
> > 
> > Yes. Look, could you please assume one second that I did not choose this
> > implementation randomly? :o)
> 
> For sure not randomly, but I still don't understand the motivations
> completely.
> 

My description of why I want RPI to be available was clear though.

> > 
> >>  Only in that case, it makes sense to think
> >> about splitting, though you will still be left with precisely the same
> >> (rather a few cycles more) CPU-local latency. Is there really no chance
> >> to split the lock paths?
> >>
> > 
> > The answer to your question is into the dynamics of migrating tasks
> > between domains, and how this relates to the overall dynamics of the
> > system. Migration needs priority tracking, priority tracking requires
> > almost the same amount of work than updating the scheduler data. Since
> > we can reduce the pressure on the nklock during migration which is a
> > thread-local action additionally involving the root thread, it is _good_
> > to do so. Even if this costs a few brain cycles more.
> 
> So we are trading off average performance against worst-case spinning
> time here?
> 

RPI data structures need not being manipulated under nklock. What we
save is contention between normal nucleus operations which all grab the
nklock for their entire execution, and possibly pathological migration
patterns on the worst-case, and generally shorter latency on average.
Please let's move on, the code is explicit about this.

> > 
> >>> Ok, the rpilock is local, the nesting level is bearable, let's focus on
> >>> putting this thingy straight.
> >> The whole RPI thing, though required for some scenarios, remains ugly
> >> and error-prone (including worst-case latency issues).
> >>  I can only
> >> underline my recommendation to switch off complexity in Xenomai when one
> >> doesn't need it - which often includes RPI.
> >>  Sorry, Philippe, but I think
> >> we have to be honest to the users here. RPI remains problematic, at
> >> least /wrt your beloved latency.
> > 
> > The best way to be honest to users is to depict things as they are:
> > 
> > 1) RPI is there because we currently rely on a co-kernel technology, and
> > we have to make our best to fix the consequences of having two
> > schedulers by at least coupling their priority scheme when applicable.
> > Otherwise, you just _cannot_ emulate common RTOS behaviour properly.
> > Additionally, albeit disabling RPI is perfectly fine and allows to run
> > most applications the RTAI way, it is _utterly flawed_ at the logical
> > level, if you intend to integrate the two kernels. I do understand that
> > you might not care about such integration, that you might even find it
> > silly, and this is not even an issue for me. But the whole purpose of
> > Xenomai has never ever been to reel off the "yet-another-co-kernel"
> > mantra once again. I -very fundamentally- don't give a dime about
> > co-kernels per se, what I want is a framework which exhibits real-time
> > OS behaviours, with deep Linux integration, in order to build skins upon
> > it, and give users access to the regular programming model, and RPI does
> > help here. Period.
> > 
> > 2) RPI is not perfect, has been rewritten a couple of times already, and
> > has suffered a handful of severe bugs. Would you throw away any software
> > only on this basis? I guess not, otherwise you would not run Linux,
> > especially not in SMP.
> 
> Linux code that broke (or still breaks) on concurrent execution on
> multiple logical (PREEMPT[_RT]) or physical (SMP) CPUs underwent lots of
> rewrites / disposals over the time because it is hard to get right and
> efficient. For the same reasons, those features remained off whenever
> the production scenario allowed it.
> 

So, all this fuss is about the default setting of the RPI option? You
should have started grumbling about this, and not going down the path of
so-called latency worsening because of RPI. An argument must be fair to
be acceptable: let's compare latencies involved with different
implementations of the same functional goal, not between different
functionalities. You don't have the same system w/ or w/o RPI.

The point of switching RPI on by default is that failures in enforcing
RPI are way more easily detectable (I did not say "fixable") than bugous
application behaviour which may happen when you don't have RPI. I do
prefer a box that locks up loudly due to RPI than a pSOS, VxWorks or
whatever application that misbehaves silently because RPI is off.

> > 3) As time passes, RPI is stabilizing because it is now handled using
> > the right core logic, albeit it involves tricky situations. Besides, the
> > RPI bug we have been talking about is nothing compared to the issue
> > regarding the deletion path I'm currently fixing, which has much large
> > implications, and is way more rotten. However, we are not going to
> > prevent people from deleting threads instead in order to solve the bug,
> > are we?
> 
> No, we are redesigning the code to make it more robust. But we are also
> avoiding certain code patterns in application that are know to be
> problematic (e.g. asynchronous rt_task_delete...). Still, I wouldn't
> compare thread deletion to RPI /wrt its necessity.
> 

I understand your POV, and I also remember that we had tons of
theoretical discussions with lots of people during the last four years -
at the very least - about correctness wrt code patterns and so on.
Unfortunately, the reality is stubborn: support for asynchronous
deletion, and incidentally for other things that terminally piss you
off, are _required_ to provide proper emulation of legacy RTOS. The good
point about Xenomai is that nobody claims that we should adopt them for
all skins, but only for the traditional RTOS APIs. For that, we need
support at nucleus level. Hey! it's not _my_ choice, it's a guy named M.
ReadySystems-Microtech-MentorGraphics-ISI-WindRiver-Chorus-et-al, who
chose to incorporate those pattern in his O/S...

> > 
> > Let's keep the issue on the plain technical ground:
> > - is there a bug? You bet there is.
> > - is the issue fixable? I think so.
> > - is it worth investing some brain cycles to do so? Yes.
> > 
> > I don't see any reason for getting nervous here.
> 
> Well, I wouldn't grumble if I complained for the first time, or maybe
> also the second.

Well, try a third one...

>  In contrast to other more special features of Xenomai,
> this one was first always on, then selectable due to my begging, and is
> now still default y while known to be the root of multiple severe and
> _very_ subtle issues over the last 3 years.

Look, the bugs involved were mostly SMP issues. It's not the first time
we do have SMP issues, and we will probably keep having some from times
to times until it calms down, like any software which is exercise by a
growing number of people. The number of issues we have now is nothing,
really nothing compared to the storm of bugs we had to face with Gilles
when porting Xenomai over the Itanium architecture 4 years ago. Those
issues have been addressed, patiently. I see no reason to freak out
about the fact that some new code may break under pressure.

>  And there is a noticeable
> complexity increment to the worst-case paths even when RPI will be
> finally correct.

Sorry, but really, no. If you disable RPI, you have zero overhead due to
it. If you don't need it, disable it. If you enable it, you know that
you are trading some additional CPU cycles for correctness. And having
two locks instead of one helps maintaining the overhead low.

> 
> Users widely don't know this (that's my guess), users generally don't
> need it (I'm still _strongly_ convinced in this), but users stumble over
> it. Ironically those - like Mathias - who are interested in hard
> real-time, not integrated soft RT. That's, well, still improvable.

When people start using GDB over a real-time Xeno application, they are
more than happy to have integration. So let's not generalize, the
problem you see is RPI being enabled by default, not integration as a
design choice. Remember the fine co-kernel era when sending a signal to
a real-time task in user-space would either 1) be ignored, or 2) crash
your box?

Additionally, I'm not talking about soft RT. RPI helps us maintaining
the correctness of the thread priority scheme during the phase when even
a co-kernel has to call into the regular Linux kernel to perform some
particular task, e.g. task creation and startup. You don't care about
this, because you don't require such correctness; some users may.

> 
> Domain migration is one, if not THE neuralgic point of any co-kernel
> approach. It's where RTAI broke countless times (dunno know if it still
> does, but they never audited code like we do), and it's where Xenomai
> stumbled over and over again. 

Domain migration has not to be confused by RPI, it's a complementary
support, but it is not necessary for migration to take place. What
happened with RTAI back then was quite different, the Linux/co-kernel
interface was unsafe there. Very fortunately, Xenomai migration scheme
is stable.

> I'm not arguing for the removal of RPI,
> I'm only worried about those poor users who are not told what they are
> running. Default-y features should have matured and provide a reasonable
> gains/costs ratio. I was always sceptical about both points, and I'm
> afraid I was right. Please prove me wrong, at least in the future.
> 

Read my mail, without listening to your own grumble at the same time,
you should see that this is not a matter of being right or wrong, it is
a matter of who needs what, and how one will use Xenomai. Your grumble
does not prove anything unfortunately, otherwise everything would be
fixed since many moons.

What I'm suggesting now, so that you can't tell the rest of the world
that I'm such an old and deaf cranky meatball, is that we do place RPI
under strict observation until the latest 2.4-rc is out, and we would
decide at this point whether we should change the default value for the
skins for which it makes sense (both for v2.3.x and 2.4). Obviously,
this would only make sense if key users actually give hell to the 2.4
testing releases (Mathias, the world is watching you).

Basically, all traditional RTOS emulators want RPI and the default would
be on if one of them is selected, and since they most often never run in
SMP mode, all possibly pending SMP issues would not hurt.

The native one would go 'n' in case of doubt, and I leave to Gilles the
decision for the POSIX skin, since its high level of integration with
Linux (the skin's, not Gilles...) may involve a different perspective.

Okay???

> Jan
> 
-- 
Philippe.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 21:35                                         ` Philippe Gerum
@ 2007-07-20 14:20                                           ` Jan Kiszka
  2007-07-20 18:33                                             ` Philippe Gerum
  2007-07-21  8:49                                             ` Philippe Gerum
  0 siblings, 2 replies; 33+ messages in thread
From: Jan Kiszka @ 2007-07-20 14:20 UTC (permalink / raw)
  To: rpm; +Cc: mathias_koehrer, xenomai

[-- Attachment #1: Type: text/plain, Size: 2471 bytes --]

Philippe Gerum wrote:
...
> Read my mail, without listening to your own grumble at the same time,
> you should see that this is not a matter of being right or wrong, it is
> a matter of who needs what, and how one will use Xenomai. Your grumble
> does not prove anything unfortunately, otherwise everything would be
> fixed since many moons.

Why things are unfixed has something to do with their complexity. RPI is
a complex thing AND it is a separate mechanism to the core (that's why I
was suggesting to reuse PI code if possible - something that is already
integrated for many moons).

> What I'm suggesting now, so that you can't tell the rest of the world
> that I'm such an old and deaf cranky meatball, is that we do place RPI
> under strict observation until the latest 2.4-rc is out, and we would
> decide at this point whether we should change the default value for the
> skins for which it makes sense (both for v2.3.x and 2.4). Obviously,
> this would only make sense if key users actually give hell to the 2.4
> testing releases (Mathias, the world is watching you).

OK, let's go through this another time, this time under the motto "get
the locking right". As a start (and a help for myself), here comes an
overview of the scheme the final version may expose - as long as there
are separate locks:

gatekeeper_thread / xnshadow_relax:
	rpilock, followed by nklock
	(while xnshadow_relax puts both under irqsave...)

xnshadow_unmap:
	nklock, then rpilock nested

xnshadow_start:
	rpilock, followed by nklock

xnshadow_renice:
	nklock, then rpilock nested

schedule_event:
	only rpilock

setsched_event:
	nklock, followed by rpilock, followed by nklock again

And then there is xnshadow_rpi_check which has to be fixed to:
	nklock, followed by rpilock (here was our lock-up bug)

That's a scheme which /should/ be safe. Unfortunately, I see no way to
get rid of the remaining nestings.

And I still doubt we are gaining much by the lock split-up on SMP (it's
pointless for UP due to xnshadow_relax). In case there is heavy
migration activity on multiple cores/CPUs, we now regularly content for
two locks in the hot paths instead of just the one everyone has to go
through anyway. And while we obviously don't win a dime for the worst
case, the average reduction of spinning times trades off against more
atomic (cache-line bouncing) operations. Were you able to measure some
improvement?

Jan

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-20 14:20                                           ` Jan Kiszka
@ 2007-07-20 18:33                                             ` Philippe Gerum
  2007-07-21  8:49                                             ` Philippe Gerum
  1 sibling, 0 replies; 33+ messages in thread
From: Philippe Gerum @ 2007-07-20 18:33 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: mathias_koehrer, xenomai

On Fri, 2007-07-20 at 16:20 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> ...
> > Read my mail, without listening to your own grumble at the same time,
> > you should see that this is not a matter of being right or wrong, it is
> > a matter of who needs what, and how one will use Xenomai. Your grumble
> > does not prove anything unfortunately, otherwise everything would be
> > fixed since many moons.
> 
> Why things are unfixed has something to do with their complexity. RPI is
> a complex thing AND it is a separate mechanism to the core (that's why I
> was suggesting to reuse PI code if possible - something that is already
> integrated for many moons).
> 

I'm afraid RPI and PI are very different beasts. The purpose of RPI is
to track real-time priority for the _pseudo_ root thread, PI deals with
Linux tasks. Moroever, RPI does no priority propagation beyond the first
level (i.e. the root thread one), and only has to handle backtracking in
a trivial way. For this reason, the PI implementation is way more
complex, zillion times beyond RPI, so the effort would be absolutely
counter-productive.

I understand your POV, the whole RPI thing seems baroque to you, and I
can only agree with you here, it is. However, we still need RPI for
proper behaviour in a lot of cases, at least with a co-kernel technology
under our feet. So, I'm going to submit fixes for this issue, and agree
to change the default knob from enabled to disabled for the native and
POSIX skins if need be, if the observation period tells us so.

Now, within the RPI issue, there is the double locking one: I'm going to
be very pragmatic here. If this is logically possible to keep the double
locking, I will keep it. The point being that people running real-time
applications on SMP configs tend in fact to prefer asymmetry to symmetry
when building their design. I mean that separate CPUs are usually
dedicated to different application tasks; in such a common pattern, if
one of the CPU is running a frequent (mode) switching task, it may put a
serious pressure on the nklock for all others (imagine a fast periodic
timeline on one CPU sending data to a secondary mode logger on a second
CPU, both being synchronized on a Xenomai synch). This is what I don't
want, if possible. If it is not possible to define a proper locking
scheme without resorting to 1) hairy and overly complex constructs, or
2) voodoo spells, then I will put everyone under the nklock, albeit I
think this is a sub-optimal solution.

Ok, let's move on. The main focus is -rc1, and beyond that 2.4 final. We
are damned late already.

-- 
Philippe.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-20 14:20                                           ` Jan Kiszka
  2007-07-20 18:33                                             ` Philippe Gerum
@ 2007-07-21  8:49                                             ` Philippe Gerum
  2007-07-22 16:44                                               ` Jan Kiszka
  1 sibling, 1 reply; 33+ messages in thread
From: Philippe Gerum @ 2007-07-21  8:49 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: mathias_koehrer, xenomai

On Fri, 2007-07-20 at 16:20 +0200, Jan Kiszka wrote:

> OK, let's go through this another time, this time under the motto "get
> the locking right". As a start (and a help for myself), here comes an
> overview of the scheme the final version may expose - as long as there
> are separate locks:
> 
> gatekeeper_thread / xnshadow_relax:
> 	rpilock, followed by nklock
> 	(while xnshadow_relax puts both under irqsave...)
> 

The relaxing thread must not be preempted in primary mode before it
schedules out but after it has been linked to the RPI list, otherwise
the root thread would benefit from a spurious priority boost. This said,
in the UP case, we have no lock to contend for anyway, so the point of
discussing whether we should have the rpilock or not is moot here.

> xnshadow_unmap:
> 	nklock, then rpilock nested
> 

This one is the hardest to solve.

> xnshadow_start:
> 	rpilock, followed by nklock
> 
> xnshadow_renice:
> 	nklock, then rpilock nested
> 
> schedule_event:
> 	only rpilock
> 
> setsched_event:
> 	nklock, followed by rpilock, followed by nklock again
> 
> And then there is xnshadow_rpi_check which has to be fixed to:
> 	nklock, followed by rpilock (here was our lock-up bug)
> 

rpilock -> nklock in fact. The last lockup was rather likely due to the
gatekeeper's dangerous nesting of nklock -> rpilock -> nklock.

> That's a scheme which /should/ be safe. Unfortunately, I see no way to
> get rid of the remaining nestings.
> 

There is one, which consists of getting rid of the rpilock entirely. The
purpose of such lock is to protect the RPI list when fixing the
situation after a task migration in secondary mode triggered from the
Linux side. Addressing the latter issue differently may solve the
problem more elegantly than figuring out how to combine the two locks,
or hammering the hot path with the nklock. Will look at this.

-- 
Philippe.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-21  8:49                                             ` Philippe Gerum
@ 2007-07-22 16:44                                               ` Jan Kiszka
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Kiszka @ 2007-07-22 16:44 UTC (permalink / raw)
  To: rpm; +Cc: mathias_koehrer, xenomai

[-- Attachment #1: Type: text/plain, Size: 2594 bytes --]

Philippe Gerum wrote:
> On Fri, 2007-07-20 at 16:20 +0200, Jan Kiszka wrote:
> 
>> OK, let's go through this another time, this time under the motto "get
>> the locking right". As a start (and a help for myself), here comes an
>> overview of the scheme the final version may expose - as long as there
>> are separate locks:
>>
>> gatekeeper_thread / xnshadow_relax:
>> 	rpilock, followed by nklock
>> 	(while xnshadow_relax puts both under irqsave...)
>>
> 
> The relaxing thread must not be preempted in primary mode before it
> schedules out but after it has been linked to the RPI list, otherwise
> the root thread would benefit from a spurious priority boost. This said,
> in the UP case, we have no lock to contend for anyway, so the point of
> discussing whether we should have the rpilock or not is moot here.
> 
>> xnshadow_unmap:
>> 	nklock, then rpilock nested
>>
> 
> This one is the hardest to solve.
> 
>> xnshadow_start:
>> 	rpilock, followed by nklock
>>
>> xnshadow_renice:
>> 	nklock, then rpilock nested
>>
>> schedule_event:
>> 	only rpilock
>>
>> setsched_event:
>> 	nklock, followed by rpilock, followed by nklock again
>>
>> And then there is xnshadow_rpi_check which has to be fixed to:
>> 	nklock, followed by rpilock (here was our lock-up bug)
>>
> 
> rpilock -> nklock in fact.

Yes, meant it the other way around: The invocation of
xnpod_renice_root() must be moved out of nklock - which should be
trivial, correct?

> The last lockup was rather likely due to the
> gatekeeper's dangerous nesting of nklock -> rpilock -> nklock.

This path - as one of three with this ordering - surely triggered the
bug. But given the fact that the other two nestings of this kind are yet
unresolvable while our reversely ordered nesting in xnshadow_rpi_check
is, it is clear that the latter one is the weak point. So far we only
have a fix for Mathias' test case which stresses just a subset of all
rpilock paths appropriately.

> 
>> That's a scheme which /should/ be safe. Unfortunately, I see no way to
>> get rid of the remaining nestings.
>>
> 
> There is one, which consists of getting rid of the rpilock entirely. The
> purpose of such lock is to protect the RPI list when fixing the
> situation after a task migration in secondary mode triggered from the
> Linux side. Addressing the latter issue differently may solve the
> problem more elegantly than figuring out how to combine the two locks,
> or hammering the hot path with the nklock. Will look at this.

Even the better! Looking forward.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 16:03                                 ` Philippe Gerum
  2007-07-19 17:18                                   ` Jan Kiszka
@ 2007-07-19 17:57                                   ` Jan Kiszka
  2007-07-21 20:15                                     ` Philippe Gerum
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Kiszka @ 2007-07-19 17:57 UTC (permalink / raw)
  To: rpm; +Cc: xenomai

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]

Philippe Gerum wrote:
> Ok, the rpilock is local, the nesting level is bearable, let's focus on
> putting this thingy straight.

Well, redesigning things may not necessarily improve the situation, but
reducing the amount of special RPI code might be worth a thought:

What is so special about RPI compared to standard prio inheritance? What
about [wild idea ahead!] modelling RPI as a virtual mutex that is
permanently held by the ROOT thread and which relaxed threads try to
acquire? They would never get it, rather drop the request (and thus the
inheritance) once they are to be hardened again or Linux starts to
schedule around.

*If* that is possible, we would
 A) reuse existing code heavily,
 B) lack any argument for separate locking,
 C) make things far easier to understand and review.

Sounds too beautiful to work, I'm afraid...

Jan

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 17:57                                   ` Jan Kiszka
@ 2007-07-21 20:15                                     ` Philippe Gerum
  0 siblings, 0 replies; 33+ messages in thread
From: Philippe Gerum @ 2007-07-21 20:15 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai

On Thu, 2007-07-19 at 19:57 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > Ok, the rpilock is local, the nesting level is bearable, let's focus on
> > putting this thingy straight.
> 

Sorry, I missed this one, which in fact explains that you were referring
to Xenomai PI and not PREEMPT_RT PI (yeah, I thought for a while that
you were nuts enough to ask me to model RPI after RT-PI... so I must be
nuts myself)

> Well, redesigning things may not necessarily improve the situation, but
> reducing the amount of special RPI code might be worth a thought:
> 
> What is so special about RPI compared to standard prio inheritance?

Basically, boost propagation and priority backtracking as I previously
answered in the wrong context. This said, I still think that PI (the
Xenomai one) complexity is much higher than RPI in its current form.

>  What
> about [wild idea ahead!] modelling RPI as a virtual mutex that is
> permanently held by the ROOT thread and which relaxed threads try to
> acquire? They would never get it, rather drop the request (and thus the
> inheritance) once they are to be hardened again or Linux starts to
> schedule around.
> 
> *If* that is possible, we would
>  A) reuse existing code heavily,
>  B) lack any argument for separate locking,
>  C) make things far easier to understand and review.
> 
> Sounds too beautiful to work, I'm afraid...
> 

It would be more elegant than RPI currently is, not question. This is
the way message passing works in the native API, in order to implement
the inheritance by the server of the client priority, for instance.

The main problem with PI, is that all starts from xnsynch_sleep_on.
Since we could not use this interface to activate PI, we would have to
craft another one. Additionally, some Linux activities may change the
RPI state (e.g. sched_setscheduler()), so we would have to create a
parallel path to fix this state without resorting to the normal PI
mechanism aimed at being used over a blockable context, Xenomai-wise.
A lot of changes for the purpose of solely recycling the basics of a PI
implementation.

> Jan
> 
-- 
Philippe.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Xenomai-core] [Xenomai-help] Sporadic PC freeze after rt_task_start
  2007-07-19 15:14                             ` Philippe Gerum
  2007-07-19 15:35                               ` Jan Kiszka
@ 2007-07-20  7:03                               ` M. Koehrer
  1 sibling, 0 replies; 33+ messages in thread
From: M. Koehrer @ 2007-07-20  7:03 UTC (permalink / raw)
  To: rpm, jan.kiszka; +Cc: xenomai, mathias_koehrer, xenomai

[-- Attachment #1.1: Type: text/plain, Size: 808 bytes --]

Hi,

here is the latest result of the overnight run. I enabled all Xenomai debug options and
(in addition) I use nmi_watchdog=1.
And after hours of running, my PC was frozen again.

I have attached again a camera screenshot of the LOCKUP message. Sorry for the 
bad quality, but we do not have a really nice camera in the office...
(I think, I will enable a serial console in future as this eases everything...)

Regards

Mathias

-- 
Mathias Koehrer
mathias_koehrer@domain.hid

Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren
ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig
und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer,
nur  39,85 €  inkl. DSL- und ISDN-Grundgebühr!
http://www.arcor.de/rd/emf-dsl-2

[-- Attachment #2: XenoCrash2.jpg --]
[-- Type: image/jpeg, Size: 114564 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2007-07-22 16:44 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-10  8:00 [Xenomai-help] Sporadic PC freeze after rt_task_start M. Koehrer
2007-07-10  8:40 ` Jan Kiszka
2007-07-10 12:29   ` M. Koehrer
2007-07-10 12:41     ` Jan Kiszka
2007-07-10 14:40       ` M. Koehrer
2007-07-10 15:34         ` Jan Kiszka
2007-07-11  6:43           ` M. Koehrer
2007-07-11  7:32             ` Jan Kiszka
2007-07-11 12:45               ` M. Koehrer
2007-07-11 14:47           ` Jan Kiszka
2007-07-13  7:27             ` M. Koehrer
2007-07-13  8:26               ` Jan Kiszka
2007-07-16  7:07                 ` M. Koehrer
2007-07-16 22:42                   ` Jan Kiszka
2007-07-19 10:58                     ` M. Koehrer
2007-07-19 11:27                       ` Jan Kiszka
2007-07-19 12:19                         ` Philippe Gerum
2007-07-19 12:40                           ` Jan Kiszka
2007-07-19 13:55                             ` [Xenomai-core] " Philippe Gerum
2007-07-19 15:14                             ` Philippe Gerum
2007-07-19 15:35                               ` Jan Kiszka
2007-07-19 16:03                                 ` Philippe Gerum
2007-07-19 17:18                                   ` Jan Kiszka
2007-07-19 18:24                                     ` Philippe Gerum
2007-07-19 20:15                                       ` Jan Kiszka
2007-07-19 21:35                                         ` Philippe Gerum
2007-07-20 14:20                                           ` Jan Kiszka
2007-07-20 18:33                                             ` Philippe Gerum
2007-07-21  8:49                                             ` Philippe Gerum
2007-07-22 16:44                                               ` Jan Kiszka
2007-07-19 17:57                                   ` Jan Kiszka
2007-07-21 20:15                                     ` Philippe Gerum
2007-07-20  7:03                               ` M. Koehrer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.