[Xenomai-core] Houston, we have a circular problem

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Xenomai-core] Houston, we have a circular problem
@ 2008-05-05 15:44 Jan Kiszka
  2008-05-05 16:04 ` Jan Kiszka
  2008-05-05 16:08 ` Philippe Gerum
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Kiszka @ 2008-05-05 15:44 UTC (permalink / raw)
  To: Xenomai-core

Hi,

after hacking away the barriers I-pipe erected in front of lockdep
(patches will follow on adeos-main), I was finally able to "visualize" a
bit more what our colleagues see in reality on SMP: some ugly, not yet
understood circular dependency when running some Xenomai app under gdb.
What lockdep tries to tell us remains unclear, unfortunately:

[  874.356703]
[  874.356957] =======================================================

Here it hangs because of this (catched via QEMU):

(gdb) bt
#0  __delay (loops=1) at arch/x86/lib/delay_64.c:34
#1  0xffffffff80372712 in _raw_spin_lock (lock=0xffff81000232c6c0) at lib/spinlock_debug.c:111
#2  0xffffffff80479d3d in _spin_lock (lock=0xffff81000232c6c0) at kernel/spinlock.c:182
#3  0xffffffff8022e546 in task_rq_lock (p=0xffff81002e792000, flags=0xffff81002f487910) at kernel/sched.c:615
#4  0xffffffff8022e6b6 in try_to_wake_up (p=0x1, state=<value optimized out>, sync=341) at kernel/sched.c:1562
#5  0xffffffff8022e9a5 in default_wake_function (curr=<value optimized out>, mode=0, sync=341, key=0xf48791c8) at kernel/sched.c:3840
#6  0xffffffff8024ae51 in autoremove_wake_function (wait=0x1, mode=0, sync=341, key=0xf48791c8) at kernel/wait.c:132
#7  0xffffffff8022bdc7 in __wake_up_common (q=<value optimized out>, mode=1, nr_exclusive=1, sync=0, key=0x0) at kernel/sched.c:3861
#8  0xffffffff8022df43 in __wake_up (q=0xffffffff805a6240, mode=1, nr_exclusive=1, key=0x0) at kernel/sched.c:3880
#9  0xffffffff80235838 in wake_up_klogd () at kernel/printk.c:1013
#10 0xffffffff80235a30 in release_console_sem () at kernel/printk.c:1059
#11 0xffffffff802360be in vprintk (fmt=0x12 <Address 0x12 out of bounds>, args=0xffff81002f487a72) at kernel/printk.c:807
#12 0xffffffff802361e5 in printk (fmt=0xffffffff8054c0fd "\n", '=' <repeats 55 times>, "\n") at kernel/printk.c:664
#13 0xffffffff80256268 in print_circular_bug_header (entry=0xffffffff809cb8c0, depth=2) at kernel/lockdep.c:902
#14 0xffffffff80256f84 in check_noncircular (source=<value optimized out>, depth=1) at kernel/lockdep.c:973
#15 0xffffffff80256f8e in check_noncircular (source=<value optimized out>, depth=0) at kernel/lockdep.c:975
#16 0xffffffff80257a45 in __lock_acquire (lock=0xffff81002e9ff960, subclass=0, trylock=0, read=0, check=2, hardirqs_off=1, ip=18446744071564715933) at kernel/lockdep.c:1324
#17 0xffffffff80258500 in lock_acquire (lock=0x1, subclass=0, trylock=-2140427232, read=-192441912, check=2, ip=<value optimized out>) at kernel/lockdep.c:2703
#18 0xffffffff80479d35 in _spin_lock (lock=0xffff81002e9ff948) at kernel/spinlock.c:181
#19 0xffffffff8028679d in schedule_event (event=<value optimized out>, ipd=0x0, data=0xffff81002e198000) at kernel/xenomai/nucleus/shadow.c:2197
#20 0xffffffff80274bfc in __ipipe_dispatch_event (event=33, data=0xffff81002e198000) at kernel/ipipe/core.c:828
#21 0xffffffff80477637 in schedule () at kernel/sched.c:1897
#22 0xffffffff80247598 in worker_thread (__cwq=<value optimized out>) at kernel/workqueue.c:314
#23 0xffffffff8024ad16 in kthread (_create=<value optimized out>) at kernel/kthread.c:78
#24 0xffffffff8020d238 in child_rip ()
#25 0x0000000000000000 in ?? ()

The lock in question should be task->sighand->siglock, but as we hit the
bug inside the scheduler, printk deadlocks now :(. Need to dig out some
patch of Steven Rostedt (IIRC) that may overcome the second deadlock.

But maybe someone already hears a bell ringing. Would be highly
appreciated as gdb is effectively unusable here.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Xenomai-core] Houston, we have a circular problem
  2008-05-05 15:44 [Xenomai-core] Houston, we have a circular problem Jan Kiszka
@ 2008-05-05 16:04 ` Jan Kiszka
  2008-05-05 16:08 ` Philippe Gerum
  1 sibling, 0 replies; 9+ messages in thread
From: Jan Kiszka @ 2008-05-05 16:04 UTC (permalink / raw)
  To: Xenomai-core

Jan Kiszka wrote:
> Hi,
> 
> after hacking away the barriers I-pipe erected in front of lockdep
> (patches will follow on adeos-main), I was finally able to "visualize" a
> bit more what our colleagues see in reality on SMP: some ugly, not yet
> understood circular dependency when running some Xenomai app under gdb.
> What lockdep tries to tell us remains unclear, unfortunately:
> 
> [  874.356703]
> [  874.356957] =======================================================

Got it!

[    0.000000]
[    0.000000] =======================================================
[    0.000000] [ INFO: possible circular locking dependency detected ]
[    0.000000] 2.6.24.6-xeno_64 #313
[    0.000000] -------------------------------------------------------
[    0.000000] gdb/4385 is trying to acquire lock:
[    0.000000]  ((spinlock_t *)&sighand->siglock){....}, at: [<ffffffff802867c1>] schedule_event+0x7f/0x578
[    0.000000]
[    0.000000] but task is already holding lock:
[    0.000000]  (&rq->rq_lock_key){++..}, at: [<ffffffff80477235>] schedule+0x176/0x7ff
[    0.000000]
[    0.000000] which lock already depends on the new lock.
[    0.000000]
[    0.000000]
[    0.000000] the existing dependency chain (in reverse order) is:
[    0.000000]
[    0.000000] -> #2 (&rq->rq_lock_key){++..}:
[    0.000000]        [<ffffffff80257b70>] __lock_acquire+0xb91/0xd80
[    0.000000]        [<ffffffff80258524>] lock_acquire+0x9d/0xbc
[    0.000000]        [<ffffffff8022e546>] task_rq_lock+0x7f/0xb8
[    0.000000]        [<ffffffff80479d65>] _spin_lock+0x2a/0x36
[    0.000000]        [<ffffffff8022e546>] task_rq_lock+0x7f/0xb8
[    0.000000]        [<ffffffff8022e6b6>] try_to_wake_up+0x29/0x306
[    0.000000]        [<ffffffff8022e9a5>] default_wake_function+0x12/0x14
[    0.000000]        [<ffffffff8022bdc7>] __wake_up_common+0x4b/0x7a
[    0.000000]        [<ffffffff8022de9f>] complete+0x3d/0x51
[    0.000000]        [<ffffffff80231321>] migration_thread+0x0/0x22b
[    0.000000]        [<ffffffff8024ad18>] kthread+0x2c/0x7c
[    0.000000]        [<ffffffff8020d238>] child_rip+0xa/0x12
[    0.000000]        [<ffffffff8020c8e8>] restore_args+0x0/0x30
[    0.000000]        [<ffffffff8024acec>] kthread+0x0/0x7c
[    0.000000]        [<ffffffff8020d22e>] child_rip+0x0/0x12
[    0.000000]        [<ffffffffffffffff>] 0xffffffffffffffff
[    0.000000]
[    0.000000] -> #1 ((spinlock_t *)&q->lock){++..}:
[    0.000000]        [<ffffffff80257b70>] __lock_acquire+0xb91/0xd80
[    0.000000]        [<ffffffff80258524>] lock_acquire+0x9d/0xbc
[    0.000000]        [<ffffffff8022ded6>] __wake_up_sync+0x23/0x53
[    0.000000]        [<ffffffff8047a07b>] _spin_lock_irqsave+0x69/0x79
[    0.000000]        [<ffffffff8022ded6>] __wake_up_sync+0x23/0x53
[    0.000000]        [<ffffffff802419f1>] do_notify_parent+0x1ea/0x207
[    0.000000]        [<ffffffff802db7a6>] kmem_cache_free+0xc6/0xcf
[    0.000000]        [<ffffffff803727b6>] _raw_write_lock+0xe/0x90
[    0.000000]        [<ffffffff80239451>] do_exit+0x5fd/0x7e0
[    0.000000]        [<ffffffff8023955b>] do_exit+0x707/0x7e0
[    0.000000]        [<ffffffff80246135>] __call_usermodehelper+0x0/0x61
[    0.000000]        [<ffffffff802464fe>] request_module+0x0/0x166
[    0.000000]        [<ffffffff8020d238>] child_rip+0xa/0x12
[    0.000000]        [<ffffffff8020c8e8>] restore_args+0x0/0x30
[    0.000000]        [<ffffffff8024637b>] ____call_usermodehelper+0x0/0x183
[    0.000000]        [<ffffffff8020d22e>] child_rip+0x0/0x12
[    0.000000]        [<ffffffffffffffff>] 0xffffffffffffffff
[    0.000000]
[    0.000000] -> #0 ((spinlock_t *)&sighand->siglock){....}:
[    0.000000]        [<ffffffff802556db>] print_circular_bug_entry+0x4d/0x54
[    0.000000]        [<ffffffff80257a72>] __lock_acquire+0xa93/0xd80
[    0.000000]        [<ffffffff80258524>] lock_acquire+0x9d/0xbc
[    0.000000]        [<ffffffff802867c1>] schedule_event+0x7f/0x578
[    0.000000]        [<ffffffff80479d65>] _spin_lock+0x2a/0x36
[    0.000000]        [<ffffffff802867c1>] schedule_event+0x7f/0x578
[    0.000000]        [<ffffffff80274c20>] __ipipe_dispatch_event+0xe4/0x1db
[    0.000000]        [<ffffffff80477667>] schedule+0x5a8/0x7ff
[    0.000000]        [<ffffffff80238cd1>] do_wait+0xb5e/0xc4c
[    0.000000]        [<ffffffff80372466>] _raw_read_unlock+0xe/0x2d
[    0.000000]        [<ffffffff80238d00>] do_wait+0xb8d/0xc4c
[    0.000000]        [<ffffffff8022e993>] default_wake_function+0x0/0x14
[    0.000000]        [<ffffffff8022211c>] mcount+0x4c/0x72
[    0.000000]        [<ffffffff80238dec>] sys_wait4+0x2d/0x2f
[    0.000000]        [<ffffffff8020c1a2>] system_call+0x92/0x97
[    0.000000]        [<ffffffffffffffff>] 0xffffffffffffffff
[    0.000000]
[    0.000000] other info that might help us debug this:
[    0.000000]
[    0.000000] 1 lock held by gdb/4385:
[    0.000000]  #0:  (&rq->rq_lock_key){++..}, at: [<ffffffff80477235>] schedule+0x176/0x7ff
[    0.000000]
[    0.000000] stack backtrace:
[    0.000000] Pid: 4385, comm: gdb Not tainted 2.6.24.6-xeno_64 #313
[    0.000000]
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff80255f76>] print_circular_bug_tail+0x75/0x80
[    0.000000]  [<ffffffff802556db>] print_circular_bug_entry+0x4d/0x54
[    0.000000]  [<ffffffff80257a72>] __lock_acquire+0xa93/0xd80
[    0.000000]  [<ffffffff80258524>] lock_acquire+0x9d/0xbc
[    0.000000]  [<ffffffff802867c1>] schedule_event+0x7f/0x578
[    0.000000]  [<ffffffff80479d65>] _spin_lock+0x2a/0x36
[    0.000000]  [<ffffffff802867c1>] schedule_event+0x7f/0x578
[    0.000000]  [<ffffffff80274c20>] __ipipe_dispatch_event+0xe4/0x1db
[    0.000000]  [<ffffffff80477667>] schedule+0x5a8/0x7ff
[    0.000000]  [<ffffffff80238cd1>] do_wait+0xb5e/0xc4c
[    0.000000]  [<ffffffff80372466>] _raw_read_unlock+0xe/0x2d
[    0.000000]  [<ffffffff80238d00>] do_wait+0xb8d/0xc4c
[    0.000000]  [<ffffffff8022e993>] default_wake_function+0x0/0x14
[    0.000000]  [<ffffffff8022211c>] mcount+0x4c/0x72
[    0.000000]  [<ffffffff80238dec>] sys_wait4+0x2d/0x2f
[    0.000000]  [<ffffffff8020c1a2>] system_call+0x92/0x97
[    0.000000]


My quick translation is that we must not send signals from the
schedule_event callback, at least as that hook is currently placed. Any
ideas? Or better interpretations?

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Xenomai-core] Houston, we have a circular problem
  2008-05-05 15:44 [Xenomai-core] Houston, we have a circular problem Jan Kiszka
  2008-05-05 16:04 ` Jan Kiszka
@ 2008-05-05 16:08 ` Philippe Gerum
  2008-05-05 16:12   ` Gilles Chanteperdrix
  1 sibling, 1 reply; 9+ messages in thread
From: Philippe Gerum @ 2008-05-05 16:08 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai-core

Jan Kiszka wrote:
> Hi,
> 
> after hacking away the barriers I-pipe erected in front of lockdep
> (patches will follow on adeos-main), I was finally able to "visualize" a
> bit more what our colleagues see in reality on SMP: some ugly, not yet
> understood circular dependency when running some Xenomai app under gdb.
> What lockdep tries to tell us remains unclear, unfortunately:
> 
> [  874.356703]
> [  874.356957] =======================================================
> 
> Here it hangs because of this (catched via QEMU):
> 
> (gdb) bt
> #0  __delay (loops=1) at arch/x86/lib/delay_64.c:34
> #1  0xffffffff80372712 in _raw_spin_lock (lock=0xffff81000232c6c0) at lib/spinlock_debug.c:111
> #2  0xffffffff80479d3d in _spin_lock (lock=0xffff81000232c6c0) at kernel/spinlock.c:182
> #3  0xffffffff8022e546 in task_rq_lock (p=0xffff81002e792000, flags=0xffff81002f487910) at kernel/sched.c:615
> #4  0xffffffff8022e6b6 in try_to_wake_up (p=0x1, state=<value optimized out>, sync=341) at kernel/sched.c:1562
> #5  0xffffffff8022e9a5 in default_wake_function (curr=<value optimized out>, mode=0, sync=341, key=0xf48791c8) at kernel/sched.c:3840
> #6  0xffffffff8024ae51 in autoremove_wake_function (wait=0x1, mode=0, sync=341, key=0xf48791c8) at kernel/wait.c:132
> #7  0xffffffff8022bdc7 in __wake_up_common (q=<value optimized out>, mode=1, nr_exclusive=1, sync=0, key=0x0) at kernel/sched.c:3861
> #8  0xffffffff8022df43 in __wake_up (q=0xffffffff805a6240, mode=1, nr_exclusive=1, key=0x0) at kernel/sched.c:3880
> #9  0xffffffff80235838 in wake_up_klogd () at kernel/printk.c:1013
> #10 0xffffffff80235a30 in release_console_sem () at kernel/printk.c:1059
> #11 0xffffffff802360be in vprintk (fmt=0x12 <Address 0x12 out of bounds>, args=0xffff81002f487a72) at kernel/printk.c:807
> #12 0xffffffff802361e5 in printk (fmt=0xffffffff8054c0fd "\n", '=' <repeats 55 times>, "\n") at kernel/printk.c:664
> #13 0xffffffff80256268 in print_circular_bug_header (entry=0xffffffff809cb8c0, depth=2) at kernel/lockdep.c:902
> #14 0xffffffff80256f84 in check_noncircular (source=<value optimized out>, depth=1) at kernel/lockdep.c:973
> #15 0xffffffff80256f8e in check_noncircular (source=<value optimized out>, depth=0) at kernel/lockdep.c:975
> #16 0xffffffff80257a45 in __lock_acquire (lock=0xffff81002e9ff960, subclass=0, trylock=0, read=0, check=2, hardirqs_off=1, ip=18446744071564715933) at kernel/lockdep.c:1324
> #17 0xffffffff80258500 in lock_acquire (lock=0x1, subclass=0, trylock=-2140427232, read=-192441912, check=2, ip=<value optimized out>) at kernel/lockdep.c:2703
> #18 0xffffffff80479d35 in _spin_lock (lock=0xffff81002e9ff948) at kernel/spinlock.c:181
> #19 0xffffffff8028679d in schedule_event (event=<value optimized out>, ipd=0x0, data=0xffff81002e198000) at kernel/xenomai/nucleus/shadow.c:2197
> #20 0xffffffff80274bfc in __ipipe_dispatch_event (event=33, data=0xffff81002e198000) at kernel/ipipe/core.c:828
> #21 0xffffffff80477637 in schedule () at kernel/sched.c:1897
> #22 0xffffffff80247598 in worker_thread (__cwq=<value optimized out>) at kernel/workqueue.c:314
> #23 0xffffffff8024ad16 in kthread (_create=<value optimized out>) at kernel/kthread.c:78
> #24 0xffffffff8020d238 in child_rip ()
> #25 0x0000000000000000 in ?? ()
> 
> The lock in question should be task->sighand->siglock, but as we hit the
> bug inside the scheduler, printk deadlocks now :(. Need to dig out some
> patch of Steven Rostedt (IIRC) that may overcome the second deadlock.
> 
> But maybe someone already hears a bell ringing. Would be highly
> appreciated as gdb is effectively unusable here.
> 

do_schedule_event() is the culprit when it reads the pending signals on the
shared queue (XNDEBUG check for rearming the timers), but we really need to know
who is the first locker to fix that properly. Any chance busting the spinlocks
and running with printk_sync() mode on the current domain would get us the
traces ou?


-- 
Philippe.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Xenomai-core] Houston, we have a circular problem
  2008-05-05 16:08 ` Philippe Gerum
@ 2008-05-05 16:12   ` Gilles Chanteperdrix
  2008-05-05 16:23     ` Jan Kiszka
  0 siblings, 1 reply; 9+ messages in thread
From: Gilles Chanteperdrix @ 2008-05-05 16:12 UTC (permalink / raw)
  To: rpm; +Cc: Jan Kiszka, Xenomai-core

On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <rpm@xenomai.org> wrote:
>  do_schedule_event() is the culprit when it reads the pending signals on the
>  shared queue (XNDEBUG check for rearming the timers),

A stupid suggestion: if we know that the spinlock is always locked
when calling do_schedule_event, maybe we can simply avoid the lock
there ?

-- 
 Gilles


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Xenomai-core] Houston, we have a circular problem
  2008-05-05 16:12   ` Gilles Chanteperdrix
@ 2008-05-05 16:23     ` Jan Kiszka
  2008-05-05 16:35       ` Philippe Gerum
  2008-05-05 16:52       ` Philippe Gerum
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Kiszka @ 2008-05-05 16:23 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai-core

Gilles Chanteperdrix wrote:
> On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <rpm@xenomai.org> wrote:
>>  do_schedule_event() is the culprit when it reads the pending signals on the
>>  shared queue (XNDEBUG check for rearming the timers),
> 
> A stupid suggestion: if we know that the spinlock is always locked
> when calling do_schedule_event, maybe we can simply avoid the lock
> there ?

Would be the best solution - but I don't think so. After reading a bit
more into the lockdep output, I think the issue is that _some_other_
task my hold the siglock and then acquire our rq_lock, but not
necessarily along a similar code path we took to acquire the siglock now.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Xenomai-core] Houston, we have a circular problem
  2008-05-05 16:23     ` Jan Kiszka
@ 2008-05-05 16:35       ` Philippe Gerum
  2008-05-05 16:52       ` Philippe Gerum
  1 sibling, 0 replies; 9+ messages in thread
From: Philippe Gerum @ 2008-05-05 16:35 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <rpm@xenomai.org> wrote:
>>>  do_schedule_event() is the culprit when it reads the pending signals on the
>>>  shared queue (XNDEBUG check for rearming the timers),
>> A stupid suggestion: if we know that the spinlock is always locked
>> when calling do_schedule_event, maybe we can simply avoid the lock
>> there ?
> 
> Would be the best solution - but I don't think so. After reading a bit
> more into the lockdep output, I think the issue is that _some_other_
> task my hold the siglock and then acquire our rq_lock, but not
> necessarily along a similar code path we took to acquire the siglock now.
> 

Mm, indeed. Grabbing any further kernel lock when we hold the runqueue lock is
probably a very bad idea, since any normal locking sequence would exactly go the
opposite way...

-- 
Philippe.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Xenomai-core] Houston, we have a circular problem
  2008-05-05 16:23     ` Jan Kiszka
  2008-05-05 16:35       ` Philippe Gerum
@ 2008-05-05 16:52       ` Philippe Gerum
  2008-05-05 17:43         ` Jan Kiszka
  1 sibling, 1 reply; 9+ messages in thread
From: Philippe Gerum @ 2008-05-05 16:52 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <rpm@xenomai.org> wrote:
>>>  do_schedule_event() is the culprit when it reads the pending signals on the
>>>  shared queue (XNDEBUG check for rearming the timers),
>> A stupid suggestion: if we know that the spinlock is always locked
>> when calling do_schedule_event, maybe we can simply avoid the lock
>> there ?
> 
> Would be the best solution - but I don't think so. After reading a bit
> more into the lockdep output, I think the issue is that _some_other_
> task my hold the siglock and then acquire our rq_lock, but not
> necessarily along a similar code path we took to acquire the siglock now.
> 

Actually, this locking around the sigmask retrieval looks overkill, since we
only address ptracing signals here, and those should go through the shared
pending set, not through the task's private one. I.e. There should be no way to
get fooled by any asynchronous update of that mask.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Xenomai-core] Houston, we have a circular problem
  2008-05-05 16:52       ` Philippe Gerum
@ 2008-05-05 17:43         ` Jan Kiszka
  2008-05-06  7:57           ` Philippe Gerum
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kiszka @ 2008-05-05 17:43 UTC (permalink / raw)
  To: rpm; +Cc: Jan Kiszka, Xenomai-core

[-- Attachment #1: Type: text/plain, Size: 1988 bytes --]

Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <rpm@xenomai.org> wrote:
>>>>  do_schedule_event() is the culprit when it reads the pending signals on the
>>>>  shared queue (XNDEBUG check for rearming the timers),
>>> A stupid suggestion: if we know that the spinlock is always locked
>>> when calling do_schedule_event, maybe we can simply avoid the lock
>>> there ?
>> Would be the best solution - but I don't think so. After reading a bit
>> more into the lockdep output, I think the issue is that _some_other_
>> task my hold the siglock and then acquire our rq_lock, but not
>> necessarily along a similar code path we took to acquire the siglock now.
>>
> 
> Actually, this locking around the sigmask retrieval looks overkill, since we
> only address ptracing signals here, and those should go through the shared
> pending set, not through the task's private one. I.e. There should be no way to
> get fooled by any asynchronous update of that mask.

This is a debug helper anyway, so we risk (if I got this right) at worst
a spurious unfreeze of the Xenomai timers. Does not really compare to
the current deadlock...

I will let my colleagues run the hunk below tomorrow (which works for me) -
let's see if they manage to crash it again :P (they are experts in this!).

Jan

Index: xenomai-2.4.x/ksrc/nucleus/shadow.c
===================================================================
--- xenomai-2.4.x/ksrc/nucleus/shadow.c	(Revision 3734)
+++ xenomai-2.4.x/ksrc/nucleus/shadow.c	(Arbeitskopie)
@@ -2194,9 +2194,7 @@ static inline void do_schedule_event(str
 			if (signal_pending(next)) {
 				sigset_t pending;
 
-				spin_lock(&wrap_sighand_lock(next));	/* Already interrupt-safe. */
 				wrap_get_sigpending(&pending, next);
-				spin_unlock(&wrap_sighand_lock(next));
 
 				if (sigismember(&pending, SIGSTOP) ||
 				    sigismember(&pending, SIGINT))


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 254 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Xenomai-core] Houston, we have a circular problem
  2008-05-05 17:43         ` Jan Kiszka
@ 2008-05-06  7:57           ` Philippe Gerum
  0 siblings, 0 replies; 9+ messages in thread
From: Philippe Gerum @ 2008-05-06  7:57 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Jan Kiszka, Xenomai-core

Jan Kiszka wrote:
> Philippe Gerum wrote:
>> Jan Kiszka wrote:
>>> Gilles Chanteperdrix wrote:
>>>> On Mon, May 5, 2008 at 6:08 PM, Philippe Gerum <rpm@xenomai.org> wrote:
>>>>>  do_schedule_event() is the culprit when it reads the pending signals on the
>>>>>  shared queue (XNDEBUG check for rearming the timers),
>>>> A stupid suggestion: if we know that the spinlock is always locked
>>>> when calling do_schedule_event, maybe we can simply avoid the lock
>>>> there ?
>>> Would be the best solution - but I don't think so. After reading a bit
>>> more into the lockdep output, I think the issue is that _some_other_
>>> task my hold the siglock and then acquire our rq_lock, but not
>>> necessarily along a similar code path we took to acquire the siglock now.
>>>
>> Actually, this locking around the sigmask retrieval looks overkill, since we
>> only address ptracing signals here, and those should go through the shared
>> pending set, not through the task's private one. I.e. There should be no way to
>> get fooled by any asynchronous update of that mask.
> 
> This is a debug helper anyway, so we risk (if I got this right) at worst
> a spurious unfreeze of the Xenomai timers. Does not really compare to
> the current deadlock...
>

As a matter of fact, we don't test any condition under the protection of this
lock, so aside of the memory barrier induced on SMP, this lock does not buy us
anything. Except a deadlock, that is...

> I will let my colleagues run the hunk below tomorrow (which works for me) -
> let's see if they manage to crash it again :P (they are experts in this!).
> 
> Jan
> 
> Index: xenomai-2.4.x/ksrc/nucleus/shadow.c
> ===================================================================
> --- xenomai-2.4.x/ksrc/nucleus/shadow.c	(Revision 3734)
> +++ xenomai-2.4.x/ksrc/nucleus/shadow.c	(Arbeitskopie)
> @@ -2194,9 +2194,7 @@ static inline void do_schedule_event(str
>  			if (signal_pending(next)) {
>  				sigset_t pending;
>  
> -				spin_lock(&wrap_sighand_lock(next));	/* Already interrupt-safe. */
>  				wrap_get_sigpending(&pending, next);
> -				spin_unlock(&wrap_sighand_lock(next));
>  
>  				if (sigismember(&pending, SIGSTOP) ||
>  				    sigismember(&pending, SIGINT))
> 


-- 
Philippe.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-05-06  7:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-05 15:44 [Xenomai-core] Houston, we have a circular problem Jan Kiszka
2008-05-05 16:04 ` Jan Kiszka
2008-05-05 16:08 ` Philippe Gerum
2008-05-05 16:12   ` Gilles Chanteperdrix
2008-05-05 16:23     ` Jan Kiszka
2008-05-05 16:35       ` Philippe Gerum
2008-05-05 16:52       ` Philippe Gerum
2008-05-05 17:43         ` Jan Kiszka
2008-05-06  7:57           ` Philippe Gerum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.