* [Adeos-main] Deadlock-prone ipipe_critical_enter
@ 2010-08-17 11:17 Jan Kiszka
2010-08-17 11:21 ` Gilles Chanteperdrix
2010-08-17 15:33 ` Philippe Gerum
0 siblings, 2 replies; 6+ messages in thread
From: Jan Kiszka @ 2010-08-17 11:17 UTC (permalink / raw)
To: adeos-main
Hi,
it turned out ipipe_critical_enter is broken on SMP > 2 CPUs: On one
CPU, Linux may have acquired an rwlock for reading when being preempted
by the critical IPI. On some other CPU, Linux may have entered
write_lock_irq[save] before the IPI arrived. The reader will be stuck in
__ipipe_do_critical_sync, the writer in __write_lock_failed - forever.
First seen on real silicon (once per "few" hundreds of boots), finally
caught under KVM and nailed down.
Two approaches to resolve this issue come to my mind so far. The first
one is to restart the whole ipipe_critical_enter after some (how many?)
cycles of futile waiting. The other is to accept the critical IPI even
if the top-most domain is stalled (as it sits in write_lock_irq), but
I'm not 100% that our optimistic IRQ mask will always allow this when
Linux is on the top (I assume we can safely require other domains to
avoid such deadlocks by design).
Comments? Better ideas?
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Adeos-main] Deadlock-prone ipipe_critical_enter
2010-08-17 11:17 [Adeos-main] Deadlock-prone ipipe_critical_enter Jan Kiszka
@ 2010-08-17 11:21 ` Gilles Chanteperdrix
2010-08-17 11:34 ` Jan Kiszka
2010-08-17 15:33 ` Philippe Gerum
1 sibling, 1 reply; 6+ messages in thread
From: Gilles Chanteperdrix @ 2010-08-17 11:21 UTC (permalink / raw)
To: Jan Kiszka; +Cc: adeos-main
Jan Kiszka wrote:
> Hi,
>
> it turned out ipipe_critical_enter is broken on SMP > 2 CPUs: On one
> CPU, Linux may have acquired an rwlock for reading when being preempted
> by the critical IPI. On some other CPU, Linux may have entered
> write_lock_irq[save] before the IPI arrived. The reader will be stuck in
> __ipipe_do_critical_sync, the writer in __write_lock_failed - forever.
> First seen on real silicon (once per "few" hundreds of boots), finally
> caught under KVM and nailed down.
>
> Two approaches to resolve this issue come to my mind so far. The first
> one is to restart the whole ipipe_critical_enter after some (how many?)
> cycles of futile waiting. The other is to accept the critical IPI even
> if the top-most domain is stalled (as it sits in write_lock_irq), but
> I'm not 100% that our optimistic IRQ mask will always allow this when
> Linux is on the top (I assume we can safely require other domains to
> avoid such deadlocks by design).
>
> Comments? Better ideas?
I guess, the rwlocks are ipipe rwlocks, right?
I am not sure it is different from your second idea, but what about
spinning in write_lock_irq/save with irqs on?
--
Gilles.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Adeos-main] Deadlock-prone ipipe_critical_enter
2010-08-17 11:21 ` Gilles Chanteperdrix
@ 2010-08-17 11:34 ` Jan Kiszka
2010-08-17 11:36 ` Gilles Chanteperdrix
0 siblings, 1 reply; 6+ messages in thread
From: Jan Kiszka @ 2010-08-17 11:34 UTC (permalink / raw)
To: Gilles Chanteperdrix; +Cc: adeos-main
Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Hi,
>>
>> it turned out ipipe_critical_enter is broken on SMP > 2 CPUs: On one
>> CPU, Linux may have acquired an rwlock for reading when being preempted
>> by the critical IPI. On some other CPU, Linux may have entered
>> write_lock_irq[save] before the IPI arrived. The reader will be stuck in
>> __ipipe_do_critical_sync, the writer in __write_lock_failed - forever.
>> First seen on real silicon (once per "few" hundreds of boots), finally
>> caught under KVM and nailed down.
>>
>> Two approaches to resolve this issue come to my mind so far. The first
>> one is to restart the whole ipipe_critical_enter after some (how many?)
>> cycles of futile waiting. The other is to accept the critical IPI even
>> if the top-most domain is stalled (as it sits in write_lock_irq), but
>> I'm not 100% that our optimistic IRQ mask will always allow this when
>> Linux is on the top (I assume we can safely require other domains to
>> avoid such deadlocks by design).
>>
>> Comments? Better ideas?
>
> I guess, the rwlocks are ipipe rwlocks, right?
Nope, plain Linux tasklist_lock. No Xenomai domain active at this point,
just Linux.
> I am not sure it is different from your second idea, but what about
> spinning in write_lock_irq/save with irqs on?
Hard-IRQs on (which is what my second idea would rely on) or Linux IRQs
on (which would involve patching Linux spinlock arch code and may have
side effects)?
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Adeos-main] Deadlock-prone ipipe_critical_enter
2010-08-17 11:34 ` Jan Kiszka
@ 2010-08-17 11:36 ` Gilles Chanteperdrix
2010-08-17 11:41 ` Jan Kiszka
0 siblings, 1 reply; 6+ messages in thread
From: Gilles Chanteperdrix @ 2010-08-17 11:36 UTC (permalink / raw)
To: Jan Kiszka; +Cc: adeos-main
Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Hi,
>>>
>>> it turned out ipipe_critical_enter is broken on SMP > 2 CPUs: On one
>>> CPU, Linux may have acquired an rwlock for reading when being preempted
>>> by the critical IPI. On some other CPU, Linux may have entered
>>> write_lock_irq[save] before the IPI arrived. The reader will be stuck in
>>> __ipipe_do_critical_sync, the writer in __write_lock_failed - forever.
>>> First seen on real silicon (once per "few" hundreds of boots), finally
>>> caught under KVM and nailed down.
>>>
>>> Two approaches to resolve this issue come to my mind so far. The first
>>> one is to restart the whole ipipe_critical_enter after some (how many?)
>>> cycles of futile waiting. The other is to accept the critical IPI even
>>> if the top-most domain is stalled (as it sits in write_lock_irq), but
>>> I'm not 100% that our optimistic IRQ mask will always allow this when
>>> Linux is on the top (I assume we can safely require other domains to
>>> avoid such deadlocks by design).
>>>
>>> Comments? Better ideas?
>> I guess, the rwlocks are ipipe rwlocks, right?
>
> Nope, plain Linux tasklist_lock. No Xenomai domain active at this point,
> just Linux.
Then how could this happen? Is not the critical IPI always able to
preempt Linux?
--
Gilles.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Adeos-main] Deadlock-prone ipipe_critical_enter
2010-08-17 11:36 ` Gilles Chanteperdrix
@ 2010-08-17 11:41 ` Jan Kiszka
0 siblings, 0 replies; 6+ messages in thread
From: Jan Kiszka @ 2010-08-17 11:41 UTC (permalink / raw)
To: Gilles Chanteperdrix; +Cc: adeos-main
Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Hi,
>>>>
>>>> it turned out ipipe_critical_enter is broken on SMP > 2 CPUs: On one
>>>> CPU, Linux may have acquired an rwlock for reading when being preempted
>>>> by the critical IPI. On some other CPU, Linux may have entered
>>>> write_lock_irq[save] before the IPI arrived. The reader will be stuck in
>>>> __ipipe_do_critical_sync, the writer in __write_lock_failed - forever.
>>>> First seen on real silicon (once per "few" hundreds of boots), finally
>>>> caught under KVM and nailed down.
>>>>
>>>> Two approaches to resolve this issue come to my mind so far. The first
>>>> one is to restart the whole ipipe_critical_enter after some (how many?)
>>>> cycles of futile waiting. The other is to accept the critical IPI even
>>>> if the top-most domain is stalled (as it sits in write_lock_irq), but
>>>> I'm not 100% that our optimistic IRQ mask will always allow this when
>>>> Linux is on the top (I assume we can safely require other domains to
>>>> avoid such deadlocks by design).
>>>>
>>>> Comments? Better ideas?
>>> I guess, the rwlocks are ipipe rwlocks, right?
>> Nope, plain Linux tasklist_lock. No Xenomai domain active at this point,
>> just Linux.
>
> Then how could this happen? Is not the critical IPI always able to
> preempt Linux?
Obviously not if Linux is top-most. Hard IRQs are enabled on all CPUs,
but the Linux domain is stalled. So the critical IPI is not delivered by
ipipe.
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Adeos-main] Deadlock-prone ipipe_critical_enter
2010-08-17 11:17 [Adeos-main] Deadlock-prone ipipe_critical_enter Jan Kiszka
2010-08-17 11:21 ` Gilles Chanteperdrix
@ 2010-08-17 15:33 ` Philippe Gerum
1 sibling, 0 replies; 6+ messages in thread
From: Philippe Gerum @ 2010-08-17 15:33 UTC (permalink / raw)
To: Jan Kiszka; +Cc: adeos-main
On Tue, 2010-08-17 at 13:17 +0200, Jan Kiszka wrote:
> Hi,
>
> it turned out ipipe_critical_enter is broken on SMP > 2 CPUs: On one
> CPU, Linux may have acquired an rwlock for reading when being preempted
> by the critical IPI. On some other CPU, Linux may have entered
> write_lock_irq[save] before the IPI arrived. The reader will be stuck in
> __ipipe_do_critical_sync, the writer in __write_lock_failed - forever.
> First seen on real silicon (once per "few" hundreds of boots), finally
> caught under KVM and nailed down.
>
> Two approaches to resolve this issue come to my mind so far. The first
> one is to restart the whole ipipe_critical_enter after some (how many?)
> cycles of futile waiting. The other is to accept the critical IPI even
> if the top-most domain is stalled (as it sits in write_lock_irq), but
> I'm not 100% that our optimistic IRQ mask will always allow this when
> Linux is on the top (I assume we can safely require other domains to
> avoid such deadlocks by design).
>
> Comments? Better ideas?
No comment, except good catch. Freaking bug. This one dates back to
October 2002; vintage stuff.
Option #2 would introduce a problem, in the sense that the semantics of
ipipe_critical_enter dictates that all preempted CPUs should be able to
run a post-sync routine in a safe, concurrency-controlled context, when
the "master" CPU releases the lock, and they do this over the critical
interrupt handler.
There is no restriction on what that routine may do over the Linux
domain, neither there is nothing the master CPU may not do while holding
the super-lock, assuming that all other CPUs are running out of any
interrupt-free section. Therefore allowing slave CPUs to be preempted
regardless of the stall bit would introduce some risk there.
So, a check-and-retry approach seems required. Fortunately, we do expect
the slave CPUs to enter a quiescent state at some point, so that the
master CPU eventually gets the lock, unless something got really broken
otherwise.
>
> Jan
>
--
Philippe.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-08-17 15:33 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-17 11:17 [Adeos-main] Deadlock-prone ipipe_critical_enter Jan Kiszka
2010-08-17 11:21 ` Gilles Chanteperdrix
2010-08-17 11:34 ` Jan Kiszka
2010-08-17 11:36 ` Gilles Chanteperdrix
2010-08-17 11:41 ` Jan Kiszka
2010-08-17 15:33 ` Philippe Gerum
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.