Deadlock in sungem/ip_auto

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Deadlock in sungem/ip_auto_config/linkwatch
@ 2004-01-05 13:07 Michal Ostrowski
  2004-01-05 14:50 ` Stefan Rompf
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Ostrowski @ 2004-01-05 13:07 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 849 bytes --]

I believe I've found a potential deadlock condition.

It occurs when we make the following sequence of calls:

ip_auto_config
ic_open_devs
dev_change_flags
dev_open
gem_open
flush_scheduled_work

ic_open_devs grabs rtnl_sem with an rtnl_shlock() call.

The sungem driver at some point calls gem_init_one, which calls
netif_carrier_*, which in turn calls schedule_work (linkwatch_event).

linkwatch_event in turn needs rtnl_sem.

If we enter the call sequence above and linkwatch_event is still
pending, we will deadlock since flush_scheduled_work will wait for
completion of linkwatch_event, which is blocked since it cannot get
rtnl_sem.

In general when can one call flush_scheduled_work?  It seems that one
can't unless you know your callers aren't holding any locks.

-- 
Michal Ostrowski <mostrows@watson.ibm.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Deadlock in sungem/ip_auto_config/linkwatch
  2004-01-05 13:07 Deadlock in sungem/ip_auto_config/linkwatch Michal Ostrowski
@ 2004-01-05 14:50 ` Stefan Rompf
  2004-01-05 16:19   ` Michal Ostrowski
  2004-01-05 19:02   ` David S. Miller
  0 siblings, 2 replies; 5+ messages in thread
From: Stefan Rompf @ 2004-01-05 14:50 UTC (permalink / raw)
  To: Michal Ostrowski, netdev

Am Montag, 05. Januar 2004 14:07 schrieb Michal Ostrowski:

> ic_open_devs grabs rtnl_sem with an rtnl_shlock() call.
>
> The sungem driver at some point calls gem_init_one, which calls
> netif_carrier_*, which in turn calls schedule_work (linkwatch_event).
>
> linkwatch_event in turn needs rtnl_sem.

Good catch! The sungem driver shows clearly that we need some way to remove 
queued work without scheduling and waiting for other events.

I will change the linkwatch code to use rtnl_shlock_nowait() and backoff and 
retry in case of failure this week. Call it a workaround, but it increases 
overall system stability.

Btw, what is the planned difference between rtnl_shlock() and rtnl_exlock()? 
Even though the later is a null operation right now, I don't want to hold 
more locks than needed in the linkwatch code.

Stefan

-- 
"doesn't work" is not a magic word to explain everything.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Deadlock in sungem/ip_auto_config/linkwatch
  2004-01-05 14:50 ` Stefan Rompf
@ 2004-01-05 16:19   ` Michal Ostrowski
  2004-01-05 16:50     ` Stefan Rompf
  2004-01-05 19:02   ` David S. Miller
  1 sibling, 1 reply; 5+ messages in thread
From: Michal Ostrowski @ 2004-01-05 16:19 UTC (permalink / raw)
  To: Stefan Rompf; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 1994 bytes --]

This can get pretty hairy.

Suppose the linkwatch code backs-off in the case that rtnl_sem is held
legitimately by thread A.  Meanwhile, thread B is doing a
flush_scheduled_work in order to wait for pending linkwatch events to
complete.  

In the proposed solution this will result in incorrect behaviour
(flush_scheduled_work returns with the linkwatch work not really done). 
(Admittedly I'm not sure if such a scenario really is feasible.)

My initial though was to use a seperate work-queue, un-entangled with
the global queue used for flush_scheduled_work.  This would allow
linkwatch events to be synchronized against explicitly.  For this
solution though I think it would be nice to not have to have a thread
per cpu for the linkwatch work queue.

On the other hand, ic_open_devs appears to be the only place where
rtnl_sem is held while going into a driver's open() function, and so
maybe the right rule is that rtnl_sem is not held when calling
dev->open().

-- 
Michal Ostrowski <mostrows@watson.ibm.com>

On Mon, 2004-01-05 at 09:50, Stefan Rompf wrote:
> Am Montag, 05. Januar 2004 14:07 schrieb Michal Ostrowski:
> 
> > ic_open_devs grabs rtnl_sem with an rtnl_shlock() call.
> >
> > The sungem driver at some point calls gem_init_one, which calls
> > netif_carrier_*, which in turn calls schedule_work (linkwatch_event).
> >
> > linkwatch_event in turn needs rtnl_sem.
> 
> Good catch! The sungem driver shows clearly that we need some way to remove 
> queued work without scheduling and waiting for other events.
> 
> I will change the linkwatch code to use rtnl_shlock_nowait() and backoff and 
> retry in case of failure this week. Call it a workaround, but it increases 
> overall system stability.
> 
> Btw, what is the planned difference between rtnl_shlock() and rtnl_exlock()? 
> Even though the later is a null operation right now, I don't want to hold 
> more locks than needed in the linkwatch code.
> 
> Stefan

> 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Deadlock in sungem/ip_auto_config/linkwatch
  2004-01-05 16:19   ` Michal Ostrowski
@ 2004-01-05 16:50     ` Stefan Rompf
  0 siblings, 0 replies; 5+ messages in thread
From: Stefan Rompf @ 2004-01-05 16:50 UTC (permalink / raw)
  To: Michal Ostrowski; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 986 bytes --]

Am Montag, 05. Januar 2004 17:19 schrieb Michal Ostrowski:

> Suppose the linkwatch code backs-off in the case that rtnl_sem is held
> legitimately by thread A.  Meanwhile, thread B is doing a
> flush_scheduled_work in order to wait for pending linkwatch events to
> complete.

This won't happen. If a pending linkwatch event needs to be scheduled 
synchronously (f.e. when a device is unregistered), it is executed in context 
of the calling process, not inside the workqueue thread.

> My initial though was to use a seperate work-queue, un-entangled with
> the global queue used for flush_scheduled_work. This would allow
> linkwatch events to be synchronized against explicitly.

That's overkill, and if I understand flush_workqueue() right, it doesn't care 
about work that is queued with delay, so it even wouldn't help. That's why I 
thought about a function to unregister pending work.

Stefan
-- 
"doesn't work" is not a magic word to explain everything.

[-- Attachment #2: signature --]
[-- Type: application/pkcs7-signature, Size: 1691 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Deadlock in sungem/ip_auto_config/linkwatch
  2004-01-05 14:50 ` Stefan Rompf
  2004-01-05 16:19   ` Michal Ostrowski
@ 2004-01-05 19:02   ` David S. Miller
  1 sibling, 0 replies; 5+ messages in thread
From: David S. Miller @ 2004-01-05 19:02 UTC (permalink / raw)
  To: Stefan Rompf; +Cc: mostrows, netdev

On Mon, 5 Jan 2004 15:50:50 +0100
Stefan Rompf <srompf@isg.de> wrote:

> Btw, what is the planned difference between rtnl_shlock() and rtnl_exlock()? 
> Even though the later is a null operation right now, I don't want to hold 
> more locks than needed in the linkwatch code.

The idea was originally to make the RTNL semaphore a read-write one,
but I doubt we'll ever make that happen and the shlock bits will just
disappear entirely.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-01-05 19:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-05 13:07 Deadlock in sungem/ip_auto_config/linkwatch Michal Ostrowski
2004-01-05 14:50 ` Stefan Rompf
2004-01-05 16:19   ` Michal Ostrowski
2004-01-05 16:50     ` Stefan Rompf
2004-01-05 19:02   ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).