* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 18:28 ` David S. Miller
@ 2004-10-07 18:41 ` Matt Mackall
2004-10-07 20:00 ` Colin Leroy
2004-10-07 18:43 ` Andi Kleen
2004-10-07 20:44 ` Colin Leroy
2 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-07 18:41 UTC (permalink / raw)
To: David S. Miller; +Cc: Colin Leroy, akpm, netdev
On Thu, Oct 07, 2004 at 11:28:46AM -0700, David S. Miller wrote:
> On Thu, 7 Oct 2004 16:05:32 +0200
> Colin Leroy <colin@colino.net> wrote:
>
> > First, my newbie question: is it possible to deadlock a spinlock on a
> > Uniprocessor kernel ? For example, there's something I find suspect in
> > netpoll/sungem interaction:
> >
>
> Oh yes, it appears that netpoll doesn't support NETIF_F_LLTX locking,
> crap :(
>
> When a device has NETIF_F_LLTX set, it means that the driver's
> dev->hard_start_xmit() routine is what takes the xmit_lock, not
> the caller one level up.
>
> Andi Kleen didn't fix up netpoll when he did his LLTX changes, oops.
>
> So, netpoll needs to have the NETIF_F_LLTX stuff added to it.
> Basically:
>
> 1) If NETIF_F_LLTX is clear, same as before
> 2) If NETIF_F_LLTX is set:
> a) Do not take xmit_lock
> b) Check ->hard_start_xmit() return value,
> if it is NETDEV_TX_LOCKED, then
> spin_trylock(&dev->xmit_lock) failed
> in ->hard_start_xmit()
Colin, feeling adventurous enough to take a stab at this? It looks
pretty straightforward but I'm going to be even more useless than
usual for the next two weeks.
>
> The best example is in net/sched/sch_generic.c:qdisc_restart()
>
> unsigned nolock = (dev->features & NETIF_F_LLTX);
> /*
> * When the driver has LLTX set it does its own locking
> * in start_xmit. No need to add additional overhead by
> * locking again. These checks are worth it because
> * even uncongested locks can be quite expensive.
> * The driver can do trylock like here too, in case
> * of lock congestion it should return -1 and the packet
> * will be requeued.
> */
> if (!nolock) {
> if (!spin_trylock(&dev->xmit_lock)) {
> collision:
> /* So, someone grabbed the driver. */
>
> /* It may be transient configuration error,
> when hard_start_xmit() recurses. We detect
> it by checking xmit owner and drop the
> packet when deadloop is detected.
> */
> if (dev->xmit_lock_owner == smp_processor_id()) {
> kfree_skb(skb);
> if (net_ratelimit())
> printk(KERN_DEBUG "Dead loop on netdevice %s, fix it urgently!\n", dev->name);
> return -1;
> }
> __get_cpu_var(netdev_rx_stat).cpu_collision++;
> goto requeue;
> }
> /* Remember that the driver is grabbed by us. */
> dev->xmit_lock_owner = smp_processor_id();
> }
>
> {
> /* And release queue */
> spin_unlock(&dev->queue_lock);
>
> if (!netif_queue_stopped(dev)) {
> int ret;
> if (netdev_nit)
> dev_queue_xmit_nit(skb, dev);
>
> ret = dev->hard_start_xmit(skb, dev);
> if (ret == NETDEV_TX_OK) {
> if (!nolock) {
> dev->xmit_lock_owner = -1;
> spin_unlock(&dev->xmit_lock);
> }
> spin_lock(&dev->queue_lock);
> return -1;
> }
> if (ret == NETDEV_TX_LOCKED && nolock) {
> spin_lock(&dev->queue_lock);
> goto collision;
> }
> }
>
> /* NETDEV_TX_BUSY - we need to requeue */
> /* Release the driver */
> if (!nolock) {
> dev->xmit_lock_owner = -1;
> spin_unlock(&dev->xmit_lock);
> }
> spin_lock(&dev->queue_lock);
> q = dev->qdisc;
> }
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 18:41 ` Matt Mackall
@ 2004-10-07 20:00 ` Colin Leroy
0 siblings, 0 replies; 34+ messages in thread
From: Colin Leroy @ 2004-10-07 20:00 UTC (permalink / raw)
To: Matt Mackall; +Cc: David S. Miller, akpm, netdev
On 07 Oct 2004 at 13h10, Matt Mackall wrote:
Hi,
> Colin, feeling adventurous enough to take a stab at this? It looks
> pretty straightforward but I'm going to be even more useless than
> usual for the next two weeks.
I'll try :)
Thanks for all the input !
--
Colin
Recursion: see Recursion
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 18:28 ` David S. Miller
2004-10-07 18:41 ` Matt Mackall
@ 2004-10-07 18:43 ` Andi Kleen
2004-10-07 20:44 ` Colin Leroy
2 siblings, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2004-10-07 18:43 UTC (permalink / raw)
To: David S. Miller; +Cc: Colin Leroy, mpm, akpm, netdev
On Thu, Oct 07, 2004 at 11:28:46AM -0700, David S. Miller wrote:
> On Thu, 7 Oct 2004 16:05:32 +0200
> Colin Leroy <colin@colino.net> wrote:
>
> > First, my newbie question: is it possible to deadlock a spinlock on a
> > Uniprocessor kernel ? For example, there's something I find suspect in
> > netpoll/sungem interaction:
> >
>
> Oh yes, it appears that netpoll doesn't support NETIF_F_LLTX locking,
> crap :(
>
> When a device has NETIF_F_LLTX set, it means that the driver's
> dev->hard_start_xmit() routine is what takes the xmit_lock, not
> the caller one level up.
It takes an own lock, not xmit_lock.
It's fine to ignore it completely. In the worst case the poll
will not be retried, but netpoll has no way to do that anyways I think.
-Andi
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 18:28 ` David S. Miller
2004-10-07 18:41 ` Matt Mackall
2004-10-07 18:43 ` Andi Kleen
@ 2004-10-07 20:44 ` Colin Leroy
2004-10-07 21:45 ` Andi Kleen
2004-10-07 22:08 ` David S. Miller
2 siblings, 2 replies; 34+ messages in thread
From: Colin Leroy @ 2004-10-07 20:44 UTC (permalink / raw)
To: David S. Miller; +Cc: mpm, akpm, netdev
On 07 Oct 2004 at 11h10, David S. Miller wrote:
Hi again,
> So, netpoll needs to have the NETIF_F_LLTX stuff added to it.
This patch should do that. It works OK for me, but I'd like it checked
before sent upstream...
However, it doesn't fix the hang. it looks like this hang is really
coming from sungem.
--- a/net/core/netpoll.c 2004-10-05 21:09:49.000000000 +0200
+++ b/net/core/netpoll.c 2004-10-07 22:48:58.000000000 +0200
@@ -181,6 +181,7 @@
void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
{
int status;
+ unsigned must_lock;
repeat:
if(!np || !np->dev || !netif_running(np->dev)) {
@@ -188,29 +189,51 @@
return;
}
- spin_lock(&np->dev->xmit_lock);
- np->dev->xmit_lock_owner = smp_processor_id();
+ must_lock = !(np->dev->features & NETIF_F_LLTX);
+
+ if (must_lock) {
+ if (!spin_trylock(&np->dev->xmit_lock)) {
+ if (np->dev->xmit_lock_owner ==
smp_processor_id()) {
+ __kfree_skb(skb);
+ return;
+ }
+ goto repeat;
+ }
+ np->dev->xmit_lock_owner = smp_processor_id();
+ }
+
/*
* network drivers do not expect to be called if the queue is
* stopped.
*/
if (netif_queue_stopped(np->dev)) {
np->dev->xmit_lock_owner = -1;
- spin_unlock(&np->dev->xmit_lock);
+
+ if (must_lock)
+ spin_unlock(&np->dev->xmit_lock);
netpoll_poll(np);
goto repeat;
}
status = np->dev->hard_start_xmit(skb, np->dev);
- np->dev->xmit_lock_owner = -1;
- spin_unlock(&np->dev->xmit_lock);
+
+ if (must_lock) {
+ np->dev->xmit_lock_owner = -1;
+ spin_unlock(&np->dev->xmit_lock);
+ }
/* transmit busy */
- if(status) {
+ if (status == NETDEV_TX_LOCKED) {
netpoll_poll(np);
goto repeat;
+ }
+
+ /* hard error */
+ if (status == NETDEV_TX_BUSY) {
+ __kfree_skb(skb);
+ return;
}
}
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 20:44 ` Colin Leroy
@ 2004-10-07 21:45 ` Andi Kleen
2004-10-07 21:50 ` Matt Mackall
2004-10-08 7:06 ` Colin Leroy
2004-10-07 22:08 ` David S. Miller
1 sibling, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2004-10-07 21:45 UTC (permalink / raw)
To: Colin Leroy; +Cc: David S. Miller, mpm, akpm, netdev
On Thu, Oct 07, 2004 at 10:44:22PM +0200, Colin Leroy wrote:
> On 07 Oct 2004 at 11h10, David S. Miller wrote:
>
> Hi again,
>
> > So, netpoll needs to have the NETIF_F_LLTX stuff added to it.
>
> This patch should do that. It works OK for me, but I'd like it checked
> before sent upstream...
>
> However, it doesn't fix the hang. it looks like this hang is really
> coming from sungem.
IMHO it's not needed. Taking xmit_lock is harmless even when
the NETIF_F_LLTX flag is set.
(or at least it was with my original patchkit. In theory it's
possible someone changed their driver to take xmit_lock in hard_start_xmit,
but if they did that I would just consider it a driver bug)
The only drawback is that there won't be a reply when the driver try
lock fails, but netpoll doesn't have a queue for that anyways. You could
probably poll then, but I'm not sure it's a good idea.
-Andi
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 21:45 ` Andi Kleen
@ 2004-10-07 21:50 ` Matt Mackall
2004-10-07 22:07 ` David S. Miller
2004-10-08 7:06 ` Colin Leroy
1 sibling, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-07 21:50 UTC (permalink / raw)
To: Andi Kleen; +Cc: Colin Leroy, David S. Miller, akpm, netdev
On Thu, Oct 07, 2004 at 11:45:05PM +0200, Andi Kleen wrote:
> On Thu, Oct 07, 2004 at 10:44:22PM +0200, Colin Leroy wrote:
> > On 07 Oct 2004 at 11h10, David S. Miller wrote:
> >
> > Hi again,
> >
> > > So, netpoll needs to have the NETIF_F_LLTX stuff added to it.
> >
> > This patch should do that. It works OK for me, but I'd like it checked
> > before sent upstream...
> >
> > However, it doesn't fix the hang. it looks like this hang is really
> > coming from sungem.
>
> IMHO it's not needed. Taking xmit_lock is harmless even when
> the NETIF_F_LLTX flag is set.
>
> (or at least it was with my original patchkit. In theory it's
> possible someone changed their driver to take xmit_lock in hard_start_xmit,
> but if they did that I would just consider it a driver bug)
Ok, this part makes sense.
> The only drawback is that there won't be a reply when the driver try
> lock fails, but netpoll doesn't have a queue for that anyways. You could
> probably poll then, but I'm not sure it's a good idea.
But your meaning here is not entirely clear.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 21:50 ` Matt Mackall
@ 2004-10-07 22:07 ` David S. Miller
2004-10-07 23:43 ` Matt Mackall
0 siblings, 1 reply; 34+ messages in thread
From: David S. Miller @ 2004-10-07 22:07 UTC (permalink / raw)
To: Matt Mackall; +Cc: ak, colin, akpm, netdev
On Thu, 7 Oct 2004 16:50:26 -0500
Matt Mackall <mpm@selenic.com> wrote:
> > The only drawback is that there won't be a reply when the driver try
> > lock fails, but netpoll doesn't have a queue for that anyways. You could
> > probably poll then, but I'm not sure it's a good idea.
>
> But your meaning here is not entirely clear.
If another thread on another cpu is in the dev->hard_start_xmit() routine,
then it will have it's tx device lock held, and netpoll will simply get an
immediate return from ->hard_start_xmit() with error NETDEV_TX_LOCKED.
The packet will thus not be sent, and because netpoll does not have a
backlog queue for tx packets of any kind the packet lost forever.
NETDEV_TX_LOCKED is a transient condition. It works for the rest of the
kernel because whoever holds the tx lock on the device, will recheck the
device packet transmit queue when it drops that lock and returns from
->hard_start_xmit().
Andi is merely noting how netpoll's design does not have such a model,
which is why the NETIF_F_LLTX semantics don't mesh very well.
It is unclear if it ise wise that netpoll_send_skb() currently spins
on ->hard_start_xmit() returning NETDEV_TX_LOCKED. That could
result in some kind of deadlocks.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 22:07 ` David S. Miller
@ 2004-10-07 23:43 ` Matt Mackall
2004-10-07 23:50 ` Andi Kleen
0 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-07 23:43 UTC (permalink / raw)
To: David S. Miller; +Cc: ak, colin, akpm, netdev
On Thu, Oct 07, 2004 at 03:07:56PM -0700, David S. Miller wrote:
> On Thu, 7 Oct 2004 16:50:26 -0500
> Matt Mackall <mpm@selenic.com> wrote:
>
> > > The only drawback is that there won't be a reply when the driver try
> > > lock fails, but netpoll doesn't have a queue for that anyways. You could
> > > probably poll then, but I'm not sure it's a good idea.
> >
> > But your meaning here is not entirely clear.
>
> If another thread on another cpu is in the dev->hard_start_xmit() routine,
> then it will have it's tx device lock held, and netpoll will simply get an
> immediate return from ->hard_start_xmit() with error NETDEV_TX_LOCKED.
>
> The packet will thus not be sent, and because netpoll does not have a
> backlog queue for tx packets of any kind the packet lost forever.
>
> NETDEV_TX_LOCKED is a transient condition. It works for the rest of the
> kernel because whoever holds the tx lock on the device, will recheck the
> device packet transmit queue when it drops that lock and returns from
> ->hard_start_xmit().
>
> Andi is merely noting how netpoll's design does not have such a model,
> which is why the NETIF_F_LLTX semantics don't mesh very well.
>
> It is unclear if it ise wise that netpoll_send_skb() currently spins
> on ->hard_start_xmit() returning NETDEV_TX_LOCKED. That could
> result in some kind of deadlocks.
Deadlocks from recursion, presumably? We could probably throw in a max
retry count, as ugly as that is..
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 23:43 ` Matt Mackall
@ 2004-10-07 23:50 ` Andi Kleen
2004-10-08 6:46 ` Colin Leroy
0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2004-10-07 23:50 UTC (permalink / raw)
To: Matt Mackall; +Cc: David S. Miller, ak, colin, akpm, netdev
> Deadlocks from recursion, presumably? We could probably throw in a max
> retry count, as ugly as that is..
There should not be any recursion, no.
The problem is that the poll is effectively a spinlock. But when
another CPU takes an long interrupt while holding the lock it
could take quite a long time to grab the lock.
For most network drivers this shouldn't occur though because
the net driver private lock is usually always taken with interrupts
off.
-Andi
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 23:50 ` Andi Kleen
@ 2004-10-08 6:46 ` Colin Leroy
2004-10-08 21:53 ` Matt Mackall
0 siblings, 1 reply; 34+ messages in thread
From: Colin Leroy @ 2004-10-08 6:46 UTC (permalink / raw)
To: Andi Kleen; +Cc: Matt Mackall, David S. Miller, ak, akpm, netdev
On 08 Oct 2004 at 01h10, Andi Kleen wrote:
Hi,
> > Deadlocks from recursion, presumably? We could probably throw in a
> > max retry count, as ugly as that is..
>
> There should not be any recursion, no.
If printk() is synchronous, there could be, if there's a printk() in the
codepath taken by dev->hard_start_xmit()... But I don't if it is...
> The problem is that the poll is effectively a spinlock. But when
> another CPU takes an long interrupt while holding the lock it
> could take quite a long time to grab the lock.
>
> For most network drivers this shouldn't occur though because
> the net driver private lock is usually always taken with interrupts
> off.
Second newbie question: how are the interrupts disabled, is it via
local_irq_save()/local_irq_restore()? or is it something else ?
--
Colin
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-08 6:46 ` Colin Leroy
@ 2004-10-08 21:53 ` Matt Mackall
0 siblings, 0 replies; 34+ messages in thread
From: Matt Mackall @ 2004-10-08 21:53 UTC (permalink / raw)
To: Colin Leroy; +Cc: Andi Kleen, David S. Miller, akpm, netdev
On Fri, Oct 08, 2004 at 08:46:40AM +0200, Colin Leroy wrote:
> On 08 Oct 2004 at 01h10, Andi Kleen wrote:
>
> Hi,
>
> > > Deadlocks from recursion, presumably? We could probably throw in a
> > > max retry count, as ugly as that is..
> >
> > There should not be any recursion, no.
>
> If printk() is synchronous, there could be, if there's a printk() in the
> codepath taken by dev->hard_start_xmit()... But I don't if it is...
Have you looked at the code path that does carrier detect for possibly
recursing printks?
> > For most network drivers this shouldn't occur though because
> > the net driver private lock is usually always taken with interrupts
> > off.
>
> Second newbie question: how are the interrupts disabled, is it via
> local_irq_save()/local_irq_restore()? or is it something else ?
Often by virtue of being in an interrupt context already.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 21:45 ` Andi Kleen
2004-10-07 21:50 ` Matt Mackall
@ 2004-10-08 7:06 ` Colin Leroy
2004-10-08 22:00 ` Matt Mackall
1 sibling, 1 reply; 34+ messages in thread
From: Colin Leroy @ 2004-10-08 7:06 UTC (permalink / raw)
To: Andi Kleen; +Cc: David S. Miller, mpm, akpm, netdev
On 07 Oct 2004 at 23h10, Andi Kleen wrote:
Hi,
> > This patch should do that. It works OK for me, but I'd like it
> > checked before sent upstream...
> >
> > However, it doesn't fix the hang. it looks like this hang is really
> > coming from sungem.
>
> IMHO it's not needed. Taking xmit_lock is harmless even when
> the NETIF_F_LLTX flag is set.
Should that be completely dropped, or is it still ok ? (I think
differenciating action based on hard_start_xmit status, that is, don't
goto repeat undefinitely when NETDEV_TX_BUSY, could be a good idea).
I mean, should I rework that patch, forget about it or leave it as-is?
Concerning the hang, I see that Andrew has put my first patch, the one
checking for netif_carrier_ok(), in his tree. Is it an OK solution from
your (net dev hackers) point of view?
Thanks,
--
Colin
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-08 7:06 ` Colin Leroy
@ 2004-10-08 22:00 ` Matt Mackall
2004-10-08 22:18 ` Andrew Morton
0 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-08 22:00 UTC (permalink / raw)
To: Colin Leroy; +Cc: Andi Kleen, David S. Miller, akpm, netdev
On Fri, Oct 08, 2004 at 09:06:10AM +0200, Colin Leroy wrote:
> On 07 Oct 2004 at 23h10, Andi Kleen wrote:
>
> Hi,
>
> > > This patch should do that. It works OK for me, but I'd like it
> > > checked before sent upstream...
> > >
> > > However, it doesn't fix the hang. it looks like this hang is really
> > > coming from sungem.
> >
> > IMHO it's not needed. Taking xmit_lock is harmless even when
> > the NETIF_F_LLTX flag is set.
>
> Should that be completely dropped, or is it still ok ? (I think
> differenciating action based on hard_start_xmit status, that is, don't
> goto repeat undefinitely when NETDEV_TX_BUSY, could be a good idea).
> I mean, should I rework that patch, forget about it or leave it as-is?
Well the purpose of the LLTX flag is to reduce serializing on the
xmit_lock. If we take the lock anyway, that should be harmless as Andi
says. So I'm afraid it looks to be a performance fix at best (which is
a low priority here). Let's back burner it for now.
> Concerning the hang, I see that Andrew has put my first patch, the one
> checking for netif_carrier_ok(), in his tree. Is it an OK solution from
> your (net dev hackers) point of view?
It seems to be papering over a driver bug of some sort, which is not
the way we like to fix things.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-08 22:00 ` Matt Mackall
@ 2004-10-08 22:18 ` Andrew Morton
2004-10-11 3:59 ` David S. Miller
0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2004-10-08 22:18 UTC (permalink / raw)
To: Matt Mackall; +Cc: colin, ak, davem, netdev
Matt Mackall <mpm@selenic.com> wrote:
>
> > Concerning the hang, I see that Andrew has put my first patch, the one
> > checking for netif_carrier_ok(), in his tree.
I dropped it again, waiting for the dust to settle on this one.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-08 22:18 ` Andrew Morton
@ 2004-10-11 3:59 ` David S. Miller
2004-10-11 15:40 ` Andi Kleen
0 siblings, 1 reply; 34+ messages in thread
From: David S. Miller @ 2004-10-11 3:59 UTC (permalink / raw)
To: Andrew Morton; +Cc: mpm, colin, ak, netdev
Wait, I think I see the problem.
Sungem processes link status in it's ->poll() NAPI handler.
This occurs via calls to gem_pcs_interrupt(), for example.
Non-pcs sungem variants use a timer to poll link status.
When the link changes state, this link state processing
does printk()'s.
So perhaps that is why it deadlocks.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-11 3:59 ` David S. Miller
@ 2004-10-11 15:40 ` Andi Kleen
2004-10-11 16:22 ` Matt Mackall
0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2004-10-11 15:40 UTC (permalink / raw)
To: David S. Miller; +Cc: Andrew Morton, mpm, colin, ak, netdev
On Sun, Oct 10, 2004 at 08:59:28PM -0700, David S. Miller wrote:
>
> Wait, I think I see the problem.
>
> Sungem processes link status in it's ->poll() NAPI handler.
> This occurs via calls to gem_pcs_interrupt(), for example.
> Non-pcs sungem variants use a timer to poll link status.
>
> When the link changes state, this link state processing
> does printk()'s.
>
> So perhaps that is why it deadlocks.
printk handles recursion with the down_trylock on console_sem.
So it shouldn't deadlock.
-Andi
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-11 15:40 ` Andi Kleen
@ 2004-10-11 16:22 ` Matt Mackall
2004-10-11 16:32 ` Andi Kleen
0 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-11 16:22 UTC (permalink / raw)
To: Andi Kleen; +Cc: David S. Miller, Andrew Morton, colin, netdev
On Mon, Oct 11, 2004 at 05:40:00PM +0200, Andi Kleen wrote:
> On Sun, Oct 10, 2004 at 08:59:28PM -0700, David S. Miller wrote:
> >
> > Wait, I think I see the problem.
> >
> > Sungem processes link status in it's ->poll() NAPI handler.
> > This occurs via calls to gem_pcs_interrupt(), for example.
> > Non-pcs sungem variants use a timer to poll link status.
> >
> > When the link changes state, this link state processing
> > does printk()'s.
> >
> > So perhaps that is why it deadlocks.
>
> printk handles recursion with the down_trylock on console_sem.
> So it shouldn't deadlock.
If we're in the ->poll() handler for non-netpoll reasons, and the link
state changes, causing a printk, we'll potentially reenter ->poll() via
netconsole.
This also explains the original fix quite nicely. Colin, can you try
commenting out all the printks in gem_pcs_interrupt and if that works,
we'll start thinking about a proper fix.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-11 16:22 ` Matt Mackall
@ 2004-10-11 16:32 ` Andi Kleen
2004-10-11 16:36 ` Matt Mackall
0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2004-10-11 16:32 UTC (permalink / raw)
To: Matt Mackall; +Cc: Andi Kleen, David S. Miller, Andrew Morton, colin, netdev
On Mon, Oct 11, 2004 at 11:22:24AM -0500, Matt Mackall wrote:
> On Mon, Oct 11, 2004 at 05:40:00PM +0200, Andi Kleen wrote:
> > On Sun, Oct 10, 2004 at 08:59:28PM -0700, David S. Miller wrote:
> > >
> > > Wait, I think I see the problem.
> > >
> > > Sungem processes link status in it's ->poll() NAPI handler.
> > > This occurs via calls to gem_pcs_interrupt(), for example.
> > > Non-pcs sungem variants use a timer to poll link status.
> > >
> > > When the link changes state, this link state processing
> > > does printk()'s.
> > >
> > > So perhaps that is why it deadlocks.
> >
> > printk handles recursion with the down_trylock on console_sem.
> > So it shouldn't deadlock.
>
> If we're in the ->poll() handler for non-netpoll reasons, and the link
> state changes, causing a printk, we'll potentially reenter ->poll() via
> netconsole.
It won't because printk catches the case.
-Andi
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-11 16:32 ` Andi Kleen
@ 2004-10-11 16:36 ` Matt Mackall
2004-10-11 16:43 ` Andi Kleen
0 siblings, 1 reply; 34+ messages in thread
From: Matt Mackall @ 2004-10-11 16:36 UTC (permalink / raw)
To: Andi Kleen; +Cc: David S. Miller, Andrew Morton, colin, netdev
On Mon, Oct 11, 2004 at 06:32:26PM +0200, Andi Kleen wrote:
> On Mon, Oct 11, 2004 at 11:22:24AM -0500, Matt Mackall wrote:
> > On Mon, Oct 11, 2004 at 05:40:00PM +0200, Andi Kleen wrote:
> > > On Sun, Oct 10, 2004 at 08:59:28PM -0700, David S. Miller wrote:
> > > >
> > > > Wait, I think I see the problem.
> > > >
> > > > Sungem processes link status in it's ->poll() NAPI handler.
> > > > This occurs via calls to gem_pcs_interrupt(), for example.
> > > > Non-pcs sungem variants use a timer to poll link status.
> > > >
> > > > When the link changes state, this link state processing
> > > > does printk()'s.
> > > >
> > > > So perhaps that is why it deadlocks.
> > >
> > > printk handles recursion with the down_trylock on console_sem.
> > > So it shouldn't deadlock.
> >
> > If we're in the ->poll() handler for non-netpoll reasons, and the link
> > state changes, causing a printk, we'll potentially reenter ->poll() via
> > netconsole.
>
> It won't because printk catches the case.
It's not recursion on printk that's a problem, it's recursion on
->poll() and attempting to take whatever internal driver locks.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-11 16:36 ` Matt Mackall
@ 2004-10-11 16:43 ` Andi Kleen
2004-10-11 16:58 ` Matt Mackall
0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2004-10-11 16:43 UTC (permalink / raw)
To: Matt Mackall; +Cc: Andi Kleen, David S. Miller, Andrew Morton, colin, netdev
> It's not recursion on printk that's a problem, it's recursion on
> ->poll() and attempting to take whatever internal driver locks.
There is no recursion on poll because printk will never call into
the low level console driver when the console sem is already taken.
-Andi
P.S.: I made the same mistake long ago, but akpm set me straight.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-11 16:43 ` Andi Kleen
@ 2004-10-11 16:58 ` Matt Mackall
2004-10-11 17:41 ` Andi Kleen
2004-10-11 20:45 ` Colin Leroy
0 siblings, 2 replies; 34+ messages in thread
From: Matt Mackall @ 2004-10-11 16:58 UTC (permalink / raw)
To: Andi Kleen; +Cc: David S. Miller, Andrew Morton, colin, netdev
On Mon, Oct 11, 2004 at 06:43:15PM +0200, Andi Kleen wrote:
> > It's not recursion on printk that's a problem, it's recursion on
> > ->poll() and attempting to take whatever internal driver locks.
>
> There is no recursion on poll because printk will never call into
> the low level console driver when the console sem is already taken.
Ergh, you deleted the context. Again, imagine we're originally in
->poll() for _a non-netpoll-related reason_. In other words, the
console sem is not taken, because we're just doing routine network
I/O. While in poll(), we take a private driver lock. Then for whatever
reason, we printk -> netconsole -> netpoll -> poll() again where we
attempt to retake the private driver lock.
> P.S.: I made the same mistake long ago, but akpm set me straight.
I'm pretty sure this case is different.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-11 16:58 ` Matt Mackall
@ 2004-10-11 17:41 ` Andi Kleen
2004-10-11 20:45 ` Colin Leroy
1 sibling, 0 replies; 34+ messages in thread
From: Andi Kleen @ 2004-10-11 17:41 UTC (permalink / raw)
To: Matt Mackall; +Cc: Andi Kleen, David S. Miller, Andrew Morton, colin, netdev
On Mon, Oct 11, 2004 at 11:58:52AM -0500, Matt Mackall wrote:
> On Mon, Oct 11, 2004 at 06:43:15PM +0200, Andi Kleen wrote:
> > > It's not recursion on printk that's a problem, it's recursion on
> > > ->poll() and attempting to take whatever internal driver locks.
> >
> > There is no recursion on poll because printk will never call into
> > the low level console driver when the console sem is already taken.
>
> Ergh, you deleted the context. Again, imagine we're originally in
> ->poll() for _a non-netpoll-related reason_. In other words, the
> console sem is not taken, because we're just doing routine network
> I/O. While in poll(), we take a private driver lock. Then for whatever
> reason, we printk -> netconsole -> netpoll -> poll() again where we
> attempt to retake the private driver lock.
You're right, in that case you could get a deadlock.
-Andi
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-11 16:58 ` Matt Mackall
2004-10-11 17:41 ` Andi Kleen
@ 2004-10-11 20:45 ` Colin Leroy
[not found] ` <5cac192f0410181443303379e2@mail.gmail.com>
1 sibling, 1 reply; 34+ messages in thread
From: Colin Leroy @ 2004-10-11 20:45 UTC (permalink / raw)
To: Matt Mackall; +Cc: Andi Kleen, David S. Miller, Andrew Morton, netdev
On 11 Oct 2004 at 11h10, Matt Mackall wrote:
Hi,
> Ergh, you deleted the context. Again, imagine we're originally in
> ->poll() for _a non-netpoll-related reason_. In other words, the
> console sem is not taken, because we're just doing routine network
> I/O. While in poll(), we take a private driver lock. Then for whatever
> reason, we printk -> netconsole -> netpoll -> poll() again where we
> attempt to retake the private driver lock.
>
> > P.S.: I made the same mistake long ago, but akpm set me straight.
>
> I'm pretty sure this case is different.
I tried replacing spin_lock by spin_trylock in poll() and a few other
functions (and returning accordingly), but it didn't help. I'm beginning
to think I won't be able to debug this one! ;)
--
Colin
"When in doubt, do the right thing."
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 20:44 ` Colin Leroy
2004-10-07 21:45 ` Andi Kleen
@ 2004-10-07 22:08 ` David S. Miller
2004-10-08 6:54 ` Colin Leroy
1 sibling, 1 reply; 34+ messages in thread
From: David S. Miller @ 2004-10-07 22:08 UTC (permalink / raw)
To: Colin Leroy; +Cc: mpm, akpm, netdev
On Thu, 7 Oct 2004 22:44:22 +0200
Colin Leroy <colin@colino.net> wrote:
> On 07 Oct 2004 at 11h10, David S. Miller wrote:
>
> Hi again,
>
> > So, netpoll needs to have the NETIF_F_LLTX stuff added to it.
>
> This patch should do that. It works OK for me, but I'd like it checked
> before sent upstream...
>
> However, it doesn't fix the hang. it looks like this hang is really
> coming from sungem.
Is it hanging inside of the ->hard_start_xmit() call or somewhere
else? Do you have a way to determine this without adding printk()'s
and thus causing recursion as you mentioned earlier? :-)
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] Prevent netpoll hanging when link is down
2004-10-07 22:08 ` David S. Miller
@ 2004-10-08 6:54 ` Colin Leroy
0 siblings, 0 replies; 34+ messages in thread
From: Colin Leroy @ 2004-10-08 6:54 UTC (permalink / raw)
To: David S. Miller; +Cc: mpm, akpm, netdev
On 07 Oct 2004 at 15h10, David S. Miller wrote:
Hi,
> > However, it doesn't fix the hang. it looks like this hang is really
> > coming from sungem.
>
> Is it hanging inside of the ->hard_start_xmit() call
I think so, but my way of discovering it may not be very good: I tested
by replacing
status = np->dev->hard_start_xmit(...);
by
status = NETDEV_TX_OK, then status = NETDEV_TX_BUSY, then status =
NETDEV_TX_LOCKED
in netpoll.c (avoiding to call hard_start_xmit()), and it didn't hang.
> or somewhere else? Do you have a way to determine this without adding
> printk()'s and thus causing recursion as you mentioned earlier? :-)
Well, that's my big problem :-) I can't use the spinlock debugging
neither, because I'm on uniprocessor and on PPC.
I tried removing printk()s from gem_start_xmit() codepath, but it didn't
help either, so I don't think the lock comes from a printk()
recursion...
(It's really hard to debug that kind of stuff! I'm learning quite a few
things :))
--
Colin
^ permalink raw reply [flat|nested] 34+ messages in thread