* netdev tx timeouts
@ 2006-09-13 2:25 Larry Finger
[not found] ` <45076C00.2000100-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Larry Finger @ 2006-09-13 2:25 UTC (permalink / raw)
To: Michael Buesch
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Stefano Brivio,
Bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w
Michael,
I still have not gotten a network guru to answer any questions about
synchronize_net, but I have been testing the patch below:
Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
===================================================================
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
@@ -3169,8 +3169,8 @@ static void bcm43xx_periodic_work_handle
* be preemtible.
*/
mutex_lock(&bcm->mutex);
- netif_stop_queue(bcm->net_dev);
synchronize_net();
+ netif_stop_queue(bcm->net_dev);
spin_lock_irqsave(&bcm->irq_lock, flags);
bcm43xx_mac_suspend(bcm);
if (bcm43xx_using_pio(bcm))
With the synchronize_net call before the netif_stop_queue as shown, my device
has run since last Saturday with no netdev watchdog tx timeouts. Roughly two
days of that testing was done at the accelerated rate of 60X normal.
I still hope to get access to a guru, but if that doesn't happen soon, I'm going
to push this change so that it gets into 2.6.19.
Larry
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: netdev tx timeouts
[not found] ` <45076C00.2000100-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
@ 2006-09-13 12:30 ` Michael Buesch
[not found] ` <200609131430.53820.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Michael Buesch @ 2006-09-13 12:30 UTC (permalink / raw)
To: Larry Finger
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w,
Stefano Brivio
On Wednesday 13 September 2006 04:25, Larry Finger wrote:
> Michael,
>
> I still have not gotten a network guru to answer any questions about
> synchronize_net, but I have been testing the patch below:
I'd say this is racy.
Did you test this on SMP?
> Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> ===================================================================
> --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> @@ -3169,8 +3169,8 @@ static void bcm43xx_periodic_work_handle
> * be preemtible.
> */
> mutex_lock(&bcm->mutex);
> - netif_stop_queue(bcm->net_dev);
> synchronize_net();
A TX handler starts on another CPU.
> + netif_stop_queue(bcm->net_dev);
It's still running... boom.
> spin_lock_irqsave(&bcm->irq_lock, flags);
> bcm43xx_mac_suspend(bcm);
> if (bcm43xx_using_pio(bcm))
>
> With the synchronize_net call before the netif_stop_queue as shown, my device
> has run since last Saturday with no netdev watchdog tx timeouts. Roughly two
> days of that testing was done at the accelerated rate of 60X normal.
>
> I still hope to get access to a guru, but if that doesn't happen soon, I'm going
> to push this change so that it gets into 2.6.19.
>
> Larry
>
>
--
Greetings Michael.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: netdev tx timeouts
[not found] ` <200609131430.53820.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
@ 2006-09-13 13:25 ` Larry Finger
2006-09-13 13:49 ` Michael Buesch
0 siblings, 1 reply; 9+ messages in thread
From: Larry Finger @ 2006-09-13 13:25 UTC (permalink / raw)
To: Michael Buesch
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w,
Stefano Brivio
Michael Buesch wrote:
> On Wednesday 13 September 2006 04:25, Larry Finger wrote:
>> Michael,
>>
>> I still have not gotten a network guru to answer any questions about
>> synchronize_net, but I have been testing the patch below:
>
> I'd say this is racy.
> Did you test this on SMP?
No - I don't have the hardware.
>
>> Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
>> ===================================================================
>> --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
>> +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
>> @@ -3169,8 +3169,8 @@ static void bcm43xx_periodic_work_handle
>> * be preemtible.
>> */
>> mutex_lock(&bcm->mutex);
>> - netif_stop_queue(bcm->net_dev);
>> synchronize_net();
>
> A TX handler starts on another CPU.
>
>> + netif_stop_queue(bcm->net_dev);
>
> It's still running... boom.
>
I see your point, but the current way breaks a UP system! What to do?
Larry
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: netdev tx timeouts
2006-09-13 13:25 ` Larry Finger
@ 2006-09-13 13:49 ` Michael Buesch
2006-09-13 14:12 ` Michael Buesch
2006-09-14 1:23 ` Stephen Hemminger
0 siblings, 2 replies; 9+ messages in thread
From: Michael Buesch @ 2006-09-13 13:49 UTC (permalink / raw)
To: Larry Finger; +Cc: bcm43xx-dev, netdev, Stefano Brivio
On Wednesday 13 September 2006 15:25, Larry Finger wrote:
> Michael Buesch wrote:
> > On Wednesday 13 September 2006 04:25, Larry Finger wrote:
> >> Michael,
> >>
> >> I still have not gotten a network guru to answer any questions about
> >> synchronize_net, but I have been testing the patch below:
> >
> > I'd say this is racy.
> > Did you test this on SMP?
>
> No - I don't have the hardware.
> >
> >> Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> >> ===================================================================
> >> --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> >> +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> >> @@ -3169,8 +3169,8 @@ static void bcm43xx_periodic_work_handle
> >> * be preemtible.
> >> */
> >> mutex_lock(&bcm->mutex);
> >> - netif_stop_queue(bcm->net_dev);
> >> synchronize_net();
> >
> > A TX handler starts on another CPU.
> >
> >> + netif_stop_queue(bcm->net_dev);
> >
> > It's still running... boom.
> >
>
> I see your point, but the current way breaks a UP system! What to do?
Simple. Reading the code of synchronize_net() and
netif_stop_queue() and thinking about why it breaks, instead
of committing bugfixes that only substitute one bug by another. ;)
I'll take a look, too.
--
Greetings Michael.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: netdev tx timeouts
2006-09-13 13:49 ` Michael Buesch
@ 2006-09-13 14:12 ` Michael Buesch
2006-09-14 1:23 ` Stephen Hemminger
1 sibling, 0 replies; 9+ messages in thread
From: Michael Buesch @ 2006-09-13 14:12 UTC (permalink / raw)
To: Larry Finger; +Cc: bcm43xx-dev, netdev, Stefano Brivio
On Wednesday 13 September 2006 15:49, Michael Buesch wrote:
> On Wednesday 13 September 2006 15:25, Larry Finger wrote:
> > Michael Buesch wrote:
> > > On Wednesday 13 September 2006 04:25, Larry Finger wrote:
> > >> Michael,
> > >>
> > >> I still have not gotten a network guru to answer any questions about
> > >> synchronize_net, but I have been testing the patch below:
> > >
> > > I'd say this is racy.
> > > Did you test this on SMP?
> >
> > No - I don't have the hardware.
> > >
> > >> Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> > >> ===================================================================
> > >> --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> > >> +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> > >> @@ -3169,8 +3169,8 @@ static void bcm43xx_periodic_work_handle
> > >> * be preemtible.
> > >> */
> > >> mutex_lock(&bcm->mutex);
> > >> - netif_stop_queue(bcm->net_dev);
> > >> synchronize_net();
> > >
> > > A TX handler starts on another CPU.
> > >
> > >> + netif_stop_queue(bcm->net_dev);
> > >
> > > It's still running... boom.
> > >
> >
> > I see your point, but the current way breaks a UP system! What to do?
>
> Simple. Reading the code of synchronize_net() and
> netif_stop_queue() and thinking about why it breaks, instead
> of committing bugfixes that only substitute one bug by another. ;)
> I'll take a look, too.
Ok, I am pretty sure now we need the following patch to fix this.
Can you test?
If it does not crash anymore, please push upstream.
Signed-off-by: Michael Buesch <mb@bu3sch.de>
Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
===================================================================
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c 2006-09-13 16:09:39.000000000 +0200
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c 2006-09-13 16:10:43.000000000 +0200
@@ -3169,8 +3169,7 @@ static void bcm43xx_periodic_work_handle
* be preemtible.
*/
mutex_lock(&bcm->mutex);
- netif_stop_queue(bcm->net_dev);
- synchronize_net();
+ netif_tx_disable(bcm->net_dev);
spin_lock_irqsave(&bcm->irq_lock, flags);
bcm43xx_mac_suspend(bcm);
if (bcm43xx_using_pio(bcm))
--
Greetings Michael.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: netdev tx timeouts
2006-09-13 13:49 ` Michael Buesch
2006-09-13 14:12 ` Michael Buesch
@ 2006-09-14 1:23 ` Stephen Hemminger
[not found] ` <20060914102337.137d4591-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
1 sibling, 1 reply; 9+ messages in thread
From: Stephen Hemminger @ 2006-09-14 1:23 UTC (permalink / raw)
To: Michael Buesch; +Cc: Larry Finger, bcm43xx-dev, netdev, Stefano Brivio
On Wed, 13 Sep 2006 15:49:23 +0200
Michael Buesch <mb@bu3sch.de> wrote:
> On Wednesday 13 September 2006 15:25, Larry Finger wrote:
> > Michael Buesch wrote:
> > > On Wednesday 13 September 2006 04:25, Larry Finger wrote:
> > >> Michael,
> > >>
> > >> I still have not gotten a network guru to answer any questions about
> > >> synchronize_net, but I have been testing the patch below:
> > >
> > > I'd say this is racy.
> > > Did you test this on SMP?
> >
> > No - I don't have the hardware.
> > >
> > >> Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> > >> ===================================================================
> > >> --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> > >> +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> > >> @@ -3169,8 +3169,8 @@ static void bcm43xx_periodic_work_handle
> > >> * be preemtible.
> > >> */
> > >> mutex_lock(&bcm->mutex);
> > >> - netif_stop_queue(bcm->net_dev);
> > >> synchronize_net();
> > >
> > > A TX handler starts on another CPU.
> > >
> > >> + netif_stop_queue(bcm->net_dev);
> > >
> > > It's still running... boom.
> > >
> >
> > I see your point, but the current way breaks a UP system! What to do?
>
> Simple. Reading the code of synchronize_net() and
> netif_stop_queue() and thinking about why it breaks, instead
> of committing bugfixes that only substitute one bug by another. ;)
> I'll take a look, too.
Why are you doing the synchronize_net()? it is meant for RCU.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: netdev tx timeouts
[not found] ` <20060914102337.137d4591-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2006-09-14 2:04 ` Larry Finger
2006-09-14 2:21 ` Stephen Hemminger
0 siblings, 1 reply; 9+ messages in thread
From: Larry Finger @ 2006-09-14 2:04 UTC (permalink / raw)
To: Stephen Hemminger
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w,
Michael Buesch, Stefano Brivio
Stephen Hemminger wrote:
> On Wed, 13 Sep 2006 15:49:23 +0200
> Michael Buesch <mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org> wrote:
>> Simple. Reading the code of synchronize_net() and
>> netif_stop_queue() and thinking about why it breaks, instead
>> of committing bugfixes that only substitute one bug by another. ;)
>> I'll take a look, too.
>
> Why are you doing the synchronize_net()? it is meant for RCU.
We know and it no longer is in the code. We have known for a couple of days that
it was the synchronize_net() step that led to the netdev timeouts, but we were
afraid that a bare netif_stop_queue would not be SMP safe. The current structure has
mutex_lock
netif_tx_disable(dev) (equivalent to netif_tx_lock_bh(dev);
netif_stop_queue(dev);
netif_tx_unlock_bh(dev);
spin_lock_irqsafe
I see you listed as a maintainer in several network-related parts of the system,
so AFAIK, you are a network guru. Do you think this will work? I have tested
code with just a netif_stop_queue (without the lock_bh/unlock_bh parts) on a UP
system and have gotten no errors, but I do not have access to SMP hardware.
Thanks,
Larry
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: netdev tx timeouts
2006-09-14 2:04 ` Larry Finger
@ 2006-09-14 2:21 ` Stephen Hemminger
2006-09-14 2:35 ` Larry Finger
0 siblings, 1 reply; 9+ messages in thread
From: Stephen Hemminger @ 2006-09-14 2:21 UTC (permalink / raw)
To: Larry Finger; +Cc: Michael Buesch, bcm43xx-dev, netdev, Stefano Brivio
On Wed, 13 Sep 2006 21:04:02 -0500
Larry Finger <Larry.Finger@lwfinger.net> wrote:
> Stephen Hemminger wrote:
> > On Wed, 13 Sep 2006 15:49:23 +0200
> > Michael Buesch <mb@bu3sch.de> wrote:
> >> Simple. Reading the code of synchronize_net() and
> >> netif_stop_queue() and thinking about why it breaks, instead
> >> of committing bugfixes that only substitute one bug by another. ;)
> >> I'll take a look, too.
> >
> > Why are you doing the synchronize_net()? it is meant for RCU.
>
> We know and it no longer is in the code. We have known for a couple of days that
> it was the synchronize_net() step that led to the netdev timeouts, but we were
> afraid that a bare netif_stop_queue would not be SMP safe. The current structure has
>
> mutex_lock
> netif_tx_disable(dev) (equivalent to netif_tx_lock_bh(dev);
> netif_stop_queue(dev);
> netif_tx_unlock_bh(dev);
> spin_lock_irqsafe
>
> I see you listed as a maintainer in several network-related parts of the system,
> so AFAIK, you are a network guru. Do you think this will work? I have tested
> code with just a netif_stop_queue (without the lock_bh/unlock_bh parts) on a UP
> system and have gotten no errors, but I do not have access to SMP hardware.
>
> Thanks,
>
I haven't done a careful review of the broadcom driver. What you
are proposing looks fine. But most network devices just use spin_lock's
rather than mutexs because there is little need for holding the lock
for a long length of time.
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: netdev tx timeouts
2006-09-14 2:21 ` Stephen Hemminger
@ 2006-09-14 2:35 ` Larry Finger
0 siblings, 0 replies; 9+ messages in thread
From: Larry Finger @ 2006-09-14 2:35 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Michael Buesch, bcm43xx-dev, netdev, Stefano Brivio
Stephen Hemminger wrote:
> On Wed, 13 Sep 2006 21:04:02 -0500
> Larry Finger <Larry.Finger@lwfinger.net> wrote:
>
>> Stephen Hemminger wrote:
>>> On Wed, 13 Sep 2006 15:49:23 +0200
>>> Michael Buesch <mb@bu3sch.de> wrote:
>>>> Simple. Reading the code of synchronize_net() and
>>>> netif_stop_queue() and thinking about why it breaks, instead
>>>> of committing bugfixes that only substitute one bug by another. ;)
>>>> I'll take a look, too.
>>> Why are you doing the synchronize_net()? it is meant for RCU.
>> We know and it no longer is in the code. We have known for a couple of days that
>> it was the synchronize_net() step that led to the netdev timeouts, but we were
>> afraid that a bare netif_stop_queue would not be SMP safe. The current structure has
>>
>> mutex_lock
>> netif_tx_disable(dev) (equivalent to netif_tx_lock_bh(dev);
>> netif_stop_queue(dev);
>> netif_tx_unlock_bh(dev);
>> spin_lock_irqsafe
>>
>> I see you listed as a maintainer in several network-related parts of the system,
>> so AFAIK, you are a network guru. Do you think this will work? I have tested
>> code with just a netif_stop_queue (without the lock_bh/unlock_bh parts) on a UP
>> system and have gotten no errors, but I do not have access to SMP hardware.
>>
>> Thanks,
>>
>
> I haven't done a careful review of the broadcom driver. What you
> are proposing looks fine. But most network devices just use spin_lock's
> rather than mutexs because there is little need for holding the lock
> for a long length of time.
I didn't code this section, but I know that the processing that we are about to
do after the mutex can be quite lengthy - perhaps as long as 200 ms. Prior to
2.6.18, it was protected with a spin_lock; however, there were complaints about
the effect on latency, so the mutex approach was taken to allow this long
section to be preemptible. At appropriate points in the processing, the program
calls cond_resched() if the kernel has preemption configured. This change
resulted in the timeouts of the subject, which we are now trying to fix.
Larry
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2006-09-14 2:35 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-13 2:25 netdev tx timeouts Larry Finger
[not found] ` <45076C00.2000100-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
2006-09-13 12:30 ` Michael Buesch
[not found] ` <200609131430.53820.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
2006-09-13 13:25 ` Larry Finger
2006-09-13 13:49 ` Michael Buesch
2006-09-13 14:12 ` Michael Buesch
2006-09-14 1:23 ` Stephen Hemminger
[not found] ` <20060914102337.137d4591-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2006-09-14 2:04 ` Larry Finger
2006-09-14 2:21 ` Stephen Hemminger
2006-09-14 2:35 ` Larry Finger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).