[PATCH] bcm43xx: further fix for periodic work errors

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] bcm43xx: further fix for periodic work errors
@ 2006-09-23  4:08 Larry Finger
       [not found] ` <4514B322.mail1K91A36N8-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Larry Finger @ 2006-09-23  4:08 UTC (permalink / raw)
  To: John Linville
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w,
	Michael Buesch, Stefano Brivio

Recent changes in the setup for preemptible periodic work fixed most
of the problems with NETDEV watchdog timeouts; however, some variants
of the bcm43xx device still had the problem. These were fixed by setting
the parameter MAXIMUM_BADNESS to 0. By doing so, all the functionality
associated with calculating the 'badness' of the upcoming periodic work
is no longer needed; therefore it is removed.

Signed-off-by: Larry Finger <Larry.Finger-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
---

John,

This patch relies on "[PATCH] bcm43xx: fix netdev watchdog timeouts", which
was submitted on 9/14/06. It is important for this one, as well as those
already queued, to make the 2.6.19 cutoff.

Thanks,

Larry


Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
===================================================================
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
@@ -3136,67 +3136,32 @@ static void do_periodic_work(struct bcm4
 	schedule_delayed_work(&bcm->periodic_work, HZ * 15);
 }
 
-/* Estimate a "Badness" value based on the periodic work
- * state-machine state. "Badness" is worse (bigger), if the
- * periodic work will take longer.
- */
-static int estimate_periodic_work_badness(unsigned int state)
-{
-	int badness = 0;
-
-	if (state % 8 == 0) /* every 120 sec */
-		badness += 10;
-	if (state % 4 == 0) /* every 60 sec */
-		badness += 5;
-	if (state % 2 == 0) /* every 30 sec */
-		badness += 1;
-	if (state % 1 == 0) /* every 15 sec */
-		badness += 1;
-
-#define BADNESS_LIMIT	4
-	return badness;
-}
-
 static void bcm43xx_periodic_work_handler(void *d)
 {
 	struct bcm43xx_private *bcm = d;
 	unsigned long flags;
 	u32 savedirqs = 0;
-	int badness;
-
-	badness = estimate_periodic_work_badness(bcm->periodic_state);
-	if (badness > BADNESS_LIMIT) {
-		/* Periodic work will take a long time, so we want it to
-		 * be preemtible.
-		 */
-		mutex_lock(&bcm->mutex);
-		netif_tx_disable(bcm->net_dev);
-		spin_lock_irqsave(&bcm->irq_lock, flags);
-		bcm43xx_mac_suspend(bcm);
-		if (bcm43xx_using_pio(bcm))
-			bcm43xx_pio_freeze_txqueues(bcm);
-		savedirqs = bcm43xx_interrupt_disable(bcm, BCM43xx_IRQ_ALL);
-		spin_unlock_irqrestore(&bcm->irq_lock, flags);
-		bcm43xx_synchronize_irq(bcm);
-	} else {
-		/* Periodic work should take short time, so we want low
-		 * locking overhead.
-		 */
-		mutex_lock(&bcm->mutex);
-		spin_lock_irqsave(&bcm->irq_lock, flags);
-	}
 
+	/* Periodic work may take a long time, so we want it to
+	 * be preemtible. In any case, we need to disable transmits.
+	 */
+	mutex_lock(&bcm->mutex);
+	netif_tx_disable(bcm->net_dev);
+	spin_lock_irqsave(&bcm->irq_lock, flags);
+	bcm43xx_mac_suspend(bcm);
+	if (bcm43xx_using_pio(bcm))
+		bcm43xx_pio_freeze_txqueues(bcm);
+	savedirqs = bcm43xx_interrupt_disable(bcm, BCM43xx_IRQ_ALL);
+	spin_unlock_irqrestore(&bcm->irq_lock, flags);
+	bcm43xx_synchronize_irq(bcm);
 	do_periodic_work(bcm);
-
-	if (badness > BADNESS_LIMIT) {
-		spin_lock_irqsave(&bcm->irq_lock, flags);
-		tasklet_enable(&bcm->isr_tasklet);
-		bcm43xx_interrupt_enable(bcm, savedirqs);
-		if (bcm43xx_using_pio(bcm))
-			bcm43xx_pio_thaw_txqueues(bcm);
-		bcm43xx_mac_enable(bcm);
-		netif_wake_queue(bcm->net_dev);
-	}
+	spin_lock_irqsave(&bcm->irq_lock, flags);
+	tasklet_enable(&bcm->isr_tasklet);
+	bcm43xx_interrupt_enable(bcm, savedirqs);
+	if (bcm43xx_using_pio(bcm))
+		bcm43xx_pio_thaw_txqueues(bcm);
+	bcm43xx_mac_enable(bcm);
+	netif_wake_queue(bcm->net_dev);
 	mmiowb();
 	spin_unlock_irqrestore(&bcm->irq_lock, flags);
 	mutex_unlock(&bcm->mutex);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] bcm43xx: further fix for periodic work errors
       [not found] ` <4514B322.mail1K91A36N8-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
@ 2006-09-23  7:56   ` Michael Buesch
       [not found]     ` <200609230956.05475.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Buesch @ 2006-09-23  7:56 UTC (permalink / raw)
  To: Larry Finger
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w,
	John Linville, Stefano Brivio

On Saturday 23 September 2006 06:08, Larry Finger wrote:
> Recent changes in the setup for preemptible periodic work fixed most
> of the problems with NETDEV watchdog timeouts; however, some variants
> of the bcm43xx device still had the problem. These were fixed by setting
> the parameter MAXIMUM_BADNESS to 0. By doing so, all the functionality
> associated with calculating the 'badness' of the upcoming periodic work
> is no longer needed; therefore it is removed.

Uhm, no. Wait. _Why_ does the watchdog trigger.
All periodic work in the fastpath (which you remove with this patch)
is supposed to execute in a few microseconds.
I don't think we want to fix this my removing the fastpath and always
taking the _expensive_ slowpath periodic work.

So why does the watchdog trigger for the fast periodic work?
We need to find out.
Removing the fastpath is just bad for overall latency.

The two fastpath periodic works are 15 and 30, if executed
standalone. If the 15 and/or 30 is execiuted alongside with
a 60sec work, it's all slowpath, of course.

> Signed-off-by: Larry Finger <Larry.Finger-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
> ---
> 
> John,
> 
> This patch relies on "[PATCH] bcm43xx: fix netdev watchdog timeouts", which
> was submitted on 9/14/06. It is important for this one, as well as those
> already queued, to make the 2.6.19 cutoff.
> 
> Thanks,
> 
> Larry
> 
> 
> Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> ===================================================================
> --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
> @@ -3136,67 +3136,32 @@ static void do_periodic_work(struct bcm4
>  	schedule_delayed_work(&bcm->periodic_work, HZ * 15);
>  }
>  
> -/* Estimate a "Badness" value based on the periodic work
> - * state-machine state. "Badness" is worse (bigger), if the
> - * periodic work will take longer.
> - */
> -static int estimate_periodic_work_badness(unsigned int state)
> -{
> -	int badness = 0;
> -
> -	if (state % 8 == 0) /* every 120 sec */
> -		badness += 10;
> -	if (state % 4 == 0) /* every 60 sec */
> -		badness += 5;
> -	if (state % 2 == 0) /* every 30 sec */
> -		badness += 1;
> -	if (state % 1 == 0) /* every 15 sec */
> -		badness += 1;
> -
> -#define BADNESS_LIMIT	4
> -	return badness;
> -}
> -
>  static void bcm43xx_periodic_work_handler(void *d)
>  {
>  	struct bcm43xx_private *bcm = d;
>  	unsigned long flags;
>  	u32 savedirqs = 0;
> -	int badness;
> -
> -	badness = estimate_periodic_work_badness(bcm->periodic_state);
> -	if (badness > BADNESS_LIMIT) {
> -		/* Periodic work will take a long time, so we want it to
> -		 * be preemtible.
> -		 */
> -		mutex_lock(&bcm->mutex);
> -		netif_tx_disable(bcm->net_dev);
> -		spin_lock_irqsave(&bcm->irq_lock, flags);
> -		bcm43xx_mac_suspend(bcm);
> -		if (bcm43xx_using_pio(bcm))
> -			bcm43xx_pio_freeze_txqueues(bcm);
> -		savedirqs = bcm43xx_interrupt_disable(bcm, BCM43xx_IRQ_ALL);
> -		spin_unlock_irqrestore(&bcm->irq_lock, flags);
> -		bcm43xx_synchronize_irq(bcm);
> -	} else {
> -		/* Periodic work should take short time, so we want low
> -		 * locking overhead.
> -		 */
> -		mutex_lock(&bcm->mutex);
> -		spin_lock_irqsave(&bcm->irq_lock, flags);
> -	}
>  
> +	/* Periodic work may take a long time, so we want it to
> +	 * be preemtible. In any case, we need to disable transmits.
> +	 */
> +	mutex_lock(&bcm->mutex);
> +	netif_tx_disable(bcm->net_dev);
> +	spin_lock_irqsave(&bcm->irq_lock, flags);
> +	bcm43xx_mac_suspend(bcm);
> +	if (bcm43xx_using_pio(bcm))
> +		bcm43xx_pio_freeze_txqueues(bcm);
> +	savedirqs = bcm43xx_interrupt_disable(bcm, BCM43xx_IRQ_ALL);
> +	spin_unlock_irqrestore(&bcm->irq_lock, flags);
> +	bcm43xx_synchronize_irq(bcm);
>  	do_periodic_work(bcm);
> -
> -	if (badness > BADNESS_LIMIT) {
> -		spin_lock_irqsave(&bcm->irq_lock, flags);
> -		tasklet_enable(&bcm->isr_tasklet);
> -		bcm43xx_interrupt_enable(bcm, savedirqs);
> -		if (bcm43xx_using_pio(bcm))
> -			bcm43xx_pio_thaw_txqueues(bcm);
> -		bcm43xx_mac_enable(bcm);
> -		netif_wake_queue(bcm->net_dev);
> -	}
> +	spin_lock_irqsave(&bcm->irq_lock, flags);
> +	tasklet_enable(&bcm->isr_tasklet);
> +	bcm43xx_interrupt_enable(bcm, savedirqs);
> +	if (bcm43xx_using_pio(bcm))
> +		bcm43xx_pio_thaw_txqueues(bcm);
> +	bcm43xx_mac_enable(bcm);
> +	netif_wake_queue(bcm->net_dev);
>  	mmiowb();
>  	spin_unlock_irqrestore(&bcm->irq_lock, flags);
>  	mutex_unlock(&bcm->mutex);
> 

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] bcm43xx: further fix for periodic work errors
       [not found]     ` <200609230956.05475.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
@ 2006-09-23 19:06       ` Larry Finger
       [not found]         ` <451585D1.3080102-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Larry Finger @ 2006-09-23 19:06 UTC (permalink / raw)
  To: Michael Buesch
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Stefano Brivio,
	Bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w, John Linville

Michael Buesch wrote:
> On Saturday 23 September 2006 06:08, Larry Finger wrote:
>> Recent changes in the setup for preemptible periodic work fixed most
>> of the problems with NETDEV watchdog timeouts; however, some variants
>> of the bcm43xx device still had the problem. These were fixed by setting
>> the parameter MAXIMUM_BADNESS to 0. By doing so, all the functionality
>> associated with calculating the 'badness' of the upcoming periodic work
>> is no longer needed; therefore it is removed.
> 
> Uhm, no. Wait. _Why_ does the watchdog trigger.
> All periodic work in the fastpath (which you remove with this patch)
> is supposed to execute in a few microseconds.
> I don't think we want to fix this my removing the fastpath and always
> taking the _expensive_ slowpath periodic work.
> 
> So why does the watchdog trigger for the fast periodic work?
> We need to find out.
> Removing the fastpath is just bad for overall latency.
> 
> The two fastpath periodic works are 15 and 30, if executed
> standalone. If the 15 and/or 30 is execiuted alongside with
> a 60sec work, it's all slowpath, of course.

I was thinking that the 15 second periodic work called mac suspend, which is the most expensive part 
of the slowpath, but I see that is an unlikely condition. I'm now testing to see if moving the 
netif_tx_disable/netif_wake_queue pair into all paths fixes the errors. Those calls should be 
relatively inexpensive.

Larry

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] bcm43xx: further fix for periodic work errors
       [not found]         ` <451585D1.3080102-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
@ 2006-09-23 19:41           ` Michael Buesch
  2006-09-23 20:05             ` Larry Finger
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Buesch @ 2006-09-23 19:41 UTC (permalink / raw)
  To: Larry Finger
  Cc: Bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w, netdev-u79uwXL29TY76Z2rM5mHXA,
	bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w, John Linville, Stefano Brivio

On Saturday 23 September 2006 21:06, Larry Finger wrote:
> Michael Buesch wrote:
> > On Saturday 23 September 2006 06:08, Larry Finger wrote:
> >> Recent changes in the setup for preemptible periodic work fixed most
> >> of the problems with NETDEV watchdog timeouts; however, some variants
> >> of the bcm43xx device still had the problem. These were fixed by setting
> >> the parameter MAXIMUM_BADNESS to 0. By doing so, all the functionality
> >> associated with calculating the 'badness' of the upcoming periodic work
> >> is no longer needed; therefore it is removed.
> > 
> > Uhm, no. Wait. _Why_ does the watchdog trigger.
> > All periodic work in the fastpath (which you remove with this patch)
> > is supposed to execute in a few microseconds.
> > I don't think we want to fix this my removing the fastpath and always
> > taking the _expensive_ slowpath periodic work.
> > 
> > So why does the watchdog trigger for the fast periodic work?
> > We need to find out.
> > Removing the fastpath is just bad for overall latency.
> > 
> > The two fastpath periodic works are 15 and 30, if executed
> > standalone. If the 15 and/or 30 is execiuted alongside with
> > a 60sec work, it's all slowpath, of course.
> 
> I was thinking that the 15 second periodic work called mac suspend, which is the most expensive part 
> of the slowpath, but I see that is an unlikely condition. I'm now testing to see if moving the 
> netif_tx_disable/netif_wake_queue pair into all paths fixes the errors. Those calls should be 
> relatively inexpensive.

Well, even _if_ mac_suspend takes a few milliseconds (which it
does not), it would not trigger the watchdog.
I measured the time it takes to execute the various works
and based the badness selection on the results.

If the 15 or 30 second work is really able to trigger a watchdog
timeout, it's a _bug_ that needs to be fixed and not to be
papered over.
It won't trigger the watchdog, because it is running too long
uninterruptible (it won't run 5sec...). If it triggers, it's
triggered by something else (like the synchronize_net thingie
in the past).

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] bcm43xx: further fix for periodic work errors
  2006-09-23 19:41           ` Michael Buesch
@ 2006-09-23 20:05             ` Larry Finger
       [not found]               ` <451593A0.8030104-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Larry Finger @ 2006-09-23 20:05 UTC (permalink / raw)
  To: Michael Buesch; +Cc: bcm43xx-dev, netdev, Stefano Brivio, John Linville

Michael Buesch wrote:
> 
> Well, even _if_ mac_suspend takes a few milliseconds (which it
> does not), it would not trigger the watchdog.
> I measured the time it takes to execute the various works
> and based the badness selection on the results.
> 
> If the 15 or 30 second work is really able to trigger a watchdog
> timeout, it's a _bug_ that needs to be fixed and not to be
> papered over.
> It won't trigger the watchdog, because it is running too long
> uninterruptible (it won't run 5sec...). If it triggers, it's
> triggered by something else (like the synchronize_net thingie
> in the past).

Even the synchronize_net problem wasn't taking 5 seconds to complete, it was messing up the transmit 
process.

I went back to check my logs again, and the actual error was "BCM43xx_IRQ_XMIT_ERROR", which is 
always preceded by a MAC suspend failed. These never happened all the time I was running with 
MAXIMUM_BADNESS of 0.

I think the _bug_ is letting the transmit process run while doing the periodic work, which is why 
I'm testing with the tx_disable before all periodic work. I'll let you know in 2 or 3 days if it 
fixes the problem. It takes that long to trigger.

Larry

Larry

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] bcm43xx: further fix for periodic work errors
       [not found]               ` <451593A0.8030104-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
@ 2006-09-23 20:22                 ` Michael Buesch
  0 siblings, 0 replies; 6+ messages in thread
From: Michael Buesch @ 2006-09-23 20:22 UTC (permalink / raw)
  To: Larry Finger
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w,
	John Linville, Stefano Brivio

On Saturday 23 September 2006 22:05, Larry Finger wrote:
> Michael Buesch wrote:
> > 
> > Well, even _if_ mac_suspend takes a few milliseconds (which it
> > does not), it would not trigger the watchdog.
> > I measured the time it takes to execute the various works
> > and based the badness selection on the results.
> > 
> > If the 15 or 30 second work is really able to trigger a watchdog
> > timeout, it's a _bug_ that needs to be fixed and not to be
> > papered over.
> > It won't trigger the watchdog, because it is running too long
> > uninterruptible (it won't run 5sec...). If it triggers, it's
> > triggered by something else (like the synchronize_net thingie
> > in the past).
> 
> Even the synchronize_net problem wasn't taking 5 seconds to complete, it was messing up the transmit 
> process.

That's what I am saying. There must be another similiar bug.

> I went back to check my logs again, and the actual error was "BCM43xx_IRQ_XMIT_ERROR", which is 
> always preceded by a MAC suspend failed. These never happened all the time I was running with 
> MAXIMUM_BADNESS of 0.

We can debug with the recently spec'ed reason and error registers why
this is triggered. See v4 specs.

> I think the _bug_ is letting the transmit process run while doing the periodic work,

No. We don't let TX run while doing any periodic work (slow or fast).
Same for the IRQ handler.
We take the IRQ lock, which protects against IRQ and TX path (and everything else).
The _only_ difference between slowpath and fastpath periodic work is
that slowpath (long) periodic work is preemptible. This is gained
by not taking the IRQ lock, but protecting it otherwise (disabling IRQs and TX).

So what you are doing by your patch is: _never_ taking the lock.

> which is why  
> I'm testing with the tx_disable before all periodic work. I'll let you know in 2 or 3 days if it 

It is not needed. tx_disable is only needed for long periodic work.

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-09-23 20:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-23  4:08 [PATCH] bcm43xx: further fix for periodic work errors Larry Finger
     [not found] ` <4514B322.mail1K91A36N8-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
2006-09-23  7:56   ` Michael Buesch
     [not found]     ` <200609230956.05475.mb-fseUSCV1ubazQB+pC5nmwQ@public.gmane.org>
2006-09-23 19:06       ` Larry Finger
     [not found]         ` <451585D1.3080102-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
2006-09-23 19:41           ` Michael Buesch
2006-09-23 20:05             ` Larry Finger
     [not found]               ` <451593A0.8030104-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
2006-09-23 20:22                 ` Michael Buesch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).