netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] A change in periodic work scheduling in bcm43xx
@ 2006-09-05 17:58 Larry Finger
       [not found] ` <44FDBAA8.6090709-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
  0 siblings, 1 reply; 2+ messages in thread
From: Larry Finger @ 2006-09-05 17:58 UTC (permalink / raw)
  To: Michael Buesch
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w,
	Stefano Brivio

Michael,

Based on user reports and my own experiences, the current problems with NETDEV WATCHDOG tx timeouts, 
and the device just falling over do not happen when periodic work is not preemptible. These problems 
seem to affect BCM4306 rev 2 & 3 chips. Since I changed BADNESS_LIMIT to 20 to disable preemption 
during periodic work, my device has stayed up continuously for more than 18 hours. Previously, the 
longest time between failures was less than 6 hours, and sometimes as short as 10 minutes.

As you know, the present scheme for periodic work scheduling for bcm43xx in both wireless-2.6 and 
wireless-dev runs all 4 periodic tasks on certain ticks of the 15-second clock. Using your values of 
"badness" of 1, 1, 5, and 10 for the 15, 30, 60, and 120 second periodic tasks, respectively, the 
badness repeat cycle is ..., 1, 2, 1, 7, 1, 2, 1, 17, ...

I propose that we reduce the size of the spike in badness by shifting the 120 second task from a 
clock value of 8n to 8n+7, and the 60 second task from 4n to 4n+1. This way no more than 2 of the 
periodic tasks will be run in any clock period, and the badness repeat cycle becomes ..., 6, 2, 1, 
2, 6, 2, 11, 2, .... The tasks are run with the same periodicity as before, just a little more 
asynchronously. I recall that they were completely asynchronous in early versions of this driver.

Until we can locate and fix the problem that occurs during preemption, should we consider setting 
BADNESS_LIMIT to 20 in the wireless-2.6 kernels? For those of us whose cards have the problem, it 
certainly makes the device a lot more usable.

Larry

The patches to implement the scheduling change are as follows:

Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
===================================================================
--- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
+++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
@@ -3195,9 +3195,9 @@ static void do_periodic_work(struct bcm4
  	unsigned int state;

  	state = bcm->periodic_state;
-	if (state % 8 == 0)
+	if (state % 8 == 7)
  		bcm43xx_periodic_every120sec(bcm);
-	if (state % 4 == 0)
+	if (state % 4 == 1)
  		bcm43xx_periodic_every60sec(bcm);
  	if (state % 2 == 0)
  		bcm43xx_periodic_every30sec(bcm);
@@ -3216,8 +3216,8 @@ static int estimate_periodic_work_badnes
  {
  	int badness = 0;

-	if (state % 8 == 0) /* every 120 sec */
+	if (state % 8 == 7) /* every 120 sec */
  		badness += 10;
-	if (state % 4 == 0) /* every 60 sec */
+	if (state % 4 == 1) /* every 60 sec */
  		badness += 5;
  	if (state % 2 == 0) /* every 30 sec */

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [RFC] A change in periodic work scheduling in bcm43xx
       [not found] ` <44FDBAA8.6090709-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
@ 2006-09-05 20:47   ` Michael Buesch
  0 siblings, 0 replies; 2+ messages in thread
From: Michael Buesch @ 2006-09-05 20:47 UTC (permalink / raw)
  To: Larry Finger
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Bcm43xx-dev-0fE9KPoRgkgATYTw5x5z8w,
	Stefano Brivio

On Tuesday 05 September 2006 19:58, Larry Finger wrote:
> Michael,
> 
> Based on user reports and my own experiences, the current problems with NETDEV WATCHDOG tx timeouts, 
> and the device just falling over do not happen when periodic work is not preemptible. These problems 
> seem to affect BCM4306 rev 2 & 3 chips. Since I changed BADNESS_LIMIT to 20 to disable preemption 
> during periodic work, my device has stayed up continuously for more than 18 hours. Previously, the 
> longest time between failures was less than 6 hours, and sometimes as short as 10 minutes.
> 
> As you know, the present scheme for periodic work scheduling for bcm43xx in both wireless-2.6 and 
> wireless-dev runs all 4 periodic tasks on certain ticks of the 15-second clock. Using your values of 
> "badness" of 1, 1, 5, and 10 for the 15, 30, 60, and 120 second periodic tasks, respectively, the 
> badness repeat cycle is ..., 1, 2, 1, 7, 1, 2, 1, 17, ...
> 
> I propose that we reduce the size of the spike in badness by shifting the 120 second task from a 
> clock value of 8n to 8n+7, and the 60 second task from 4n to 4n+1. This way no more than 2 of the 
> periodic tasks will be run in any clock period, and the badness repeat cycle becomes ..., 6, 2, 1, 
> 2, 6, 2, 11, 2, .... The tasks are run with the same periodicity as before, just a little more 
> asynchronously. I recall that they were completely asynchronous in early versions of this driver.
> 
> Until we can locate and fix the problem that occurs during preemption, should we consider setting 
> BADNESS_LIMIT to 20 in the wireless-2.6 kernels? For those of us whose cards have the problem, it 
> certainly makes the device a lot more usable.

Oh well...
And if we do this, it will take two weeks for the latency-people to
show up and request a revert of this again.

Well, I _really_ don't want to have a patch like this, because
it just papers over a real bug.
There are only two choices: Either we want preemption or we don't.
It's worthless to tune the badness limit to a point where it is least
likely for the bug to trigger. Sooner or later it _will_ trigger.

What we really want is:
1st: A relieable way to reproduce the bug in short time.
     Waiting 20hours isn't really a good way of debugging.
2nd: If we can reproduce it in reasonable time, we can track
     down what is actually causing the bug.

My thoughts on the bug:

When a preemptible work happens, we completely shutdown IRQ
handling and we suspend the MAC. We do this, because we must
not take the IRQ spinlock if we want to be preemptible.
By not taking the IRQ spinlock, we race against the DMA engine
(and other parts). So we must shutdown any data flow during
the periodic work to ensure the IRQ handler does not trigger.
The sad thing is: We don't know much about how the card and
the firmware works (yet). So the big question is:
How to suspend the card in an easy and _inexpensive_ way?
We currently mask all IRQs and suspend the MAC. I guess MAC
suspending is part of the problem. I _guess_ the card is
confused by suspending the MAC in the middle of possible
transmissions. It's all just a guess. That's why I want to
have a good way to reproduce the bug to do experiments.
We could suspend the DMA TX channel before we suspend the MAC,
for example. We could try other things as well. For example
don't suspend the MAC at all. Just mask IRQs.

We must be _careful_ here. The preemptible periodic work
is a damn fragile part of the whole driver and it is easily
possible to break it even more with a patch that looks
correct.

Short:
We don't need a patch to paper over the bug, but we need
_ideas_ of what is actually going on.

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2006-09-05 20:47 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-05 17:58 [RFC] A change in periodic work scheduling in bcm43xx Larry Finger
     [not found] ` <44FDBAA8.6090709-tQ5ms3gMjBLk1uMJSBkQmQ@public.gmane.org>
2006-09-05 20:47   ` Michael Buesch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).