* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup [not found] <20060803075704.GC27835@thunk.org> @ 2006-08-03 10:00 ` Herbert Xu 2006-08-03 16:32 ` Theodore Tso 2006-08-03 18:36 ` Michael Chan 0 siblings, 2 replies; 38+ messages in thread From: Herbert Xu @ 2006-08-03 10:00 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-kernel, netdev, davem, mchan Theodore Tso <tytso@mit.edu> wrote: > > I'm sending this on mostly because it was a bit of a pain to track down, > and hopefully it will save time if anyone else hits this while playing > with the -rt kernel. It is NOT the right way to fix things, so please > don't even think of applying this patch (unless you need it, in your own > local tree :-). > > One of these days when we have time to breath we'll look into fixing > this the right way, if someone doesn't beat us to it first. :-) You probably should resend the patch to netdev and Michael Chan <mchan@broadcom.com>. He might have ideas on how this could be avoided. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 10:00 ` [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup Herbert Xu @ 2006-08-03 16:32 ` Theodore Tso 2006-08-03 16:46 ` Daniel Walker 2006-08-03 16:49 ` Randy.Dunlap 2006-08-03 18:36 ` Michael Chan 1 sibling, 2 replies; 38+ messages in thread From: Theodore Tso @ 2006-08-03 16:32 UTC (permalink / raw) To: Herbert Xu; +Cc: linux-kernel, netdev, davem, mchan On Thu, Aug 03, 2006 at 08:00:35PM +1000, Herbert Xu wrote: > Theodore Tso <tytso@mit.edu> wrote: > > > > I'm sending this on mostly because it was a bit of a pain to track down, > > and hopefully it will save time if anyone else hits this while playing > > with the -rt kernel. It is NOT the right way to fix things, so please > > don't even think of applying this patch (unless you need it, in your own > > local tree :-). > > > > One of these days when we have time to breath we'll look into fixing > > this the right way, if someone doesn't beat us to it first. :-) > > You probably should resend the patch to netdev and Michael Chan > <mchan@broadcom.com>. He might have ideas on how this could be > avoided. This only shows up with the real-time kernel where timer softirq's run in their own processes, and a high priority process preempts the timer softirq. I don't really consider this a networking bug, or even driver bug, although it does seem unfortunate that Broadcom hardware locks up and goes unresponsive if the OS doesn't tickle it every tenth of a second or so. (Definitely a bad idea if the tg3 gets used on any laptops, from a power usage perspective.) But that seems like a (lame) hardware bug, not a driver bug.... - Ted ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 16:32 ` Theodore Tso @ 2006-08-03 16:46 ` Daniel Walker 2006-08-03 17:17 ` Theodore Tso 2006-08-03 16:49 ` Randy.Dunlap 1 sibling, 1 reply; 38+ messages in thread From: Daniel Walker @ 2006-08-03 16:46 UTC (permalink / raw) To: Theodore Tso; +Cc: Herbert Xu, linux-kernel, netdev, davem, mchan On Thu, 2006-08-03 at 12:32 -0400, Theodore Tso wrote: > This only shows up with the real-time kernel where timer softirq's run > in their own processes, and a high priority process preempts the timer > softirq. I don't really consider this a networking bug, or even > driver bug, although it does seem unfortunate that Broadcom hardware > locks up and goes unresponsive if the OS doesn't tickle it every tenth > of a second or so. (Definitely a bad idea if the tg3 gets used on any > laptops, from a power usage perspective.) But that seems like a > (lame) hardware bug, not a driver bug.... There is some form of priority inheritance on the timer softirq. It said in the patch header that the right fix was for the timer softirq to change priorities. Which Real Time patch are you using? Or is the current system not sufficient ? Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 16:46 ` Daniel Walker @ 2006-08-03 17:17 ` Theodore Tso 2006-08-03 21:45 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Theodore Tso @ 2006-08-03 17:17 UTC (permalink / raw) To: Daniel Walker; +Cc: Herbert Xu, linux-kernel, netdev, davem, mchan On Thu, Aug 03, 2006 at 09:46:37AM -0700, Daniel Walker wrote: > There is some form of priority inheritance on the timer softirq. It said > in the patch header that the right fix was for the timer softirq to > change priorities. Which Real Time patch are you using? Or is the > current system not sufficient ? We're using a someone older version of the CONFIG_PREEMPT_RT patches (2.6.16-rt22, with various bug fixes pulled up to what we are running.) There is priority inheritance on the hrtimers, but not on normal timers, and conversations with Thomas and Stephen at OLS indicated this is on the wishlist, but it has not happened yet. As I mentioned in the patch comments, I looked at hacking hrtimers into the tg3 driver code, but (a) the hrtimers code assume that the priority inheritance happens when a process is associated with the hrtimer and the process is a high priority process, and (b) the hrtimers code aren't exported for use by modules. So I went with a very quick hack, since we have a hard code freeze for a customer deliverable. In the long term, we're going to need something a bit more sophisticated than what we have in the hrtimers code, since not all code which requests timers are necessarily associated with a process. The tg3_timer() code, for example, is trigger by the device driver but isn't associated with a process for boosting purposes, and creating a process just so that tg3_timer() can be boosted seems like the Wrong Thing. In addition, the timer wheel code has a *large* number of timers that get added and then removed without ever getting expired by the TCP networking code, and I'm not at all convinced that the technique used for doing prio boosting for the hrtimers will scale to what is needed for normal timers. - Ted ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 17:17 ` Theodore Tso @ 2006-08-03 21:45 ` David Miller 0 siblings, 0 replies; 38+ messages in thread From: David Miller @ 2006-08-03 21:45 UTC (permalink / raw) To: tytso; +Cc: dwalker, herbert, linux-kernel, netdev, mchan From: Theodore Tso <tytso@mit.edu> Date: Thu, 3 Aug 2006 13:17:31 -0400 > The tg3_timer() code, for example, is trigger by the device driver but > isn't associated with a process for boosting purposes, and creating a > process just so that tg3_timer() can be boosted seems like the Wrong Ted please make sure the tg3 chips you have actually do need that periodic poking code that tg3_timer() has, most chips do not. You don't need the periodic poke unless TG3_FLAG_TAGGED_STATUS is cleared, and that is only the case for two chips 1) 5700 and 2) 5788. The only thing left is the link status and that is not so concerned about mild forms of latency in the timer firing. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 16:32 ` Theodore Tso 2006-08-03 16:46 ` Daniel Walker @ 2006-08-03 16:49 ` Randy.Dunlap 2006-08-03 17:04 ` Alistair John Strachan 2006-08-03 17:28 ` Theodore Tso 1 sibling, 2 replies; 38+ messages in thread From: Randy.Dunlap @ 2006-08-03 16:49 UTC (permalink / raw) To: Theodore Tso; +Cc: Herbert Xu, linux-kernel, netdev, davem, mchan On Thu, 3 Aug 2006 12:32:05 -0400 Theodore Tso wrote: > On Thu, Aug 03, 2006 at 08:00:35PM +1000, Herbert Xu wrote: > > Theodore Tso <tytso@mit.edu> wrote: > > > > > > I'm sending this on mostly because it was a bit of a pain to track down, > > > and hopefully it will save time if anyone else hits this while playing > > > with the -rt kernel. It is NOT the right way to fix things, so please > > > don't even think of applying this patch (unless you need it, in your own > > > local tree :-). > > > > > > One of these days when we have time to breath we'll look into fixing > > > this the right way, if someone doesn't beat us to it first. :-) > > > > You probably should resend the patch to netdev and Michael Chan > > <mchan@broadcom.com>. He might have ideas on how this could be > > avoided. > > This only shows up with the real-time kernel where timer softirq's run > in their own processes, and a high priority process preempts the timer > softirq. I don't really consider this a networking bug, or even > driver bug, although it does seem unfortunate that Broadcom hardware > locks up and goes unresponsive if the OS doesn't tickle it every tenth > of a second or so. (Definitely a bad idea if the tg3 gets used on any > laptops, from a power usage perspective.) But that seems like a > (lame) hardware bug, not a driver bug.... Interesting. On my Dell D610 notebook with tg3 and vpn, I have to ping a server on the vpn to keep it alive, otherwise it disappears soon and I have to restart the vpn. Of course, this could just be the vpn or some other software problem instead of a tg3 problem. --- ~Randy ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 16:49 ` Randy.Dunlap @ 2006-08-03 17:04 ` Alistair John Strachan 2006-08-03 21:43 ` David Miller 2006-08-03 17:28 ` Theodore Tso 1 sibling, 1 reply; 38+ messages in thread From: Alistair John Strachan @ 2006-08-03 17:04 UTC (permalink / raw) To: Randy.Dunlap; +Cc: Theodore Tso, Herbert Xu, linux-kernel, netdev, davem, mchan On Thursday 03 August 2006 17:49, Randy.Dunlap wrote: [snip] > > This only shows up with the real-time kernel where timer softirq's run > > in their own processes, and a high priority process preempts the timer > > softirq. I don't really consider this a networking bug, or even > > driver bug, although it does seem unfortunate that Broadcom hardware > > locks up and goes unresponsive if the OS doesn't tickle it every tenth > > of a second or so. (Definitely a bad idea if the tg3 gets used on any > > laptops, from a power usage perspective.) But that seems like a > > (lame) hardware bug, not a driver bug.... > > Interesting. On my Dell D610 notebook with tg3 and vpn, > I have to ping a server on the vpn to keep it alive, otherwise > it disappears soon and I have to restart the vpn. Of course, > this could just be the vpn or some other software problem > instead of a tg3 problem. Probably. I have an NC6000 with a tg3 and have never experienced link failure problems, even under -rt. -- Cheers, Alistair. Final year Computer Science undergraduate. 1F2 55 South Clerk Street, Edinburgh, UK. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 17:04 ` Alistair John Strachan @ 2006-08-03 21:43 ` David Miller 0 siblings, 0 replies; 38+ messages in thread From: David Miller @ 2006-08-03 21:43 UTC (permalink / raw) To: s0348365; +Cc: rdunlap, tytso, herbert, linux-kernel, netdev, mchan From: Alistair John Strachan <s0348365@sms.ed.ac.uk> Date: Thu, 3 Aug 2006 18:04:55 +0100 > Probably. I have an NC6000 with a tg3 and have never experienced > link failure problems, even under -rt. And note that the "poke the chip N times a second to avoid lockup" issue only matters on very very old tg3 chips. Therefore I think Ted fixed his problem by accident, as very few people in the world actually have tg3 chips old enough to need that periodic poking. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 16:49 ` Randy.Dunlap 2006-08-03 17:04 ` Alistair John Strachan @ 2006-08-03 17:28 ` Theodore Tso 1 sibling, 0 replies; 38+ messages in thread From: Theodore Tso @ 2006-08-03 17:28 UTC (permalink / raw) To: Randy.Dunlap; +Cc: Herbert Xu, linux-kernel, netdev, davem, mchan On Thu, Aug 03, 2006 at 09:49:17AM -0700, Randy.Dunlap wrote: > Interesting. On my Dell D610 notebook with tg3 and vpn, > I have to ping a server on the vpn to keep it alive, otherwise > it disappears soon and I have to restart the vpn. Of course, > this could just be the vpn or some other software problem > instead of a tg3 problem. That sounds almost certainly like a VPN problem. The tg3_timer() code wakes up every second or tenth of a second (depending on which mode you're in) and takes care of keeping the tg3 hardware mollified. On a standard kernel, this shouldn't ever be an issue. For the -rt kernel, this problem only shows up if you have enough tasks running at rtprio's above the rtprio of the softirq-timer for long enough that tg3 chip gets angry.... - Ted ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 10:00 ` [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup Herbert Xu 2006-08-03 16:32 ` Theodore Tso @ 2006-08-03 18:36 ` Michael Chan 2006-08-03 20:17 ` Theodore Tso 1 sibling, 1 reply; 38+ messages in thread From: Michael Chan @ 2006-08-03 18:36 UTC (permalink / raw) To: Herbert Xu; +Cc: tytso, linux-kernel, netdev, davem On Thu, 2006-08-03 at 20:00 +1000, Herbert Xu wrote: > Theodore Tso <tytso@mit.edu> wrote: > > > > I'm sending this on mostly because it was a bit of a pain to track down, > > and hopefully it will save time if anyone else hits this while playing > > with the -rt kernel. It is NOT the right way to fix things, so please > > don't even think of applying this patch (unless you need it, in your own > > local tree :-). > > > > One of these days when we have time to breath we'll look into fixing > > this the right way, if someone doesn't beat us to it first. :-) > Ted, what tg3 hardware is having this timer related problem? Can you send me the tg3 probing output? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 18:36 ` Michael Chan @ 2006-08-03 20:17 ` Theodore Tso 2006-08-03 21:48 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Theodore Tso @ 2006-08-03 20:17 UTC (permalink / raw) To: Michael Chan; +Cc: Herbert Xu, linux-kernel, netdev, davem On Thu, Aug 03, 2006 at 11:36:47AM -0700, Michael Chan wrote: > On Thu, 2006-08-03 at 20:00 +1000, Herbert Xu wrote: > > Theodore Tso <tytso@mit.edu> wrote: > > > > > > I'm sending this on mostly because it was a bit of a pain to track down, > > > and hopefully it will save time if anyone else hits this while playing > > > with the -rt kernel. It is NOT the right way to fix things, so please > > > don't even think of applying this patch (unless you need it, in your own > > > local tree :-). > > > > > > One of these days when we have time to breath we'll look into fixing > > > this the right way, if someone doesn't beat us to it first. :-) > > > Ted, what tg3 hardware is having this timer related problem? Can you > send me the tg3 probing output? tg3.c:v3.49 (Feb 2, 2006) ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 17 eth0: Tigon3 [partno(BCM95704s) rev 2100 PHY(serdes)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:5e:86:44:24 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] Split[0] WireSpeed[0] TSOcap[0] eth0: dma_rwctrl[769f4000] dma_mask[64-bit] 02:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet (rev 10) Subsystem: IBM: Unknown device 0301 Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 17 Memory at efff0000 (64-bit, non-prefetchable) [size=64K] Capabilities: [40] PCI-X non-bridge device. Capabilities: [48] Power Management version 2 Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable- The other interesting bit of information is that after networking card goes dead and I do the "ifdown eth0; ifup eth0", the following printk shows up: tg3: tg3_abort_hw timed out for eth0, TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff This is from an IBM LS-20 blade. Is this helpful? - Ted ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 20:17 ` Theodore Tso @ 2006-08-03 21:48 ` David Miller 2006-08-03 23:28 ` Michael Chan 2006-08-03 23:53 ` Theodore Tso 0 siblings, 2 replies; 38+ messages in thread From: David Miller @ 2006-08-03 21:48 UTC (permalink / raw) To: tytso; +Cc: mchan, herbert, linux-kernel, netdev From: Theodore Tso <tytso@mit.edu> Date: Thu, 3 Aug 2006 16:17:41 -0400 > eth0: Tigon3 [partno(BCM95704s) rev 2100 PHY(serdes)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:5e:86:44:24 The 5704 chip will set TG3_FLAG_TAGGED_STATUS, and therefore doesn't need the periodic poking done by tg3_timer(). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 21:48 ` David Miller @ 2006-08-03 23:28 ` Michael Chan 2006-08-03 23:43 ` David Miller 2006-08-04 3:23 ` Theodore Tso 2006-08-03 23:53 ` Theodore Tso 1 sibling, 2 replies; 38+ messages in thread From: Michael Chan @ 2006-08-03 23:28 UTC (permalink / raw) To: David Miller; +Cc: tytso, herbert, linux-kernel, netdev On Thu, 2006-08-03 at 14:48 -0700, David Miller wrote: > From: Theodore Tso <tytso@mit.edu> > Date: Thu, 3 Aug 2006 16:17:41 -0400 > > > eth0: Tigon3 [partno(BCM95704s) rev 2100 PHY(serdes)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:5e:86:44:24 > > The 5704 chip will set TG3_FLAG_TAGGED_STATUS, and therefore > doesn't need the periodic poking done by tg3_timer(). > True. But they also have ASF enabled which requires tg3_timer() to send the heartbeat periodically. If the heartbeat is late, ASF may reset the chip believing that the system has crashed. > eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] Split[0] WireSpeed[0] TSOcap[0] We'll see if we can do away with the timer-based heartbeat. That's probably the best solution. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 23:28 ` Michael Chan @ 2006-08-03 23:43 ` David Miller 2006-08-04 0:07 ` Theodore Tso 2006-08-04 3:23 ` Theodore Tso 1 sibling, 1 reply; 38+ messages in thread From: David Miller @ 2006-08-03 23:43 UTC (permalink / raw) To: mchan; +Cc: tytso, herbert, linux-kernel, netdev From: "Michael Chan" <mchan@broadcom.com> Date: Thu, 03 Aug 2006 16:28:19 -0700 > > eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] Split[0] WireSpeed[0] TSOcap[0] > > We'll see if we can do away with the timer-based heartbeat. That's > probably the best solution. The tg3 driver is not the only device in the world that requires a timer based "ping" to work. The watchdog drivers and the softlockup detector are other instances which require a timer to not be delayed an unreasonable amount of time. Therefore TG3 is not unique in this regard, and I thus don't think it's worthwhile to change tg3 just to accomodate this broken behavior of the RT patches. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 23:43 ` David Miller @ 2006-08-04 0:07 ` Theodore Tso 2006-08-04 0:20 ` David Miller 2006-08-04 0:25 ` Arjan van de Ven 0 siblings, 2 replies; 38+ messages in thread From: Theodore Tso @ 2006-08-04 0:07 UTC (permalink / raw) To: David Miller; +Cc: mchan, herbert, linux-kernel, netdev On Thu, Aug 03, 2006 at 04:43:11PM -0700, David Miller wrote: > From: "Michael Chan" <mchan@broadcom.com> > Date: Thu, 03 Aug 2006 16:28:19 -0700 > > > > eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] Split[0] WireSpeed[0] TSOcap[0] > > > > We'll see if we can do away with the timer-based heartbeat. That's > > probably the best solution. > > The tg3 driver is not the only device in the world that requires a > timer based "ping" to work. The watchdog drivers and the softlockup > detector are other instances which require a timer to not be delayed > an unreasonable amount of time. > > Therefore TG3 is not unique in this regard, and I thus don't think > it's worthwhile to change tg3 just to accomodate this broken behavior > of the RT patches. Removing the timer-based "ping" might be a good thing to do from the point of view of reducing power utilization of laptops (but hey, I don't have a tg3 in my laptop, so I won't worry about it a whole lot :-), but I agree that in general the RT patches need to be able to call functions such as tg3_timer() reliably even when under a high real-time process workload, without needing to use the blunt hammer of "chrt -f 95 `pidof softirq-timer`". (Since not all timer callbacks need to be run at rt prio 95.) - Ted ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-04 0:07 ` Theodore Tso @ 2006-08-04 0:20 ` David Miller 2006-08-04 0:25 ` Arjan van de Ven 1 sibling, 0 replies; 38+ messages in thread From: David Miller @ 2006-08-04 0:20 UTC (permalink / raw) To: tytso; +Cc: mchan, herbert, linux-kernel, netdev From: Theodore Tso <tytso@mit.edu> Date: Thu, 3 Aug 2006 20:07:07 -0400 > Removing the timer-based "ping" might be a good thing to do from the > point of view of reducing power utilization of laptops (but hey, I > don't have a tg3 in my laptop, so I won't worry about it a whole lot :-), You won't have "ASF" in your laptop tg3. ASF, as the second character it's acronym implies is for "Servers". ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-04 0:07 ` Theodore Tso 2006-08-04 0:20 ` David Miller @ 2006-08-04 0:25 ` Arjan van de Ven 1 sibling, 0 replies; 38+ messages in thread From: Arjan van de Ven @ 2006-08-04 0:25 UTC (permalink / raw) To: Theodore Tso, David Miller, mchan, herbert, linux-kernel, netdev Theodore Tso wrote: > Removing the timer-based "ping" might be a good thing to do from the > point of view of reducing power utilization of laptops (but hey, I > don't have a tg3 in my laptop, so I won't worry about it a whole lot :-), > but I agree that in general the RT patches need to be able to > call functions such as tg3_timer() reliably even when under a high > real-time process workload, without needing to use the blunt hammer of > "chrt -f 95 `pidof softirq-timer`". (Since not all timer callbacks > need to be run at rt prio 95.) > I suppose the timer subsystem needs a "I'd like to have this timer called at time X, but it's ok to call it later until time X+Y" option; that's useful for RT like stuff but also for power savings... (eg you can batch timer firings that way a lot better) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 23:28 ` Michael Chan 2006-08-03 23:43 ` David Miller @ 2006-08-04 3:23 ` Theodore Tso 2006-08-04 3:45 ` Michael Chan 1 sibling, 1 reply; 38+ messages in thread From: Theodore Tso @ 2006-08-04 3:23 UTC (permalink / raw) To: Michael Chan; +Cc: David Miller, herbert, linux-kernel, netdev On Thu, Aug 03, 2006 at 04:28:19PM -0700, Michael Chan wrote: > True. But they also have ASF enabled which requires tg3_timer() to send > the heartbeat periodically. If the heartbeat is late, ASF may reset the > chip believing that the system has crashed. Parden me for asking a dumb question, but what's being accomplished by resetting the chip if the system has crashed? Why not reset the chip when the system reboots and it sees the PCI bus reset? I guess I'm missing the purpose of the ASF heartbeat; why does the networking chip need a chip-specific watchdog? - Ted ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-04 3:23 ` Theodore Tso @ 2006-08-04 3:45 ` Michael Chan 2006-08-05 20:26 ` Theodore Tso 0 siblings, 1 reply; 38+ messages in thread From: Michael Chan @ 2006-08-04 3:45 UTC (permalink / raw) To: Theodore Tso; +Cc: David Miller, herbert, linux-kernel, netdev Theodore Tso wrote: > Parden me for asking a dumb question, but what's being accomplished by > resetting the chip if the system has crashed? Why not reset the chip > when the system reboots and it sees the PCI bus reset? I guess I'm > missing the purpose of the ASF heartbeat; why does the networking chip > need a chip-specific watchdog? > ASF is firmware that monitors the system and sends out alerts whenever certain events happen. So it needs to run before the OS boots or after it has crashed. When the driver is up and running, the driver and ASF run independently sending and receiving traffic on the same wire. Of course, the bandwidth that is used by ASF is a very tiny fraction of the host traffic. If the system crashes, the FIFO and other resources on the NIC will be backed up and ASF can no longer function without resetting the chip. As David explained, ASF is only used on servers. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-04 3:45 ` Michael Chan @ 2006-08-05 20:26 ` Theodore Tso 2006-08-08 6:36 ` Michael Chan 0 siblings, 1 reply; 38+ messages in thread From: Theodore Tso @ 2006-08-05 20:26 UTC (permalink / raw) To: Michael Chan; +Cc: David Miller, herbert, linux-kernel, netdev On Thu, Aug 03, 2006 at 08:45:31PM -0700, Michael Chan wrote: > ASF is firmware that monitors the system and sends out alerts whenever > certain events happen. So it needs to run before the OS boots or after > it has crashed. When the driver is up and running, the driver and ASF > run independently sending and receiving traffic on the same wire. Of > course, the bandwidth that is used by ASF is a very tiny fraction of the > host traffic. If the system crashes, the FIFO and other resources on > the NIC will be backed up and ASF can no longer function without > resetting the chip. Thanks, that description was very helpful. Would you accept a patch with adding a comment describing this? I couldn't figure it out from looking at the source and googling "ASF" turned up lots of other uses for that particular acronym. It appears that there is no way of disabling ASF; is that a true statement? Thanks, regards, - Ted ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-05 20:26 ` Theodore Tso @ 2006-08-08 6:36 ` Michael Chan 2006-08-25 22:33 ` Marc Bevand 0 siblings, 1 reply; 38+ messages in thread From: Michael Chan @ 2006-08-08 6:36 UTC (permalink / raw) To: Theodore Tso; +Cc: David Miller, herbert, linux-kernel, netdev Theodore Tso wrote: > Thanks, that description was very helpful. Would you accept a patch > with adding a comment describing this? I will put it on my queue to add some comments for ASF. > > It appears that there is no way of disabling ASF; is that a true > statement? > Turning off ASF is just a matter of changing some bits in NVRAM and recalculating the checksum. If you need the tool to do this, I'll have someone send it to you. Note that on some of the blade servers, I believe ASF is vital and should not be disabled. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-08 6:36 ` Michael Chan @ 2006-08-25 22:33 ` Marc Bevand 2006-08-25 22:55 ` Michael Chan 0 siblings, 1 reply; 38+ messages in thread From: Marc Bevand @ 2006-08-25 22:33 UTC (permalink / raw) To: netdev Michael Chan <mchan <at> broadcom.com> writes: > > Turning off ASF is just a matter of changing some bits in NVRAM > and recalculating the checksum. If you need the tool to do this, > I'll have someone send it to you. > > Note that on some of the blade servers, I believe ASF is vital > and should not be disabled. Still, it would be great if ASF could be disabled, because I have noticed that when ASF is enabled, the tg3 driver automatically disables TSO (TCP Segmentation Offloading). Here is a dmesg output from a server where I am seeing that behavior: eth0: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:133MHz:64-bit) \ 10/100/1000BaseT Ethernet 00:30:48:59:c4:94 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] Split[0] WireSpeed[1] TSOcap[0] [...] eth1: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:133MHz:64-bit) \ 10/100/1000BaseT Ethernet 00:30:48:59:c4:95 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] Both interfaces are fundamentally TSO-capable, but since ASF is enabled on eth0, tg3 disables TSO on this interface. Of course at this point it is not even possible to use ethtool to re-enable it because the driver considers eth0 as not TSO-capable at all. As far as I know, the tg3 driver has been doing that since one of your patches shipped with 2.6.11-rc2-bk3, Michael, see [1]. Here is the relevant code snippet (line numbers are for 2.6.16): 10835 if (tp->tg3_flags2 & TG3_FLG2_HW_TSO) { 10836 tp->tg3_flags2 |= TG3_FLG2_TSO_CAPABLE; 10837 } 10838 else if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5700 || 10839 GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701 || 10840 tp->pci_chip_rev_id == CHIPREV_ID_5705_A0 || 10841 (tp->tg3_flags & TG3_FLAG_ENABLE_ASF) != 0) { 10842 tp->tg3_flags2 &= ~TG3_FLG2_TSO_CAPABLE; 10843 } else { 10844 tp->tg3_flags2 |= TG3_FLG2_TSO_CAPABLE; 10845 } The culprit is line 10841. Why is that done ? [1] ftp://ftp.us.kernel.org:/pub/linux/kernel/v2.6/snapshots/old/ patch-2.6.11-rc2-bk3.log, patch-2.6.11-rc2-bk3.bz2 -- Marc Bevand ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-25 22:33 ` Marc Bevand @ 2006-08-25 22:55 ` Michael Chan 2006-08-25 23:48 ` Marc Bevand 0 siblings, 1 reply; 38+ messages in thread From: Michael Chan @ 2006-08-25 22:55 UTC (permalink / raw) To: Marc Bevand; +Cc: netdev On Fri, 2006-08-25 at 22:33 +0000, Marc Bevand wrote: > Still, it would be great if ASF could be disabled, because I have > noticed that when ASF is enabled, the tg3 driver automatically disables > TSO (TCP Segmentation Offloading). Here is a dmesg output from a server > where I am seeing that behavior: > > eth0: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:133MHz:64-bit) \ > 10/100/1000BaseT Ethernet 00:30:48:59:c4:94 > eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] Split[0] WireSpeed[1] TSOcap[0] > [...] > eth1: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:133MHz:64-bit) \ > 10/100/1000BaseT Ethernet 00:30:48:59:c4:95 > eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] > > Both interfaces are fundamentally TSO-capable, but since ASF is enabled > on eth0, tg3 disables TSO on this interface. Of course at this point it > is not even possible to use ethtool to re-enable it because the driver > considers eth0 as not TSO-capable at all. > The reason is that TSO on 5704 and older chips is done by firmware. ASF is also implemented by firmware. If ASF is enabled, there is no room to do TSO and ASF at the same time. Firmware-based TSO is actually slower than no TSO. The only benefit is a little better CPU utilization. On the newer chips, TSO is done by hardware which solves both the ASF and throughput problems. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-25 22:55 ` Michael Chan @ 2006-08-25 23:48 ` Marc Bevand 2006-08-26 0:01 ` Michael Chan 0 siblings, 1 reply; 38+ messages in thread From: Marc Bevand @ 2006-08-25 23:48 UTC (permalink / raw) To: Michael Chan; +Cc: netdev On 8/25/06, Michael Chan <mchan@broadcom.com> wrote: > > The reason is that TSO on 5704 and older chips is done by firmware. ASF > is also implemented by firmware. If ASF is enabled, there is no room to > do TSO and ASF at the same time. Just for test purpose, I have applied the following patch to my tg3.c. I now seem to be able to enable/disable TSO, but I admit don't know whether ASF is still functional or not. else if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5700 || GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701 || - tp->pci_chip_rev_id == CHIPREV_ID_5705_A0 || - (tp->tg3_flags & TG3_FLAG_ENABLE_ASF) != 0) { + tp->pci_chip_rev_id == CHIPREV_ID_5705_A0) { tp->tg3_flags2 &= ~TG3_FLG2_TSO_CAPABLE; } else { Then tg3 considered my interface as TSO-capable ("TSOcap[1]" in dmesg). TSO was still disabled by default, which is normal because there is this other check a couple of lines below: /* TSO is on by default on chips that support hardware TSO. * Firmware TSO on older chips gives lower performance, so it * is off by default, but can be enabled using ethtool. */ if (tp->tg3_flags2 & TG3_FLG2_HW_TSO) dev->features |= NETIF_F_TSO; But I was able to turn TSO on via ethtool -K. This is exactly the behavior I would like to see in tg3. So are you saying the patch I applied actually breaks ASF ? > Firmware-based TSO is actually slower than no TSO. The only benefit is > a little better CPU utilization. I know, in one of my test-cases, firmware TSO reduces the max achievable TCP bandwidth from 930 to 840 Mbit/s on a GigE network while reducing the CPU utilization from 44% to 22%. I think firmware TSO still makes sense in some cases. -- Marc Bevand ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-25 23:48 ` Marc Bevand @ 2006-08-26 0:01 ` Michael Chan 0 siblings, 0 replies; 38+ messages in thread From: Michael Chan @ 2006-08-26 0:01 UTC (permalink / raw) To: Marc Bevand; +Cc: netdev On Fri, 2006-08-25 at 16:48 -0700, Marc Bevand wrote: > > But I was able to turn TSO on via ethtool -K. This is exactly the behavior I > would like to see in tg3. So are you saying the patch I applied actually > breaks ASF ? > Yes, the TSO firmware code that tg3 downloads to the chip using your patch will overwrite the ASF code. The ASF code will get reloaded again at the next chip reset once your patch is reverted. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 21:48 ` David Miller 2006-08-03 23:28 ` Michael Chan @ 2006-08-03 23:53 ` Theodore Tso 2006-08-03 23:56 ` David Miller 2006-08-07 5:34 ` Steven Rostedt 1 sibling, 2 replies; 38+ messages in thread From: Theodore Tso @ 2006-08-03 23:53 UTC (permalink / raw) To: David Miller; +Cc: mchan, herbert, linux-kernel, netdev On Thu, Aug 03, 2006 at 02:48:45PM -0700, David Miller wrote: > > eth0: Tigon3 [partno(BCM95704s) rev 2100 PHY(serdes)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:5e:86:44:24 > > The 5704 chip will set TG3_FLAG_TAGGED_STATUS, and therefore > doesn't need the periodic poking done by tg3_timer(). Hmm.... all I can say is that I could reliably knock the box off the network by running a four processes that tied up all CPU's at high real-time priorities, and after I applied the horrible hack that guaranteed that tg3_timer() was run every 0.128 seconds, the system stayed on the network. I'm not sure why, but it did fix the problem. Any suggestions on how I could figure out what was really going on and what would be a better fix would be greatly appreciated. - Ted ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 23:53 ` Theodore Tso @ 2006-08-03 23:56 ` David Miller 2006-08-03 23:59 ` Herbert Xu 2006-08-04 0:03 ` Daniel Walker 2006-08-07 5:34 ` Steven Rostedt 1 sibling, 2 replies; 38+ messages in thread From: David Miller @ 2006-08-03 23:56 UTC (permalink / raw) To: tytso; +Cc: mchan, herbert, linux-kernel, netdev From: Theodore Tso <tytso@mit.edu> Date: Thu, 3 Aug 2006 19:53:26 -0400 > Any suggestions on how I could figure out what was really going on and > what would be a better fix would be greatly appreciated. As Michael explained, it's the ASF heartbeat sent by tg3_timer() that must be delivered to the chip within certain timing constraints. If you had any watchdog devices on this machine, they would likely trigger too and reset your machine :) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 23:56 ` David Miller @ 2006-08-03 23:59 ` Herbert Xu 2006-08-04 0:01 ` David Miller 2006-08-04 0:03 ` Daniel Walker 1 sibling, 1 reply; 38+ messages in thread From: Herbert Xu @ 2006-08-03 23:59 UTC (permalink / raw) To: David Miller; +Cc: tytso, mchan, linux-kernel, netdev On Thu, Aug 03, 2006 at 04:56:54PM -0700, David Miller wrote: > > As Michael explained, it's the ASF heartbeat sent by tg3_timer() that > must be delivered to the chip within certain timing constraints. > > If you had any watchdog devices on this machine, they would likely > trigger too and reset your machine :) Watchdogs usually require one heartbeat every 30 seconds or so. Does the ASF heartbeat need to be that frequent? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 23:59 ` Herbert Xu @ 2006-08-04 0:01 ` David Miller 2006-08-04 0:16 ` Michael Chan 0 siblings, 1 reply; 38+ messages in thread From: David Miller @ 2006-08-04 0:01 UTC (permalink / raw) To: herbert; +Cc: tytso, mchan, linux-kernel, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Fri, 4 Aug 2006 09:59:27 +1000 > Watchdogs usually require one heartbeat every 30 seconds or so. Does > the ASF heartbeat need to be that frequent? The ASF heartbeat needs to be sent every 2 seconds. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-04 0:01 ` David Miller @ 2006-08-04 0:16 ` Michael Chan 0 siblings, 0 replies; 38+ messages in thread From: Michael Chan @ 2006-08-04 0:16 UTC (permalink / raw) To: David Miller; +Cc: herbert, tytso, linux-kernel, netdev On Thu, 2006-08-03 at 17:01 -0700, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Fri, 4 Aug 2006 09:59:27 +1000 > > > Watchdogs usually require one heartbeat every 30 seconds or so. Does > > the ASF heartbeat need to be that frequent? > > The ASF heartbeat needs to be sent every 2 seconds. > Yep, we send it every 2 seconds and it will reset in 5 seconds after the last heartbeat. So the margin is 3 seconds. These numbers are somewhat arbitrary and the goal is to allow ASF to function properly without too much delay after the system has crashed. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 23:56 ` David Miller 2006-08-03 23:59 ` Herbert Xu @ 2006-08-04 0:03 ` Daniel Walker 1 sibling, 0 replies; 38+ messages in thread From: Daniel Walker @ 2006-08-04 0:03 UTC (permalink / raw) To: David Miller; +Cc: tytso, mchan, herbert, linux-kernel, netdev On Thu, 2006-08-03 at 16:56 -0700, David Miller wrote: > From: Theodore Tso <tytso@mit.edu> > Date: Thu, 3 Aug 2006 19:53:26 -0400 > > > Any suggestions on how I could figure out what was really going on and > > what would be a better fix would be greatly appreciated. > > As Michael explained, it's the ASF heartbeat sent by tg3_timer() that > must be delivered to the chip within certain timing constraints. > > If you had any watchdog devices on this machine, they would likely > trigger too and reset your machine :) That's not broken behavior in RT .. That's just plain old task priorities. Some high priority task (SCHED_FIFO prio 99) is sucking up a lot of the CPU. But that's 100% legal in SCHED_FIFO. Daniel ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-03 23:53 ` Theodore Tso 2006-08-03 23:56 ` David Miller @ 2006-08-07 5:34 ` Steven Rostedt 2006-08-07 6:18 ` David Miller 1 sibling, 1 reply; 38+ messages in thread From: Steven Rostedt @ 2006-08-07 5:34 UTC (permalink / raw) To: Theodore Tso Cc: David Miller, mchan, herbert, LKML, netdev, Thomas Gleixner, Ingo Molnar On Thu, 3 Aug 2006, Theodore Tso wrote: > On Thu, Aug 03, 2006 at 02:48:45PM -0700, David Miller wrote: > > > eth0: Tigon3 [partno(BCM95704s) rev 2100 PHY(serdes)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:5e:86:44:24 > > > > The 5704 chip will set TG3_FLAG_TAGGED_STATUS, and therefore > > doesn't need the periodic poking done by tg3_timer(). > > Hmm.... all I can say is that I could reliably knock the box off the > network by running a four processes that tied up all CPU's at high > real-time priorities, and after I applied the horrible hack that > guaranteed that tg3_timer() was run every 0.128 seconds, the system > stayed on the network. I'm not sure why, but it did fix the problem. > > Any suggestions on how I could figure out what was really going on and > what would be a better fix would be greatly appreciated. > Ted, I took a quick look at the tg3 driver and that timer interrupt as well have read this thread. My suggestion would be to separate that tg3_timer into 4 different timers, which is what it actually looks like. One I believe (the first one) is an actual timeout and the other three are timers that are expected to be triggered everytime (watchdog like: at 1 second, 2 seconds and 3 seconds) and thus should be converted into a hrtimers that goes off when expected then having some crazy accounting thing in one timer. I don't fully understand the ASF part here, and if it definitely needs to go off then set that timer with the highest prio. I've ask Thomas to add a way to add a hrtimer with a given prio instead of always taking the current->normal_prio. But until he does that, you can do a hack of changing the current prio to something very high (like 99) then start the timer, and then lower the prio back to what it was. This is a hack, but it might lead to a better solution in the future. -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-07 5:34 ` Steven Rostedt @ 2006-08-07 6:18 ` David Miller 2006-08-08 12:24 ` Steven Rostedt 0 siblings, 1 reply; 38+ messages in thread From: David Miller @ 2006-08-07 6:18 UTC (permalink / raw) To: rostedt; +Cc: tytso, mchan, herbert, linux-kernel, netdev, tglx, mingo From: Steven Rostedt <rostedt@goodmis.org> Date: Mon, 7 Aug 2006 01:34:56 -0400 (EDT) > My suggestion would be to separate that tg3_timer into 4 different > timers, which is what it actually looks like. Timers have non-trivial cost. It's cheaper to have one and vector off to the necessary operations each tick internalls. That's why it's implemented as one timer. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-07 6:18 ` David Miller @ 2006-08-08 12:24 ` Steven Rostedt 2006-08-08 13:13 ` Steven Rostedt 2006-08-08 22:00 ` David Miller 0 siblings, 2 replies; 38+ messages in thread From: Steven Rostedt @ 2006-08-08 12:24 UTC (permalink / raw) To: David Miller; +Cc: tytso, mchan, herbert, linux-kernel, netdev, tglx, mingo On Sun, 6 Aug 2006, David Miller wrote: > From: Steven Rostedt <rostedt@goodmis.org> > Date: Mon, 7 Aug 2006 01:34:56 -0400 (EDT) > > > My suggestion would be to separate that tg3_timer into 4 different > > timers, which is what it actually looks like. > > Timers have non-trivial cost. It's cheaper to have one and > vector off to the necessary operations each tick internalls. > > That's why it's implemented as one timer. > hrtimers don't have the cost of a normal timer. And that's why I suggested to convert them. There's a much bigger cost in a single timer that always times out than 3 hrtimers. hrtimers are expected to timeout, but timers are not. Of the 4 timers, only one is a timeout. The other three expire every time, forcing the timer wheel into effect. Even though it's one timer implementing 4, it's expensive to use it as a watchdog. -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-08 12:24 ` Steven Rostedt @ 2006-08-08 13:13 ` Steven Rostedt 2006-08-08 22:00 ` David Miller 1 sibling, 0 replies; 38+ messages in thread From: Steven Rostedt @ 2006-08-08 13:13 UTC (permalink / raw) To: David Miller; +Cc: tytso, mchan, herbert, linux-kernel, netdev, tglx, mingo On Tue, 8 Aug 2006, Steven Rostedt wrote: > > On Sun, 6 Aug 2006, David Miller wrote: > > > From: Steven Rostedt <rostedt@goodmis.org> > > Date: Mon, 7 Aug 2006 01:34:56 -0400 (EDT) > > > > > My suggestion would be to separate that tg3_timer into 4 different > > > timers, which is what it actually looks like. > > > > Timers have non-trivial cost. It's cheaper to have one and > > vector off to the necessary operations each tick internalls. > > > > That's why it's implemented as one timer. > > > > hrtimers don't have the cost of a normal timer. And that's why I suggested > to convert them. There's a much bigger cost in a single timer that always > times out than 3 hrtimers. hrtimers are expected to timeout, but timers > are not. > > Of the 4 timers, only one is a timeout. The other three expire every time, > forcing the timer wheel into effect. Even though it's one timer > implementing 4, it's expensive to use it as a watchdog. I just got a chance to look a little more deeper at what the tg3 timer is doing, and I was wrong. The timeout is not a timeout but some messing around when the network card doesn't use tagged status (whatever that is). Which just pushes the point that this should _not_ be a timer, but a hrtimer (expected to expire). So you can keep this as one timer, but I would still switch it to a hrtimer regardless, since it is expected to timeout. (maybe separate out the ASF if that still needs to be special?). Ted, I don't know what the max latency of that timer is, (I'm sure it wouldn't be too hard to measuer, just add some timings around the timer handler, let it run for a while and keep account of the max time). But, since the user that opens this network card is the one that initializes the timer, if you simply switch the timer to be a hrtimer (that should also go in mainline) and then have a really high prio task start up the network, that timer would then run at the prio of the task that started the network. -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-08 12:24 ` Steven Rostedt 2006-08-08 13:13 ` Steven Rostedt @ 2006-08-08 22:00 ` David Miller 2006-08-08 22:27 ` Steven Rostedt 1 sibling, 1 reply; 38+ messages in thread From: David Miller @ 2006-08-08 22:00 UTC (permalink / raw) To: rostedt; +Cc: tytso, mchan, herbert, linux-kernel, netdev, tglx, mingo From: Steven Rostedt <rostedt@goodmis.org> Date: Tue, 8 Aug 2006 08:24:10 -0400 (EDT) > Of the 4 timers, only one is a timeout. The other three expire every time, > forcing the timer wheel into effect. Even though it's one timer > implementing 4, it's expensive to use it as a watchdog. It's not a watchdog, the timer continually fires. It is the on-chip ASF firmware that "times out" if it does not see the heartbeat message from the driver within 5 seconds. The driver timer in question runs every 2 seconds to write this heartbeat message to the chip. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-08 22:00 ` David Miller @ 2006-08-08 22:27 ` Steven Rostedt 2006-08-09 11:29 ` Roman Zippel 0 siblings, 1 reply; 38+ messages in thread From: Steven Rostedt @ 2006-08-08 22:27 UTC (permalink / raw) To: David Miller; +Cc: tytso, mchan, herbert, linux-kernel, netdev, tglx, mingo On Tue, 8 Aug 2006, David Miller wrote: > From: Steven Rostedt <rostedt@goodmis.org> > Date: Tue, 8 Aug 2006 08:24:10 -0400 (EDT) > > > Of the 4 timers, only one is a timeout. The other three expire every time, > > forcing the timer wheel into effect. Even though it's one timer > > implementing 4, it's expensive to use it as a watchdog. > > It's not a watchdog, the timer continually fires. Right, I should have used the term "heartbeat" instead :) > > It is the on-chip ASF firmware that "times out" if it does not > see the heartbeat message from the driver within 5 seconds. > > The driver timer in question runs every 2 seconds to write this > heartbeat message to the chip. > OK, but still, if it is expected to expire, which this one is, then it should be converted to a hrtimer, instead of a normal timer. The hrtimers were introduced to make it more efficient for expiring timers. Well, they really were introduced for high resolution, but in order to do that the expiring timer implementation needed to be improved. The timer wheel is most efficient for those timers that are not expected to time out (probably what a watchdog timer would actually do, so my terminology was really incorrect). The timer wheel is O(1) in removing and adding a timer, but can be O(n) if the timers cascade (which would happen when a timer moves to expire). The hrtimers are O(1) in expiring but O(log n) on adding and removing. So they are better to use when you know that the timer will expire. This also help's Ted's case for the -rt patch since the hrtimers there have a dynamic priority. -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup 2006-08-08 22:27 ` Steven Rostedt @ 2006-08-09 11:29 ` Roman Zippel 0 siblings, 0 replies; 38+ messages in thread From: Roman Zippel @ 2006-08-09 11:29 UTC (permalink / raw) To: Steven Rostedt Cc: David Miller, tytso, mchan, herbert, linux-kernel, netdev, tglx, mingo Hi, On Tue, 8 Aug 2006, Steven Rostedt wrote: > OK, but still, if it is expected to expire, which this one is, then it > should be converted to a hrtimer, instead of a normal timer. The hrtimers > were introduced to make it more efficient for expiring timers. Well, they > really were introduced for high resolution, but in order to do that the > expiring timer implementation needed to be improved. The timer wheel is > most efficient for those timers that are not expected to time out > (probably what a watchdog timer would actually do, so my terminology was > really incorrect). > > The timer wheel is O(1) in removing and adding a timer, but can be O(n) > if the timers cascade (which would happen when a timer moves to expire). > > The hrtimers are O(1) in expiring but O(log n) on adding and removing. So > they are better to use when you know that the timer will expire. Can we please, please, please stop with this? While this is correct, it's only rarely relevant for most users. The cascading is really not as bad it's always represented here. You need a huge amount of timers pending to run into trouble and at this point hrtimers won't behave much better. In this case it's better to do what Dave suggested - use a single continous timer and vector off the needed work from there. The primary reason to use hrtimers should only be whether or not the high resolution is needed. hrtimer can have a significant higher fixed cost (due to a possibly expensive get_time() operation), so one should be a little careful with them. Possible problems in any of the timer systems should hardly be a concern for most users of it, at the point the timer system can't handle this many timers, we should rather look at the user and fix the problem there. bye, Roman ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2006-08-26 0:02 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20060803075704.GC27835@thunk.org>
2006-08-03 10:00 ` [PATCH -rt DO NOT APPLY] Fix for tg3 networking lockup Herbert Xu
2006-08-03 16:32 ` Theodore Tso
2006-08-03 16:46 ` Daniel Walker
2006-08-03 17:17 ` Theodore Tso
2006-08-03 21:45 ` David Miller
2006-08-03 16:49 ` Randy.Dunlap
2006-08-03 17:04 ` Alistair John Strachan
2006-08-03 21:43 ` David Miller
2006-08-03 17:28 ` Theodore Tso
2006-08-03 18:36 ` Michael Chan
2006-08-03 20:17 ` Theodore Tso
2006-08-03 21:48 ` David Miller
2006-08-03 23:28 ` Michael Chan
2006-08-03 23:43 ` David Miller
2006-08-04 0:07 ` Theodore Tso
2006-08-04 0:20 ` David Miller
2006-08-04 0:25 ` Arjan van de Ven
2006-08-04 3:23 ` Theodore Tso
2006-08-04 3:45 ` Michael Chan
2006-08-05 20:26 ` Theodore Tso
2006-08-08 6:36 ` Michael Chan
2006-08-25 22:33 ` Marc Bevand
2006-08-25 22:55 ` Michael Chan
2006-08-25 23:48 ` Marc Bevand
2006-08-26 0:01 ` Michael Chan
2006-08-03 23:53 ` Theodore Tso
2006-08-03 23:56 ` David Miller
2006-08-03 23:59 ` Herbert Xu
2006-08-04 0:01 ` David Miller
2006-08-04 0:16 ` Michael Chan
2006-08-04 0:03 ` Daniel Walker
2006-08-07 5:34 ` Steven Rostedt
2006-08-07 6:18 ` David Miller
2006-08-08 12:24 ` Steven Rostedt
2006-08-08 13:13 ` Steven Rostedt
2006-08-08 22:00 ` David Miller
2006-08-08 22:27 ` Steven Rostedt
2006-08-09 11:29 ` Roman Zippel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).