* NAPI note (was Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)) [not found] <3E4D66DF.3040800@colorfullife.com> @ 2003-02-14 23:58 ` Jeff Garzik 2003-02-18 2:57 ` NAPI note David S. Miller 0 siblings, 1 reply; 6+ messages in thread From: Jeff Garzik @ 2003-02-14 23:58 UTC (permalink / raw) To: Manfred Spraul; +Cc: Pete Zaitcev, James Bourne, davem, netdev Manfred Spraul wrote: > It seems to be a generic NAPI restriction: > The caller of netif_receive_skb() must not own a spinlock that is > acquired from an interrupt handler. Thanks much for noticing this, Manfred. tg3 is definitely buggy in this regard. I've CC'd netdev as an FYI... We should probably patch NAPI_HOWTO for this note. I note that David pointed this out as an area for improvement, so he was already thinking in this direction anyway :) Jeff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: NAPI note 2003-02-14 23:58 ` NAPI note (was Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)) Jeff Garzik @ 2003-02-18 2:57 ` David S. Miller 2003-02-18 16:31 ` Manfred Spraul 0 siblings, 1 reply; 6+ messages in thread From: David S. Miller @ 2003-02-18 2:57 UTC (permalink / raw) To: jgarzik; +Cc: manfred, zaitcev, jbourne, netdev From: Jeff Garzik <jgarzik@pobox.com> Date: Fri, 14 Feb 2003 18:58:13 -0500 Manfred Spraul wrote: > It seems to be a generic NAPI restriction: > The caller of netif_receive_skb() must not own a spinlock that is > acquired from an interrupt handler. Thanks much for noticing this, Manfred. I think this logic is buggy. In the example I've seen posted, only a NAPI implementation bug could cause the situation to occur. If cpu1 is in ->poll() for the driver, then by definition the device shall not cause interrupts. The device's interrupts are disabled before we enter the ->poll() handler, and as such the "cpu2 take device interrupt and takes driver->lock" cannot occur. If anything we've found a bug in interrupt disabling in the tg3 driver. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: NAPI note 2003-02-18 2:57 ` NAPI note David S. Miller @ 2003-02-18 16:31 ` Manfred Spraul 2003-02-19 2:53 ` jamal 0 siblings, 1 reply; 6+ messages in thread From: Manfred Spraul @ 2003-02-18 16:31 UTC (permalink / raw) To: David S. Miller; +Cc: jgarzik, zaitcev, jbourne, netdev David S. Miller wrote: > From: Jeff Garzik <jgarzik@pobox.com> > Date: Fri, 14 Feb 2003 18:58:13 -0500 > > Manfred Spraul wrote: > > It seems to be a generic NAPI restriction: > > The caller of netif_receive_skb() must not own a spinlock that is > > acquired from an interrupt handler. > > Thanks much for noticing this, Manfred. > >I think this logic is buggy. > >In the example I've seen posted, only a NAPI implementation bug >could cause the situation to occur. > >If cpu1 is in ->poll() for the driver, then by definition the >device shall not cause interrupts. The device's interrupts >are disabled before we enter the ->poll() handler, and as such >the "cpu2 take device interrupt and takes driver->lock" cannot >occur. > > No. I think the rule is that drivers that use the NAPI interface must not cause interrupts for packet receive and out-of-rx-buffers conditions. But what about media error interrupts, or tx interrupts? Or MIB counter overflow, etc. What about shared pci interrupts? All of them could occur, and if they take a spinlock that is held across netif_receive_skb(), then it can deadlock. OTHO if it's guaranteed that no interrupt occurs, then the nic should not take a spinlock at all and rely on the synchronization provided by NAPI. (->poll is single-threaded). -- Manfred ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: NAPI note 2003-02-18 16:31 ` Manfred Spraul @ 2003-02-19 2:53 ` jamal 2003-02-19 3:14 ` Jeff Garzik 2003-02-19 3:18 ` David S. Miller 0 siblings, 2 replies; 6+ messages in thread From: jamal @ 2003-02-19 2:53 UTC (permalink / raw) To: Manfred Spraul; +Cc: David S. Miller, Jeff Garzik, zaitcev, jbourne, netdev On Tue, 18 Feb 2003, Manfred Spraul wrote: > David S. Miller wrote: > > > From: Jeff Garzik <jgarzik@pobox.com> > > Date: Fri, 14 Feb 2003 18:58:13 -0500 > > > > Manfred Spraul wrote: > > > It seems to be a generic NAPI restriction: > > > The caller of netif_receive_skb() must not own a spinlock that is > > > acquired from an interrupt handler. > > > > Thanks much for noticing this, Manfred. > > > >I think this logic is buggy. > > > >In the example I've seen posted, only a NAPI implementation bug > >could cause the situation to occur. > > > >If cpu1 is in ->poll() for the driver, then by definition the > >device shall not cause interrupts. The device's interrupts > >are disabled before we enter the ->poll() handler, and as such > >the "cpu2 take device interrupt and takes driver->lock" cannot > >occur. > > > > > No. I think the rule is that drivers that use the NAPI interface must > not cause interrupts for packet receive and out-of-rx-buffers conditions. Ah, but that is only one of two rules. Theres other drivers which dont follow this rule and just shutdown all interupt sources. I know that the e1000 for example does this. I am not sure about the tg3. I think the doc says this but may not emphasize it as strongly. So if tg3 uses method 2 then its as Dave says - a bug. > But what about media error interrupts, or tx interrupts? Or MIB counter > overflow, etc. What about shared pci interrupts? Shared interupts should be interesting actually. However if you are in poll mode and you receive an interupt you should be able to quickly determine its not yours without much effect on shared locks, no? > All of them could occur, and if they take a spinlock that is held across > netif_receive_skb(), then it can deadlock. > yes this could happen with method 1 of programming the driver; however, tx, receive, link are essentially separate threads and would hardly share locks. > OTHO if it's guaranteed that no interrupt occurs, then the nic should > not take a spinlock at all and rely on the synchronization provided by > NAPI. (->poll is single-threaded). > i havent studied the e1000 theres a lot of this happening already. I dont think you need say to protect the tx ring for example from tx completion interupts vs regular softirq path. cheers, jamal ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: NAPI note 2003-02-19 2:53 ` jamal @ 2003-02-19 3:14 ` Jeff Garzik 2003-02-19 3:18 ` David S. Miller 1 sibling, 0 replies; 6+ messages in thread From: Jeff Garzik @ 2003-02-19 3:14 UTC (permalink / raw) To: jamal; +Cc: Manfred Spraul, David S. Miller, zaitcev, jbourne, netdev jamal wrote: > > On Tue, 18 Feb 2003, Manfred Spraul wrote: > > >>David S. Miller wrote: >> >> >>> From: Jeff Garzik <jgarzik@pobox.com> >>> Date: Fri, 14 Feb 2003 18:58:13 -0500 >>> >>> Manfred Spraul wrote: >>> > It seems to be a generic NAPI restriction: >>> > The caller of netif_receive_skb() must not own a spinlock that is >>> > acquired from an interrupt handler. >>> >>> Thanks much for noticing this, Manfred. >>> >>>I think this logic is buggy. >>> >>>In the example I've seen posted, only a NAPI implementation bug >>>could cause the situation to occur. >>> >>>If cpu1 is in ->poll() for the driver, then by definition the >>>device shall not cause interrupts. The device's interrupts >>>are disabled before we enter the ->poll() handler, and as such >>>the "cpu2 take device interrupt and takes driver->lock" cannot >>>occur. >>> >>> >> >>No. I think the rule is that drivers that use the NAPI interface must >>not cause interrupts for packet receive and out-of-rx-buffers conditions. > > > Ah, but that is only one of two rules. > Theres other drivers which dont follow this rule and just shutdown > all interupt sources. I know that the e1000 for example does this. > I am not sure about the tg3. I think the doc says this but may not > emphasize it as strongly. > So if tg3 uses method 2 then its as Dave says - a bug. tg3 shuts down all interrupt sources, and handles all interrupt events in dev->poll(). David and I hashed it out a bit on IRC. The problem is that deliver_to_old_ones() waits, and thus the deadlock that Manfred described. For 2.4.x, the solution is simply to avoid the deadlock in the driver. For 2.5.x, David hinted that deliver_to_old_ones() may be going away. >>But what about media error interrupts, or tx interrupts? Or MIB counter >>overflow, etc. What about shared pci interrupts? > > > Shared interupts should be interesting actually. > However if you are in poll mode and you receive an interupt you should be > able to quickly determine its not yours without much effect on shared > locks, no? Normally, yes. However tg3 grabs a lock just about anytime it does anything. ;-) A long term project of mine is to slowly remove these locks, but that must wait until the driver stabilizes, and is overall a long process. Most of the locks _are_ removeable, but we keep hit deadlock bugs like this, and hardware bugs which need workarounds, so those come first. >>All of them could occur, and if they take a spinlock that is held across >>netif_receive_skb(), then it can deadlock. >> > > > yes this could happen with method 1 of programming the driver; however, > tx, receive, link are essentially separate threads and would hardly share > locks. They do in tg3's case. The locks can be removed eventually, but such is the state of life right now. Jeff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: NAPI note 2003-02-19 2:53 ` jamal 2003-02-19 3:14 ` Jeff Garzik @ 2003-02-19 3:18 ` David S. Miller 1 sibling, 0 replies; 6+ messages in thread From: David S. Miller @ 2003-02-19 3:18 UTC (permalink / raw) To: hadi; +Cc: manfred, jgarzik, zaitcev, jbourne, netdev From: jamal <hadi@cyberus.ca> Date: Tue, 18 Feb 2003 21:53:12 -0500 (EST) Theres other drivers which dont follow this rule and just shutdown all interupt sources. I know that the e1000 for example does this. I am not sure about the tg3. I think the doc says this but may not emphasize it as strongly. So if tg3 uses method 2 then its as Dave says - a bug. Right, but I forgot that it's not a bug in the shared interrupt case where we need to grab a lock to access the hardware and fetch the interrupt status. Shared interupts should be interesting actually. However if you are in poll mode and you receive an interupt you should be able to quickly determine its not yours without much effect on shared locks, no? As Jeff has responded already, often you do need locks to do this sanely. In tg3, it's really complicated even though the chip writes the interrupt status to a piece of memory shared with the cpu. Any time you want to enable/disable tg3 chip interrupts you must flip and/or check a status bit in this piece of memory. So it really needs a lock. I personally think Jeff is overly optimistic about lock removal in the driver. :-) The last time I attempted to be clever here, we ended up with all sorts of deadlocks in tg3. It requires real brain time and heavy testing to make any kinds of changes in this area. I also don't want to accomplish this by splitting up into seperate lines of development of tg3, that's nuts as it would split up our testing and make both lines get less testing than a unified mainline driver would (which is what we happily do now). ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2003-02-19 3:18 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <3E4D66DF.3040800@colorfullife.com>
2003-02-14 23:58 ` NAPI note (was Re: lockups with 2.4.20 (tg3? net/core/dev.c|deliver_to_old_ones)) Jeff Garzik
2003-02-18 2:57 ` NAPI note David S. Miller
2003-02-18 16:31 ` Manfred Spraul
2003-02-19 2:53 ` jamal
2003-02-19 3:14 ` Jeff Garzik
2003-02-19 3:18 ` David S. Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).