netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* REGRESSION: panic on e1000 driver
@ 2007-05-30 21:22 Doug Chapman
  2007-05-30 22:57 ` Kok, Auke
  2007-05-31  1:08 ` Herbert Xu
  0 siblings, 2 replies; 17+ messages in thread
From: Doug Chapman @ 2007-05-30 21:22 UTC (permalink / raw)
  To: doug.chapman, herbert, auke-jan.h.kok, netdev, e1000-devel

All,

I reported this a few weeks ago and it was fixed but it appears the
offending code was again re-submitted.  This causes a panic on HP
Integrity servers and from what I hear many other platforms using e1000
as well.

My original report was via kernel.org BZ:
http://bugzilla.kernel.org/show_bug.cgi?id=8455

I am seeing the same panic with a kernel built from today's git pull.


It was fixed by this commit:
3e1657c8ef53e1cd541cc1e420f3230dc075949b


but once again broken just yesterday by the following commit.  I have
backed just this commit out and verified I no longer panic.


commit 47313054352b879a2bc65379d55b05f48a0af7ec
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Tue May 29 15:07:31 2007 -0700

    e1000: restore netif_poll_enable call but make sure IRQs are off

    This restores the previously removed netif_poll_enable call in e1000_open.
    It's needed on all but the first call to e1000_open for a NIC as
    e1000_close always calls netif_poll_disable.

    netif_poll_enable can only be called safely if no polls have been
    scheduled.  This should be the case as long as we don't enter our IRQ
    handler.

    In order to guarantee this we explicitly disable IRQs as early as possible
    when we're probing the NIC.

    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: "Kok, Auke" <auke-jan.h.kok@intel.com>
    Cc: Jeff Garzik <jeff@garzik.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Jeff Garzik <jeff@garzik.org>





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-30 21:22 REGRESSION: panic on e1000 driver Doug Chapman
@ 2007-05-30 22:57 ` Kok, Auke
  2007-05-30 23:31   ` Doug Chapman
  2007-05-31  0:51   ` Herbert Xu
  2007-05-31  1:08 ` Herbert Xu
  1 sibling, 2 replies; 17+ messages in thread
From: Kok, Auke @ 2007-05-30 22:57 UTC (permalink / raw)
  To: Doug Chapman, herbert; +Cc: e1000-devel, netdev, auke-jan.h.kok

Doug Chapman wrote:
> All,
> 
> I reported this a few weeks ago and it was fixed but it appears the
> offending code was again re-submitted.  This causes a panic on HP
> Integrity servers and from what I hear many other platforms using e1000
> as well.
> 
> My original report was via kernel.org BZ:
> http://bugzilla.kernel.org/show_bug.cgi?id=8455
> 
> I am seeing the same panic with a kernel built from today's git pull.
> 
> 
> It was fixed by this commit:
> 3e1657c8ef53e1cd541cc1e420f3230dc075949b
> 
> 
> but once again broken just yesterday by the following commit.  I have
> backed just this commit out and verified I no longer panic.
> 
> 
> commit 47313054352b879a2bc65379d55b05f48a0af7ec
> Author: Herbert Xu <herbert@gondor.apana.org.au>
> Date:   Tue May 29 15:07:31 2007 -0700
> 
>     e1000: restore netif_poll_enable call but make sure IRQs are off

Hmm, we're making a mess of it.

Herbert, wouldn't it just have been a lot easier to do just add a 
netif_poll_disable in e1000_probe, so that any and all other poll 
enable/disables are symmetric ? Something like this?

Auke


PS: apologies for the whitespace damage... it's a cut'n'paste
---

Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 49be393..a8a3c5f 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -1149,6 +1149,7 @@ e1000_probe(struct pci_dev *pdev,
         /* tell the stack to leave us alone until e1000_open() is called */
         netif_carrier_off(netdev);
         netif_stop_queue(netdev);
+       netif_poll_disable(netdev);

         DPRINTK(PROBE, INFO, "Intel(R) PRO/1000 Network Connection\n");

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-30 22:57 ` Kok, Auke
@ 2007-05-30 23:31   ` Doug Chapman
  2007-05-31  0:51   ` Herbert Xu
  1 sibling, 0 replies; 17+ messages in thread
From: Doug Chapman @ 2007-05-30 23:31 UTC (permalink / raw)
  To: Kok, Auke; +Cc: e1000-devel, netdev, herbert, doug.chapman

On Wed, 2007-05-30 at 15:57 -0700, Kok, Auke wrote:
> Doug Chapman wrote:
> > All,
> > 
> > I reported this a few weeks ago and it was fixed but it appears the
> > offending code was again re-submitted.  This causes a panic on HP
> > Integrity servers and from what I hear many other platforms using e1000
> > as well.
> > 
> > My original report was via kernel.org BZ:
> > http://bugzilla.kernel.org/show_bug.cgi?id=8455
> > 
> > I am seeing the same panic with a kernel built from today's git pull.
> > 
> > 
> > It was fixed by this commit:
> > 3e1657c8ef53e1cd541cc1e420f3230dc075949b
> > 
> > 
> > but once again broken just yesterday by the following commit.  I have
> > backed just this commit out and verified I no longer panic.
> > 
> > 
> > commit 47313054352b879a2bc65379d55b05f48a0af7ec
> > Author: Herbert Xu <herbert@gondor.apana.org.au>
> > Date:   Tue May 29 15:07:31 2007 -0700
> > 
> >     e1000: restore netif_poll_enable call but make sure IRQs are off
> 
> Hmm, we're making a mess of it.
> 
> Herbert, wouldn't it just have been a lot easier to do just add a 
> netif_poll_disable in e1000_probe, so that any and all other poll 
> enable/disables are symmetric ? Something like this?
> 
> Auke
> 
> 
> PS: apologies for the whitespace damage... it's a cut'n'paste
> ---
> 
> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
> 
> diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
> index 49be393..a8a3c5f 100644
> --- a/drivers/net/e1000/e1000_main.c
> +++ b/drivers/net/e1000/e1000_main.c
> @@ -1149,6 +1149,7 @@ e1000_probe(struct pci_dev *pdev,
>          /* tell the stack to leave us alone until e1000_open() is called */
>          netif_carrier_off(netdev);
>          netif_stop_queue(netdev);
> +       netif_poll_disable(netdev);
> 
>          DPRINTK(PROBE, INFO, "Intel(R) PRO/1000 Network Connection\n");

Auke,

I tested this and it does indeed fix the panic on my systems.  I will
leave the decision as to if it is the "right" thing to do to those who
are more knowledgeable with the code.

- Doug



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-30 22:57 ` Kok, Auke
  2007-05-30 23:31   ` Doug Chapman
@ 2007-05-31  0:51   ` Herbert Xu
  2007-05-31  4:51     ` Kok, Auke
  1 sibling, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2007-05-31  0:51 UTC (permalink / raw)
  To: Kok, Auke; +Cc: Doug Chapman, netdev, e1000-devel

On Wed, May 30, 2007 at 03:57:13PM -0700, Kok, Auke wrote:
> 
> Hmm, we're making a mess of it.

Indeed :)

> Herbert, wouldn't it just have been a lot easier to do just add a 
> netif_poll_disable in e1000_probe, so that any and all other poll 
> enable/disables are symmetric ? Something like this?

I wish if it were as simple as that.  As soon as register_netdev
returns somebody else can invoke e1000_open so disabling poll
here can be undesirable.  In fact the existing netif_stop_queue
and netif_carrier_off calls are also bad for the same reason.

I'll keep looking to see if there's some other problem that we
haven't uncovered yet.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-30 21:22 REGRESSION: panic on e1000 driver Doug Chapman
  2007-05-30 22:57 ` Kok, Auke
@ 2007-05-31  1:08 ` Herbert Xu
  2007-05-31 15:16   ` Doug Chapman
  1 sibling, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2007-05-31  1:08 UTC (permalink / raw)
  To: Doug Chapman; +Cc: auke-jan.h.kok, netdev, e1000-devel

On Wed, May 30, 2007 at 05:22:30PM -0400, Doug Chapman wrote:
> 
> but once again broken just yesterday by the following commit.  I have
> backed just this commit out and verified I no longer panic.

Hmm, the only way I can see this happening is if the hardware signals
an interrupt even though we've explicitly shut it off.  Does this patch
help? If it does help, does it produce the warning in dmesg?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 9ec35b7..c6cf82f 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -625,15 +625,15 @@ e1000_down(struct e1000_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
 
-	/* signal that we're down so the interrupt handler does not
-	 * reschedule our watchdog timer */
-	set_bit(__E1000_DOWN, &adapter->flags);
-
 #ifdef CONFIG_E1000_NAPI
 	netif_poll_disable(netdev);
 #endif
 	e1000_irq_disable(adapter);
 
+	/* signal that we're down so the interrupt handler does not
+	 * reschedule our watchdog timer */
+	set_bit(__E1000_DOWN, &adapter->flags);
+
 	del_timer_sync(&adapter->tx_fifo_stall_timer);
 	del_timer_sync(&adapter->watchdog_timer);
 	del_timer_sync(&adapter->phy_info_timer);
@@ -1427,6 +1427,10 @@ e1000_open(struct net_device *netdev)
 	 * clean_rx handler before we do so.  */
 	e1000_configure(adapter);
 
+#ifdef CONFIG_E1000_NAPI
+	netif_poll_enable(netdev);
+#endif
+
 	err = e1000_request_irq(adapter);
 	if (err)
 		goto err_req_irq;
@@ -1434,10 +1438,6 @@ e1000_open(struct net_device *netdev)
 	/* From here on the code is the same as e1000_up() */
 	clear_bit(__E1000_DOWN, &adapter->flags);
 
-#ifdef CONFIG_E1000_NAPI
-	netif_poll_enable(netdev);
-#endif
-
 	e1000_irq_enable(adapter);
 
 	/* fire a link status change interrupt to start the watchdog */
@@ -3756,6 +3756,11 @@ e1000_intr_msi(int irq, void *data)
 #endif
 	uint32_t icr = E1000_READ_REG(hw, ICR);
 
+	if (unlikely(test_bit(__E1000_DOWN, &adapter->flags) &&
+		     atomic_read(&adapter->irq_sem)))
+		printk(KERN_WARNING "e1000: Unexpected MSI icr=0x%x\n",
+		       icr);
+
 #ifdef CONFIG_E1000_NAPI
 	/* read ICR disables interrupts using IAM, so keep up with our
 	 * enable/disable accounting */
@@ -3823,6 +3828,11 @@ e1000_intr(int irq, void *data)
 	if (unlikely(!icr))
 		return IRQ_NONE;  /* Not our interrupt */
 
+	if (unlikely(test_bit(__E1000_DOWN, &adapter->flags) &&
+		     atomic_read(&adapter->irq_sem)))
+		printk(KERN_WARNING "e1000: Unexpected interrupt icr=0x%x\n",
+		       icr);
+
 #ifdef CONFIG_E1000_NAPI
 	/* IMS will not auto-mask if INT_ASSERTED is not set, and if it is
 	 * not set, then the adapter didn't send an interrupt */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-31  0:51   ` Herbert Xu
@ 2007-05-31  4:51     ` Kok, Auke
  2007-05-31  5:06       ` Herbert Xu
  0 siblings, 1 reply; 17+ messages in thread
From: Kok, Auke @ 2007-05-31  4:51 UTC (permalink / raw)
  To: Herbert Xu; +Cc: e1000-devel, netdev, Doug Chapman

Herbert Xu wrote:
> On Wed, May 30, 2007 at 03:57:13PM -0700, Kok, Auke wrote:
>> Hmm, we're making a mess of it.
> 
> Indeed :)
> 
>> Herbert, wouldn't it just have been a lot easier to do just add a 
>> netif_poll_disable in e1000_probe, so that any and all other poll 
>> enable/disables are symmetric ? Something like this?
> 
> I wish if it were as simple as that.  As soon as register_netdev
> returns somebody else can invoke e1000_open so disabling poll
> here can be undesirable.  In fact the existing netif_stop_queue
> and netif_carrier_off calls are also bad for the same reason.

this has been an age-old confusion that I never grasped either, so I perfectly 
understand why you added the explicit e1000_disable_irq call in the other patch 
(and think thats a great idea). But really, there should be a way for a driver 
to tell the stack that it should really keep it's hands off :)

BTW e1000 currently triggers a single irq manually in the watchdog as link goes 
up, so that might be the one that is giving problems now. In any case I can't 
reproduce any of it - perhaps my hardware is too fast. Time to whip out the pIII :o

Auke

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-31  4:51     ` Kok, Auke
@ 2007-05-31  5:06       ` Herbert Xu
  2007-05-31 14:54         ` Kok, Auke
  0 siblings, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2007-05-31  5:06 UTC (permalink / raw)
  To: Kok, Auke; +Cc: Doug Chapman, netdev, e1000-devel

On Wed, May 30, 2007 at 09:51:14PM -0700, Kok, Auke wrote:
> 
> this has been an age-old confusion that I never grasped either, so I 
> perfectly understand why you added the explicit e1000_disable_irq call in 
> the other patch (and think thats a great idea). But really, there should be 
> a way for a driver to tell the stack that it should really keep it's hands 
> off :)

Well yes, you can get the stack to keep away by not registering your
device :)

> BTW e1000 currently triggers a single irq manually in the watchdog as link 
> goes up, so that might be the one that is giving problems now. In any case 
> I can't reproduce any of it - perhaps my hardware is too fast. Time to whip 
> out the pIII :o

Hmm, if it's triggered by the watchdog then that means the watchdog has
been scheduled.  However, it seems that the only way to schedule it is
through an interrupt?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-31  5:06       ` Herbert Xu
@ 2007-05-31 14:54         ` Kok, Auke
  2007-05-31 21:49           ` Herbert Xu
  0 siblings, 1 reply; 17+ messages in thread
From: Kok, Auke @ 2007-05-31 14:54 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Doug Chapman, netdev, e1000-devel

Herbert Xu wrote:
> On Wed, May 30, 2007 at 09:51:14PM -0700, Kok, Auke wrote:
>> this has been an age-old confusion that I never grasped either, so I 
>> perfectly understand why you added the explicit e1000_disable_irq call in 
>> the other patch (and think thats a great idea). But really, there should be 
>> a way for a driver to tell the stack that it should really keep it's hands 
>> off :)
> 
> Well yes, you can get the stack to keep away by not registering your
> device :)

*blunt*

so how about calling netif_poll_disable() before we register the net_device?

>> BTW e1000 currently triggers a single irq manually in the watchdog as link 
>> goes up, so that might be the one that is giving problems now. In any case 
>> I can't reproduce any of it - perhaps my hardware is too fast. Time to whip 
>> out the pIII :o
> 
> Hmm, if it's triggered by the watchdog then that means the watchdog has
> been scheduled.  However, it seems that the only way to schedule it is
> through an interrupt?

well no, if we make the watchdog (this is something I've already implemented 
locally and -mm has it for instance) run as delayed work we can just schedule a 
watchdog run instead of firing an interrupt.

I'm just not sure that would relieve the situation

Auke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-31  1:08 ` Herbert Xu
@ 2007-05-31 15:16   ` Doug Chapman
  2007-05-31 15:23     ` [E1000-devel] " Kok, Auke
  2007-05-31 22:10     ` Herbert Xu
  0 siblings, 2 replies; 17+ messages in thread
From: Doug Chapman @ 2007-05-31 15:16 UTC (permalink / raw)
  To: Herbert Xu; +Cc: e1000-devel, netdev, auke-jan.h.kok, doug.chapman

On Thu, 2007-05-31 at 11:08 +1000, Herbert Xu wrote:
> On Wed, May 30, 2007 at 05:22:30PM -0400, Doug Chapman wrote:
> > 
> > but once again broken just yesterday by the following commit.  I have
> > backed just this commit out and verified I no longer panic.
> 
> Hmm, the only way I can see this happening is if the hardware signals
> an interrupt even though we've explicitly shut it off.  Does this patch
> help? If it does help, does it produce the warning in dmesg?
> 
> Thanks,


The patch does fix the panic.  Here are all the e1000 messages from
dmesg after a bootup.  Note that only eth0 is actually configured and
connected:


# dmesg | grep e1000
e1000: 0000:01:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f8
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: 0000:01:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f9
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: 0000:15:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c4
e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: 0000:15:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c5
e1000: eth3: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: Unexpected interrupt icr=0x4
e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: RX


- Doug



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [E1000-devel] REGRESSION: panic on e1000 driver
  2007-05-31 15:16   ` Doug Chapman
@ 2007-05-31 15:23     ` Kok, Auke
  2007-05-31 21:48       ` Herbert Xu
  2007-05-31 22:10     ` Herbert Xu
  1 sibling, 1 reply; 17+ messages in thread
From: Kok, Auke @ 2007-05-31 15:23 UTC (permalink / raw)
  To: Doug Chapman, Herbert Xu; +Cc: e1000-devel, netdev

Doug Chapman wrote:
> On Thu, 2007-05-31 at 11:08 +1000, Herbert Xu wrote:
>> On Wed, May 30, 2007 at 05:22:30PM -0400, Doug Chapman wrote:
>>> but once again broken just yesterday by the following commit.  I have
>>> backed just this commit out and verified I no longer panic.
>> Hmm, the only way I can see this happening is if the hardware signals
>> an interrupt even though we've explicitly shut it off.  Does this patch
>> help? If it does help, does it produce the warning in dmesg?
>>
>> Thanks,
> 
> 
> The patch does fix the panic.  Here are all the e1000 messages from
> dmesg after a bootup.  Note that only eth0 is actually configured and
> connected:
> 
> 
> # dmesg | grep e1000
> e1000: 0000:01:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f8
> e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
> e1000: 0000:01:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f9
> e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
> e1000: 0000:15:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c4
> e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection
> e1000: 0000:15:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c5
> e1000: eth3: e1000_probe: Intel(R) PRO/1000 Network Connection
> e1000: Unexpected interrupt icr=0x4

that's indeed the link status change we fire manually to trigger the watchdog run:

#define E1000_ICR_LSC           0x00000004 /* Link Status Change */

Auke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [E1000-devel] REGRESSION: panic on e1000 driver
  2007-05-31 15:23     ` [E1000-devel] " Kok, Auke
@ 2007-05-31 21:48       ` Herbert Xu
  2007-05-31 22:10         ` Kok, Auke
  0 siblings, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2007-05-31 21:48 UTC (permalink / raw)
  To: Kok, Auke; +Cc: Doug Chapman, e1000-devel, netdev

On Thu, May 31, 2007 at 08:23:24AM -0700, Kok, Auke wrote:
>
> >e1000: 0000:01:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f8
> >e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
> >e1000: 0000:01:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f9
> >e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
> >e1000: 0000:15:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c4
> >e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection
> >e1000: 0000:15:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c5
> >e1000: eth3: e1000_probe: Intel(R) PRO/1000 Network Connection
> >e1000: Unexpected interrupt icr=0x4
> 
> that's indeed the link status change we fire manually to trigger the 
> watchdog run:
> 
> #define E1000_ICR_LSC           0x00000004 /* Link Status Change */

This still makes no sense.  The only triggers I can find for this occur
after e1000_irq_enable.  So unless we've got a problem with memory
barriers we shouldn't get the above printk.

Is there another trigger that happens earlier?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-31 14:54         ` Kok, Auke
@ 2007-05-31 21:49           ` Herbert Xu
  2007-05-31 22:11             ` Kok, Auke
  0 siblings, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2007-05-31 21:49 UTC (permalink / raw)
  To: Kok, Auke; +Cc: Doug Chapman, netdev, e1000-devel

On Thu, May 31, 2007 at 07:54:31AM -0700, Kok, Auke wrote:
> 
> so how about calling netif_poll_disable() before we register the net_device?

Yes that should work.  Let's move the other two netif_ calls while we're
at it.
 
> well no, if we make the watchdog (this is something I've already 
> implemented locally and -mm has it for instance) run as delayed work we can 
> just schedule a watchdog run instead of firing an interrupt.

OK.  The only thing I'm worried about is if we're uncovering some
underlying problem with the platform with respect to memory barriers
then we should get to the bottom of it first before making this problem
disappear.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-31 15:16   ` Doug Chapman
  2007-05-31 15:23     ` [E1000-devel] " Kok, Auke
@ 2007-05-31 22:10     ` Herbert Xu
  2007-05-31 22:38       ` Doug Chapman
  1 sibling, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2007-05-31 22:10 UTC (permalink / raw)
  To: Doug Chapman; +Cc: auke-jan.h.kok, netdev, e1000-devel

On Thu, May 31, 2007 at 11:16:09AM -0400, Doug Chapman wrote:
> 
> # dmesg | grep e1000
> e1000: 0000:01:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f8
> e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
> e1000: 0000:01:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f9
> e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
> e1000: 0000:15:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c4
> e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection
> e1000: 0000:15:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c5
> e1000: eth3: e1000_probe: Intel(R) PRO/1000 Network Connection
> e1000: Unexpected interrupt icr=0x4
> e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: RX

OK that's interesting.

Does this patch (please unapply the last one first) make the printk
go away or does it trigger a backtrace?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 9ec35b7..167de10 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -625,15 +625,15 @@ e1000_down(struct e1000_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
 
-	/* signal that we're down so the interrupt handler does not
-	 * reschedule our watchdog timer */
-	set_bit(__E1000_DOWN, &adapter->flags);
-
 #ifdef CONFIG_E1000_NAPI
 	netif_poll_disable(netdev);
 #endif
 	e1000_irq_disable(adapter);
 
+	/* signal that we're down so the interrupt handler does not
+	 * reschedule our watchdog timer */
+	set_bit(__E1000_DOWN, &adapter->flags);
+
 	del_timer_sync(&adapter->tx_fifo_stall_timer);
 	del_timer_sync(&adapter->watchdog_timer);
 	del_timer_sync(&adapter->phy_info_timer);
@@ -1328,6 +1328,7 @@ e1000_sw_init(struct e1000_adapter *adapter)
 	/* Explicitly disable IRQ since the NIC can be in any state. */
 	atomic_set(&adapter->irq_sem, 0);
 	e1000_irq_disable(adapter);
+	WARN_ON(E1000_READ_REG(&adapter->hw, ICR));
 
 	spin_lock_init(&adapter->stats_lock);
 
@@ -1427,6 +1428,10 @@ e1000_open(struct net_device *netdev)
 	 * clean_rx handler before we do so.  */
 	e1000_configure(adapter);
 
+#ifdef CONFIG_E1000_NAPI
+	netif_poll_enable(netdev);
+#endif
+
 	err = e1000_request_irq(adapter);
 	if (err)
 		goto err_req_irq;
@@ -1434,9 +1439,7 @@ e1000_open(struct net_device *netdev)
 	/* From here on the code is the same as e1000_up() */
 	clear_bit(__E1000_DOWN, &adapter->flags);
 
-#ifdef CONFIG_E1000_NAPI
-	netif_poll_enable(netdev);
-#endif
+	WARN_ON(E1000_READ_REG(&adapter->hw, ICR));
 
 	e1000_irq_enable(adapter);
 
@@ -3756,6 +3759,13 @@ e1000_intr_msi(int irq, void *data)
 #endif
 	uint32_t icr = E1000_READ_REG(hw, ICR);
 
+	if (unlikely(test_bit(__E1000_DOWN, &adapter->flags))) {
+		mb();
+		if (unlikely(atomic_read(&adapter->irq_sem)))
+			printk(KERN_WARNING "e1000: Unexpected MSI icr=0x%x\n",
+			       icr);
+	}
+
 #ifdef CONFIG_E1000_NAPI
 	/* read ICR disables interrupts using IAM, so keep up with our
 	 * enable/disable accounting */
@@ -3823,6 +3833,14 @@ e1000_intr(int irq, void *data)
 	if (unlikely(!icr))
 		return IRQ_NONE;  /* Not our interrupt */
 
+
+	if (unlikely(test_bit(__E1000_DOWN, &adapter->flags))) {
+		mb();
+		if (unlikely(atomic_read(&adapter->irq_sem)))
+			printk(KERN_WARNING
+			       "e1000: Unexpected interrupt icr=0x%x\n", icr);
+	}
+
 #ifdef CONFIG_E1000_NAPI
 	/* IMS will not auto-mask if INT_ASSERTED is not set, and if it is
 	 * not set, then the adapter didn't send an interrupt */

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [E1000-devel] REGRESSION: panic on e1000 driver
  2007-05-31 21:48       ` Herbert Xu
@ 2007-05-31 22:10         ` Kok, Auke
  0 siblings, 0 replies; 17+ messages in thread
From: Kok, Auke @ 2007-05-31 22:10 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Doug Chapman, e1000-devel, netdev

Herbert Xu wrote:
> On Thu, May 31, 2007 at 08:23:24AM -0700, Kok, Auke wrote:
>>> e1000: 0000:01:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f8
>>> e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
>>> e1000: 0000:01:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f9
>>> e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
>>> e1000: 0000:15:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c4
>>> e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection
>>> e1000: 0000:15:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c5
>>> e1000: eth3: e1000_probe: Intel(R) PRO/1000 Network Connection
>>> e1000: Unexpected interrupt icr=0x4
>> that's indeed the link status change we fire manually to trigger the 
>> watchdog run:
>>
>> #define E1000_ICR_LSC           0x00000004 /* Link Status Change */
> 
> This still makes no sense.  The only triggers I can find for this occur
> after e1000_irq_enable.  So unless we've got a problem with memory
> barriers we shouldn't get the above printk.
> 
> Is there another trigger that happens earlier?

none that I know....

Auke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-31 21:49           ` Herbert Xu
@ 2007-05-31 22:11             ` Kok, Auke
  0 siblings, 0 replies; 17+ messages in thread
From: Kok, Auke @ 2007-05-31 22:11 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Doug Chapman, netdev, e1000-devel

Herbert Xu wrote:
> On Thu, May 31, 2007 at 07:54:31AM -0700, Kok, Auke wrote:
>> so how about calling netif_poll_disable() before we register the net_device?
> 
> Yes that should work.  Let's move the other two netif_ calls while we're
> at it.
>  
>> well no, if we make the watchdog (this is something I've already 
>> implemented locally and -mm has it for instance) run as delayed work we can 
>> just schedule a watchdog run instead of firing an interrupt.
> 
> OK.  The only thing I'm worried about is if we're uncovering some
> underlying problem with the platform with respect to memory barriers
> then we should get to the bottom of it first before making this problem
> disappear.

agreed. I will try to send a patch to do this later today.

Auke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-31 22:10     ` Herbert Xu
@ 2007-05-31 22:38       ` Doug Chapman
  2007-06-01  0:34         ` Herbert Xu
  0 siblings, 1 reply; 17+ messages in thread
From: Doug Chapman @ 2007-05-31 22:38 UTC (permalink / raw)
  To: Herbert Xu; +Cc: auke-jan.h.kok, netdev, e1000-devel

On Fri, 2007-06-01 at 08:10 +1000, Herbert Xu wrote:
> On Thu, May 31, 2007 at 11:16:09AM -0400, Doug Chapman wrote:
> > 
> > # dmesg | grep e1000
> > e1000: 0000:01:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f8
> > e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
> > e1000: 0000:01:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:15:60:04:d7:f9
> > e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
> > e1000: 0000:15:02.0: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c4
> > e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection
> > e1000: 0000:15:02.1: e1000_probe: (PCI-X:66MHz:64-bit) 00:12:79:9e:b7:c5
> > e1000: eth3: e1000_probe: Intel(R) PRO/1000 Network Connection
> > e1000: Unexpected interrupt icr=0x4
> > e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow
> > Control: RX
> 
> OK that's interesting.
> 
> Does this patch (please unapply the last one first) make the printk
> go away or does it trigger a backtrace?
> 
> Thanks,

I get a backtrace as it probes each e1000 device and I also still get
the unexpected interrupt message.


WARNING: at drivers/net/e1000/e1000_main.c:1331 e1000_sw_init()

Call Trace:
 [<a000000100013c20>] show_stack+0x40/0xa0
                                sp=e0000017e50c7b30 bsp=e0000017e50c1120
 [<a000000100013cb0>] dump_stack+0x30/0x60
                                sp=e0000017e50c7d00 bsp=e0000017e50c1108
 [<a000000236c47a40>] e1000_probe+0xce0/0x1ee0 [e1000]
                                sp=e0000017e50c7d00 bsp=e0000017e50c1098
 [<a0000001002d5540>] pci_device_probe+0x260/0x3a0
                                sp=e0000017e50c7d10 bsp=e0000017e50c1050
 [<a0000001003e2d10>] driver_probe_device+0x250/0x380
                                sp=e0000017e50c7de0 bsp=e0000017e50c1010
 [<a0000001003e3120>] __driver_attach+0x100/0x1a0
                                sp=e0000017e50c7de0 bsp=e0000017e50c0fd8
 [<a0000001003e10c0>] bus_for_each_dev+0x80/0x100
                                sp=e0000017e50c7de0 bsp=e0000017e50c0fa0
 [<a0000001003e28e0>] driver_attach+0x40/0x60
                                sp=e0000017e50c7e00 bsp=e0000017e50c0f80
 [<a0000001003e18b0>] bus_add_driver+0xf0/0x3c0
                                sp=e0000017e50c7e00 bsp=e0000017e50c0f40
 [<a0000001003e3680>] driver_register+0x140/0x160
                                sp=e0000017e50c7e00 bsp=e0000017e50c0f20
 [<a0000001002d5a10>] __pci_register_driver+0xd0/0x160
                                sp=e0000017e50c7e00 bsp=e0000017e50c0ee8
 [<a0000002363400b0>] e1000_init_module+0xb0/0x1b0 [e1000]
                                sp=e0000017e50c7e00 bsp=e0000017e50c0ec8
 [<a0000001000cd440>] sys_init_module+0x2dc0/0x31a0
                                sp=e0000017e50c7e00 bsp=e0000017e50c0d48
 [<a00000010000bae0>] ia64_ret_from_syscall+0x0/0x20
                                sp=e0000017e50c7e30 bsp=e0000017e50c0d48


- Doug



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: REGRESSION: panic on e1000 driver
  2007-05-31 22:38       ` Doug Chapman
@ 2007-06-01  0:34         ` Herbert Xu
  0 siblings, 0 replies; 17+ messages in thread
From: Herbert Xu @ 2007-06-01  0:34 UTC (permalink / raw)
  To: Doug Chapman; +Cc: e1000-devel, netdev, auke-jan.h.kok

On Thu, May 31, 2007 at 06:38:28PM -0400, Doug Chapman wrote:
> 
> I get a backtrace as it probes each e1000 device and I also still get
> the unexpected interrupt message.
> 
> 
> WARNING: at drivers/net/e1000/e1000_main.c:1331 e1000_sw_init()

Thanks for testing!

Although I still don't know what caused the interrupt in your case,
it is clear that we need to be able to deal with interrupts as soon
as the handler is registered since the cause register is not affected
by e1000_irq_disable and a shared interrupt can easily be mistaken as
our own.

So Auke's solution of doing netif_poll_disable should fix this problem.

In looking at this I've found a couple of other problems:

1) Race between IRQ handler and e1000_open:

A shared/spurious interrupt can cause this:

CPU0				CPU1
e1000_open
	request_irq
				spurious/shared IRQ
				e1000_interrupt
	e1000_irq_enable
		atomic_dec_*
					atomic_inc
					IMC <- ~0
		IMS <- MASK

So we end up with IRQs enabled when they shouldn't be.

2) Race between IRQ handler and e1000_clean (and other mgmt functions):

Again shared/spurious interrupts may cause problems:

CPU0				CPU1
e1000_clean
	do work
				spurious/shared IRQ
				e1000_interrupt
					clear ICR
					netif_rx_schedule_prep fails
					e1000_irq_enable
	netif_rx_complete
	e1000_irq_enable

At this point IRQs are on but we've lost an interrupt.

We can fix this by

1) Ignoring IRQs when irq_sem > 0.
2) Always generate an IRQ after e1000_irq_enable.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-06-01  0:34 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-30 21:22 REGRESSION: panic on e1000 driver Doug Chapman
2007-05-30 22:57 ` Kok, Auke
2007-05-30 23:31   ` Doug Chapman
2007-05-31  0:51   ` Herbert Xu
2007-05-31  4:51     ` Kok, Auke
2007-05-31  5:06       ` Herbert Xu
2007-05-31 14:54         ` Kok, Auke
2007-05-31 21:49           ` Herbert Xu
2007-05-31 22:11             ` Kok, Auke
2007-05-31  1:08 ` Herbert Xu
2007-05-31 15:16   ` Doug Chapman
2007-05-31 15:23     ` [E1000-devel] " Kok, Auke
2007-05-31 21:48       ` Herbert Xu
2007-05-31 22:10         ` Kok, Auke
2007-05-31 22:10     ` Herbert Xu
2007-05-31 22:38       ` Doug Chapman
2007-06-01  0:34         ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).