public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 4/13]: PCI Err: e100 ethernet driver recovery
@ 2005-06-28 23:58 Linas Vepstas
  2005-06-29  1:46 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 8+ messages in thread
From: Linas Vepstas @ 2005-06-28 23:58 UTC (permalink / raw)
  To: linux-kernel, Benjamin Herrenschmidt, long
  Cc: Hidetoshi Seto, Greg KH, ak, Paul Mackerras, linuxppc64-dev,
	linux-pci, johnrose

[-- Attachment #1: Type: text/plain, Size: 189 bytes --]


pci-err-4-e100.patch

Adds PCI error recovery callbacks to the Intel E100 ethernet device driver.
Lightly tested on an E100 two-port card. 

Signed-off-by: Linas Vepstas <linas@linas.org>

[-- Attachment #2: pci-err-4-e100.patch --]
[-- Type: text/plain, Size: 3505 bytes --]

--- linux-2.6.12-git10/drivers/net/e100.c.linas-orig	2005-06-17 14:48:29.000000000 -0500
+++ linux-2.6.12-git10/drivers/net/e100.c	2005-06-22 17:18:26.000000000 -0500
@@ -2460,6 +2460,67 @@ static void e100_shutdown(struct device 
 #endif
 }
 
+#ifdef CONFIG_E100_EEH_RECOVERY
+
+/** e100_io_error_detected() is called when PCI error is detected */
+static int e100_io_error_detected (struct pci_dev *pdev, enum pci_channel_state state)
+{
+	struct net_device *netdev = pci_get_drvdata(pdev);
+	struct nic *nic = netdev_priv(netdev);
+
+	mod_timer(&nic->watchdog, jiffies + 30*HZ);
+	e100_down(nic);
+
+	/* Request a slot reset. */
+	return PCIERR_RESULT_NEED_RESET;
+}
+
+/** e100_io_slot_reset is called after the pci bus has been reset.
+ *  Restart the card from scratch. */
+static int e100_io_slot_reset (struct pci_dev *pdev)
+{
+	struct net_device *netdev = pci_get_drvdata(pdev);
+	struct nic *nic = netdev_priv(netdev);
+
+	if(pci_enable_device(pdev)) {
+		printk(KERN_ERR "e100: Cannot re-enable PCI device after reset.\n");
+		return PCIERR_RESULT_DISCONNECT;
+	}
+	pci_set_master(pdev);
+
+	/* Only one device per card can do a reset */
+	if (0 != PCI_FUNC (pdev->devfn))
+		return PCIERR_RESULT_RECOVERED;
+	
+	e100_hw_reset(nic);
+	e100_phy_init(nic);
+	
+	if(e100_hw_init(nic)) {
+		DPRINTK(HW, ERR, "e100_hw_init failed\n");
+		return PCIERR_RESULT_DISCONNECT;
+	}
+
+	return PCIERR_RESULT_RECOVERED;
+}
+
+/** e100_io_resume is called when the error recovery driver
+ *  tells us that its OK to resume normal operation.
+ */
+static void e100_io_resume (struct pci_dev *pdev)
+{
+	struct net_device *netdev = pci_get_drvdata(pdev);
+	struct nic *nic = netdev_priv(netdev);
+
+	/* ack any pending wake events, disable PME */
+	pci_enable_wake(pdev, 0, 0);
+
+	netif_device_attach(netdev);
+	if(netif_running(netdev))
+		e100_open (netdev);
+	
+	mod_timer(&nic->watchdog, jiffies);
+}
+#endif /* CONFIG_E100_EEH_RECOVERY */
 
 static struct pci_driver e100_driver = {
 	.name =         DRV_NAME,
@@ -2470,6 +2531,13 @@ static struct pci_driver e100_driver = {
 	.suspend =      e100_suspend,
 	.resume =       e100_resume,
 #endif
+#ifdef CONFIG_E100_EEH_RECOVERY
+	.err_handler = {
+		.error_detected = e100_io_error_detected,
+		.slot_reset = e100_io_slot_reset,
+		.resume = e100_io_resume,
+	},
+#endif /* CONFIG_E100_EEH_RECOVERY */
 
 	.driver = {
 		.shutdown = e100_shutdown,
--- linux-2.6.12-git10/drivers/net/Kconfig.linas-orig	2005-06-22 15:26:13.000000000 -0500
+++ linux-2.6.12-git10/drivers/net/Kconfig	2005-06-22 15:28:29.000000000 -0500
@@ -1392,6 +1392,14 @@ config E100
 	  <file:Documentation/networking/net-modules.txt>.  The module
 	  will be called e100.
 
+config E100_EEH_RECOVERY
+	bool "Enable PCI bus error recovery"
+	depends on E100 && PPC_PSERIES
+   help
+      If you say Y here, the driver will be able to recover from
+      PCI bus errors on many PowerPC platforms. IBM pSeries users
+      should answer Y.
+
 config LNE390
 	tristate "Mylex EISA LNE390A/B support (EXPERIMENTAL)"
 	depends on NET_PCI && EISA && EXPERIMENTAL
--- linux-2.6.12-git10/arch/ppc64/configs/pSeries_defconfig.linas-orig	2005-06-17 14:48:29.000000000 -0500
+++ linux-2.6.12-git10/arch/ppc64/configs/pSeries_defconfig	2005-06-22 15:30:33.000000000 -0500
@@ -545,6 +545,7 @@ CONFIG_PCNET32=y
 # CONFIG_DGRS is not set
 # CONFIG_EEPRO100 is not set
 CONFIG_E100=y
+CONFIG_E100_EEH_RECOVERY=y
 # CONFIG_FEALNX is not set
 # CONFIG_NATSEMI is not set
 # CONFIG_NE2K_PCI is not set

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/13]: PCI Err: e100 ethernet driver recovery
  2005-06-28 23:58 [PATCH 4/13]: PCI Err: e100 ethernet driver recovery Linas Vepstas
@ 2005-06-29  1:46 ` Benjamin Herrenschmidt
  2005-06-29 15:59   ` Linas Vepstas
  0 siblings, 1 reply; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2005-06-29  1:46 UTC (permalink / raw)
  To: Linas Vepstas
  Cc: linux-kernel, long, Hidetoshi Seto, Greg KH, ak, Paul Mackerras,
	linuxppc64-dev, linux-pci, johnrose

On Tue, 2005-06-28 at 18:58 -0500, Linas Vepstas wrote:
> /** e100_io_error_detected() is called when PCI error is detected */
> +static int e100_io_error_detected (struct pci_dev *pdev, enum
> pci_channel_state state)
> +{
> +       struct net_device *netdev = pci_get_drvdata(pdev);
> +       struct nic *nic = netdev_priv(netdev);
> +
> +       mod_timer(&nic->watchdog, jiffies + 30*HZ);
> +       e100_down(nic);
> +
> +       /* Request a slot reset. */
> +       return PCIERR_RESULT_NEED_RESET;
> +}

I'm not sure just "pushing" the watchdog timer to 30sec in the future is
the way to go here. What about netif_stop_queue() or so ?

Ben.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/13]: PCI Err: e100 ethernet driver recovery
  2005-06-29  1:46 ` Benjamin Herrenschmidt
@ 2005-06-29 15:59   ` Linas Vepstas
  2005-06-29 16:58     ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: Linas Vepstas @ 2005-06-29 15:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, long, Hidetoshi Seto, Greg KH, ak, Paul Mackerras,
	linuxppc64-dev, linux-pci, johnrose

On Wed, Jun 29, 2005 at 11:46:58AM +1000, Benjamin Herrenschmidt was heard to remark:
> On Tue, 2005-06-28 at 18:58 -0500, Linas Vepstas wrote:
> > /** e100_io_error_detected() is called when PCI error is detected */
> > +static int e100_io_error_detected (struct pci_dev *pdev, enum
> > pci_channel_state state)
> > +{
> > +       struct net_device *netdev = pci_get_drvdata(pdev);
> > +       struct nic *nic = netdev_priv(netdev);
> > +
> > +       mod_timer(&nic->watchdog, jiffies + 30*HZ);
> > +       e100_down(nic);
> > +
> > +       /* Request a slot reset. */
> > +       return PCIERR_RESULT_NEED_RESET;
> > +}
> 
> I'm not sure just "pushing" the watchdog timer to 30sec in the future is
> the way to go here. What about netif_stop_queue() or so ?

Yep, OK. Pushig the timer would in fact break if the device was marked
perm disabled.

--linas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/13]: PCI Err: e100 ethernet driver recovery
  2005-06-29 15:59   ` Linas Vepstas
@ 2005-06-29 16:58     ` Andi Kleen
  2005-06-29 23:40       ` Benjamin Herrenschmidt
  2005-06-30 20:39       ` PCI Power management (was: " Linas Vepstas
  0 siblings, 2 replies; 8+ messages in thread
From: Andi Kleen @ 2005-06-29 16:58 UTC (permalink / raw)
  To: Linas Vepstas
  Cc: Benjamin Herrenschmidt, linux-kernel, long, Hidetoshi Seto,
	Greg KH, Paul Mackerras, linuxppc64-dev, linux-pci, johnrose

> Yep, OK. Pushig the timer would in fact break if the device was marked
> perm disabled.

I think for network drivers you should just write a generic error handler
(perhaps in net/core/dev.c) that calls the watchdog handler. 
Then all drivers could be easily converted without much code duplication.

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/13]: PCI Err: e100 ethernet driver recovery
  2005-06-29 16:58     ` Andi Kleen
@ 2005-06-29 23:40       ` Benjamin Herrenschmidt
  2005-06-30 20:39       ` PCI Power management (was: " Linas Vepstas
  1 sibling, 0 replies; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2005-06-29 23:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linas Vepstas, linux-kernel, long, Hidetoshi Seto, Greg KH,
	Paul Mackerras, linuxppc64-dev, linux-pci, johnrose

On Wed, 2005-06-29 at 18:58 +0200, Andi Kleen wrote:
> > Yep, OK. Pushig the timer would in fact break if the device was marked
> > perm disabled.
> 
> I think for network drivers you should just write a generic error handler
> (perhaps in net/core/dev.c) that calls the watchdog handler. 
> Then all drivers could be easily converted without much code duplication.

Provided the watchdog timer completely reconfigures the device from
reset since the slot will be reset...

Ben.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* PCI Power management (was: Re: [PATCH 4/13]: PCI Err: e100 ethernet driver recovery
  2005-06-29 16:58     ` Andi Kleen
  2005-06-29 23:40       ` Benjamin Herrenschmidt
@ 2005-06-30 20:39       ` Linas Vepstas
  2005-06-30 21:07         ` Linas Vepstas
  2005-06-30 23:32         ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 8+ messages in thread
From: Linas Vepstas @ 2005-06-30 20:39 UTC (permalink / raw)
  To: Andi Kleen, sfr
  Cc: Benjamin Herrenschmidt, linux-kernel, long, Hidetoshi Seto,
	Greg KH, Paul Mackerras, linuxppc64-dev, linux-pci, johnrose,
	linux-laptop, mochel, pavel

On Wed, Jun 29, 2005 at 06:58:29PM +0200, Andi Kleen was heard to remark:
> > Yep, OK. Pushig the timer would in fact break if the device was marked
> > perm disabled.
> 
> I think for network drivers you should just write a generic error handler
> (perhaps in net/core/dev.c) that calls the watchdog handler. 
> Then all drivers could be easily converted without much code duplication.

Well, there's no watchdog per-se in "struct net_device" -- are you
suggesting I add one?

It looks like I can almost create generic handlers for net devices; 
looks like calling netdev->stop() is enough to handle the error
detection. 

However, a generic bringup would need to call pci_enable_device(), 
and net/core/dev.c does not include pci.h so I can't really do it 
there.  Other than that, a generic recovry routine looks like it might
be possible; I'll have to experiment; its hard to tell by reading code.

This might be the wrong paradigm, though.  The pci error recovery 
routines are *almost identical* to the power-management suspend/resume
routines.  From what I can tell, the only real difference is that 
I want to not actually turn off/on the power. 

Thus, the right thing to do might be to split up the 
struct pci_dev->suspend() and pci_dev->resume() calls into

   suspend()
   poweroff()
   poweron()
   resume()

and then have the generic pci error recovery routines call
suspend/resume only, skipping the poweroff-on calls.  Does that 
sound good?

I'm not sure I can pull this off without having someone from 
the power-management world throw a brick at me.

--linas


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PCI Power management (was: Re: [PATCH 4/13]: PCI Err: e100 ethernet driver recovery
  2005-06-30 20:39       ` PCI Power management (was: " Linas Vepstas
@ 2005-06-30 21:07         ` Linas Vepstas
  2005-06-30 23:32         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 8+ messages in thread
From: Linas Vepstas @ 2005-06-30 21:07 UTC (permalink / raw)
  To: Andi Kleen, sfr
  Cc: Benjamin Herrenschmidt, linux-kernel, long, Hidetoshi Seto,
	Greg KH, Paul Mackerras, linuxppc64-dev, linux-pci, johnrose,
	linux-laptop, mochel, pavel


Hm,

Scratch the idea I outline below, seems like its not a good idea.

I'm reading the e100, e1000 and the ixgb power management code, and they
go through all sorts of steps I don't need to do for PCI device reset.
There's no clear abstraction that would serve both needs.

On Thu, Jun 30, 2005 at 03:39:31PM -0500, Linas Vepstas was heard to remark:
> On Wed, Jun 29, 2005 at 06:58:29PM +0200, Andi Kleen was heard to remark:
> > > Yep, OK. Pushig the timer would in fact break if the device was marked
> > > perm disabled.
> > 
> > I think for network drivers you should just write a generic error handler
> > (perhaps in net/core/dev.c) that calls the watchdog handler. 
> > Then all drivers could be easily converted without much code duplication.
> 
> Well, there's no watchdog per-se in "struct net_device" -- are you
> suggesting I add one?
> 
> It looks like I can almost create generic handlers for net devices; 
> looks like calling netdev->stop() is enough to handle the error
> detection. 
> 
> However, a generic bringup would need to call pci_enable_device(), 
> and net/core/dev.c does not include pci.h so I can't really do it 
> there.  Other than that, a generic recovry routine looks like it might
> be possible; I'll have to experiment; its hard to tell by reading code.
> 
> This might be the wrong paradigm, though.  The pci error recovery 
> routines are *almost identical* to the power-management suspend/resume
> routines.  From what I can tell, the only real difference is that 
> I want to not actually turn off/on the power. 
> 
> Thus, the right thing to do might be to split up the 
> struct pci_dev->suspend() and pci_dev->resume() calls into
> 
>    suspend()
>    poweroff()
>    poweron()
>    resume()
> 
> and then have the generic pci error recovery routines call
> suspend/resume only, skipping the poweroff-on calls.  Does that 
> sound good?
> 
> I'm not sure I can pull this off without having someone from 
> the power-management world throw a brick at me.
> 
> --linas
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PCI Power management (was: Re: [PATCH 4/13]: PCI Err: e100 ethernet driver recovery
  2005-06-30 20:39       ` PCI Power management (was: " Linas Vepstas
  2005-06-30 21:07         ` Linas Vepstas
@ 2005-06-30 23:32         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2005-06-30 23:32 UTC (permalink / raw)
  To: Linas Vepstas
  Cc: Andi Kleen, sfr, linux-kernel, long, Hidetoshi Seto, Greg KH,
	Paul Mackerras, linuxppc64-dev, linux-pci, johnrose, linux-laptop,
	mochel, pavel

On Thu, 2005-06-30 at 15:39 -0500, Linas Vepstas wrote:

> Thus, the right thing to do might be to split up the 
> struct pci_dev->suspend() and pci_dev->resume() calls into
> 
>    suspend()
>    poweroff()
>    poweron()
>    resume()

No. There are very good reasons not to do that split at the pci_dev
level.
 
> and then have the generic pci error recovery routines call
> suspend/resume only, skipping the poweroff-on calls.  Does that 
> sound good?
> 
> I'm not sure I can pull this off without having someone from 
> the power-management world throw a brick at me.

Just keep the error recovery callbacks for now, and we might be able to
provide a generic "helper" doing the watchdog thing (yes, there is a
watchdog in the net core)

Ben.



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-06-30 23:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-28 23:58 [PATCH 4/13]: PCI Err: e100 ethernet driver recovery Linas Vepstas
2005-06-29  1:46 ` Benjamin Herrenschmidt
2005-06-29 15:59   ` Linas Vepstas
2005-06-29 16:58     ` Andi Kleen
2005-06-29 23:40       ` Benjamin Herrenschmidt
2005-06-30 20:39       ` PCI Power management (was: " Linas Vepstas
2005-06-30 21:07         ` Linas Vepstas
2005-06-30 23:32         ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox