Netdev List
 help / color / mirror / Atom feed
* [PATCH 23/27] sfc: Remove unnecessary tests of efx->membase
From: Ben Hutchings @ 2009-10-23 18:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1256322441.2785.3.camel@achroite>

These cleanup functions will never be called if the MMIO region could
not be mapped.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/efx.c |    8 --------
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index c925801..8fc6a6e 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -2022,10 +2022,6 @@ static void efx_fini_struct(struct efx_nic *efx)
  */
 static void efx_pci_remove_main(struct efx_nic *efx)
 {
-	/* Skip everything if we never obtained a valid membase */
-	if (!efx->membase)
-		return;
-
 	falcon_fini_interrupt(efx);
 	efx_fini_channels(efx);
 	efx_fini_port(efx);
@@ -2056,9 +2052,6 @@ static void efx_pci_remove(struct pci_dev *pci_dev)
 	/* Allow any queued efx_resets() to complete */
 	rtnl_unlock();
 
-	if (efx->membase == NULL)
-		goto out;
-
 	efx_unregister_netdev(efx);
 
 	efx_mtd_remove(efx);
@@ -2071,7 +2064,6 @@ static void efx_pci_remove(struct pci_dev *pci_dev)
 
 	efx_pci_remove_main(efx);
 
-out:
 	efx_fini_io(efx);
 	EFX_LOG(efx, "shutdown successful\n");
 

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 24/27] sfc: Move MTD probe after netdev registration and name allocation
From: Ben Hutchings @ 2009-10-23 18:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1256322441.2785.3.camel@achroite>

The MTD partition is named based on the netdev name, which is set to
'eth%d' before registration.  Also, the MTD partition will currently
be left registered if netdev registration fails.

Fix both these problems by moving the MTD probe after netdev
registration.  Hold the RTNL to serialise this with the netdev
notifier that calls efx_mtd_rename().

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/efx.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index 8fc6a6e..0d0243b 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -2209,13 +2209,15 @@ static int __devinit efx_pci_probe(struct pci_dev *pci_dev,
 	 * MAC stats succeeds. */
 	efx->state = STATE_RUNNING;
 
-	efx_mtd_probe(efx); /* allowed to fail */
-
 	rc = efx_register_netdev(efx);
 	if (rc)
 		goto fail5;
 
 	EFX_LOG(efx, "initialisation successful\n");
+
+	rtnl_lock();
+	efx_mtd_probe(efx); /* allowed to fail */
+	rtnl_unlock();
 	return 0;
 
  fail5:

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 25/27] sfc: Merge efx_fc_resolve() into efx_mdio_get_pause()
From: Ben Hutchings @ 2009-10-23 18:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1256322441.2785.3.camel@achroite>

efx_fc_resolve() is specific to MDIO and is not used by any other
function.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/mdio_10g.c   |   12 ++++++++----
 drivers/net/sfc/net_driver.h |   11 -----------
 2 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/drivers/net/sfc/mdio_10g.c b/drivers/net/sfc/mdio_10g.c
index b355872..25fb20a 100644
--- a/drivers/net/sfc/mdio_10g.c
+++ b/drivers/net/sfc/mdio_10g.c
@@ -341,10 +341,14 @@ int efx_mdio_set_settings(struct efx_nic *efx, struct ethtool_cmd *ecmd)
 
 enum efx_fc_type efx_mdio_get_pause(struct efx_nic *efx)
 {
-	int lpa;
+	BUILD_BUG_ON(EFX_FC_AUTO & (EFX_FC_RX | EFX_FC_TX));
 
-	if (!(efx->phy_op->mmds & MDIO_DEVS_AN))
+	if (!(efx->wanted_fc & EFX_FC_AUTO))
 		return efx->wanted_fc;
-	lpa = efx_mdio_read(efx, MDIO_MMD_AN, MDIO_AN_LPA);
-	return efx_fc_resolve(efx->wanted_fc, lpa);
+
+	WARN_ON(!(efx->mdio.mmds & MDIO_DEVS_AN));
+
+	return mii_resolve_flowctrl_fdx(
+		mii_advertise_flowctrl(efx->wanted_fc),
+		efx_mdio_read(efx, MDIO_MMD_AN, MDIO_AN_LPA));
 }
diff --git a/drivers/net/sfc/net_driver.h b/drivers/net/sfc/net_driver.h
index 13bc1b4..bb3d258 100644
--- a/drivers/net/sfc/net_driver.h
+++ b/drivers/net/sfc/net_driver.h
@@ -497,17 +497,6 @@ enum efx_mac_type {
 	EFX_XMAC = 2,
 };
 
-static inline enum efx_fc_type efx_fc_resolve(enum efx_fc_type wanted_fc,
-					      unsigned int lpa)
-{
-	BUILD_BUG_ON(EFX_FC_AUTO & (EFX_FC_RX | EFX_FC_TX));
-
-	if (!(wanted_fc & EFX_FC_AUTO))
-		return wanted_fc;
-
-	return mii_resolve_flowctrl_fdx(mii_advertise_flowctrl(wanted_fc), lpa);
-}
-
 /**
  * struct efx_mac_operations - Efx MAC operations table
  * @reconfigure: Reconfigure MAC. Serialised by the mac_lock

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 26/27] sfc: Remove unused code for non-autoneg speed/duplex switching
From: Ben Hutchings @ 2009-10-23 18:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1256322441.2785.3.camel@achroite>

The only multi-speed PHY driver using this is 10Xpress, and it does
not support non-autoneg operation.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/mdio_10g.c |  108 ++++++++++++++++----------------------------
 1 files changed, 39 insertions(+), 69 deletions(-)

diff --git a/drivers/net/sfc/mdio_10g.c b/drivers/net/sfc/mdio_10g.c
index 25fb20a..231e580 100644
--- a/drivers/net/sfc/mdio_10g.c
+++ b/drivers/net/sfc/mdio_10g.c
@@ -248,7 +248,7 @@ void efx_mdio_set_mmds_lpower(struct efx_nic *efx,
 int efx_mdio_set_settings(struct efx_nic *efx, struct ethtool_cmd *ecmd)
 {
 	struct ethtool_cmd prev;
-	u32 required;
+	bool xnp;
 	int reg;
 
 	efx->phy_op->get_settings(efx, &prev);
@@ -265,76 +265,46 @@ int efx_mdio_set_settings(struct efx_nic *efx, struct ethtool_cmd *ecmd)
 		return -EINVAL;
 
 	/* Check that PHY supports these settings */
-	if (ecmd->autoneg) {
-		required = SUPPORTED_Autoneg;
-	} else if (ecmd->duplex) {
-		switch (ecmd->speed) {
-		case SPEED_10:  required = SUPPORTED_10baseT_Full;  break;
-		case SPEED_100: required = SUPPORTED_100baseT_Full; break;
-		default:        return -EINVAL;
-		}
-	} else {
-		switch (ecmd->speed) {
-		case SPEED_10:  required = SUPPORTED_10baseT_Half;  break;
-		case SPEED_100: required = SUPPORTED_100baseT_Half; break;
-		default:        return -EINVAL;
-		}
-	}
-	required |= ecmd->advertising;
-	if (required & ~prev.supported)
+	if (!ecmd->autoneg ||
+	    (ecmd->advertising | SUPPORTED_Autoneg) & ~prev.supported)
 		return -EINVAL;
 
-	if (ecmd->autoneg) {
-		bool xnp = (ecmd->advertising & ADVERTISED_10000baseT_Full
-			    || EFX_WORKAROUND_13204(efx));
-
-		/* Set up the base page */
-		reg = ADVERTISE_CSMA;
-		if (ecmd->advertising & ADVERTISED_10baseT_Half)
-			reg |= ADVERTISE_10HALF;
-		if (ecmd->advertising & ADVERTISED_10baseT_Full)
-			reg |= ADVERTISE_10FULL;
-		if (ecmd->advertising & ADVERTISED_100baseT_Half)
-			reg |= ADVERTISE_100HALF;
-		if (ecmd->advertising & ADVERTISED_100baseT_Full)
-			reg |= ADVERTISE_100FULL;
-		if (xnp)
-			reg |= ADVERTISE_RESV;
-		else if (ecmd->advertising & (ADVERTISED_1000baseT_Half |
-					      ADVERTISED_1000baseT_Full))
-			reg |= ADVERTISE_NPAGE;
-		reg |= mii_advertise_flowctrl(efx->wanted_fc);
-		efx_mdio_write(efx, MDIO_MMD_AN, MDIO_AN_ADVERTISE, reg);
-
-		/* Set up the (extended) next page if necessary */
-		if (efx->phy_op->set_npage_adv)
-			efx->phy_op->set_npage_adv(efx, ecmd->advertising);
-
-		/* Enable and restart AN */
-		reg = efx_mdio_read(efx, MDIO_MMD_AN, MDIO_CTRL1);
-		reg |= MDIO_AN_CTRL1_ENABLE;
-		if (!(EFX_WORKAROUND_15195(efx) &&
-		      LOOPBACK_MASK(efx) & efx->phy_op->loopbacks))
-			reg |= MDIO_AN_CTRL1_RESTART;
-		if (xnp)
-			reg |= MDIO_AN_CTRL1_XNP;
-		else
-			reg &= ~MDIO_AN_CTRL1_XNP;
-		efx_mdio_write(efx, MDIO_MMD_AN, MDIO_CTRL1, reg);
-	} else {
-		/* Disable AN */
-		efx_mdio_set_flag(efx, MDIO_MMD_AN, MDIO_CTRL1,
-				  MDIO_AN_CTRL1_ENABLE, false);
-
-		/* Set the basic control bits */
-		reg = efx_mdio_read(efx, MDIO_MMD_PMAPMD, MDIO_CTRL1);
-		reg &= ~(MDIO_CTRL1_SPEEDSEL | MDIO_CTRL1_FULLDPLX);
-		if (ecmd->speed == SPEED_100)
-			reg |= MDIO_PMA_CTRL1_SPEED100;
-		if (ecmd->duplex)
-			reg |= MDIO_CTRL1_FULLDPLX;
-		efx_mdio_write(efx, MDIO_MMD_PMAPMD, MDIO_CTRL1, reg);
-	}
+	xnp = (ecmd->advertising & ADVERTISED_10000baseT_Full
+	       || EFX_WORKAROUND_13204(efx));
+
+	/* Set up the base page */
+	reg = ADVERTISE_CSMA;
+	if (ecmd->advertising & ADVERTISED_10baseT_Half)
+		reg |= ADVERTISE_10HALF;
+	if (ecmd->advertising & ADVERTISED_10baseT_Full)
+		reg |= ADVERTISE_10FULL;
+	if (ecmd->advertising & ADVERTISED_100baseT_Half)
+		reg |= ADVERTISE_100HALF;
+	if (ecmd->advertising & ADVERTISED_100baseT_Full)
+		reg |= ADVERTISE_100FULL;
+	if (xnp)
+		reg |= ADVERTISE_RESV;
+	else if (ecmd->advertising & (ADVERTISED_1000baseT_Half |
+				      ADVERTISED_1000baseT_Full))
+		reg |= ADVERTISE_NPAGE;
+	reg |= mii_advertise_flowctrl(efx->wanted_fc);
+	efx_mdio_write(efx, MDIO_MMD_AN, MDIO_AN_ADVERTISE, reg);
+
+	/* Set up the (extended) next page if necessary */
+	if (efx->phy_op->set_npage_adv)
+		efx->phy_op->set_npage_adv(efx, ecmd->advertising);
+
+	/* Enable and restart AN */
+	reg = efx_mdio_read(efx, MDIO_MMD_AN, MDIO_CTRL1);
+	reg |= MDIO_AN_CTRL1_ENABLE;
+	if (!(EFX_WORKAROUND_15195(efx) &&
+	      LOOPBACK_MASK(efx) & efx->phy_op->loopbacks))
+		reg |= MDIO_AN_CTRL1_RESTART;
+	if (xnp)
+		reg |= MDIO_AN_CTRL1_XNP;
+	else
+		reg &= ~MDIO_AN_CTRL1_XNP;
+	efx_mdio_write(efx, MDIO_MMD_AN, MDIO_CTRL1, reg);
 
 	return 0;
 }

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* [PATCH 27/27] sfc: Rename 'xfp' file and functions to reflect reality
From: Ben Hutchings @ 2009-10-23 18:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1256322441.2785.3.camel@achroite>

The 'XFP' driver is really a driver for the QT2022C2 and QT2025C PHYs,
covering both more and less than XFP.  Rename its functions and
constants to reflect reality and to reduce namespace pollution when
sfc is a built-in driver.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/Makefile                    |    2 +-
 drivers/net/sfc/falcon.c                    |    2 +-
 drivers/net/sfc/falcon_boards.c             |   26 ++++----
 drivers/net/sfc/phy.h                       |    6 +-
 drivers/net/sfc/{xfp_phy.c => qt202x_phy.c} |   80 +++++++++++++-------------
 5 files changed, 58 insertions(+), 58 deletions(-)
 rename drivers/net/sfc/{xfp_phy.c => qt202x_phy.c} (73%)

diff --git a/drivers/net/sfc/Makefile b/drivers/net/sfc/Makefile
index 9c98d06..7b52fe1 100644
--- a/drivers/net/sfc/Makefile
+++ b/drivers/net/sfc/Makefile
@@ -1,5 +1,5 @@
 sfc-y			+= efx.o falcon.o tx.o rx.o falcon_gmac.o \
-			   falcon_xmac.o selftest.o ethtool.o xfp_phy.o \
+			   falcon_xmac.o selftest.o ethtool.o qt202x_phy.o \
 			   mdio_10g.o tenxpress.o falcon_boards.o
 sfc-$(CONFIG_SFC_MTD)	+= mtd.o
 
diff --git a/drivers/net/sfc/falcon.c b/drivers/net/sfc/falcon.c
index d9ce21e..8776432 100644
--- a/drivers/net/sfc/falcon.c
+++ b/drivers/net/sfc/falcon.c
@@ -2239,7 +2239,7 @@ int falcon_probe_port(struct efx_nic *efx)
 		break;
 	case PHY_TYPE_QT2022C2:
 	case PHY_TYPE_QT2025C:
-		efx->phy_op = &falcon_xfp_phy_ops;
+		efx->phy_op = &falcon_qt202x_phy_ops;
 		break;
 	default:
 		EFX_ERR(efx, "Unknown PHY type %d\n",
diff --git a/drivers/net/sfc/falcon_boards.c b/drivers/net/sfc/falcon_boards.c
index f65738b..99f7372 100644
--- a/drivers/net/sfc/falcon_boards.c
+++ b/drivers/net/sfc/falcon_boards.c
@@ -612,17 +612,17 @@ static void sfe4002_init_leds(struct efx_nic *efx)
 {
 	/* Set the TX and RX LEDs to reflect status and activity, and the
 	 * fault LED off */
-	xfp_set_led(efx, SFE4002_TX_LED,
-		    QUAKE_LED_TXLINK | QUAKE_LED_LINK_ACTSTAT);
-	xfp_set_led(efx, SFE4002_RX_LED,
-		    QUAKE_LED_RXLINK | QUAKE_LED_LINK_ACTSTAT);
-	xfp_set_led(efx, SFE4002_FAULT_LED, QUAKE_LED_OFF);
+	falcon_qt202x_set_led(efx, SFE4002_TX_LED,
+			      QUAKE_LED_TXLINK | QUAKE_LED_LINK_ACTSTAT);
+	falcon_qt202x_set_led(efx, SFE4002_RX_LED,
+			      QUAKE_LED_RXLINK | QUAKE_LED_LINK_ACTSTAT);
+	falcon_qt202x_set_led(efx, SFE4002_FAULT_LED, QUAKE_LED_OFF);
 }
 
 static void sfe4002_set_id_led(struct efx_nic *efx, bool state)
 {
-	xfp_set_led(efx, SFE4002_FAULT_LED, state ? QUAKE_LED_ON :
-			QUAKE_LED_OFF);
+	falcon_qt202x_set_led(efx, SFE4002_FAULT_LED, state ? QUAKE_LED_ON :
+			      QUAKE_LED_OFF);
 }
 
 static int sfe4002_check_hw(struct efx_nic *efx)
@@ -677,16 +677,16 @@ static struct i2c_board_info sfn4112f_hwmon_info = {
 
 static void sfn4112f_init_leds(struct efx_nic *efx)
 {
-	xfp_set_led(efx, SFN4112F_ACT_LED,
-		    QUAKE_LED_RXLINK | QUAKE_LED_LINK_ACT);
-	xfp_set_led(efx, SFN4112F_LINK_LED,
-		    QUAKE_LED_RXLINK | QUAKE_LED_LINK_STAT);
+	falcon_qt202x_set_led(efx, SFN4112F_ACT_LED,
+			      QUAKE_LED_RXLINK | QUAKE_LED_LINK_ACT);
+	falcon_qt202x_set_led(efx, SFN4112F_LINK_LED,
+			      QUAKE_LED_RXLINK | QUAKE_LED_LINK_STAT);
 }
 
 static void sfn4112f_set_id_led(struct efx_nic *efx, bool state)
 {
-	xfp_set_led(efx, SFN4112F_LINK_LED,
-		    state ? QUAKE_LED_ON : QUAKE_LED_OFF);
+	falcon_qt202x_set_led(efx, SFN4112F_LINK_LED,
+			      state ? QUAKE_LED_ON : QUAKE_LED_OFF);
 }
 
 static int sfn4112f_check_hw(struct efx_nic *efx)
diff --git a/drivers/net/sfc/phy.h b/drivers/net/sfc/phy.h
index c1cff9c..b5150f3 100644
--- a/drivers/net/sfc/phy.h
+++ b/drivers/net/sfc/phy.h
@@ -23,9 +23,9 @@ extern void tenxpress_phy_blink(struct efx_nic *efx, bool blink);
 extern int sft9001_wait_boot(struct efx_nic *efx);
 
 /****************************************************************************
- * AMCC/Quake QT20xx PHYs
+ * AMCC/Quake QT202x PHYs
  */
-extern struct efx_phy_operations falcon_xfp_phy_ops;
+extern struct efx_phy_operations falcon_qt202x_phy_ops;
 
 /* These PHYs provide various H/W control states for LEDs */
 #define QUAKE_LED_LINK_INVAL	(0)
@@ -39,6 +39,6 @@ extern struct efx_phy_operations falcon_xfp_phy_ops;
 #define QUAKE_LED_TXLINK	(0)
 #define QUAKE_LED_RXLINK	(8)
 
-extern void xfp_set_led(struct efx_nic *p, int led, int state);
+extern void falcon_qt202x_set_led(struct efx_nic *p, int led, int state);
 
 #endif
diff --git a/drivers/net/sfc/xfp_phy.c b/drivers/net/sfc/qt202x_phy.c
similarity index 73%
rename from drivers/net/sfc/xfp_phy.c
rename to drivers/net/sfc/qt202x_phy.c
index e6b3d5e..560eb18 100644
--- a/drivers/net/sfc/xfp_phy.c
+++ b/drivers/net/sfc/qt202x_phy.c
@@ -7,8 +7,7 @@
  * by the Free Software Foundation, incorporated herein by reference.
  */
 /*
- * Driver for SFP+ and XFP optical PHYs plus some support specific to the
- * AMCC QT20xx adapters; see www.amcc.com for details
+ * Driver for AMCC QT202x SFP+ and XFP adapters; see www.amcc.com for details
  */
 
 #include <linux/timer.h>
@@ -18,13 +17,13 @@
 #include "phy.h"
 #include "falcon.h"
 
-#define XFP_REQUIRED_DEVS (MDIO_DEVS_PCS |	\
-			   MDIO_DEVS_PMAPMD |	\
-			   MDIO_DEVS_PHYXS)
+#define QT202X_REQUIRED_DEVS (MDIO_DEVS_PCS |		\
+			      MDIO_DEVS_PMAPMD |	\
+			      MDIO_DEVS_PHYXS)
 
-#define XFP_LOOPBACKS ((1 << LOOPBACK_PCS) |		\
-		       (1 << LOOPBACK_PMAPMD) |		\
-		       (1 << LOOPBACK_NETWORK))
+#define QT202X_LOOPBACKS ((1 << LOOPBACK_PCS) |		\
+			  (1 << LOOPBACK_PMAPMD) |	\
+			  (1 << LOOPBACK_NETWORK))
 
 /****************************************************************************/
 /* Quake-specific MDIO registers */
@@ -45,18 +44,18 @@
 #define PCS_VEND1_REG	   	0xc000
 #define PCS_VEND1_LBTXD_LBN	5
 
-void xfp_set_led(struct efx_nic *p, int led, int mode)
+void falcon_qt202x_set_led(struct efx_nic *p, int led, int mode)
 {
 	int addr = MDIO_QUAKE_LED0_REG + led;
 	efx_mdio_write(p, MDIO_MMD_PMAPMD, addr, mode);
 }
 
-struct xfp_phy_data {
+struct qt202x_phy_data {
 	enum efx_phy_mode phy_mode;
 };
 
-#define XFP_MAX_RESET_TIME 500
-#define XFP_RESET_WAIT 10
+#define QT2022C2_MAX_RESET_TIME 500
+#define QT2022C2_RESET_WAIT 10
 
 static int qt2025c_wait_reset(struct efx_nic *efx)
 {
@@ -97,7 +96,7 @@ static int qt2025c_wait_reset(struct efx_nic *efx)
 	return 0;
 }
 
-static int xfp_reset_phy(struct efx_nic *efx)
+static int qt202x_reset_phy(struct efx_nic *efx)
 {
 	int rc;
 
@@ -111,8 +110,9 @@ static int xfp_reset_phy(struct efx_nic *efx)
 		/* Reset the PHYXS MMD. This is documented as doing
 		 * a complete soft reset. */
 		rc = efx_mdio_reset_mmd(efx, MDIO_MMD_PHYXS,
-					XFP_MAX_RESET_TIME / XFP_RESET_WAIT,
-					XFP_RESET_WAIT);
+					QT2022C2_MAX_RESET_TIME /
+					QT2022C2_RESET_WAIT,
+					QT2022C2_RESET_WAIT);
 		if (rc < 0)
 			goto fail;
 	}
@@ -122,7 +122,7 @@ static int xfp_reset_phy(struct efx_nic *efx)
 
 	/* Check that all the MMDs we expect are present and responding. We
 	 * expect faults on some if the link is down, but not on the PHY XS */
-	rc = efx_mdio_check_mmds(efx, XFP_REQUIRED_DEVS, MDIO_DEVS_PHYXS);
+	rc = efx_mdio_check_mmds(efx, QT202X_REQUIRED_DEVS, MDIO_DEVS_PHYXS);
 	if (rc < 0)
 		goto fail;
 
@@ -135,13 +135,13 @@ static int xfp_reset_phy(struct efx_nic *efx)
 	return rc;
 }
 
-static int xfp_phy_init(struct efx_nic *efx)
+static int qt202x_phy_init(struct efx_nic *efx)
 {
-	struct xfp_phy_data *phy_data;
+	struct qt202x_phy_data *phy_data;
 	u32 devid = efx_mdio_read_id(efx, MDIO_MMD_PHYXS);
 	int rc;
 
-	phy_data = kzalloc(sizeof(struct xfp_phy_data), GFP_KERNEL);
+	phy_data = kzalloc(sizeof(struct qt202x_phy_data), GFP_KERNEL);
 	if (!phy_data)
 		return -ENOMEM;
 	efx->phy_data = phy_data;
@@ -152,7 +152,7 @@ static int xfp_phy_init(struct efx_nic *efx)
 
 	phy_data->phy_mode = efx->phy_mode;
 
-	rc = xfp_reset_phy(efx);
+	rc = qt202x_reset_phy(efx);
 
 	EFX_INFO(efx, "PHY init %s.\n",
 		 rc ? "failed" : "successful");
@@ -167,28 +167,28 @@ static int xfp_phy_init(struct efx_nic *efx)
 	return rc;
 }
 
-static void xfp_phy_clear_interrupt(struct efx_nic *efx)
+static void qt202x_phy_clear_interrupt(struct efx_nic *efx)
 {
 	/* Read to clear link status alarm */
 	efx_mdio_read(efx, MDIO_MMD_PMAPMD, MDIO_PMA_LASI_STAT);
 }
 
-static int xfp_link_ok(struct efx_nic *efx)
+static int qt202x_link_ok(struct efx_nic *efx)
 {
-	return efx_mdio_links_ok(efx, XFP_REQUIRED_DEVS);
+	return efx_mdio_links_ok(efx, QT202X_REQUIRED_DEVS);
 }
 
-static void xfp_phy_poll(struct efx_nic *efx)
+static void qt202x_phy_poll(struct efx_nic *efx)
 {
-	int link_up = xfp_link_ok(efx);
+	int link_up = qt202x_link_ok(efx);
 	/* Simulate a PHY event if link state has changed */
 	if (link_up != efx->link_up)
 		falcon_sim_phy_event(efx);
 }
 
-static void xfp_phy_reconfigure(struct efx_nic *efx)
+static void qt202x_phy_reconfigure(struct efx_nic *efx)
 {
-	struct xfp_phy_data *phy_data = efx->phy_data;
+	struct qt202x_phy_data *phy_data = efx->phy_data;
 
 	if (efx->phy_type == PHY_TYPE_QT2025C) {
 		/* There are several different register bits which can
@@ -207,7 +207,7 @@ static void xfp_phy_reconfigure(struct efx_nic *efx)
 		/* Reset the PHY when moving from tx off to tx on */
 		if (!(efx->phy_mode & PHY_MODE_TX_DISABLED) &&
 		    (phy_data->phy_mode & PHY_MODE_TX_DISABLED))
-			xfp_reset_phy(efx);
+			qt202x_reset_phy(efx);
 
 		efx_mdio_transmit_disable(efx);
 	}
@@ -215,18 +215,18 @@ static void xfp_phy_reconfigure(struct efx_nic *efx)
 	efx_mdio_phy_reconfigure(efx);
 
 	phy_data->phy_mode = efx->phy_mode;
-	efx->link_up = xfp_link_ok(efx);
+	efx->link_up = qt202x_link_ok(efx);
 	efx->link_speed = 10000;
 	efx->link_fd = true;
 	efx->link_fc = efx->wanted_fc;
 }
 
-static void xfp_phy_get_settings(struct efx_nic *efx, struct ethtool_cmd *ecmd)
+static void qt202x_phy_get_settings(struct efx_nic *efx, struct ethtool_cmd *ecmd)
 {
 	mdio45_ethtool_gset(&efx->mdio, ecmd);
 }
 
-static void xfp_phy_fini(struct efx_nic *efx)
+static void qt202x_phy_fini(struct efx_nic *efx)
 {
 	/* Clobber the LED if it was blinking */
 	efx->board_info.blink(efx, false);
@@ -236,15 +236,15 @@ static void xfp_phy_fini(struct efx_nic *efx)
 	efx->phy_data = NULL;
 }
 
-struct efx_phy_operations falcon_xfp_phy_ops = {
+struct efx_phy_operations falcon_qt202x_phy_ops = {
 	.macs		 = EFX_XMAC,
-	.init            = xfp_phy_init,
-	.reconfigure     = xfp_phy_reconfigure,
-	.poll            = xfp_phy_poll,
-	.fini            = xfp_phy_fini,
-	.clear_interrupt = xfp_phy_clear_interrupt,
-	.get_settings    = xfp_phy_get_settings,
+	.init		 = qt202x_phy_init,
+	.reconfigure	 = qt202x_phy_reconfigure,
+	.poll	     	 = qt202x_phy_poll,
+	.fini	  	 = qt202x_phy_fini,
+	.clear_interrupt = qt202x_phy_clear_interrupt,
+	.get_settings	 = qt202x_phy_get_settings,
 	.set_settings	 = efx_mdio_set_settings,
-	.mmds            = XFP_REQUIRED_DEVS,
-	.loopbacks       = XFP_LOOPBACKS,
+	.mmds            = QT202X_REQUIRED_DEVS,
+	.loopbacks       = QT202X_LOOPBACKS,
 };

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Re: [PATCH 2/4] [RFC] Add c/r support for connected INET sockets
From: Oren Laadan @ 2009-10-23 19:37 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, netdev, John Dykstra
In-Reply-To: <1256072803-3518-3-git-send-email-danms@us.ibm.com>



Dan Smith wrote:
> This patch adds basic support for C/R of open INET sockets.  I think that
> all the important bits of the TCP and ICSK socket structures is saved,
> but I think there is still some additional IPv6 stuff that needs to be
> handled.
> 
> With this patch applied, the following script can be used to demonstrate
> the functionality:
> 
>   https://lists.linux-foundation.org/pipermail/containers/2009-October/021239.html
> 
> It shows that this enables migration of a sendmail process with open
> connections from one machine to another without dropping.
> 
> We still need comments from the netdev people about what sort of sanity
> checking we need to do on the values in the ckpt_hdr_socket_inet
> structure on restart.
> 
> Note that this still doesn't address lingering sockets yet.
> 

[...]

> diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
> index fa57cdc..91c141b 100644
> --- a/include/linux/checkpoint_types.h
> +++ b/include/linux/checkpoint_types.h
> @@ -65,6 +65,8 @@ struct ckpt_ctx {
>  	struct list_head pgarr_list;	/* page array to dump VMA contents */
>  	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
>  
> +	struct list_head listen_sockets;/* listening parent sockets */
> +

Nit: maybe move under the comment "multi-process restart" ?

[...]

Otherwise (and pending comments from netdev people on sanity checks):

Acked-by: Oren Laadan <orenl@cs.columbia.edu>


^ permalink raw reply

* Re: [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
From: Karol Lewandowski @ 2009-10-23 21:12 UTC (permalink / raw)
  To: Karol Lewandowski
  Cc: Mel Gorman, Frans Pop, Jiri Kosina, Sven Geggus, Tobias Oetiker,
	Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
	David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
	John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
	Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
	netdev, linux-kernel, linux-mm@kvack.org
In-Reply-To: <20091023165810.GA4588@bizet.domek.prywatny>

On Fri, Oct 23, 2009 at 06:58:10PM +0200, Karol Lewandowski wrote:
> On Thu, Oct 22, 2009 at 03:22:31PM +0100, Mel Gorman wrote:
> > Test 3: If you are getting allocation failures, try with the following patch
> > 
> >   3/5 vmscan: Force kswapd to take notice faster when high-order watermarks are being hit

> No, problem doesn't go away with these patches (1+2+3).  However, from
> my testing this particular patch makes it way, way harder to trigger
> allocation failures (but these are still present).
> 
> This bothers me - should I test following patches with or without
> above patch?  This patch makes bug harder to find, IMVHO it doesn't
> fix the real problem.
..

> Test 4: If you are still getting failures, apply the following
>  4/5 page allocator: Pre-emptively wake kswapd when high-order watermarks are hit

Ok, I've tested patches 1+2+4 and bug, while very hard to trigger, is
still present. I'll test complete 1-4 patchset as time permits.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: VLAN and ARP failure on tg3 drivers
From: Matt Carlson @ 2009-10-23 21:35 UTC (permalink / raw)
  To: Gertjan Hofman; +Cc: netdev@vger.kernel.org
In-Reply-To: <93821.71108.qm@web32607.mail.mud.yahoo.com>

On Thu, Oct 22, 2009 at 09:52:42PM -0700, Gertjan Hofman wrote:
> Dear Kernel developers,
> 
> A couple of weeks ago we tried to migrate from a 2.6.24? kernel to a 2.6.29 kernel and noticed our VLAN application no longer works.? The problem is easy to replicate:
> 
> 1. connect 2 PC's with a cross-over cable
> 2. set up a fixed IP address to both PC's? (say 192.168.0.[1,2])
> 3. create a vlan:? vconfig? add eth0 0.
> 4. set IP addresses on the VLAN devices? (say 192.168.1.[1,2])
> 5. try ping one machine from the other.
> 
> I tried to dig into the problem by using un-patched kernel.org kernels with Ubuntu .config files.? Kernels up to 2.6.26 work fine, kernels after and including 2.6.27 fail. The problem is that ARP messages are being dropped. If the ARP table is updated by hand on each machine, the communication across the VLAN works fine.
> 
> At first I thought the kernel VLAN code was the problem (we had an earlier issue with a regression in 2.6.24) but it looks like the problem is actually with the tg3 driver.? Our system uses Broadcom ethernet chips. I tried the same experiments with combination of boards that have Broadcom and none-Broadcom and the only time I see it fail is with the tg3? driver loaded.
> 
> Snooping with WireShark shows that a ARP request from the non-Broadcom machine is seen and even answered, but never appears back on the network. If the Broadcom machine orginates the ARP message, it never arrives at the destination. I tried lowering the size of the MTU to 1492 as well as giving each VLAN device a different MAC. No deal.
> 
> I tried to look at tg3 patch changes from 2.6.26 to 2.6.27 but I am not familiar enough with the Git system to extract the appropiate changes.? I am a bit surprised that I am not seeing any references to this on the web, the combination of >2.6.27 kernels, Broadcom and VLAN cant be that uncommon.
> 
> I would be happy to provide more information and to try tests if any one can suggest them.
> 
> Sincerely,
> 
> Gertjan

I don't see any reason why your setup should fail, but it doesn't hurt
to gather more info about the problem.

What device are you experiencing this problem with?  Is management
firmware enabled?  (`ethtool -i ethx`)


^ permalink raw reply

* Re: [RFC] net,socket: introduce build_sockaddr_check helper to catch overflow at build time
From: Cyrill Gorcunov @ 2009-10-23 21:43 UTC (permalink / raw)
  To: David Miller, netdev
In-Reply-To: <20091022135557.GA5162@lenovo>

[Cyrill Gorcunov - Thu, Oct 22, 2009 at 05:55:57PM +0400]
...
| 
| 2) All handlers set *len to some size explicitly so we may
|    introduce set_sockaddr_size() helper like
| 
| #define set_sockaddr_size(ptr, size)		\
| 	do {					\
| 		build_sockaddr_check(size);	\
| 		*ptr = size;			\
| 	} while (0)
| 
| Or you meant something completely different?
| 
| 	-- Cyrill

Or say it could be something like that

#define __sockaddr(type, src)	\
	({ build_sockaddr_check(sizeof(type)); (type *) src; })

and say in function af_inet.c:inet_getname instead of

	struct sockaddr_in *sin	= (struct sockaddr_in *)uaddr;

we may write like

	struct sockaddr_in *sin	= __sockaddr(struct sockaddr_in, uaddr);

which would check the size.

	-- Cyrill

^ permalink raw reply

* RE: [PATCH] e100: fix 10sec DHCP delay (regression for 82559ER)
From: Allan, Bruce W @ 2009-10-23 21:45 UTC (permalink / raw)
  To: Bernhard Kaindl, David S. Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <4ADF6F15.5040301@gmx.net>

After further testing, this introduces a problem on 82552 parts where it no longer autonegs to 100Full.  I've created a different patch that reverts my code put in to 2.6.30 which isolates all PHY ids (which causes the problem discovered by Bernhard) yet still works around the problem on 82552.  It is currently queued up for testing here at Intel and will be pushed to netdev in the next few days.

>-----Original Message-----
>From: Bernhard Kaindl [mailto:bernhard.kaindl@gmx.net]
>Sent: Wednesday, October 21, 2009 1:29 PM
>To: David S. Miller
>Cc: Allan, Bruce W; netdev@vger.kernel.org
>Subject: [PATCH] e100: fix 10sec DHCP delay (regression for 82559ER)
>
>From: Bernhard Kaindl <bernhard.kaindl@gmx.net>
>
>With 2.6.30, we noticed a regression on our Intel 82559ER-equipped
>boards regarding the PHY initialization. The Intel 82559ER uses an
>internal PHY bus, so our PHY environment should be not be very special.
>
>The symptom which we observed on these boards was that the boot-time
>DHCP negotiation was stalled for ~5secs and went very slowly until
>it was finally completed after ~10secs after initial interface start.
>
>I reported our finding along with a proposal for a fix to netdev and
>Bruce Allan@Intel, whose patch to support the new Intel 82552 adapter
>included a workaround for the 82552 with this side effect for our env.
>
>Bruce worked on a way to have both environments working and tested
>it on as many 10/100 parts he could get his hands on and started
>a process which allows for more testing and patch submission from
>
>So pending this process which allows for more testing at Intel, I am
>submitting our common patch. For PHYs other than the 82552, it resembles
>mostly how 2.6.23-2.6.29 have been selecting and isolating the PHYs.
>
>This patch applies to 2.6.30.9, 2.6.31.4 and 2.6.32-rc5-git1 and is
>essentially what Bruce has been testing:
>
>Signed-off-by: Bernhard Kaindl <bernhard.kaindl@gmx.net>
>Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
>---
>
>  drivers/net/e100.c |   10 +++++++---
>  1 files changed, 7 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/net/e100.c b/drivers/net/e100.c
>index 5d2f48f..aed18a4 100644
>--- a/drivers/net/e100.c
>+++ b/drivers/net/e100.c
>@@ -1427,13 +1427,17 @@ static int e100_phy_init(struct nic *nic)
>  	} else
>  		DPRINTK(HW, DEBUG, "phy_addr = %d\n", nic->mii.phy_id);
>
>-	/* Isolate all the PHY ids */
>-	for (addr = 0; addr < 32; addr++)
>-		mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);
>  	/* Select the discovered PHY */
>  	bmcr &= ~BMCR_ISOLATE;
>  	mdio_write(netdev, nic->mii.phy_id, MII_BMCR, bmcr);
>
>+	if (nic->phy != phy_82552_v) {
>+		/* Isolate the unused PHY ids */
>+		for (addr = 0; addr < 32; addr++)
>+			if (addr != nic->mii.phy_id)
>+				mdio_write(netdev, addr, MII_BMCR, BMCR_ISOLATE);
>+	}
>+
>  	/* Get phy ID */
>  	id_lo = mdio_read(netdev, nic->mii.phy_id, MII_PHYSID1);
>  	id_hi = mdio_read(netdev, nic->mii.phy_id, MII_PHYSID2);

^ permalink raw reply

* Re: [PATCH] net: Fix RPF to work with policy routing
From: jamal @ 2009-10-23 22:40 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, netdev, atis, eric.dumazet, zenczykowski
In-Reply-To: <4AE1CD19.8090901@candelatech.com>

On Fri, 2009-10-23 at 08:34 -0700, Ben Greear wrote:

> I implemented something similar while allowing for virtual router like
> applications.  I had to add a mark very early in the pkt rx logic in dev.c,
> and had to add a 'skb_default_mark' member to the netdevice because
> the route lookup is done before the normal iptables logic ran.  

You dont need to add a new construct to netdev.
Here's how youd tag all packets coming on eth0 with mark 7:
----
tc filter add dev eth0 parent ffff: protocol ip \
pref 10 u32 match u32 0 0 flowid 1:17 \
action skbedit mark 7
---

Of course you could also be very flow specific, example:

----
tc filter add dev eth0 parent ffff: protocol ip \
pref 9 u32 match ip src 64.233.169.99/32 flowid 1:5 \
action skbedit mark 5
---

Or even use iptable marker
---
tc filter add dev eth2 parent 1:0 protocol ip \
prio 5 u32 match ip dst 10.0.0.90/32 flowid 1:12 \
action ipt -j mark --set-mark 2
----

You could even slice bread with this stuff. Example
you could use certain policy routing tables only
if a flow misbehaved (works well with routing not
local destined packets), example

---
tc filter add dev eth0 parent ffff: protocol ip prio 1 u32 \
match ip src 10.0.0.90/32 flowid 1:10 \
action ipt -j mark --set-mark 1 \
action police rate 100kbit burst 90k pipe \
action ipt -j mark --set-mark 2 \
action police rate 50kbit burst 50k pipe \
action ipt -j mark --set-mark 3 \
action police rate 50kbit burst 50k drop
----

As a warning ipt could be shaky in some distros because
of the morphing iptables interface.

> Without
> this, if a flow already existed for pkts coming in eth1, if the packet came
> back in eth2, it would use eth1's flow.

True. Of course you can avoid that with the patch i posted 
meeting the conditions i described with RPF.

> I'll dig out the patch if anyone is interested...

If you can do overlapping IP addresses, it would be interesting
to see. Otherwise all is achievable with smart policy routing.

cheers,
jamal


^ permalink raw reply

* Re: Irq architecture for multi-core network driver.
From: Eric W. Biederman @ 2009-10-23 23:22 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: David Daney, Chris Friesen, netdev, Linux Kernel Mailing List,
	linux-mips
In-Reply-To: <4807377b0910231028g60b479cfycdbf3f4e25384c58@mail.gmail.com>

Jesse Brandeburg <jesse.brandeburg@gmail.com> writes:

> On Fri, Oct 23, 2009 at 12:59 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> David Daney <ddaney@caviumnetworks.com> writes:
>>> Certainly this is one mode of operation that should be supported, but I would
>>> also like to be able to go for raw throughput and have as many cores as possible
>>> reading from a single queue (like I currently have).
>>
>> I believe will detect false packet drops and ask for unnecessary
>> retransmits if you have multiple cores processing a single queue,
>> because you are processing the packets out of order.
>
> So, the way the default linux kernel configures today's many core
> server systems is to leave the affinity mask by default at 0xffffffff,
> and most current Intel hardware based on 5000 (older core cpus), or
> 5500 chipset (used with Core i7 processors) that I have seen will
> allow for round robin interrupts by default.  This kind of sucks for
> the above unless you run irqbalance or set smp_affinity by hand.

On x86 if you have > 8 cores the hardware does not support any form of
irq balancing.  You do have an interesting point.

How often and how much does irq balancing hurt us.

> Yes, I know Arjan and others will say you should always run
> irqbalance, but some people don't and some distros don't ship it
> enabled by default (or their version doesn't work for one reason or
> another)  

irqbalance is actually more likely to move irqs than the hardware.
I have heard promises it won't move network irqs but I have seen
the opposite behavior.

> The question is should the kernel work better by default
> *without* irqbalance loaded, or does it not matter?

Good question.  I would aim for the kernel to work better by default.
Ideally we should have a coupling between which sockets applications have
open, which cpus those applications run on, and which core the irqs arrive
at.

> I don't believe we should re-enable the kernel irq balancer, but
> should we consider only setting a single bit in each new interrupt's
> irq affinity?  Doing it with a random spread for the initial affinity
> would be better than setting them all to one.

Not a bad idea.  The practical problem is that we usually have the irqs
setup before we have the additional cpus.  But that isn't entirely true,
I'm thinking of mostly pre-acpi rules.  With ACPI we do some kind of
on-demand setup of the gsi in the device initialization.

How irq threads interact also ways in here.

Eric

^ permalink raw reply

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Paul E. McKenney @ 2009-10-23 21:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <4ADF2B57.4030708@gmail.com>

On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> Octavian Purdila a écrit :
> > On Sunday 18 October 2009 21:21:44 you wrote:
> >>> The msleep(250) should be tuned first. Then if this is really necessary
> >>> to dismantle 100.000 netdevices per second, we might have to think a bit
> >>> more. 
> >>> Just try msleep(1 or 2), it should work quite well.
> >> My goal is tearing down 100,000 interfaces in a few seconds, which really
> >>  is  necessary.  Right now we're running about 40,000 interfaces on a not
> >>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
> >>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
> >>  system can handle that.
> >>
> > 
> > I would also like to see this patch in, we are running into scalability issues 
> > with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put() spots than
> can be deleted and replaced by something more scalable. I began this job
> but others can help me.
> 
> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 
> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 
> [PATCH] net: allow netdev_wait_allrefs() to run faster
> 
> netdev_wait_allrefs() waits that all references to a device vanishes.
> 
> It currently uses a _very_ pessimistic 250 ms delay between each probe.
> Some users report that no more than 4 devices can be dismantled per second,
> this is a pretty serious problem for extreme setups.
> 
> Most likely, references only wait for a rcu grace period that should come
> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.

Is this a place where synchronize_rcu_expedited() is appropriate?
(It went in to 2.6.32-rc1.)

							Thanx, Paul

> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/core/dev.c |    2 +-
>  1 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 28b0b9e..fca2e4a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4983,7 +4983,7 @@ static void netdev_wait_allrefs(struct net_device *dev)
>  			rebroadcast_time = jiffies;
>  		}
> 
> -		msleep(250);
> +		schedule_timeout_uninterruptible(1);
> 
>  		if (time_after(jiffies, warning_time + 10 * HZ)) {
>  			printk(KERN_EMERG "unregister_netdevice: "
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC] [PATCH] cxgb3: Set the rxq
From: Divy Le Ray @ 2009-10-24  0:29 UTC (permalink / raw)
  To: Krishna Kumar, davem; +Cc: netdev
In-Reply-To: <20091023111321.4726.29354.sendpatchset@localhost.localdomain>

On Fri, 23 Oct 2009 04:13:21 -0700, Krishna Kumar <krkumar2@in.ibm.com>  
wrote:

> From: Krishna Kumar <krkumar2@in.ibm.com>
>
> Set the rxq# for LRO when processing the last fragment of a
> frame. This helps in fast txq selection for routing workloads.
>
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>

Acked-by: Divy Le Ray <divy@chelsio.com>

> ---
>
>  drivers/net/cxgb3/sge.c |    1 +
>  1 file changed, 1 insertion(+)
>
> diff -ruNp org/drivers/net/cxgb3/sge.c new/drivers/net/cxgb3/sge.c
> --- org/drivers/net/cxgb3/sge.c	2009-10-19 11:58:16.000000000 +0530
> +++ new/drivers/net/cxgb3/sge.c	2009-10-20 09:40:40.000000000 +0530
> @@ -2135,6 +2135,7 @@ static void lro_add_page(struct adapter
>  	if (!complete)
>  		return;
> +	skb_record_rx_queue(skb, qs - &adap->sge.qs[0]);
>  	skb->ip_summed = CHECKSUM_UNNECESSARY;
>  	cpl = qs->lro_va;
>


^ permalink raw reply

* Re: [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
From: Christoph Lameter @ 2009-10-24  1:52 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
	David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
	KOSAKI Motohiro, Mohamed Abbas, Jens Axboe, John W. Linville,
	Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Stephan von Krawczynski, Kernel Testers List, netdev,
	linux-kernel, "li
In-Reply-To: <84144f020910220747nba30d8bkc83c2569da79bd7c@mail.gmail.com>

On Thu, 22 Oct 2009, Pekka Enberg wrote:

> These are pretty obvious bug fixes and should go to linux-next ASAP IMHO.

Bug fixes go into main not linux-next. Lets make sure these fixes really
work and then merge.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* RE: [patch next 3/4] netxen: fix bonding support
From: Dhananjay Phadke @ 2009-10-24  1:54 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev@vger.kernel.org
In-Reply-To: <m1ljj2ps80.fsf@fess.ebiederm.org>

Yes, this was a bug until recent times.

It should be fixed in 2.6.32 development cycle by this commit -
db4cfd8a6149e778befb2ff6e6f91cdc6394cbe6 ("netxen: handle firmware load errors").

This added a check for adapter->is_up before touching txq lock.

-Dhananjay
________________________________________
From: Eric W. Biederman [ebiederm@xmission.com]
Sent: Thursday, October 22, 2009 10:57 PM
To: Dhananjay Phadke
Cc: netdev@vger.kernel.org
Subject: Re: [patch next 3/4] netxen: fix bonding support

Dhananjay Phadke <dhananjay@netxen.com> writes:

> o Pause traffic during mac addr change.
> o Enable setting mac address for NX3031.

In 2.6.31 when I try and use bonding with the netxen or
even simply set the mac address after the driver has
first loaded but before the nic is brought up I get:

As I read that oops. The problem is that netxen_send_cmd_descs
is being called before tx_ring->txq has been initialized.
tx_ring->txq happens in open, when the device is brought up.

Any chance you can look at the problem?

Eric


BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffffa0128942>] netxen_send_cmd_descs.clone.2+0x26/0xfe [netxen_nic]
PGD 1fe76067 PUD b4d05067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
CPU 2
Modules linked in: ipt_REDIRECT vfat fat loop appletalk ipx p8023 rose ax25 tulip xt_hl ipt_LOG xt_limit ipt_REJECT xt_state xt_tcpudp dummy iptable_filter veth macvlan nfsd lockd nfs_acl auth_rpcgss exportfs sco bridge stp bnep l2cap bluetooth sunrpc cpufreq_ondemand acpi_cpufreq freq_table ext4 jbd2 crc16 dm_mirror dm_region_hash dm_log dm_multipath dm_mod uinput bonding ipv6 kvm_intel kvm fuse xt_multiport iptable_nat ip_tables nf_nat x_tables nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 tun 8021q i2c_i801 igb i2c_core i5k_amb hwmon i5400_edac edac_core iTCO_wdt iTCO_vendor_support pcspkr shpchp myri10ge floppy netxen_nic sr_mod cdrom serio_raw sg ata_generic pata_acpi ata_piix libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 22313, comm: ip Not tainted 2.6.31.2-193574.2006Arora.fc11.x86_64 #1 X7DWU
RIP: 0010:[<ffffffffa0128942>]  [<ffffffffa0128942>] netxen_send_cmd_descs.clone.2+0x26/0xfe [netxen_nic]
RSP: 0018:ffff88005753d638  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff88005753d688 RDI: ffff88012dc8f580
RBP: ffff88005753d678 R08: ffff88012dc8f580 R09: ffff88005753d688
R10: 00000000a388069b R11: 0000000000000246 R12: ffff88012dc8f580
R13: ffff88001fcb6c00 R14: ffff88005753d738 R15: 0000000000000020
FS:  00007f4360f656f0(0000) GS:ffff88002805a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 00000000b4de5000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ip (pid: 22313, threadinfo ffff88005753c000, task ffff88002fbdb700)
Stack:
 0000000000000000 00000000a388069b ffffea000136a578 00000000a388069b
<0> ffff88005753d778 ffff88012dc8f580 ffff88012e2cf910 ffff88001fcb6c00
<0> ffff88005753d6d8 ffffffffa0128ae5 000000000a000000 0000000000010001
Call Trace:
 [<ffffffffa0128ae5>] nx_p3_sre_macaddr_change+0x60/0x76 [netxen_nic]
 [<ffffffffa0128ca8>] nx_p3_nic_add_mac+0x129/0x14c [netxen_nic]
 [<ffffffffa0128d57>] netxen_p3_nic_set_multi+0x8c/0x169 [netxen_nic]
 [<ffffffffa0128e54>] netxen_p3_nic_set_mac_addr+0x20/0x38 [netxen_nic]
 [<ffffffffa0129ed2>] netxen_nic_set_mac+0x96/0xd6 [netxen_nic]
 [<ffffffff813128b1>] do_setlink+0x17c/0x360
 [<ffffffff81312e97>] rtnl_newlink+0x2e1/0x4e6
 [<ffffffff81312c53>] ? rtnl_newlink+0x9d/0x4e6
 [<ffffffff810ec697>] ? page_add_new_anon_rmap+0x4d/0x73
 [<ffffffff812f8920>] ? sock_rmalloc+0x39/0xa8
 [<ffffffff810fdc8c>] ? virt_to_head_page+0x1c/0x51
 [<ffffffff813122fe>] rtnetlink_rcv_msg+0x1d0/0x201
 [<ffffffff81327856>] ? netlink_sendmsg+0x18f/0x2ac
 [<ffffffff8131212e>] ? rtnetlink_rcv_msg+0x0/0x201
 [<ffffffff81327b7d>] netlink_rcv_skb+0x4d/0xb4
 [<ffffffff81312113>] rtnetlink_rcv+0x30/0x4b
 [<ffffffff8132764a>] netlink_unicast+0x12f/0x1ac
 [<ffffffff81327950>] netlink_sendmsg+0x289/0x2ac
 [<ffffffff812f4146>] __sock_sendmsg+0x6b/0x8a
 [<ffffffff812f4abf>] sock_sendmsg+0xd6/0x103
 [<ffffffff812f494b>] ? sock_recvmsg+0xd9/0x106
 [<ffffffff8106a257>] ? autoremove_wake_function+0x0/0x5a
 [<ffffffff812ff6b4>] ? verify_iovec+0x5b/0xaf
 [<ffffffff812f4d08>] sys_sendmsg+0x21c/0x2a0
 [<ffffffff810e3711>] ? handle_mm_fault+0x2d6/0x681
 [<ffffffff811ced89>] ? __up_read+0x9c/0xbb
 [<ffffffff8106e264>] ? up_read+0x1c/0x32
 [<ffffffff813a9c3f>] ? do_page_fault+0x260/0x2a4
 [<ffffffff8100bdc2>] system_call_fastpath+0x16/0x1b
Code: 5b 41 5c c9 c3 55 48 89 e5 41 55 41 54 49 89 fc 53 48 83 ec 28 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 48 8b 9f 28 01 00 00 <4c> 8b 6b 20 48 89 75 c8 49 8d 7d 40 e8 5e e9 27 e1 65 8b 04 25
RIP  [<ffffffffa0128942>] netxen_send_cmd_descs.clone.2+0x26/0xfe [netxen_nic]
 RSP <ffff88005753d638>
CR2: 0000000000000020

^ permalink raw reply

* Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
From: Christoph Lameter @ 2009-10-24  2:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Stephan von Krawczynski, Frans Pop, Jiri Kosina, Sven Geggus,
	Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
	David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
	KOSAKI Motohiro, Mohamed Abbas, Jens Axboe, John W. Linville,
	Pekka Enberg, Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
	Kernel Testers List, netdev, linux-kernel
In-Reply-To: <20091022163752.GU11778@csn.ul.ie>

There are now rt dependencies in the page allocator that screw things up?

And an rt flag causes the page allocator to try harder meaning it adds
latency.

?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch next 3/4] netxen: fix bonding support
From: Eric W. Biederman @ 2009-10-24  2:13 UTC (permalink / raw)
  To: Dhananjay Phadke; +Cc: netdev@vger.kernel.org
In-Reply-To: <7608421F3572AB4292BB2532AE89D5658B0B6E4F72@AVEXMB1.qlogic.org>

Dhananjay Phadke <dhananjay.phadke@qlogic.com> writes:

> Yes, this was a bug until recent times.
>
> It should be fixed in 2.6.32 development cycle by this commit -
> db4cfd8a6149e778befb2ff6e6f91cdc6394cbe6 ("netxen: handle firmware load errors").
>
> This added a check for adapter->is_up before touching txq lock.

Yes.  That should prevent the null pointer deference.  Will it also
allow setting the mac address when the NIC is down?

Eric

^ permalink raw reply

* Re: [Patch] sctp: remove deprecated SCTP_GET_*_OLD stuffs
From: Cong Wang @ 2009-10-24  2:41 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: Sam Ravnborg, linux-kernel, netdev, akpm
In-Reply-To: <4AE1D2AA.8060200@hp.com>

Vlad Yasevich wrote:
> 
> Sam Ravnborg wrote:
>> On Thu, Oct 22, 2009 at 04:53:30PM -0400, Vlad Yasevich wrote:
>>>> diff --git a/include/net/sctp/user.h b/include/net/sctp/user.h
>>>> index be2334a..0991f1b 100644
>>>> --- a/include/net/sctp/user.h
>>>> +++ b/include/net/sctp/user.h
>>>> @@ -131,14 +131,6 @@ enum sctp_optname {
>>>>  #define SCTP_SOCKOPT_BINDX_REM	SCTP_SOCKOPT_BINDX_REM
>>>>  	SCTP_SOCKOPT_PEELOFF, 	/* peel off association. */
>>>>  #define SCTP_SOCKOPT_PEELOFF	SCTP_SOCKOPT_PEELOFF
>>>> -	SCTP_GET_PEER_ADDRS_NUM_OLD, 	/* Get number of peer addresss. */
>>>> -#define SCTP_GET_PEER_ADDRS_NUM_OLD	SCTP_GET_PEER_ADDRS_NUM_OLD
>>>> -	SCTP_GET_PEER_ADDRS_OLD, 	/* Get all peer addresss. */
>>>> -#define SCTP_GET_PEER_ADDRS_OLD	SCTP_GET_PEER_ADDRS_OLD
>>>> -	SCTP_GET_LOCAL_ADDRS_NUM_OLD, 	/* Get number of local addresss. */
>>>> -#define SCTP_GET_LOCAL_ADDRS_NUM_OLD	SCTP_GET_LOCAL_ADDRS_NUM_OLD
>>>> -	SCTP_GET_LOCAL_ADDRS_OLD, 	/* Get all local addresss. */
>>>> -#define SCTP_GET_LOCAL_ADDRS_OLD	SCTP_GET_LOCAL_ADDRS_OLD
>>>>  	SCTP_SOCKOPT_CONNECTX_OLD, /* CONNECTX old requests. */
>>> After running the regression suite against this patch I find that we can't
>>> remove the enum values.  Removing the enums changes the value for the remainder
>>> of the definitions and breaks binary compatibility for applications that use
>>> those trailing options.
>>>
>>> You should be ok with removing the #defines and actual code that uses them,
>>> but not the enums.  You can even rename the enums, but we must preserve
>>> numeric ordering.
>> If we really depend on the actual value of an enum as in this case,
>> then e should assign them direct to better document this.
>>
>> 	Sam
>>
> 
> I agree.  I have a patch that converts the enum to just a #define section that
> I'll apply on top of this removal patch and document the deletion.

Hi, Vlad.

I was busy, sorry for joining late. Thanks for doing this!

^ permalink raw reply

* Re: Irq architecture for multi-core network driver.
From: David Miller @ 2009-10-24  3:19 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: ebiederm, ddaney, cfriesen, netdev, linux-kernel, linux-mips
In-Reply-To: <4807377b0910231028g60b479cfycdbf3f4e25384c58@mail.gmail.com>

From: Jesse Brandeburg <jesse.brandeburg@gmail.com>
Date: Fri, 23 Oct 2009 10:28:10 -0700

> Yes, I know Arjan and others will say you should always run
> irqbalance, but some people don't and some distros don't ship it
> enabled by default (or their version doesn't work for one reason or
> another)  The question is should the kernel work better by default
> *without* irqbalance loaded, or does it not matter?

I think requiring irqbalanced for optimal behavior is more
than reasonable.

And since we explicitly took that policy logic out of the
kernel it makes absolutely no sense to put it back there.

It's policy, and policy is (largely) userspace.

^ permalink raw reply

* [PATCH 1/2 net-next-2.6] ipv6 sit: RCU conversion phase I
From: Eric Dumazet @ 2009-10-24  3:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List, Yoshifuji Hideaki

SIT tunnels use one rwlock to protect their prl entries.

This first patch adds RCU locking for prl management,
with standard call_rcu() calls.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/ipip.h |    1
 net/ipv6/sit.c     |   73 +++++++++++++++++++++++++++++--------------
 2 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/include/net/ipip.h b/include/net/ipip.h
index 86f1c8b..290effb 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -45,6 +45,7 @@ struct ip_tunnel_prl_entry
 	struct ip_tunnel_prl_entry	*next;
 	__be32				addr;
 	u16				flags;
+	struct rcu_head			rcu_head;
 };
 
 #define IPTUNNEL_XMIT() do {						\
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 510d31f..8cdcc2a 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -240,15 +240,22 @@ failed:
 	return NULL;
 }
 
+static DEFINE_SPINLOCK(ipip6_prl_lock);
+
+#define for_each_prl_rcu(start)			\
+	for (prl = rcu_dereference(start);	\
+	     prl;				\
+	     prl = rcu_dereference(prl->next))
+
 static struct ip_tunnel_prl_entry *
 __ipip6_tunnel_locate_prl(struct ip_tunnel *t, __be32 addr)
 {
-	struct ip_tunnel_prl_entry *p = (struct ip_tunnel_prl_entry *)NULL;
+	struct ip_tunnel_prl_entry *prl;
 
-	for (p = t->prl; p; p = p->next)
-		if (p->addr == addr)
+	for_each_prl_rcu(t->prl)
+		if (prl->addr == addr)
 			break;
-	return p;
+	return prl;
 
 }
 
@@ -273,7 +280,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t,
 		kcalloc(cmax, sizeof(*kp), GFP_KERNEL) :
 		NULL;
 
-	read_lock(&ipip6_lock);
+	rcu_read_lock();
 
 	ca = t->prl_count < cmax ? t->prl_count : cmax;
 
@@ -291,7 +298,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t,
 	}
 
 	c = 0;
-	for (prl = t->prl; prl; prl = prl->next) {
+	for_each_prl_rcu(t->prl) {
 		if (c >= cmax)
 			break;
 		if (kprl.addr != htonl(INADDR_ANY) && prl->addr != kprl.addr)
@@ -303,7 +310,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t,
 			break;
 	}
 out:
-	read_unlock(&ipip6_lock);
+	rcu_read_unlock();
 
 	len = sizeof(*kp) * c;
 	ret = 0;
@@ -324,12 +331,14 @@ ipip6_tunnel_add_prl(struct ip_tunnel *t, struct ip_tunnel_prl *a, int chg)
 	if (a->addr == htonl(INADDR_ANY))
 		return -EINVAL;
 
-	write_lock(&ipip6_lock);
+	spin_lock(&ipip6_prl_lock);
 
 	for (p = t->prl; p; p = p->next) {
 		if (p->addr == a->addr) {
-			if (chg)
-				goto update;
+			if (chg) {
+				p->flags = a->flags;
+				goto out;
+			}
 			err = -EEXIST;
 			goto out;
 		}
@@ -346,46 +355,63 @@ ipip6_tunnel_add_prl(struct ip_tunnel *t, struct ip_tunnel_prl *a, int chg)
 		goto out;
 	}
 
+	INIT_RCU_HEAD(&p->rcu_head);
 	p->next = t->prl;
-	t->prl = p;
-	t->prl_count++;
-update:
 	p->addr = a->addr;
 	p->flags = a->flags;
+	t->prl_count++;
+	rcu_assign_pointer(t->prl, p);
 out:
-	write_unlock(&ipip6_lock);
+	spin_unlock(&ipip6_prl_lock);
 	return err;
 }
 
+static void prl_entry_destroy_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct ip_tunnel_prl_entry, rcu_head));
+}
+
+static void prl_list_destroy_rcu(struct rcu_head *head)
+{
+	struct ip_tunnel_prl_entry *p, *n;
+
+	p = container_of(head, struct ip_tunnel_prl_entry, rcu_head);
+	do {
+		n = p->next;
+		kfree(p);
+		p = n;
+	} while (p);
+}
+
 static int
 ipip6_tunnel_del_prl(struct ip_tunnel *t, struct ip_tunnel_prl *a)
 {
 	struct ip_tunnel_prl_entry *x, **p;
 	int err = 0;
 
-	write_lock(&ipip6_lock);
+	spin_lock(&ipip6_prl_lock);
 
 	if (a && a->addr != htonl(INADDR_ANY)) {
 		for (p = &t->prl; *p; p = &(*p)->next) {
 			if ((*p)->addr == a->addr) {
 				x = *p;
 				*p = x->next;
-				kfree(x);
+				call_rcu(&x->rcu_head, prl_entry_destroy_rcu);
 				t->prl_count--;
 				goto out;
 			}
 		}
 		err = -ENXIO;
 	} else {
-		while (t->prl) {
+		if (t->prl) {
+			t->prl_count = 0;
 			x = t->prl;
-			t->prl = t->prl->next;
-			kfree(x);
-			t->prl_count--;
+			call_rcu(&x->rcu_head, prl_list_destroy_rcu);
+			t->prl = NULL;
 		}
 	}
 out:
-	write_unlock(&ipip6_lock);
+	spin_unlock(&ipip6_prl_lock);
 	return err;
 }
 
@@ -395,7 +421,7 @@ isatap_chksrc(struct sk_buff *skb, struct iphdr *iph, struct ip_tunnel *t)
 	struct ip_tunnel_prl_entry *p;
 	int ok = 1;
 
-	read_lock(&ipip6_lock);
+	rcu_read_lock();
 	p = __ipip6_tunnel_locate_prl(t, iph->saddr);
 	if (p) {
 		if (p->flags & PRL_DEFAULT)
@@ -411,7 +437,7 @@ isatap_chksrc(struct sk_buff *skb, struct iphdr *iph, struct ip_tunnel *t)
 		else
 			ok = 0;
 	}
-	read_unlock(&ipip6_lock);
+	rcu_read_unlock();
 	return ok;
 }
 
@@ -1192,6 +1218,7 @@ static void __exit sit_cleanup(void)
 	xfrm4_tunnel_deregister(&sit_handler, AF_INET6);
 
 	unregister_pernet_gen_device(sit_net_id, &sit_net_ops);
+	rcu_barrier(); /* Wait for completion of call_rcu()'s */
 }
 
 static int __init sit_init(void)

^ permalink raw reply related

* [PATCH 2/2 net-next-2.6] ipv6 sit: RCU conversion phase II
From: Eric Dumazet @ 2009-10-24  3:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List, Yoshifuji Hideaki

SIT tunnels use one rwlock to protect their hash tables.

This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv6/sit.c |   45 +++++++++++++++++++++++++++------------------
 1 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 8cdcc2a..b6b1626 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -77,8 +77,17 @@ struct sit_net {
 	struct net_device *fb_tunnel_dev;
 };
 
-static DEFINE_RWLOCK(ipip6_lock);
+/*
+ * Locking : hash tables are protected by RCU and a spinlock
+ */
+static DEFINE_SPINLOCK(ipip6_lock);
+
+#define for_each_ip_tunnel_rcu(start) \
+	for (t = rcu_dereference(start); t; t = rcu_dereference(t->next))
 
+/*
+ * Must be invoked with rcu_read_lock
+ */
 static struct ip_tunnel * ipip6_tunnel_lookup(struct net *net,
 		struct net_device *dev, __be32 remote, __be32 local)
 {
@@ -87,26 +96,26 @@ static struct ip_tunnel * ipip6_tunnel_lookup(struct net *net,
 	struct ip_tunnel *t;
 	struct sit_net *sitn = net_generic(net, sit_net_id);
 
-	for (t = sitn->tunnels_r_l[h0^h1]; t; t = t->next) {
+	for_each_ip_tunnel_rcu(sitn->tunnels_r_l[h0 ^ h1]) {
 		if (local == t->parms.iph.saddr &&
 		    remote == t->parms.iph.daddr &&
 		    (!dev || !t->parms.link || dev->iflink == t->parms.link) &&
 		    (t->dev->flags & IFF_UP))
 			return t;
 	}
-	for (t = sitn->tunnels_r[h0]; t; t = t->next) {
+	for_each_ip_tunnel_rcu(sitn->tunnels_r[h0]) {
 		if (remote == t->parms.iph.daddr &&
 		    (!dev || !t->parms.link || dev->iflink == t->parms.link) &&
 		    (t->dev->flags & IFF_UP))
 			return t;
 	}
-	for (t = sitn->tunnels_l[h1]; t; t = t->next) {
+	for_each_ip_tunnel_rcu(sitn->tunnels_l[h1]) {
 		if (local == t->parms.iph.saddr &&
 		    (!dev || !t->parms.link || dev->iflink == t->parms.link) &&
 		    (t->dev->flags & IFF_UP))
 			return t;
 	}
-	t = sitn->tunnels_wc[0];
+	t = rcu_dereference(sitn->tunnels_wc[0]);
 	if ((t != NULL) && (t->dev->flags & IFF_UP))
 		return t;
 	return NULL;
@@ -143,9 +152,9 @@ static void ipip6_tunnel_unlink(struct sit_net *sitn, struct ip_tunnel *t)
 
 	for (tp = ipip6_bucket(sitn, t); *tp; tp = &(*tp)->next) {
 		if (t == *tp) {
-			write_lock_bh(&ipip6_lock);
+			spin_lock_bh(&ipip6_lock);
 			*tp = t->next;
-			write_unlock_bh(&ipip6_lock);
+			spin_unlock_bh(&ipip6_lock);
 			break;
 		}
 	}
@@ -155,10 +164,10 @@ static void ipip6_tunnel_link(struct sit_net *sitn, struct ip_tunnel *t)
 {
 	struct ip_tunnel **tp = ipip6_bucket(sitn, t);
 
+	spin_lock_bh(&ipip6_lock);
 	t->next = *tp;
-	write_lock_bh(&ipip6_lock);
-	*tp = t;
-	write_unlock_bh(&ipip6_lock);
+	rcu_assign_pointer(*tp, t);
+	spin_unlock_bh(&ipip6_lock);
 }
 
 static void ipip6_tunnel_clone_6rd(struct net_device *dev, struct sit_net *sitn)
@@ -447,9 +456,9 @@ static void ipip6_tunnel_uninit(struct net_device *dev)
 	struct sit_net *sitn = net_generic(net, sit_net_id);
 
 	if (dev == sitn->fb_tunnel_dev) {
-		write_lock_bh(&ipip6_lock);
+		spin_lock_bh(&ipip6_lock);
 		sitn->tunnels_wc[0] = NULL;
-		write_unlock_bh(&ipip6_lock);
+		spin_unlock_bh(&ipip6_lock);
 		dev_put(dev);
 	} else {
 		ipip6_tunnel_unlink(sitn, netdev_priv(dev));
@@ -502,7 +511,7 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
 
 	err = -ENOENT;
 
-	read_lock(&ipip6_lock);
+	rcu_read_lock();
 	t = ipip6_tunnel_lookup(dev_net(skb->dev),
 				skb->dev,
 				iph->daddr,
@@ -520,7 +529,7 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
 		t->err_count = 1;
 	t->err_time = jiffies;
 out:
-	read_unlock(&ipip6_lock);
+	rcu_read_unlock();
 	return err;
 }
 
@@ -540,7 +549,7 @@ static int ipip6_rcv(struct sk_buff *skb)
 
 	iph = ip_hdr(skb);
 
-	read_lock(&ipip6_lock);
+	rcu_read_lock();
 	tunnel = ipip6_tunnel_lookup(dev_net(skb->dev), skb->dev,
 				     iph->saddr, iph->daddr);
 	if (tunnel != NULL) {
@@ -554,7 +563,7 @@ static int ipip6_rcv(struct sk_buff *skb)
 		if ((tunnel->dev->priv_flags & IFF_ISATAP) &&
 		    !isatap_chksrc(skb, iph, tunnel)) {
 			tunnel->dev->stats.rx_errors++;
-			read_unlock(&ipip6_lock);
+			rcu_read_unlock();
 			kfree_skb(skb);
 			return 0;
 		}
@@ -565,12 +574,12 @@ static int ipip6_rcv(struct sk_buff *skb)
 		nf_reset(skb);
 		ipip6_ecn_decapsulate(iph, skb);
 		netif_rx(skb);
-		read_unlock(&ipip6_lock);
+		rcu_read_unlock();
 		return 0;
 	}
 
 	icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
-	read_unlock(&ipip6_lock);
+	rcu_read_unlock();
 out:
 	kfree_skb(skb);
 	return 0;

^ permalink raw reply related

* [PATCH net-next-2.6] xfrm6_tunnel: RCU conversion
From: Eric Dumazet @ 2009-10-24  4:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List, Yoshifuji Hideaki

xfrm6_tunnels use one rwlock to protect their hash tables.

Plain and straightforward conversion to RCU locking to permit better SMP
performance.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
--- 
 net/ipv6/xfrm6_tunnel.c |   47 ++++++++++++++++++++++----------------
 1 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/net/ipv6/xfrm6_tunnel.c b/net/ipv6/xfrm6_tunnel.c
index 81a95c0..438831d 100644
--- a/net/ipv6/xfrm6_tunnel.c
+++ b/net/ipv6/xfrm6_tunnel.c
@@ -23,7 +23,7 @@
  */
 #include <linux/module.h>
 #include <linux/xfrm.h>
-#include <linux/list.h>
+#include <linux/rculist.h>
 #include <net/ip.h>
 #include <net/xfrm.h>
 #include <net/ipv6.h>
@@ -36,14 +36,15 @@
  * per xfrm_address_t.
  */
 struct xfrm6_tunnel_spi {
-	struct hlist_node list_byaddr;
-	struct hlist_node list_byspi;
-	xfrm_address_t addr;
-	u32 spi;
-	atomic_t refcnt;
+	struct hlist_node	list_byaddr;
+	struct hlist_node	list_byspi;
+	xfrm_address_t		addr;
+	u32			spi;
+	atomic_t		refcnt;
+	struct rcu_head		rcu_head;
 };
 
-static DEFINE_RWLOCK(xfrm6_tunnel_spi_lock);
+static DEFINE_SPINLOCK(xfrm6_tunnel_spi_lock);
 
 static u32 xfrm6_tunnel_spi;
 
@@ -107,6 +108,7 @@ static void xfrm6_tunnel_spi_fini(void)
 		if (!hlist_empty(&xfrm6_tunnel_spi_byspi[i]))
 			return;
 	}
+	rcu_barrier();
 	kmem_cache_destroy(xfrm6_tunnel_spi_kmem);
 	xfrm6_tunnel_spi_kmem = NULL;
 }
@@ -116,7 +118,7 @@ static struct xfrm6_tunnel_spi *__xfrm6_tunnel_spi_lookup(xfrm_address_t *saddr)
 	struct xfrm6_tunnel_spi *x6spi;
 	struct hlist_node *pos;
 
-	hlist_for_each_entry(x6spi, pos,
+	hlist_for_each_entry_rcu(x6spi, pos,
 			     &xfrm6_tunnel_spi_byaddr[xfrm6_tunnel_spi_hash_byaddr(saddr)],
 			     list_byaddr) {
 		if (memcmp(&x6spi->addr, saddr, sizeof(x6spi->addr)) == 0)
@@ -131,10 +133,10 @@ __be32 xfrm6_tunnel_spi_lookup(xfrm_address_t *saddr)
 	struct xfrm6_tunnel_spi *x6spi;
 	u32 spi;
 
-	read_lock_bh(&xfrm6_tunnel_spi_lock);
+	rcu_read_lock_bh();
 	x6spi = __xfrm6_tunnel_spi_lookup(saddr);
 	spi = x6spi ? x6spi->spi : 0;
-	read_unlock_bh(&xfrm6_tunnel_spi_lock);
+	rcu_read_unlock_bh();
 	return htonl(spi);
 }
 
@@ -185,14 +187,15 @@ alloc_spi:
 	if (!x6spi)
 		goto out;
 
+	INIT_RCU_HEAD(&x6spi->rcu_head);
 	memcpy(&x6spi->addr, saddr, sizeof(x6spi->addr));
 	x6spi->spi = spi;
 	atomic_set(&x6spi->refcnt, 1);
 
-	hlist_add_head(&x6spi->list_byspi, &xfrm6_tunnel_spi_byspi[index]);
+	hlist_add_head_rcu(&x6spi->list_byspi, &xfrm6_tunnel_spi_byspi[index]);
 
 	index = xfrm6_tunnel_spi_hash_byaddr(saddr);
-	hlist_add_head(&x6spi->list_byaddr, &xfrm6_tunnel_spi_byaddr[index]);
+	hlist_add_head_rcu(&x6spi->list_byaddr, &xfrm6_tunnel_spi_byaddr[index]);
 out:
 	return spi;
 }
@@ -202,26 +205,32 @@ __be32 xfrm6_tunnel_alloc_spi(xfrm_address_t *saddr)
 	struct xfrm6_tunnel_spi *x6spi;
 	u32 spi;
 
-	write_lock_bh(&xfrm6_tunnel_spi_lock);
+	spin_lock_bh(&xfrm6_tunnel_spi_lock);
 	x6spi = __xfrm6_tunnel_spi_lookup(saddr);
 	if (x6spi) {
 		atomic_inc(&x6spi->refcnt);
 		spi = x6spi->spi;
 	} else
 		spi = __xfrm6_tunnel_alloc_spi(saddr);
-	write_unlock_bh(&xfrm6_tunnel_spi_lock);
+	spin_unlock_bh(&xfrm6_tunnel_spi_lock);
 
 	return htonl(spi);
 }
 
 EXPORT_SYMBOL(xfrm6_tunnel_alloc_spi);
 
+static void x6spi_destroy_rcu(struct rcu_head *head)
+{
+	kmem_cache_free(xfrm6_tunnel_spi_kmem,
+			container_of(head, struct xfrm6_tunnel_spi, rcu_head));
+}
+
 void xfrm6_tunnel_free_spi(xfrm_address_t *saddr)
 {
 	struct xfrm6_tunnel_spi *x6spi;
 	struct hlist_node *pos, *n;
 
-	write_lock_bh(&xfrm6_tunnel_spi_lock);
+	spin_lock_bh(&xfrm6_tunnel_spi_lock);
 
 	hlist_for_each_entry_safe(x6spi, pos, n,
 				  &xfrm6_tunnel_spi_byaddr[xfrm6_tunnel_spi_hash_byaddr(saddr)],
@@ -229,14 +238,14 @@ void xfrm6_tunnel_free_spi(xfrm_address_t *saddr)
 	{
 		if (memcmp(&x6spi->addr, saddr, sizeof(x6spi->addr)) == 0) {
 			if (atomic_dec_and_test(&x6spi->refcnt)) {
-				hlist_del(&x6spi->list_byaddr);
-				hlist_del(&x6spi->list_byspi);
-				kmem_cache_free(xfrm6_tunnel_spi_kmem, x6spi);
+				hlist_del_rcu(&x6spi->list_byaddr);
+				hlist_del_rcu(&x6spi->list_byspi);
+				call_rcu(&x6spi->rcu_head, x6spi_destroy_rcu);
 				break;
 			}
 		}
 	}
-	write_unlock_bh(&xfrm6_tunnel_spi_lock);
+	spin_unlock_bh(&xfrm6_tunnel_spi_lock);
 }
 
 EXPORT_SYMBOL(xfrm6_tunnel_free_spi);

^ permalink raw reply related

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Eric Dumazet @ 2009-10-24  4:35 UTC (permalink / raw)
  To: paulmck; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <20091023211338.GA6145@linux.vnet.ibm.com>

Paul E. McKenney a écrit :
> On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
>> [PATCH] net: allow netdev_wait_allrefs() to run faster
>>
>> netdev_wait_allrefs() waits that all references to a device vanishes.
>>
>> It currently uses a _very_ pessimistic 250 ms delay between each probe.
>> Some users report that no more than 4 devices can be dismantled per second,
>> this is a pretty serious problem for extreme setups.
>>
>> Most likely, references only wait for a rcu grace period that should come
>> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.
> 
> Is this a place where synchronize_rcu_expedited() is appropriate?
> (It went in to 2.6.32-rc1.)
> 

Thanks for the tip Paul

I believe netdev_wait_allrefs() is not a perfect candidate, because 
synchronize_sched_expedited() seems really expensive.

Maybe we could call it once only, if we had to call 1 times
the jiffie delay ?

diff --git a/net/core/dev.c b/net/core/dev.c
index fa88dcd..9b04b9a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4970,6 +4970,7 @@ EXPORT_SYMBOL(register_netdev);
 static void netdev_wait_allrefs(struct net_device *dev)
 {
 	unsigned long rebroadcast_time, warning_time;
+	unsigned int count = 0;
 
 	rebroadcast_time = warning_time = jiffies;
 	while (atomic_read(&dev->refcnt) != 0) {
@@ -4995,7 +4996,10 @@ static void netdev_wait_allrefs(struct net_device *dev)
 			rebroadcast_time = jiffies;
 		}
 
-		msleep(250);
+		if (count++ == 1)
+			synchronize_rcu_expedited();
+		else
+			schedule_timeout_uninterruptible(1);
 
 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "


^ permalink raw reply related

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
From: Paul E. McKenney @ 2009-10-24  5:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu
In-Reply-To: <4AE28429.6040608@gmail.com>

On Sat, Oct 24, 2009 at 06:35:53AM +0200, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
> > On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> >> [PATCH] net: allow netdev_wait_allrefs() to run faster
> >>
> >> netdev_wait_allrefs() waits that all references to a device vanishes.
> >>
> >> It currently uses a _very_ pessimistic 250 ms delay between each probe.
> >> Some users report that no more than 4 devices can be dismantled per second,
> >> this is a pretty serious problem for extreme setups.
> >>
> >> Most likely, references only wait for a rcu grace period that should come
> >> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.
> > 
> > Is this a place where synchronize_rcu_expedited() is appropriate?
> > (It went in to 2.6.32-rc1.)
> 
> Thanks for the tip Paul
> 
> I believe netdev_wait_allrefs() is not a perfect candidate, because 
> synchronize_sched_expedited() seems really expensive.

It does indeed keep the CPUs quite busy for a bit.  ;-)

> Maybe we could call it once only, if we had to call 1 times
> the jiffie delay ?

This could be a very useful approach!

However, please keep in mind that although synchronize_rcu_expedited()
forces a grace period, it does nothing to speed the invocation of other
RCU callbacks.  In short, synchronize_rcu_expedited() is a faster version
of synchronize_rcu(), but doesn't necessarily help other synchronize_rcu()
or call_rcu() invocations.

The reason I point this out is that it looks to me that the code below is
waiting for some other task which is in turn waiting on a grace period.
But I don't know this code, so could easily be confused.

						Thanx, paul

> diff --git a/net/core/dev.c b/net/core/dev.c
> index fa88dcd..9b04b9a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4970,6 +4970,7 @@ EXPORT_SYMBOL(register_netdev);
>  static void netdev_wait_allrefs(struct net_device *dev)
>  {
>  	unsigned long rebroadcast_time, warning_time;
> +	unsigned int count = 0;
> 
>  	rebroadcast_time = warning_time = jiffies;
>  	while (atomic_read(&dev->refcnt) != 0) {
> @@ -4995,7 +4996,10 @@ static void netdev_wait_allrefs(struct net_device *dev)
>  			rebroadcast_time = jiffies;
>  		}
> 
> -		msleep(250);
> +		if (count++ == 1)
> +			synchronize_rcu_expedited();
> +		else
> +			schedule_timeout_uninterruptible(1);
> 
>  		if (time_after(jiffies, warning_time + 10 * HZ)) {
>  			printk(KERN_EMERG "unregister_netdevice: "
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox