Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next-2.6 PATCH] igb: add support for reporting 5GT/s during probe on PCIe Gen2
From: Jeff Kirsher @ 2010-04-27 11:02 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Alexander Duyck, Jeff Kirsher

From: Alexander Duyck <alexander.h.duyck@intel.com>

This change corrects the fact that we were not reporting Gen2 link speeds
when we were in fact connected at Gen2 rates.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/igb/e1000_defines.h |    4 ----
 drivers/net/igb/e1000_mac.c     |   27 ++++++++++++++++++++-------
 drivers/net/igb/igb_main.c      |    1 +
 include/linux/pci_regs.h        |    3 +++
 4 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/drivers/net/igb/e1000_defines.h b/drivers/net/igb/e1000_defines.h
index 31d24e0..24d9be6 100644
--- a/drivers/net/igb/e1000_defines.h
+++ b/drivers/net/igb/e1000_defines.h
@@ -610,11 +610,7 @@
 #define IGP_LED3_MODE           0x07000000
 
 /* PCI/PCI-X/PCI-EX Config space */
-#define PCIE_LINK_STATUS             0x12
 #define PCIE_DEVICE_CONTROL2         0x28
-
-#define PCIE_LINK_WIDTH_MASK         0x3F0
-#define PCIE_LINK_WIDTH_SHIFT        4
 #define PCIE_DEVICE_CONTROL2_16ms    0x0005
 
 #define PHY_REVISION_MASK      0xFFFFFFF0
diff --git a/drivers/net/igb/e1000_mac.c b/drivers/net/igb/e1000_mac.c
index be8d010..90c5e01 100644
--- a/drivers/net/igb/e1000_mac.c
+++ b/drivers/net/igb/e1000_mac.c
@@ -53,17 +53,30 @@ s32 igb_get_bus_info_pcie(struct e1000_hw *hw)
 	u16 pcie_link_status;
 
 	bus->type = e1000_bus_type_pci_express;
-	bus->speed = e1000_bus_speed_2500;
 
 	ret_val = igb_read_pcie_cap_reg(hw,
-					  PCIE_LINK_STATUS,
-					  &pcie_link_status);
-	if (ret_val)
+					PCI_EXP_LNKSTA,
+					&pcie_link_status);
+	if (ret_val) {
 		bus->width = e1000_bus_width_unknown;
-	else
+		bus->speed = e1000_bus_speed_unknown;
+	} else {
+		switch (pcie_link_status & PCI_EXP_LNKSTA_CLS) {
+		case PCI_EXP_LNKSTA_CLS_2_5GB:
+			bus->speed = e1000_bus_speed_2500;
+			break;
+		case PCI_EXP_LNKSTA_CLS_5_0GB:
+			bus->speed = e1000_bus_speed_5000;
+			break;
+		default:
+			bus->speed = e1000_bus_speed_unknown;
+			break;
+		}
+
 		bus->width = (enum e1000_bus_width)((pcie_link_status &
-						     PCIE_LINK_WIDTH_MASK) >>
-						     PCIE_LINK_WIDTH_SHIFT);
+						     PCI_EXP_LNKSTA_NLW) >>
+						     PCI_EXP_LNKSTA_NLW_SHIFT);
+	}
 
 	reg = rd32(E1000_STATUS);
 	bus->func = (reg & E1000_STATUS_FUNC_MASK) >> E1000_STATUS_FUNC_SHIFT;
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index a14303a..919e363 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -1638,6 +1638,7 @@ static int __devinit igb_probe(struct pci_dev *pdev,
 	dev_info(&pdev->dev, "%s: (PCIe:%s:%s) %pM\n",
 		 netdev->name,
 		 ((hw->bus.speed == e1000_bus_speed_2500) ? "2.5Gb/s" :
+		  (hw->bus.speed == e1000_bus_speed_5000) ? "5.0Gb/s" :
 		                                            "unknown"),
 		 ((hw->bus.width == e1000_bus_width_pcie_x4) ? "Width x4" :
 		  (hw->bus.width == e1000_bus_width_pcie_x2) ? "Width x2" :
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index c8f3029..c4c3d68 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -442,7 +442,10 @@
 #define  PCI_EXP_LNKCTL_LABIE	0x0800	/* Lnk Autonomous Bandwidth Interrupt Enable */
 #define PCI_EXP_LNKSTA		18	/* Link Status */
 #define  PCI_EXP_LNKSTA_CLS	0x000f	/* Current Link Speed */
+#define  PCI_EXP_LNKSTA_CLS_2_5GB 0x01	/* Current Link Speed 2.5GT/s */
+#define  PCI_EXP_LNKSTA_CLS_5_0GB 0x02	/* Current Link Speed 5.0GT/s */
 #define  PCI_EXP_LNKSTA_NLW	0x03f0	/* Nogotiated Link Width */
+#define  PCI_EXP_LNKSTA_NLW_SHIFT 4	/* start of NLW mask in link status */
 #define  PCI_EXP_LNKSTA_LT	0x0800	/* Link Training */
 #define  PCI_EXP_LNKSTA_SLC	0x1000	/* Slot Clock Configuration */
 #define  PCI_EXP_LNKSTA_DLLLA	0x2000	/* Data Link Layer Link Active */


^ permalink raw reply related

* [net-2.6 PATCH] ixgbe: cleanup ethtool autoneg input
From: Jeff Kirsher @ 2010-04-27 11:25 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Don Skidmore, Jeff Kirsher

From: Don Skidmore <donald.c.skidmore@intel.com>

The way we were setting autoneg via ethtool was inconstant with that
of our other drivers.  It will change the following:

If autoneg is off:
>ethtool -a eth0
Pause parameters for eth0:

Autonegotiate:  off
RX:             off
TX:             off

Before:
>ethtool -A eth0 autoneg on
>ethtool -a eth0
Pause parameters for eth0:

Autonegotiate:  off
RX:             off
TX:             off

Now:
>ethtool -A eth0 autoneg on
>ethtool -a eth0
Pause parameters for eth0:

Autonegotiate:  on
RX:             on
TX:             on

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_ethtool.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_ethtool.c b/drivers/net/ixgbe/ixgbe_ethtool.c
index 8f461d5..7b391f9 100644
--- a/drivers/net/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ixgbe/ixgbe_ethtool.c
@@ -365,7 +365,7 @@ static int ixgbe_set_pauseparam(struct net_device *netdev,
 	else
 		fc.disable_fc_autoneg = false;
 
-	if (pause->rx_pause && pause->tx_pause)
+	if ((pause->rx_pause && pause->tx_pause) || pause->autoneg)
 		fc.requested_mode = ixgbe_fc_full;
 	else if (pause->rx_pause && !pause->tx_pause)
 		fc.requested_mode = ixgbe_fc_rx_pause;


^ permalink raw reply related

* [net-next-2.6 PATCH 1/2] ixgbe: enable extremely low latency
From: Jeff Kirsher @ 2010-04-27 11:37 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Jesse Brandeburg, Jeff Kirsher

From: Jesse Brandeburg <jesse.brandeburg@intel.com>

82598/82599 can support EITR == 0, which allows for the
absolutely lowest latency setting in the hardware.  This disables
writeback batching and anything else that relies upon a delayed
interrupt. This patch enables the feature of "override" when a
user sets rx-usecs to zero, the driver will respect that setting
over using RSC, and automatically disable RSC.  If rx-usecs is
used to set the EITR value to 0, then the driver should disable
LRO (aka RSC) internally until EITR is set to non-zero again.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_ethtool.c |   87 +++++++++++++++++++++++++++++++++----
 drivers/net/ixgbe/ixgbe_main.c    |    9 ++++
 2 files changed, 87 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_ethtool.c b/drivers/net/ixgbe/ixgbe_ethtool.c
index 7b391f9..b72a997 100644
--- a/drivers/net/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ixgbe/ixgbe_ethtool.c
@@ -2079,6 +2079,27 @@ static int ixgbe_get_coalesce(struct net_device *netdev,
 	return 0;
 }
 
+/*
+ * this function must be called before setting the new value of
+ * rx_itr_setting
+ */
+static void ixgbe_reenable_rsc(struct ixgbe_adapter *adapter,
+                               struct ethtool_coalesce *ec)
+{
+	/* check the old value and enable RSC if necessary */
+	if ((adapter->rx_itr_setting == 0) &&
+	    (adapter->flags2 & IXGBE_FLAG2_RSC_CAPABLE)) {
+		adapter->flags2 |= IXGBE_FLAG2_RSC_ENABLED;
+		adapter->netdev->features |= NETIF_F_LRO;
+		DPRINTK(PROBE, INFO, "rx-usecs set to %d, re-enabling RSC\n",
+		        ec->rx_coalesce_usecs);
+		if (netif_running(adapter->netdev))
+			ixgbe_reinit_locked(adapter);
+		else
+			ixgbe_reset(adapter);
+	}
+}
+
 static int ixgbe_set_coalesce(struct net_device *netdev,
                               struct ethtool_coalesce *ec)
 {
@@ -2095,11 +2116,20 @@ static int ixgbe_set_coalesce(struct net_device *netdev,
 		adapter->tx_ring[0]->work_limit = ec->tx_max_coalesced_frames_irq;
 
 	if (ec->rx_coalesce_usecs > 1) {
+		u32 max_int;
+		if (adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED)
+			max_int = IXGBE_MAX_RSC_INT_RATE;
+		else
+			max_int = IXGBE_MAX_INT_RATE;
+
 		/* check the limits */
-		if ((1000000/ec->rx_coalesce_usecs > IXGBE_MAX_INT_RATE) ||
+		if ((1000000/ec->rx_coalesce_usecs > max_int) ||
 		    (1000000/ec->rx_coalesce_usecs < IXGBE_MIN_INT_RATE))
 			return -EINVAL;
 
+		/* check the old value and enable RSC if necessary */
+		ixgbe_reenable_rsc(adapter, ec);
+
 		/* store the value in ints/second */
 		adapter->rx_eitr_param = 1000000/ec->rx_coalesce_usecs;
 
@@ -2108,6 +2138,9 @@ static int ixgbe_set_coalesce(struct net_device *netdev,
 		/* clear the lower bit as its used for dynamic state */
 		adapter->rx_itr_setting &= ~1;
 	} else if (ec->rx_coalesce_usecs == 1) {
+		/* check the old value and enable RSC if necessary */
+		ixgbe_reenable_rsc(adapter, ec);
+
 		/* 1 means dynamic mode */
 		adapter->rx_eitr_param = 20000;
 		adapter->rx_itr_setting = 1;
@@ -2116,14 +2149,34 @@ static int ixgbe_set_coalesce(struct net_device *netdev,
 		 * any other value means disable eitr, which is best
 		 * served by setting the interrupt rate very high
 		 */
-		if (adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED)
-			adapter->rx_eitr_param = IXGBE_MAX_RSC_INT_RATE;
-		else
-			adapter->rx_eitr_param = IXGBE_MAX_INT_RATE;
+		adapter->rx_eitr_param = IXGBE_MAX_INT_RATE;
 		adapter->rx_itr_setting = 0;
+
+		/*
+		 * if hardware RSC is enabled, disable it when
+		 * setting low latency mode, to avoid errata, assuming
+		 * that when the user set low latency mode they want
+		 * it at the cost of anything else
+		 */
+		if (adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) {
+			adapter->flags2 &= ~IXGBE_FLAG2_RSC_ENABLED;
+			netdev->features &= ~NETIF_F_LRO;
+			DPRINTK(PROBE, INFO,
+			        "rx-usecs set to 0, disabling RSC\n");
+
+			if (netif_running(netdev))
+				ixgbe_reinit_locked(adapter);
+			else
+				ixgbe_reset(adapter);
+			return 0;
+		}
 	}
 
 	if (ec->tx_coalesce_usecs > 1) {
+		/*
+		 * don't have to worry about max_int as above because
+		 * tx vectors don't do hardware RSC (an rx function)
+		 */
 		/* check the limits */
 		if ((1000000/ec->tx_coalesce_usecs > IXGBE_MAX_INT_RATE) ||
 		    (1000000/ec->tx_coalesce_usecs < IXGBE_MIN_INT_RATE))
@@ -2178,10 +2231,26 @@ static int ixgbe_set_flags(struct net_device *netdev, u32 data)
 	ethtool_op_set_flags(netdev, data);
 
 	/* if state changes we need to update adapter->flags and reset */
-	if ((!!(data & ETH_FLAG_LRO)) != 
-	    (!!(adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED))) {
-		adapter->flags2 ^= IXGBE_FLAG2_RSC_ENABLED;
-		need_reset = true;
+	if (adapter->flags2 & IXGBE_FLAG2_RSC_CAPABLE) {
+		/*
+		 * cast both to bool and verify if they are set the same
+		 * but only enable RSC if itr is non-zero, as
+		 * itr=0 and RSC are mutually exclusive
+		 */
+		if (((!!(data & ETH_FLAG_LRO)) !=
+		     (!!(adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED))) &&
+		    adapter->rx_itr_setting) {
+			adapter->flags2 ^= IXGBE_FLAG2_RSC_ENABLED;
+			switch (adapter->hw.mac.type) {
+			case ixgbe_mac_82599EB:
+				need_reset = true;
+				break;
+			default:
+				break;
+			}
+		} else if (!adapter->rx_itr_setting) {
+			netdev->features &= ~ETH_FLAG_LRO;
+		}
 	}
 
 	/*
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 32781b3..0c4ca68 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1190,6 +1190,15 @@ void ixgbe_write_eitr(struct ixgbe_q_vector *q_vector)
 		itr_reg |= (itr_reg << 16);
 	} else if (adapter->hw.mac.type == ixgbe_mac_82599EB) {
 		/*
+		 * 82599 can support a value of zero, so allow it for
+		 * max interrupt rate, but there is an errata where it can
+		 * not be zero with RSC
+		 */
+		if (itr_reg == 8 &&
+		    !(adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED))
+			itr_reg = 0;
+
+		/*
 		 * set the WDIS bit to not clear the timer bits and cause an
 		 * immediate assertion of the interrupt
 		 */


^ permalink raw reply related

* [net-next-2.6 PATCH 2/2] ixgbe: fix bug when EITR=0 causing no writebacks
From: Jeff Kirsher @ 2010-04-27 11:37 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Jesse Brandeburg, Jeff Kirsher
In-Reply-To: <20100427113651.24431.9221.stgit@localhost.localdomain>

From: Jesse Brandeburg <jesse.brandeburg@intel.com>

writebacks can be held indefinitely by hardware if EITR=0, when
combined with TXDCTL.WTHRESH=8.  When EITR=0, WTHRESH should be
set back to zero.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_ethtool.c |   31 +++++++++++++++++++------------
 drivers/net/ixgbe/ixgbe_main.c    |    9 +++++++--
 2 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_ethtool.c b/drivers/net/ixgbe/ixgbe_ethtool.c
index b72a997..dfbfe35 100644
--- a/drivers/net/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ixgbe/ixgbe_ethtool.c
@@ -2083,7 +2083,7 @@ static int ixgbe_get_coalesce(struct net_device *netdev,
  * this function must be called before setting the new value of
  * rx_itr_setting
  */
-static void ixgbe_reenable_rsc(struct ixgbe_adapter *adapter,
+static bool ixgbe_reenable_rsc(struct ixgbe_adapter *adapter,
                                struct ethtool_coalesce *ec)
 {
 	/* check the old value and enable RSC if necessary */
@@ -2093,11 +2093,9 @@ static void ixgbe_reenable_rsc(struct ixgbe_adapter *adapter,
 		adapter->netdev->features |= NETIF_F_LRO;
 		DPRINTK(PROBE, INFO, "rx-usecs set to %d, re-enabling RSC\n",
 		        ec->rx_coalesce_usecs);
-		if (netif_running(adapter->netdev))
-			ixgbe_reinit_locked(adapter);
-		else
-			ixgbe_reset(adapter);
+		return true;
 	}
+	return false;
 }
 
 static int ixgbe_set_coalesce(struct net_device *netdev,
@@ -2106,6 +2104,7 @@ static int ixgbe_set_coalesce(struct net_device *netdev,
 	struct ixgbe_adapter *adapter = netdev_priv(netdev);
 	struct ixgbe_q_vector *q_vector;
 	int i;
+	bool need_reset = false;
 
 	/* don't accept tx specific changes if we've got mixed RxTx vectors */
 	if (adapter->q_vector[0]->txr_count && adapter->q_vector[0]->rxr_count
@@ -2128,7 +2127,7 @@ static int ixgbe_set_coalesce(struct net_device *netdev,
 			return -EINVAL;
 
 		/* check the old value and enable RSC if necessary */
-		ixgbe_reenable_rsc(adapter, ec);
+		need_reset = ixgbe_reenable_rsc(adapter, ec);
 
 		/* store the value in ints/second */
 		adapter->rx_eitr_param = 1000000/ec->rx_coalesce_usecs;
@@ -2139,7 +2138,7 @@ static int ixgbe_set_coalesce(struct net_device *netdev,
 		adapter->rx_itr_setting &= ~1;
 	} else if (ec->rx_coalesce_usecs == 1) {
 		/* check the old value and enable RSC if necessary */
-		ixgbe_reenable_rsc(adapter, ec);
+		need_reset = ixgbe_reenable_rsc(adapter, ec);
 
 		/* 1 means dynamic mode */
 		adapter->rx_eitr_param = 20000;
@@ -2164,11 +2163,7 @@ static int ixgbe_set_coalesce(struct net_device *netdev,
 			DPRINTK(PROBE, INFO,
 			        "rx-usecs set to 0, disabling RSC\n");
 
-			if (netif_running(netdev))
-				ixgbe_reinit_locked(adapter);
-			else
-				ixgbe_reset(adapter);
-			return 0;
+			need_reset = true;
 		}
 	}
 
@@ -2220,6 +2215,18 @@ static int ixgbe_set_coalesce(struct net_device *netdev,
 		ixgbe_write_eitr(q_vector);
 	}
 
+	/*
+	 * do reset here at the end to make sure EITR==0 case is handled
+	 * correctly w.r.t stopping tx, and changing TXDCTL.WTHRESH settings
+	 * also locks in RSC enable/disable which requires reset
+	 */
+	if (need_reset) {
+		if (netif_running(netdev))
+			ixgbe_reinit_locked(adapter);
+		else
+			ixgbe_reset(adapter);
+	}
+
 	return 0;
 }
 
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 0c4ca68..fe7e260 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -2945,8 +2945,13 @@ static int ixgbe_up_complete(struct ixgbe_adapter *adapter)
 	for (i = 0; i < adapter->num_tx_queues; i++) {
 		j = adapter->tx_ring[i]->reg_idx;
 		txdctl = IXGBE_READ_REG(hw, IXGBE_TXDCTL(j));
-		/* enable WTHRESH=8 descriptors, to encourage burst writeback */
-		txdctl |= (8 << 16);
+		if (adapter->rx_itr_setting == 0) {
+			/* cannot set wthresh when itr==0 */
+			txdctl &= ~0x007F0000;
+		} else {
+			/* enable WTHRESH=8 descriptors, to encourage burst writeback */
+			txdctl |= (8 << 16);
+		}
 		IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(j), txdctl);
 	}
 


^ permalink raw reply related

* [net-next-2.6 PATCH] ixgbe: ixgbe_down needs to stop dev_watchdog
From: Jeff Kirsher @ 2010-04-27 12:13 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, John Fastabend, Peter P Waskiewicz Jr,
	Jeff Kirsher

From: John Fastabend <john.r.fastabend@intel.com>

There is a small race between when the tx queues are stopped
and when netif_carrier_off() is called in ixgbe_down.  If the
dev_watchdog() timer fires during this time it is possible for
a false tx timeout to occur.

This patch moves the netif_carrier_off() so that it is called before
the tx queues are stopped preventing the dev_watchdog timer from
detecting false tx timeouts.  The race is seen occosionally when
FCoE or DCB settings are being configured or changed.

Testing note, running ifconfig up/down will not reproduce this
issue because dev_open/dev_close call dev_deactivate() and then
dev_activate().

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_main.c |   15 +++++++--------
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index fe7e260..5258b3d 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -3291,22 +3291,23 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 	rxctrl = IXGBE_READ_REG(hw, IXGBE_RXCTRL);
 	IXGBE_WRITE_REG(hw, IXGBE_RXCTRL, rxctrl & ~IXGBE_RXCTRL_RXEN);
 
-	netif_tx_disable(netdev);
-
 	IXGBE_WRITE_FLUSH(hw);
 	msleep(10);
 
 	netif_tx_stop_all_queues(netdev);
 
-	ixgbe_irq_disable(adapter);
-
-	ixgbe_napi_disable_all(adapter);
-
 	clear_bit(__IXGBE_SFP_MODULE_NOT_FOUND, &adapter->state);
 	del_timer_sync(&adapter->sfp_timer);
 	del_timer_sync(&adapter->watchdog_timer);
 	cancel_work_sync(&adapter->watchdog_task);
 
+	netif_carrier_off(netdev);
+	netif_tx_disable(netdev);
+
+	ixgbe_irq_disable(adapter);
+
+	ixgbe_napi_disable_all(adapter);
+
 	if (adapter->flags & IXGBE_FLAG_FDIR_HASH_CAPABLE ||
 	    adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE)
 		cancel_work_sync(&adapter->fdir_reinit_task);
@@ -3324,8 +3325,6 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 		                (IXGBE_READ_REG(hw, IXGBE_DMATXCTL) &
 		                 ~IXGBE_DMATXCTL_TE));
 
-	netif_carrier_off(netdev);
-
 	/* clear n-tuple filters that are cached */
 	ethtool_ntuple_flush(netdev);
 


^ permalink raw reply related

* Re: [PATCH RFC: linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Thomas Gleixner @ 2010-04-27 12:32 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr; +Cc: davem, arjan, netdev, linux-kernel
In-Reply-To: <20100419045741.30276.23233.stgit@ppwaskie-hc2.jf.intel.com>

On Sun, 18 Apr 2010, Peter P Waskiewicz Jr wrote:
> +/**
> + * struct irqaffinityhint - per interrupt affinity helper
> + * @callback:	device driver callback function
> + * @dev:	reference for the affected device
> + * @irq:	interrupt number
> + */
> +struct irqaffinityhint {
> +	irq_affinity_hint_t callback;
> +	void *dev;
> +	int irq;
> +};

Why do you need that extra data structure ? The device and the irq
number are known, so all you need is the callback itself. So no need
for allocating memory ....

> +static ssize_t irq_affinity_hint_proc_write(struct file *file,
> +		const char __user *buffer, size_t count, loff_t *pos)
> +{
> +	/* affinity_hint is read-only from proc */
> +	return -EOPNOTSUPP;
> +}
> +

Why do you want a write function when the file is read only ?

Thanks,

	tglx

^ permalink raw reply

* Re: [net-next-2.6 PATCH 1/2] Add ndo_set_vf_port_profile
From: Arnd Bergmann @ 2010-04-27 12:35 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Rose, Gregory V, David Miller, netdev@vger.kernel.org,
	chrisw@redhat.com, Williams, Mitch A
In-Reply-To: <C7FB77DB.2BC42%scofeldm@cisco.com>

On Tuesday 27 April 2010, Scott Feldman wrote:
> > Yes, I believe that's there today:
> > 
> >     NLA_PUT_U32(skb, IFLA_NUM_VF, dev_num_vf(dev->dev.parent));
> > 
> > The number of VFs is returned in RTM_GETLINK.  But, it's only returned if:
> > 
> >     if (dev->netdev_ops->ndo_get_vf_config && dev->dev.parent)
> > 
> > For my proposal, I'll need to return IFLA_NUM_VF unconditionally so callers
> > can get num VFs.
> 
> Hmmm...seems IFLA_NUM_VF assumes a PCI device supporting SR-IOV when it uses
> dev_num_vf().  I think a better option would have been to query the device
> for the number of VFs, without assuming SR-IOV or even PCI.
> 
> I see a ndo_get_num_vf() coming...

Shouldn't the number of registered port profiles be totally independent of
the number of virtual functions?

Any of the VFs could multiplex multiple guests using macvlan, which means you
need to register each guest separately, not each VF.

Anything that ties port profiles to VFs seems fundamentally flawed AFAICT,
at least when we want to extend this to adapters that don't do it in firmware.

	Arnd

^ permalink raw reply

* [PATCH net-next-2.6] rps: inet_rps_save_rxhash() argument is not const
From: Eric Dumazet @ 2010-04-27 12:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

const qualifier on sock argument is misleading, since we can modify rxhash.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index b487bc1..c1d4295 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -248,7 +248,7 @@ static inline void inet_rps_reset_flow(const struct sock *sk)
 #endif
 }
 
-static inline void inet_rps_save_rxhash(const struct sock *sk, u32 rxhash)
+static inline void inet_rps_save_rxhash(struct sock *sk, u32 rxhash)
 {
 #ifdef CONFIG_RPS
 	if (unlikely(inet_sk(sk)->rxhash != rxhash)) {



^ permalink raw reply related

* [PATCH 0/4] net: ipmr netlink interface for route dumping
From: Patrick, McHardy, kaber @ 2010-04-27 13:26 UTC (permalink / raw)
  To: davem; +Cc: netdev

These patches add support to ipmr to dump routes of all tables over
netlink. The /proc interface contains an unterminated list of oifs
at the end of a line, therefore it can't be extended to print the
routing table number without breaking backwards compatibility.

These patches fully decouple rtnetlink families from real address
families and add a new family RTNL_FAMILY_IPMR for ipmr as well as
a RTM_GETROUTE handler.

The first patch additionally marks the fib_rules_ops templates as
const __net_initdata. This is not directly related to the netlink
interface, but was already queued in my tree.

Please apply or pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/kaber/ipmr-2.6.git master

Thanks!

^ permalink raw reply

* [PATCH 2/3] net: rtnetlink: decouple rtnetlink address families from real address families
From: Patrick, McHardy, kaber @ 2010-04-27 13:26 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1272374785-3858-1-git-send-email-kaber@trash.net>

From: Patrick McHardy <kaber@trash.net>

Decouple rtnetlink address families from real address families in socket.h to
be able to add rtnetlink interfaces to code that is not a real address family
without increasing AF_MAX/NPROTO.

This will be used to add support for multicast route dumping from all tables
as the proc interface can't be extended to support anything but the main table
without breaking compatibility.

This partialy undoes the patch to introduce independant families for routing
rules and converts ipmr routing rules to a new rtnetlink family. Similar to
that patch, values up to 127 are reserved for real address families, values
above that may be used arbitrarily.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/linux/fib_rules.h |    8 --------
 include/linux/rtnetlink.h |    6 ++++++
 net/core/rtnetlink.c      |   14 +++++++-------
 net/decnet/dn_rules.c     |    2 +-
 net/ipv4/fib_rules.c      |    2 +-
 net/ipv4/ipmr.c           |    2 +-
 net/ipv6/fib6_rules.c     |    2 +-
 7 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/include/linux/fib_rules.h b/include/linux/fib_rules.h
index 04a3976..51da65b 100644
--- a/include/linux/fib_rules.h
+++ b/include/linux/fib_rules.h
@@ -15,14 +15,6 @@
 /* try to find source address in routing lookups */
 #define FIB_RULE_FIND_SADDR	0x00010000
 
-/* fib_rules families. values up to 127 are reserved for real address
- * families, values above 128 may be used arbitrarily.
- */
-#define FIB_RULES_IPV4		AF_INET
-#define FIB_RULES_IPV6		AF_INET6
-#define FIB_RULES_DECNET	AF_DECnet
-#define FIB_RULES_IPMR		128
-
 struct fib_rule_hdr {
 	__u8		family;
 	__u8		dst_len;
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index d1c7c90..5a42c36 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -7,6 +7,12 @@
 #include <linux/if_addr.h>
 #include <linux/neighbour.h>
 
+/* rtnetlink families. Values up to 127 are reserved for real address
+ * families, values above 128 may be used arbitrarily.
+ */
+#define RTNL_FAMILY_IPMR		128
+#define RTNL_FAMILY_MAX			128
+
 /****
  *		Routing/neighbour discovery messages.
  ****/
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 78c8598..fd781b6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -98,7 +98,7 @@ int lockdep_rtnl_is_held(void)
 EXPORT_SYMBOL(lockdep_rtnl_is_held);
 #endif /* #ifdef CONFIG_PROVE_LOCKING */
 
-static struct rtnl_link *rtnl_msg_handlers[NPROTO];
+static struct rtnl_link *rtnl_msg_handlers[RTNL_FAMILY_MAX + 1];
 
 static inline int rtm_msgindex(int msgtype)
 {
@@ -118,7 +118,7 @@ static rtnl_doit_func rtnl_get_doit(int protocol, int msgindex)
 {
 	struct rtnl_link *tab;
 
-	if (protocol < NPROTO)
+	if (protocol <= RTNL_FAMILY_MAX)
 		tab = rtnl_msg_handlers[protocol];
 	else
 		tab = NULL;
@@ -133,7 +133,7 @@ static rtnl_dumpit_func rtnl_get_dumpit(int protocol, int msgindex)
 {
 	struct rtnl_link *tab;
 
-	if (protocol < NPROTO)
+	if (protocol <= RTNL_FAMILY_MAX)
 		tab = rtnl_msg_handlers[protocol];
 	else
 		tab = NULL;
@@ -167,7 +167,7 @@ int __rtnl_register(int protocol, int msgtype,
 	struct rtnl_link *tab;
 	int msgindex;
 
-	BUG_ON(protocol < 0 || protocol >= NPROTO);
+	BUG_ON(protocol < 0 || protocol > RTNL_FAMILY_MAX);
 	msgindex = rtm_msgindex(msgtype);
 
 	tab = rtnl_msg_handlers[protocol];
@@ -219,7 +219,7 @@ int rtnl_unregister(int protocol, int msgtype)
 {
 	int msgindex;
 
-	BUG_ON(protocol < 0 || protocol >= NPROTO);
+	BUG_ON(protocol < 0 || protocol > RTNL_FAMILY_MAX);
 	msgindex = rtm_msgindex(msgtype);
 
 	if (rtnl_msg_handlers[protocol] == NULL)
@@ -241,7 +241,7 @@ EXPORT_SYMBOL_GPL(rtnl_unregister);
  */
 void rtnl_unregister_all(int protocol)
 {
-	BUG_ON(protocol < 0 || protocol >= NPROTO);
+	BUG_ON(protocol < 0 || protocol > RTNL_FAMILY_MAX);
 
 	kfree(rtnl_msg_handlers[protocol]);
 	rtnl_msg_handlers[protocol] = NULL;
@@ -1384,7 +1384,7 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
 
 	if (s_idx == 0)
 		s_idx = 1;
-	for (idx = 1; idx < NPROTO; idx++) {
+	for (idx = 1; idx <= RTNL_FAMILY_MAX; idx++) {
 		int type = cb->nlh->nlmsg_type-RTM_BASE;
 		if (idx < s_idx || idx == PF_PACKET)
 			continue;
diff --git a/net/decnet/dn_rules.c b/net/decnet/dn_rules.c
index 1226bca..48fdf10 100644
--- a/net/decnet/dn_rules.c
+++ b/net/decnet/dn_rules.c
@@ -217,7 +217,7 @@ static void dn_fib_rule_flush_cache(struct fib_rules_ops *ops)
 }
 
 static const struct fib_rules_ops __net_initdata dn_fib_rules_ops_template = {
-	.family		= FIB_RULES_DECNET,
+	.family		= AF_DECnet,
 	.rule_size	= sizeof(struct dn_fib_rule),
 	.addr_size	= sizeof(u16),
 	.action		= dn_fib_rule_action,
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 8ab62a5..76daeb5 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -246,7 +246,7 @@ static void fib4_rule_flush_cache(struct fib_rules_ops *ops)
 }
 
 static const struct fib_rules_ops __net_initdata fib4_rules_ops_template = {
-	.family		= FIB_RULES_IPV4,
+	.family		= AF_INET,
 	.rule_size	= sizeof(struct fib4_rule),
 	.addr_size	= sizeof(u32),
 	.action		= fib4_rule_action,
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 7d3e382..41e8fc0 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -217,7 +217,7 @@ static int ipmr_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
 }
 
 static const struct fib_rules_ops __net_initdata ipmr_rules_ops_template = {
-	.family		= FIB_RULES_IPMR,
+	.family		= RTNL_FAMILY_IPMR,
 	.rule_size	= sizeof(struct ipmr_rule),
 	.addr_size	= sizeof(u32),
 	.action		= ipmr_rule_action,
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 35f6949..8e44f8f 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -238,7 +238,7 @@ static size_t fib6_rule_nlmsg_payload(struct fib_rule *rule)
 }
 
 static const struct fib_rules_ops __net_initdata fib6_rules_ops_template = {
-	.family			= FIB_RULES_IPV6,
+	.family			= AF_INET6,
 	.rule_size		= sizeof(struct fib6_rule),
 	.addr_size		= sizeof(struct in6_addr),
 	.action			= fib6_rule_action,
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 1/3] net: fib_rules: mark arguments to fib_rules_register const and __net_initdata
From: Patrick, McHardy, kaber @ 2010-04-27 13:26 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1272374785-3858-1-git-send-email-kaber@trash.net>

From: Patrick McHardy <kaber@trash.net>

fib_rules_register() duplicates the template passed to it without modification,
mark the argument as const. Additionally the templates are only needed when
instantiating a new namespace, so mark them as __net_initdata, which means
they can be discarded when CONFIG_NET_NS=n.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/net/fib_rules.h |    2 +-
 net/core/fib_rules.c    |    2 +-
 net/decnet/dn_rules.c   |    2 +-
 net/ipv4/fib_rules.c    |    2 +-
 net/ipv4/ipmr.c         |    2 +-
 net/ipv6/fib6_rules.c   |    2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 52bd9e6..e8923bc 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -104,7 +104,7 @@ static inline u32 frh_get_table(struct fib_rule_hdr *frh, struct nlattr **nla)
 	return frh->table;
 }
 
-extern struct fib_rules_ops *fib_rules_register(struct fib_rules_ops *, struct net *);
+extern struct fib_rules_ops *fib_rules_register(const struct fib_rules_ops *, struct net *);
 extern void fib_rules_unregister(struct fib_rules_ops *);
 extern void                     fib_rules_cleanup_ops(struct fib_rules_ops *);
 
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 1bc6659..42e84e0 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -122,7 +122,7 @@ errout:
 }
 
 struct fib_rules_ops *
-fib_rules_register(struct fib_rules_ops *tmpl, struct net *net)
+fib_rules_register(const struct fib_rules_ops *tmpl, struct net *net)
 {
 	struct fib_rules_ops *ops;
 	int err;
diff --git a/net/decnet/dn_rules.c b/net/decnet/dn_rules.c
index af28dcc..1226bca 100644
--- a/net/decnet/dn_rules.c
+++ b/net/decnet/dn_rules.c
@@ -216,7 +216,7 @@ static void dn_fib_rule_flush_cache(struct fib_rules_ops *ops)
 	dn_rt_cache_flush(-1);
 }
 
-static struct fib_rules_ops dn_fib_rules_ops_template = {
+static const struct fib_rules_ops __net_initdata dn_fib_rules_ops_template = {
 	.family		= FIB_RULES_DECNET,
 	.rule_size	= sizeof(struct dn_fib_rule),
 	.addr_size	= sizeof(u16),
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 3ec84fe..8ab62a5 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -245,7 +245,7 @@ static void fib4_rule_flush_cache(struct fib_rules_ops *ops)
 	rt_cache_flush(ops->fro_net, -1);
 }
 
-static struct fib_rules_ops fib4_rules_ops_template = {
+static const struct fib_rules_ops __net_initdata fib4_rules_ops_template = {
 	.family		= FIB_RULES_IPV4,
 	.rule_size	= sizeof(struct fib4_rule),
 	.addr_size	= sizeof(u32),
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index a2df501..7d3e382 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -216,7 +216,7 @@ static int ipmr_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
 	return 0;
 }
 
-static struct fib_rules_ops ipmr_rules_ops_template = {
+static const struct fib_rules_ops __net_initdata ipmr_rules_ops_template = {
 	.family		= FIB_RULES_IPMR,
 	.rule_size	= sizeof(struct ipmr_rule),
 	.addr_size	= sizeof(u32),
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 8124f16..35f6949 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -237,7 +237,7 @@ static size_t fib6_rule_nlmsg_payload(struct fib_rule *rule)
 	       + nla_total_size(16); /* src */
 }
 
-static struct fib_rules_ops fib6_rules_ops_template = {
+static const struct fib_rules_ops __net_initdata fib6_rules_ops_template = {
 	.family			= FIB_RULES_IPV6,
 	.rule_size		= sizeof(struct fib6_rule),
 	.addr_size		= sizeof(struct in6_addr),
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 3/3] net: ipmr: add support for dumping routing tables over netlink
From: Patrick, McHardy, kaber @ 2010-04-27 13:26 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1272374785-3858-1-git-send-email-kaber@trash.net>

From: Patrick McHardy <kaber@trash.net>

The ipmr /proc interface (ip_mr_cache) can't be extended to dump routes
from any tables but the main table in a backwards compatible fashion since
the output format ends in a variable amount of output interfaces.

Introduce a new netlink interface to dump multicast routes from all tables,
similar to the netlink interface for regular routes.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/ipmr.c |   96 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 89 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 41e8fc0..eddfd12 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -128,8 +128,8 @@ static int ip_mr_forward(struct net *net, struct mr_table *mrt,
 			 int local);
 static int ipmr_cache_report(struct mr_table *mrt,
 			     struct sk_buff *pkt, vifi_t vifi, int assert);
-static int ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
-			    struct mfc_cache *c, struct rtmsg *rtm);
+static int __ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
+			      struct mfc_cache *c, struct rtmsg *rtm);
 static void ipmr_expire_process(unsigned long arg);
 
 #ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
@@ -831,7 +831,7 @@ static void ipmr_cache_resolve(struct net *net, struct mr_table *mrt,
 		if (ip_hdr(skb)->version == 0) {
 			struct nlmsghdr *nlh = (struct nlmsghdr *)skb_pull(skb, sizeof(struct iphdr));
 
-			if (ipmr_fill_mroute(mrt, skb, c, NLMSG_DATA(nlh)) > 0) {
+			if (__ipmr_fill_mroute(mrt, skb, c, NLMSG_DATA(nlh)) > 0) {
 				nlh->nlmsg_len = (skb_tail_pointer(skb) -
 						  (u8 *)nlh);
 			} else {
@@ -1904,9 +1904,8 @@ drop:
 }
 #endif
 
-static int
-ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb, struct mfc_cache *c,
-		 struct rtmsg *rtm)
+static int __ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
+			      struct mfc_cache *c, struct rtmsg *rtm)
 {
 	int ct;
 	struct rtnexthop *nhp;
@@ -1994,11 +1993,93 @@ int ipmr_get_route(struct net *net,
 
 	if (!nowait && (rtm->rtm_flags&RTM_F_NOTIFY))
 		cache->mfc_flags |= MFC_NOTIFY;
-	err = ipmr_fill_mroute(mrt, skb, cache, rtm);
+	err = __ipmr_fill_mroute(mrt, skb, cache, rtm);
 	read_unlock(&mrt_lock);
 	return err;
 }
 
+static int ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
+			    u32 pid, u32 seq, struct mfc_cache *c)
+{
+	struct nlmsghdr *nlh;
+	struct rtmsg *rtm;
+
+	nlh = nlmsg_put(skb, pid, seq, RTM_NEWROUTE, sizeof(*rtm), NLM_F_MULTI);
+	if (nlh == NULL)
+		return -EMSGSIZE;
+
+	rtm = nlmsg_data(nlh);
+	rtm->rtm_family   = RTNL_FAMILY_IPMR;
+	rtm->rtm_dst_len  = 32;
+	rtm->rtm_src_len  = 32;
+	rtm->rtm_tos      = 0;
+	rtm->rtm_table    = mrt->id;
+	NLA_PUT_U32(skb, RTA_TABLE, mrt->id);
+	rtm->rtm_type     = RTN_MULTICAST;
+	rtm->rtm_scope    = RT_SCOPE_UNIVERSE;
+	rtm->rtm_protocol = RTPROT_UNSPEC;
+	rtm->rtm_flags    = 0;
+
+	NLA_PUT_BE32(skb, RTA_SRC, c->mfc_origin);
+	NLA_PUT_BE32(skb, RTA_DST, c->mfc_mcastgrp);
+
+	if (__ipmr_fill_mroute(mrt, skb, c, rtm) < 0)
+		goto nla_put_failure;
+
+	return nlmsg_end(skb, nlh);
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+	return -EMSGSIZE;
+}
+
+static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct net *net = sock_net(skb->sk);
+	struct mr_table *mrt;
+	struct mfc_cache *mfc;
+	unsigned int t = 0, s_t;
+	unsigned int h = 0, s_h;
+	unsigned int e = 0, s_e;
+
+	s_t = cb->args[0];
+	s_h = cb->args[1];
+	s_e = cb->args[2];
+
+	read_lock(&mrt_lock);
+	ipmr_for_each_table(mrt, net) {
+		if (t < s_t)
+			goto next_table;
+		if (t > s_t)
+			s_h = 0;
+		for (h = s_h; h < MFC_LINES; h++) {
+			list_for_each_entry(mfc, &mrt->mfc_cache_array[h], list) {
+				if (e < s_e)
+					goto next_entry;
+				if (ipmr_fill_mroute(mrt, skb,
+						     NETLINK_CB(cb->skb).pid,
+						     cb->nlh->nlmsg_seq,
+						     mfc) < 0)
+					goto done;
+next_entry:
+				e++;
+			}
+			e = s_e = 0;
+		}
+		s_h = 0;
+next_table:
+		t++;
+	}
+done:
+	read_unlock(&mrt_lock);
+
+	cb->args[2] = e;
+	cb->args[1] = h;
+	cb->args[0] = t;
+
+	return skb->len;
+}
+
 #ifdef CONFIG_PROC_FS
 /*
  *	The /proc interfaces to multicast routing /proc/ip_mr_cache /proc/ip_mr_vif
@@ -2355,6 +2436,7 @@ int __init ip_mr_init(void)
 		goto add_proto_fail;
 	}
 #endif
+	rtnl_register(RTNL_FAMILY_IPMR, RTM_GETROUTE, NULL, ipmr_rtm_dumproute);
 	return 0;
 
 #ifdef CONFIG_IP_PIMSM_V2
-- 
1.7.0.4


^ permalink raw reply related

* [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: Jeff Kirsher @ 2010-04-27 13:33 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Matthew Garret, Bruce Allan, Jeff Kirsher

From: Bruce Allan <bruce.w.allan@intel.com>

Prompted by a previous patch submitted by Matthew Garret <mjg@redhat.com>,
further digging into errata documentation reveals the current enabling or
disabling of ASPM L0s and L1 states for certain parts supported by this
driver are incorrect.  82571 and 82572 should always disable L1.  For
standard frames, 82573/82574/82583 can enable L1 but L0s must be disabled,
and for jumbo frames 82573/82574 must disable L1.  This allows for some
parts to enable L1 in certain configurations leading to better power
savings.

Also according to the same errata, Early Receive (ERT) should be disabled
on 82573 when using jumbo frames.

Cc: Matthew Garret <mjg@redhat.com>
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/e1000e/82571.c  |   20 ++++++------
 drivers/net/e1000e/e1000.h  |    5 ++-
 drivers/net/e1000e/netdev.c |   70 ++++++++++++++++++++++++++-----------------
 3 files changed, 55 insertions(+), 40 deletions(-)

diff --git a/drivers/net/e1000e/82571.c b/drivers/net/e1000e/82571.c
index 712ccc6..9015555 100644
--- a/drivers/net/e1000e/82571.c
+++ b/drivers/net/e1000e/82571.c
@@ -336,7 +336,6 @@ static s32 e1000_get_variants_82571(struct e1000_adapter *adapter)
 	struct e1000_hw *hw = &adapter->hw;
 	static int global_quad_port_a; /* global port a indication */
 	struct pci_dev *pdev = adapter->pdev;
-	u16 eeprom_data = 0;
 	int is_port_b = er32(STATUS) & E1000_STATUS_FUNC_1;
 	s32 rc;
 
@@ -387,16 +386,15 @@ static s32 e1000_get_variants_82571(struct e1000_adapter *adapter)
 		if (pdev->device == E1000_DEV_ID_82571EB_SERDES_QUAD)
 			adapter->flags &= ~FLAG_HAS_WOL;
 		break;
-
 	case e1000_82573:
+	case e1000_82574:
+	case e1000_82583:
+		/* Disable ASPM L0s due to hardware errata */
+		e1000e_disable_aspm(adapter->pdev, PCIE_LINK_STATE_L0S);
+
 		if (pdev->device == E1000_DEV_ID_82573L) {
-			if (e1000_read_nvm(&adapter->hw, NVM_INIT_3GIO_3, 1,
-				       &eeprom_data) < 0)
-				break;
-			if (!(eeprom_data & NVM_WORD1A_ASPM_MASK)) {
-				adapter->flags |= FLAG_HAS_JUMBO_FRAMES;
-				adapter->max_hw_frame_size = DEFAULT_JUMBO;
-			}
+			adapter->flags |= FLAG_HAS_JUMBO_FRAMES;
+			adapter->max_hw_frame_size = DEFAULT_JUMBO;
 		}
 		break;
 	default:
@@ -1792,6 +1790,7 @@ struct e1000_info e1000_82571_info = {
 				  | FLAG_RESET_OVERWRITES_LAA /* errata */
 				  | FLAG_TARC_SPEED_MODE_BIT /* errata */
 				  | FLAG_APME_CHECK_PORT_B,
+	.flags2			= FLAG2_DISABLE_ASPM_L1, /* errata 13 */
 	.pba			= 38,
 	.max_hw_frame_size	= DEFAULT_JUMBO,
 	.get_variants		= e1000_get_variants_82571,
@@ -1809,6 +1808,7 @@ struct e1000_info e1000_82572_info = {
 				  | FLAG_RX_CSUM_ENABLED
 				  | FLAG_HAS_CTRLEXT_ON_LOAD
 				  | FLAG_TARC_SPEED_MODE_BIT, /* errata */
+	.flags2			= FLAG2_DISABLE_ASPM_L1, /* errata 13 */
 	.pba			= 38,
 	.max_hw_frame_size	= DEFAULT_JUMBO,
 	.get_variants		= e1000_get_variants_82571,
@@ -1820,13 +1820,11 @@ struct e1000_info e1000_82572_info = {
 struct e1000_info e1000_82573_info = {
 	.mac			= e1000_82573,
 	.flags			= FLAG_HAS_HW_VLAN_FILTER
-				  | FLAG_HAS_JUMBO_FRAMES
 				  | FLAG_HAS_WOL
 				  | FLAG_APME_IN_CTRL3
 				  | FLAG_RX_CSUM_ENABLED
 				  | FLAG_HAS_SMART_POWER_DOWN
 				  | FLAG_HAS_AMT
-				  | FLAG_HAS_ERT
 				  | FLAG_HAS_SWSM_ON_LOAD,
 	.pba			= 20,
 	.max_hw_frame_size	= ETH_FRAME_LEN + ETH_FCS_LEN,
diff --git a/drivers/net/e1000e/e1000.h b/drivers/net/e1000e/e1000.h
index 118bdf4..ee32b9b 100644
--- a/drivers/net/e1000e/e1000.h
+++ b/drivers/net/e1000e/e1000.h
@@ -37,6 +37,7 @@
 #include <linux/io.h>
 #include <linux/netdevice.h>
 #include <linux/pci.h>
+#include <linux/pci-aspm.h>
 
 #include "hw.h"
 
@@ -374,7 +375,7 @@ struct e1000_adapter {
 struct e1000_info {
 	enum e1000_mac_type	mac;
 	unsigned int		flags;
-	unsigned int            flags2;
+	unsigned int		flags2;
 	u32			pba;
 	u32			max_hw_frame_size;
 	s32			(*get_variants)(struct e1000_adapter *);
@@ -421,6 +422,7 @@ struct e1000_info {
 #define FLAG2_CRC_STRIPPING               (1 << 0)
 #define FLAG2_HAS_PHY_WAKEUP              (1 << 1)
 #define FLAG2_IS_DISCARDING               (1 << 2)
+#define FLAG2_DISABLE_ASPM_L1             (1 << 3)
 
 #define E1000_RX_DESC_PS(R, i)	    \
 	(&(((union e1000_rx_desc_packet_split *)((R).desc))[i]))
@@ -461,6 +463,7 @@ extern void e1000e_update_stats(struct e1000_adapter *adapter);
 extern bool e1000e_has_link(struct e1000_adapter *adapter);
 extern void e1000e_set_interrupt_capability(struct e1000_adapter *adapter);
 extern void e1000e_reset_interrupt_capability(struct e1000_adapter *adapter);
+extern void e1000e_disable_aspm(struct pci_dev *pdev, u16 state);
 
 extern unsigned int copybreak;
 
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 73d43c5..fb8fc7d 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4283,6 +4283,14 @@ static int e1000_change_mtu(struct net_device *netdev, int new_mtu)
 		return -EINVAL;
 	}
 
+	/* 82573 Errata 17 */
+	if (((adapter->hw.mac.type == e1000_82573) ||
+	     (adapter->hw.mac.type == e1000_82574)) &&
+	    (max_frame > ETH_FRAME_LEN + ETH_FCS_LEN)) {
+		adapter->flags2 |= FLAG2_DISABLE_ASPM_L1;
+		e1000e_disable_aspm(adapter->pdev, PCIE_LINK_STATE_L1);
+	}
+
 	while (test_and_set_bit(__E1000_RESETTING, &adapter->state))
 		msleep(1);
 	/* e1000e_down -> e1000e_reset dependent on max_frame_size & mtu */
@@ -4605,29 +4613,39 @@ static void e1000_complete_shutdown(struct pci_dev *pdev, bool sleep,
 	}
 }
 
-static void e1000e_disable_l1aspm(struct pci_dev *pdev)
+#ifdef CONFIG_PCIEASPM
+static void __e1000e_disable_aspm(struct pci_dev *pdev, u16 state)
+{
+	pci_disable_link_state(pdev, state);
+}
+#else
+static void __e1000e_disable_aspm(struct pci_dev *pdev, u16 state)
 {
 	int pos;
-	u16 val;
+	u16 reg16;
 
 	/*
-	 * 82573 workaround - disable L1 ASPM on mobile chipsets
-	 *
-	 * L1 ASPM on various mobile (ich7) chipsets do not behave properly
-	 * resulting in lost data or garbage information on the pci-e link
-	 * level. This could result in (false) bad EEPROM checksum errors,
-	 * long ping times (up to 2s) or even a system freeze/hang.
-	 *
-	 * Unfortunately this feature saves about 1W power consumption when
-	 * active.
+	 * Both device and parent should have the same ASPM setting.
+	 * Disable ASPM in downstream component first and then upstream.
 	 */
-	pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
-	pci_read_config_word(pdev, pos + PCI_EXP_LNKCTL, &val);
-	if (val & 0x2) {
-		dev_warn(&pdev->dev, "Disabling L1 ASPM\n");
-		val &= ~0x2;
-		pci_write_config_word(pdev, pos + PCI_EXP_LNKCTL, val);
-	}
+	pos = pci_pcie_cap(pdev);
+	pci_read_config_word(pdev, pos + PCI_EXP_LNKCTL, &reg16);
+	reg16 &= ~state;
+	pci_write_config_word(pdev, pos + PCI_EXP_LNKCTL, reg16);
+
+	pos = pci_pcie_cap(pdev->bus->self);
+	pci_read_config_word(pdev->bus->self, pos + PCI_EXP_LNKCTL, &reg16);
+	reg16 &= ~state;
+	pci_write_config_word(pdev->bus->self, pos + PCI_EXP_LNKCTL, reg16);
+}
+#endif
+void e1000e_disable_aspm(struct pci_dev *pdev, u16 state)
+{
+	dev_info(&pdev->dev, "Disabling ASPM %s %s\n",
+		 (state & PCIE_LINK_STATE_L0S) ? "L0s" : "",
+		 (state & PCIE_LINK_STATE_L1) ? "L1" : "");
+
+	__e1000e_disable_aspm(pdev, state);
 }
 
 #ifdef CONFIG_PM
@@ -4653,7 +4671,8 @@ static int e1000_resume(struct pci_dev *pdev)
 	pci_set_power_state(pdev, PCI_D0);
 	pci_restore_state(pdev);
 	pci_save_state(pdev);
-	e1000e_disable_l1aspm(pdev);
+	if (adapter->flags2 & FLAG2_DISABLE_ASPM_L1)
+		e1000e_disable_aspm(pdev, PCIE_LINK_STATE_L1);
 
 	err = pci_enable_device_mem(pdev);
 	if (err) {
@@ -4795,7 +4814,8 @@ static pci_ers_result_t e1000_io_slot_reset(struct pci_dev *pdev)
 	int err;
 	pci_ers_result_t result;
 
-	e1000e_disable_l1aspm(pdev);
+	if (adapter->flags2 & FLAG2_DISABLE_ASPM_L1)
+		e1000e_disable_aspm(pdev, PCIE_LINK_STATE_L1);
 	err = pci_enable_device_mem(pdev);
 	if (err) {
 		dev_err(&pdev->dev,
@@ -4889,13 +4909,6 @@ static void e1000_eeprom_checks(struct e1000_adapter *adapter)
 		dev_warn(&adapter->pdev->dev,
 			 "Warning: detected DSPD enabled in EEPROM\n");
 	}
-
-	ret_val = e1000_read_nvm(hw, NVM_INIT_3GIO_3, 1, &buf);
-	if (!ret_val && (le16_to_cpu(buf) & (3 << 2))) {
-		/* ASPM enable */
-		dev_warn(&adapter->pdev->dev,
-			 "Warning: detected ASPM enabled in EEPROM\n");
-	}
 }
 
 static const struct net_device_ops e1000e_netdev_ops = {
@@ -4944,7 +4957,8 @@ static int __devinit e1000_probe(struct pci_dev *pdev,
 	u16 eeprom_data = 0;
 	u16 eeprom_apme_mask = E1000_EEPROM_APME;
 
-	e1000e_disable_l1aspm(pdev);
+	if (ei->flags2 & FLAG2_DISABLE_ASPM_L1)
+		e1000e_disable_aspm(pdev, PCIE_LINK_STATE_L1);
 
 	err = pci_enable_device_mem(pdev);
 	if (err)


^ permalink raw reply related

* Re: [PATCH] bnx2x: add support for receive hashing
From: Brian Bloniarz @ 2010-04-27 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, eric.dumazet, netdev, rick.jones2
In-Reply-To: <20100426.110432.104061817.davem@davemloft.net>

David Miller wrote:
> How damn hard is it to add two 16-bit ports to the hash regardless of
> protocol?
>   
Come to think of it, for UDP the hash must ignore
the srcport and srcaddr, because a single bound
socket is going to wildcard both those fields.

So the best we can hope for is for the hash to
include destport and destaddr? From looking at
my BCM5709's multiq behavior, I think broadcom's
hash includes the destaddr but not the destport.


^ permalink raw reply

* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Wolfgang Grandegger @ 2010-04-27 13:35 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100427102025.GA6471@riccoc20.at.omicron.at>

Hi Richard,

Richard Cochran wrote:
> BTW,
> 
> To show how the API works from the user space, I also patches For the
> popular ptpd program to that project.
> 
>    https://sourceforge.net/tracker/?func=detail&aid=2992845&group_id=139814&atid=744634

Cool stuff, I'm giving it a try on my MPC8313 PTP setup.

Thanks.

Wolfgang.

^ permalink raw reply

* Re: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: Matthew Garrett @ 2010-04-27 13:43 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, gospo, Bruce Allan
In-Reply-To: <20100427133232.25490.92973.stgit@localhost.localdomain>

On Tue, Apr 27, 2010 at 06:33:04AM -0700, Jeff Kirsher wrote:
> From: Bruce Allan <bruce.w.allan@intel.com>
> 
> Prompted by a previous patch submitted by Matthew Garret <mjg@redhat.com>,
> further digging into errata documentation reveals the current enabling or
> disabling of ASPM L0s and L1 states for certain parts supported by this
> driver are incorrect.  82571 and 82572 should always disable L1.  For
> standard frames, 82573/82574/82583 can enable L1 but L0s must be disabled,
> and for jumbo frames 82573/82574 must disable L1.  This allows for some
> parts to enable L1 in certain configurations leading to better power
> savings.
> 
> Also according to the same errata, Early Receive (ERT) should be disabled
> on 82573 when using jumbo frames.

Looks good. Thanks for digging into this!

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 14:30 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: David Miller, therbert, netdev, rick.jones2
In-Reply-To: <4BD6E887.3000804@athenacr.com>

Le mardi 27 avril 2010 à 09:37 -0400, Brian Bloniarz a écrit :
> David Miller wrote:
> > How damn hard is it to add two 16-bit ports to the hash regardless of
> > protocol?
> >   
> Come to think of it, for UDP the hash must ignore
> the srcport and srcaddr, because a single bound
> socket is going to wildcard both those fields.
> 

For your application maybe ;)

Here, I have thousand of RTP flows to big mediagateways, so the
(srcaddr, dstaddr) is shared by all these flows.

Tom Herbert also wants a threaded DNS server.

For UDP, we could have a bitmap (system level ?) to say if a particular
destination port wants multi-cpu (RPS) spreading or not, even if the NIC
decided to use a single queue to submit frames to the host (ie mask the
src_addr and src_port). In this case, RFS on non conected UDP sockets
could be activated as well.

Or just a global sysctl to be able to mask the src_addr and/or src_port
in our software rxhash.

> So the best we can hope for is for the hash to
> include destport and destaddr? From looking at
> my BCM5709's multiq behavior, I think broadcom's
> hash includes the destaddr but not the destport.

Toepliz hash for UDP is a hash(src_addr, dst_addr)

Both src port and dst port are ignored.

^ permalink raw reply

* Re: [patch v2] sctp: cleanup: remove duplicate assignment
From: Vlad Yasevich @ 2010-04-27 14:32 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Sridhar Samudrala, David S. Miller, Wei Yongjun, Chris Dischino,
	linux-sctp, netdev, kernel-janitors
In-Reply-To: <20100424171805.GO29093@bicker>



Dan Carpenter wrote:
> This assignment isn't needed because we did it earlier already.
> 
> Also another reason to delete the assignment is because it triggers a
> Smatch warning about checking for NULL pointers after a dereference.
> 
> Reported-by: Vlad Yasevich <vladislav.yasevich@hp.com>
> Signed-off-by: Dan Carpenter <error27@gmail.com>

Thanks.  I'll take this one.

-vlad

> ---
> Thanks Vlad.  I came so close to seeing that myself if only I had openned
> my eyes a tiny bit more.  :P
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index 17cb400..33aed1c 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -419,10 +419,17 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc,
>  	if (!retval)
>  		goto nomem_chunk;
>  
> -	/* Per the advice in RFC 2960 6.4, send this reply to
> -	 * the source of the INIT packet.
> +	/* RFC 2960 6.4 Multi-homed SCTP Endpoints
> +	 *
> +	 * An endpoint SHOULD transmit reply chunks (e.g., SACK,
> +	 * HEARTBEAT ACK, * etc.) to the same destination transport
> +	 * address from which it received the DATA or control chunk
> +	 * to which it is replying.
> +	 *
> +	 * [INIT ACK back to where the INIT came from.]
>  	 */
>  	retval->transport = chunk->transport;
> +
>  	retval->subh.init_hdr =
>  		sctp_addto_chunk(retval, sizeof(initack), &initack);
>  	retval->param_hdr.v = sctp_addto_chunk(retval, addrs_len, addrs.v);
> @@ -461,18 +468,6 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc,
>  	/* We need to remove the const qualifier at this point.  */
>  	retval->asoc = (struct sctp_association *) asoc;
>  
> -	/* RFC 2960 6.4 Multi-homed SCTP Endpoints
> -	 *
> -	 * An endpoint SHOULD transmit reply chunks (e.g., SACK,
> -	 * HEARTBEAT ACK, * etc.) to the same destination transport
> -	 * address from which it received the DATA or control chunk
> -	 * to which it is replying.
> -	 *
> -	 * [INIT ACK back to where the INIT came from.]
> -	 */
> -	if (chunk)
> -		retval->transport = chunk->transport;
> -
>  nomem_chunk:
>  	kfree(cookie);
>  nomem_cookie:
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: linux-next: manual merge of the staging-next tree with the net tree
From: Greg KH @ 2010-04-27 15:13 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: linux-next, linux-kernel, Jiri Pirko, David Miller, netdev,
	John Sheehan
In-Reply-To: <20100427142655.49673ff1.sfr@canb.auug.org.au>

On Tue, Apr 27, 2010 at 02:26:55PM +1000, Stephen Rothwell wrote:
> Hi Greg,
> 
> Today's linux-next merge of the staging-next tree got a conflict in
> drivers/staging/arlan/arlan-main.c between commit
> 22bedad3ce112d5ca1eaf043d4990fa2ed698c87 ("net: convert multicast list to
> list_head") from the net tree and commit
> 8db2022b08e232bb237b2162f03ff5f6e7c0c35e ("staging: arlan: fix errors
> reported by checkpatch.pl tool") from the staging-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary.

That looks fine, thanks.

greg k-h

^ permalink raw reply

* Re: linux-next: manual merge of the tty tree with the net tree
From: Greg KH @ 2010-04-27 15:14 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: linux-next, linux-kernel, Sjur Braendeland, David Miller, netdev,
	Pavan Savoy
In-Reply-To: <20100427135706.259fa9c9.sfr@canb.auug.org.au>

On Tue, Apr 27, 2010 at 01:57:06PM +1000, Stephen Rothwell wrote:
> Hi Greg,
> 
> Today's linux-next merge of the tty tree got a conflict in
> include/linux/tty.h between commit
> 9b27105b4a44c54bf91ecd7d0315034ae75684f7 ("net-caif-driver: add CAIF
> serial driver (ldisc)") from the net tree and commit
> f3150d54db0e0caa1871f66b2424f463a34a7e7b ("serial: TTY: new ldiscs for
> staging") from the tty tree.
> 
> I fixed it up (they both changed NR_LDISCS - I used the higher number) and
> can carry the fix as necessary.

Thanks, the higher number should be what is needed here.

greg k-h

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Brian Bloniarz @ 2010-04-27 15:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, therbert, netdev, rick.jones2
In-Reply-To: <1272378659.2295.193.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 27 avril 2010 à 09:37 -0400, Brian Bloniarz a écrit :
>> David Miller wrote:
>>> How damn hard is it to add two 16-bit ports to the hash regardless of
>>> protocol?
>>>   
>> Come to think of it, for UDP the hash must ignore
>> the srcport and srcaddr, because a single bound
>> socket is going to wildcard both those fields.
>>
> 
> For your application maybe ;)
> 
> Here, I have thousand of RTP flows to big mediagateways, so the
> (srcaddr, dstaddr) is shared by all these flows.
> 
> Tom Herbert also wants a threaded DNS server.
> 
> For UDP, we could have a bitmap (system level ?) to say if a particular
> destination port wants multi-cpu (RPS) spreading or not, even if the NIC
> decided to use a single queue to submit frames to the host (ie mask the
> src_addr and src_port). In this case, RFS on non conected UDP sockets
> could be activated as well.
> 
> Or just a global sysctl to be able to mask the src_addr and/or src_port
> in our software rxhash.

This is all good to know. Here are 3 other alternatives/suggestions:

1) Leave things as they are, if it really hurts that badly
for our workload we can just disable RPS entirely.
2) Add a global sysctl to disable RPS for datagram-based protocols.
3) Pluggable SW rxhashes :)

I haven't benchmarked anything for our workloads yet, so
I can't say for sure it's even a big issue.

Has anyone benchmarked RPS + single-threaded DNS servers? It'd
be a pessimization in that case, right? Even the multi-threaded
DNS server wouldn't be workable unless there was something like
the soreuseport patch you & Tom had been been discussing.

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-27 15:50 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, herbert, yoshfuji, netdev, shemminger
In-Reply-To: <20100422.185400.71096585.davem@davemloft.net>

On Thu, Apr 22, 2010 at 06:54:00PM -0700, David Miller wrote:
> From: Jiri Bohac <jbohac@suse.cz>
> Date: Thu, 22 Apr 2010 17:49:08 +0200
> 
> > I still don't see why __ipv6_ifa_notify() needs to call
> > dst_free(). Shouldn't that be dst_release() instead, to drop the
> > reference obtained by dst_hold(&ifp->rt->u.dst)?
> 
> It likely wants to do both.
> 
> Just doing dst_release() doesn't mark the 'dst' object as obsolete,
> and therefore it won't get force garbage collected.

Sure. So If I understand it correctly, there are two problems:

- the reference taken by dst_hold() just above the ip6_del_rt()
  is never dropped if ip6_del_rt() fails; so shouldn't the code
  be like this?:

	dst_hold(&ifp->rt->u.dst);
	if (ip6_del_rt(ifp->rt)) {
		dst_release(&ifp->rt->u.dst);
		dst_free(&ifp->rt->u.dst);
	}

- if ip6_del_rt() fails because it races with something else
  deleting the address, dst_free() will be called twice. This is
  what Herbert is fixing with additional locking. However -- even
  when he fixes that -- how can ip6_del_rt() fail with the
  ifp->rt still needing a dst_free()?
  AFAICS, it can fail by:
  	- __ip6_del_rt() finding that (rt ==
	  net->ipv6.ip6_null_entry); we don't want to call
	  dst_free() on net->ipv6.ip6_null_entry, do we?

	- fib6_del() returning -ENOENT for multiple reasons;
	  but doesn't that mean that something else has removed
	  the route already and called dst_free on it?

  In either case, the dst_free() looks like not being needed. And it only
  does no harm in most cases, because these events are rare and
  it usually finds (obsolete > 2) and does nothing.

  I think that what Herbert is doing is only going to enforce that
  the ip6_del_rt() is never going to fail, so the dst_free()
  won't ever be called anyway, right?

Thanks,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Herbert Xu @ 2010-04-27 15:55 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: David Miller, yoshfuji, netdev, shemminger
In-Reply-To: <20100427155034.GA11157@midget.suse.cz>

On Tue, Apr 27, 2010 at 05:50:34PM +0200, Jiri Bohac wrote:
>
>   I think that what Herbert is doing is only going to enforce that
>   the ip6_del_rt() is never going to fail, so the dst_free()
>   won't ever be called anyway, right?

Yes I think you're right.  But let's fix the bigger problem first,
and then we can audit this and possibly turn it into a WARN_ON.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH RFC: linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Peter P Waskiewicz Jr @ 2010-04-27 16:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LFD.2.00.1004271425140.2951@localhost.localdomain>

On Tue, 27 Apr 2010, Thomas Gleixner wrote:

> On Sun, 18 Apr 2010, Peter P Waskiewicz Jr wrote:
>> +/**
>> + * struct irqaffinityhint - per interrupt affinity helper
>> + * @callback:	device driver callback function
>> + * @dev:	reference for the affected device
>> + * @irq:	interrupt number
>> + */
>> +struct irqaffinityhint {
>> +	irq_affinity_hint_t callback;
>> +	void *dev;
>> +	int irq;
>> +};
>
> Why do you need that extra data structure ? The device and the irq
> number are known, so all you need is the callback itself. So no need
> for allocating memory ....

When I register the function callback with the interrupt layer, I need to 
know what device structures to reference back in the driver.  In other 
words, if I call into an underlying driver with just an interrupt number, 
then I have no way at getting at the dev structures (netdevice for me, 
plus my private adapter structures), unless I declare them globally 
(yuck).

I had a different approach before this one where I assumed the device from 
the irq handler callback was safe to use for the device in this new 
callback.  I didn't feel really great about that, since it's an implicit 
assumption that could cause things to go sideways really quickly.

Let me know what you think either way.  I'm certainly willing to make a 
change, I just don't know at this point what's the safest approach from 
what I currently have.

>
>> +static ssize_t irq_affinity_hint_proc_write(struct file *file,
>> +		const char __user *buffer, size_t count, loff_t *pos)
>> +{
>> +	/* affinity_hint is read-only from proc */
>> +	return -EOPNOTSUPP;
>> +}
>> +
>
> Why do you want a write function when the file is read only ?

It's leftover paranoia.  I put it in early on, then changed the mode 
later.  I can remove this function.  I'll re-send something once we agree 
on how the code in your first comment should look.

Thanks Thomas!

-PJ

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: fix a lockdep rcu warning in __sk_dst_set()
From: Paul E. McKenney @ 2010-04-27 16:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1272350443.4861.9.camel@edumazet-laptop>

On Tue, Apr 27, 2010 at 08:40:43AM +0200, Eric Dumazet wrote:
> __sk_dst_set() might be called while no state can be integrated in a
> rcu_dereference_check() condition.
> 
> So use rcu_dereference_raw() to shutup lockdep warnings (if
> CONFIG_PROVE_RCU is set)

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 86a8ca1..94dbdbc 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1236,8 +1236,11 @@ __sk_dst_set(struct sock *sk, struct dst_entry *dst)
>  	struct dst_entry *old_dst;
> 
>  	sk_tx_queue_clear(sk);
> -	old_dst = rcu_dereference_check(sk->sk_dst_cache,
> -					lockdep_is_held(&sk->sk_dst_lock));
> +	/*
> +	 * This can be called while sk is owned by the caller only,
> +	 * with no state that can be checked in a rcu_dereference_check() cond
> +	 */
> +	old_dst = rcu_dereference_raw(sk->sk_dst_cache);
>  	rcu_assign_pointer(sk->sk_dst_cache, dst);
>  	dst_release(old_dst);
>  }
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox