Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next 6/7] ixgbe: Make the bridge mode setting sticky
From: Jeff Kirsher @ 2012-11-28 13:06 UTC (permalink / raw)
  To: davem; +Cc: Greg Rose, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1354107978-24731-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Greg Rose <gregory.v.rose@intel.com>

The internal bridge mode setting needs to be sticky so that it can be
configured correctly after a device reset.  This change is required now
that the driver supports setting the bridge mode to VEB or VEPA.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Tested-by: Sibai Li <Sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h       |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 12 ++++++++----
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |  1 +
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 7ff4c4f..8e78676 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -483,6 +483,7 @@ struct ixgbe_adapter {
 #define IXGBE_FLAG2_RSS_FIELD_IPV6_UDP		(u32)(1 << 9)
 #define IXGBE_FLAG2_PTP_ENABLED			(u32)(1 << 10)
 #define IXGBE_FLAG2_PTP_PPS_ENABLED		(u32)(1 << 11)
+#define IXGBE_FLAG2_BRIDGE_MODE_VEB		(u32)(1 << 12)
 
 	/* Tx fast path data */
 	int num_tx_queues;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fc8cfad..fee0f8c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3247,6 +3247,8 @@ static void ixgbe_configure_virtualization(struct ixgbe_adapter *adapter)
 	IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset ^ 1), reg_offset - 1);
 	IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), (~0) << vf_shift);
 	IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset ^ 1), reg_offset - 1);
+	if (adapter->flags2 & IXGBE_FLAG2_BRIDGE_MODE_VEB)
+		IXGBE_WRITE_REG(hw, IXGBE_PFDTXGSWC, IXGBE_PFDTXGSWC_VT_LBEN);
 
 	/* Map PF MAC address in RAR Entry 0 to first pool following VFs */
 	hw->mac.ops.set_vmdq(hw, 0, VMDQ_P(0));
@@ -7039,11 +7041,13 @@ static int ixgbe_ndo_bridge_setlink(struct net_device *dev,
 			continue;
 
 		mode = nla_get_u16(attr);
-		if (mode == BRIDGE_MODE_VEPA)
+		if (mode == BRIDGE_MODE_VEPA) {
 			reg = 0;
-		else if (mode == BRIDGE_MODE_VEB)
+			adapter->flags2 &= ~IXGBE_FLAG2_BRIDGE_MODE_VEB;
+		} else if (mode == BRIDGE_MODE_VEB) {
 			reg = IXGBE_PFDTXGSWC_VT_LBEN;
-		else
+			adapter->flags2 |= IXGBE_FLAG2_BRIDGE_MODE_VEB;
+		} else
 			return -EINVAL;
 
 		IXGBE_WRITE_REG(&adapter->hw, IXGBE_PFDTXGSWC, reg);
@@ -7064,7 +7068,7 @@ static int ixgbe_ndo_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
 	if (!(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED))
 		return 0;
 
-	if (IXGBE_READ_REG(&adapter->hw, IXGBE_PFDTXGSWC) & 1)
+	if (adapter->flags2 & IXGBE_FLAG2_BRIDGE_MODE_VEB)
 		mode = BRIDGE_MODE_VEB;
 	else
 		mode = BRIDGE_MODE_VEPA;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 4993642..85cddac 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -119,6 +119,7 @@ void ixgbe_enable_sriov(struct ixgbe_adapter *adapter,
 
 	/* Initialize default switching mode VEB */
 	IXGBE_WRITE_REG(hw, IXGBE_PFDTXGSWC, IXGBE_PFDTXGSWC_VT_LBEN);
+	adapter->flags2 |= IXGBE_FLAG2_BRIDGE_MODE_VEB;
 
 	/* If call to enable VFs succeeded then allocate memory
 	 * for per VF control structures.
-- 
1.7.11.7

^ permalink raw reply related

* [net-next 7/7] ixgbe: bump version number
From: Jeff Kirsher @ 2012-11-28 13:06 UTC (permalink / raw)
  To: davem; +Cc: Don Skidmore, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1354107978-24731-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Don Skidmore <donald.c.skidmore@intel.com>

Move the version string to better reflect the driver functionality with
that of the out of tree driver.  Also since we no longer need the MAJ,
MIN, BUILD defines remove them to clean up the code.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fee0f8c..484bbed 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -63,11 +63,7 @@ char ixgbe_default_device_descr[] =
 static char ixgbe_default_device_descr[] =
 			      "Intel(R) 10 Gigabit Network Connection";
 #endif
-#define MAJ 3
-#define MIN 9
-#define BUILD 15
-#define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) "." \
-	__stringify(BUILD) "-k"
+#define DRV_VERSION "3.11.33-k"
 const char ixgbe_driver_version[] = DRV_VERSION;
 static const char ixgbe_copyright[] =
 				"Copyright (c) 1999-2012 Intel Corporation.";
-- 
1.7.11.7

^ permalink raw reply related

* [net-next 5/7] ixgbe: Fix incorrect disabling of Tx hang check in case of PFC
From: Jeff Kirsher @ 2012-11-28 13:06 UTC (permalink / raw)
  To: davem; +Cc: Neerav Parikh, netdev, gospo, sassmann, Neerav Parikh,
	Jeff Kirsher
In-Reply-To: <1354107978-24731-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Neerav Parikh <neerav.parikh@intel.com>

The XOFF received statistic registers are per priority based and not per
traffic class. The ixgbe driver was incorrectly considering them to be for
each traffic class; and then disabling the "Tx hang" check for the queues
that belonged to the particular traffic class that had received PFC frames.

The above logic worked fine in scenario where the user priority and traffic
class number matched e.g. priority 0 is mapped to traffic class 0 and so on.
But, when multiple user priorities are mapped to a single traffic class or
when user priorities and traffic class numbers do not line up; the ixgbe
driver may disable the "Tx hang" check for queues belonging to a traffic
class that did not receive PFC frames and keep the "Tx hang" check enabled
for the queues that did receive the PFC frames.

This patch corrects the above in the code by considering the statistics
on a per priority basis; then getting the traffic class the user priority
belongs to and disabling the "Tx hang" check for queues that belong
to that traffic class.

Signed-off-by: Neerav Parikh <Neerav.Parikh@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Tested-by: Marcus Dennis <marcusx.e.dennis@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index e6e1245..fc8cfad 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -703,6 +703,7 @@ static void ixgbe_update_xoff_received(struct ixgbe_adapter *adapter)
 	struct ixgbe_hw *hw = &adapter->hw;
 	struct ixgbe_hw_stats *hwstats = &adapter->stats;
 	u32 xoff[8] = {0};
+	u8 tc;
 	int i;
 	bool pfc_en = adapter->dcb_cfg.pfc_mode_enable;

@@ -716,21 +717,26 @@ static void ixgbe_update_xoff_received(struct ixgbe_adapter *adapter)

 	/* update stats for each tc, only valid with PFC enabled */
 	for (i = 0; i < MAX_TX_PACKET_BUFFERS; i++) {
+		u32 pxoffrxc;
+
 		switch (hw->mac.type) {
 		case ixgbe_mac_82598EB:
-			xoff[i] = IXGBE_READ_REG(hw, IXGBE_PXOFFRXC(i));
+			pxoffrxc = IXGBE_READ_REG(hw, IXGBE_PXOFFRXC(i));
 			break;
 		default:
-			xoff[i] = IXGBE_READ_REG(hw, IXGBE_PXOFFRXCNT(i));
+			pxoffrxc = IXGBE_READ_REG(hw, IXGBE_PXOFFRXCNT(i));
 		}
-		hwstats->pxoffrxc[i] += xoff[i];
+		hwstats->pxoffrxc[i] += pxoffrxc;
+		/* Get the TC for given UP */
+		tc = netdev_get_prio_tc_map(adapter->netdev, i);
+		xoff[tc] += pxoffrxc;
 	}

 	/* disarm tx queues that have received xoff frames */
 	for (i = 0; i < adapter->num_tx_queues; i++) {
 		struct ixgbe_ring *tx_ring = adapter->tx_ring[i];
-		u8 tc = tx_ring->dcb_tc;

+		tc = tx_ring->dcb_tc;
 		if (xoff[tc])
 			clear_bit(__IXGBE_HANG_CHECK_ARMED, &tx_ring->state);
 	}
-- 
1.7.11.7

^ permalink raw reply related

* [net-next 3/7] igb: Use a 32-bit mask when calculating the flow control watermarks
From: Jeff Kirsher @ 2012-11-28 13:06 UTC (permalink / raw)
  To: davem; +Cc: Matthew Vick, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1354107978-24731-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Matthew Vick <matthew.vick@intel.com>

For some devices, the result of the flow control high watermark gets
truncated when programming it into the registers because of the mask used.
Switch the mask to 32-bit to prevent this from happening.

Signed-off-by: Matthew Vick <matthew.vick@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 0ce145e..b85b15a 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1589,8 +1589,7 @@ void igb_reset(struct igb_adapter *adapter)
 	struct e1000_hw *hw = &adapter->hw;
 	struct e1000_mac_info *mac = &hw->mac;
 	struct e1000_fc_info *fc = &hw->fc;
-	u32 pba = 0, tx_space, min_tx_space, min_rx_space;
-	u16 hwm;
+	u32 pba = 0, tx_space, min_tx_space, min_rx_space, hwm;
 
 	/* Repartition Pba for greater than 9k mtu
 	 * To take effect CTRL.RST is required.
@@ -1665,7 +1664,7 @@ void igb_reset(struct igb_adapter *adapter)
 	hwm = min(((pba << 10) * 9 / 10),
 			((pba << 10) - 2 * adapter->max_frame_size));
 
-	fc->high_water = hwm & 0xFFF0;	/* 16-byte granularity */
+	fc->high_water = hwm & 0xFFFFFFF0;	/* 16-byte granularity */
 	fc->low_water = fc->high_water - 16;
 	fc->pause_time = 0xFFFF;
 	fc->send_xon = 1;
-- 
1.7.11.7

^ permalink raw reply related

* [net-next 4/7] ixgbe: Drop RLPML configuration from x540 RXDCTL register configuration
From: Jeff Kirsher @ 2012-11-28 13:06 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1354107978-24731-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

Since we are doing a page based receive there is no point in setting a maximum
packet length on the x540 RXDCTL register.  As such we can drop the code from
the driver entirely.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Tested-by: Marcus Dennis <marcusx.e.dennis@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 80e3cb7..e6e1245 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3170,14 +3170,6 @@ void ixgbe_configure_rx_ring(struct ixgbe_adapter *adapter,
 	ixgbe_configure_srrctl(adapter, ring);
 	ixgbe_configure_rscctl(adapter, ring);
 
-	/* If operating in IOV mode set RLPML for X540 */
-	if ((adapter->flags & IXGBE_FLAG_SRIOV_ENABLED) &&
-	    hw->mac.type == ixgbe_mac_X540) {
-		rxdctl &= ~IXGBE_RXDCTL_RLPMLMASK;
-		rxdctl |= ((ring->netdev->mtu + ETH_HLEN +
-			    ETH_FCS_LEN + VLAN_HLEN) | IXGBE_RXDCTL_RLPML_EN);
-	}
-
 	if (hw->mac.type == ixgbe_mac_82598EB) {
 		/*
 		 * enable cache line friendly hardware writes:
-- 
1.7.11.7

^ permalink raw reply related

* [net-next 2/7] igbvf: update version number
From: Jeff Kirsher @ 2012-11-28 13:06 UTC (permalink / raw)
  To: davem; +Cc: Mitch A Williams, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1354107978-24731-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Mitch A Williams <mitch.a.williams@intel.com>

Update version number.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index b44b9d3..3d92ad8 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -47,7 +47,7 @@
 
 #include "igbvf.h"
 
-#define DRV_VERSION "2.0.1-k"
+#define DRV_VERSION "2.0.2-k"
 char igbvf_driver_name[] = "igbvf";
 const char igbvf_driver_version[] = DRV_VERSION;
 static const char igbvf_driver_string[] =
-- 
1.7.11.7

^ permalink raw reply related

* [net-next 1/7] igbvf: work around i350 erratum
From: Jeff Kirsher @ 2012-11-28 13:06 UTC (permalink / raw)
  To: davem; +Cc: Mitch A Williams, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1354107978-24731-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Mitch A Williams <mitch.a.williams@intel.com>

On i350 VF devices, VLAN tags will be byte-swapped in the receive
descriptor only when received packets are looped back from other
VFs. Check for this condition and swab the tag if needed.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igbvf/defines.h |  1 +
 drivers/net/ethernet/intel/igbvf/igbvf.h   |  2 +-
 drivers/net/ethernet/intel/igbvf/netdev.c  | 15 +++++++++++++--
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/defines.h b/drivers/net/ethernet/intel/igbvf/defines.h
index 3e18045..d9fa999 100644
--- a/drivers/net/ethernet/intel/igbvf/defines.h
+++ b/drivers/net/ethernet/intel/igbvf/defines.h
@@ -46,6 +46,7 @@
 #define E1000_RXD_ERR_SE        0x02    /* Symbol Error */
 #define E1000_RXD_SPC_VLAN_MASK 0x0FFF  /* VLAN ID is in lower 12 bits */
 
+#define E1000_RXDEXT_STATERR_LB    0x00040000
 #define E1000_RXDEXT_STATERR_CE    0x01000000
 #define E1000_RXDEXT_STATERR_SE    0x02000000
 #define E1000_RXDEXT_STATERR_SEQ   0x04000000
diff --git a/drivers/net/ethernet/intel/igbvf/igbvf.h b/drivers/net/ethernet/intel/igbvf/igbvf.h
index a895e2f..fdca7b6 100644
--- a/drivers/net/ethernet/intel/igbvf/igbvf.h
+++ b/drivers/net/ethernet/intel/igbvf/igbvf.h
@@ -295,7 +295,7 @@ struct igbvf_info {
 
 /* hardware capability, feature, and workaround flags */
 #define IGBVF_FLAG_RX_CSUM_DISABLED             (1 << 0)
-
+#define IGBVF_FLAG_RX_LB_VLAN_BSWAP		(1 << 1)
 #define IGBVF_RX_DESC_ADV(R, i)     \
 	(&((((R).desc))[i].rx_desc))
 #define IGBVF_TX_DESC_ADV(R, i)     \
diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index 4051ec4..b44b9d3 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -107,12 +107,19 @@ static void igbvf_receive_skb(struct igbvf_adapter *adapter,
                               struct sk_buff *skb,
                               u32 status, u16 vlan)
 {
+	u16 vid;
+
 	if (status & E1000_RXD_STAT_VP) {
-		u16 vid = le16_to_cpu(vlan) & E1000_RXD_SPC_VLAN_MASK;
+		if ((adapter->flags & IGBVF_FLAG_RX_LB_VLAN_BSWAP) &&
+		    (status & E1000_RXDEXT_STATERR_LB))
+			vid = be16_to_cpu(vlan) & E1000_RXD_SPC_VLAN_MASK;
+		else
+			vid = le16_to_cpu(vlan) & E1000_RXD_SPC_VLAN_MASK;
 		if (test_bit(vid, adapter->active_vlans))
 			__vlan_hwaccel_put_tag(skb, vid);
 	}
-	netif_receive_skb(skb);
+
+	napi_gro_receive(&adapter->rx_ring->napi, skb);
 }
 
 static inline void igbvf_rx_checksum_adv(struct igbvf_adapter *adapter,
@@ -2767,6 +2774,10 @@ static int __devinit igbvf_probe(struct pci_dev *pdev,
 	/* reset the hardware with the new settings */
 	igbvf_reset(adapter);
 
+	/* set hardware-specific flags */
+	if (adapter->hw.mac.type == e1000_vfadapt_i350)
+		adapter->flags |= IGBVF_FLAG_RX_LB_VLAN_BSWAP;
+
 	strcpy(netdev->name, "eth%d");
 	err = register_netdev(netdev);
 	if (err)
-- 
1.7.11.7

^ permalink raw reply related

* [net-next 0/7][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-11-28 13:06 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann

This series contains updates to igb, igbvf and ixgbe.

The following are changes since commit 03f52a0a554210d5049eeed9f1bb29047dc807cb:
  ip6mr: Add sizeof verification to MRT6_ASSERT and MT6_PIM
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Alexander Duyck (1):
  ixgbe: Drop RLPML configuration from x540 RXDCTL register
    configuration

Don Skidmore (1):
  ixgbe: bump version number

Greg Rose (1):
  ixgbe: Make the bridge mode setting sticky

Matthew Vick (1):
  igb: Use a 32-bit mask when calculating the flow control watermarks

Mitch A Williams (2):
  igbvf: work around i350 erratum
  igbvf: update version number

Neerav Parikh (1):
  ixgbe: Fix incorrect disabling of Tx hang check in case of PFC

 drivers/net/ethernet/intel/igb/igb_main.c      |  5 ++--
 drivers/net/ethernet/intel/igbvf/defines.h     |  1 +
 drivers/net/ethernet/intel/igbvf/igbvf.h       |  2 +-
 drivers/net/ethernet/intel/igbvf/netdev.c      | 17 +++++++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe.h       |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 40 ++++++++++++--------------
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |  1 +
 7 files changed, 39 insertions(+), 28 deletions(-)

-- 
1.7.11.7

^ permalink raw reply

* Re: [PATCH 1/2] smsc75xx: refactor entering suspend modes
From: Bjørn Mork @ 2012-11-28 12:57 UTC (permalink / raw)
  To: Steve Glendinning; +Cc: Alan Stern, netdev, linux-usb, David Miller
In-Reply-To: <CAKh2mn7VsNYMO-qYyyqR==A2sXjkZo6QXj_kOuem4o1Vu-LLmw@mail.gmail.com>

Steve Glendinning <steve@shawell.net> writes:

>> Looking at the different ethernet drivers, the normal way do do this
>> seems to be something like this in their .set_wol implementation:
>>
>>         device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);
>>
>>
>> where "adapter" is a netdev_priv private struct, "pdev" is a pci device
>> and "wol" is an u32.  I don't see any problem doing the same for USB
>> network devices implementing ethtool "set_wol".
>
> Ahh, good spot.  I've implemented this (patch below for reference) and
> it still doesn't work:
>
>
> $ cat /sys/bus/usb/devices/2-1.2/power/wakeup
> disabled
>
> $ sudo ethtool -s eth2 wol p
>
> [ 1607.237767] smsc75xx 2-1.2:1.0 eth2: set_wol before
> device_can_wakeup=1 device_may_wakeup=0
> [ 1607.237772] smsc75xx 2-1.2:1.0 eth2: set_wol after
> device_can_wakeup=1 device_may_wakeup=1
>
> $ cat /sys/bus/usb/devices/2-1.2/power/wakeup
> disabled
>
>
> Huh?!  My debugging printk statements tell me I've succesfully set it,
> but then the sysfs entry is disabled when I read it.  Is something
> else setting this back?
>
> My testing patch for reference (note this is against my tree with a
> few to-be submitted patches so won't apply cleanly!):

[..]

> @@ -674,8 +653,19 @@ static int smsc75xx_ethtool_set_wol(struct net_device *net,
>  {
>   struct usbnet *dev = netdev_priv(net);
>   struct smsc75xx_priv *pdata = (struct smsc75xx_priv *)(dev->data[0]);
> + int ret;
> +
> + netdev_info(dev->net, "set_wol before device_can_wakeup=%d
> device_may_wakeup=%d\n",
> + device_can_wakeup(&net->dev), device_may_wakeup(&net->dev));
>
>   pdata->wolopts = wolinfo->wolopts & SUPPORTED_WAKE;
> +
> + ret = device_set_wakeup_enable(&net->dev, pdata->wolopts);

You are touching the network device here.  That should have been the USB
device.  Try something like

 ret = device_set_wakeup_enable(&dev->udev->dev, pdata->wolopts);

instead.



Bjørn

^ permalink raw reply

* [PATCH v2 net-next] net: move inet_dport/inet_num in sock_common
From: Eric Dumazet @ 2012-11-28 12:56 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ling Ma, Joe Perches, Ben Hutchings
In-Reply-To: <1354102061.14302.84.camel@edumazet-glaptop>

From: Eric Dumazet <edumazet@google.com>

commit 68835aba4d9b (net: optimize INET input path further)
moved some fields used for tcp/udp sockets lookup in the first cache
line of struct sock_common.

This patch moves inet_dport/inet_num as well, filling a 32bit hole
on 64 bit arches and reducing number of cache line misses.

Also change INET_MATCH()/INET_TW_MATCH() to perform the ports match
before addresses match, as this check is more discriminant.

Remove the hash check from MATCH() macros because we dont need to
re validate the hash value after taking a refcount on socket, and
use likely/unlikely compiler hints, as the sk_hash/hash check
makes the following conditional tests 100% predicted by cpu.

Introduce skc_addrpair/skc_portpair pair values to better
document the alignment requirements of the port/addr pairs
used in the various MATCH() macros, and remove some casts.

The namespace check can also be done at last.

This slightly improves TCP/UDP lookup times.

With help from Ben Hutchings & Joe Perches.

Idea of this patch came after Ling Ma proposal to move skc_hash
to the beginning of struct sock_common, and should allow him
to submit a final version of his patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Joe Perches <joe@perches.com>
Cc: Ling Ma <ling.ma.program@gmail.com>
---
 include/linux/ipv6.h             |   32 ++++++++++---------
 include/net/inet_hashtables.h    |   48 +++++++++++++++--------------
 include/net/inet_sock.h          |    6 ++-
 include/net/inet_timewait_sock.h |    7 +++-
 include/net/sock.h               |   25 ++++++++++++---
 net/ipv4/inet_hashtables.c       |   36 +++++++++++++--------
 net/ipv6/inet6_hashtables.c      |   27 +++++++++++-----
 7 files changed, 114 insertions(+), 67 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 5e11905..12729e9 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -364,20 +364,22 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
 #define inet_v6_ipv6only(__sk)		0
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
-#define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif)\
-	(((__sk)->sk_hash == (__hash)) && sock_net((__sk)) == (__net)	&& \
-	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports)) && \
-	 ((__sk)->sk_family		== AF_INET6)		&& \
-	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr))	&& \
-	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&& \
-	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
-
-#define INET6_TW_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif) \
-	(((__sk)->sk_hash == (__hash)) && sock_net((__sk)) == (__net)	&& \
-	 (*((__portpair *)&(inet_twsk(__sk)->tw_dport)) == (__ports))	&& \
-	 ((__sk)->sk_family	       == PF_INET6)			&& \
-	 (ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_daddr, (__saddr)))	&& \
-	 (ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_rcv_saddr, (__daddr))) && \
-	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
+#define INET6_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif)	\
+	((inet_sk(__sk)->inet_portpair == (__ports))		&&	\
+	 ((__sk)->sk_family == AF_INET6)			&&	\
+	 ipv6_addr_equal(&inet6_sk(__sk)->daddr, (__saddr))	&&	\
+	 ipv6_addr_equal(&inet6_sk(__sk)->rcv_saddr, (__daddr))	&&	\
+	 (!(__sk)->sk_bound_dev_if	||				\
+	   ((__sk)->sk_bound_dev_if == (__dif))) 		&&	\
+	 net_eq(sock_net(__sk), (__net)))
+
+#define INET6_TW_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif)	   \
+	((inet_twsk(__sk)->tw_portpair == (__ports))			&& \
+	 ((__sk)->sk_family == AF_INET6)				&& \
+	 ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_daddr, (__saddr))	&& \
+	 ipv6_addr_equal(&inet6_twsk(__sk)->tw_v6_rcv_saddr, (__daddr)) && \
+	 (!(__sk)->sk_bound_dev_if	||				   \
+	  ((__sk)->sk_bound_dev_if == (__dif)))				&& \
+	 net_eq(sock_net(__sk), (__net)))
 
 #endif /* _IPV6_H */
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 54be028..5dc8521 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -299,30 +299,34 @@ typedef __u64 __bitwise __addrpair;
 				   (((__force __u64)(__be32)(__daddr)) << 32) | \
 				   ((__force __u64)(__be32)(__saddr)));
 #endif /* __BIG_ENDIAN */
-#define INET_MATCH(__sk, __net, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
-	(((__sk)->sk_hash == (__hash)) && net_eq(sock_net(__sk), (__net)) &&	\
-	 ((*((__addrpair *)&(inet_sk(__sk)->inet_daddr))) == (__cookie))  &&	\
-	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports))   &&	\
-	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
-#define INET_TW_MATCH(__sk, __net, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
-	(((__sk)->sk_hash == (__hash)) && net_eq(sock_net(__sk), (__net)) &&	\
-	 ((*((__addrpair *)&(inet_twsk(__sk)->tw_daddr))) == (__cookie)) &&	\
-	 ((*((__portpair *)&(inet_twsk(__sk)->tw_dport))) == (__ports)) &&	\
-	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
+#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif)	\
+	((inet_sk(__sk)->inet_portpair == (__ports))		&&	\
+	 (inet_sk(__sk)->inet_addrpair == (__addrpair))		&&	\
+	 (!(__sk)->sk_bound_dev_if	||				\
+	   ((__sk)->sk_bound_dev_if == (__dif))) 		&& 	\
+	 net_eq(sock_net(__sk), (__net)))
+#define INET_TW_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif)\
+	((inet_twsk(__sk)->tw_portpair == (__ports))	&&		\
+	 (inet_twsk(__sk)->tw_addrpair == (__addrpair))	&&		\
+	 (!(__sk)->sk_bound_dev_if	||				\
+	   ((__sk)->sk_bound_dev_if == (__dif)))	&&		\
+	 net_eq(sock_net(__sk), (__net)))
 #else /* 32-bit arch */
 #define INET_ADDR_COOKIE(__name, __saddr, __daddr)
-#define INET_MATCH(__sk, __net, __hash, __cookie, __saddr, __daddr, __ports, __dif)	\
-	(((__sk)->sk_hash == (__hash)) && net_eq(sock_net(__sk), (__net))	&&	\
-	 (inet_sk(__sk)->inet_daddr	== (__saddr))		&&	\
-	 (inet_sk(__sk)->inet_rcv_saddr	== (__daddr))		&&	\
-	 ((*((__portpair *)&(inet_sk(__sk)->inet_dport))) == (__ports))	&&	\
-	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
-#define INET_TW_MATCH(__sk, __net, __hash,__cookie, __saddr, __daddr, __ports, __dif)	\
-	(((__sk)->sk_hash == (__hash)) && net_eq(sock_net(__sk), (__net))	&&	\
-	 (inet_twsk(__sk)->tw_daddr	== (__saddr))		&&	\
-	 (inet_twsk(__sk)->tw_rcv_saddr	== (__daddr))		&&	\
-	 ((*((__portpair *)&(inet_twsk(__sk)->tw_dport))) == (__ports)) &&	\
-	 (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif))))
+#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif) \
+	((inet_sk(__sk)->inet_portpair == (__ports))	&&		\
+	 (inet_sk(__sk)->inet_daddr	== (__saddr))	&&		\
+	 (inet_sk(__sk)->inet_rcv_saddr	== (__daddr))	&&		\
+	 (!(__sk)->sk_bound_dev_if	||				\
+	   ((__sk)->sk_bound_dev_if == (__dif))) 	&&		\
+	 net_eq(sock_net(__sk), (__net)))
+#define INET_TW_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif) \
+	((inet_twsk(__sk)->tw_portpair == (__ports))	&&		\
+	 (inet_twsk(__sk)->tw_daddr	== (__saddr))	&&		\
+	 (inet_twsk(__sk)->tw_rcv_saddr	== (__daddr))	&&		\
+	 (!(__sk)->sk_bound_dev_if	||				\
+	   ((__sk)->sk_bound_dev_if == (__dif))) 	&&		\
+	 net_eq(sock_net(__sk), (__net)))
 #endif /* 64-bit arch */
 
 /*
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 256c1ed..ee5ddcd 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -144,9 +144,11 @@ struct inet_sock {
 	/* Socket demultiplex comparisons on incoming packets. */
 #define inet_daddr		sk.__sk_common.skc_daddr
 #define inet_rcv_saddr		sk.__sk_common.skc_rcv_saddr
+#define inet_addrpair		sk.__sk_common.skc_addrpair
+#define inet_dport		sk.__sk_common.skc_dport
+#define inet_num		sk.__sk_common.skc_num
+#define inet_portpair		sk.__sk_common.skc_portpair
 
-	__be16			inet_dport;
-	__u16			inet_num;
 	__be32			inet_saddr;
 	__s16			uc_ttl;
 	__u16			cmsg_flags;
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index ba52c83..7d658d5 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -112,6 +112,11 @@ struct inet_timewait_sock {
 #define tw_net			__tw_common.skc_net
 #define tw_daddr        	__tw_common.skc_daddr
 #define tw_rcv_saddr    	__tw_common.skc_rcv_saddr
+#define tw_addrpair		__tw_common.skc_addrpair
+#define tw_dport		__tw_common.skc_dport
+#define tw_num			__tw_common.skc_num
+#define tw_portpair		__tw_common.skc_portpair
+
 	int			tw_timeout;
 	volatile unsigned char	tw_substate;
 	unsigned char		tw_rcv_wscale;
@@ -119,8 +124,6 @@ struct inet_timewait_sock {
 	/* Socket demultiplex comparisons on incoming packets. */
 	/* these three are in inet_sock */
 	__be16			tw_sport;
-	__be16			tw_dport;
-	__u16			tw_num;
 	kmemcheck_bitfield_begin(flags);
 	/* And these are ours. */
 	unsigned int		tw_ipv6only     : 1,
diff --git a/include/net/sock.h b/include/net/sock.h
index c945fba..c4132c1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -132,6 +132,8 @@ struct net;
  *	@skc_rcv_saddr: Bound local IPv4 addr
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_u16hashes: two u16 hash values used by UDP lookup tables
+ *	@skc_dport: placeholder for inet_dport/tw_dport
+ *	@skc_num: placeholder for inet_num/tw_num
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
@@ -149,16 +151,29 @@ struct net;
  *	for struct sock and struct inet_timewait_sock.
  */
 struct sock_common {
-	/* skc_daddr and skc_rcv_saddr must be grouped :
-	 * cf INET_MATCH() and INET_TW_MATCH()
+	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
+	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
 	 */
-	__be32			skc_daddr;
-	__be32			skc_rcv_saddr;
-
+	union {
+		unsigned long	skc_addrpair;
+		struct {
+			__be32	skc_daddr;
+			__be32	skc_rcv_saddr;
+		};
+	};
 	union  {
 		unsigned int	skc_hash;
 		__u16		skc_u16hashes[2];
 	};
+	/* skc_dport && skc_num must be grouped as well */
+	union {
+		u32		skc_portpair;
+		struct {
+			__be16	skc_dport;
+			__u16	skc_num;
+		};
+	};
+
 	unsigned short		skc_family;
 	volatile unsigned char	skc_state;
 	unsigned char		skc_reuse;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 7880af9..fa3ae81 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -237,12 +237,14 @@ struct sock *__inet_lookup_established(struct net *net,
 	rcu_read_lock();
 begin:
 	sk_nulls_for_each_rcu(sk, node, &head->chain) {
-		if (INET_MATCH(sk, net, hash, acookie,
-					saddr, daddr, ports, dif)) {
+		if (sk->sk_hash != hash)
+			continue;
+		if (likely(INET_MATCH(sk, net, acookie,
+				      saddr, daddr, ports, dif))) {
 			if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt)))
 				goto begintw;
-			if (unlikely(!INET_MATCH(sk, net, hash, acookie,
-				saddr, daddr, ports, dif))) {
+			if (unlikely(!INET_MATCH(sk, net, acookie,
+						 saddr, daddr, ports, dif))) {
 				sock_put(sk);
 				goto begin;
 			}
@@ -260,14 +262,18 @@ begin:
 begintw:
 	/* Must check for a TIME_WAIT'er before going to listener hash. */
 	sk_nulls_for_each_rcu(sk, node, &head->twchain) {
-		if (INET_TW_MATCH(sk, net, hash, acookie,
-					saddr, daddr, ports, dif)) {
+		if (sk->sk_hash != hash)
+			continue;
+		if (likely(INET_TW_MATCH(sk, net, acookie,
+					 saddr, daddr, ports,
+					 dif))) {
 			if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
 				sk = NULL;
 				goto out;
 			}
-			if (unlikely(!INET_TW_MATCH(sk, net, hash, acookie,
-				 saddr, daddr, ports, dif))) {
+			if (unlikely(!INET_TW_MATCH(sk, net, acookie,
+						    saddr, daddr, ports,
+						    dif))) {
 				sock_put(sk);
 				goto begintw;
 			}
@@ -314,10 +320,12 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
 
 	/* Check TIME-WAIT sockets first. */
 	sk_nulls_for_each(sk2, node, &head->twchain) {
-		tw = inet_twsk(sk2);
+		if (sk2->sk_hash != hash)
+			continue;
 
-		if (INET_TW_MATCH(sk2, net, hash, acookie,
-					saddr, daddr, ports, dif)) {
+		if (likely(INET_TW_MATCH(sk2, net, acookie,
+					 saddr, daddr, ports, dif))) {
+			tw = inet_twsk(sk2);
 			if (twsk_unique(sk, sk2, twp))
 				goto unique;
 			else
@@ -328,8 +336,10 @@ static int __inet_check_established(struct inet_timewait_death_row *death_row,
 
 	/* And established part... */
 	sk_nulls_for_each(sk2, node, &head->chain) {
-		if (INET_MATCH(sk2, net, hash, acookie,
-					saddr, daddr, ports, dif))
+		if (sk2->sk_hash != hash)
+			continue;
+		if (likely(INET_MATCH(sk2, net, acookie,
+				      saddr, daddr, ports, dif)))
 			goto not_unique;
 	}
 
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 73f1a00..dea17fd 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -87,11 +87,13 @@ struct sock *__inet6_lookup_established(struct net *net,
 	rcu_read_lock();
 begin:
 	sk_nulls_for_each_rcu(sk, node, &head->chain) {
-		/* For IPV6 do the cheaper port and family tests first. */
-		if (INET6_MATCH(sk, net, hash, saddr, daddr, ports, dif)) {
+		if (sk->sk_hash != hash)
+			continue;
+		if (likely(INET6_MATCH(sk, net, saddr, daddr, ports, dif))) {
 			if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt)))
 				goto begintw;
-			if (!INET6_MATCH(sk, net, hash, saddr, daddr, ports, dif)) {
+			if (unlikely(!INET6_MATCH(sk, net, saddr, daddr,
+						  ports, dif))) {
 				sock_put(sk);
 				goto begin;
 			}
@@ -104,12 +106,16 @@ begin:
 begintw:
 	/* Must check for a TIME_WAIT'er before going to listener hash. */
 	sk_nulls_for_each_rcu(sk, node, &head->twchain) {
-		if (INET6_TW_MATCH(sk, net, hash, saddr, daddr, ports, dif)) {
+		if (sk->sk_hash != hash)
+			continue;
+		if (likely(INET6_TW_MATCH(sk, net, saddr, daddr,
+					  ports, dif))) {
 			if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
 				sk = NULL;
 				goto out;
 			}
-			if (!INET6_TW_MATCH(sk, net, hash, saddr, daddr, ports, dif)) {
+			if (unlikely(!INET6_TW_MATCH(sk, net, saddr, daddr,
+						     ports, dif))) {
 				sock_put(sk);
 				goto begintw;
 			}
@@ -236,9 +242,12 @@ static int __inet6_check_established(struct inet_timewait_death_row *death_row,
 
 	/* Check TIME-WAIT sockets first. */
 	sk_nulls_for_each(sk2, node, &head->twchain) {
-		tw = inet_twsk(sk2);
+		if (sk2->sk_hash != hash)
+			continue;
 
-		if (INET6_TW_MATCH(sk2, net, hash, saddr, daddr, ports, dif)) {
+		if (likely(INET6_TW_MATCH(sk2, net, saddr, daddr,
+					  ports, dif))) {
+			tw = inet_twsk(sk2);
 			if (twsk_unique(sk, sk2, twp))
 				goto unique;
 			else
@@ -249,7 +258,9 @@ static int __inet6_check_established(struct inet_timewait_death_row *death_row,
 
 	/* And established part... */
 	sk_nulls_for_each(sk2, node, &head->chain) {
-		if (INET6_MATCH(sk2, net, hash, saddr, daddr, ports, dif))
+		if (sk2->sk_hash != hash)
+			continue;
+		if (likely(INET6_MATCH(sk2, net, saddr, daddr, ports, dif)))
 			goto not_unique;
 	}
 

^ permalink raw reply related

* Re: TCP and reordering
From: Eric Dumazet @ 2012-11-28 12:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev
In-Reply-To: <1354106362.21562.51.camel@shinybook.infradead.org>

On Wed, 2012-11-28 at 12:39 +0000, David Woodhouse wrote:

> 
> I'll go back to looking at TSQ, and BQL for PPP. If I have to use
> skb_orphan() and install a destructor of my own in order to do BQL for
> PPP, that'll upset TSQ a little. Is there a way we could *chain* the
> destructors... skb_clone() to put the skbs on the PPP channels' queues,
> perhaps, then free the original from the PPP destructor? Or is that too
> much overhead?
> 
> I've killed most of the channel queue for PPPoATM and PPPoE now, but
> L2TP still has a whole load of buffering all the way through the stack
> again before it really leaves the host.
> 
> (And PPPoE will still have the txqueuelen on the Ethernet device too).
> 

BQL is nice for high speed adapters.

For slow one, you always can stop the queue for each packet given to
start_xmit()

And restart the queue at TX completion.

Some device drivers do that (because the hardware has a single slot, no
ring buffer, not because they wanted to fight bufferbloat ;) )

^ permalink raw reply

* Re: [PATCH 1/2] smsc75xx: refactor entering suspend modes
From: Steve Glendinning @ 2012-11-28 12:43 UTC (permalink / raw)
  To: Bjørn Mork
  Cc: Alan Stern, netdev, linux-usb-u79uwXL29TY76Z2rM5mHXA,
	David Miller
In-Reply-To: <87k3t62btw.fsf-lbf33ChDnrE/G1V5fR+Y7Q@public.gmane.org>

> Looking at the different ethernet drivers, the normal way do do this
> seems to be something like this in their .set_wol implementation:
>
>         device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);
>
>
> where "adapter" is a netdev_priv private struct, "pdev" is a pci device
> and "wol" is an u32.  I don't see any problem doing the same for USB
> network devices implementing ethtool "set_wol".

Ahh, good spot.  I've implemented this (patch below for reference) and
it still doesn't work:


$ cat /sys/bus/usb/devices/2-1.2/power/wakeup
disabled

$ sudo ethtool -s eth2 wol p

[ 1607.237767] smsc75xx 2-1.2:1.0 eth2: set_wol before
device_can_wakeup=1 device_may_wakeup=0
[ 1607.237772] smsc75xx 2-1.2:1.0 eth2: set_wol after
device_can_wakeup=1 device_may_wakeup=1

$ cat /sys/bus/usb/devices/2-1.2/power/wakeup
disabled


Huh?!  My debugging printk statements tell me I've succesfully set it,
but then the sysfs entry is disabled when I read it.  Is something
else setting this back?

My testing patch for reference (note this is against my tree with a
few to-be submitted patches so won't apply cleanly!):


diff --git a/drivers/net/usb/smsc75xx.c b/drivers/net/usb/smsc75xx.c
index d8fa649..4c17849 100644
--- a/drivers/net/usb/smsc75xx.c
+++ b/drivers/net/usb/smsc75xx.c
@@ -61,7 +61,6 @@
 #define SUSPEND_SUSPEND1 (0x02)
 #define SUSPEND_SUSPEND2 (0x04)
 #define SUSPEND_SUSPEND3 (0x08)
-#define SUSPEND_REMOTEWAKE (0x10)
 #define SUSPEND_ALLMODES (SUSPEND_SUSPEND0 | SUSPEND_SUSPEND1 | \
  SUSPEND_SUSPEND2 | SUSPEND_SUSPEND3)

@@ -172,26 +171,6 @@ static int __must_check smsc75xx_write_reg(struct
usbnet *dev, u32 index,
  return __smsc75xx_write_reg(dev, index, data, 0);
 }

-static int smsc75xx_set_feature(struct usbnet *dev, u32 feature)
-{
- if (WARN_ON_ONCE(!dev))
- return -EINVAL;
-
- return usbnet_write_cmd_nopm(dev, USB_REQ_SET_FEATURE,
-     USB_DIR_OUT | USB_RECIP_DEVICE,
-     feature, 0, NULL, 0);
-}
-
-static int smsc75xx_clear_feature(struct usbnet *dev, u32 feature)
-{
- if (WARN_ON_ONCE(!dev))
- return -EINVAL;
-
- return usbnet_write_cmd_nopm(dev, USB_REQ_CLEAR_FEATURE,
-     USB_DIR_OUT | USB_RECIP_DEVICE,
-     feature, 0, NULL, 0);
-}
-
 /* Loop until the read is completed with timeout
  * called with phy_mutex held */
 static __must_check int __smsc75xx_phy_wait_not_busy(struct usbnet *dev,
@@ -674,8 +653,19 @@ static int smsc75xx_ethtool_set_wol(struct net_device *net,
 {
  struct usbnet *dev = netdev_priv(net);
  struct smsc75xx_priv *pdata = (struct smsc75xx_priv *)(dev->data[0]);
+ int ret;
+
+ netdev_info(dev->net, "set_wol before device_can_wakeup=%d
device_may_wakeup=%d\n",
+ device_can_wakeup(&net->dev), device_may_wakeup(&net->dev));

  pdata->wolopts = wolinfo->wolopts & SUPPORTED_WAKE;
+
+ ret = device_set_wakeup_enable(&net->dev, pdata->wolopts);
+ check_warn_return(ret, "device_set_wakeup_enable error %d\n", ret);
+
+ netdev_info(dev->net, "set_wol after device_can_wakeup=%d
device_may_wakeup=%d\n",
+ device_can_wakeup(&net->dev), device_may_wakeup(&net->dev));
+
  return 0;
 }

@@ -1197,12 +1187,17 @@ static int smsc75xx_bind(struct usbnet *dev,
struct usb_interface *intf)

  /* Init all registers */
  ret = smsc75xx_reset(dev);
+ check_warn_return(ret, "smsc75xx_reset error %d\n", ret);

  dev->net->netdev_ops = &smsc75xx_netdev_ops;
  dev->net->ethtool_ops = &smsc75xx_ethtool_ops;
  dev->net->flags |= IFF_MULTICAST;
  dev->net->hard_header_len += SMSC75XX_TX_OVERHEAD;
  dev->hard_mtu = dev->net->mtu + dev->net->hard_header_len;
+
+ ret = device_init_wakeup(&dev->net->dev, 1);
+ check_warn_return(ret, "device_init_wakeup error %d\n", ret);
+
  return 0;
 }

@@ -1262,9 +1257,7 @@ static int smsc75xx_enter_suspend0(struct usbnet *dev)
  ret = smsc75xx_write_reg_nopm(dev, PMT_CTL, val);
  check_warn_return(ret, "Error writing PMT_CTL\n");

- smsc75xx_set_feature(dev, USB_DEVICE_REMOTE_WAKEUP);
-
- pdata->suspend_flags |= SUSPEND_SUSPEND0 | SUSPEND_REMOTEWAKE;
+ pdata->suspend_flags |= SUSPEND_SUSPEND0;

  return 0;
 }
@@ -1291,9 +1284,7 @@ static int smsc75xx_enter_suspend1(struct usbnet *dev)
  ret = smsc75xx_write_reg_nopm(dev, PMT_CTL, val);
  check_warn_return(ret, "Error writing PMT_CTL\n");

- smsc75xx_set_feature(dev, USB_DEVICE_REMOTE_WAKEUP);
-
- pdata->suspend_flags |= SUSPEND_SUSPEND1 | SUSPEND_REMOTEWAKE;
+ pdata->suspend_flags |= SUSPEND_SUSPEND1;

  return 0;
 }
@@ -1348,9 +1339,7 @@ static int smsc75xx_enter_suspend3(struct usbnet *dev)
  ret = smsc75xx_write_reg_nopm(dev, PMT_CTL, val);
  check_warn_return(ret, "Error writing PMT_CTL\n");

- smsc75xx_set_feature(dev, USB_DEVICE_REMOTE_WAKEUP);
-
- pdata->suspend_flags |= SUSPEND_SUSPEND3 | SUSPEND_REMOTEWAKE;
+ pdata->suspend_flags |= SUSPEND_SUSPEND3;

  return 0;
  }
@@ -1650,11 +1639,6 @@ static int smsc75xx_resume(struct usb_interface *intf)
  /* do this first to ensure it's cleared even in error case */
  pdata->suspend_flags = 0;

- if (suspend_flags & SUSPEND_REMOTEWAKE) {
- ret = smsc75xx_clear_feature(dev, USB_DEVICE_REMOTE_WAKEUP);
- check_warn_return(ret, "Error disabling remote wakeup\n");
- }

^ permalink raw reply related

* Re: TCP and reordering
From: David Woodhouse @ 2012-11-28 12:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev
In-Reply-To: <1354105619.14302.89.camel@edumazet-glaptop>

[-- Attachment #1: Type: text/plain, Size: 1529 bytes --]

On Wed, 2012-11-28 at 04:26 -0800, Eric Dumazet wrote:
> On Wed, 2012-11-28 at 11:49 +0000, David Woodhouse wrote:
> > On Wed, 2012-11-28 at 03:02 -0800, Eric Dumazet wrote:
> > > > Thanks. For me after a 64MiB download, I have an increase of one FACK,
> > > > one SACK and one TS reorder. So my connection probably does even less
> > > > reordering than I thought, and thus isn't particularly relevant to this
> > > > conversation. I'll shut up now and go back to playing with ATM.
> > > 
> > > But you are the receiver. A receiver should not increase these counters.
> > 
> > I checked it on the sending side.
> > 
> 
> If you want to play with reordering effects on your bi-ADSL line,
> you could install an "netem delay 3ms" on ingress side of one of the
> link.

For now I'm content to observe that I don't really get much reordering
at all, which is fine.

I'll go back to looking at TSQ, and BQL for PPP. If I have to use
skb_orphan() and install a destructor of my own in order to do BQL for
PPP, that'll upset TSQ a little. Is there a way we could *chain* the
destructors... skb_clone() to put the skbs on the PPP channels' queues,
perhaps, then free the original from the PPP destructor? Or is that too
much overhead?

I've killed most of the channel queue for PPPoATM and PPPoE now, but
L2TP still has a whole load of buffering all the way through the stack
again before it really leaves the host.

(And PPPoE will still have the txqueuelen on the Ethernet device too).

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: TCP and reordering
From: Eric Dumazet @ 2012-11-28 12:26 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev
In-Reply-To: <1354103355.21562.46.camel@shinybook.infradead.org>

On Wed, 2012-11-28 at 11:49 +0000, David Woodhouse wrote:
> On Wed, 2012-11-28 at 03:02 -0800, Eric Dumazet wrote:
> > > Thanks. For me after a 64MiB download, I have an increase of one FACK,
> > > one SACK and one TS reorder. So my connection probably does even less
> > > reordering than I thought, and thus isn't particularly relevant to this
> > > conversation. I'll shut up now and go back to playing with ATM.
> > 
> > But you are the receiver. A receiver should not increase these counters.
> 
> I checked it on the sending side.
> 

If you want to play with reordering effects on your bi-ADSL line,
you could install an "netem delay 3ms" on ingress side of one of the
link.

^ permalink raw reply

* Question about packet schedulers
From: Alexey Perevalov @ 2012-11-28 12:24 UTC (permalink / raw)
  To: netdev

Hello All,

I need a packet scheduler with very interesting behavior.

Such packet scheduler should maximize idle time and send available 
packets on tiny time frame,
on huge time frame it should wait and accumulate packets. These time 
intervals should be tweaked.

-- 
Best regards,
Alexey Perevalov,

^ permalink raw reply

* Re: Specific question about packet dropping
From: Shan Wei @ 2012-11-28 12:23 UTC (permalink / raw)
  To: Javier Domingo; +Cc: netdev
In-Reply-To: <CALZVapkZpRRh+ruf-QWbv_pFYtZgdx7US7Zc6x0kgMmmAsnAPg@mail.gmail.com>

Javier Domingo said, at 2012/11/28 19:50:
> Thank you very much, I (don't know why) thought the packets were being
> smashed in the dma memory.

Maybe your kernel/application handles the packets so slower than that
NIC receives. 

>> Best Regards
>> Shan Wei
>>
>>>
>>> Javier Domingo
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
> 

^ permalink raw reply

* Re: [RFC PATCH] tcp: introduce raw access to experimental options
From: Yuchung Cheng @ 2012-11-28 12:01 UTC (permalink / raw)
  To: elelueck; +Cc: netdev, frankbla, raspl, ubacher, samudrala, davem
In-Reply-To: <1353084898-42264-1-git-send-email-elelueck@linux.vnet.ibm.com>

On Sat, Nov 17, 2012 at 12:54 AM,  <elelueck@linux.vnet.ibm.com> wrote:
> From: Einar Lueck <elelueck@linux.vnet.ibm.com>
>
> This patch adds means for raw acces to TCP expirimental options
> 253 and 254. The intention of this is to enable user space
> applications to implement communication behaviour that depends
> on experimental options. For that, new (set|get)sockopts are

Could you elaborate on the use case? I am having a hard time
understanding that. If you need to use experimental options for your
applications, why not just use another magic number according to
draft-ietf-tcpm-experimental-options-02 (since you cite that too)?

> introduced:
>
> TCP_EXPOPTS (get & set): TCP experimental options to be added to
>                          packets
> TCP_RECV_EXPOPTS (get):  experimental options received with last
>                          packet
> TCP_RECV_SYN_EXPOPTS (get): experimental options received with
>                          SYN packet
>
> TCP experimental options 253 and 254 configured via TCP_EXPOPTS on
> any TCP socket are appended to every packet that is sent as long
> as there is enough room left. If there is not enough room left they
> are silently dropped.
>
> Listening sockets reply to SYN packets with SYN ACK packets containing
> TCP experimental options 253 and 254 as configured via TCP_EXPOPTS, too.
> If a TCP connection gets established the configured experimental options
> are the defaults for the new socket, too. Thus, a getsockopt on the
> resulting accept socket for TCP_EXPOPTS returns the same stuff configured
> on the listening socket.
>
> As mentioned above, even after the 3whs is complete, experimental options
> are sent with every packet. To enable user space applications to distinguish
> between what has been advertized via SYN and what has been received with the
> last packet the aforementioned TCP_RECV_SYN_EXPOPTS and TCP_RECV_EXPOPTS are
> introduced.
>
> Today, experimental option 253 (COOKIE) and 254 (FASTOPEN) are already
> exploited. For co-existence the following approach has been taken:
>
> General remarks:
> * Interface to COOKIE and FASTOPEN stays the same
> Sender side:
> 1. COOKIE and FASTPATH code adds own options first (if applicable)
> 2. Finally, if enough room is left, TCP_EXPOPTS experimental options are
>    appended
> Receiver side:
> 1. ALL 253 and 254 experimental options are made available via
>    TCP_RECV(_SYN)_EXPOPTS
> 2. COOKIE and FASTOPEN code check if there is any option relevant for them
>
> References:
> http://tools.ietf.org/html/draft-ietf-tcpm-experimental-options-02
>
> Signed-off-by: Einar Lueck <elelueck@linux.vnet.ibm.com>
> ---
>  include/linux/tcp.h      |  25 ++++++++++
>  include/net/tcp.h        |   3 ++
>  net/ipv4/tcp.c           | 110 +++++++++++++++++++++++++++++++++++++++++++
>  net/ipv4/tcp_input.c     | 119 +++++++++++++++++++++++++++++++----------------
>  net/ipv4/tcp_ipv4.c      |  14 ++++++
>  net/ipv4/tcp_minisocks.c |  17 +++++++
>  net/ipv4/tcp_output.c    |  37 ++++++++++++---
>  7 files changed, 279 insertions(+), 46 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index eb125a4..b2a6451 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -110,6 +110,10 @@ enum {
>  #define TCP_REPAIR_QUEUE       20
>  #define TCP_QUEUE_SEQ          21
>  #define TCP_REPAIR_OPTIONS     22
> +#define TCP_EXPOPTS            23      /* TCP exp. options (configured) */
> +#define TCP_RECV_EXPOPTS       24      /* TCP exp. options (received) */
> +#define TCP_RECV_SYN_EXPOPTS   25      /* TCP exp. options
> +                                          (received with syn)) */
>
>  struct tcp_repair_opt {
>         __u32   opt_code;
> @@ -269,6 +273,8 @@ struct tcp_sack_block {
>  #define TCP_FACK_ENABLED  (1 << 1)   /*1 = FACK is enabled locally*/
>  #define TCP_DSACK_SEEN    (1 << 2)   /*1 = DSACK was received from peer*/
>
> +#define TCP_EXPOP_MAXLEN       40
> +
>  struct tcp_options_received {
>  /*     PAWS/RTTM data  */
>         long    ts_recent_stamp;/* Time we stored ts_recent (for aging) */
> @@ -288,6 +294,9 @@ struct tcp_options_received {
>         u8      num_sacks;      /* Number of SACK blocks                */
>         u16     user_mss;       /* mss requested by user in ioctl       */
>         u16     mss_clamp;      /* Maximal mss, negotiated at connection setup */
> +       u8      exp_opts_len;   /* length of buffer containing all exp
> +                                  options in format: kind length data */
> +       u8      exp_opts[TCP_EXPOP_MAXLEN];     /* experimental options */
>  };
>
>  static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
> @@ -295,6 +304,7 @@ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
>         rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
>         rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
>         rx_opt->cookie_plus = 0;
> +       rx_opt->exp_opts_len = 0;
>  }
>
>  /* This is the max number of SACKS that we'll generate and process. It's safe
> @@ -315,6 +325,10 @@ struct tcp_request_sock {
>         u32                             rcv_isn;
>         u32                             snt_isn;
>         u32                             snt_synack; /* synack sent time */
> +
> +       u8 syn_expopts[TCP_EXPOP_MAXLEN];       /* experimental options
> +                                                  received with SYNACK */
> +       u8 syn_expopts_len;
>  };
>
>  static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
> @@ -406,6 +420,17 @@ struct tcp_sock {
>         u32     snd_up;         /* Urgent pointer               */
>
>         u8      keepalive_probes; /* num of allowed keep alive probes   */
> +
> +       /* for raw acces to experimental options */
> +       struct {
> +               u8 *conf;       /* lazy allocation of TCP_EXPOP_MAXLEN bytes
> +                                  for raw access to experimental options */
> +               u8 conf_len;    /* bytes actually used for experimental opts */
> +               u8 *syn;        /* experimental options received with SYN,
> +                                  allocated only if received */
> +               u8 syn_len;     /* bytes of experimental options actually
> +                                  received with SYN */
> +       } exp_opts;
>  /*
>   *      Options received (usually on last packet, some only on SYN packets).
>   */
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 1f000ff..b63d5c9 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -170,6 +170,8 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
>  #define TCPOPT_TIMESTAMP       8       /* Better RTT estimations/PAWS */
>  #define TCPOPT_MD5SIG          19      /* MD5 Signature (RFC2385) */
>  #define TCPOPT_COOKIE          253     /* Cookie extension (experimental) */
> +#define TCPOPT_EXP253          253     /* TCP experimental option 253 */
> +#define TCPOPT_EXP254          254     /* TCP experimental option 254 */
>  #define TCPOPT_EXP             254     /* Experimental */
>  /* Magic number to be after the option value for sharing TCP
>   * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
> @@ -180,6 +182,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
>   *     TCP option lengths
>   */
>
> +#define TCPOLEN_MAX_ANYEXP     40
>  #define TCPOLEN_MSS            4
>  #define TCPOLEN_WINDOW         3
>  #define TCPOLEN_SACK_PERM      2
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 5f64193..e7e4947 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -423,6 +423,12 @@ void tcp_init_sock(struct sock *sk)
>         sk->sk_sndbuf = sysctl_tcp_wmem[1];
>         sk->sk_rcvbuf = sysctl_tcp_rmem[1];
>
> +       /* memory for raw access to experimental options is allocated lazy */
> +       tp->exp_opts.conf = NULL;
> +       tp->exp_opts.conf_len = 0;
> +       tp->exp_opts.syn = NULL;
> +       tp->exp_opts.syn_len = 0;
> +
>         local_bh_disable();
>         sock_update_memcg(sk);
>         sk_sockets_allocated_inc(sk);
> @@ -2376,6 +2382,53 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>
>         /* These are data/string values, all the others are ints */
>         switch (optname) {
> +       case TCP_EXPOPTS: {
> +               u8 conf[TCP_EXPOP_MAXLEN];
> +
> +               if (optlen > TCP_EXPOP_MAXLEN || (optlen < 4 && optlen > 0) ||
> +                   (optlen % 4 > 0))
> +                       return -EINVAL;
> +               if (optlen > 0 && !optval)
> +                       return -EINVAL;
> +
> +               /* filter for raw access to supported options */
> +               if (optlen) {
> +                       u8 i;
> +
> +                       if (copy_from_user(conf, optval, optlen))
> +                               return -EFAULT;
> +
> +                       i = 0;
> +                       while (i < optlen) {
> +                               if (conf[i] != TCPOPT_EXP253 &&
> +                                   conf[i] != TCPOPT_EXP254)
> +                                       return -EINVAL;
> +
> +                               if (i + 1 < optlen) {
> +                                       i += conf[i+1];
> +                                       if (i > optlen)
> +                                               return -EINVAL;
> +                               } else {
> +                                       return -EINVAL;
> +                               }
> +                       }
> +               }
> +
> +               lock_sock(sk);
> +               if (!optlen) {
> +                       tp->exp_opts.conf_len = 0;
> +                       release_sock(sk);
> +                       return 0;
> +               }
> +               if (!tp->exp_opts.conf) {
> +                       tp->exp_opts.conf = kzalloc(TCP_EXPOP_MAXLEN,
> +                                                   sk->sk_allocation);
> +               }
> +               memcpy(tp->exp_opts.conf, conf, optlen);
> +               tp->exp_opts.conf_len = optlen;
> +               release_sock(sk);
> +               return err;
> +       }
>         case TCP_CONGESTION: {
>                 char name[TCP_CA_NAME_MAX];
>
> @@ -2947,6 +3000,63 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
>         case TCP_USER_TIMEOUT:
>                 val = jiffies_to_msecs(icsk->icsk_user_timeout);
>                 break;
> +       case TCP_EXPOPTS: {
> +               u8 exp_opts_len;
> +
> +               if (get_user(len, optlen))
> +                       return -EFAULT;
> +               if (len < 0)
> +                       return -EINVAL;
> +
> +               exp_opts_len = tp->exp_opts.conf_len;
> +
> +               if (exp_opts_len > len)
> +                       return -EINVAL;
> +               if (put_user(exp_opts_len, optlen))
> +                       return -EFAULT;
> +               if (exp_opts_len && copy_to_user(optval, tp->exp_opts.conf,
> +                                                exp_opts_len))
> +                       return -EFAULT;
> +               return 0;
> +       }
> +       case TCP_RECV_EXPOPTS:
> +               if (get_user(len, optlen))
> +                       return -EFAULT;
> +               if (len < 0)
> +                       return -EINVAL;
> +
> +               if (len < tp->rx_opt.exp_opts_len)
> +                       return -EINVAL;
> +
> +               if (put_user(tp->rx_opt.exp_opts_len, optlen))
> +                       return -EFAULT;
> +               if (copy_to_user(optval, tp->rx_opt.exp_opts,
> +                                tp->rx_opt.exp_opts_len))
> +                       return -EFAULT;
> +               return 0;
> +       case TCP_RECV_SYN_EXPOPTS: {
> +               u8 exp_opts_len;
> +
> +               if (get_user(len, optlen))
> +                       return -EFAULT;
> +               if (len < 0)
> +                       return -EINVAL;
> +
> +               if (!tp->exp_opts.syn)
> +                       exp_opts_len = 0;
> +               else
> +                       exp_opts_len = tp->exp_opts.syn_len;
> +
> +               if (exp_opts_len > len)
> +                       return -EINVAL;
> +               if (put_user(exp_opts_len, optlen))
> +                       return -EFAULT;
> +               if (exp_opts_len && copy_to_user(optval, tp->exp_opts.syn,
> +                                                exp_opts_len)) {
> +                       return -EFAULT;
> +               }
> +               return 0;
> +       }
>         default:
>                 return -ENOPROTOOPT;
>         }
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d377f48..130d4f4 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3726,11 +3726,32 @@ old_ack:
>         return 0;
>  }
>
> +static inline void tcp_parse_fastopen_cookie(int opcode,
> +               int opsize,
> +               const unsigned char *ptr,
> +               struct tcp_fastopen_cookie *foc,
> +               const struct tcphdr *th) {
> +       /* Fast Open option shares code 254 using a 16 bits magic number. It's
> +        * valid only in SYN or SYN-ACK with an even size.
> +        */
> +       if (opsize < TCPOLEN_EXP_FASTOPEN_BASE ||
> +           get_unaligned_be16(ptr) != TCPOPT_FASTOPEN_MAGIC || foc == NULL ||
> +           !th->syn || (opsize & 1))
> +               return;
> +       foc->len = opsize - TCPOLEN_EXP_FASTOPEN_BASE;
> +       if (foc->len >= TCP_FASTOPEN_COOKIE_MIN &&
> +           foc->len <= TCP_FASTOPEN_COOKIE_MAX)
> +               memcpy(foc->val, ptr + 2, foc->len);
> +       else if (foc->len != 0)
> +               foc->len = -1;
> +}
> +
>  /* Look for tcp options. Normally only called on SYN and SYNACK packets.
>   * But, this can also be called on packets in the established flow when
>   * the fast version below fails.
>   */
> -void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *opt_rx,
> +void tcp_parse_options(const struct sk_buff *skb,
> +                      struct tcp_options_received *opt_rx,
>                        const u8 **hvpp, int estab,
>                        struct tcp_fastopen_cookie *foc)
>  {
> @@ -3740,6 +3761,7 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *o
>
>         ptr = (const unsigned char *)(th + 1);
>         opt_rx->saw_tstamp = 0;
> +       opt_rx->exp_opts_len = 0;
>
>         while (length > 0) {
>                 int opcode = *ptr++;
> @@ -3815,48 +3837,56 @@ void tcp_parse_options(const struct sk_buff *skb, struct tcp_options_received *o
>                                  */
>                                 break;
>  #endif
> -                       case TCPOPT_COOKIE:
> -                               /* This option is variable length.
> +                       case TCPOPT_EXP253:
> +                       case TCPOPT_EXP254:
> +                               /* First parse options into raw access area for
> +                                * experimental options. Then handle
> +                                * potential exploitations
>                                  */
> -                               switch (opsize) {
> -                               case TCPOLEN_COOKIE_BASE:
> -                                       /* not yet implemented */
> -                                       break;
> -                               case TCPOLEN_COOKIE_PAIR:
> -                                       /* not yet implemented */
> -                                       break;
> -                               case TCPOLEN_COOKIE_MIN+0:
> -                               case TCPOLEN_COOKIE_MIN+2:
> -                               case TCPOLEN_COOKIE_MIN+4:
> -                               case TCPOLEN_COOKIE_MIN+6:
> -                               case TCPOLEN_COOKIE_MAX:
> -                                       /* 16-bit multiple */
> -                                       opt_rx->cookie_plus = opsize;
> -                                       *hvpp = ptr;
> -                                       break;
> -                               default:
> -                                       /* ignore option */
> -                                       break;
> +                               if (opsize <= TCPOLEN_MAX_ANYEXP &&
> +                                   opsize >= 2 &&
> +                                   (opt_rx->exp_opts_len + opsize <=
> +                                    TCPOLEN_MAX_ANYEXP)) {
> +                                       opt_rx->exp_opts[
> +                                               opt_rx->exp_opts_len] = opcode;
> +                                       opt_rx->exp_opts[
> +                                               opt_rx->exp_opts_len + 1] =
> +                                               opsize;
> +                                       memcpy(opt_rx->exp_opts +
> +                                               opt_rx->exp_opts_len + 2, ptr,
> +                                               opsize - 2);
> +                                       opt_rx->exp_opts_len += opsize;
>                                 }
> -                               break;
>
> -                       case TCPOPT_EXP:
> -                               /* Fast Open option shares code 254 using a
> -                                * 16 bits magic number. It's valid only in
> -                                * SYN or SYN-ACK with an even size.
> -                                */
> -                               if (opsize < TCPOLEN_EXP_FASTOPEN_BASE ||
> -                                   get_unaligned_be16(ptr) != TCPOPT_FASTOPEN_MAGIC ||
> -                                   foc == NULL || !th->syn || (opsize & 1))
> -                                       break;
> -                               foc->len = opsize - TCPOLEN_EXP_FASTOPEN_BASE;
> -                               if (foc->len >= TCP_FASTOPEN_COOKIE_MIN &&
> -                                   foc->len <= TCP_FASTOPEN_COOKIE_MAX)
> -                                       memcpy(foc->val, ptr + 2, foc->len);
> -                               else if (foc->len != 0)
> -                                       foc->len = -1;
> +                               /* handle potential exploitations */
> +                               if (opcode == TCPOPT_COOKIE) {
> +                                       /* This option is variable length. */
> +                                       switch (opsize) {
> +                                       case TCPOLEN_COOKIE_BASE:
> +                                               /* not yet implemented */
> +                                               break;
> +                                       case TCPOLEN_COOKIE_PAIR:
> +                                               /* not yet implemented */
> +                                               break;
> +                                       case TCPOLEN_COOKIE_MIN+0:
> +                                       case TCPOLEN_COOKIE_MIN+2:
> +                                       case TCPOLEN_COOKIE_MIN+4:
> +                                       case TCPOLEN_COOKIE_MIN+6:
> +                                       case TCPOLEN_COOKIE_MAX:
> +                                               /* 16-bit multiple */
> +                                               opt_rx->cookie_plus = opsize;
> +                                               *hvpp = ptr;
> +                                               break;
> +                                       default:
> +                                               /* ignore option */
> +                                               break;
> +                                       }
> +                               } else {
> +                                       tcp_parse_fastopen_cookie(opcode,
> +                                                                 opsize, ptr,
> +                                                                 foc, th);
> +                               }
>                                 break;
> -
>                         }
>                         ptr += opsize-2;
>                         length -= opsize;
> @@ -3888,6 +3918,9 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
>                                    const struct tcphdr *th,
>                                    struct tcp_sock *tp, const u8 **hvpp)
>  {
> +       /* required if exp options are not used anymore by the counter part */
> +       tp->rx_opt.exp_opts_len = 0;
> +
>         /* In the spirit of fast parsing, compare doff directly to constant
>          * values.  Because equality is used, short doff can be ignored here.
>          */
> @@ -5806,6 +5839,14 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
>                         }
>                 }
>
> +               if (unlikely(tp->rx_opt.exp_opts_len > 0)) {
> +                       tp->exp_opts.syn = kzalloc(tp->rx_opt.exp_opts_len,
> +                                                  sk->sk_allocation);
> +                       tp->exp_opts.syn_len = tp->rx_opt.exp_opts_len;
> +                       memcpy(tp->exp_opts.syn, &tp->rx_opt.exp_opts,
> +                              tp->rx_opt.exp_opts_len);
> +               }
> +
>                 smp_mb();
>
>                 tcp_finish_connect(sk, skb);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 00a748d..2f66bd5 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1321,6 +1321,16 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>         tmp_opt.user_mss  = tp->rx_opt.user_mss;
>         tcp_parse_options(skb, &tmp_opt, &hash_location, 0, NULL);
>
> +       /* for raw access to experimental options in SYN packet */
> +       tcp_rsk(req)->syn_expopts_len = tmp_opt.exp_opts_len;
> +       if (tcp_rsk(req)->syn_expopts_len) {
> +               /* transport experimental options via request socket to big
> +                * socket
> +                */
> +               memcpy(tcp_rsk(req)->syn_expopts, tmp_opt.exp_opts,
> +                      tcp_rsk(req)->syn_expopts_len);
> +       }
> +
>         if (tmp_opt.cookie_plus > 0 &&
>             tmp_opt.saw_tstamp &&
>             !tp->rx_opt.cookie_out_never &&
> @@ -1978,6 +1988,10 @@ void tcp_v4_destroy_sock(struct sock *sk)
>                 tp->cookie_values = NULL;
>         }
>
> +       /* buffers for raw access to experimental options */
> +       kfree(tp->exp_opts.conf);
> +       kfree(tp->exp_opts.syn);
> +
>         /* If socket is aborted during connect operation */
>         tcp_free_fastopen_req(tp);
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 6ff7f10..dc25875 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -466,6 +466,23 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
>
>                 newtp->urg_data = 0;
>
> +               if (tcp_rsk(req)->syn_expopts_len) {
> +                       newtp->exp_opts.syn_len =
> +                                       tcp_rsk(req)->syn_expopts_len;
> +                       newtp->exp_opts.syn = kzalloc(newtp->exp_opts.syn_len,
> +                                                     GFP_ATOMIC);
> +                       memcpy(newtp->exp_opts.syn, tcp_rsk(req)->syn_expopts,
> +                              newtp->exp_opts.syn_len);
> +               }
> +
> +               if (oldtp->exp_opts.conf_len > 0) {
> +                       newtp->exp_opts.conf_len = oldtp->exp_opts.conf_len;
> +                       newtp->exp_opts.conf = kzalloc(TCP_EXPOP_MAXLEN,
> +                                                      GFP_ATOMIC);
> +                       memcpy(newtp->exp_opts.conf, oldtp->exp_opts.conf,
> +                              oldtp->exp_opts.conf_len);
> +               }
> +
>                 if (sock_flag(newsk, SOCK_KEEPOPEN))
>                         inet_csk_reset_keepalive_timer(newsk,
>                                                        keepalive_time_when(newtp));
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index d046326..8d7cf51 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -385,6 +385,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
>  #define OPTION_MD5             (1 << 2)
>  #define OPTION_WSCALE          (1 << 3)
>  #define OPTION_COOKIE_EXTENSION        (1 << 4)
> +#define OPTION_EXP             (1 << 5)
>  #define OPTION_FAST_OPEN_COOKIE        (1 << 8)
>
>  struct tcp_out_options {
> @@ -581,6 +582,12 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
>                 }
>                 ptr += (foc->len + 3) >> 2;
>         }
> +       if (unlikely(OPTION_EXP & options && tp->exp_opts.conf_len > 0)) {
> +               __u8 *p = (__u8 *) ptr;
> +               memcpy(ptr, tp->exp_opts.conf, tp->exp_opts.conf_len);
> +               p += tp->exp_opts.conf_len;
> +               ptr = (__be32 *) p;
> +       }
>  }
>
>  /* Compute TCP options for SYN packets. This is not the final
> @@ -693,6 +700,11 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
>                         remaining -= need;
>                 }
>         }
> +       if (unlikely(tp->exp_opts.conf_len > 0 &&
> +                    tp->exp_opts.conf_len <= remaining)) {
> +               opts->options |= OPTION_EXP;
> +               remaining -= tp->exp_opts.conf_len;
> +       }
>         return MAX_TCP_OPTION_SPACE - remaining;
>  }
>
> @@ -747,6 +759,11 @@ static unsigned int tcp_synack_options(struct sock *sk,
>                 if (unlikely(!ireq->tstamp_ok))
>                         remaining -= TCPOLEN_SACKPERM_ALIGNED;
>         }
> +       if (unlikely(tcp_sk(sk)->exp_opts.conf_len > 0 &&
> +                    tcp_sk(sk)->exp_opts.conf_len <= remaining)) {
> +               opts->options |= OPTION_EXP;
> +               remaining -= tcp_sk(sk)->exp_opts.conf_len;
> +       }
>
>         /* Similar rationale to tcp_syn_options() applies here, too.
>          * If the <SYN> options fit, the same options should fit now!
> @@ -782,38 +799,44 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
>  {
>         struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
>         struct tcp_sock *tp = tcp_sk(sk);
> -       unsigned int size = 0;
> +       unsigned remaining = MAX_TCP_OPTION_SPACE;
>         unsigned int eff_sacks;
>
>  #ifdef CONFIG_TCP_MD5SIG
>         *md5 = tp->af_specific->md5_lookup(sk, sk);
>         if (unlikely(*md5)) {
>                 opts->options |= OPTION_MD5;
> -               size += TCPOLEN_MD5SIG_ALIGNED;
> +               remaining -= TCPOLEN_MD5SIG_ALIGNED;
>         }
>  #else
>         *md5 = NULL;
>  #endif
>
> -       if (likely(tp->rx_opt.tstamp_ok)) {
> +       if (likely(tp->rx_opt.tstamp_ok &&
> +                  remaining >= TCPOLEN_TSTAMP_ALIGNED)) {
>                 opts->options |= OPTION_TS;
>                 opts->tsval = tcb ? tcb->when : 0;
>                 opts->tsecr = tp->rx_opt.ts_recent;
> -               size += TCPOLEN_TSTAMP_ALIGNED;
> +               remaining -= TCPOLEN_TSTAMP_ALIGNED;
>         }
>
>         eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
>         if (unlikely(eff_sacks)) {
> -               const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
>                 opts->num_sack_blocks =
>                         min_t(unsigned int, eff_sacks,
>                               (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
>                               TCPOLEN_SACK_PERBLOCK);
> -               size += TCPOLEN_SACK_BASE_ALIGNED +
> +               remaining -= TCPOLEN_SACK_BASE_ALIGNED +
>                         opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
>         }
>
> -       return size;
> +       if (unlikely(tp->exp_opts.conf_len > 0 &&
> +                    tp->exp_opts.conf_len <= remaining)) {
> +               opts->options |= OPTION_EXP;
> +               remaining -= tp->exp_opts.conf_len;
> +       }
> +
> +       return MAX_TCP_OPTION_SPACE - remaining;
>  }
>
>
> --
> 1.7.12.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: TCP and reordering
From: David Woodhouse @ 2012-11-28 11:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev
In-Reply-To: <1354100552.14302.78.camel@edumazet-glaptop>

[-- Attachment #1: Type: text/plain, Size: 486 bytes --]

On Wed, 2012-11-28 at 03:02 -0800, Eric Dumazet wrote:
> > Thanks. For me after a 64MiB download, I have an increase of one FACK,
> > one SACK and one TS reorder. So my connection probably does even less
> > reordering than I thought, and thus isn't particularly relevant to this
> > conversation. I'll shut up now and go back to playing with ATM.
> 
> But you are the receiver. A receiver should not increase these counters.

I checked it on the sending side.

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* Re: Specific question about packet dropping
From: Shan Wei @ 2012-11-28 11:45 UTC (permalink / raw)
  To: Javier Domingo; +Cc: netdev
In-Reply-To: <CALZVapnQrH9YvCOh8OnGFs1V28nPfX4p=GK=M0VnqHjZCU8=1Q@mail.gmail.com>

Javier Domingo said, at 2012/11/28 18:40:
> Hi,
> 
> Where are packets dropped?
> 
> I reading code I think just found:
> - In the NIC
> - In the netif_receive_skb
> 
> Is there any other place I can have missed?

There is many reasons, following is not whole.

For receiver:
1. NIC
2. __netif_receive_skb(unknown packet type)
3. ip_rcv(abnormal packet, route reason...)
4. tcp/udp (unknown port, receive buff limit..)

For Sender:
1. tcp/udp(send buffer limit)
2. ip(route reason....)
3. arp(resolved_discards)
4. qdisc(queue limit)

Best Regards
Shan Wei

> 
> Javier Domingo
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* [PATCH net] net/mlx4_en: Can set maxrate only for TC0
From: Amir Vadai @ 2012-11-28 11:43 UTC (permalink / raw)
  To: David S. Miller; +Cc: Or Gerlitz, Oren Duer, Amir Vadai, netdev

Had a typo in memcpy.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
index 5d367958..b799ab12 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
@@ -237,7 +237,7 @@ static int mlx4_en_dcbnl_ieee_setmaxrate(struct net_device *dev,
 	if (err)
 		return err;
 
-	memcpy(priv->maxrate, tmp, sizeof(*priv->maxrate));
+	memcpy(priv->maxrate, tmp, sizeof(priv->maxrate));
 
 	return 0;
 }
-- 
1.7.8.2

^ permalink raw reply related

* Re: [PATCH 1/2] smsc75xx: refactor entering suspend modes
From: Bjørn Mork @ 2012-11-28 11:42 UTC (permalink / raw)
  To: Steve Glendinning; +Cc: Alan Stern, netdev, linux-usb, David Miller
In-Reply-To: <CAKh2mn5UGRN5U0w_bj25e4VnCwr9kK3f798nHSwmF4WE4dORMw@mail.gmail.com>

Steve Glendinning <steve@shawell.net> writes:

> Hi Bjorn,
>
> On 28 November 2012 09:31, Bjørn Mork <bjorn@mork.no> wrote:
>>
>> Remote wakeup will not be enabled on system suspend unless the user (or
>> a userspace program on the users behalf) has requested it.
>
> If a user types "ethtool -s eth2 wol p" they *are* explicitly
> requesting the ethernet device to bring the system out of suspend, so
> I think the ethernet driver should set the feature automatically.
>
> from drivers/base/power/wakeup.c:
>
>  * By default, most devices should leave wakeup disabled.  The exceptions are
>  * devices that everyone expects to be wakeup sources: keyboards, power buttons,
>  * possibly network interfaces, etc.

Right.  That seems logical.  But the ethtool setting should still
probably be reflected in the device attributes so that the user can see
them there?

Just doing a simple test of what other ethernet drivers does, I tried
this on the e1000e adapter in my laptop.  Initially:

nemi:/tmp# grep .  /sys/bus/pci/devices/0000:00:19.0/power/*
/sys/bus/pci/devices/0000:00:19.0/power/async:enabled
grep: /sys/bus/pci/devices/0000:00:19.0/power/autosuspend_delay_ms: Input/output error
/sys/bus/pci/devices/0000:00:19.0/power/control:auto
/sys/bus/pci/devices/0000:00:19.0/power/runtime_active_kids:0
/sys/bus/pci/devices/0000:00:19.0/power/runtime_active_time:266378772
/sys/bus/pci/devices/0000:00:19.0/power/runtime_enabled:enabled
/sys/bus/pci/devices/0000:00:19.0/power/runtime_status:active
/sys/bus/pci/devices/0000:00:19.0/power/runtime_suspended_time:568333816
/sys/bus/pci/devices/0000:00:19.0/power/runtime_usage:0
/sys/bus/pci/devices/0000:00:19.0/power/wakeup:disabled
nemi:/tmp# ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 2
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: off
        Supports Wake-on: pumbg
        Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes


Enabling WOL:

nemi:/tmp# ethtool -s eth0 wol p
nemi:/tmp# ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 2
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: off
        Supports Wake-on: pumbg
        Wake-on: p
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes
nemi:/tmp# grep .  /sys/bus/pci/devices/0000:00:19.0/power/*
/sys/bus/pci/devices/0000:00:19.0/power/async:enabled
grep: /sys/bus/pci/devices/0000:00:19.0/power/autosuspend_delay_ms: Input/output error
/sys/bus/pci/devices/0000:00:19.0/power/control:auto
/sys/bus/pci/devices/0000:00:19.0/power/runtime_active_kids:0
/sys/bus/pci/devices/0000:00:19.0/power/runtime_active_time:266414488
/sys/bus/pci/devices/0000:00:19.0/power/runtime_enabled:enabled
/sys/bus/pci/devices/0000:00:19.0/power/runtime_status:active
/sys/bus/pci/devices/0000:00:19.0/power/runtime_suspended_time:568333816
/sys/bus/pci/devices/0000:00:19.0/power/runtime_usage:0
/sys/bus/pci/devices/0000:00:19.0/power/wakeup:enabled
/sys/bus/pci/devices/0000:00:19.0/power/wakeup_abort_count:0
/sys/bus/pci/devices/0000:00:19.0/power/wakeup_active:0
/sys/bus/pci/devices/0000:00:19.0/power/wakeup_active_count:0
/sys/bus/pci/devices/0000:00:19.0/power/wakeup_count:0
/sys/bus/pci/devices/0000:00:19.0/power/wakeup_expire_count:0
/sys/bus/pci/devices/0000:00:19.0/power/wakeup_last_time_ms:834745779
/sys/bus/pci/devices/0000:00:19.0/power/wakeup_max_time_ms:0
/sys/bus/pci/devices/0000:00:19.0/power/wakeup_total_time_ms:0


So this driver does set wakeup:enabled, and if this had been a USB
device then the USB core would have set the Remote Wakeup feature on
suspend without the driver having to do anything special in its
driver->suspend function.

Looking at the different ethernet drivers, the normal way do do this
seems to be something like this in their .set_wol implementation:

        device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);


where "adapter" is a netdev_priv private struct, "pdev" is a pci device
and "wol" is an u32.  I don't see any problem doing the same for USB
network devices implementing ethtool "set_wol".

Note that according to Documentation/power/devices.txt:

"Device drivers, however, are not supposed to call device_set_wakeup_enable()
 directly in any case."

which I guess really means that wakeup:enabled is supposed to be user
controlled and not driver controlled.  I assume the ethtool userspace
interface make the above void, as drivers implementing the ethtool
interface will have to call device_set_wakeup_enable() to syncronize
ethtool and sysfs settings.

Does this make sense?


Bjørn

^ permalink raw reply

* Re: [PATCH net-next] net: move inet_dport/inet_num in sock_common
From: Eric Dumazet @ 2012-11-28 11:27 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, netdev, Ling Ma
In-Reply-To: <1354075918.14302.77.camel@edumazet-glaptop>

On Tue, 2012-11-27 at 20:12 -0800, Eric Dumazet wrote:

> The point of having the cond jump on sk_hash/hash was that in one
> compare, we catch the yes/no status with 99.999999 % success rate.
> 
> All the following compares are predicted by the cpu and essentially are
> free. Adding the AND or OR will basically have the same cpu cost.
> 
> If we wanted to do a full test of all tuple fields and a single
> conditional jump, we would not have to include hash test at all.
> 
> (If the 4-tuple matches, then sk_hash/hash value _must_ be the same by
> definition)

What I am going to do is to remove the hash compare from the macros so
that we can use likely()/unlikely() to explicitly give hints to the
compiler.

The hash compare can be omitted in the validation done after the
atomic_inc_not_zero() [ done to make sure keys didnt change ]

begin:
        sk_nulls_for_each_rcu(sk, node, &head->chain) {
                if (sk->sk_hash != hash)
                        continue;
                if (likely(INET_MATCH(sk, net, acookie,
                                      saddr, daddr, ports, dif))) {
                        if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt)))
                                goto begintw;
                        if (unlikely(!INET_MATCH(sk, net, acookie,
                                                 saddr, daddr, ports, dif))) {
                                sock_put(sk);
                                goto begin;
                        }
                        goto out;
                }
        }
 

^ permalink raw reply

* Re: [net-next RFC v2] net_cls: traffic counter based on classification control cgroup
From: Alexey Perevalov @ 2012-11-28 11:18 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Glauber Costa, netdev-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <50B5C6AB.6040208-kQCPcA+X3s7YtjvyW6yDsg@public.gmane.org>

On 11/28/2012 12:09 PM, Daniel Wagner wrote:
> Hi Alexey,
>
> On 28.11.2012 06:21, Alexey Perevalov wrote:
>>>> Daniel Wagner is working on something a lot similar.
>>> Yes, basically what I try to do is explained by this excellent article
>>>
>>> https://lwn.net/Articles/523058/
>> I read articles and agreed with aspects.
>> But problem of selecting preferred network for application can be solved
>> using netprio cgroup.
> Choosing the which network to connect to is job of a connection manager.
> I don't see how a cgroup controller can help you there. I guess I do not
> understand your statement. Can you rephrase please?
I meant choosing preferred network interface for application's traffic.
I can be done by metric in routing table or (I wrote) by netprio cgroup.

>
>>> The second implementation is adding a new iptables matcher which matches
>>> on LSM contexts. Then you can do something like this:
>>>
>>> iptables -t mangle -A OUTPUT -m secmark --secctx
>>> unconfined_u:unconfined_r:foo_t:s0-s0:c0.c1023 -j MARK --set-mark 200
>> As I understand in LSM context it works for egress and ingress.
> Yes, I am using CONNMARK in conjunction with the the above LSM context
> matcher. I am still playing around, but it looks quite promising.
>
>>>> 2) When Daniel exposed his use case to me, it gave me the impression
>>>> that "counting traffic" is something that is totally doable by having a
>>>> dedicated interface in a separate namespace. Basically, we already count
>>>> traffic (rx and tx) for all interfaces anyway, so it suggests that it
>>>> could be an interesting way to see the problem.
>>> Moving applications into separate net namespaces is for sure a valid
>>> solution.
>>> Though there is a one drawback in this approach. The namespaces need
>>> to be
>>> attached to a bridge and then some NATting. That means every application
>>> would get it's own IP address. This might be okay for your certain use
>>> cases but I am still trying to work around this. Glauber and I had some
>>> discussion about this and he suggested to allow the physical networking
>>> device to be attached to several namespaces (e.g. via macvlan). Every
>>> namespace would get the same IP address. Unfortunately, this would
>>> result in
>>> the same mess as several physical devices on a network get the same
>>> IP address assigned.
>> Is I truly understand what to make statistics works we need to put
>> process to separate namespace?
> If a process lives in its own network namespace then you can
> count the packets/bytes on the network interface level. The side effect
> is that is that each namespace is obviously a new network and has to be
> treated as such.
>
>> Approach to keep counter in cgroup hasn't such side effects, but it has
>> another ).
> cgroups are not for free. Currently a lot of effort is put into getting
> a reasonable performance and behavior into cgroups. In this situation
> any new feature added to cgroups will need a pretty good justification
> why it is needed and why it cant be done with existing infrastructure.
I want to figure out in yours proposed design:

    +------------------------------------------------+
    |network namespace1: pid1, pid2,...              |
    |                                                | 
+---------------------------+
    |   network stack,          network iface        | 
|                           |
    |       nf hooks                                 +------->| physical 
network          |
    +------------------------------------------------+        | 
interface            |
|                           |
|                           |
    +------------------------------------------------+ 
|                           |
    |network namespace2: pid1, pid2,...              | 
|                           |
    | +------->|                           |
    |   network stack,          network iface        | 
|                           |
    |       nf hooks                                 | 
|                           |
    +------------------------------------------------+ 
|                           |
+---------------------------+
    ...                                                          ^
    +------------------------------------------------+           |
    |network namespace3: pid1, pid2,...              |           |
    |                                                |           |
    |   network stack,          network iface        +-----------+
    |       nf hooks                                 |
    +------------------------------------------------+


Question, in case of one physical networking device connected to several 
namespaces,
is it allow to tweak network packet scheduler (qdisc instance) using 
traffic control tool for one physical network interface?
The same question is about netfilter hooks. I have seen the code, it 
seems to me nf hooks is registering per network stack now.

CGroup framework has an notification mechanism based on eventfd. For 
example I can just send notification to user space about network activity.
Is there such mechanism in standard infrastructure to notify user space 
apps on activity on monitored application (maybe nf_queue)?
>
> Here is some background information on the state of cgroups:
>
> http://thread.gmane.org/gmane.linux.kernel.containers/23698
Thank you, I have read.
It seems to me my patch has technical defect with inherited groups.

>
> cheers,
> daniel
>


-- 
BR,
Alexey

^ permalink raw reply

* Re: TCP and reordering
From: Eric Dumazet @ 2012-11-28 11:02 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev
In-Reply-To: <1354093703.21562.23.camel@shinybook.infradead.org>

On Wed, 2012-11-28 at 09:08 +0000, David Woodhouse wrote:
> On Wed, 2012-11-28 at 00:22 -0800, Vijay Subramanian wrote:
> > 
> > I don't believe reordering is tracked on the receiver side but on the
> > sender, there are SNMB_MIB items.
> > They can be tracked and can be viewed using nstat/netstat
> > 
> > # nstat -az | grep -i reorder
> > TcpExtTCPFACKReorder            0                  0.0
> > TcpExtTCPSACKReorder            0                  0.0
> > TcpExtTCPRenoReorder            0                  0.0
> > TcpExtTCPTSReorder              0                  0.0
> 
> Thanks. For me after a 64MiB download, I have an increase of one FACK,
> one SACK and one TS reorder. So my connection probably does even less
> reordering than I thought, and thus isn't particularly relevant to this
> conversation. I'll shut up now and go back to playing with ATM.

But you are the receiver. A receiver should not increase these counters.

^ permalink raw reply

* RE: [PATCH 1/3] net: stmmac: change GMAC control register for SGMII
From: Byungho An @ 2012-11-28 10:57 UTC (permalink / raw)
  To: 'Giuseppe CAVALLARO'
  Cc: davem, jeffrey.t.kirsher, netdev, kgene.kim, linux-kernel
In-Reply-To: <50B344FC.4070405@st.com>

On 11/26/2012 07:31 PM, Giuseppe CABALLARO wrote:
> On 11/23/2012 10:04 AM, Byungho An wrote:
> >
> > This patch changes GMAC control register (TC(Transmit
> > Configuration) and PS(Port Selection) bit for SGMII.
> > In case of SGMII, TC bit is '1' and PS bit is 0.
> 
> IMO this new support that should be released for net-next and further 
> effort is actually needed.
>
OK, I see but if possible, I want to support the new features which is
included in this patch from v3.8
 
> The availability of the PCS registers is given by looking at the HW 
> feature register. In fact, these are optional registers.
> I don't want to break the compatibility with old chips.
> 
It means that old chip doesn't have this bit or this register? If that, how
about using compatible in DT blob like snps,dwmac-3.70a and then in just
this case trying to read this bit and this register.

> I do not see why we have to use Kconfig macro to select ANE etc (as 
> you do in your patches).
OK. I agree with you.

> The driver could directly manage the phy device by itself if possible 
> and the stmmac_init_phy should be reworked.
> 
Could you explain more detail? As I understood, after set ANE bit in MAC
side then PHY auto-negotiation can be enabled. If I'm wrong let me know.
According to your mention, MAC and PHY auto-negotiation can be managed in
stmmac_init_phy?

> There are several things that need to be implemented. For example:
> 
> The ISR (e.g. priv->hw->mac->host_irq_status) should be able to manage 
> these new interrupts.
I think that there would be two additional interrupts."PCS Auto-Negotiation
Complete" and "PCS Link Status Changed". These two interrupts are added to
"stmmac_interrupt". In my opinion, there are no specific processing for
these two irqs. What do you think about it?

> The code has to be able to maintain the user interface.
> For example if you want to enable ANE or manage Advertisement caps.
> 
Does it mean that command line or other network command(e.g. ifconfig...) or
ioctol? Actually I don't understand exact user interface way. Could you
recommend the method for user interface?

> > Signed-off-by: Byungho An <bh74.an@samsung.com>
> > ---
> 
> [snip]
> 
> > +	if (priv->phydev->interface == PHY_INTERFACE_MODE_SGMII) {
> > +		value = readl(priv->ioaddr);
> > +		/* GMAC_CONTROL_TC : transmit config in RGMII/SGMII */
> > +		value |= 0x1000000;
> > +		/* GMAC_CONTROL_PS : Port Selection for GMII */
> > +		value &= ~(0x8000);
> > +		writel(value, priv->ioaddr);
> > +	}
> > +
> 
> 
> This parts of code have to be moved in 
> drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
> 
OK.

> Pls, do not use value |= 0x1000000 but provide the appropriate defines.
> 
OK.

> >   	/* Request the IRQ lines */
> >   	ret = request_irq(dev->irq, stmmac_interrupt,
> >   			 IRQF_SHARED, dev->name, dev);
> >
Thank you.
Byungho An.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox