Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next 10/15] i40e: ensure reset occurs when disabling VF
From: Jeff Kirsher @ 2017-10-02 19:48 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20171002194852.71970-1-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

It is possible although rare that we may not reset when
i40e_vc_disable_vf() is called. This can lead to some weird
circumstances with some values not being properly set. Modify
i40e_reset_vf() to return a code indicating whether it reset or not.

Now, i40e_vc_disable_vf() can wait until a reset actually occurs. If it
fails to free up within a reasonable time frame we'll display a warning
message.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 42 +++++++++++++++++-----
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |  4 +--
 2 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 94ee243f110e..7742cf3d38d9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -156,12 +156,28 @@ void i40e_vc_notify_vf_reset(struct i40e_vf *vf)
  * i40e_vc_disable_vf
  * @vf: pointer to the VF info
  *
- * Disable the VF through a SW reset
+ * Disable the VF through a SW reset.
  **/
 static inline void i40e_vc_disable_vf(struct i40e_vf *vf)
 {
+	int i;
+
 	i40e_vc_notify_vf_reset(vf);
-	i40e_reset_vf(vf, false);
+
+	/* We want to ensure that an actual reset occurs initiated after this
+	 * function was called. However, we do not want to wait forever, so
+	 * we'll give a reasonable time and print a message if we failed to
+	 * ensure a reset.
+	 */
+	for (i = 0; i < 20; i++) {
+		if (i40e_reset_vf(vf, false))
+			return;
+		usleep_range(10000, 20000);
+	}
+
+	dev_warn(&vf->pf->pdev->dev,
+		 "Failed to initiate reset for VF %d after 200 milliseconds\n",
+		 vf->vf_id);
 }
 
 /**
@@ -1051,9 +1067,9 @@ static void i40e_cleanup_reset_vf(struct i40e_vf *vf)
  * @vf: pointer to the VF structure
  * @flr: VFLR was issued or not
  *
- * reset the VF
+ * Returns true if the VF is reset, false otherwise.
  **/
-void i40e_reset_vf(struct i40e_vf *vf, bool flr)
+bool i40e_reset_vf(struct i40e_vf *vf, bool flr)
 {
 	struct i40e_pf *pf = vf->pf;
 	struct i40e_hw *hw = &pf->hw;
@@ -1061,9 +1077,11 @@ void i40e_reset_vf(struct i40e_vf *vf, bool flr)
 	u32 reg;
 	int i;
 
-	/* If VFs have been disabled, there is no need to reset */
+	/* If the VFs have been disabled, this means something else is
+	 * resetting the VF, so we shouldn't continue.
+	 */
 	if (test_and_set_bit(__I40E_VF_DISABLE, pf->state))
-		return;
+		return false;
 
 	i40e_trigger_vf_reset(vf, flr);
 
@@ -1100,6 +1118,8 @@ void i40e_reset_vf(struct i40e_vf *vf, bool flr)
 
 	i40e_flush(hw);
 	clear_bit(__I40E_VF_DISABLE, pf->state);
+
+	return true;
 }
 
 /**
@@ -1111,8 +1131,10 @@ void i40e_reset_vf(struct i40e_vf *vf, bool flr)
  * VF, then do all the waiting in one chunk, and finally finish restoring each
  * VF after the wait. This is useful during PF routines which need to reset
  * all VFs, as otherwise it must perform these resets in a serialized fashion.
+ *
+ * Returns true if any VFs were reset, and false otherwise.
  **/
-void i40e_reset_all_vfs(struct i40e_pf *pf, bool flr)
+bool i40e_reset_all_vfs(struct i40e_pf *pf, bool flr)
 {
 	struct i40e_hw *hw = &pf->hw;
 	struct i40e_vf *vf;
@@ -1121,11 +1143,11 @@ void i40e_reset_all_vfs(struct i40e_pf *pf, bool flr)
 
 	/* If we don't have any VFs, then there is nothing to reset */
 	if (!pf->num_alloc_vfs)
-		return;
+		return false;
 
 	/* If VFs have been disabled, there is no need to reset */
 	if (test_and_set_bit(__I40E_VF_DISABLE, pf->state))
-		return;
+		return false;
 
 	/* Begin reset on all VFs at once */
 	for (v = 0; v < pf->num_alloc_vfs; v++)
@@ -1200,6 +1222,8 @@ void i40e_reset_all_vfs(struct i40e_pf *pf, bool flr)
 
 	i40e_flush(hw);
 	clear_bit(__I40E_VF_DISABLE, pf->state);
+
+	return true;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h
index 5111d05d5f2f..5ea42ad094bc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h
@@ -122,8 +122,8 @@ int i40e_alloc_vfs(struct i40e_pf *pf, u16 num_alloc_vfs);
 int i40e_vc_process_vf_msg(struct i40e_pf *pf, s16 vf_id, u32 v_opcode,
 			   u32 v_retval, u8 *msg, u16 msglen);
 int i40e_vc_process_vflr_event(struct i40e_pf *pf);
-void i40e_reset_vf(struct i40e_vf *vf, bool flr);
-void i40e_reset_all_vfs(struct i40e_pf *pf, bool flr);
+bool i40e_reset_vf(struct i40e_vf *vf, bool flr);
+bool i40e_reset_all_vfs(struct i40e_pf *pf, bool flr);
 void i40e_vc_notify_vf_reset(struct i40e_vf *vf);
 
 /* VF configuration related iplink handlers */
-- 
2.14.2

^ permalink raw reply related

* [net-next 04/15] i40e: Fix reporting of supported link modes
From: Jeff Kirsher @ 2017-10-02 19:48 UTC (permalink / raw)
  To: davem; +Cc: Filip Sadowski, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20171002194852.71970-1-jeffrey.t.kirsher@intel.com>

From: Filip Sadowski <filip.sadowski@intel.com>

This patch fixes incorrect reporting of supported link modes on some NICs.

Signed-off-by: Filip Sadowski <filip.sadowski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h   | 20 ++++++++++++++++++--
 drivers/net/ethernet/intel/i40e/i40e_common.c       | 11 ++++++++++-
 drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h | 20 ++++++++++++++++++--
 3 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
index e2a9ec80a623..5d0291c1337e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
@@ -1734,6 +1734,8 @@ enum i40e_aq_phy_type {
 	I40E_PHY_TYPE_10GBASE_CR1_CU		= 0xB,
 	I40E_PHY_TYPE_10GBASE_AOC		= 0xC,
 	I40E_PHY_TYPE_40GBASE_AOC		= 0xD,
+	I40E_PHY_TYPE_UNRECOGNIZED		= 0xE,
+	I40E_PHY_TYPE_UNSUPPORTED		= 0xF,
 	I40E_PHY_TYPE_100BASE_TX		= 0x11,
 	I40E_PHY_TYPE_1000BASE_T		= 0x12,
 	I40E_PHY_TYPE_10GBASE_T			= 0x13,
@@ -1752,6 +1754,8 @@ enum i40e_aq_phy_type {
 	I40E_PHY_TYPE_25GBASE_CR		= 0x20,
 	I40E_PHY_TYPE_25GBASE_SR		= 0x21,
 	I40E_PHY_TYPE_25GBASE_LR		= 0x22,
+	I40E_PHY_TYPE_EMPTY			= 0xFE,
+	I40E_PHY_TYPE_DEFAULT			= 0xFF,
 	I40E_PHY_TYPE_MAX
 };
 
@@ -1942,19 +1946,31 @@ struct i40e_aqc_get_link_status {
 #define I40E_AQ_25G_SERDES_UCODE_ERR	0X04
 #define I40E_AQ_25G_NIMB_UCODE_ERR	0X05
 	u8	loopback; /* use defines from i40e_aqc_set_lb_mode */
+/* Since firmware API 1.7 loopback field keeps power class info as well */
+#define I40E_AQ_LOOPBACK_MASK		0x07
+#define I40E_AQ_PWR_CLASS_SHIFT_LB	6
+#define I40E_AQ_PWR_CLASS_MASK_LB	(0x03 << I40E_AQ_PWR_CLASS_SHIFT_LB)
 	__le16	max_frame_size;
 	u8	config;
 #define I40E_AQ_CONFIG_FEC_KR_ENA	0x01
 #define I40E_AQ_CONFIG_FEC_RS_ENA	0x02
 #define I40E_AQ_CONFIG_CRC_ENA		0x04
 #define I40E_AQ_CONFIG_PACING_MASK	0x78
-	u8	power_desc;
+	union {
+		struct {
+			u8	power_desc;
 #define I40E_AQ_LINK_POWER_CLASS_1	0x00
 #define I40E_AQ_LINK_POWER_CLASS_2	0x01
 #define I40E_AQ_LINK_POWER_CLASS_3	0x02
 #define I40E_AQ_LINK_POWER_CLASS_4	0x03
 #define I40E_AQ_PWR_CLASS_MASK		0x03
-	u8	reserved[4];
+			u8	reserved[4];
+		};
+		struct {
+			u8	link_type[4];
+			u8	link_type_ext;
+		};
+	};
 };
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_get_link_status);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 7346d8850c8e..64c15f4c9d2b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -1821,7 +1821,7 @@ i40e_status i40e_aq_get_link_info(struct i40e_hw *hw,
 	hw_link_info->fec_info = resp->config & (I40E_AQ_CONFIG_FEC_KR_ENA |
 						 I40E_AQ_CONFIG_FEC_RS_ENA);
 	hw_link_info->ext_info = resp->ext_info;
-	hw_link_info->loopback = resp->loopback;
+	hw_link_info->loopback = resp->loopback & I40E_AQ_LOOPBACK_MASK;
 	hw_link_info->max_frame_size = le16_to_cpu(resp->max_frame_size);
 	hw_link_info->pacing = resp->config & I40E_AQ_CONFIG_PACING_MASK;
 
@@ -1852,6 +1852,15 @@ i40e_status i40e_aq_get_link_info(struct i40e_hw *hw,
 	     hw->aq.fw_min_ver < 40)) && hw_link_info->phy_type == 0xE)
 		hw_link_info->phy_type = I40E_PHY_TYPE_10GBASE_SFPP_CU;
 
+	if (hw->aq.api_maj_ver == I40E_FW_API_VERSION_MAJOR &&
+	    hw->aq.api_min_ver >= 7) {
+		__le32 tmp;
+
+		memcpy(&tmp, resp->link_type, sizeof(tmp));
+		hw->phy.phy_types = le32_to_cpu(tmp);
+		hw->phy.phy_types |= ((u64)resp->link_type_ext << 32);
+	}
+
 	/* save link status information */
 	if (link)
 		*link = *hw_link_info;
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
index f9f48d1900b0..709d114fc305 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
@@ -1730,6 +1730,8 @@ enum i40e_aq_phy_type {
 	I40E_PHY_TYPE_10GBASE_CR1_CU		= 0xB,
 	I40E_PHY_TYPE_10GBASE_AOC		= 0xC,
 	I40E_PHY_TYPE_40GBASE_AOC		= 0xD,
+	I40E_PHY_TYPE_UNRECOGNIZED		= 0xE,
+	I40E_PHY_TYPE_UNSUPPORTED		= 0xF,
 	I40E_PHY_TYPE_100BASE_TX		= 0x11,
 	I40E_PHY_TYPE_1000BASE_T		= 0x12,
 	I40E_PHY_TYPE_10GBASE_T			= 0x13,
@@ -1748,6 +1750,8 @@ enum i40e_aq_phy_type {
 	I40E_PHY_TYPE_25GBASE_CR		= 0x20,
 	I40E_PHY_TYPE_25GBASE_SR		= 0x21,
 	I40E_PHY_TYPE_25GBASE_LR		= 0x22,
+	I40E_PHY_TYPE_EMPTY			= 0xFE,
+	I40E_PHY_TYPE_DEFAULT			= 0xFF,
 	I40E_PHY_TYPE_MAX
 };
 
@@ -1938,19 +1942,31 @@ struct i40e_aqc_get_link_status {
 #define I40E_AQ_25G_SERDES_UCODE_ERR	0X04
 #define I40E_AQ_25G_NIMB_UCODE_ERR	0X05
 	u8	loopback; /* use defines from i40e_aqc_set_lb_mode */
+/* Since firmware API 1.7 loopback field keeps power class info as well */
+#define I40E_AQ_LOOPBACK_MASK		0x07
+#define I40E_AQ_PWR_CLASS_SHIFT_LB	6
+#define I40E_AQ_PWR_CLASS_MASK_LB	(0x03 << I40E_AQ_PWR_CLASS_SHIFT_LB)
 	__le16	max_frame_size;
 	u8	config;
 #define I40E_AQ_CONFIG_FEC_KR_ENA	0x01
 #define I40E_AQ_CONFIG_FEC_RS_ENA	0x02
 #define I40E_AQ_CONFIG_CRC_ENA		0x04
 #define I40E_AQ_CONFIG_PACING_MASK	0x78
-	u8	power_desc;
+	union {
+		struct {
+			u8	power_desc;
 #define I40E_AQ_LINK_POWER_CLASS_1	0x00
 #define I40E_AQ_LINK_POWER_CLASS_2	0x01
 #define I40E_AQ_LINK_POWER_CLASS_3	0x02
 #define I40E_AQ_LINK_POWER_CLASS_4	0x03
 #define I40E_AQ_PWR_CLASS_MASK		0x03
-	u8	reserved[4];
+			u8	reserved[4];
+		};
+		struct {
+			u8	link_type[4];
+			u8	link_type_ext;
+		};
+	};
 };
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_get_link_status);
-- 
2.14.2

^ permalink raw reply related

* [net-next 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2017-10-02
From: Jeff Kirsher @ 2017-10-02 19:48 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, jogreene

This series contains updates to i40e and i40evf.

Shannon Nelson fixes an issue where when a machine has more CPUs than
queue pairs, the counting gets a "little funky" and turns off Flow
Director.  So to correct it, limit the number of LAN queues initially
allocated to be sure there are some left for Flow Director and other
features.

Lihong cleans up dead code by removing a condition check which cannot
ever be true.

Christophe Jaillet fixes a potential NULL pointer dereference, which
could happen if kzalloc() fails.

Filip corrects the reporting of supported link modes, which was incorrect
for some NICs.  Added support for 'ethtool -m' command, which displays
information about QSFP+ modules.

Mariusz adds functions to read/write the LED registers to control the
LEDS, instead of accessing the registers directly whenever the LEDs
need to be controlled.

Jake fixes a regression where we introduced a scheduling while atomic,
so introduce a separate helper function which will manage its own need
for the mac_filter_hash_lock.  Also cleaned up the "PF" parameter in
i40e_vc_disable_vf() since it is never used and is not needed.  Fixed
a rare case where it is possible that a reset does not occur when
i40e_vc_disable_vf() is called, so modify i40e_reset_vf() to return a
bool to indicate whether it reset or not so that i40e_vc_disable_vf()
can wait until a reset actually occurs.

Alan adds the ability for the VF to request more or less underlying
allocated queues from the PF.  Fixes the incorrect method for clearing
the vf_states variable with a NULL assignment, when we should be
using atomic bitops since we don't actually want to clear all the
flags.  Fixed a resource leak, where the PF driver fails to inform
clients of a VF reset because we were incorrectly checking the
I40E_VF_STATE_PRE_ENABLE bit.

Mitch converts i40evf_map_rings_to_vectors() to a void function since
it cannot fail and allows us to clean up the checks for the function
return value.

Scott enables the driver(s) to pass traffic with VLAN tags using the
802.1ad Ethernet protocol.

The following are changes since commit 1dd236fda0c500a21c54f2140dadc488cde9265b:
  Merge branch 'mlxsw-Fixlets'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Alan Brady (3):
  i40evf: Enable VF to request an alternate queue allocation
  i40e: fix handling of vf_states variable
  i40e: fix client notify of VF reset

Christophe JAILLET (1):
  i40e: Fix a potential NULL pointer dereference

Filip Sadowski (2):
  i40e: Fix reporting of supported link modes
  i40e: Add support for 'ethtool -m'

Jacob Keller (4):
  i40e: don't hold spinlock while resetting VF
  i40e: drop i40e_pf *pf from i40e_vc_disable_vf()
  i40e: make use of i40e_vc_disable_vf
  i40e: ensure reset occurs when disabling VF

Lihong Yang (1):
  i40e: remove logically dead code

Mariusz Stachura (1):
  i40e: use admin queue for setting LEDs behavior

Mitch Williams (1):
  i40e: make i40evf_map_rings_to_vectors void

Scott Peterson (1):
  i40e: Stop dropping 802.1ad tags - eth proto 0x88a8

Shannon Nelson (1):
  i40e: limit lan queue count in large CPU count machine

 drivers/net/ethernet/intel/i40e/i40e_adminq.c      |  12 ++
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |  55 +++++-
 drivers/net/ethernet/intel/i40e/i40e_common.c      | 201 ++++++++++++++++++---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c     | 154 ++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_main.c        |  18 +-
 drivers/net/ethernet/intel/i40e/i40e_prototype.h   |   9 +
 drivers/net/ethernet/intel/i40e/i40e_type.h        |  19 ++
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 102 ++++++++---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |   5 +-
 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h    |  55 +++++-
 drivers/net/ethernet/intel/i40evf/i40e_common.c    |  69 +++++++
 drivers/net/ethernet/intel/i40evf/i40e_prototype.h |   9 +
 drivers/net/ethernet/intel/i40evf/i40e_type.h      |  20 ++
 drivers/net/ethernet/intel/i40evf/i40evf.h         |   4 +
 drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c |  38 +++-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c    |  97 +++++++++-
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c    |  44 ++++-
 17 files changed, 837 insertions(+), 74 deletions(-)

-- 
2.14.2

^ permalink raw reply

* [net-next 02/15] i40e: remove logically dead code
From: Jeff Kirsher @ 2017-10-02 19:48 UTC (permalink / raw)
  To: davem; +Cc: Lihong Yang, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20171002194852.71970-1-jeffrey.t.kirsher@intel.com>

From: Lihong Yang <lihong.yang@intel.com>

This patch removes the !vf condition check that cannot be
true in i40e_ndo_set_vf_trust function

Detected by CoverityScan, CID 1397531 Logically dead code

Signed-off-by: Lihong Yang <lihong.yang@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index a75396c157d9..e6b95e1e1a33 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -3354,8 +3354,6 @@ int i40e_ndo_set_vf_trust(struct net_device *netdev, int vf_id, bool setting)
 
 	vf = &pf->vf[vf_id];
 
-	if (!vf)
-		return -EINVAL;
 	if (setting == vf->trusted)
 		goto out;
 
-- 
2.14.2

^ permalink raw reply related

* [net-next 05/15] i40e: Add support for 'ethtool -m'
From: Jeff Kirsher @ 2017-10-02 19:48 UTC (permalink / raw)
  To: davem; +Cc: Filip Sadowski, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20171002194852.71970-1-jeffrey.t.kirsher@intel.com>

From: Filip Sadowski <filip.sadowski@intel.com>

This patch adds support for 'ethtool -m' command which displays
information about (Q)SFP+ module plugged into NIC's cage.

Signed-off-by: Filip Sadowski <filip.sadowski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |  18 +++
 drivers/net/ethernet/intel/i40e/i40e_common.c      |  69 +++++++++
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c     | 154 +++++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_prototype.h   |   9 ++
 drivers/net/ethernet/intel/i40e/i40e_type.h        |  13 ++
 .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h    |  18 +++
 drivers/net/ethernet/intel/i40evf/i40e_common.c    |  69 +++++++++
 drivers/net/ethernet/intel/i40evf/i40e_prototype.h |   9 ++
 drivers/net/ethernet/intel/i40evf/i40e_type.h      |  12 ++
 9 files changed, 371 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
index 5d0291c1337e..ed7bbe14bc6e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
@@ -244,6 +244,8 @@ enum i40e_admin_queue_opc {
 	i40e_aqc_opc_set_phy_debug		= 0x0622,
 	i40e_aqc_opc_upload_ext_phy_fm		= 0x0625,
 	i40e_aqc_opc_run_phy_activity		= 0x0626,
+	i40e_aqc_opc_set_phy_register		= 0x0628,
+	i40e_aqc_opc_get_phy_register		= 0x0629,
 
 	/* NVM commands */
 	i40e_aqc_opc_nvm_read			= 0x0701,
@@ -2053,6 +2055,22 @@ struct i40e_aqc_run_phy_activity {
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_run_phy_activity);
 
+/* Set PHY Register command (0x0628) */
+/* Get PHY Register command (0x0629) */
+struct i40e_aqc_phy_register_access {
+	u8	phy_interface;
+#define I40E_AQ_PHY_REG_ACCESS_INTERNAL	0
+#define I40E_AQ_PHY_REG_ACCESS_EXTERNAL	1
+#define I40E_AQ_PHY_REG_ACCESS_EXTERNAL_MODULE	2
+	u8	dev_address;
+	u8	reserved1[2];
+	__le32	reg_address;
+	__le32	reg_value;
+	u8	reserved2[4];
+};
+
+I40E_CHECK_CMD_LENGTH(i40e_aqc_phy_register_access);
+
 /* NVM Read command (indirect 0x0701)
  * NVM Erase commands (direct 0x0702)
  * NVM Update commands (indirect 0x0703)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 64c15f4c9d2b..fada03799850 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -5062,6 +5062,75 @@ void i40e_write_rx_ctl(struct i40e_hw *hw, u32 reg_addr, u32 reg_val)
 		wr32(hw, reg_addr, reg_val);
 }
 
+/**
+ * i40e_aq_set_phy_register
+ * @hw: pointer to the hw struct
+ * @phy_select: select which phy should be accessed
+ * @dev_addr: PHY device address
+ * @reg_addr: PHY register address
+ * @reg_val: new register value
+ * @cmd_details: pointer to command details structure or NULL
+ *
+ * Write the external PHY register.
+ **/
+i40e_status i40e_aq_set_phy_register(struct i40e_hw *hw,
+				     u8 phy_select, u8 dev_addr,
+				     u32 reg_addr, u32 reg_val,
+				     struct i40e_asq_cmd_details *cmd_details)
+{
+	struct i40e_aq_desc desc;
+	struct i40e_aqc_phy_register_access *cmd =
+		(struct i40e_aqc_phy_register_access *)&desc.params.raw;
+	i40e_status status;
+
+	i40e_fill_default_direct_cmd_desc(&desc,
+					  i40e_aqc_opc_set_phy_register);
+
+	cmd->phy_interface = phy_select;
+	cmd->dev_address = dev_addr;
+	cmd->reg_address = cpu_to_le32(reg_addr);
+	cmd->reg_value = cpu_to_le32(reg_val);
+
+	status = i40e_asq_send_command(hw, &desc, NULL, 0, cmd_details);
+
+	return status;
+}
+
+/**
+ * i40e_aq_get_phy_register
+ * @hw: pointer to the hw struct
+ * @phy_select: select which phy should be accessed
+ * @dev_addr: PHY device address
+ * @reg_addr: PHY register address
+ * @reg_val: read register value
+ * @cmd_details: pointer to command details structure or NULL
+ *
+ * Read the external PHY register.
+ **/
+i40e_status i40e_aq_get_phy_register(struct i40e_hw *hw,
+				     u8 phy_select, u8 dev_addr,
+				     u32 reg_addr, u32 *reg_val,
+				     struct i40e_asq_cmd_details *cmd_details)
+{
+	struct i40e_aq_desc desc;
+	struct i40e_aqc_phy_register_access *cmd =
+		(struct i40e_aqc_phy_register_access *)&desc.params.raw;
+	i40e_status status;
+
+	i40e_fill_default_direct_cmd_desc(&desc,
+					  i40e_aqc_opc_get_phy_register);
+
+	cmd->phy_interface = phy_select;
+	cmd->dev_address = dev_addr;
+	cmd->reg_address = cpu_to_le32(reg_addr);
+
+	status = i40e_asq_send_command(hw, &desc, NULL, 0, cmd_details);
+	if (!status)
+		*reg_val = le32_to_cpu(cmd->reg_value);
+
+	return status;
+}
+
 /**
  * i40e_aq_write_ppp - Write pipeline personalization profile (ppp)
  * @hw: pointer to the hw struct
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 05e89864f781..1136d02e2e95 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -4196,6 +4196,158 @@ static int i40e_set_priv_flags(struct net_device *dev, u32 flags)
 	return 0;
 }
 
+/**
+ * i40e_get_module_info - get (Q)SFP+ module type info
+ * @netdev: network interface device structure
+ * @modinfo: module EEPROM size and layout information structure
+ **/
+static int i40e_get_module_info(struct net_device *netdev,
+				struct ethtool_modinfo *modinfo)
+{
+	struct i40e_netdev_priv *np = netdev_priv(netdev);
+	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+	u32 sff8472_comp = 0;
+	u32 sff8472_swap = 0;
+	u32 sff8636_rev = 0;
+	i40e_status status;
+	u32 type = 0;
+
+	/* Check if firmware supports reading module EEPROM. */
+	if (!(hw->flags & I40E_HW_FLAG_AQ_PHY_ACCESS_CAPABLE)) {
+		netdev_err(vsi->netdev, "Module EEPROM memory read not supported. Please update the NVM image.\n");
+		return -EINVAL;
+	}
+
+	status = i40e_update_link_info(hw);
+	if (status)
+		return -EIO;
+
+	if (hw->phy.link_info.phy_type == I40E_PHY_TYPE_EMPTY) {
+		netdev_err(vsi->netdev, "Cannot read module EEPROM memory. No module connected.\n");
+		return -EINVAL;
+	}
+
+	type = hw->phy.link_info.module_type[0];
+
+	switch (type) {
+	case I40E_MODULE_TYPE_SFP:
+		status = i40e_aq_get_phy_register(hw,
+				I40E_AQ_PHY_REG_ACCESS_EXTERNAL_MODULE,
+				I40E_I2C_EEPROM_DEV_ADDR,
+				I40E_MODULE_SFF_8472_COMP,
+				&sff8472_comp, NULL);
+		if (status)
+			return -EIO;
+
+		status = i40e_aq_get_phy_register(hw,
+				I40E_AQ_PHY_REG_ACCESS_EXTERNAL_MODULE,
+				I40E_I2C_EEPROM_DEV_ADDR,
+				I40E_MODULE_SFF_8472_SWAP,
+				&sff8472_swap, NULL);
+		if (status)
+			return -EIO;
+
+		/* Check if the module requires address swap to access
+		 * the other EEPROM memory page.
+		 */
+		if (sff8472_swap & I40E_MODULE_SFF_ADDR_MODE) {
+			netdev_warn(vsi->netdev, "Module address swap to access page 0xA2 is not supported.\n");
+			modinfo->type = ETH_MODULE_SFF_8079;
+			modinfo->eeprom_len = ETH_MODULE_SFF_8079_LEN;
+		} else if (sff8472_comp == 0x00) {
+			/* Module is not SFF-8472 compliant */
+			modinfo->type = ETH_MODULE_SFF_8079;
+			modinfo->eeprom_len = ETH_MODULE_SFF_8079_LEN;
+		} else {
+			modinfo->type = ETH_MODULE_SFF_8472;
+			modinfo->eeprom_len = ETH_MODULE_SFF_8472_LEN;
+		}
+		break;
+	case I40E_MODULE_TYPE_QSFP_PLUS:
+		/* Read from memory page 0. */
+		status = i40e_aq_get_phy_register(hw,
+				I40E_AQ_PHY_REG_ACCESS_EXTERNAL_MODULE,
+				0,
+				I40E_MODULE_REVISION_ADDR,
+				&sff8636_rev, NULL);
+		if (status)
+			return -EIO;
+		/* Determine revision compliance byte */
+		if (sff8636_rev > 0x02) {
+			/* Module is SFF-8636 compliant */
+			modinfo->type = ETH_MODULE_SFF_8636;
+			modinfo->eeprom_len = I40E_MODULE_QSFP_MAX_LEN;
+		} else {
+			modinfo->type = ETH_MODULE_SFF_8436;
+			modinfo->eeprom_len = I40E_MODULE_QSFP_MAX_LEN;
+		}
+		break;
+	case I40E_MODULE_TYPE_QSFP28:
+		modinfo->type = ETH_MODULE_SFF_8636;
+		modinfo->eeprom_len = I40E_MODULE_QSFP_MAX_LEN;
+		break;
+	default:
+		netdev_err(vsi->netdev, "Module type unrecognized\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/**
+ * i40e_get_module_eeprom - fills buffer with (Q)SFP+ module memory contents
+ * @netdev: network interface device structure
+ * @ee: EEPROM dump request structure
+ * @data: buffer to be filled with EEPROM contents
+ **/
+static int i40e_get_module_eeprom(struct net_device *netdev,
+				  struct ethtool_eeprom *ee,
+				  u8 *data)
+{
+	struct i40e_netdev_priv *np = netdev_priv(netdev);
+	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+	bool is_sfp = false;
+	i40e_status status;
+	u32 value = 0;
+	int i;
+
+	if (!ee || !ee->len || !data)
+		return -EINVAL;
+
+	if (hw->phy.link_info.module_type[0] == I40E_MODULE_TYPE_SFP)
+		is_sfp = true;
+
+	for (i = 0; i < ee->len; i++) {
+		u32 offset = i + ee->offset;
+		u32 addr = is_sfp ? I40E_I2C_EEPROM_DEV_ADDR : 0;
+
+		/* Check if we need to access the other memory page */
+		if (is_sfp) {
+			if (offset >= ETH_MODULE_SFF_8079_LEN) {
+				offset -= ETH_MODULE_SFF_8079_LEN;
+				addr = I40E_I2C_EEPROM_DEV_ADDR2;
+			}
+		} else {
+			while (offset >= ETH_MODULE_SFF_8436_LEN) {
+				/* Compute memory page number and offset. */
+				offset -= ETH_MODULE_SFF_8436_LEN / 2;
+				addr++;
+			}
+		}
+
+		status = i40e_aq_get_phy_register(hw,
+				I40E_AQ_PHY_REG_ACCESS_EXTERNAL_MODULE,
+				addr, offset, &value, NULL);
+		if (status)
+			return -EIO;
+		data[i] = value;
+	}
+	return 0;
+}
+
 static const struct ethtool_ops i40e_ethtool_ops = {
 	.get_drvinfo		= i40e_get_drvinfo,
 	.get_regs_len		= i40e_get_regs_len,
@@ -4228,6 +4380,8 @@ static const struct ethtool_ops i40e_ethtool_ops = {
 	.set_rxfh		= i40e_set_rxfh,
 	.get_channels		= i40e_get_channels,
 	.set_channels		= i40e_set_channels,
+	.get_module_info	= i40e_get_module_info,
+	.get_module_eeprom	= i40e_get_module_eeprom,
 	.get_ts_info		= i40e_get_ts_info,
 	.get_priv_flags		= i40e_get_priv_flags,
 	.set_priv_flags		= i40e_set_priv_flags,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_prototype.h b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
index a39b13197891..01502561035c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_prototype.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
@@ -360,6 +360,15 @@ i40e_status i40e_aq_rx_ctl_write_register(struct i40e_hw *hw,
 				u32 reg_addr, u32 reg_val,
 				struct i40e_asq_cmd_details *cmd_details);
 void i40e_write_rx_ctl(struct i40e_hw *hw, u32 reg_addr, u32 reg_val);
+i40e_status i40e_aq_set_phy_register(struct i40e_hw *hw,
+				     u8 phy_select, u8 dev_addr,
+				     u32 reg_addr, u32 reg_val,
+				     struct i40e_asq_cmd_details *cmd_details);
+i40e_status i40e_aq_get_phy_register(struct i40e_hw *hw,
+				     u8 phy_select, u8 dev_addr,
+				     u32 reg_addr, u32 *reg_val,
+				     struct i40e_asq_cmd_details *cmd_details);
+
 i40e_status i40e_read_phy_register_clause22(struct i40e_hw *hw,
 					    u16 reg, u8 phy_addr, u16 *value);
 i40e_status i40e_write_phy_register_clause22(struct i40e_hw *hw,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h b/drivers/net/ethernet/intel/i40e/i40e_type.h
index fd4bbdd88b57..8b0b9f826b7f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
@@ -428,6 +428,18 @@ struct i40e_nvm_access {
 	u8 data[1];
 };
 
+/* (Q)SFP module access definitions */
+#define I40E_I2C_EEPROM_DEV_ADDR	0xA0
+#define I40E_I2C_EEPROM_DEV_ADDR2	0xA2
+#define I40E_MODULE_TYPE_ADDR		0x00
+#define I40E_MODULE_REVISION_ADDR	0x01
+#define I40E_MODULE_SFF_8472_COMP	0x5E
+#define I40E_MODULE_SFF_8472_SWAP	0x5C
+#define I40E_MODULE_SFF_ADDR_MODE	0x04
+#define I40E_MODULE_TYPE_QSFP_PLUS	0x0D
+#define I40E_MODULE_TYPE_QSFP28		0x11
+#define I40E_MODULE_QSFP_MAX_LEN	640
+
 /* PCI bus types */
 enum i40e_bus_type {
 	i40e_bus_type_unknown = 0,
@@ -598,6 +610,7 @@ struct i40e_hw {
 	struct i40e_dcbx_config desired_dcbx_config; /* CEE Desired Cfg */
 
 #define I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE BIT_ULL(0)
+#define I40E_HW_FLAG_AQ_PHY_ACCESS_CAPABLE  BIT_ULL(2)
 	u64 flags;
 
 	/* debug mask */
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
index 709d114fc305..eee7ece42b39 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
@@ -244,6 +244,8 @@ enum i40e_admin_queue_opc {
 	i40e_aqc_opc_set_phy_debug		= 0x0622,
 	i40e_aqc_opc_upload_ext_phy_fm		= 0x0625,
 	i40e_aqc_opc_run_phy_activity		= 0x0626,
+	i40e_aqc_opc_set_phy_register		= 0x0628,
+	i40e_aqc_opc_get_phy_register		= 0x0629,
 
 	/* NVM commands */
 	i40e_aqc_opc_nvm_read			= 0x0701,
@@ -2046,6 +2048,22 @@ struct i40e_aqc_run_phy_activity {
 
 I40E_CHECK_CMD_LENGTH(i40e_aqc_run_phy_activity);
 
+/* Set PHY Register command (0x0628) */
+/* Get PHY Register command (0x0629) */
+struct i40e_aqc_phy_register_access {
+	u8	phy_interface;
+#define I40E_AQ_PHY_REG_ACCESS_INTERNAL	0
+#define I40E_AQ_PHY_REG_ACCESS_EXTERNAL	1
+#define I40E_AQ_PHY_REG_ACCESS_EXTERNAL_MODULE	2
+	u8	dev_address;
+	u8	reserved1[2];
+	__le32	reg_address;
+	__le32	reg_value;
+	u8	reserved2[4];
+};
+
+I40E_CHECK_CMD_LENGTH(i40e_aqc_phy_register_access);
+
 /* NVM Read command (indirect 0x0701)
  * NVM Erase commands (direct 0x0702)
  * NVM Update commands (indirect 0x0703)
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_common.c b/drivers/net/ethernet/intel/i40evf/i40e_common.c
index 8d3a2bfe186a..7d70bf69b249 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_common.c
@@ -1041,6 +1041,75 @@ void i40evf_write_rx_ctl(struct i40e_hw *hw, u32 reg_addr, u32 reg_val)
 		wr32(hw, reg_addr, reg_val);
 }
 
+/**
+ * i40evf_aq_set_phy_register
+ * @hw: pointer to the hw struct
+ * @phy_select: select which phy should be accessed
+ * @dev_addr: PHY device address
+ * @reg_addr: PHY register address
+ * @reg_val: new register value
+ * @cmd_details: pointer to command details structure or NULL
+ *
+ * Reset the external PHY.
+ **/
+i40e_status i40evf_aq_set_phy_register(struct i40e_hw *hw,
+				       u8 phy_select, u8 dev_addr,
+				       u32 reg_addr, u32 reg_val,
+				       struct i40e_asq_cmd_details *cmd_details)
+{
+	struct i40e_aq_desc desc;
+	struct i40e_aqc_phy_register_access *cmd =
+		(struct i40e_aqc_phy_register_access *)&desc.params.raw;
+	i40e_status status;
+
+	i40evf_fill_default_direct_cmd_desc(&desc,
+					    i40e_aqc_opc_set_phy_register);
+
+	cmd->phy_interface = phy_select;
+	cmd->dev_address = dev_addr;
+	cmd->reg_address = cpu_to_le32(reg_addr);
+	cmd->reg_value = cpu_to_le32(reg_val);
+
+	status = i40evf_asq_send_command(hw, &desc, NULL, 0, cmd_details);
+
+	return status;
+}
+
+/**
+ * i40evf_aq_get_phy_register
+ * @hw: pointer to the hw struct
+ * @phy_select: select which phy should be accessed
+ * @dev_addr: PHY device address
+ * @reg_addr: PHY register address
+ * @reg_val: read register value
+ * @cmd_details: pointer to command details structure or NULL
+ *
+ * Reset the external PHY.
+ **/
+i40e_status i40evf_aq_get_phy_register(struct i40e_hw *hw,
+				       u8 phy_select, u8 dev_addr,
+				       u32 reg_addr, u32 *reg_val,
+				       struct i40e_asq_cmd_details *cmd_details)
+{
+	struct i40e_aq_desc desc;
+	struct i40e_aqc_phy_register_access *cmd =
+		(struct i40e_aqc_phy_register_access *)&desc.params.raw;
+	i40e_status status;
+
+	i40evf_fill_default_direct_cmd_desc(&desc,
+					    i40e_aqc_opc_get_phy_register);
+
+	cmd->phy_interface = phy_select;
+	cmd->dev_address = dev_addr;
+	cmd->reg_address = cpu_to_le32(reg_addr);
+
+	status = i40evf_asq_send_command(hw, &desc, NULL, 0, cmd_details);
+	if (!status)
+		*reg_val = le32_to_cpu(cmd->reg_value);
+
+	return status;
+}
+
 /**
  * i40e_aq_send_msg_to_pf
  * @hw: pointer to the hardware structure
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_prototype.h b/drivers/net/ethernet/intel/i40evf/i40e_prototype.h
index c9836bba487d..b624b5994075 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_prototype.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_prototype.h
@@ -111,6 +111,15 @@ i40e_status i40evf_aq_rx_ctl_write_register(struct i40e_hw *hw,
 				u32 reg_addr, u32 reg_val,
 				struct i40e_asq_cmd_details *cmd_details);
 void i40evf_write_rx_ctl(struct i40e_hw *hw, u32 reg_addr, u32 reg_val);
+i40e_status i40e_aq_set_phy_register(struct i40e_hw *hw,
+				     u8 phy_select, u8 dev_addr,
+				     u32 reg_addr, u32 reg_val,
+				     struct i40e_asq_cmd_details *cmd_details);
+i40e_status i40e_aq_get_phy_register(struct i40e_hw *hw,
+				     u8 phy_select, u8 dev_addr,
+				     u32 reg_addr, u32 *reg_val,
+				     struct i40e_asq_cmd_details *cmd_details);
+
 i40e_status i40e_read_phy_register(struct i40e_hw *hw, u8 page,
 				   u16 reg, u8 phy_addr, u16 *value);
 i40e_status i40e_write_phy_register(struct i40e_hw *hw, u8 page,
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_type.h b/drivers/net/ethernet/intel/i40evf/i40e_type.h
index 2ea919d9cdcf..b53584e3d580 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_type.h
@@ -401,6 +401,18 @@ struct i40e_nvm_access {
 	u8 data[1];
 };
 
+/* (Q)SFP module access definitions */
+#define I40E_I2C_EEPROM_DEV_ADDR	0xA0
+#define I40E_I2C_EEPROM_DEV_ADDR2	0xA2
+#define I40E_MODULE_TYPE_ADDR		0x00
+#define I40E_MODULE_REVISION_ADDR	0x01
+#define I40E_MODULE_SFF_8472_COMP	0x5E
+#define I40E_MODULE_SFF_8472_SWAP	0x5C
+#define I40E_MODULE_SFF_ADDR_MODE	0x04
+#define I40E_MODULE_TYPE_QSFP_PLUS	0x0D
+#define I40E_MODULE_TYPE_QSFP28		0x11
+#define I40E_MODULE_QSFP_MAX_LEN	640
+
 /* PCI bus types */
 enum i40e_bus_type {
 	i40e_bus_type_unknown = 0,
-- 
2.14.2

^ permalink raw reply related

* [net-next 01/15] i40e: limit lan queue count in large CPU count machine
From: Jeff Kirsher @ 2017-10-02 19:48 UTC (permalink / raw)
  To: davem; +Cc: Shannon Nelson, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20171002194852.71970-1-jeffrey.t.kirsher@intel.com>

From: Shannon Nelson <shannon.nelson@oracle.com>

When a machine has more CPUs than queue pairs, e.g. 512 cores, the
counting gets a little funky and turns off Flow Director with the
message:
  not enough queues for Flow Director. Flow Director feature is disabled

This patch limits the number of lan queues initially allocated to
be sure we have some left for FD and other features.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 47f71d7c3ae0..387f0863f794 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11093,6 +11093,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit)
 static void i40e_determine_queue_usage(struct i40e_pf *pf)
 {
 	int queues_left;
+	int q_max;
 
 	pf->num_lan_qps = 0;
 
@@ -11139,10 +11140,12 @@ static void i40e_determine_queue_usage(struct i40e_pf *pf)
 					I40E_FLAG_DCB_ENABLED);
 			dev_info(&pf->pdev->dev, "not enough queues for DCB. DCB is disabled.\n");
 		}
-		pf->num_lan_qps = max_t(int, pf->rss_size_max,
-					num_online_cpus());
-		pf->num_lan_qps = min_t(int, pf->num_lan_qps,
-					pf->hw.func_caps.num_tx_qp);
+
+		/* limit lan qps to the smaller of qps, cpus or msix */
+		q_max = max_t(int, pf->rss_size_max, num_online_cpus());
+		q_max = min_t(int, q_max, pf->hw.func_caps.num_tx_qp);
+		q_max = min_t(int, q_max, pf->hw.func_caps.num_msix_vectors);
+		pf->num_lan_qps = q_max;
 
 		queues_left -= pf->num_lan_qps;
 	}
-- 
2.14.2

^ permalink raw reply related

* Re: [net-next V2 PATCH 5/5] samples/bpf: add cpumap sample program xdp_redirect_cpu
From: Alexei Starovoitov @ 2017-10-02 19:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, Jason Wang, mchan,
	John Fastabend, peter.waskiewicz.jr, Daniel Borkmann,
	Andy Gospodarek
In-Reply-To: <20171002140723.7667a9b4@redhat.com>

On Mon, Oct 02, 2017 at 02:07:23PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 29 Sep 2017 20:06:09 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > > +/*** Trace point code ***/
> > > +
> > > +/* Tracepoint format: /sys/kernel/debug/tracing/events/xdp/xdp_redirect/format
> > > + * Code in:                kernel/include/trace/events/xdp.h
> > > + */
> > > +struct xdp_redirect_ctx {
> > > +	unsigned short common_type;	//	offset:0;  size:2; signed:0;
> > > +	unsigned char common_flags;	//	offset:2;  size:1; signed:0;
> > > +	unsigned char common_preempt_count;//	offset:3;  size:1; signed:0;
> > > +	int common_pid;			//	offset:4;  size:4; signed:1;  
> > 
> > this part is not right. First 8 bytes are not accessible by bpf code.
> > Please use __u64 pad; or similar here.
> 
> I've corrected this in V3.
> 
> Can you explain why BPF cannot access these (first 8 bytes) struct members?

in the very first patch adding bpf to tracepoint we allowed the access,
but then people complained that common_* stuff is not abi and since it was
slow to populate anyway due to local_save_flags (which is never used by
scripts we had so far) and for pid there was a helper already, I was happy
to get rid of it and save these 8 bytes for future use. which ended up
being used by hidden(inaccessible) pt_regs pointer.
Layout description is in
commit 98b5c2c65c29 ("perf, bpf: allow bpf programs attach to tracepoints")
The work on dynamic bits of tracepoints is still tbd.

^ permalink raw reply

* Re: [PATCH net] socket, bpf: fix possible use after free
From: Daniel Borkmann @ 2017-10-02 19:42 UTC (permalink / raw)
  To: Eric Dumazet, David Miller; +Cc: netdev, Alexei Starovoitov
In-Reply-To: <1506972051.8061.30.camel@edumazet-glaptop3.roam.corp.google.com>

On 10/02/2017 09:20 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Starting from linux-4.4, 3WHS no longer takes the listener lock.
>
> Since this time, we might hit a use-after-free in sk_filter_charge(),
> if the filter we got in the memcpy() of the listener content
> just happened to be replaced by a thread changing listener BPF filter.
>
> To fix this, we need to make sure the filter refcount is not already
> zero before incrementing it again.
>
> Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thanks, Eric!

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* RE: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Rodney Cummings @ 2017-10-02 19:40 UTC (permalink / raw)
  To: Levi Pearson
  Cc: Linux Kernel Network Developers, Vinicius Costa Gomes,
	Henrik Austad, richardcochran@gmail.com,
	jesus.sanchez-palencia@intel.com, andre.guedes@intel.com
In-Reply-To: <CAEYbN3STY+7ZOodQLaP4MfbvwCovWsav52PXBmjHND7QOO=srg@mail.gmail.com>

Thanks Levi,

> One particular device I am working with now provides all network
> access through a DSA switch chip with hardware Qbv support in addtion
> to hardware Qav support. The SoC attached to it has no hardware timed
> launch (SO_TXTIME) support. In this case, although the proposed
> interface for Qbv is not *sufficient* to make a working time-aware end
> station, it does provide a usable building block to provide one. As
> with the credit-based shaping system, Talkers must provide an
> additional level of per-stream shaping as well, but this is largely
> (absent the jitter calculations, which are sort of a middle-level
> concern) independent of what sort of hardware offload of the
> scheduling is provided.

It's a shame that someone built such hardware. Speaking as a manufacturer
of daisy-chainable products for industrial/automotive applications, I
wouldn't use that hardware in my products. The whole point of scheduling
is to obtain close-to-optimal network determinism. If the hardware
doesn't schedule properly, the product is probably better off using CBS.

It would be great if all of the user-level application code was scheduled
as tightly as the network, but the reality is that we aren't there yet.
Also, even if the end-station has a separately timed RT Linux thread for
each stream, the order of those threads can vary. Therefore, when the 
end-station (SoC) has more than one talker, the frames for each talker 
can be written into the Qbv queue at any time, in any order. 
There is no scheduling of frames (i.e. streams)... only the queue.

From the perspective of the CNC, that means that a Qbv-only end-station is
sloppy. In contrast, an end-station using per-stream scheduling 
(e.g. i210 with LaunchTime) is precise, because each frame has a known time.
It's true that P802.1Qcc supports both, so the CNC might support both.
Nevertheless, the CNC is likely to compute a network-wide schedule
that optimizes for the per-stream end-stations, and locates the windows
for the Qbv-only end-stations in a slower interval.

I realize that we are only doing CBS for now. I guess my main point is that
if/when we add scheduling, it cannot be limited to Qbv. Per-stream scheduling
is essential, because that is the assumption for most CNC designs that 
I'm aware of.

Rodney

> -----Original Message-----
> From: Levi Pearson [mailto:levipearson@gmail.com]
> Sent: Monday, October 2, 2017 1:46 PM
> To: Rodney Cummings <rodney.cummings@ni.com>
> Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>; Vinicius
> Costa Gomes <vinicius.gomes@intel.com>; Henrik Austad <henrik@austad.us>;
> richardcochran@gmail.com; jesus.sanchez-palencia@intel.com;
> andre.guedes@intel.com
> Subject: Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for
> traffic shapers
> 
> Hi Rodney,
> 
> Some archives seem to have threaded it, but I have CC'd the
> participants I saw in the original discussion thread since they may
> not otherwise notice it amongst the normal traffic.
> 
> On Fri, Sep 29, 2017 at 2:44 PM, Rodney Cummings <rodney.cummings@ni.com>
> wrote:
> > Hi,
> >
> > I am posting my reply to this thread after subscribing, so I apologize
> > if the archive happens to attach it to the wrong thread.
> >
> > First, I'd like to say that I strongly support this RFC.
> > We need Linux interfaces for IEEE 802.1 TSN features.
> >
> > Although I haven't looked in detail, the proposal for CBS looks good.
> > My questions/concerns are more related to future work, such for 802.1Qbv
> > (scheduled traffic).
> >
> > 1. Question: From an 802.1 perspective, is this RFC intended to support
> > end-station (e.g. NIC in host), bridges (i.e. DSA), or both?
> >
> > This is very important to clarify, because the usage of this interface
> > will be very different for one or the other.
> >
> > For a bridge, the user code typically represents a remote management
> > protocol (e.g. SNMP, NETCONF, RESTCONF), and this interface is
> > expected to align with the specifications of 802.1Q clause 12,
> > which serves as the information model for management. Historically,
> > a standard kernel interface for management hasn't been viewed as
> > essential, but I suppose it wouldn't hurt.
> 
> I don't think the proposal was meant to cover the case of non-local
> switch hardware, but in addition to dsa and switchdev switch ICs
> managed by embedded Linux-running SoCs, there are SoCs with embedded
> small port count switches or even plain multiple NICs with software
> bridging. Many of these embedded small port count switches have FQTSS
> hardware that could potentially be configured by the proposed cbs
> qdisc. This blurs the line somewhat between what is a "bridge" and
> what is an "end-station" in 802.1Q terminology, but nevertheless these
> devices exist, sometimes acting as an endpoint + a real bridge and
> sometimes as just a system with multiple network interfaces.
> 
> > For an end station, the user code can be an implementation of SRP
> > (802.1Q clause 35), or it can be an application-specific
> > protocol (e.g. industrial fieldbus) that exchanges data according
> > to P802.1Qcc clause 46. Either way, the top-level user interface
> > is designed for individual streams, not queues and shapers. That
> > implies some translation code between that top-level interface
> > and this sort of kernel interface.
> >
> > As a specific end-station example, for CBS, 802.1Q-2014 subclause
> > 34.6.1 requires "per-stream queues" in the Talker end-station.
> > I don't see 34.6.1 represented in the proposed RFC, but that's
> > okay... maybe per-stream queues are implemented in user code.
> > Nevertheless, if that is the assumption, I think we need to
> > clarify, especially in examples.
> 
> You're correct that the FQTSS credit-based shaping algorithm requires
> per-stream shaping by Talker endpoints as well, but this is in
> addition to the per-class shaping provided by most hardware shaping
> implementations that I'm aware of in endpoint network hardware. I
> agree that we need to document the need to provide this, but it can
> definitely be built on top of the current proposal.
> 
> I believe the per-stream shaping could be managed either by a user
> space application that manages all use of a streaming traffic class,
> or through an additional qdisc module that performs per-stream
> management on top of the proposed cbs qdisc, ensuring that the
> frames-per-observation interval aspect of each stream's reservation is
> obeyed. This becomes a fairly simple qdisc to implement on top of a
> per-traffic class shaper, and could even be implemented with the help
> of the timestamp that the SO_TXTIME proposal adds to skbuffs, but I
> think keeping the layers separate provides more flexibility to
> implementations and keeps management of various kinds of hardware
> offload support simpler as well.
> 
> > 2. Suggestion: Do not assume that a time-aware (i.e. scheduled)
> > end-station will always use 802.1Qbv.
> >
> > For those who are subscribed to the 802.1 mailing list,
> > I'd suggest a read of draft P802.1Qcc/D1.6, subclause U.1
> > of Annex U. Subclause U.1 assumes that bridges in the network use
> > 802.1Qbv, and then it poses the question of what an end-station
> > Talker should do. If the end-station also uses 802.1Qbv,
> > and that end-station transmits multiple streams, 802.1Qbv is
> > a bad implementation. The reason is that the scheduling
> > (i.e. order in time) of each stream cannot be controlled, which
> > in turn means that the CNC (network manager) cannot optimize
> > the 802.1Qbv schedules in bridges. The preferred technique
> > is to use "per-stream scheduling" in each Talker, so that
> > the CNC can create an optimal schedules (i.e. best determinism).
> >
> > I'm aware of a small number of proprietary CNC implementations for
> > 802.1Qbv in bridges, and they are generally assuming per-stream
> > scheduling in end-stations (Talkers).
> >
> > The i210 NIC's LaunchTime can be used to implement per-stream
> > scheduling. I haven't looked at SO_TXTIME in detail, but it sounds
> > like per-stream scheduling. If so, then we already have the
> > fundamental building blocks for a complete implementation
> > of a time-aware end-station.
> >
> > If we answer the preceding question #1 as "end-station only",
> > I would recommend avoiding 802.1Qbv in this interface. There
> > isn't really anything wrong with it per-se, but it would lead
> > developers down the wrong path.
> 
> In some situations, such as device nodes that each incorporate a small
> port count switch for the purpose of daisy-chaining a segment of the
> network, "end stations" must do a limited subset of local bridge
> management as well. I'm not sure how common this is going to be for
> industrial control applications, but I know there are audio and
> automotive applications built this way.
> 
> One particular device I am working with now provides all network
> access through a DSA switch chip with hardware Qbv support in addtion
> to hardware Qav support. The SoC attached to it has no hardware timed
> launch (SO_TXTIME) support. In this case, although the proposed
> interface for Qbv is not *sufficient* to make a working time-aware end
> station, it does provide a usable building block to provide one. As
> with the credit-based shaping system, Talkers must provide an
> additional level of per-stream shaping as well, but this is largely
> (absent the jitter calculations, which are sort of a middle-level
> concern) independent of what sort of hardware offload of the
> scheduling is provided.
> 
> Both Qbv windows and timed launch support do roughly the same thing;
> they *delay* the launch of a hardware-queued frame so it can egress at
> a precisely specified time, and at least with the i210 and Qbv, ensure
> that no other traffic will be in-progress when that time arrives. For
> either to be used effectively, the application still has to prepare
> the frame slightly ahead-of-time and thus must have the same level of
> time-awareness. This is, again, largely independent of what kind of
> hardware offloading support is provided and is also largely
> independent of the network stack itself. Neither queue window
> management nor SO_TXTIME help the application present its
> time-sensitive traffic at the right time; that's a matter to be worked
> out with the application taking advantage of PTP and the OS scheduler.
> Whether you rely on managed windows or hardware launch time to provide
> the precisely correct amount of delay beyond that is immaterial to the
> application. In the absence of SO_TXTIME offloading (or even with it,
> and in the presence of sufficient OS scheduling jitter), an additional
> layer may need to be provided to ensure different applications' frames
> are queued in the correct order for egress during the window. Again,
> this could be a purely user-space application multiplexer or a
> separate qdisc module.
> 
> I wholeheartedly agree with you and Richard that we ought to
> eventually provide application-level APIs that don't require users to
> have deep knowledge of various 802.1Q intricacies. But I believe that
> the hardware offloading capability being provided now, and the variety
> of the way things are hooked up in real hardware, suggests that we
> ought to also build the support for the underlying protocols in layers
> so that we don't create unnecessary mismatches between offloading
> capability (which can be essential to overall network performance) and
> APIs, such that one configuration of offload support is privileged
> above others even when comparable scheduling accuracy could be
> provided by either.
> 
> In any case, only the cbs qdisc has been included in the post-RFC
> patch cover page for its last couple of iterations, so there is plenty
> of time to discuss how time-aware shaping, preemption, etc. management
> should occur beyond the cbs and SO_TXTIME proposals.
> 
> 
> Levi

^ permalink raw reply

* Re: [PATCH net-next 2/2] flow_dissector: dissect tunnel info
From: Tom Herbert @ 2017-10-02 19:36 UTC (permalink / raw)
  To: Simon Horman
  Cc: David Miller, Jiri Pirko, Jamal Hadi Salim, Cong Wang,
	Linux Kernel Network Developers, oss-drivers
In-Reply-To: <1506933676-20121-3-git-send-email-simon.horman@netronome.com>

On Mon, Oct 2, 2017 at 1:41 AM, Simon Horman <simon.horman@netronome.com> wrote:
> Move dissection of tunnel info from the flower classifier to the flow
> dissector where all other dissection occurs.  This should not have any
> behavioural affect on other users of the flow dissector.
>
> Signed-off-by: Simon Horman <simon.horman@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
>  net/core/flow_dissector.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++
>  net/sched/cls_flower.c    |  25 ------------
>  2 files changed, 100 insertions(+), 25 deletions(-)
>
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index 0a977373d003..1f5caafb4492 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -5,6 +5,7 @@
>  #include <linux/ipv6.h>
>  #include <linux/if_vlan.h>
>  #include <net/dsa.h>
> +#include <net/dst_metadata.h>
>  #include <net/ip.h>
>  #include <net/ipv6.h>
>  #include <net/gre.h>
> @@ -115,6 +116,102 @@ __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
>  }
>  EXPORT_SYMBOL(__skb_flow_get_ports);
>
> +static void
> +skb_flow_dissect_set_enc_addr_type(enum flow_dissector_key_id type,
> +                                  struct flow_dissector *flow_dissector,
> +                                  void *target_container)
> +{
> +       struct flow_dissector_key_control *ctrl;
> +
> +       if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_CONTROL))
> +               return;
> +
> +       ctrl = skb_flow_dissector_target(flow_dissector,
> +                                        FLOW_DISSECTOR_KEY_ENC_CONTROL,
> +                                        target_container);
> +       ctrl->addr_type = type;
> +}
> +
> +static void
> +__skb_flow_dissect_tunnel_info(const struct sk_buff *skb,
> +                              struct flow_dissector *flow_dissector,
> +                              void *target_container)
> +{
> +       struct ip_tunnel_info *info;
> +       struct ip_tunnel_key *key;
> +
> +       /* A quick check to see if there might be something to do. */
> +       if (!dissector_uses_key(flow_dissector,
> +                               FLOW_DISSECTOR_KEY_ENC_KEYID) &&
> +           !dissector_uses_key(flow_dissector,
> +                               FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) &&
> +           !dissector_uses_key(flow_dissector,
> +                               FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) &&
> +           !dissector_uses_key(flow_dissector,
> +                               FLOW_DISSECTOR_KEY_ENC_CONTROL) &&
> +           !dissector_uses_key(flow_dissector,
> +                               FLOW_DISSECTOR_KEY_ENC_PORTS))
> +               return;

This complex conditional is now in the path of every call to flow
dissector regardless of whether a classifier is enabled or tunnels
are. What does the assembly show in terms of number of branches? Can
we at least get this down to one check (might be a use case for
FLOW_DISSECTOR_F_FLOWER ;-) ), or even better use the static key when
encap or is enabled?

> +
> +       info = skb_tunnel_info(skb);
> +       if (!info)
> +               return;
> +
> +       key = &info->key;
> +
> +       switch (ip_tunnel_info_af(info)) {
> +       case AF_INET:
> +               skb_flow_dissect_set_enc_addr_type(FLOW_DISSECTOR_KEY_IPV4_ADDRS,
> +                                                  flow_dissector,
> +                                                  target_container);
> +               if (dissector_uses_key(flow_dissector,
> +                                      FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS)) {
> +                       struct flow_dissector_key_ipv4_addrs *ipv4;
> +
> +                       ipv4 = skb_flow_dissector_target(flow_dissector,
> +                                                        FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS,
> +                                                        target_container);
> +                       ipv4->src = key->u.ipv4.src;
> +                       ipv4->dst = key->u.ipv4.dst;
> +               }
> +               break;
> +       case AF_INET6:
> +               skb_flow_dissect_set_enc_addr_type(FLOW_DISSECTOR_KEY_IPV6_ADDRS,
> +                                                  flow_dissector,
> +                                                  target_container);
> +               if (dissector_uses_key(flow_dissector,
> +                                      FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS)) {
> +                       struct flow_dissector_key_ipv6_addrs *ipv6;
> +
> +                       ipv6 = skb_flow_dissector_target(flow_dissector,
> +                                                        FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS,
> +                                                        target_container);
> +                       ipv6->src = key->u.ipv6.src;
> +                       ipv6->dst = key->u.ipv6.dst;
> +               }
> +               break;
> +       }
> +
> +       if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_KEYID)) {
> +               struct flow_dissector_key_keyid *keyid;
> +
> +               keyid = skb_flow_dissector_target(flow_dissector,
> +                                                 FLOW_DISSECTOR_KEY_ENC_KEYID,
> +                                                 target_container);
> +               keyid->keyid = tunnel_id_to_key32(key->tun_id);
> +       }
> +
> +       if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_PORTS)) {
> +               struct flow_dissector_key_ports *tp;
> +
> +               tp = skb_flow_dissector_target(flow_dissector,
> +                                              FLOW_DISSECTOR_KEY_ENC_PORTS,
> +                                              target_container);
> +               tp->src = key->tp_src;
> +               tp->dst = key->tp_dst;
> +       }
> +}
> +
>  static enum flow_dissect_ret
>  __skb_flow_dissect_mpls(const struct sk_buff *skb,
>                         struct flow_dissector *flow_dissector,
> @@ -478,6 +575,9 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>                                               FLOW_DISSECTOR_KEY_BASIC,
>                                               target_container);
>
> +       __skb_flow_dissect_tunnel_info(skb, flow_dissector,
> +                                      target_container);
> +
>         if (dissector_uses_key(flow_dissector,
>                                FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
>                 struct ethhdr *eth = eth_hdr(skb);
> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> index d230cb4c8094..db831ac708f6 100644
> --- a/net/sched/cls_flower.c
> +++ b/net/sched/cls_flower.c
> @@ -152,37 +152,12 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>         struct cls_fl_filter *f;
>         struct fl_flow_key skb_key;
>         struct fl_flow_key skb_mkey;
> -       struct ip_tunnel_info *info;
>
>         if (!atomic_read(&head->ht.nelems))
>                 return -1;
>
>         fl_clear_masked_range(&skb_key, &head->mask);
>
> -       info = skb_tunnel_info(skb);
> -       if (info) {
> -               struct ip_tunnel_key *key = &info->key;
> -
> -               switch (ip_tunnel_info_af(info)) {
> -               case AF_INET:
> -                       skb_key.enc_control.addr_type =
> -                               FLOW_DISSECTOR_KEY_IPV4_ADDRS;
> -                       skb_key.enc_ipv4.src = key->u.ipv4.src;
> -                       skb_key.enc_ipv4.dst = key->u.ipv4.dst;
> -                       break;
> -               case AF_INET6:
> -                       skb_key.enc_control.addr_type =
> -                               FLOW_DISSECTOR_KEY_IPV6_ADDRS;
> -                       skb_key.enc_ipv6.src = key->u.ipv6.src;
> -                       skb_key.enc_ipv6.dst = key->u.ipv6.dst;
> -                       break;
> -               }
> -
> -               skb_key.enc_key_id.keyid = tunnel_id_to_key32(key->tun_id);
> -               skb_key.enc_tp.src = key->tp_src;
> -               skb_key.enc_tp.dst = key->tp_dst;
> -       }
> -
>         skb_key.indev_ifindex = skb->skb_iif;
>         /* skb_flow_dissect() does not set n_proto in case an unknown protocol,
>          * so do it rather here.
> --
> 2.1.4
>

^ permalink raw reply

* Re: [PATCH net] socket, bpf: fix possible use after free
From: Alexei Starovoitov @ 2017-10-02 19:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <1506972051.8061.30.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Oct 02, 2017 at 12:20:51PM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Starting from linux-4.4, 3WHS no longer takes the listener lock.
> 
> Since this time, we might hit a use-after-free in sk_filter_charge(),
> if the filter we got in the memcpy() of the listener content
> just happened to be replaced by a thread changing listener BPF filter.
> 
> To fix this, we need to make sure the filter refcount is not already
> zero before incrementing it again.
> 
> Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thanks for the fix!
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH 00/18] use ARRAY_SIZE macro
From: J. Bruce Fields @ 2017-10-02 19:22 UTC (permalink / raw)
  To: Greg KH
  Cc: Jérémy Lefaure, Tobin C. Harding, alsa-devel, nouveau,
	dri-devel, dm-devel, brcm80211-dev-list, devel, linux-scsi,
	linux-rdma, amd-gfx, Jason Gunthorpe, linux-acpi, linux-video,
	intel-wired-lan, linux-media, intel-gfx, ecryptfs, linux-nfs,
	linux-raid, openipmi-developer, intel-gvt-dev, devel,
	brcm80211-dev-list.pdl
In-Reply-To: <20171002053554.GA28743@kroah.com>

On Mon, Oct 02, 2017 at 07:35:54AM +0200, Greg KH wrote:
> On Sun, Oct 01, 2017 at 08:52:20PM -0400, Jérémy Lefaure wrote:
> > On Mon, 2 Oct 2017 09:01:31 +1100
> > "Tobin C. Harding" <me@tobin.cc> wrote:
> > 
> > > > In order to reduce the size of the To: and Cc: lines, each patch of the
> > > > series is sent only to the maintainers and lists concerned by the patch.
> > > > This cover letter is sent to every list concerned by this series.  
> > > 
> > > Why don't you just send individual patches for each subsystem? I'm not a maintainer but I don't see
> > > how any one person is going to be able to apply this whole series, it is making it hard for
> > > maintainers if they have to pick patches out from among the series (if indeed any will bother
> > > doing that).
> > Yeah, maybe it would have been better to send individual patches.
> > 
> > From my point of view it's a series because the patches are related (I
> > did a git format-patch from my local branch). But for the maintainers
> > point of view, they are individual patches.
> 
> And the maintainers view is what matters here, if you wish to get your
> patches reviewed and accepted...

Mainly I'd just like to know which you're asking for.  Do you want me to
apply this, or to ACK it so someone else can?  If it's sent as a series
I tend to assume the latter.

But in this case I'm assuming it's the former, so I'll pick up the nfsd
one....

--b.

^ permalink raw reply

* [PATCH net] socket, bpf: fix possible use after free
From: Eric Dumazet @ 2017-10-02 19:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann

From: Eric Dumazet <edumazet@google.com>

Starting from linux-4.4, 3WHS no longer takes the listener lock.

Since this time, we might hit a use-after-free in sk_filter_charge(),
if the filter we got in the memcpy() of the listener content
just happened to be replaced by a thread changing listener BPF filter.

To fix this, we need to make sure the filter refcount is not already
zero before incrementing it again.

Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/filter.c |   12 ++++++++----
 net/core/sock.c   |    5 ++++-
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 82edad58d066857aeee562661098effa3b3e6961..74b8c91fb5f4461da58c73568976bc9834c4612b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -989,10 +989,14 @@ static bool __sk_filter_charge(struct sock *sk, struct sk_filter *fp)
 
 bool sk_filter_charge(struct sock *sk, struct sk_filter *fp)
 {
-	bool ret = __sk_filter_charge(sk, fp);
-	if (ret)
-		refcount_inc(&fp->refcnt);
-	return ret;
+	if (!refcount_inc_not_zero(&fp->refcnt))
+		return false;
+
+	if (!__sk_filter_charge(sk, fp)) {
+		sk_filter_release(fp);
+		return false;
+	}
+	return true;
 }
 
 static struct bpf_prog *bpf_migrate_filter(struct bpf_prog *fp)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7d55c05f449d306b43fd850a61ce8ffd44174f7f..23953b741a41fbcf4a6ffb0dd5bf05bd5266b99d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1684,13 +1684,16 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 
 		sock_reset_flag(newsk, SOCK_DONE);
 
-		filter = rcu_dereference_protected(newsk->sk_filter, 1);
+		rcu_read_lock();
+		filter = rcu_dereference(sk->sk_filter);
 		if (filter != NULL)
 			/* though it's an empty new sock, the charging may fail
 			 * if sysctl_optmem_max was changed between creation of
 			 * original socket and cloning
 			 */
 			is_charged = sk_filter_charge(newsk, filter);
+		RCU_INIT_POINTER(newsk->sk_filter, filter);
+		rcu_read_unlock();
 
 		if (unlikely(!is_charged || xfrm_sk_clone_policy(newsk, sk))) {
 			/* We need to make sure that we don't uncharge the new

^ permalink raw reply related

* Re: [PATCH net-next v11] openvswitch: enable NSH support
From: Jiri Benc @ 2017-10-02 19:13 UTC (permalink / raw)
  To: Yi Yang
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA, e,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1506668610-18505-1-git-send-email-yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On Fri, 29 Sep 2017 15:03:30 +0800, Yi Yang wrote:
> --- a/include/net/nsh.h
> +++ b/include/net/nsh.h
> @@ -304,4 +304,7 @@ static inline void nsh_set_flags_ttl_len(struct nshhdr *nsh, u8 flags,
>  			NSH_FLAGS_MASK | NSH_TTL_MASK | NSH_LEN_MASK);
>  }
>  
> +int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *nh);

[...]
> +int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *src_nsh_hdr)

This is minor but since this patch will need a respin anyway, please
name the variables in the forward declaration and here the same.

> +int skb_pop_nsh(struct sk_buff *skb)
> +{
> +	int err;
> +	struct nshhdr *nsh_hdr = (struct nshhdr *)(skb->data);

Bad name of the variable, clashes with the nsh_hdr function. I pointed
that out already.

> +	size_t length;
> +	__be16 inner_proto;
> +
> +	err = skb_ensure_writable(skb, skb_network_offset(skb) +
> +				       sizeof(struct nshhdr));

You assume that the skb starts at the NSH header, thus the
skb_network_offset is completely unnecessary and introduces just
another assumption on the caller. Also, the sizeof(struct nshhdr) is
wrong: there's no guarantee that the header is not smaller or larger
than that.

More importantly though, why do you need skb_ensure_writable? You don't
write into the header. pkskb_may_pull is enough.

	if (!pskb_may_pull(skb, NSH_BASE_HDR_LEN))
		return -ENOMEM;
	length = nsh_hdr_len(nsh_hdr);
	if (!pskb_may_pull(skb, length))
		return -ENOMEM;

> +static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
> +		   const struct nlattr *a)
> +{
> +	struct nshhdr *nh;
> +	int err;
> +	u8 flags;
> +	u8 ttl;
> +	int i;
> +
> +	struct ovs_key_nsh key;
> +	struct ovs_key_nsh mask;
> +
> +	err = nsh_key_from_nlattr(a, &key, &mask);
> +	if (err)
> +		return err;
> +
> +	err = skb_ensure_writable(skb, skb_network_offset(skb) +
> +				       sizeof(struct nshhdr));

I missed this before: this is wrong, too. You need to use the real
header size, not sizeof(struct nshhdr). It should be computable from
the flow key.

> +		case OVS_ACTION_ATTR_PUSH_NSH: {
> +			u8 buffer[NSH_HDR_MAX_LEN];
> +			struct nshhdr *nh = (struct nshhdr *)buffer;
> +
> +			nsh_hdr_from_nlattr(nla_data(a), nh,
> +					    NSH_HDR_MAX_LEN);
> +			err = push_nsh(skb, key, (const struct nshhdr *)nh);

Is the cast to const really needed? It looks suspicious. If you added it
because a compiler complained, it's even more suspicious.

> +static int parse_nsh(struct sk_buff *skb, struct sw_flow_key *key)
> +{
> +	struct nshhdr *nh;
> +	unsigned int nh_ofs = skb_network_offset(skb);
> +	u8 version, length;
> +	int err;
> +
> +	err = check_header(skb, nh_ofs + NSH_BASE_HDR_LEN);
> +	if (unlikely(err))
> +		return err;
> +
> +	nh = nsh_hdr(skb);
> +	version = nsh_get_ver(nh);
> +	length = nsh_hdr_len(nh);
> +
> +	if (version != 0)
> +		return -EINVAL;
> +
> +	err = check_header(skb, nh_ofs + length);
> +	if (unlikely(err))
> +		return err;
> +
> +	nh = (struct nshhdr *)skb_network_header(skb);

I really really really hate this. This is the third time I'm telling
you to use the nsh_hdr function. Every time, you change only part of
the places. And this one I even explicitly pointed out in the previous
review.

I'm not supposed to look at my previous review to verify that you
addressed everything. That's your responsibility. Yet I need to do it
because every time, some of my comments remain unaddressed.

> +int nsh_hdr_from_nlattr(const struct nlattr *attr,
> +			struct nshhdr *nh, size_t size)
> +{
> +	struct nlattr *a;
> +	int rem;
> +	u8 flags = 0;
> +	u8 ttl = 0;
> +	int mdlen = 0;
> +
> +	/* validate_nsh has check this, so we needn't do duplicate check here
> +	 */
> +	nla_for_each_nested(a, attr, rem) {
> +		int type = nla_type(a);
> +
> +		switch (type) {
> +		case OVS_NSH_KEY_ATTR_BASE: {
> +			const struct ovs_nsh_key_base *base = nla_data(a);
> +
> +			flags = base->flags;
> +			ttl = base->ttl;
> +			nh->np = base->np;
> +			nh->mdtype = base->mdtype;
> +			nh->path_hdr = base->path_hdr;
> +			break;
> +		}
> +		case OVS_NSH_KEY_ATTR_MD1: {
> +			const struct ovs_nsh_key_md1 *md1 = nla_data(a);
> +
> +			mdlen = nla_len(a);
> +			memcpy(&nh->md1, md1, mdlen);
> +			break;

Looks better. Why not simplify it even more?

		case OVS_NSH_KEY_ATTR_MD1:
			mdlen = nla_len(a);
			memcpy(&nh->md1, nla_data(a), mdlen);
			break;

It's still perfectly readable this way and there's no need for the
braces.

> +		}
> +		case OVS_NSH_KEY_ATTR_MD2: {
> +			const struct u8 *md2 = nla_data(a);
> +
> +			mdlen = nla_len(a);
> +			memcpy(&nh->md2, md2, mdlen);

And here, too.

> +int nsh_key_from_nlattr(const struct nlattr *attr,
> +			struct ovs_key_nsh *nsh, struct ovs_key_nsh *nsh_mask)
> +{
> +	struct nlattr *a;
> +	int rem;
> +
> +	/* validate_nsh has check this, so we needn't do duplicate check here
> +	 */
> +	nla_for_each_nested(a, attr, rem) {
> +		int type = nla_type(a);
> +
> +		switch (type) {
> +		case OVS_NSH_KEY_ATTR_BASE: {
> +			const struct ovs_nsh_key_base *base = nla_data(a);
> +			const struct ovs_nsh_key_base *base_mask = base + 1;
> +
> +			nsh->base = *base;
> +			nsh_mask->base = *base_mask;
> +			break;
> +		}
> +		case OVS_NSH_KEY_ATTR_MD1: {
> +			const struct ovs_nsh_key_md1 *md1 =
> +				(struct ovs_nsh_key_md1 *)nla_data(a);

I'm speechless.

Yes, I don't like the line above. For a reason that I already pointed
out.

I expected more of this version.

 Jiri

^ permalink raw reply

* Re: [net-next 00/13][pull request] 100GbE Intel Wired LAN Driver Updates 2017-10-02
From: David Miller @ 2017-10-02 18:59 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20171002154236.84043-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon,  2 Oct 2017 08:42:23 -0700

> This series contains updates to fm10k only.
> 
> Jake provides all but one of the changes in this series.  Most are small
> fixes, starting with ensuring prompt transmission of messages queued up
> after each VF message is received and handled.  Fix a possible race
> condition between the watchdog task and the processing of mailbox
> messages by just checking whether the mailbox is still open.  Fix a
> couple of GCC v7 warnings, including misspelled "fall through" comments
> and warnings about possible truncation of calls to snprintf().  Cleaned
> up a convoluted bitshift and read for the PFVFLRE register.  Fixed a
> potential divide by zero when finding the proper r_idx.
> 
> Markus Elfring fixes an issue which was found using Coccinelle, where
> we should have been using seq_putc() instead of seq_puts().
> 
> The following are changes since commit 0929567a7a2dab8455a7313956973ff0d339709a:
>   samples/bpf: fix warnings in xdp_monitor_user
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Levi Pearson @ 2017-10-02 18:45 UTC (permalink / raw)
  To: Rodney Cummings
  Cc: Linux Kernel Network Developers, Vinicius Costa Gomes,
	Henrik Austad, richardcochran, jesus.sanchez-palencia,
	andre.guedes
In-Reply-To: <DM2PR0401MB1389A950EAC6FB54D4186B9C927E0@DM2PR0401MB1389.namprd04.prod.outlook.com>

Hi Rodney,

Some archives seem to have threaded it, but I have CC'd the
participants I saw in the original discussion thread since they may
not otherwise notice it amongst the normal traffic.

On Fri, Sep 29, 2017 at 2:44 PM, Rodney Cummings <rodney.cummings@ni.com> wrote:
> Hi,
>
> I am posting my reply to this thread after subscribing, so I apologize
> if the archive happens to attach it to the wrong thread.
>
> First, I'd like to say that I strongly support this RFC.
> We need Linux interfaces for IEEE 802.1 TSN features.
>
> Although I haven't looked in detail, the proposal for CBS looks good.
> My questions/concerns are more related to future work, such for 802.1Qbv
> (scheduled traffic).
>
> 1. Question: From an 802.1 perspective, is this RFC intended to support
> end-station (e.g. NIC in host), bridges (i.e. DSA), or both?
>
> This is very important to clarify, because the usage of this interface
> will be very different for one or the other.
>
> For a bridge, the user code typically represents a remote management
> protocol (e.g. SNMP, NETCONF, RESTCONF), and this interface is
> expected to align with the specifications of 802.1Q clause 12,
> which serves as the information model for management. Historically,
> a standard kernel interface for management hasn't been viewed as
> essential, but I suppose it wouldn't hurt.

I don't think the proposal was meant to cover the case of non-local
switch hardware, but in addition to dsa and switchdev switch ICs
managed by embedded Linux-running SoCs, there are SoCs with embedded
small port count switches or even plain multiple NICs with software
bridging. Many of these embedded small port count switches have FQTSS
hardware that could potentially be configured by the proposed cbs
qdisc. This blurs the line somewhat between what is a "bridge" and
what is an "end-station" in 802.1Q terminology, but nevertheless these
devices exist, sometimes acting as an endpoint + a real bridge and
sometimes as just a system with multiple network interfaces.

> For an end station, the user code can be an implementation of SRP
> (802.1Q clause 35), or it can be an application-specific
> protocol (e.g. industrial fieldbus) that exchanges data according
> to P802.1Qcc clause 46. Either way, the top-level user interface
> is designed for individual streams, not queues and shapers. That
> implies some translation code between that top-level interface
> and this sort of kernel interface.
>
> As a specific end-station example, for CBS, 802.1Q-2014 subclause
> 34.6.1 requires "per-stream queues" in the Talker end-station.
> I don't see 34.6.1 represented in the proposed RFC, but that's
> okay... maybe per-stream queues are implemented in user code.
> Nevertheless, if that is the assumption, I think we need to
> clarify, especially in examples.

You're correct that the FQTSS credit-based shaping algorithm requires
per-stream shaping by Talker endpoints as well, but this is in
addition to the per-class shaping provided by most hardware shaping
implementations that I'm aware of in endpoint network hardware. I
agree that we need to document the need to provide this, but it can
definitely be built on top of the current proposal.

I believe the per-stream shaping could be managed either by a user
space application that manages all use of a streaming traffic class,
or through an additional qdisc module that performs per-stream
management on top of the proposed cbs qdisc, ensuring that the
frames-per-observation interval aspect of each stream's reservation is
obeyed. This becomes a fairly simple qdisc to implement on top of a
per-traffic class shaper, and could even be implemented with the help
of the timestamp that the SO_TXTIME proposal adds to skbuffs, but I
think keeping the layers separate provides more flexibility to
implementations and keeps management of various kinds of hardware
offload support simpler as well.

> 2. Suggestion: Do not assume that a time-aware (i.e. scheduled)
> end-station will always use 802.1Qbv.
>
> For those who are subscribed to the 802.1 mailing list,
> I'd suggest a read of draft P802.1Qcc/D1.6, subclause U.1
> of Annex U. Subclause U.1 assumes that bridges in the network use
> 802.1Qbv, and then it poses the question of what an end-station
> Talker should do. If the end-station also uses 802.1Qbv,
> and that end-station transmits multiple streams, 802.1Qbv is
> a bad implementation. The reason is that the scheduling
> (i.e. order in time) of each stream cannot be controlled, which
> in turn means that the CNC (network manager) cannot optimize
> the 802.1Qbv schedules in bridges. The preferred technique
> is to use "per-stream scheduling" in each Talker, so that
> the CNC can create an optimal schedules (i.e. best determinism).
>
> I'm aware of a small number of proprietary CNC implementations for
> 802.1Qbv in bridges, and they are generally assuming per-stream
> scheduling in end-stations (Talkers).
>
> The i210 NIC's LaunchTime can be used to implement per-stream
> scheduling. I haven't looked at SO_TXTIME in detail, but it sounds
> like per-stream scheduling. If so, then we already have the
> fundamental building blocks for a complete implementation
> of a time-aware end-station.
>
> If we answer the preceding question #1 as "end-station only",
> I would recommend avoiding 802.1Qbv in this interface. There
> isn't really anything wrong with it per-se, but it would lead
> developers down the wrong path.

In some situations, such as device nodes that each incorporate a small
port count switch for the purpose of daisy-chaining a segment of the
network, "end stations" must do a limited subset of local bridge
management as well. I'm not sure how common this is going to be for
industrial control applications, but I know there are audio and
automotive applications built this way.

One particular device I am working with now provides all network
access through a DSA switch chip with hardware Qbv support in addtion
to hardware Qav support. The SoC attached to it has no hardware timed
launch (SO_TXTIME) support. In this case, although the proposed
interface for Qbv is not *sufficient* to make a working time-aware end
station, it does provide a usable building block to provide one. As
with the credit-based shaping system, Talkers must provide an
additional level of per-stream shaping as well, but this is largely
(absent the jitter calculations, which are sort of a middle-level
concern) independent of what sort of hardware offload of the
scheduling is provided.

Both Qbv windows and timed launch support do roughly the same thing;
they *delay* the launch of a hardware-queued frame so it can egress at
a precisely specified time, and at least with the i210 and Qbv, ensure
that no other traffic will be in-progress when that time arrives. For
either to be used effectively, the application still has to prepare
the frame slightly ahead-of-time and thus must have the same level of
time-awareness. This is, again, largely independent of what kind of
hardware offloading support is provided and is also largely
independent of the network stack itself. Neither queue window
management nor SO_TXTIME help the application present its
time-sensitive traffic at the right time; that's a matter to be worked
out with the application taking advantage of PTP and the OS scheduler.
Whether you rely on managed windows or hardware launch time to provide
the precisely correct amount of delay beyond that is immaterial to the
application. In the absence of SO_TXTIME offloading (or even with it,
and in the presence of sufficient OS scheduling jitter), an additional
layer may need to be provided to ensure different applications' frames
are queued in the correct order for egress during the window. Again,
this could be a purely user-space application multiplexer or a
separate qdisc module.

I wholeheartedly agree with you and Richard that we ought to
eventually provide application-level APIs that don't require users to
have deep knowledge of various 802.1Q intricacies. But I believe that
the hardware offloading capability being provided now, and the variety
of the way things are hooked up in real hardware, suggests that we
ought to also build the support for the underlying protocols in layers
so that we don't create unnecessary mismatches between offloading
capability (which can be essential to overall network performance) and
APIs, such that one configuration of offload support is privileged
above others even when comparable scheduling accuracy could be
provided by either.

In any case, only the cbs qdisc has been included in the post-RFC
patch cover page for its last couple of iterations, so there is plenty
of time to discuss how time-aware shaping, preemption, etc. management
should occur beyond the cbs and SO_TXTIME proposals.

Levi

^ permalink raw reply

* Re: Fw: [Bug 197099] New: Kernel panic in interrupt [l2tp_ppp]
From: SviMik @ 2017-10-02 18:35 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1506952566.8061.3.camel@edumazet-glaptop3.roam.corp.google.com>

Hi, James!

No, I'm suffering from kernel panics since I started using 4.x
kernels. See my current collection:
http://svimik.com/hdmmsk1kp1.png
http://svimik.com/hdmmsk2kp2.png
http://svimik.com/hdmmsk2kp3.png
http://svimik.com/hdmmsk2kp4.png
http://svimik.com/hdmmsk2kp5.png
http://svimik.com/hdmmsk7kp1.png

Screenshots are from three different machines, kernels from 4.8.13 to 4.13.4.

2017-10-02 16:56 GMT+03:00 Eric Dumazet <eric.dumazet@gmail.com>:
> CC svimik@gmail.com so that he is aware of this netdev thread.
>
> On Mon, 2017-10-02 at 14:32 +0100, James Chapman wrote:
>> This seems to be a NULL pointer exception caused by tunnel->sock being
>> NULL at the call to bh_lock_sock() in l2tp_xmit_skb() at
>> l2tp_core.c:1135.
>>
>> tunnel->sock is set NULL in l2tp_core's tunnel socket destructor.
>>
>> At the moment, I don't understand how this happens because
>> pppol2tp_xmit() does a sock_hold() on the tunnel socket before
>> l2tp_xmit_skb() is called. I'm still looking at this.
>>
>> Has this problem only recently started happening?
>>
>>
>>
>>
>>
>> On 1 October 2017 at 18:21, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> >
>> >
>> > Begin forwarded message:
>> >
>> > Date: Sun, 01 Oct 2017 16:22:33 +0000
>> > From: bugzilla-daemon@bugzilla.kernel.org
>> > To: stephen@networkplumber.org
>> > Subject: [Bug 197099] New: Kernel panic in interrupt [l2tp_ppp]
>> >
>> >
>> > https://bugzilla.kernel.org/show_bug.cgi?id=197099
>> >
>> >             Bug ID: 197099
>> >            Summary: Kernel panic in interrupt [l2tp_ppp]
>> >            Product: Networking
>> >            Version: 2.5
>> >     Kernel Version: 4.8.13-1.el6.elrepo.x86_64
>> >           Hardware: x86-64
>> >                 OS: Linux
>> >               Tree: Mainline
>> >             Status: NEW
>> >           Severity: normal
>> >           Priority: P1
>> >          Component: Other
>> >           Assignee: stephen@networkplumber.org
>> >           Reporter: svimik@gmail.com
>> >         Regression: No
>> >
>> > Created attachment 258685
>> >   --> https://bugzilla.kernel.org/attachment.cgi?id=258685&action=edit
>> > stacktrace screenshot
>> >
>> > Hello!
>> >
>> > Getting kernel panics on multiple servers. Since it mentions l2tp_core,
>> > l2tp_ppp and ppp_generic, I decided to report it to Networking (correct me if
>> > I'm wrong).
>> >
>> > Unfortunately I'm still struggling with making kdump work, so the trace
>> > screenshot is all I have at this moment. The only hope is that this stacktrace
>> > means something to the guys that wrote the code.
>> >
>> > --
>> > You are receiving this mail because:
>> > You are the assignee for the bug.
>
>

^ permalink raw reply

* Re: [PATCH v3 00/19] Thunderbolt networking
From: David Miller @ 2017-10-02 18:25 UTC (permalink / raw)
  To: mika.westerberg
  Cc: gregkh, andreas.noever, michael.jamet, yehezkel.bernat,
	amir.jer.levy, Mario.Limonciello, lukas, andriy.shevchenko,
	andrew, netdev, linux-kernel
In-Reply-To: <20171002103846.64602-1-mika.westerberg@linux.intel.com>

From: Mika Westerberg <mika.westerberg@linux.intel.com>
Date: Mon,  2 Oct 2017 13:38:27 +0300

> In addition of tunneling PCIe, Display Port and USB traffic, Thunderbolt
> allows connecting two hosts (domains) over a Thunderbolt cable. It is
> possible to tunnel arbitrary data packets over such connection using
> high-speed DMA rings available in the Thunderbolt host controller.

Series applied to net-next, thanks!

^ permalink raw reply

* Re: [patch net-next 0/2] mlxsw: Fixlets
From: David Miller @ 2017-10-02 18:20 UTC (permalink / raw)
  To: jiri; +Cc: netdev, petrm, idosch, mlxsw
In-Reply-To: <20171002102158.2443-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Mon,  2 Oct 2017 12:21:56 +0200

> From: Jiri Pirko <jiri@mellanox.com>
> 
> Couple of small nit fixes from Petr

Series applied.

^ permalink raw reply

* Re: [patch net 0/2] mlxsw: Fixes in GRE offloading
From: David Miller @ 2017-10-02 18:19 UTC (permalink / raw)
  To: jiri; +Cc: netdev, petrm, idosch, mlxsw
In-Reply-To: <20171002101457.1462-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Mon,  2 Oct 2017 12:14:55 +0200

> From: Jiri Pirko <jiri@mellanox.com>
> 
> Petr says:
> 
> This patchset fixes a couple unrelated problems in offloading IP-in-IP tunnels
> in mlxsw driver.
> 
> - The first patch fixes a potential reference-counting problem that might lead
>   to a kernel crash.
> 
> - The second patch associates IPIP next hops with their loopback RIFs. Besides
>   being the right thing to do, it also fixes a problem where offloaded IPv6
>   routes that forward to IP-in-IP netdevices were not flagged as such.

Series applied.

^ permalink raw reply

* Re: [RFC net-next 1/5] net: dsa: Add infrastructure to support LAG
From: Florian Fainelli @ 2017-10-02 18:19 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, vivien.didelot, jiri, idosch, Woojung.Huh, john,
	sean.wang
In-Reply-To: <20171002020327.GA21593@lunn.ch>

On 10/01/2017 07:03 PM, Andrew Lunn wrote:
> On Sun, Oct 01, 2017 at 12:46:35PM -0700, Florian Fainelli wrote:
>> Add the necessary logic to support network device events targetting LAG events,
>> this is loosely inspired from mlxsw/spectrum.c.
>>
>> In the process we change dsa_slave_changeupper() to be more generic and be called
>> from both LAG events as well as normal bridge enslaving events paths.
>>
>> The DSA layer takes care of managing the LAG group identifiers, how many LAGs
>> may be supported by a switch, and how many members per LAG are supported by a
>> switch device. When a LAG group is identified, the port is then configured to
>> be a part of that group. When a LAG group no longer has any users, we remove it
>> and we tell the drivers whether it is safe to disable trunking altogether.
>>
>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>> ---
>>  include/net/dsa.h  |  25 +++++++++
>>  net/dsa/dsa2.c     |  12 ++++
>>  net/dsa/dsa_priv.h |   7 +++
>>  net/dsa/port.c     |  92 +++++++++++++++++++++++++++++++
>>  net/dsa/slave.c    | 157 +++++++++++++++++++++++++++++++++++++++++++++++++----
>>  net/dsa/switch.c   |  30 ++++++++++
>>  6 files changed, 312 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/net/dsa.h b/include/net/dsa.h
>> index 10dceccd9ce8..247ea58add68 100644
>> --- a/include/net/dsa.h
>> +++ b/include/net/dsa.h
>> @@ -182,12 +182,20 @@ struct dsa_port {
>>  	u8			stp_state;
>>  	struct net_device	*bridge_dev;
>>  	struct devlink_port	devlink_port;
>> +	u8			lag_id;
>> +	bool			lagged;
>>  	/*
>>  	 * Original copy of the master netdev ethtool_ops
>>  	 */
>>  	const struct ethtool_ops *orig_ethtool_ops;
>>  };
>>  
>> +struct dsa_lag_group {
>> +	/* Used to know when we can disable lag on the switch */
>> +	unsigned int		ref_count;
> 
> Hi Florian
> 
> In what contexts is ref_count manipulated. Normally you use would
> refcounf_t and the operations in linux/refcount.h. But if you know
> there is some other protection, e.g. rtnl, an unsigned int is O.K.
> Maybe scatter some assert_RTNL() in the code?

Hi Andrew,

This is called with rtnl held, but this is a good point. In fact, I
don't think we need the reference count at all, what I am going to
propose now is that we just maintain a bitmask of port members per lag
group (along with the reference to the lag device) and when the hamming
weight of that bitmask is 1, that means we were removing the lat port of
the LAG group and we can stop using that LAG group. This also allow us
to remove the port_lag_member operation, since we would be maintaining
that at the DSA layer now.

> 
>> +static bool dsa_slave_lag_check(struct net_device *dev, struct net_device *lag_dev,
>> +				struct netdev_lag_upper_info *lag_upper_info)
>> +{
>> +	struct dsa_slave_priv *p = netdev_priv(dev);
>> +	u8 lag_id;
>> +
>> +	/* No more lag identifiers available or already in use */
>> +	if (dsa_switch_lag_get_index(p->dp->ds, lag_dev, &lag_id) != 0)
>> +		return false;
>> +
>> +	if (lag_upper_info->tx_type != NETDEV_LAG_TX_TYPE_HASH)
>> +		return false;
> 
> I wounder if the driver needs to decide this? Can different hardware
> support different tx_types?

That is a valid point. For instance, the b53/bcm_sf2 switches can only
do MAC DA and SA, SA only, DA only hashing, but you can't do hashing at
a higher level than L2 addresses, this does appear to be something that
the driver should indeed decide.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] selftests: rtnetlink.sh: add vxlan and fou test cases
From: David Miller @ 2017-10-02 18:15 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <20171002100529.602-1-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Mon,  2 Oct 2017 12:05:29 +0200

> fou test lifted from ip-fou man page.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>

I love seeing new testcases ;-)

Applied, thanks.

^ permalink raw reply

* Re: [kernel-hardening] [PATCH 0/2] capability controlled user-namespaces
From: Mahesh Bandewar (महेश बंडेवार) @ 2017-10-02 18:12 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Mahesh Bandewar, LKML, Netdev, Kernel-hardening, Linux API,
	Kees Cook, Eric W . Biederman, Eric Dumazet, David Miller
In-Reply-To: <20171002171410.GA19611-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>

On Mon, Oct 2, 2017 at 10:14 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Mahesh Bandewar (mahesh-bmGAjcP2qsnk1uMJSBkQmQ@public.gmane.org):
>> From: Mahesh Bandewar <maheshb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>
>> [Same as the previous RFC series sent on 9/21]
>>
>> TL;DR version
>> -------------
>> Creating a sandbox environment with namespaces is challenging
>> considering what these sandboxed processes can engage into. e.g.
>> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
>> Current form of user-namespaces, however, if changed a bit can allow
>> us to create a sandbox environment without locking down user-
>> namespaces.
>>
>> Detailed version
>> ----------------
>
> Hi,
>
> still struggling with how I feel about the idea in general.
>
> So is the intent mainly that if/when there comes an 0-day which allows
> users with CAP_NET_ADMIN in any namespace to gain privilege on the host,
> then this can be used as a stop-gap measure until there is a proper fix?
>
Thank for looking at this Serge.

Yes, but at the same time it's not just limited to NET_ADMIN but could
be any of the current capabilities.

> Otherwise, do you have any guidance for how people should use this?
>
> IMO it should be heavily discouraged to use this tool as a regular
> day to day configuration, as I'm not sure there is any "educated"
> decision to be made, even by those who are in the know, about what
> to put in this set.
>
I think that really depends on the environment. e.g. in certain
sandboxes third-part / semi-trusted workload is executed where network
resource is not used. In that environment I can easily take off
NET_ADMIN and NET_RAW without affecting anything there. At the same
time I wont have to worry about 0-day related to these two
capabilities. I would say the Admins at these places are in the best
place to decide what they can take-off safely and what they cannot.
Even if they decide not to take-off anything, having a tool at hand to
gain control is important when the next 0-day strikes us that can be
exploited using any of the currently used capabilities.

However, you are absolutely right in terms of using it as a stop-gap
measure to protect environment until it's fixed and the capability in
question can not be safely taken off permanently without hampering
operations.

thanks,
--mahesh..

[...]

^ permalink raw reply

* Re: [PATCH net-next 0/2] flow_dissector: dissect tunnel info
From: David Miller @ 2017-10-02 18:06 UTC (permalink / raw)
  To: simon.horman; +Cc: jiri, jhs, xiyou.wangcong, netdev, oss-drivers
In-Reply-To: <1506933676-20121-1-git-send-email-simon.horman@netronome.com>

From: Simon Horman <simon.horman@netronome.com>
Date: Mon,  2 Oct 2017 10:41:14 +0200

> Move dissection of tunnel info from the flower classifier to the flow
> dissector where all other dissection occurs.  This should not have any
> behavioural affect on other users of the flow dissector.

Series applied, thanks Simon.

^ permalink raw reply

* Re: [PATCH net-next 0/3] bridge: neigh msg proxy and flood suppression support
From: Stephen Hemminger @ 2017-10-02 18:02 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Nikolay Aleksandrov,
	bridge
In-Reply-To: <CAJieiUim2XLMGAomb3S5AeWfYqjxV_raetedWcA_PBiaGPRHWg@mail.gmail.com>

On Mon, 2 Oct 2017 07:49:09 -0700
Roopa Prabhu <roopa@cumulusnetworks.com> wrote:

> On Sun, Oct 1, 2017 at 9:36 PM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> > From: Roopa Prabhu <roopa@cumulusnetworks.com>
> >
> > This series implements arp and nd suppression in the bridge
> > driver for ethernet vpns. It implements rfc7432, section 10
> > https://tools.ietf.org/html/rfc7432#section-10
> > for ethernet VPN deployments. It is similar to the existing
> > BR_ARP_PROXY flag but has a few semantic differences to conform
> > to EVPN standard. In case of EVPN, it is mainly used to avoid flooding to
> > tunnel ports like vxlan/mpls. Unlike the existing flags it suppresses flood
> > of all neigh discovery packets (arp, nd) to tunnel ports.
> >
> > Roopa Prabhu (3):
> >   bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd
> >     flood
> >   neigh arp suppress first
> >   bridge: suppress nd messages from going to BR_NEIGH_SUPPRESS ports
> >  
> 
> pls ignore, shows conflict applying over recent net-next bridge
> changes. Will rebase and submit v2.

Ok, but the concept looks good.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox