Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next 04/11] i40evf: Add support for 10G base T parts
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
  To: davem
  Cc: Paul M Stillwell Jr, netdev, nhorman, sassmann, jogreene,
	Patrick Lu, Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>

Add 10G-Base-T support in i40evf.

Change-ID: I98a1c3138d7d6572fe7903a7c1c4692cae3260d5
Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40evf/i40e_common.c | 1 +
 drivers/net/ethernet/intel/i40evf/i40e_type.h   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_common.c b/drivers/net/ethernet/intel/i40evf/i40e_common.c
index 9525605..28c40c5 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_common.c
@@ -50,6 +50,7 @@ i40e_status i40e_set_mac_type(struct i40e_hw *hw)
 		case I40E_DEV_ID_QSFP_A:
 		case I40E_DEV_ID_QSFP_B:
 		case I40E_DEV_ID_QSFP_C:
+		case I40E_DEV_ID_10G_BASE_T:
 			hw->mac.type = I40E_MAC_XL710;
 			break;
 		case I40E_DEV_ID_VF:
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_type.h b/drivers/net/ethernet/intel/i40evf/i40e_type.h
index 1537643..8fe34fc 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_type.h
@@ -43,6 +43,7 @@
 #define I40E_DEV_ID_QSFP_A		0x1583
 #define I40E_DEV_ID_QSFP_B		0x1584
 #define I40E_DEV_ID_QSFP_C		0x1585
+#define I40E_DEV_ID_10G_BASE_T		0x1586
 #define I40E_DEV_ID_VF		0x154C
 #define I40E_DEV_ID_VF_HV		0x1571
 
-- 
1.9.3

^ permalink raw reply related

* [net-next 02/11] i40evf: properly handle multiple AQ messages
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
  To: davem
  Cc: Mitch Williams, netdev, nhorman, sassmann, jogreene, Patrick Lu,
	Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Mitch Williams <mitch.a.williams@intel.com>

When we receive an admin queue message, the msg_size field in the event
struct gets overwritten. Because of this, we need to reinit the field
each time we go through the loop. Without this we may receive truncated
messages due to the firmware thinking we have insufficient buffer size.

Change-ID: I21dcca5114d91365d731169965ce3ffec0e4a190
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index dabe6a4..b2f01eb 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -1647,10 +1647,8 @@ static void i40evf_adminq_task(struct work_struct *work)
 					   v_msg->v_retval, event.msg_buf,
 					   event.msg_size);
 		if (pending != 0) {
-			dev_info(&adapter->pdev->dev,
-				 "%s: ARQ: Pending events %d\n",
-				 __func__, pending);
 			memset(event.msg_buf, 0, I40EVF_MAX_AQ_BUF_SIZE);
+			event.msg_size = I40EVF_MAX_AQ_BUF_SIZE;
 		}
 	} while (pending);
 
-- 
1.9.3

^ permalink raw reply related

* [net-next 05/11] i40e: avoid disable of interrupt when changing ITR
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
  To: davem
  Cc: Jesse Brandeburg, netdev, nhorman, sassmann, jogreene, Patrick Lu,
	Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jesse Brandeburg <jesse.brandeburg@intel.com>

The call to irq_dynamic_disable was turning off the interrupt completely
when trying to set ITR to 0 (for lowest moderation).  Just remove the
call as setting the values to 0 later in this function will suffice.

Change-ID: I47caf1ecbe65653cf63ec833db93094cd83fd84d
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-By: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 12adc08..b6e745f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1574,7 +1574,6 @@ static int i40e_set_coalesce(struct net_device *netdev,
 		vsi->rx_itr_setting = ec->rx_coalesce_usecs;
 	} else if (ec->rx_coalesce_usecs == 0) {
 		vsi->rx_itr_setting = ec->rx_coalesce_usecs;
-		i40e_irq_dynamic_disable(vsi, vector);
 		if (ec->use_adaptive_rx_coalesce)
 			netif_info(pf, drv, netdev,
 				   "Rx-secs=0, need to disable adaptive-Rx for a complete disable\n");
@@ -1589,7 +1588,6 @@ static int i40e_set_coalesce(struct net_device *netdev,
 		vsi->tx_itr_setting = ec->tx_coalesce_usecs;
 	} else if (ec->tx_coalesce_usecs == 0) {
 		vsi->tx_itr_setting = ec->tx_coalesce_usecs;
-		i40e_irq_dynamic_disable(vsi, vector);
 		if (ec->use_adaptive_tx_coalesce)
 			netif_info(pf, drv, netdev,
 				   "Tx-secs=0, need to disable adaptive-Tx for a complete disable\n");
-- 
1.9.3

^ permalink raw reply related

* [net-next 08/11] i40e: better wording for resource tracking errors
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
  To: davem
  Cc: Shannon Nelson, netdev, nhorman, sassmann, jogreene, Patrick Lu,
	Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Shannon Nelson <shannon.nelson@intel.com>

Tweak and homogenize the error reporting for get_lump() resource
tracking errors.

Change-ID: I11330161cc6ad8d04371c499c63071c816171c3b
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 83fee7f..6a481bf 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7957,8 +7957,8 @@ static int i40e_vsi_setup_vectors(struct i40e_vsi *vsi)
 						 vsi->num_q_vectors, vsi->idx);
 	if (vsi->base_vector < 0) {
 		dev_info(&pf->pdev->dev,
-			 "failed to get queue tracking for VSI %d, err=%d\n",
-			 vsi->seid, vsi->base_vector);
+			 "failed to get tracking for %d vectors for VSI %d, err=%d\n",
+			 vsi->num_q_vectors, vsi->seid, vsi->base_vector);
 		i40e_vsi_free_q_vectors(vsi);
 		ret = -ENOENT;
 		goto vector_setup_out;
@@ -7994,8 +7994,9 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 
 	ret = i40e_get_lump(pf, pf->qp_pile, vsi->alloc_queue_pairs, vsi->idx);
 	if (ret < 0) {
-		dev_info(&pf->pdev->dev, "VSI %d get_lump failed %d\n",
-			 vsi->seid, ret);
+		dev_info(&pf->pdev->dev,
+			 "failed to get tracking for %d queues for VSI %d err=%d\n",
+			 vsi->alloc_queue_pairs, vsi->seid, ret);
 		goto err_vsi;
 	}
 	vsi->base_queue = ret;
@@ -8124,8 +8125,9 @@ struct i40e_vsi *i40e_vsi_setup(struct i40e_pf *pf, u8 type,
 	ret = i40e_get_lump(pf, pf->qp_pile, vsi->alloc_queue_pairs,
 				vsi->idx);
 	if (ret < 0) {
-		dev_info(&pf->pdev->dev, "VSI %d get_lump failed %d\n",
-			 vsi->seid, ret);
+		dev_info(&pf->pdev->dev,
+			 "failed to get tracking for %d queues for VSI %d err=%d\n",
+			 vsi->alloc_queue_pairs, vsi->seid, ret);
 		goto err_vsi;
 	}
 	vsi->base_queue = ret;
-- 
1.9.3

^ permalink raw reply related

* [net-next 06/11] i40e: remove debugfs dump stats
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
  To: davem
  Cc: Shannon Nelson, netdev, nhorman, sassmann, jogreene, Patrick Lu,
	Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Shannon Nelson <shannon.nelson@intel.com>

The debugfs dump stats wasn't being kept up-to-date, was redundant with
the ethtool output, and didn't offer any useful additional info.  Rather
than continue trying to keep them aligned, just remove the debugfs command.

Change-ID: Id130ed9aef01c6369ab662c7b4c5ec5b1dbc5b40
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <Jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c | 93 +-------------------------
 1 file changed, 2 insertions(+), 91 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
index 7067f4b..a03f459 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
@@ -895,90 +895,6 @@ static void i40e_dbg_dump_eth_stats(struct i40e_pf *pf,
 }
 
 /**
- * i40e_dbg_dump_stats - handles dump stats write into command datum
- * @pf: the i40e_pf created in command write
- * @stats: the stats structure to be dumped
- **/
-static void i40e_dbg_dump_stats(struct i40e_pf *pf,
-				struct i40e_hw_port_stats *stats)
-{
-	int i;
-
-	dev_info(&pf->pdev->dev, "  stats:\n");
-	dev_info(&pf->pdev->dev,
-		 "    crc_errors = \t\t%lld \tillegal_bytes = \t%lld \terror_bytes = \t\t%lld\n",
-		 stats->crc_errors, stats->illegal_bytes, stats->error_bytes);
-	dev_info(&pf->pdev->dev,
-		 "    mac_local_faults = \t%lld \tmac_remote_faults = \t%lld \trx_length_errors = \t%lld\n",
-		 stats->mac_local_faults, stats->mac_remote_faults,
-		 stats->rx_length_errors);
-	dev_info(&pf->pdev->dev,
-		 "    link_xon_rx = \t\t%lld \tlink_xoff_rx = \t\t%lld \tlink_xon_tx = \t\t%lld\n",
-		 stats->link_xon_rx, stats->link_xoff_rx, stats->link_xon_tx);
-	dev_info(&pf->pdev->dev,
-		 "    link_xoff_tx = \t\t%lld \trx_size_64 = \t\t%lld \trx_size_127 = \t\t%lld\n",
-		 stats->link_xoff_tx, stats->rx_size_64, stats->rx_size_127);
-	dev_info(&pf->pdev->dev,
-		 "    rx_size_255 = \t\t%lld \trx_size_511 = \t\t%lld \trx_size_1023 = \t\t%lld\n",
-		 stats->rx_size_255, stats->rx_size_511, stats->rx_size_1023);
-	dev_info(&pf->pdev->dev,
-		 "    rx_size_big = \t\t%lld \trx_undersize = \t\t%lld \trx_jabber = \t\t%lld\n",
-		 stats->rx_size_big, stats->rx_undersize, stats->rx_jabber);
-	dev_info(&pf->pdev->dev,
-		 "    rx_fragments = \t\t%lld \trx_oversize = \t\t%lld \ttx_size_64 = \t\t%lld\n",
-		 stats->rx_fragments, stats->rx_oversize, stats->tx_size_64);
-	dev_info(&pf->pdev->dev,
-		 "    tx_size_127 = \t\t%lld \ttx_size_255 = \t\t%lld \ttx_size_511 = \t\t%lld\n",
-		 stats->tx_size_127, stats->tx_size_255, stats->tx_size_511);
-	dev_info(&pf->pdev->dev,
-		 "    tx_size_1023 = \t\t%lld \ttx_size_big = \t\t%lld \tmac_short_packet_dropped = \t%lld\n",
-		 stats->tx_size_1023, stats->tx_size_big,
-		 stats->mac_short_packet_dropped);
-	for (i = 0; i < 8; i += 4) {
-		dev_info(&pf->pdev->dev,
-			 "    priority_xon_rx[%d] = \t%lld \t[%d] = \t%lld \t[%d] = \t%lld \t[%d] = \t%lld\n",
-			 i, stats->priority_xon_rx[i],
-			 i+1, stats->priority_xon_rx[i+1],
-			 i+2, stats->priority_xon_rx[i+2],
-			 i+3, stats->priority_xon_rx[i+3]);
-	}
-	for (i = 0; i < 8; i += 4) {
-		dev_info(&pf->pdev->dev,
-			 "    priority_xoff_rx[%d] = \t%lld \t[%d] = \t%lld \t[%d] = \t%lld \t[%d] = \t%lld\n",
-			 i, stats->priority_xoff_rx[i],
-			 i+1, stats->priority_xoff_rx[i+1],
-			 i+2, stats->priority_xoff_rx[i+2],
-			 i+3, stats->priority_xoff_rx[i+3]);
-	}
-	for (i = 0; i < 8; i += 4) {
-		dev_info(&pf->pdev->dev,
-			 "    priority_xon_tx[%d] = \t%lld \t[%d] = \t%lld \t[%d] = \t%lld \t[%d] = \t%lld\n",
-			 i, stats->priority_xon_tx[i],
-			 i+1, stats->priority_xon_tx[i+1],
-			 i+2, stats->priority_xon_tx[i+2],
-			 i+3, stats->priority_xon_rx[i+3]);
-	}
-	for (i = 0; i < 8; i += 4) {
-		dev_info(&pf->pdev->dev,
-			 "    priority_xoff_tx[%d] = \t%lld \t[%d] = \t%lld \t[%d] = \t%lld \t[%d] = \t%lld\n",
-			 i, stats->priority_xoff_tx[i],
-			 i+1, stats->priority_xoff_tx[i+1],
-			 i+2, stats->priority_xoff_tx[i+2],
-			 i+3, stats->priority_xoff_tx[i+3]);
-	}
-	for (i = 0; i < 8; i += 4) {
-		dev_info(&pf->pdev->dev,
-			 "    priority_xon_2_xoff[%d] = \t%lld \t[%d] = \t%lld \t[%d] = \t%lld \t[%d] = \t%lld\n",
-			 i, stats->priority_xon_2_xoff[i],
-			 i+1, stats->priority_xon_2_xoff[i+1],
-			 i+2, stats->priority_xon_2_xoff[i+2],
-			 i+3, stats->priority_xon_2_xoff[i+3]);
-	}
-
-	i40e_dbg_dump_eth_stats(pf, &stats->eth);
-}
-
-/**
  * i40e_dbg_dump_veb_seid - handles dump stats of a single given veb
  * @pf: the i40e_pf created in command write
  * @seid: the seid the user put in
@@ -1342,11 +1258,6 @@ static ssize_t i40e_dbg_command_write(struct file *filp,
 					 "dump desc rx <vsi_seid> <ring_id> [<desc_n>]\n");
 				dev_info(&pf->pdev->dev, "dump desc aq\n");
 			}
-		} else if (strncmp(&cmd_buf[5], "stats", 5) == 0) {
-			dev_info(&pf->pdev->dev, "pf stats:\n");
-			i40e_dbg_dump_stats(pf, &pf->stats);
-			dev_info(&pf->pdev->dev, "pf stats_offsets:\n");
-			i40e_dbg_dump_stats(pf, &pf->stats_offsets);
 		} else if (strncmp(&cmd_buf[5], "reset stats", 11) == 0) {
 			dev_info(&pf->pdev->dev,
 				 "core reset count: %d\n", pf->corer_count);
@@ -1464,8 +1375,8 @@ static ssize_t i40e_dbg_command_write(struct file *filp,
 		} else {
 			dev_info(&pf->pdev->dev,
 				 "dump desc tx <vsi_seid> <ring_id> [<desc_n>], dump desc rx <vsi_seid> <ring_id> [<desc_n>],\n");
-			dev_info(&pf->pdev->dev, "dump switch, dump vsi [seid] or\n");
-			dev_info(&pf->pdev->dev, "dump stats\n");
+			dev_info(&pf->pdev->dev, "dump switch\n");
+			dev_info(&pf->pdev->dev, "dump vsi [seid]\n");
 			dev_info(&pf->pdev->dev, "dump reset stats\n");
 			dev_info(&pf->pdev->dev, "dump port\n");
 			dev_info(&pf->pdev->dev,
-- 
1.9.3

^ permalink raw reply related

* [net-next 07/11] i40e: scale msix vector use when more cores than vectors
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
  To: davem
  Cc: Shannon Nelson, netdev, nhorman, sassmann, jogreene, Patrick Lu,
	Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Shannon Nelson <shannon.nelson@intel.com>

When there are more cores than vectors available to the PF, scale back
the LAN msix usage to force queue/vector sharing and leave some vectors
for Flow Director, VMDq, etc.

Change-ID: Ie0317732eb85ad8d851d7da7d9af86b1bf8c21ad
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index f95c04a..83fee7f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6699,6 +6699,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
 {
 	i40e_status err = 0;
 	struct i40e_hw *hw = &pf->hw;
+	int other_vecs = 0;
 	int v_budget, i;
 	int vec;
 
@@ -6724,10 +6725,10 @@ static int i40e_init_msix(struct i40e_pf *pf)
 	 */
 	pf->num_lan_msix = pf->num_lan_qps - (pf->rss_size_max - pf->rss_size);
 	pf->num_vmdq_msix = pf->num_vmdq_qps;
-	v_budget = 1 + pf->num_lan_msix;
-	v_budget += (pf->num_vmdq_vsis * pf->num_vmdq_msix);
+	other_vecs = 1;
+	other_vecs += (pf->num_vmdq_vsis * pf->num_vmdq_msix);
 	if (pf->flags & I40E_FLAG_FD_SB_ENABLED)
-		v_budget++;
+		other_vecs++;
 
 #ifdef I40E_FCOE
 	if (pf->flags & I40E_FLAG_FCOE_ENABLED) {
@@ -6737,7 +6738,9 @@ static int i40e_init_msix(struct i40e_pf *pf)
 
 #endif
 	/* Scale down if necessary, and the rings will share vectors */
-	v_budget = min_t(int, v_budget, hw->func_caps.num_msix_vectors);
+	pf->num_lan_msix = min_t(int, pf->num_lan_msix,
+			(hw->func_caps.num_msix_vectors - other_vecs));
+	v_budget = pf->num_lan_msix + other_vecs;
 
 	pf->msix_entries = kcalloc(v_budget, sizeof(struct msix_entry),
 				   GFP_KERNEL);
-- 
1.9.3

^ permalink raw reply related

* [net-next 09/11] i40e: enable debug earlier
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
  To: davem
  Cc: Shannon Nelson, netdev, nhorman, sassmann, jogreene, Patrick Lu,
	Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Shannon Nelson <shannon.nelson@intel.com>

Check the debug module parameter earlier to be able to catch the early
configuration phase adminq messages.

Change-ID: Ic84fabd72393489bbf96042de770790a80fd8468
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6a481bf..ea62267 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9023,6 +9023,11 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	hw->bus.func = PCI_FUNC(pdev->devfn);
 	pf->instance = pfs_found;
 
+	if (debug != -1) {
+		pf->msg_enable = pf->hw.debug_mask;
+		pf->msg_enable = debug;
+	}
+
 	/* do a special CORER for clearing PXE mode once at init */
 	if (hw->revision_id == 0 &&
 	    (rd32(hw, I40E_GLLAN_RCTL_0) & I40E_GLLAN_RCTL_0_PXE_MODE_MASK)) {
-- 
1.9.3

^ permalink raw reply related

* [net-next 11/11] i40e: properly parse MDET registers
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
  To: davem
  Cc: Mitch Williams, netdev, nhorman, sassmann, jogreene, Patrick Lu,
	Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Mitch Williams <mitch.a.williams@intel.com>

Fix a few problems with our parsing of the MDET registers:
* Queue IDs are longer than 8 bits
* Queue IDs are absolute for the device and the base queue must be
  subtracted out.
* VF IDs are longer than 8 bits
* Use the MASK define to mask the event value, instead of the SHIFT
  define.

Change-ID: I3dc7237f480c02e1192a2a8ea782f8a02ab2a8b7
Reported-by: Marc Neustadter <marc.neustadter@intel.com>
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 1b0c437..1a98e23 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6174,12 +6174,13 @@ static void i40e_handle_mdd_event(struct i40e_pf *pf)
 	if (reg & I40E_GL_MDET_TX_VALID_MASK) {
 		u8 pf_num = (reg & I40E_GL_MDET_TX_PF_NUM_MASK) >>
 				I40E_GL_MDET_TX_PF_NUM_SHIFT;
-		u8 vf_num = (reg & I40E_GL_MDET_TX_VF_NUM_MASK) >>
+		u16 vf_num = (reg & I40E_GL_MDET_TX_VF_NUM_MASK) >>
 				I40E_GL_MDET_TX_VF_NUM_SHIFT;
 		u8 event = (reg & I40E_GL_MDET_TX_EVENT_MASK) >>
 				I40E_GL_MDET_TX_EVENT_SHIFT;
-		u8 queue = (reg & I40E_GL_MDET_TX_QUEUE_MASK) >>
-				I40E_GL_MDET_TX_QUEUE_SHIFT;
+		u16 queue = ((reg & I40E_GL_MDET_TX_QUEUE_MASK) >>
+				I40E_GL_MDET_TX_QUEUE_SHIFT) -
+				pf->hw.func_caps.base_queue;
 		if (netif_msg_tx_err(pf))
 			dev_info(&pf->pdev->dev, "Malicious Driver Detection event 0x%02x on TX queue %d pf number 0x%02x vf number 0x%02x\n",
 				 event, queue, pf_num, vf_num);
@@ -6192,8 +6193,9 @@ static void i40e_handle_mdd_event(struct i40e_pf *pf)
 				I40E_GL_MDET_RX_FUNCTION_SHIFT;
 		u8 event = (reg & I40E_GL_MDET_RX_EVENT_MASK) >>
 				I40E_GL_MDET_RX_EVENT_SHIFT;
-		u8 queue = (reg & I40E_GL_MDET_RX_QUEUE_MASK) >>
-				I40E_GL_MDET_RX_QUEUE_SHIFT;
+		u16 queue = ((reg & I40E_GL_MDET_RX_QUEUE_MASK) >>
+				I40E_GL_MDET_RX_QUEUE_SHIFT) -
+				pf->hw.func_caps.base_queue;
 		if (netif_msg_rx_err(pf))
 			dev_info(&pf->pdev->dev, "Malicious Driver Detection event 0x%02x on RX queue %d of function 0x%02x\n",
 				 event, queue, func);
-- 
1.9.3

^ permalink raw reply related

* [net-next 10/11] i40e: configure VM ID in qtx_ctl
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
  To: davem
  Cc: Mitch Williams, netdev, nhorman, sassmann, jogreene, Patrick Lu,
	Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Mitch Williams <mitch.a.williams@intel.com>

We must insert the VSI ID in the QTX_CTL register when
configuring queues for VMDQ VSIs.

Change-ID: Iedfe36bd42ca0adc90a7cc2b7cf04795a98f4761
Reported-by: Marc Neustadter <marc.neustadter@intel.com>
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ea62267..1b0c437 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -2462,10 +2462,14 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 	}
 
 	/* Now associate this queue with this PCI function */
-	if (vsi->type == I40E_VSI_VMDQ2)
+	if (vsi->type == I40E_VSI_VMDQ2) {
 		qtx_ctl = I40E_QTX_CTL_VM_QUEUE;
-	else
+		qtx_ctl |= ((vsi->id) << I40E_QTX_CTL_VFVM_INDX_SHIFT) &
+			   I40E_QTX_CTL_VFVM_INDX_MASK;
+	} else {
 		qtx_ctl = I40E_QTX_CTL_PF_QUEUE;
+	}
+
 	qtx_ctl |= ((hw->pf_id << I40E_QTX_CTL_PF_INDX_SHIFT) &
 		    I40E_QTX_CTL_PF_INDX_MASK);
 	wr32(hw, I40E_QTX_CTL(pf_q), qtx_ctl);
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH -next v3 1/3] syncookies: avoid magic values and document which-bit-is-what-option
From: Daniel Borkmann @ 2014-11-03 15:27 UTC (permalink / raw)
  To: David Laight; +Cc: 'Florian Westphal', netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1C9E44D6@AcuExch.aculab.com>

On 11/03/2014 03:41 PM, David Laight wrote:
...
>>>> +/* TCP Timestamp: 6 lowest bits of timestamp sent in the cookie SYN-ACK
>>>> + * stores TCP options:
>>>> + *
>>>> + * MSB                               LSB
>>>> + * | 31 ...   6 |  5  |  4   | 3 2 1 0 |
>>>> + * |  Timestamp | ECN | SACK | WScale  |
>>>> + *
>>>> + * When we receive a valid cookie-ACK, we look at the echoed tsval (if
>>>> + * any) to figure out which TCP options we should use for the rebuilt
>>>> + * connection.
>>>> + *
>>>> + * A WScale setting of '0xf' (which is an invalid scaling value)
>>>> + * means that original syn did not include the TCP window scaling option.
>>>> + */
>>>> +#define TS_OPT_WSCALE_MASK	0xf
>>>> +#define TS_OPT_SACK		BIT(4)
>>>> +#define TS_OPT_ECN		BIT(5)
>>>> +/* There is no TS_OPT_TIMESTAMP:
>>>> + * if ACK contains timestamp option, we already know it was
>>>> + * requested/supported by the syn/synack exchange.
>>>> + */
>>>> +#define TSBITS	6
>>>> +#define TSMASK	(((__u32)1 << TSBITS) - 1)
>>>
>>> Personally I'd define all the values as hex constants instead of mixing
>>> and matching the defines.
>>>
>>> So probably just:
>>> #define TS_OPT_WSCALE_MASK	0x0f
>>> #define TS_OPT_SACK		0x10
>>> #define TS_OPT_ECN		0x20
>>> #define TSMASK                0x3f
>>
>> If you look at the above comment and then take a peek at the actual TS_OPT_*,
>> it is much easier to follow. Moreover, how is having TSMASK as 0x3f better?!
>> Currently, it is a constant calculated based upon TSBITS.
>
> TSMASK is also (TS_OPT_WSCALE_MASK | TS_OPT_SACK | TS_OPT_ECN) defining
> the values in hex makes this even more clear.

Right, that's your personal taste. ;) Besides, the definition of TSBITS/TSMASK
itself is not even altered here.

> Defining TSBITS from the mask would save the extra definition - which might
> be easier done by replacing the shifts with multiply/divide by (TSMASK + 1)
> (probably in a #define/inline function to make the code easier to read.

Sure, lets make it more complicated than it actually needs to be ... again,
I think the code is fine as is, sorry.

^ permalink raw reply

* Re: [PATCH] ipv4: avoid divide 0 error in tcp_incr_quickack
From: Eric Dumazet @ 2014-11-03 15:30 UTC (permalink / raw)
  To: chenweilong, Yuchung Cheng, Neal Cardwell
  Cc: davem, kuznet, jmorris, yoshfuji, kaber, netdev, linux-kernel
In-Reply-To: <54571335.2060709@huawei.com>

On Mon, 2014-11-03 at 13:31 +0800, chenweilong wrote:

> Hi Eric,
> 
> I check the code and find that:
> 
> 1.In function "tcp_rcv_state_process",
> the "tcp_initialize_rcv_mss" is called at "step 5: check the ACK field" when the sk->sk_state is TCP_SYN_RECV
> and there is a "tcp_validate_incoming" just before it.
> So when we call "tcp_validate_incoming", the rcv_mss may not been initialized.
> 
> 2.In function "tcp_validate_incoming",
> the "Step 1: check sequence number", according to RFC793 page 69,
> If an incoming segment is not acceptable,an acknowledgment should be sent in reply (unless the RST
> bit is set, if so drop the segment and return).
> So we may call "tcp_send_dupack" while the rcv_mss hasn't been initialized.
> 
> 3.In function "tcp_send_dupack",
> when the condition is suitable, it'll enter quick ack mode. Notice it only check the seq !
> So I think add another state check should be OK.
> 
> Any suggestion ?
> 

You did find what immediate conditions for the crash (rcv_mss = 0, state
= TCP_SYN_RCV) were.

Your patch avoids the zero divide, but leaves other issues. rcv_mss = 0
here is a sign some logic is wrong in the stack.

Given this potential zero divide had been there for years, I believe we
should take the time for a more complete fix, instead of papering over
the immediate problem.

We have been working with Neal to reproduce the issue with packetdrill,
we'll post our results when we manage to get our first crash ;)

Thanks !

^ permalink raw reply

* Re: [PATCH -next v3 1/3] syncookies: avoid magic values and document which-bit-is-what-option
From: Eric Dumazet @ 2014-11-03 15:41 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Daniel Borkmann
In-Reply-To: <1415019720-17106-2-git-send-email-fw@strlen.de>

On Mon, 2014-11-03 at 14:01 +0100, Florian Westphal wrote:
> Was a bit more difficult to read than needed due to magic shifts;
> add defines and document the used encoding scheme.
> 
> Joint work with Daniel Borkmann.
> 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH] PPC: bpf_jit_comp: add SKF_AD_PKTTYPE instruction
From: Philippe Bergheaud @ 2014-11-03 15:45 UTC (permalink / raw)
  To: Denis Kirjanov; +Cc: linuxppc-dev, netdev, Matt Evans, mpe
In-Reply-To: <CAOJe8K0t3G-bHm_24GjrTp9mmKnYZS1_bdGgrwQBLjC_s5is6w@mail.gmail.com>

Denis Kirjanov wrote:
> Any feedback from PPC folks?

I have reviewed the patch and it looks fine to me.
I have tested successfuly on ppc64le.
I could not test it on ppc64.

Philippe

> On 10/26/14, Denis Kirjanov <kda@linux-powerpc.org> wrote:
> 
>>Cc: Matt Evans <matt@ozlabs.org>
>>Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
>>---
>> arch/powerpc/include/asm/ppc-opcode.h | 1 +
>> arch/powerpc/net/bpf_jit.h            | 7 +++++++
>> arch/powerpc/net/bpf_jit_comp.c       | 5 +++++
>> 3 files changed, 13 insertions(+)
>>
>>diff --git a/arch/powerpc/include/asm/ppc-opcode.h
>>b/arch/powerpc/include/asm/ppc-opcode.h
>>index 6f85362..1a52877 100644
>>--- a/arch/powerpc/include/asm/ppc-opcode.h
>>+++ b/arch/powerpc/include/asm/ppc-opcode.h
>>@@ -204,6 +204,7 @@
>> #define PPC_INST_ERATSX_DOT		0x7c000127
>>
>> /* Misc instructions for BPF compiler */
>>+#define PPC_INST_LBZ			0x88000000
>> #define PPC_INST_LD			0xe8000000
>> #define PPC_INST_LHZ			0xa0000000
>> #define PPC_INST_LHBRX			0x7c00062c
>>diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
>>index 9aee27c..c406aa9 100644
>>--- a/arch/powerpc/net/bpf_jit.h
>>+++ b/arch/powerpc/net/bpf_jit.h
>>@@ -87,6 +87,9 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
>> #define PPC_STD(r, base, i)	EMIT(PPC_INST_STD | ___PPC_RS(r) |	      \
>> 				     ___PPC_RA(base) | ((i) & 0xfffc))
>>
>>+
>>+#define PPC_LBZ(r, base, i)	EMIT(PPC_INST_LBZ | ___PPC_RT(r) |	      \
>>+				     ___PPC_RA(base) | IMM_L(i))
>> #define PPC_LD(r, base, i)	EMIT(PPC_INST_LD | ___PPC_RT(r) |	      \
>> 				     ___PPC_RA(base) | IMM_L(i))
>> #define PPC_LWZ(r, base, i)	EMIT(PPC_INST_LWZ | ___PPC_RT(r) |	      \
>>@@ -96,6 +99,10 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
>> #define PPC_LHBRX(r, base, b)	EMIT(PPC_INST_LHBRX | ___PPC_RT(r) |	      \
>> 				     ___PPC_RA(base) | ___PPC_RB(b))
>> /* Convenience helpers for the above with 'far' offsets: */
>>+#define PPC_LBZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LBZ(r, base, i);
>>  \
>>+		else {	PPC_ADDIS(r, base, IMM_HA(i));			      \
>>+			PPC_LBZ(r, r, IMM_L(i)); } } while(0)
>>+
>> #define PPC_LD_OFFS(r, base, i) do { if ((i) < 32768) PPC_LD(r, base, i);
>>  \
>> 		else {	PPC_ADDIS(r, base, IMM_HA(i));			      \
>> 			PPC_LD(r, r, IMM_L(i)); } } while(0)
>>diff --git a/arch/powerpc/net/bpf_jit_comp.c
>>b/arch/powerpc/net/bpf_jit_comp.c
>>index cbae2df..d110e28 100644
>>--- a/arch/powerpc/net/bpf_jit_comp.c
>>+++ b/arch/powerpc/net/bpf_jit_comp.c
>>@@ -407,6 +407,11 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32
>>*image,
>> 			PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
>> 							  queue_mapping));
>> 			break;
>>+		case BPF_ANC | SKF_AD_PKTTYPE:
>>+			PPC_LBZ_OFFS(r_A, r_skb, PKT_TYPE_OFFSET());
>>+			PPC_ANDI(r_A, r_A, PKT_TYPE_MAX);
>>+			PPC_SRWI(r_A, r_A, 5);
>>+			break;
>> 		case BPF_ANC | SKF_AD_CPU:
>> #ifdef CONFIG_SMP
>> 			/*
>>--
>>2.1.0
>>
>>
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re: [PATCH -next v3 2/3] syncookies: split cookie_check_timestamp() into two functions
From: Eric Dumazet @ 2014-11-03 16:07 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Daniel Borkmann
In-Reply-To: <1415019720-17106-3-git-send-email-fw@strlen.de>

On Mon, 2014-11-03 at 14:01 +0100, Florian Westphal wrote:
> The function cookie_check_timestamp(), both called from IPv4/6 context,
> is being used to decode the echoed timestamp from the SYN/ACK into TCP
> options used for follow-up communication with the peer.
> 
> We can remove ECN handling from that function, split it into a separate
> one, and simply rename the original function into cookie_decode_options().
> cookie_decode_options() just fills in tcp_option struct based on the
> echoed timestamp received from the peer. Anything that fails in this
> function will actually discard the request socket.
> 
> While this is the natural place for decoding options such as ECN which
> commit 172d69e63c7f ("syncookies: add support for ECN") added, we argue
> that in particular for ECN handling, it can be checked at a later point
> in time as the request sock would actually not need to be dropped from
> this, but just ECN support turned off.
> 
> Therefore, we split this functionality into cookie_ecn_ok(), which tells
> |us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.
> 
> This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
> won't be enough anymore at that point; if the timestamp indicates ECN
> and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.
> 
> This would mean adding a route lookup to cookie_check_timestamp(), which
> we definitely want to avoid. As we already do a route lookup at a later
> point in cookie_{v4,v6}_check(), we can simply make use of that as well
> for the new cookie_ecn_ok() function w/o any additional cost.
> 
> Joint work with Daniel Borkmann.
> 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH -next v3 3/3] net: allow setting ecn via routing table
From: Eric Dumazet @ 2014-11-03 16:11 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Daniel Borkmann
In-Reply-To: <1415019720-17106-4-git-send-email-fw@strlen.de>

On Mon, 2014-11-03 at 14:02 +0100, Florian Westphal wrote:
> This patch allows to set ECN on a per-route basis in case the sysctl
> tcp_ecn is not set to 1. In other words, when ECN is set for specific
> routes, it provides a tcp_ecn=1 behaviour for that route while the rest
> of the stack acts according to the global settings.
> 
> One can use 'ip route change dev $dev $net features ecn' to toggle this.
> 
> Having a more fine-grained per-route setting can be beneficial for various
> reasons, for example, 1) within data centers, or 2) local ISPs may deploy
> ECN support for their own video/streaming services [1], etc.
> 
> There was a recent measurement study/paper [2] which scanned the Alexa's
> publicly available top million websites list from a vantage point in US,
> Europe and Asia:
> 
> Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
> blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
> only ECN") ;)); the break in connectivity on-path was found is about
> 1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
> more common in the negotiation phase (and mostly seen in the Alexa
> middle band, ranks around 50k-150k): from 12-thousand hosts on which
> there _may_ be ECN-linked connection failures, only 79 failed with RST
> when _not_ failing with RST when ECN is not requested.
> 
> It's unclear though, how much equipment in the wild actually marks CE
> when buffers start to fill up.
> 
> We thought about a fallback to non-ECN for retransmitted SYNs as another
> global option (which could perhaps one day be made default), but as Eric
> points out, there's much more work needed to detect broken middleboxes.
> 
> Two examples Eric mentioned are buggy firewalls that accept only a single
> SYN per flow, and middleboxes that successfully let an ECN flow establish,
> but later mark CE for all packets (so cwnd converges to 1).
> 
>  [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
>  [2] http://ecn.ethz.ch/
> 
> Joint work with Daniel Borkmann.
> 
> Reference: http://thread.gmane.org/gmane.linux.network/335797
> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  Changes since v2:
>   - alter new cookie_ecn_ok() ok function to also evaluate
>   RTAX_FEATURE_ECN if timestamp-ecn-bit is set and the ecn sysctl
>   is off.
>   - extra blank line in tcp_output.c to appease checkpatch.pl
> 
>  include/net/tcp.h     |  2 +-
>  net/ipv4/syncookies.c |  6 +++---
>  net/ipv4/tcp_input.c  | 25 +++++++++++++++----------
>  net/ipv4/tcp_output.c | 13 +++++++++++--
>  net/ipv6/syncookies.c |  2 +-
>  5 files changed, 31 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 36c5084..f50f29faf 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -493,7 +493,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
>  __u32 cookie_init_timestamp(struct request_sock *req);
>  bool cookie_timestamp_decode(struct tcp_options_received *opt);
>  bool cookie_ecn_ok(const struct tcp_options_received *opt,
> -		   const struct net *net);
> +		   const struct net *net, const struct dst_entry *dst);
>  
>  /* From net/ipv6/syncookies.c */
>  int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 6de7725..45fe60c 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -273,7 +273,7 @@ bool cookie_timestamp_decode(struct tcp_options_received *tcp_opt)
>  EXPORT_SYMBOL(cookie_timestamp_decode);
>  
>  bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
> -		   const struct net *net)
> +		   const struct net *net, const struct dst_entry *dst)
>  {
>  	bool ecn_ok = tcp_opt->rcv_tsecr & TS_OPT_ECN;
>  
> @@ -283,7 +283,7 @@ bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
>  	if (net->ipv4.sysctl_tcp_ecn)
>  		return true;
>  
> -	return false;
> +	return dst_feature(dst, RTAX_FEATURE_ECN);
>  }
>  EXPORT_SYMBOL(cookie_ecn_ok);
>  
> @@ -387,7 +387,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
>  				  dst_metric(&rt->dst, RTAX_INITRWND));
>  
>  	ireq->rcv_wscale  = rcv_wscale;
> -	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
> +	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst);
>  
>  	ret = get_cookie_sock(sk, skb, req, &rt->dst);
>  	/* ip_queue_xmit() depends on our flow being setup
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 4e4617e..9db942a 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5876,20 +5876,22 @@ static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
>   */
>  static void tcp_ecn_create_request(struct request_sock *req,
>  				   const struct sk_buff *skb,
> -				   const struct sock *listen_sk)
> +				   const struct sock *listen_sk,
> +				   struct dst_entry *dst)

Nit : This probably should be 'const struct dst_entry *dst'

Otherwise, patch looks fine, thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* [PATCH net-next] net: add rbnode to struct sk_buff
From: Eric Dumazet @ 2014-11-03 16:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Yaogong Wang

From: Eric Dumazet <edumazet@google.com>

Yaogong replaces TCP out of order receive queue by an RB tree.

As netem already does a private skb->{next/prev/tstamp} union
with a 'struct rb_node', lets do this in a cleaner way.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
---
 include/linux/skbuff.h |   20 +++++++++++++-------
 net/sched/sch_netem.c  |   27 +++++++--------------------
 2 files changed, 20 insertions(+), 27 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6c8b6f604e7609f2d320a7aa572a8ec6a870f4b8..5ad9675b6fe17b2819476c181a4707780a2e03f2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -20,6 +20,7 @@
 #include <linux/time.h>
 #include <linux/bug.h>
 #include <linux/cache.h>
+#include <linux/rbtree.h>
 
 #include <linux/atomic.h>
 #include <asm/types.h>
@@ -440,6 +441,7 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
  *	@next: Next buffer in list
  *	@prev: Previous buffer in list
  *	@tstamp: Time we arrived/left
+ *	@rbnode: RB tree node, alternative to next/prev for netem/tcp
  *	@sk: Socket we are owned by
  *	@dev: Device we arrived on/are leaving by
  *	@cb: Control buffer. Free for use by every layer. Put private vars here
@@ -504,15 +506,19 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
  */
 
 struct sk_buff {
-	/* These two members must be first. */
-	struct sk_buff		*next;
-	struct sk_buff		*prev;
-
 	union {
-		ktime_t		tstamp;
-		struct skb_mstamp skb_mstamp;
+		struct {
+			/* These two members must be first. */
+			struct sk_buff		*next;
+			struct sk_buff		*prev;
+
+			union {
+				ktime_t		tstamp;
+				struct skb_mstamp skb_mstamp;
+			};
+		};
+		struct rb_node	rbnode; /* used in netem & tcp stack */
 	};
-
 	struct sock		*sk;
 	struct net_device	*dev;
 
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index b34331967e020b6f1151b25be8f744e494a80ad6..179f1c8c0d8bba4aa00705e6c9ca41ef909f7d50 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -139,33 +139,20 @@ struct netem_sched_data {
 
 /* Time stamp put into socket buffer control block
  * Only valid when skbs are in our internal t(ime)fifo queue.
+ *
+ * As skb->rbnode uses same storage than skb->next, skb->prev and skb->tstamp,
+ * and skb->next & skb->prev are scratch space for a qdisc,
+ * we save skb->tstamp value in skb->cb[] before destroying it.
  */
 struct netem_skb_cb {
 	psched_time_t	time_to_send;
 	ktime_t		tstamp_save;
 };
 
-/* Because space in skb->cb[] is tight, netem overloads skb->next/prev/tstamp
- * to hold a rb_node structure.
- *
- * If struct sk_buff layout is changed, the following checks will complain.
- */
-static struct rb_node *netem_rb_node(struct sk_buff *skb)
-{
-	BUILD_BUG_ON(offsetof(struct sk_buff, next) != 0);
-	BUILD_BUG_ON(offsetof(struct sk_buff, prev) !=
-		     offsetof(struct sk_buff, next) + sizeof(skb->next));
-	BUILD_BUG_ON(offsetof(struct sk_buff, tstamp) !=
-		     offsetof(struct sk_buff, prev) + sizeof(skb->prev));
-	BUILD_BUG_ON(sizeof(struct rb_node) > sizeof(skb->next) +
-					      sizeof(skb->prev) +
-					      sizeof(skb->tstamp));
-	return (struct rb_node *)&skb->next;
-}
 
 static struct sk_buff *netem_rb_to_skb(struct rb_node *rb)
 {
-	return (struct sk_buff *)rb;
+	return container_of(rb, struct sk_buff, rbnode);
 }
 
 static inline struct netem_skb_cb *netem_skb_cb(struct sk_buff *skb)
@@ -403,8 +390,8 @@ static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 		else
 			p = &parent->rb_left;
 	}
-	rb_link_node(netem_rb_node(nskb), parent, p);
-	rb_insert_color(netem_rb_node(nskb), &q->t_root);
+	rb_link_node(&nskb->rbnode, parent, p);
+	rb_insert_color(&nskb->rbnode, &q->t_root);
 	sch->q.qlen++;
 }
 

^ permalink raw reply related

* Re: [PATCH 1/7] can: m_can: fix possible sleep in napi poll
From: Marc Kleine-Budde @ 2014-11-03 16:24 UTC (permalink / raw)
  To: Dong Aisheng, linux-can
  Cc: wg, varkabhadram, netdev, socketcan, linux-arm-kernel
In-Reply-To: <1414579527-31100-1-git-send-email-b29396@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 500 bytes --]

On 10/29/2014 11:45 AM, Dong Aisheng wrote:
> The m_can_get_berr_counter function can sleep and it may be called in napi poll function.
> Rework it to fix the following warning.

Applied to can/master.

Thanks, Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 2/7] can: m_can: fix the incorrect error messages
From: Marc Kleine-Budde @ 2014-11-03 16:25 UTC (permalink / raw)
  To: Dong Aisheng, linux-can
  Cc: wg, varkabhadram, netdev, socketcan, linux-arm-kernel
In-Reply-To: <1414579527-31100-2-git-send-email-b29396@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 450 bytes --]

On 10/29/2014 11:45 AM, Dong Aisheng wrote:
> Fix a few error messages.
> 
> Signed-off-by: Dong Aisheng <b29396@freescale.com>

Applied to can/master.

Thanks,
Marc
-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 5/7] can: clear ctrlmode when close candev
From: Marc Kleine-Budde @ 2014-11-03 16:28 UTC (permalink / raw)
  To: Dong Aisheng, linux-can
  Cc: wg, varkabhadram, netdev, socketcan, linux-arm-kernel,
	Oliver Hartkopp
In-Reply-To: <1414579527-31100-5-git-send-email-b29396@freescale.com>

[-- Attachment #1: Type: text/plain, Size: 1547 bytes --]

On 10/29/2014 11:45 AM, Dong Aisheng wrote:
> Currently priv->ctrlmode is not cleared when close_candev, so next time
> the driver will still use this value to set controller even user
> does not set any ctrl mode.
> e.g.
> Step 1. ip link set can0 up type can0 bitrate 1000000 loopback on
> Controller will be in loopback mode
> Step 2. ip link set can0 down
> Step 3. ip link set can0 up type can0 bitrate 1000000
> Controller will still be set to loopback mode in driver due to saved
> priv->ctrlmode.
> 
> This patch clears priv->ctrlmode when the CAN interface is closed,
> and set it to correct mode according to next user setting.

Oliver, what do you think of this patch? It will introduce a subtle
change to the userspace.

Marc

> Signed-off-by: Dong Aisheng <b29396@freescale.com>
> ---
>  drivers/net/can/dev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/can/dev.c b/drivers/net/can/dev.c
> index 02492d2..1fce485 100644
> --- a/drivers/net/can/dev.c
> +++ b/drivers/net/can/dev.c
> @@ -671,6 +671,7 @@ void close_candev(struct net_device *dev)
>  
>  	del_timer_sync(&priv->restart_timer);
>  	can_flush_echo_skb(dev);
> +	priv->ctrlmode = 0;
>  }
>  EXPORT_SYMBOL_GPL(close_candev);
>  
> 


-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [PATCH -next v4 0/3] net: allow setting ecn via routing table
From: Florian Westphal @ 2014-11-03 16:35 UTC (permalink / raw)
  To: netdev

Here is v4 of the patchset, its exactly the same as v3 except in patch3/3
where I added the missing 'const' qualifier to a function argument that
Eric spotted during review.

I preserved Erics Acks so that he doesn't have to resend them.

v3 cover letter:

When using syn cookies, then do not simply trust that the echoed timestamp
was not modified to make sure that ecn is not turned on magically when it
is disabled on the host.

The first two patches, which were not part of earlier series, prepare
the cookie code for the ecn route metrics change by allowing is to
more easily use the existing dst object for ecn validation.

The 3rd patch adds the ecn route metric feature support.
It is almost the same as in v2, except that we'll now also test the
dst_features when decoding a syn cookie timestamp that indicates ecn support.

These three patches then allow turning on explicit congestion notification
based on the destination network.

For example, assuming the default tcp_ecn sysctl '2', the following will
enable ecn (tcp_ecn=1 behaviour, i.e. request ecn to be enabled for a
tcp connection) for all connections to hosts inside the 192.168.2/24 network:

ip route change 192.168.2.0/24 dev eth0 features ecn

Having a more fine-grained per-route setting can be beneficial for
various reasons, for example 1) within data centers, or 2) local ISPs
may deploy ECN support for their own video/streaming services [1], etc.

Joint work with Daniel Borkmann, feature suggested by Hannes Frederic Sowa.

The patch to enable this in iproute2 will be posted shortly, it is currently
also available here:
http://git.breakpoint.cc/cgit/fw/iproute2.git/commit/?h=iproute_features&id=8843d2d8973fb81c78a7efe6d42e3a17d739003e

[1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15

^ permalink raw reply

* [PATCH -next v4 1/3] syncookies: avoid magic values and document which-bit-is-what-option
From: Florian Westphal @ 2014-11-03 16:35 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Daniel Borkmann
In-Reply-To: <1415032503-4936-1-git-send-email-fw@strlen.de>

Was a bit more difficult to read than needed due to magic shifts;
add defines and document the used encoding scheme.

Joint work with Daniel Borkmann.

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 no changes since v3, added preserved Erics Acked-by tag.

 net/ipv4/syncookies.c | 50 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 4ac7bca..c3792c0 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -19,10 +19,6 @@
 #include <net/tcp.h>
 #include <net/route.h>
 
-/* Timestamps: lowest bits store TCP options */
-#define TSBITS 6
-#define TSMASK (((__u32)1 << TSBITS) - 1)
-
 extern int sysctl_tcp_syncookies;
 
 static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
@@ -30,6 +26,30 @@ static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
 #define COOKIEBITS 24	/* Upper bits store count */
 #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
 
+/* TCP Timestamp: 6 lowest bits of timestamp sent in the cookie SYN-ACK
+ * stores TCP options:
+ *
+ * MSB                               LSB
+ * | 31 ...   6 |  5  |  4   | 3 2 1 0 |
+ * |  Timestamp | ECN | SACK | WScale  |
+ *
+ * When we receive a valid cookie-ACK, we look at the echoed tsval (if
+ * any) to figure out which TCP options we should use for the rebuilt
+ * connection.
+ *
+ * A WScale setting of '0xf' (which is an invalid scaling value)
+ * means that original syn did not include the TCP window scaling option.
+ */
+#define TS_OPT_WSCALE_MASK	0xf
+#define TS_OPT_SACK		BIT(4)
+#define TS_OPT_ECN		BIT(5)
+/* There is no TS_OPT_TIMESTAMP:
+ * if ACK contains timestamp option, we already know it was
+ * requested/supported by the syn/synack exchange.
+ */
+#define TSBITS	6
+#define TSMASK	(((__u32)1 << TSBITS) - 1)
+
 static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS],
 		      ipv4_cookie_scratch);
 
@@ -67,9 +87,11 @@ __u32 cookie_init_timestamp(struct request_sock *req)
 
 	ireq = inet_rsk(req);
 
-	options = ireq->wscale_ok ? ireq->snd_wscale : 0xf;
-	options |= ireq->sack_ok << 4;
-	options |= ireq->ecn_ok << 5;
+	options = ireq->wscale_ok ? ireq->snd_wscale : TS_OPT_WSCALE_MASK;
+	if (ireq->sack_ok)
+		options |= TS_OPT_SACK;
+	if (ireq->ecn_ok)
+		options |= TS_OPT_ECN;
 
 	ts = ts_now & ~TSMASK;
 	ts |= options;
@@ -219,16 +241,13 @@ static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
  * additional tcp options in the timestamp.
  * This extracts these options from the timestamp echo.
  *
- * The lowest 4 bits store snd_wscale.
- * next 2 bits indicate SACK and ECN support.
- *
  * return false if we decode an option that should not be.
  */
 bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
 			struct net *net, bool *ecn_ok)
 {
 	/* echoed timestamp, lowest bits contain options */
-	u32 options = tcp_opt->rcv_tsecr & TSMASK;
+	u32 options = tcp_opt->rcv_tsecr;
 
 	if (!tcp_opt->saw_tstamp)  {
 		tcp_clear_options(tcp_opt);
@@ -238,19 +257,20 @@ bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
 	if (!sysctl_tcp_timestamps)
 		return false;
 
-	tcp_opt->sack_ok = (options & (1 << 4)) ? TCP_SACK_SEEN : 0;
-	*ecn_ok = (options >> 5) & 1;
+	tcp_opt->sack_ok = (options & TS_OPT_SACK) ? TCP_SACK_SEEN : 0;
+	*ecn_ok = options & TS_OPT_ECN;
 	if (*ecn_ok && !net->ipv4.sysctl_tcp_ecn)
 		return false;
 
 	if (tcp_opt->sack_ok && !sysctl_tcp_sack)
 		return false;
 
-	if ((options & 0xf) == 0xf)
+	if ((options & TS_OPT_WSCALE_MASK) == TS_OPT_WSCALE_MASK)
 		return true; /* no window scaling */
 
 	tcp_opt->wscale_ok = 1;
-	tcp_opt->snd_wscale = options & 0xf;
+	tcp_opt->snd_wscale = options & TS_OPT_WSCALE_MASK;
+
 	return sysctl_tcp_window_scaling != 0;
 }
 EXPORT_SYMBOL(cookie_check_timestamp);
-- 
2.0.4

^ permalink raw reply related

* [PATCH -next v4 2/3] syncookies: split cookie_check_timestamp() into two functions
From: Florian Westphal @ 2014-11-03 16:35 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Daniel Borkmann
In-Reply-To: <1415032503-4936-1-git-send-email-fw@strlen.de>

The function cookie_check_timestamp(), both called from IPv4/6 context,
is being used to decode the echoed timestamp from the SYN/ACK into TCP
options used for follow-up communication with the peer.

We can remove ECN handling from that function, split it into a separate
one, and simply rename the original function into cookie_decode_options().
cookie_decode_options() just fills in tcp_option struct based on the
echoed timestamp received from the peer. Anything that fails in this
function will actually discard the request socket.

While this is the natural place for decoding options such as ECN which
commit 172d69e63c7f ("syncookies: add support for ECN") added, we argue
that in particular for ECN handling, it can be checked at a later point
in time as the request sock would actually not need to be dropped from
this, but just ECN support turned off.

Therefore, we split this functionality into cookie_ecn_ok(), which tells
us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.

This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
won't be enough anymore at that point; if the timestamp indicates ECN
and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.

This would mean adding a route lookup to cookie_check_timestamp(), which
we definitely want to avoid. As we already do a route lookup at a later
point in cookie_{v4,v6}_check(), we can simply make use of that as well
for the new cookie_ecn_ok() function w/o any additional cost.

Joint work with Daniel Borkmann.

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 No changes since v3, preserved Erics Acked-by tag.

 include/net/tcp.h     |  9 ++++-----
 net/ipv4/syncookies.c | 31 +++++++++++++++++++++----------
 net/ipv6/syncookies.c |  5 ++---
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3a35b15..36c5084 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -490,17 +490,16 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
 			      u16 *mssp);
 __u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
 			      __u16 *mss);
-#endif
-
 __u32 cookie_init_timestamp(struct request_sock *req);
-bool cookie_check_timestamp(struct tcp_options_received *opt, struct net *net,
-			    bool *ecn_ok);
+bool cookie_timestamp_decode(struct tcp_options_received *opt);
+bool cookie_ecn_ok(const struct tcp_options_received *opt,
+		   const struct net *net);
 
 /* From net/ipv6/syncookies.c */
 int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
 		      u32 cookie);
 struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb);
-#ifdef CONFIG_SYN_COOKIES
+
 u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
 			      const struct tcphdr *th, u16 *mssp);
 __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index c3792c0..6de7725 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -241,10 +241,10 @@ static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
  * additional tcp options in the timestamp.
  * This extracts these options from the timestamp echo.
  *
- * return false if we decode an option that should not be.
+ * return false if we decode a tcp option that is disabled
+ * on the host.
  */
-bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
-			struct net *net, bool *ecn_ok)
+bool cookie_timestamp_decode(struct tcp_options_received *tcp_opt)
 {
 	/* echoed timestamp, lowest bits contain options */
 	u32 options = tcp_opt->rcv_tsecr;
@@ -258,9 +258,6 @@ bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
 		return false;
 
 	tcp_opt->sack_ok = (options & TS_OPT_SACK) ? TCP_SACK_SEEN : 0;
-	*ecn_ok = options & TS_OPT_ECN;
-	if (*ecn_ok && !net->ipv4.sysctl_tcp_ecn)
-		return false;
 
 	if (tcp_opt->sack_ok && !sysctl_tcp_sack)
 		return false;
@@ -273,7 +270,22 @@ bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
 
 	return sysctl_tcp_window_scaling != 0;
 }
-EXPORT_SYMBOL(cookie_check_timestamp);
+EXPORT_SYMBOL(cookie_timestamp_decode);
+
+bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
+		   const struct net *net)
+{
+	bool ecn_ok = tcp_opt->rcv_tsecr & TS_OPT_ECN;
+
+	if (!ecn_ok)
+		return false;
+
+	if (net->ipv4.sysctl_tcp_ecn)
+		return true;
+
+	return false;
+}
+EXPORT_SYMBOL(cookie_ecn_ok);
 
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 {
@@ -289,7 +301,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	int mss;
 	struct rtable *rt;
 	__u8 rcv_wscale;
-	bool ecn_ok = false;
 	struct flowi4 fl4;
 
 	if (!sysctl_tcp_syncookies || !th->ack || th->rst)
@@ -310,7 +321,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
 	tcp_parse_options(skb, &tcp_opt, 0, NULL);
 
-	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+	if (!cookie_timestamp_decode(&tcp_opt))
 		goto out;
 
 	ret = NULL;
@@ -328,7 +339,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	ireq->ir_loc_addr	= ip_hdr(skb)->daddr;
 	ireq->ir_rmt_addr	= ip_hdr(skb)->saddr;
 	ireq->ir_mark		= inet_request_mark(sk, skb);
-	ireq->ecn_ok		= ecn_ok;
 	ireq->snd_wscale	= tcp_opt.snd_wscale;
 	ireq->sack_ok		= tcp_opt.sack_ok;
 	ireq->wscale_ok		= tcp_opt.wscale_ok;
@@ -377,6 +387,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(&rt->dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale  = rcv_wscale;
+	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
 
 	ret = get_cookie_sock(sk, skb, req, &rt->dst);
 	/* ip_queue_xmit() depends on our flow being setup
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index be291ba..52cc8cb 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -166,7 +166,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	int mss;
 	struct dst_entry *dst;
 	__u8 rcv_wscale;
-	bool ecn_ok = false;
 
 	if (!sysctl_tcp_syncookies || !th->ack || th->rst)
 		goto out;
@@ -186,7 +185,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
 	tcp_parse_options(skb, &tcp_opt, 0, NULL);
 
-	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+	if (!cookie_timestamp_decode(&tcp_opt))
 		goto out;
 
 	ret = NULL;
@@ -223,7 +222,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	req->expires = 0UL;
 	req->num_retrans = 0;
-	ireq->ecn_ok		= ecn_ok;
 	ireq->snd_wscale	= tcp_opt.snd_wscale;
 	ireq->sack_ok		= tcp_opt.sack_ok;
 	ireq->wscale_ok		= tcp_opt.wscale_ok;
@@ -264,6 +262,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale = rcv_wscale;
+	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
 
 	ret = get_cookie_sock(sk, skb, req, dst);
 out:
-- 
2.0.4

^ permalink raw reply related

* [PATCH -next v4 3/3] net: allow setting ecn via routing table
From: Florian Westphal @ 2014-11-03 16:35 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Daniel Borkmann
In-Reply-To: <1415032503-4936-1-git-send-email-fw@strlen.de>

This patch allows to set ECN on a per-route basis in case the sysctl
tcp_ecn is not set to 1. In other words, when ECN is set for specific
routes, it provides a tcp_ecn=1 behaviour for that route while the rest
of the stack acts according to the global settings.

One can use 'ip route change dev $dev $net features ecn' to toggle this.

Having a more fine-grained per-route setting can be beneficial for various
reasons, for example, 1) within data centers, or 2) local ISPs may deploy
ECN support for their own video/streaming services [1], etc.

There was a recent measurement study/paper [2] which scanned the Alexa's
publicly available top million websites list from a vantage point in US,
Europe and Asia:

Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
only ECN") ;)); the break in connectivity on-path was found is about
1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
more common in the negotiation phase (and mostly seen in the Alexa
middle band, ranks around 50k-150k): from 12-thousand hosts on which
there _may_ be ECN-linked connection failures, only 79 failed with RST
when _not_ failing with RST when ECN is not requested.

It's unclear though, how much equipment in the wild actually marks CE
when buffers start to fill up.

We thought about a fallback to non-ECN for retransmitted SYNs as another
global option (which could perhaps one day be made default), but as Eric
points out, there's much more work needed to detect broken middleboxes.

Two examples Eric mentioned are buggy firewalls that accept only a single
SYN per flow, and middleboxes that successfully let an ECN flow establish,
but later mark CE for all packets (so cwnd converges to 1).

 [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
 [2] http://ecn.ethz.ch/

Joint work with Daniel Borkmann.

Reference: http://thread.gmane.org/gmane.linux.network/335797
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Changes since v3:
 tcp_ecn_create_request() can have 'struct dst_entry *' can be 'const', spotted by Eric.
 Changes since v2:
  - alter new cookie_ecn_ok() ok function to also evaluate
  RTAX_FEATURE_ECN if timestamp-ecn-bit is set and the ecn sysctl
  is off.
  - extra blank line in tcp_output.c to appease checkpatch.pl

 include/net/tcp.h     |  2 +-
 net/ipv4/syncookies.c |  6 +++---
 net/ipv4/tcp_input.c  | 25 +++++++++++++++----------
 net/ipv4/tcp_output.c | 13 +++++++++++--
 net/ipv6/syncookies.c |  2 +-
 5 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 36c5084..f50f29faf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -493,7 +493,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
 __u32 cookie_init_timestamp(struct request_sock *req);
 bool cookie_timestamp_decode(struct tcp_options_received *opt);
 bool cookie_ecn_ok(const struct tcp_options_received *opt,
-		   const struct net *net);
+		   const struct net *net, const struct dst_entry *dst);
 
 /* From net/ipv6/syncookies.c */
 int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 6de7725..45fe60c 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -273,7 +273,7 @@ bool cookie_timestamp_decode(struct tcp_options_received *tcp_opt)
 EXPORT_SYMBOL(cookie_timestamp_decode);
 
 bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
-		   const struct net *net)
+		   const struct net *net, const struct dst_entry *dst)
 {
 	bool ecn_ok = tcp_opt->rcv_tsecr & TS_OPT_ECN;
 
@@ -283,7 +283,7 @@ bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
 	if (net->ipv4.sysctl_tcp_ecn)
 		return true;
 
-	return false;
+	return dst_feature(dst, RTAX_FEATURE_ECN);
 }
 EXPORT_SYMBOL(cookie_ecn_ok);
 
@@ -387,7 +387,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(&rt->dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale  = rcv_wscale;
-	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
+	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst);
 
 	ret = get_cookie_sock(sk, skb, req, &rt->dst);
 	/* ip_queue_xmit() depends on our flow being setup
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4e4617e..196b438 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5876,20 +5876,22 @@ static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
  */
 static void tcp_ecn_create_request(struct request_sock *req,
 				   const struct sk_buff *skb,
-				   const struct sock *listen_sk)
+				   const struct sock *listen_sk,
+				   const struct dst_entry *dst)
 {
 	const struct tcphdr *th = tcp_hdr(skb);
 	const struct net *net = sock_net(listen_sk);
 	bool th_ecn = th->ece && th->cwr;
-	bool ect, need_ecn;
+	bool ect, need_ecn, ecn_ok;
 
 	if (!th_ecn)
 		return;
 
 	ect = !INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield);
 	need_ecn = tcp_ca_needs_ecn(listen_sk);
+	ecn_ok = net->ipv4.sysctl_tcp_ecn || dst_feature(dst, RTAX_FEATURE_ECN);
 
-	if (!ect && !need_ecn && net->ipv4.sysctl_tcp_ecn)
+	if (!ect && !need_ecn && ecn_ok)
 		inet_rsk(req)->ecn_ok = 1;
 	else if (ect && need_ecn)
 		inet_rsk(req)->ecn_ok = 1;
@@ -5954,13 +5956,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	if (security_inet_conn_request(sk, skb, req))
 		goto drop_and_free;
 
-	if (!want_cookie || tmp_opt.tstamp_ok)
-		tcp_ecn_create_request(req, skb, sk);
-
-	if (want_cookie) {
-		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
-		req->cookie_ts = tmp_opt.tstamp_ok;
-	} else if (!isn) {
+	if (!want_cookie && !isn) {
 		/* VJ's idea. We save last timestamp seen
 		 * from the destination in peer table, when entering
 		 * state TIME-WAIT, and check against it before
@@ -6008,6 +6004,15 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 			goto drop_and_free;
 	}
 
+	tcp_ecn_create_request(req, skb, sk, dst);
+
+	if (want_cookie) {
+		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
+		req->cookie_ts = tmp_opt.tstamp_ok;
+		if (!tmp_opt.tstamp_ok)
+			inet_rsk(req)->ecn_ok = 0;
+	}
+
 	tcp_rsk(req)->snt_isn = isn;
 	tcp_openreq_init_rwin(req, sk, dst);
 	fastopen = !want_cookie &&
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a3d453b..0b88158 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -333,10 +333,19 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
+		       tcp_ca_needs_ecn(sk);
+
+	if (!use_ecn) {
+		const struct dst_entry *dst = __sk_dst_get(sk);
+
+		if (dst && dst_feature(dst, RTAX_FEATURE_ECN))
+			use_ecn = true;
+	}
 
 	tp->ecn_flags = 0;
-	if (sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
-	    tcp_ca_needs_ecn(sk)) {
+
+	if (use_ecn) {
 		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
 		tp->ecn_flags = TCP_ECN_OK;
 		if (tcp_ca_needs_ecn(sk))
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 52cc8cb..7337fc7 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -262,7 +262,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale = rcv_wscale;
-	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
+	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), dst);
 
 	ret = get_cookie_sock(sk, skb, req, dst);
 out:
-- 
2.0.4

^ permalink raw reply related

* Re: TCP NewReno and single retransmit
From: Marcelo Ricardo Leitner @ 2014-11-03 16:38 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Neal Cardwell, netdev, Eric Dumazet
In-Reply-To: <CAK6E8=dYfJw16Q0D40QD7RLr=wq=y+5W59zHmZ24L49OPS9O5A@mail.gmail.com>

On 31-10-2014 01:51, Yuchung Cheng wrote:
> On Thu, Oct 30, 2014 at 7:24 PM, Marcelo Ricardo Leitner
> <mleitner@redhat.com> wrote:
>> On 30-10-2014 00:03, Neal Cardwell wrote:
>>>
>>> On Mon, Oct 27, 2014 at 2:49 PM, Marcelo Ricardo Leitner
>>> <mleitner@redhat.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We have a report from a customer saying that on a very calm connection,
>>>> like
>>>> having only a single data packet within some minutes, if this packet gets
>>>> to
>>>> be re-transmitted, retrans_stamp is only cleared when the next acked
>>>> packet
>>>> is received. But this may make we abort the connection too soon if this
>>>> next
>>>> packet also gets lost, because the reference for the initial loss is
>>>> still
>>>> for a big while ago..
>>>
>>> ...
>>>>
>>>> @@ -2382,31 +2382,32 @@ static inline bool tcp_may_undo(const struct
>>>> tcp_sock *tp)
>>>>    static bool tcp_try_undo_recovery(struct sock *sk)
>>>
>>> ...
>>>>
>>>>           if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
>>>>                   /* Hold old state until something *above* high_seq
>>>>                    * is ACKed. For Reno it is MUST to prevent false
>>>>                    * fast retransmits (RFC2582). SACK TCP is safe. */
> Or we can just remove this strange state-holding logic?
>
> I couldn't find such a "MUST" statement in RFC2582. RFC2582 section 3
> step 5 suggests exiting the recovery procedure when an ACK acknowledges
> the "recover" variable (== tp->high_seq - 1).
>
> Since we've called tcp_reset_reno_sack() before tcp_try_undo_recovery(),
> I couldn't see how false fast retransmits can be triggered without
> this state-holding.
>
> Any insights?

Nice one, me neither. Neal?

 From RFC2582, Section 5, Avoiding Multiple Fast Retransmits:

    Nevertheless, unnecessary Fast Retransmits can occur with Reno or
    NewReno TCP, particularly if a Retransmit Timeout occurs during Fast
    Recovery.  (This is illustrated for Reno on page 6 of [F98], and for
    NewReno on page 8 of [F98].)  With NewReno, the data sender remains
    in Fast Recovery until either a Retransmit Timeout, or *until all of
    the data outstanding when Fast Retransmit was entered has been
    acknowledged*.  Thus with NewReno, the problem of multiple Fast
    Retransmits from a single window of data can only occur after a
    Retransmit Timeout.

Bolding mark is mine. If I didn't miss anything, as that condition was met, we 
should be good to keep that cwnd reduction (required by section 3 step 5) and 
but get back to Open state right away.

Marcelo

>>>>                   tcp_moderate_cwnd(tp);
>>>> +               tp->retrans_stamp = 0;
>>>>                   return true;
>>>>           }
>>>>           tcp_set_ca_state(sk, TCP_CA_Open);
>>>>           return false;
>>>>    }
>>>>
>>>> We would still hold state, at least part of it.. WDYT?
>>>
>>>
>>> This approach sounds OK to me as long as we include a check of
>>> tcp_any_retrans_done(), as we do in the similar code paths (for
>>> motivation, see the comment above tcp_any_retrans_done()).
>>
>>
>> Yes, okay. I thought that this would be taken care of already by then but
>> reading the code again now after your comment, I can see what you're saying.
>> Thanks.
>>
>>> So it sounds fine to me if you change that one new line to the following
>>> 2:
>>>
>>> +  if (!tcp_any_retrans_done(sk))
>>> +    tp->retrans_stamp = 0;
>>
>>
>> Will do.
>>
>>> Nice catch!
>>
>>
>> A good part of it (including the diagram) was done by customer. :)
>> I'll post the patch as soon as we sync with them (credits).
>>
>> Marcelo
>>
>

^ permalink raw reply

* Re: DMA allocations from CMA and fatal_signal_pending check
From: Michal Nazarewicz @ 2014-11-03 16:45 UTC (permalink / raw)
  To: Florian Fainelli, Joonsoo Kim
  Cc: linux-arm-kernel, Brian Norris, Gregory Fong, linux-kernel,
	linux-mm, lauraa, gioh.kim, aneesh.kumar, m.szyprowski, akpm,
	netdev@vger.kernel.org
In-Reply-To: <5453F80C.4090006@gmail.com>

On Fri, Oct 31 2014, Florian Fainelli wrote:
> I agree that the CMA allocation should not be allowed to succeed, but
> the dma_alloc_coherent() allocation should succeed. If we look at the
> sysport driver, there are kmalloc() calls to initialize private
> structures, those will succeed (except under high memory pressure), so
> by the same token, a driver expects DMA allocations to succeed (unless
> we are under high memory pressure)
>
> What are we trying to solve exactly with the fatal_signal_pending()
> check here? Are we just optimizing for the case where a process has
> allocated from a CMA region to allow this region to be returned to the
> pool of free pages when it gets killed? Could there be another mechanism
> used to reclaim those pages if we know the process is getting killed
> anyway?

We're guarding against situations where process may hang around
arbitrarily long time after receiving SIGKILL.  If user does “kill -9
$pid” the usual expectation is that the $pid process will die within
seconds and anything longer is perceived by user as a bug.

What problem are *you* trying to solve?  If user sent SIGKILL to
a process that imitated device initialisation, what is the point of
continuing initialising the device?  Just recover and return -EINTR.

> Well, not really. This driver is not an isolated case, there are tons of
> other networking drivers that do exactly the same thing, and we do
> expect these dma_alloc_* calls to succeed.

Again, why do you expect them to succeed?  The code must handle failures
correctly anyway so why do you wish to ignore fatal signal?

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +--<mpn@google.com>--<xmpp:mina86@jabber.org>--ooO--(_)--Ooo--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox