Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v3 0/3] Add QLogic FastLinQ iSCSI (qedi) driver.
From: Manish Rangankar @ 2016-12-01  8:21 UTC (permalink / raw)
  To: martin.petersen, lduncan, cleech
  Cc: linux-scsi, netdev, QLogic-Storage-Upstream, Yuval.Mintz

This series introduces hardware offload iSCSI initiator driver for the
41000 Series Converged Network Adapters (579xx chip) by Qlogic. The overall
driver design includes a common module ('qed') and protocol specific
dependent modules ('qedi' for iSCSI).

This is an open iSCSI driver, modifications to open iSCSI user components
'iscsid', 'iscsiuio', etc. are required for the solution to work. The user
space changes are also in the process of being submitted.

    https://groups.google.com/forum/#!forum/open-iscsi

The 'qed' common module, under drivers/net/ethernet/qlogic/qed/, is
enhanced with functionality required for the iSCSI support. This series
is based on:

    net tree base: Merge of net and net-next as of 11/29/2016

Changes from RFC v2:

  1. qedi patches are squashed into single patch to prevent krobot
     warning.
  2. Fixed 'hw_p_cpuq' incompatible pointer type. 
  3. Fixed sparse incompatible types in comparison expression. 
  4. Misc fixes with latest 'checkpatch --strict' option.
  5. Remove int_mode option from MODULE_PARAM.
  6. Prefix all MODULE_PARAM params with qedi_*.
  7. Use CONFIG_QED_ISCSI instead of CONFIG_QEDI
  8. Added bad task mem access fix.

Manish Rangankar (1):
  qedi: Add QLogic FastLinQ offload iSCSI driver framework.

Yuval Mintz (2):
  qed: Add support for hardware offloaded iSCSI.
  qed: Add iSCSI out of order packet handling.

 MAINTAINERS                                    |    6 +
 drivers/net/ethernet/qlogic/Kconfig            |    3 +
 drivers/net/ethernet/qlogic/qed/Makefile       |    1 +
 drivers/net/ethernet/qlogic/qed/qed.h          |    8 +-
 drivers/net/ethernet/qlogic/qed/qed_dev.c      |   22 +
 drivers/net/ethernet/qlogic/qed/qed_iscsi.c    | 1277 +++++++++++++
 drivers/net/ethernet/qlogic/qed/qed_iscsi.h    |   52 +
 drivers/net/ethernet/qlogic/qed/qed_ll2.c      |  555 +++++-
 drivers/net/ethernet/qlogic/qed/qed_ll2.h      |    9 +
 drivers/net/ethernet/qlogic/qed/qed_ooo.c      |  501 +++++
 drivers/net/ethernet/qlogic/qed/qed_ooo.h      |  176 ++
 drivers/net/ethernet/qlogic/qed/qed_reg_addr.h |    2 +
 drivers/net/ethernet/qlogic/qed/qed_roce.c     |    1 +
 drivers/net/ethernet/qlogic/qed/qed_spq.c      |   24 +
 drivers/scsi/Kconfig                           |    1 +
 drivers/scsi/Makefile                          |    1 +
 drivers/scsi/qedi/Kconfig                      |   10 +
 drivers/scsi/qedi/Makefile                     |    5 +
 drivers/scsi/qedi/qedi.h                       |  364 ++++
 drivers/scsi/qedi/qedi_dbg.c                   |  143 ++
 drivers/scsi/qedi/qedi_dbg.h                   |  144 ++
 drivers/scsi/qedi/qedi_debugfs.c               |  244 +++
 drivers/scsi/qedi/qedi_fw.c                    | 2378 ++++++++++++++++++++++++
 drivers/scsi/qedi/qedi_gbl.h                   |   73 +
 drivers/scsi/qedi/qedi_hsi.h                   |   52 +
 drivers/scsi/qedi/qedi_iscsi.c                 | 1624 ++++++++++++++++
 drivers/scsi/qedi/qedi_iscsi.h                 |  232 +++
 drivers/scsi/qedi/qedi_main.c                  | 2127 +++++++++++++++++++++
 drivers/scsi/qedi/qedi_sysfs.c                 |   52 +
 drivers/scsi/qedi/qedi_version.h               |   14 +
 include/linux/qed/qed_if.h                     |    2 +
 include/linux/qed/qed_iscsi_if.h               |  229 +++
 32 files changed, 10303 insertions(+), 29 deletions(-)
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.c
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_iscsi.h
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_ooo.c
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_ooo.h
 create mode 100644 drivers/scsi/qedi/Kconfig
 create mode 100644 drivers/scsi/qedi/Makefile
 create mode 100644 drivers/scsi/qedi/qedi.h
 create mode 100644 drivers/scsi/qedi/qedi_dbg.c
 create mode 100644 drivers/scsi/qedi/qedi_dbg.h
 create mode 100644 drivers/scsi/qedi/qedi_debugfs.c
 create mode 100644 drivers/scsi/qedi/qedi_fw.c
 create mode 100644 drivers/scsi/qedi/qedi_gbl.h
 create mode 100644 drivers/scsi/qedi/qedi_hsi.h
 create mode 100644 drivers/scsi/qedi/qedi_iscsi.c
 create mode 100644 drivers/scsi/qedi/qedi_iscsi.h
 create mode 100644 drivers/scsi/qedi/qedi_main.c
 create mode 100644 drivers/scsi/qedi/qedi_sysfs.c
 create mode 100644 drivers/scsi/qedi/qedi_version.h
 create mode 100644 include/linux/qed/qed_iscsi_if.h

-- 
1.8.3.1

^ permalink raw reply

* [PATCH v3 net-next 2/3] qed: Add iSCSI out of order packet handling.
From: Manish Rangankar @ 2016-12-01  8:21 UTC (permalink / raw)
  To: martin.petersen, lduncan, cleech
  Cc: linux-scsi, netdev, QLogic-Storage-Upstream, Yuval.Mintz
In-Reply-To: <1480580468-31567-1-git-send-email-manish.rangankar@cavium.com>

From: Yuval Mintz <yuval.mintz@cavium.com>

This patch adds out of order packet handling for hardware offloaded
iSCSI. Out of order packet handling requires driver buffer allocation
and assistance.

Signed-off-by: Arun Easi <arun.easi@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/net/ethernet/qlogic/qed/Makefile   |   2 +-
 drivers/net/ethernet/qlogic/qed/qed.h      |   1 +
 drivers/net/ethernet/qlogic/qed/qed_dev.c  |  14 +-
 drivers/net/ethernet/qlogic/qed/qed_ll2.c  | 551 +++++++++++++++++++++++++++--
 drivers/net/ethernet/qlogic/qed/qed_ll2.h  |   9 +
 drivers/net/ethernet/qlogic/qed/qed_ooo.c  | 501 ++++++++++++++++++++++++++
 drivers/net/ethernet/qlogic/qed/qed_ooo.h  | 176 +++++++++
 drivers/net/ethernet/qlogic/qed/qed_roce.c |   1 +
 drivers/net/ethernet/qlogic/qed/qed_spq.c  |   9 +
 9 files changed, 1234 insertions(+), 30 deletions(-)
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_ooo.c
 create mode 100644 drivers/net/ethernet/qlogic/qed/qed_ooo.h

diff --git a/drivers/net/ethernet/qlogic/qed/Makefile b/drivers/net/ethernet/qlogic/qed/Makefile
index 597e15c..729e437 100644
--- a/drivers/net/ethernet/qlogic/qed/Makefile
+++ b/drivers/net/ethernet/qlogic/qed/Makefile
@@ -6,4 +6,4 @@ qed-y := qed_cxt.o qed_dev.o qed_hw.o qed_init_fw_funcs.o qed_init_ops.o \
 qed-$(CONFIG_QED_SRIOV) += qed_sriov.o qed_vf.o
 qed-$(CONFIG_QED_LL2) += qed_ll2.o
 qed-$(CONFIG_QED_RDMA) += qed_roce.o
-qed-$(CONFIG_QED_ISCSI) += qed_iscsi.o
+qed-$(CONFIG_QED_ISCSI) += qed_iscsi.o qed_ooo.o
diff --git a/drivers/net/ethernet/qlogic/qed/qed.h b/drivers/net/ethernet/qlogic/qed/qed.h
index 15286c1..85c2c6a 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -392,6 +392,7 @@ struct qed_hwfn {
 	/* Protocol related */
 	bool				using_ll2;
 	struct qed_ll2_info		*p_ll2_info;
+	struct qed_ooo_info		*p_ooo_info;
 	struct qed_rdma_info		*p_rdma_info;
 	struct qed_iscsi_info		*p_iscsi_info;
 	struct qed_pf_params		pf_params;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index e3612be..896429c 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -32,6 +32,7 @@
 #include "qed_iscsi.h"
 #include "qed_ll2.h"
 #include "qed_mcp.h"
+#include "qed_ooo.h"
 #include "qed_reg_addr.h"
 #include "qed_sp.h"
 #include "qed_sriov.h"
@@ -156,8 +157,10 @@ void qed_resc_free(struct qed_dev *cdev)
 #ifdef CONFIG_QED_LL2
 		qed_ll2_free(p_hwfn, p_hwfn->p_ll2_info);
 #endif
-		if (p_hwfn->hw_info.personality == QED_PCI_ISCSI)
+		if (p_hwfn->hw_info.personality == QED_PCI_ISCSI) {
 			qed_iscsi_free(p_hwfn, p_hwfn->p_iscsi_info);
+			qed_ooo_free(p_hwfn, p_hwfn->p_ooo_info);
+		}
 		qed_iov_free(p_hwfn);
 		qed_dmae_info_free(p_hwfn);
 		qed_dcbx_info_free(p_hwfn, p_hwfn->p_dcbx_info);
@@ -415,6 +418,7 @@ int qed_qm_reconf(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt)
 int qed_resc_alloc(struct qed_dev *cdev)
 {
 	struct qed_iscsi_info *p_iscsi_info;
+	struct qed_ooo_info *p_ooo_info;
 #ifdef CONFIG_QED_LL2
 	struct qed_ll2_info *p_ll2_info;
 #endif
@@ -542,6 +546,10 @@ int qed_resc_alloc(struct qed_dev *cdev)
 			if (!p_iscsi_info)
 				goto alloc_no_mem;
 			p_hwfn->p_iscsi_info = p_iscsi_info;
+			p_ooo_info = qed_ooo_alloc(p_hwfn);
+			if (!p_ooo_info)
+				goto alloc_no_mem;
+			p_hwfn->p_ooo_info = p_ooo_info;
 		}
 
 		/* DMA info initialization */
@@ -596,8 +604,10 @@ void qed_resc_setup(struct qed_dev *cdev)
 		if (p_hwfn->using_ll2)
 			qed_ll2_setup(p_hwfn, p_hwfn->p_ll2_info);
 #endif
-		if (p_hwfn->hw_info.personality == QED_PCI_ISCSI)
+		if (p_hwfn->hw_info.personality == QED_PCI_ISCSI) {
 			qed_iscsi_setup(p_hwfn, p_hwfn->p_iscsi_info);
+			qed_ooo_setup(p_hwfn, p_hwfn->p_ooo_info);
+		}
 	}
 }
 
diff --git a/drivers/net/ethernet/qlogic/qed/qed_ll2.c b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
index 84d7e91..de4e2a2 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_ll2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
@@ -36,6 +36,7 @@
 #include "qed_int.h"
 #include "qed_ll2.h"
 #include "qed_mcp.h"
+#include "qed_ooo.h"
 #include "qed_reg_addr.h"
 #include "qed_sp.h"
 #include "qed_roce.h"
@@ -296,25 +297,34 @@ static void qed_ll2_txq_flush(struct qed_hwfn *p_hwfn, u8 connection_handle)
 		list_del(&p_pkt->list_entry);
 		b_last_packet = list_empty(&p_tx->active_descq);
 		list_add_tail(&p_pkt->list_entry, &p_tx->free_descq);
-		p_tx->cur_completing_packet = *p_pkt;
-		p_tx->cur_completing_bd_idx = 1;
-		b_last_frag = p_tx->cur_completing_bd_idx == p_pkt->bd_used;
-		tx_frag = p_pkt->bds_set[0].tx_frag;
-		if (p_ll2_conn->gsi_enable)
-			qed_ll2b_release_tx_gsi_packet(p_hwfn,
-						       p_ll2_conn->my_id,
-						       p_pkt->cookie,
-						       tx_frag,
-						       b_last_frag,
-						       b_last_packet);
-		else
-			qed_ll2b_complete_tx_packet(p_hwfn,
-						    p_ll2_conn->my_id,
-						    p_pkt->cookie,
-						    tx_frag,
-						    b_last_frag,
-						    b_last_packet);
+		if (p_ll2_conn->conn_type == QED_LL2_TYPE_ISCSI_OOO) {
+			struct qed_ooo_buffer *p_buffer;
 
+			p_buffer = (struct qed_ooo_buffer *)p_pkt->cookie;
+			qed_ooo_put_free_buffer(p_hwfn, p_hwfn->p_ooo_info,
+						p_buffer);
+		} else {
+			p_tx->cur_completing_packet = *p_pkt;
+			p_tx->cur_completing_bd_idx = 1;
+			b_last_frag =
+				p_tx->cur_completing_bd_idx == p_pkt->bd_used;
+			tx_frag = p_pkt->bds_set[0].tx_frag;
+			if (p_ll2_conn->gsi_enable)
+				qed_ll2b_release_tx_gsi_packet(p_hwfn,
+							       p_ll2_conn->
+							       my_id,
+							       p_pkt->cookie,
+							       tx_frag,
+							       b_last_frag,
+							       b_last_packet);
+			else
+				qed_ll2b_complete_tx_packet(p_hwfn,
+							    p_ll2_conn->my_id,
+							    p_pkt->cookie,
+							    tx_frag,
+							    b_last_frag,
+							    b_last_packet);
+		}
 	}
 }
 
@@ -540,12 +550,457 @@ static void qed_ll2_rxq_flush(struct qed_hwfn *p_hwfn, u8 connection_handle)
 
 		list_move_tail(&p_pkt->list_entry, &p_rx->free_descq);
 
-		rx_buf_addr = p_pkt->rx_buf_addr;
-		cookie = p_pkt->cookie;
+		if (p_ll2_conn->conn_type == QED_LL2_TYPE_ISCSI_OOO) {
+			struct qed_ooo_buffer *p_buffer;
+
+			p_buffer = (struct qed_ooo_buffer *)p_pkt->cookie;
+			qed_ooo_put_free_buffer(p_hwfn, p_hwfn->p_ooo_info,
+						p_buffer);
+		} else {
+			rx_buf_addr = p_pkt->rx_buf_addr;
+			cookie = p_pkt->cookie;
+
+			b_last = list_empty(&p_rx->active_descq);
+		}
+	}
+}
+
+#if IS_ENABLED(CONFIG_QED_ISCSI)
+static u8 qed_ll2_convert_rx_parse_to_tx_flags(u16 parse_flags)
+{
+	u8 bd_flags = 0;
+
+	if (GET_FIELD(parse_flags, PARSING_AND_ERR_FLAGS_TAG8021QEXIST))
+		SET_FIELD(bd_flags, CORE_TX_BD_FLAGS_VLAN_INSERTION, 1);
+
+	return bd_flags;
+}
+
+static int qed_ll2_lb_rxq_handler(struct qed_hwfn *p_hwfn,
+				  struct qed_ll2_info *p_ll2_conn)
+{
+	struct qed_ll2_rx_queue *p_rx = &p_ll2_conn->rx_queue;
+	u16 packet_length = 0, parse_flags = 0, vlan = 0;
+	struct qed_ll2_rx_packet *p_pkt = NULL;
+	u32 num_ooo_add_to_peninsula = 0, cid;
+	union core_rx_cqe_union *cqe = NULL;
+	u16 cq_new_idx = 0, cq_old_idx = 0;
+	struct qed_ooo_buffer *p_buffer;
+	struct ooo_opaque *iscsi_ooo;
+	u8 placement_offset = 0;
+	u8 cqe_type;
+
+	cq_new_idx = le16_to_cpu(*p_rx->p_fw_cons);
+	cq_old_idx = qed_chain_get_cons_idx(&p_rx->rcq_chain);
+	if (cq_new_idx == cq_old_idx)
+		return 0;
+
+	while (cq_new_idx != cq_old_idx) {
+		struct core_rx_fast_path_cqe *p_cqe_fp;
+
+		cqe = qed_chain_consume(&p_rx->rcq_chain);
+		cq_old_idx = qed_chain_get_cons_idx(&p_rx->rcq_chain);
+		cqe_type = cqe->rx_cqe_sp.type;
+
+		if (cqe_type != CORE_RX_CQE_TYPE_REGULAR) {
+			DP_NOTICE(p_hwfn,
+				  "Got a non-regular LB LL2 completion [type 0x%02x]\n",
+				  cqe_type);
+			return -EINVAL;
+		}
+		p_cqe_fp = &cqe->rx_cqe_fp;
+
+		placement_offset = p_cqe_fp->placement_offset;
+		parse_flags = le16_to_cpu(p_cqe_fp->parse_flags.flags);
+		packet_length = le16_to_cpu(p_cqe_fp->packet_length);
+		vlan = le16_to_cpu(p_cqe_fp->vlan);
+		iscsi_ooo = (struct ooo_opaque *)&p_cqe_fp->opaque_data;
+		qed_ooo_save_history_entry(p_hwfn, p_hwfn->p_ooo_info,
+					   iscsi_ooo);
+		cid = le32_to_cpu(iscsi_ooo->cid);
+
+		/* Process delete isle first */
+		if (iscsi_ooo->drop_size)
+			qed_ooo_delete_isles(p_hwfn, p_hwfn->p_ooo_info, cid,
+					     iscsi_ooo->drop_isle,
+					     iscsi_ooo->drop_size);
+
+		if (iscsi_ooo->ooo_opcode == TCP_EVENT_NOP)
+			continue;
+
+		/* Now process create/add/join isles */
+		if (list_empty(&p_rx->active_descq)) {
+			DP_NOTICE(p_hwfn,
+				  "LL2 OOO RX chain has no submitted buffers\n"
+				  );
+			return -EIO;
+		}
+
+		p_pkt = list_first_entry(&p_rx->active_descq,
+					 struct qed_ll2_rx_packet, list_entry);
+
+		if ((iscsi_ooo->ooo_opcode == TCP_EVENT_ADD_NEW_ISLE) ||
+		    (iscsi_ooo->ooo_opcode == TCP_EVENT_ADD_ISLE_RIGHT) ||
+		    (iscsi_ooo->ooo_opcode == TCP_EVENT_ADD_ISLE_LEFT) ||
+		    (iscsi_ooo->ooo_opcode == TCP_EVENT_ADD_PEN) ||
+		    (iscsi_ooo->ooo_opcode == TCP_EVENT_JOIN)) {
+			if (!p_pkt) {
+				DP_NOTICE(p_hwfn,
+					  "LL2 OOO RX packet is not valid\n");
+				return -EIO;
+			}
+			list_del(&p_pkt->list_entry);
+			p_buffer = (struct qed_ooo_buffer *)p_pkt->cookie;
+			p_buffer->packet_length = packet_length;
+			p_buffer->parse_flags = parse_flags;
+			p_buffer->vlan = vlan;
+			p_buffer->placement_offset = placement_offset;
+			qed_chain_consume(&p_rx->rxq_chain);
+			list_add_tail(&p_pkt->list_entry, &p_rx->free_descq);
+
+			switch (iscsi_ooo->ooo_opcode) {
+			case TCP_EVENT_ADD_NEW_ISLE:
+				qed_ooo_add_new_isle(p_hwfn,
+						     p_hwfn->p_ooo_info,
+						     cid,
+						     iscsi_ooo->ooo_isle,
+						     p_buffer);
+				break;
+			case TCP_EVENT_ADD_ISLE_RIGHT:
+				qed_ooo_add_new_buffer(p_hwfn,
+						       p_hwfn->p_ooo_info,
+						       cid,
+						       iscsi_ooo->ooo_isle,
+						       p_buffer,
+						       QED_OOO_RIGHT_BUF);
+				break;
+			case TCP_EVENT_ADD_ISLE_LEFT:
+				qed_ooo_add_new_buffer(p_hwfn,
+						       p_hwfn->p_ooo_info,
+						       cid,
+						       iscsi_ooo->ooo_isle,
+						       p_buffer,
+						       QED_OOO_LEFT_BUF);
+				break;
+			case TCP_EVENT_JOIN:
+				qed_ooo_add_new_buffer(p_hwfn,
+						       p_hwfn->p_ooo_info,
+						       cid,
+						       iscsi_ooo->ooo_isle +
+						       1,
+						       p_buffer,
+						       QED_OOO_LEFT_BUF);
+				qed_ooo_join_isles(p_hwfn,
+						   p_hwfn->p_ooo_info,
+						   cid, iscsi_ooo->ooo_isle);
+				break;
+			case TCP_EVENT_ADD_PEN:
+				num_ooo_add_to_peninsula++;
+				qed_ooo_put_ready_buffer(p_hwfn,
+							 p_hwfn->p_ooo_info,
+							 p_buffer, true);
+				break;
+			}
+		} else {
+			DP_NOTICE(p_hwfn,
+				  "Unexpected event (%d) TX OOO completion\n",
+				  iscsi_ooo->ooo_opcode);
+		}
+	}
+
+	return 0;
+}
+
+static void
+qed_ooo_submit_tx_buffers(struct qed_hwfn *p_hwfn,
+			  struct qed_ll2_info *p_ll2_conn)
+{
+	struct qed_ooo_buffer *p_buffer;
+	int rc;
+	u16 l4_hdr_offset_w;
+	dma_addr_t first_frag;
+	u16 parse_flags;
+	u8 bd_flags;
+
+	/* Submit Tx buffers here */
+	while ((p_buffer = qed_ooo_get_ready_buffer(p_hwfn,
+						    p_hwfn->p_ooo_info))) {
+		l4_hdr_offset_w = 0;
+		bd_flags = 0;
+
+		first_frag = p_buffer->rx_buffer_phys_addr +
+			     p_buffer->placement_offset;
+		parse_flags = p_buffer->parse_flags;
+		bd_flags = qed_ll2_convert_rx_parse_to_tx_flags(parse_flags);
+		SET_FIELD(bd_flags, CORE_TX_BD_FLAGS_FORCE_VLAN_MODE, 1);
+		SET_FIELD(bd_flags, CORE_TX_BD_FLAGS_L4_PROTOCOL, 1);
+
+		rc = qed_ll2_prepare_tx_packet(p_hwfn, p_ll2_conn->my_id, 1,
+					       p_buffer->vlan, bd_flags,
+					       l4_hdr_offset_w,
+					       p_ll2_conn->tx_dest, 0,
+					       first_frag,
+					       p_buffer->packet_length,
+					       p_buffer, true);
+		if (rc) {
+			qed_ooo_put_ready_buffer(p_hwfn, p_hwfn->p_ooo_info,
+						 p_buffer, false);
+			break;
+		}
+	}
+}
+
+static void
+qed_ooo_submit_rx_buffers(struct qed_hwfn *p_hwfn,
+			  struct qed_ll2_info *p_ll2_conn)
+{
+	struct qed_ooo_buffer *p_buffer;
+	int rc;
+
+	while ((p_buffer = qed_ooo_get_free_buffer(p_hwfn,
+						   p_hwfn->p_ooo_info))) {
+		rc = qed_ll2_post_rx_buffer(p_hwfn,
+					    p_ll2_conn->my_id,
+					    p_buffer->rx_buffer_phys_addr,
+					    0, p_buffer, true);
+		if (rc) {
+			qed_ooo_put_free_buffer(p_hwfn,
+						p_hwfn->p_ooo_info, p_buffer);
+			break;
+		}
+	}
+}
+
+static int qed_ll2_lb_rxq_completion(struct qed_hwfn *p_hwfn, void *p_cookie)
+{
+	struct qed_ll2_info *p_ll2_conn = (struct qed_ll2_info *)p_cookie;
+	int rc;
+
+	rc = qed_ll2_lb_rxq_handler(p_hwfn, p_ll2_conn);
+	if (rc)
+		return rc;
+
+	qed_ooo_submit_rx_buffers(p_hwfn, p_ll2_conn);
+	qed_ooo_submit_tx_buffers(p_hwfn, p_ll2_conn);
+
+	return 0;
+}
+
+static int qed_ll2_lb_txq_completion(struct qed_hwfn *p_hwfn, void *p_cookie)
+{
+	struct qed_ll2_info *p_ll2_conn = (struct qed_ll2_info *)p_cookie;
+	struct qed_ll2_tx_queue *p_tx = &p_ll2_conn->tx_queue;
+	struct qed_ll2_tx_packet *p_pkt = NULL;
+	struct qed_ooo_buffer *p_buffer;
+	bool b_dont_submit_rx = false;
+	u16 new_idx = 0, num_bds = 0;
+	int rc;
+
+	new_idx = le16_to_cpu(*p_tx->p_fw_cons);
+	num_bds = ((s16)new_idx - (s16)p_tx->bds_idx);
+
+	if (!num_bds)
+		return 0;
+
+	while (num_bds) {
+		if (list_empty(&p_tx->active_descq))
+			return -EINVAL;
+
+		p_pkt = list_first_entry(&p_tx->active_descq,
+					 struct qed_ll2_tx_packet, list_entry);
+		if (!p_pkt)
+			return -EINVAL;
+
+		if (p_pkt->bd_used != 1) {
+			DP_NOTICE(p_hwfn,
+				  "Unexpectedly many BDs(%d) in TX OOO completion\n",
+				  p_pkt->bd_used);
+			return -EINVAL;
+		}
+
+		list_del(&p_pkt->list_entry);
+
+		num_bds--;
+		p_tx->bds_idx++;
+		qed_chain_consume(&p_tx->txq_chain);
+
+		p_buffer = (struct qed_ooo_buffer *)p_pkt->cookie;
+		list_add_tail(&p_pkt->list_entry, &p_tx->free_descq);
+
+		if (b_dont_submit_rx) {
+			qed_ooo_put_free_buffer(p_hwfn, p_hwfn->p_ooo_info,
+						p_buffer);
+			continue;
+		}
+
+		rc = qed_ll2_post_rx_buffer(p_hwfn, p_ll2_conn->my_id,
+					    p_buffer->rx_buffer_phys_addr, 0,
+					    p_buffer, true);
+		if (rc != 0) {
+			qed_ooo_put_free_buffer(p_hwfn,
+						p_hwfn->p_ooo_info, p_buffer);
+			b_dont_submit_rx = true;
+		}
+	}
+
+	qed_ooo_submit_tx_buffers(p_hwfn, p_ll2_conn);
+
+	return 0;
+}
+
+static int
+qed_ll2_acquire_connection_ooo(struct qed_hwfn *p_hwfn,
+			       struct qed_ll2_info *p_ll2_info,
+			       u16 rx_num_ooo_buffers, u16 mtu)
+{
+	struct qed_ooo_buffer *p_buf = NULL;
+	void *p_virt;
+	u16 buf_idx;
+	int rc = 0;
+
+	if (p_ll2_info->conn_type != QED_LL2_TYPE_ISCSI_OOO)
+		return rc;
+
+	if (!rx_num_ooo_buffers)
+		return -EINVAL;
+
+	for (buf_idx = 0; buf_idx < rx_num_ooo_buffers; buf_idx++) {
+		p_buf = kzalloc(sizeof(*p_buf), GFP_KERNEL);
+		if (!p_buf) {
+			rc = -ENOMEM;
+			goto out;
+		}
+
+		p_buf->rx_buffer_size = mtu + 26 + ETH_CACHE_LINE_SIZE;
+		p_buf->rx_buffer_size = (p_buf->rx_buffer_size +
+					 ETH_CACHE_LINE_SIZE - 1) &
+					~(ETH_CACHE_LINE_SIZE - 1);
+		p_virt = dma_alloc_coherent(&p_hwfn->cdev->pdev->dev,
+					    p_buf->rx_buffer_size,
+					    &p_buf->rx_buffer_phys_addr,
+					    GFP_KERNEL);
+		if (!p_virt) {
+			kfree(p_buf);
+			rc = -ENOMEM;
+			goto out;
+		}
+
+		p_buf->rx_buffer_virt_addr = p_virt;
+		qed_ooo_put_free_buffer(p_hwfn, p_hwfn->p_ooo_info, p_buf);
+	}
+
+	DP_VERBOSE(p_hwfn, QED_MSG_LL2,
+		   "Allocated [%04x] LL2 OOO buffers [each of size 0x%08x]\n",
+		   rx_num_ooo_buffers, p_buf->rx_buffer_size);
+
+out:
+	return rc;
+}
+
+static void
+qed_ll2_establish_connection_ooo(struct qed_hwfn *p_hwfn,
+				 struct qed_ll2_info *p_ll2_conn)
+{
+	if (p_ll2_conn->conn_type != QED_LL2_TYPE_ISCSI_OOO)
+		return;
+
+	qed_ooo_release_all_isles(p_hwfn, p_hwfn->p_ooo_info);
+	qed_ooo_submit_rx_buffers(p_hwfn, p_ll2_conn);
+}
+
+static void qed_ll2_release_connection_ooo(struct qed_hwfn *p_hwfn,
+					   struct qed_ll2_info *p_ll2_conn)
+{
+	struct qed_ooo_buffer *p_buffer;
+
+	if (p_ll2_conn->conn_type != QED_LL2_TYPE_ISCSI_OOO)
+		return;
+
+	qed_ooo_release_all_isles(p_hwfn, p_hwfn->p_ooo_info);
+	while ((p_buffer = qed_ooo_get_free_buffer(p_hwfn,
+						   p_hwfn->p_ooo_info))) {
+		dma_free_coherent(&p_hwfn->cdev->pdev->dev,
+				  p_buffer->rx_buffer_size,
+				  p_buffer->rx_buffer_virt_addr,
+				  p_buffer->rx_buffer_phys_addr);
+		kfree(p_buffer);
+	}
+}
+
+static void qed_ll2_stop_ooo(struct qed_dev *cdev)
+{
+	struct qed_hwfn *hwfn = QED_LEADING_HWFN(cdev);
+	u8 *handle = &hwfn->pf_params.iscsi_pf_params.ll2_ooo_queue_id;
+
+	DP_VERBOSE(cdev, QED_MSG_STORAGE, "Stopping LL2 OOO queue [%02x]\n",
+		   *handle);
+
+	qed_ll2_terminate_connection(hwfn, *handle);
+	qed_ll2_release_connection(hwfn, *handle);
+	*handle = QED_LL2_UNUSED_HANDLE;
+}
+
+static int qed_ll2_start_ooo(struct qed_dev *cdev,
+			     struct qed_ll2_params *params)
+{
+	struct qed_hwfn *hwfn = QED_LEADING_HWFN(cdev);
+	u8 *handle = &hwfn->pf_params.iscsi_pf_params.ll2_ooo_queue_id;
+	struct qed_ll2_info *ll2_info;
+	int rc;
+
+	ll2_info = kzalloc(sizeof(*ll2_info), GFP_KERNEL);
+	if (!ll2_info)
+		return -ENOMEM;
+	ll2_info->conn_type = QED_LL2_TYPE_ISCSI_OOO;
+	ll2_info->mtu = params->mtu;
+	ll2_info->rx_drop_ttl0_flg = params->drop_ttl0_packets;
+	ll2_info->rx_vlan_removal_en = params->rx_vlan_stripping;
+	ll2_info->tx_tc = OOO_LB_TC;
+	ll2_info->tx_dest = CORE_TX_DEST_LB;
+
+	rc = qed_ll2_acquire_connection(hwfn, ll2_info,
+					QED_LL2_RX_SIZE, QED_LL2_TX_SIZE,
+					handle);
+	kfree(ll2_info);
+	if (rc) {
+		DP_INFO(cdev, "Failed to acquire LL2 OOO connection\n");
+		goto out;
+	}
 
-		b_last = list_empty(&p_rx->active_descq);
+	rc = qed_ll2_establish_connection(hwfn, *handle);
+	if (rc) {
+		DP_INFO(cdev, "Failed to establist LL2 OOO connection\n");
+		goto fail;
 	}
+
+	return 0;
+
+fail:
+	qed_ll2_release_connection(hwfn, *handle);
+out:
+	*handle = QED_LL2_UNUSED_HANDLE;
+	return rc;
 }
+#else /* IS_ENABLED(CONFIG_QED_ISCSI) */
+static int qed_ll2_lb_rxq_completion(struct qed_hwfn *p_hwfn,
+				     void *p_cookie) { return -EINVAL; }
+static int qed_ll2_lb_txq_completion(struct qed_hwfn *p_hwfn,
+				     void *p_cookie) { return -EINVAL; }
+static inline int
+qed_ll2_acquire_connection_ooo(struct qed_hwfn *p_hwfn,
+			       struct qed_ll2_info *p_ll2_info,
+			       u16 rx_num_ooo_buffers, u16 mtu) { return 0; }
+static inline void
+qed_ll2_establish_connection_ooo(struct qed_hwfn *p_hwfn,
+				 struct qed_ll2_info *p_ll2_conn) { return; }
+static inline void
+qed_ll2_release_connection_ooo(struct qed_hwfn *p_hwfn,
+			       struct qed_ll2_info *p_ll2_conn) { return; }
+static inline void qed_ll2_stop_ooo(struct qed_dev *cdev) { return; }
+static inline int qed_ll2_start_ooo(struct qed_dev *cdev,
+				    struct qed_ll2_params *params)
+				    { return -EINVAL; }
+#endif /* IS_ENABLED(CONFIG_QED_ISCSI) */
 
 static int qed_sp_ll2_rx_queue_start(struct qed_hwfn *p_hwfn,
 				     struct qed_ll2_info *p_ll2_conn,
@@ -588,7 +1043,8 @@ static int qed_sp_ll2_rx_queue_start(struct qed_hwfn *p_hwfn,
 	p_ramrod->drop_ttl0_flg = p_ll2_conn->rx_drop_ttl0_flg;
 	p_ramrod->inner_vlan_removal_en = p_ll2_conn->rx_vlan_removal_en;
 	p_ramrod->queue_id = p_ll2_conn->queue_id;
-	p_ramrod->main_func_queue = 1;
+	p_ramrod->main_func_queue = (conn_type == QED_LL2_TYPE_ISCSI_OOO) ? 0
+									  : 1;
 
 	if ((IS_MF_DEFAULT(p_hwfn) || IS_MF_SI(p_hwfn)) &&
 	    p_ramrod->main_func_queue && (conn_type != QED_LL2_TYPE_ROCE)) {
@@ -619,6 +1075,11 @@ static int qed_sp_ll2_tx_queue_start(struct qed_hwfn *p_hwfn,
 	if (!QED_LL2_TX_REGISTERED(p_ll2_conn))
 		return 0;
 
+	if (p_ll2_conn->conn_type == QED_LL2_TYPE_ISCSI_OOO)
+		p_ll2_conn->tx_stats_en = 0;
+	else
+		p_ll2_conn->tx_stats_en = 1;
+
 	/* Get SPQ entry */
 	memset(&init_data, 0, sizeof(init_data));
 	init_data.cid = p_ll2_conn->cid;
@@ -636,7 +1097,6 @@ static int qed_sp_ll2_tx_queue_start(struct qed_hwfn *p_hwfn,
 	p_ramrod->sb_id = cpu_to_le16(qed_int_get_sp_sb_id(p_hwfn));
 	p_ramrod->sb_index = p_tx->tx_sb_index;
 	p_ramrod->mtu = cpu_to_le16(p_ll2_conn->mtu);
-	p_ll2_conn->tx_stats_en = 1;
 	p_ramrod->stats_en = p_ll2_conn->tx_stats_en;
 	p_ramrod->stats_id = p_ll2_conn->tx_stats_id;
 
@@ -860,9 +1320,19 @@ int qed_ll2_acquire_connection(struct qed_hwfn *p_hwfn,
 	if (rc)
 		goto q_allocate_fail;
 
+	rc = qed_ll2_acquire_connection_ooo(p_hwfn, p_ll2_info,
+					    rx_num_desc * 2, p_params->mtu);
+	if (rc)
+		goto q_allocate_fail;
+
 	/* Register callbacks for the Rx/Tx queues */
-	comp_rx_cb = qed_ll2_rxq_completion;
-	comp_tx_cb = qed_ll2_txq_completion;
+	if (p_params->conn_type == QED_LL2_TYPE_ISCSI_OOO) {
+		comp_rx_cb = qed_ll2_lb_rxq_completion;
+		comp_tx_cb = qed_ll2_lb_txq_completion;
+	} else {
+		comp_rx_cb = qed_ll2_rxq_completion;
+		comp_tx_cb = qed_ll2_txq_completion;
+	}
 
 	if (rx_num_desc) {
 		qed_int_register_cb(p_hwfn, comp_rx_cb,
@@ -975,6 +1445,8 @@ int qed_ll2_establish_connection(struct qed_hwfn *p_hwfn, u8 connection_handle)
 	if (p_hwfn->hw_info.personality != QED_PCI_ETH_ROCE)
 		qed_wr(p_hwfn, p_hwfn->p_main_ptt, PRS_REG_USE_LIGHT_L2, 1);
 
+	qed_ll2_establish_connection_ooo(p_hwfn, p_ll2_conn);
+
 	return rc;
 }
 
@@ -1213,6 +1685,7 @@ int qed_ll2_prepare_tx_packet(struct qed_hwfn *p_hwfn,
 			      u16 vlan,
 			      u8 bd_flags,
 			      u16 l4_hdr_offset_w,
+			      enum qed_ll2_tx_dest e_tx_dest,
 			      enum qed_ll2_roce_flavor_type qed_roce_flavor,
 			      dma_addr_t first_frag,
 			      u16 first_frag_len, void *cookie, u8 notify_fw)
@@ -1222,6 +1695,7 @@ int qed_ll2_prepare_tx_packet(struct qed_hwfn *p_hwfn,
 	enum core_roce_flavor_type roce_flavor;
 	struct qed_ll2_tx_queue *p_tx;
 	struct qed_chain *p_tx_chain;
+	enum core_tx_dest tx_dest;
 	unsigned long flags;
 	int rc = 0;
 
@@ -1252,6 +1726,8 @@ int qed_ll2_prepare_tx_packet(struct qed_hwfn *p_hwfn,
 		goto out;
 	}
 
+	tx_dest = e_tx_dest == QED_LL2_TX_DEST_NW ? CORE_TX_DEST_NW :
+						    CORE_TX_DEST_LB;
 	if (qed_roce_flavor == QED_LL2_ROCE) {
 		roce_flavor = CORE_ROCE;
 	} else if (qed_roce_flavor == QED_LL2_RROCE) {
@@ -1266,7 +1742,7 @@ int qed_ll2_prepare_tx_packet(struct qed_hwfn *p_hwfn,
 				      num_of_bds, first_frag,
 				      first_frag_len, cookie, notify_fw);
 	qed_ll2_prepare_tx_packet_set_bd(p_hwfn, p_ll2_conn, p_curp,
-					 num_of_bds, CORE_TX_DEST_NW,
+					 num_of_bds, tx_dest,
 					 vlan, bd_flags, l4_hdr_offset_w,
 					 roce_flavor,
 					 first_frag, first_frag_len);
@@ -1341,6 +1817,9 @@ int qed_ll2_terminate_connection(struct qed_hwfn *p_hwfn, u8 connection_handle)
 		qed_ll2_rxq_flush(p_hwfn, connection_handle);
 	}
 
+	if (p_ll2_conn->conn_type == QED_LL2_TYPE_ISCSI_OOO)
+		qed_ooo_release_all_isles(p_hwfn, p_hwfn->p_ooo_info);
+
 	return rc;
 }
 
@@ -1371,6 +1850,8 @@ void qed_ll2_release_connection(struct qed_hwfn *p_hwfn, u8 connection_handle)
 
 	qed_cxt_release_cid(p_hwfn, p_ll2_conn->cid);
 
+	qed_ll2_release_connection_ooo(p_hwfn, p_ll2_conn);
+
 	mutex_lock(&p_ll2_conn->mutex);
 	p_ll2_conn->b_active = false;
 	mutex_unlock(&p_ll2_conn->mutex);
@@ -1613,6 +2094,17 @@ static int qed_ll2_start(struct qed_dev *cdev, struct qed_ll2_params *params)
 		goto release_terminate;
 	}
 
+	if (cdev->hwfns[0].hw_info.personality == QED_PCI_ISCSI &&
+	    cdev->hwfns[0].pf_params.iscsi_pf_params.ooo_enable) {
+		DP_VERBOSE(cdev, QED_MSG_STORAGE, "Starting OOO LL2 queue\n");
+		rc = qed_ll2_start_ooo(cdev, params);
+		if (rc) {
+			DP_INFO(cdev,
+				"Failed to initialize the OOO LL2 queue\n");
+			goto release_terminate;
+		}
+	}
+
 	p_ptt = qed_ptt_acquire(QED_LEADING_HWFN(cdev));
 	if (!p_ptt) {
 		DP_INFO(cdev, "Failed to acquire PTT\n");
@@ -1662,6 +2154,10 @@ static int qed_ll2_stop(struct qed_dev *cdev)
 	qed_ptt_release(QED_LEADING_HWFN(cdev), p_ptt);
 	eth_zero_addr(cdev->ll2_mac_address);
 
+	if (cdev->hwfns[0].hw_info.personality == QED_PCI_ISCSI &&
+	    cdev->hwfns[0].pf_params.iscsi_pf_params.ooo_enable)
+		qed_ll2_stop_ooo(cdev);
+
 	rc = qed_ll2_terminate_connection(QED_LEADING_HWFN(cdev),
 					  cdev->ll2->handle);
 	if (rc)
@@ -1716,7 +2212,8 @@ static int qed_ll2_start_xmit(struct qed_dev *cdev, struct sk_buff *skb)
 	rc = qed_ll2_prepare_tx_packet(QED_LEADING_HWFN(cdev),
 				       cdev->ll2->handle,
 				       1 + skb_shinfo(skb)->nr_frags,
-				       vlan, flags, 0, 0 /* RoCE FLAVOR */,
+				       vlan, flags, 0, QED_LL2_TX_DEST_NW,
+				       0 /* RoCE FLAVOR */,
 				       mapping, skb->len, skb, 1);
 	if (rc)
 		goto err;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_ll2.h b/drivers/net/ethernet/qlogic/qed/qed_ll2.h
index 4e3d62a..6625a3a 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_ll2.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_ll2.h
@@ -41,6 +41,12 @@ enum qed_ll2_conn_type {
 	MAX_QED_LL2_RX_CONN_TYPE
 };
 
+enum qed_ll2_tx_dest {
+	QED_LL2_TX_DEST_NW, /* Light L2 TX Destination to the Network */
+	QED_LL2_TX_DEST_LB, /* Light L2 TX Destination to the Loopback */
+	QED_LL2_TX_DEST_MAX
+};
+
 struct qed_ll2_rx_packet {
 	struct list_head list_entry;
 	struct core_rx_bd_with_buff_len *rxq_bd;
@@ -192,6 +198,8 @@ int qed_ll2_post_rx_buffer(struct qed_hwfn *p_hwfn,
  * @param l4_hdr_offset_w	L4 Header Offset from start of packet
  *				(in words). This is needed if both l4_csum
  *				and ipv6_ext are set
+ * @param e_tx_dest             indicates if the packet is to be transmitted via
+ *                              loopback or to the network
  * @param first_frag
  * @param first_frag_len
  * @param cookie
@@ -206,6 +214,7 @@ int qed_ll2_prepare_tx_packet(struct qed_hwfn *p_hwfn,
 			      u16 vlan,
 			      u8 bd_flags,
 			      u16 l4_hdr_offset_w,
+			      enum qed_ll2_tx_dest e_tx_dest,
 			      enum qed_ll2_roce_flavor_type qed_roce_flavor,
 			      dma_addr_t first_frag,
 			      u16 first_frag_len, void *cookie, u8 notify_fw);
diff --git a/drivers/net/ethernet/qlogic/qed/qed_ooo.c b/drivers/net/ethernet/qlogic/qed/qed_ooo.c
new file mode 100644
index 0000000..155abcb
--- /dev/null
+++ b/drivers/net/ethernet/qlogic/qed/qed_ooo.c
@@ -0,0 +1,501 @@
+/* QLogic qed NIC Driver
+ * Copyright (c) 2015 QLogic Corporation
+ *
+ * This software is available under the terms of the GNU General Public License
+ * (GPL) Version 2, available from the file COPYING in the main directory of
+ * this source tree.
+ */
+
+#include <linux/types.h>
+#include <linux/dma-mapping.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/pci.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include "qed.h"
+#include "qed_iscsi.h"
+#include "qed_ll2.h"
+#include "qed_ooo.h"
+
+static struct qed_ooo_archipelago
+*qed_ooo_seek_archipelago(struct qed_hwfn *p_hwfn,
+			  struct qed_ooo_info
+			  *p_ooo_info,
+			  u32 cid)
+{
+	struct qed_ooo_archipelago *p_archipelago = NULL;
+
+	list_for_each_entry(p_archipelago,
+			    &p_ooo_info->archipelagos_list, list_entry) {
+		if (p_archipelago->cid == cid)
+			return p_archipelago;
+	}
+
+	return NULL;
+}
+
+static struct qed_ooo_isle *qed_ooo_seek_isle(struct qed_hwfn *p_hwfn,
+					      struct qed_ooo_info *p_ooo_info,
+					      u32 cid, u8 isle)
+{
+	struct qed_ooo_archipelago *p_archipelago = NULL;
+	struct qed_ooo_isle *p_isle = NULL;
+	u8 the_num_of_isle = 1;
+
+	p_archipelago = qed_ooo_seek_archipelago(p_hwfn, p_ooo_info, cid);
+	if (!p_archipelago) {
+		DP_NOTICE(p_hwfn,
+			  "Connection %d is not found in OOO list\n", cid);
+		return NULL;
+	}
+
+	list_for_each_entry(p_isle, &p_archipelago->isles_list, list_entry) {
+		if (the_num_of_isle == isle)
+			return p_isle;
+		the_num_of_isle++;
+	}
+
+	return NULL;
+}
+
+void qed_ooo_save_history_entry(struct qed_hwfn *p_hwfn,
+				struct qed_ooo_info *p_ooo_info,
+				struct ooo_opaque *p_cqe)
+{
+	struct qed_ooo_history *p_history = &p_ooo_info->ooo_history;
+
+	if (p_history->head_idx == p_history->num_of_cqes)
+		p_history->head_idx = 0;
+	p_history->p_cqes[p_history->head_idx] = *p_cqe;
+	p_history->head_idx++;
+}
+
+struct qed_ooo_info *qed_ooo_alloc(struct qed_hwfn *p_hwfn)
+{
+	struct qed_ooo_info *p_ooo_info;
+	u16 max_num_archipelagos = 0;
+	u16 max_num_isles = 0;
+	u32 i;
+
+	if (p_hwfn->hw_info.personality != QED_PCI_ISCSI) {
+		DP_NOTICE(p_hwfn,
+			  "Failed to allocate qed_ooo_info: unknown personality\n");
+		return NULL;
+	}
+
+	max_num_archipelagos = p_hwfn->pf_params.iscsi_pf_params.num_cons;
+	max_num_isles = QED_MAX_NUM_ISLES + max_num_archipelagos;
+
+	if (!max_num_archipelagos) {
+		DP_NOTICE(p_hwfn,
+			  "Failed to allocate qed_ooo_info: unknown amount of connections\n");
+		return NULL;
+	}
+
+	p_ooo_info = kzalloc(sizeof(*p_ooo_info), GFP_KERNEL);
+	if (!p_ooo_info)
+		return NULL;
+
+	INIT_LIST_HEAD(&p_ooo_info->free_buffers_list);
+	INIT_LIST_HEAD(&p_ooo_info->ready_buffers_list);
+	INIT_LIST_HEAD(&p_ooo_info->free_isles_list);
+	INIT_LIST_HEAD(&p_ooo_info->free_archipelagos_list);
+	INIT_LIST_HEAD(&p_ooo_info->archipelagos_list);
+
+	p_ooo_info->p_isles_mem = kcalloc(max_num_isles,
+					  sizeof(struct qed_ooo_isle),
+					  GFP_KERNEL);
+	if (!p_ooo_info->p_isles_mem)
+		goto no_isles_mem;
+
+	for (i = 0; i < max_num_isles; i++) {
+		INIT_LIST_HEAD(&p_ooo_info->p_isles_mem[i].buffers_list);
+		list_add_tail(&p_ooo_info->p_isles_mem[i].list_entry,
+			      &p_ooo_info->free_isles_list);
+	}
+
+	p_ooo_info->p_archipelagos_mem =
+				kcalloc(max_num_archipelagos,
+					sizeof(struct qed_ooo_archipelago),
+					GFP_KERNEL);
+	if (!p_ooo_info->p_archipelagos_mem)
+		goto no_archipelagos_mem;
+
+	for (i = 0; i < max_num_archipelagos; i++) {
+		INIT_LIST_HEAD(&p_ooo_info->p_archipelagos_mem[i].isles_list);
+		list_add_tail(&p_ooo_info->p_archipelagos_mem[i].list_entry,
+			      &p_ooo_info->free_archipelagos_list);
+	}
+
+	p_ooo_info->ooo_history.p_cqes =
+				kcalloc(QED_MAX_NUM_OOO_HISTORY_ENTRIES,
+					sizeof(struct ooo_opaque),
+					GFP_KERNEL);
+	if (!p_ooo_info->ooo_history.p_cqes)
+		goto no_history_mem;
+
+	return p_ooo_info;
+
+no_history_mem:
+	kfree(p_ooo_info->p_archipelagos_mem);
+no_archipelagos_mem:
+	kfree(p_ooo_info->p_isles_mem);
+no_isles_mem:
+	kfree(p_ooo_info);
+	return NULL;
+}
+
+void qed_ooo_release_connection_isles(struct qed_hwfn *p_hwfn,
+				      struct qed_ooo_info *p_ooo_info, u32 cid)
+{
+	struct qed_ooo_archipelago *p_archipelago;
+	struct qed_ooo_buffer *p_buffer;
+	struct qed_ooo_isle *p_isle;
+	bool b_found = false;
+
+	if (list_empty(&p_ooo_info->archipelagos_list))
+		return;
+
+	list_for_each_entry(p_archipelago,
+			    &p_ooo_info->archipelagos_list, list_entry) {
+		if (p_archipelago->cid == cid) {
+			list_del(&p_archipelago->list_entry);
+			b_found = true;
+			break;
+		}
+	}
+
+	if (!b_found)
+		return;
+
+	while (!list_empty(&p_archipelago->isles_list)) {
+		p_isle = list_first_entry(&p_archipelago->isles_list,
+					  struct qed_ooo_isle, list_entry);
+
+		list_del(&p_isle->list_entry);
+
+		while (!list_empty(&p_isle->buffers_list)) {
+			p_buffer = list_first_entry(&p_isle->buffers_list,
+						    struct qed_ooo_buffer,
+						    list_entry);
+
+			if (!p_buffer)
+				break;
+
+			list_del(&p_buffer->list_entry);
+			list_add_tail(&p_buffer->list_entry,
+				      &p_ooo_info->free_buffers_list);
+		}
+		list_add_tail(&p_isle->list_entry,
+			      &p_ooo_info->free_isles_list);
+	}
+
+	list_add_tail(&p_archipelago->list_entry,
+		      &p_ooo_info->free_archipelagos_list);
+}
+
+void qed_ooo_release_all_isles(struct qed_hwfn *p_hwfn,
+			       struct qed_ooo_info *p_ooo_info)
+{
+	struct qed_ooo_archipelago *p_arch;
+	struct qed_ooo_buffer *p_buffer;
+	struct qed_ooo_isle *p_isle;
+
+	while (!list_empty(&p_ooo_info->archipelagos_list)) {
+		p_arch = list_first_entry(&p_ooo_info->archipelagos_list,
+					  struct qed_ooo_archipelago,
+					  list_entry);
+
+		list_del(&p_arch->list_entry);
+
+		while (!list_empty(&p_arch->isles_list)) {
+			p_isle = list_first_entry(&p_arch->isles_list,
+						  struct qed_ooo_isle,
+						  list_entry);
+
+			list_del(&p_isle->list_entry);
+
+			while (!list_empty(&p_isle->buffers_list)) {
+				p_buffer =
+				    list_first_entry(&p_isle->buffers_list,
+						     struct qed_ooo_buffer,
+						     list_entry);
+
+				if (!p_buffer)
+					break;
+
+			list_del(&p_buffer->list_entry);
+				list_add_tail(&p_buffer->list_entry,
+					      &p_ooo_info->free_buffers_list);
+			}
+			list_add_tail(&p_isle->list_entry,
+				      &p_ooo_info->free_isles_list);
+		}
+		list_add_tail(&p_arch->list_entry,
+			      &p_ooo_info->free_archipelagos_list);
+	}
+	if (!list_empty(&p_ooo_info->ready_buffers_list))
+		list_splice_tail_init(&p_ooo_info->ready_buffers_list,
+				      &p_ooo_info->free_buffers_list);
+}
+
+void qed_ooo_setup(struct qed_hwfn *p_hwfn, struct qed_ooo_info *p_ooo_info)
+{
+	qed_ooo_release_all_isles(p_hwfn, p_ooo_info);
+	memset(p_ooo_info->ooo_history.p_cqes, 0,
+	       p_ooo_info->ooo_history.num_of_cqes *
+	       sizeof(struct ooo_opaque));
+	p_ooo_info->ooo_history.head_idx = 0;
+}
+
+void qed_ooo_free(struct qed_hwfn *p_hwfn, struct qed_ooo_info *p_ooo_info)
+{
+	struct qed_ooo_buffer *p_buffer;
+
+	qed_ooo_release_all_isles(p_hwfn, p_ooo_info);
+	while (!list_empty(&p_ooo_info->free_buffers_list)) {
+		p_buffer = list_first_entry(&p_ooo_info->free_buffers_list,
+					    struct qed_ooo_buffer, list_entry);
+
+		if (!p_buffer)
+			break;
+
+		list_del(&p_buffer->list_entry);
+		dma_free_coherent(&p_hwfn->cdev->pdev->dev,
+				  p_buffer->rx_buffer_size,
+				  p_buffer->rx_buffer_virt_addr,
+				  p_buffer->rx_buffer_phys_addr);
+		kfree(p_buffer);
+	}
+
+	kfree(p_ooo_info->p_isles_mem);
+	kfree(p_ooo_info->p_archipelagos_mem);
+	kfree(p_ooo_info->ooo_history.p_cqes);
+	kfree(p_ooo_info);
+}
+
+void qed_ooo_put_free_buffer(struct qed_hwfn *p_hwfn,
+			     struct qed_ooo_info *p_ooo_info,
+			     struct qed_ooo_buffer *p_buffer)
+{
+	list_add_tail(&p_buffer->list_entry, &p_ooo_info->free_buffers_list);
+}
+
+struct qed_ooo_buffer *qed_ooo_get_free_buffer(struct qed_hwfn *p_hwfn,
+					       struct qed_ooo_info *p_ooo_info)
+{
+	struct qed_ooo_buffer *p_buffer = NULL;
+
+	if (!list_empty(&p_ooo_info->free_buffers_list)) {
+		p_buffer = list_first_entry(&p_ooo_info->free_buffers_list,
+					    struct qed_ooo_buffer, list_entry);
+
+		list_del(&p_buffer->list_entry);
+	}
+
+	return p_buffer;
+}
+
+void qed_ooo_put_ready_buffer(struct qed_hwfn *p_hwfn,
+			      struct qed_ooo_info *p_ooo_info,
+			      struct qed_ooo_buffer *p_buffer, u8 on_tail)
+{
+	if (on_tail)
+		list_add_tail(&p_buffer->list_entry,
+			      &p_ooo_info->ready_buffers_list);
+	else
+		list_add(&p_buffer->list_entry,
+			 &p_ooo_info->ready_buffers_list);
+}
+
+struct qed_ooo_buffer *qed_ooo_get_ready_buffer(struct qed_hwfn *p_hwfn,
+						struct qed_ooo_info *p_ooo_info)
+{
+	struct qed_ooo_buffer *p_buffer = NULL;
+
+	if (!list_empty(&p_ooo_info->ready_buffers_list)) {
+		p_buffer = list_first_entry(&p_ooo_info->ready_buffers_list,
+					    struct qed_ooo_buffer, list_entry);
+
+		list_del(&p_buffer->list_entry);
+	}
+
+	return p_buffer;
+}
+
+void qed_ooo_delete_isles(struct qed_hwfn *p_hwfn,
+			  struct qed_ooo_info *p_ooo_info,
+			  u32 cid, u8 drop_isle, u8 drop_size)
+{
+	struct qed_ooo_archipelago *p_archipelago = NULL;
+	struct qed_ooo_isle *p_isle = NULL;
+	u8 isle_idx;
+
+	p_archipelago = qed_ooo_seek_archipelago(p_hwfn, p_ooo_info, cid);
+	for (isle_idx = 0; isle_idx < drop_size; isle_idx++) {
+		p_isle = qed_ooo_seek_isle(p_hwfn, p_ooo_info, cid, drop_isle);
+		if (!p_isle) {
+			DP_NOTICE(p_hwfn,
+				  "Isle %d is not found(cid %d)\n",
+				  drop_isle, cid);
+			return;
+		}
+		if (list_empty(&p_isle->buffers_list))
+			DP_NOTICE(p_hwfn,
+				  "Isle %d is empty(cid %d)\n", drop_isle, cid);
+		else
+			list_splice_tail_init(&p_isle->buffers_list,
+					      &p_ooo_info->free_buffers_list);
+
+		list_del(&p_isle->list_entry);
+		p_ooo_info->cur_isles_number--;
+		list_add(&p_isle->list_entry, &p_ooo_info->free_isles_list);
+	}
+
+	if (list_empty(&p_archipelago->isles_list)) {
+		list_del(&p_archipelago->list_entry);
+		list_add(&p_archipelago->list_entry,
+			 &p_ooo_info->free_archipelagos_list);
+	}
+}
+
+void qed_ooo_add_new_isle(struct qed_hwfn *p_hwfn,
+			  struct qed_ooo_info *p_ooo_info,
+			  u32 cid, u8 ooo_isle,
+			  struct qed_ooo_buffer *p_buffer)
+{
+	struct qed_ooo_archipelago *p_archipelago = NULL;
+	struct qed_ooo_isle *p_prev_isle = NULL;
+	struct qed_ooo_isle *p_isle = NULL;
+
+	if (ooo_isle > 1) {
+		p_prev_isle = qed_ooo_seek_isle(p_hwfn,
+						p_ooo_info, cid, ooo_isle - 1);
+		if (!p_prev_isle) {
+			DP_NOTICE(p_hwfn,
+				  "Isle %d is not found(cid %d)\n",
+				  ooo_isle - 1, cid);
+			return;
+		}
+	}
+	p_archipelago = qed_ooo_seek_archipelago(p_hwfn, p_ooo_info, cid);
+	if (!p_archipelago && (ooo_isle != 1)) {
+		DP_NOTICE(p_hwfn,
+			  "Connection %d is not found in OOO list\n", cid);
+		return;
+	}
+
+	if (!list_empty(&p_ooo_info->free_isles_list)) {
+		p_isle = list_first_entry(&p_ooo_info->free_isles_list,
+					  struct qed_ooo_isle, list_entry);
+
+		list_del(&p_isle->list_entry);
+		if (!list_empty(&p_isle->buffers_list)) {
+			DP_NOTICE(p_hwfn, "Free isle is not empty\n");
+			INIT_LIST_HEAD(&p_isle->buffers_list);
+		}
+	} else {
+		DP_NOTICE(p_hwfn, "No more free isles\n");
+		return;
+	}
+
+	if (!p_archipelago &&
+	    !list_empty(&p_ooo_info->free_archipelagos_list)) {
+		p_archipelago =
+		    list_first_entry(&p_ooo_info->free_archipelagos_list,
+				     struct qed_ooo_archipelago, list_entry);
+
+		list_del(&p_archipelago->list_entry);
+		if (!list_empty(&p_archipelago->isles_list)) {
+			DP_NOTICE(p_hwfn,
+				  "Free OOO connection is not empty\n");
+			INIT_LIST_HEAD(&p_archipelago->isles_list);
+		}
+		p_archipelago->cid = cid;
+		list_add(&p_archipelago->list_entry,
+			 &p_ooo_info->archipelagos_list);
+	} else if (!p_archipelago) {
+		DP_NOTICE(p_hwfn, "No more free OOO connections\n");
+		list_add(&p_isle->list_entry,
+			 &p_ooo_info->free_isles_list);
+		list_add(&p_buffer->list_entry,
+			 &p_ooo_info->free_buffers_list);
+		return;
+	}
+
+	list_add(&p_buffer->list_entry, &p_isle->buffers_list);
+	p_ooo_info->cur_isles_number++;
+	p_ooo_info->gen_isles_number++;
+
+	if (p_ooo_info->cur_isles_number > p_ooo_info->max_isles_number)
+		p_ooo_info->max_isles_number = p_ooo_info->cur_isles_number;
+
+	if (!p_prev_isle)
+		list_add(&p_isle->list_entry, &p_archipelago->isles_list);
+	else
+		list_add(&p_isle->list_entry, &p_prev_isle->list_entry);
+}
+
+void qed_ooo_add_new_buffer(struct qed_hwfn *p_hwfn,
+			    struct qed_ooo_info *p_ooo_info,
+			    u32 cid,
+			    u8 ooo_isle,
+			    struct qed_ooo_buffer *p_buffer, u8 buffer_side)
+{
+	struct qed_ooo_isle *p_isle = NULL;
+
+	p_isle = qed_ooo_seek_isle(p_hwfn, p_ooo_info, cid, ooo_isle);
+	if (!p_isle) {
+		DP_NOTICE(p_hwfn,
+			  "Isle %d is not found(cid %d)\n", ooo_isle, cid);
+		return;
+	}
+
+	if (buffer_side == QED_OOO_LEFT_BUF)
+		list_add(&p_buffer->list_entry, &p_isle->buffers_list);
+	else
+		list_add_tail(&p_buffer->list_entry, &p_isle->buffers_list);
+}
+
+void qed_ooo_join_isles(struct qed_hwfn *p_hwfn,
+			struct qed_ooo_info *p_ooo_info, u32 cid, u8 left_isle)
+{
+	struct qed_ooo_archipelago *p_archipelago = NULL;
+	struct qed_ooo_isle *p_right_isle = NULL;
+	struct qed_ooo_isle *p_left_isle = NULL;
+
+	p_right_isle = qed_ooo_seek_isle(p_hwfn, p_ooo_info, cid,
+					 left_isle + 1);
+	if (!p_right_isle) {
+		DP_NOTICE(p_hwfn,
+			  "Right isle %d is not found(cid %d)\n",
+			  left_isle + 1, cid);
+		return;
+	}
+
+	p_archipelago = qed_ooo_seek_archipelago(p_hwfn, p_ooo_info, cid);
+	list_del(&p_right_isle->list_entry);
+	p_ooo_info->cur_isles_number--;
+	if (left_isle) {
+		p_left_isle = qed_ooo_seek_isle(p_hwfn, p_ooo_info, cid,
+						left_isle);
+		if (!p_left_isle) {
+			DP_NOTICE(p_hwfn,
+				  "Left isle %d is not found(cid %d)\n",
+				  left_isle, cid);
+			return;
+		}
+		list_splice_tail_init(&p_right_isle->buffers_list,
+				      &p_left_isle->buffers_list);
+	} else {
+		list_splice_tail_init(&p_right_isle->buffers_list,
+				      &p_ooo_info->ready_buffers_list);
+		if (list_empty(&p_archipelago->isles_list)) {
+			list_del(&p_archipelago->list_entry);
+			list_add(&p_archipelago->list_entry,
+				 &p_ooo_info->free_archipelagos_list);
+		}
+	}
+	list_add_tail(&p_right_isle->list_entry, &p_ooo_info->free_isles_list);
+}
diff --git a/drivers/net/ethernet/qlogic/qed/qed_ooo.h b/drivers/net/ethernet/qlogic/qed/qed_ooo.h
new file mode 100644
index 0000000..7a0670a
--- /dev/null
+++ b/drivers/net/ethernet/qlogic/qed/qed_ooo.h
@@ -0,0 +1,176 @@
+/* QLogic qed NIC Driver
+ * Copyright (c) 2015 QLogic Corporation
+ *
+ * This software is available under the terms of the GNU General Public License
+ * (GPL) Version 2, available from the file COPYING in the main directory of
+ * this source tree.
+ */
+
+#ifndef _QED_OOO_H
+#define _QED_OOO_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include "qed.h"
+
+#define QED_MAX_NUM_ISLES	256
+#define QED_MAX_NUM_OOO_HISTORY_ENTRIES	512
+
+#define QED_OOO_LEFT_BUF	0
+#define QED_OOO_RIGHT_BUF	1
+
+struct qed_ooo_buffer {
+	struct list_head list_entry;
+	void *rx_buffer_virt_addr;
+	dma_addr_t rx_buffer_phys_addr;
+	u32 rx_buffer_size;
+	u16 packet_length;
+	u16 parse_flags;
+	u16 vlan;
+	u8 placement_offset;
+};
+
+struct qed_ooo_isle {
+	struct list_head list_entry;
+	struct list_head buffers_list;
+};
+
+struct qed_ooo_archipelago {
+	struct list_head list_entry;
+	struct list_head isles_list;
+	u32 cid;
+};
+
+struct qed_ooo_history {
+	struct ooo_opaque *p_cqes;
+	u32 head_idx;
+	u32 num_of_cqes;
+};
+
+struct qed_ooo_info {
+	struct list_head free_buffers_list;
+	struct list_head ready_buffers_list;
+	struct list_head free_isles_list;
+	struct list_head free_archipelagos_list;
+	struct list_head archipelagos_list;
+	struct qed_ooo_archipelago *p_archipelagos_mem;
+	struct qed_ooo_isle *p_isles_mem;
+	struct qed_ooo_history ooo_history;
+	u32 cur_isles_number;
+	u32 max_isles_number;
+	u32 gen_isles_number;
+};
+
+#if IS_ENABLED(CONFIG_QED_ISCSI)
+void qed_ooo_save_history_entry(struct qed_hwfn *p_hwfn,
+				struct qed_ooo_info *p_ooo_info,
+				struct ooo_opaque *p_cqe);
+
+struct qed_ooo_info *qed_ooo_alloc(struct qed_hwfn *p_hwfn);
+
+void qed_ooo_release_connection_isles(struct qed_hwfn *p_hwfn,
+				      struct qed_ooo_info *p_ooo_info,
+				      u32 cid);
+
+void qed_ooo_release_all_isles(struct qed_hwfn *p_hwfn,
+			       struct qed_ooo_info *p_ooo_info);
+
+void qed_ooo_setup(struct qed_hwfn *p_hwfn, struct qed_ooo_info *p_ooo_info);
+
+void qed_ooo_free(struct qed_hwfn *p_hwfn, struct qed_ooo_info *p_ooo_info);
+
+void qed_ooo_put_free_buffer(struct qed_hwfn *p_hwfn,
+			     struct qed_ooo_info *p_ooo_info,
+			     struct qed_ooo_buffer *p_buffer);
+
+struct qed_ooo_buffer *
+qed_ooo_get_free_buffer(struct qed_hwfn *p_hwfn,
+			struct qed_ooo_info *p_ooo_info);
+
+void qed_ooo_put_ready_buffer(struct qed_hwfn *p_hwfn,
+			      struct qed_ooo_info *p_ooo_info,
+			      struct qed_ooo_buffer *p_buffer, u8 on_tail);
+
+struct qed_ooo_buffer *
+qed_ooo_get_ready_buffer(struct qed_hwfn *p_hwfn,
+			 struct qed_ooo_info *p_ooo_info);
+
+void qed_ooo_delete_isles(struct qed_hwfn *p_hwfn,
+			  struct qed_ooo_info *p_ooo_info,
+			  u32 cid, u8 drop_isle, u8 drop_size);
+
+void qed_ooo_add_new_isle(struct qed_hwfn *p_hwfn,
+			  struct qed_ooo_info *p_ooo_info,
+			  u32 cid,
+			  u8 ooo_isle, struct qed_ooo_buffer *p_buffer);
+
+void qed_ooo_add_new_buffer(struct qed_hwfn *p_hwfn,
+			    struct qed_ooo_info *p_ooo_info,
+			    u32 cid,
+			    u8 ooo_isle,
+			    struct qed_ooo_buffer *p_buffer, u8 buffer_side);
+
+void qed_ooo_join_isles(struct qed_hwfn *p_hwfn,
+			struct qed_ooo_info *p_ooo_info, u32 cid,
+			u8 left_isle);
+#else /* IS_ENABLED(CONFIG_QED_ISCSI) */
+static inline void qed_ooo_save_history_entry(struct qed_hwfn *p_hwfn,
+					      struct qed_ooo_info *p_ooo_info,
+					      struct ooo_opaque *p_cqe) {}
+
+static inline struct qed_ooo_info *qed_ooo_alloc(
+				struct qed_hwfn *p_hwfn) { return NULL; }
+
+static inline void
+qed_ooo_release_connection_isles(struct qed_hwfn *p_hwfn,
+				 struct qed_ooo_info *p_ooo_info,
+				 u32 cid) {}
+
+static inline void qed_ooo_release_all_isles(struct qed_hwfn *p_hwfn,
+					     struct qed_ooo_info *p_ooo_info)
+					     {}
+
+static inline void qed_ooo_setup(struct qed_hwfn *p_hwfn,
+				 struct qed_ooo_info *p_ooo_info) {}
+
+static inline void qed_ooo_free(struct qed_hwfn *p_hwfn,
+				struct qed_ooo_info *p_ooo_info) {}
+
+static inline void qed_ooo_put_free_buffer(struct qed_hwfn *p_hwfn,
+					   struct qed_ooo_info *p_ooo_info,
+					   struct qed_ooo_buffer *p_buffer) {}
+
+static inline struct qed_ooo_buffer *
+qed_ooo_get_free_buffer(struct qed_hwfn *p_hwfn,
+			struct qed_ooo_info *p_ooo_info) { return NULL; }
+
+static inline void qed_ooo_put_ready_buffer(struct qed_hwfn *p_hwfn,
+					    struct qed_ooo_info *p_ooo_info,
+					    struct qed_ooo_buffer *p_buffer,
+					    u8 on_tail) {}
+
+static inline struct qed_ooo_buffer *
+qed_ooo_get_ready_buffer(struct qed_hwfn *p_hwfn,
+			 struct qed_ooo_info *p_ooo_info) { return NULL; }
+
+static inline void qed_ooo_delete_isles(struct qed_hwfn *p_hwfn,
+					struct qed_ooo_info *p_ooo_info,
+					u32 cid, u8 drop_isle, u8 drop_size) {}
+
+static inline void qed_ooo_add_new_isle(struct qed_hwfn *p_hwfn,
+					struct qed_ooo_info *p_ooo_info,
+					u32 cid, u8 ooo_isle,
+					struct qed_ooo_buffer *p_buffer) {}
+
+static inline void qed_ooo_add_new_buffer(struct qed_hwfn *p_hwfn,
+					  struct qed_ooo_info *p_ooo_info,
+					  u32 cid, u8 ooo_isle,
+					  struct qed_ooo_buffer *p_buffer,
+					  u8 buffer_side) {}
+
+static inline void qed_ooo_join_isles(struct qed_hwfn *p_hwfn,
+				      struct qed_ooo_info *p_ooo_info, u32 cid,
+				      u8 left_isle) {}
+#endif /* IS_ENABLED(CONFIG_QED_ISCSI) */
+
+#endif
diff --git a/drivers/net/ethernet/qlogic/qed/qed_roce.c b/drivers/net/ethernet/qlogic/qed/qed_roce.c
index 6a353ff..2a16547 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_roce.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_roce.c
@@ -2771,6 +2771,7 @@ static int qed_roce_ll2_tx(struct qed_dev *cdev,
 	/* Tx header */
 	rc = qed_ll2_prepare_tx_packet(QED_LEADING_HWFN(cdev), roce_ll2->handle,
 				       1 + pkt->n_seg, 0, flags, 0,
+				       QED_LL2_TX_DEST_NW,
 				       qed_roce_flavor, pkt->header.baddr,
 				       pkt->header.len, pkt, 1);
 	if (rc) {
diff --git a/drivers/net/ethernet/qlogic/qed/qed_spq.c b/drivers/net/ethernet/qlogic/qed/qed_spq.c
index 56d2f64..f022469 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_spq.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_spq.c
@@ -26,6 +26,7 @@
 #include "qed_int.h"
 #include "qed_iscsi.h"
 #include "qed_mcp.h"
+#include "qed_ooo.h"
 #include "qed_reg_addr.h"
 #include "qed_sp.h"
 #include "qed_sriov.h"
@@ -281,6 +282,14 @@ static int qed_spq_hw_post(struct qed_hwfn *p_hwfn,
 	case PROTOCOLID_ISCSI:
 		if (!IS_ENABLED(CONFIG_QED_ISCSI))
 			return -EINVAL;
+		if (p_eqe->opcode == ISCSI_EVENT_TYPE_ASYN_DELETE_OOO_ISLES) {
+			u32 cid = le32_to_cpu(p_eqe->data.iscsi_info.cid);
+
+			qed_ooo_release_connection_isles(p_hwfn,
+							 p_hwfn->p_ooo_info,
+							 cid);
+			return 0;
+		}
 
 		if (p_hwfn->p_iscsi_info->event_cb) {
 			struct qed_iscsi_info *p_iscsi = p_hwfn->p_iscsi_info;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next 5/8] net/sched: cls_flower: Add offload support using egress Hardware device
From: kbuild test robot @ 2016-12-01  8:11 UTC (permalink / raw)
  To: Hadar Hen Zion
  Cc: kbuild-all, David S. Miller, netdev, Saeed Mahameed, Jiri Pirko,
	Amir Vadai, Or Gerlitz, Roi Dayan, Hadar Hen Zion
In-Reply-To: <1480516895-29545-6-git-send-email-hadarh@mellanox.com>

[-- Attachment #1: Type: text/plain, Size: 1105 bytes --]

Hi Hadar,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Hadar-Hen-Zion/Offloading-tc-rules-using-underline-Hardware-device/20161201-155727
config: i386-randconfig-x004-201648 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   net/sched/cls_api.c: In function 'tcf_exts_get_dev':
>> net/sched/cls_api.c:696:13: error: dereferencing pointer to incomplete type 'const struct tc_action_ops'
      if (a->ops->get_dev) {
                ^~

vim +696 net/sched/cls_api.c

   690	
   691		if (tc_no_actions(exts))
   692			return -EINVAL;
   693	
   694		tcf_exts_to_list(exts, &actions);
   695		list_for_each_entry(a, &actions, list) {
 > 696			if (a->ops->get_dev) {
   697				a->ops->get_dev(a, dev_net(dev), hw_dev);
   698				break;
   699			}

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26047 bytes --]

^ permalink raw reply

* Re: linux-next: manual merge of the net-next tree with the net tree
From: Jiri Pirko @ 2016-12-01  7:56 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, Networking, linux-next, linux-kernel, Jiri Pirko,
	Roi Dayan
In-Reply-To: <20161201124159.79f2e4f9@canb.auug.org.au>

Thu, Dec 01, 2016 at 02:41:59AM CET, sfr@canb.auug.org.au wrote:
>Hi all,
>
>Today's linux-next merge of the net-next tree got a conflict in:
>
>  net/sched/cls_flower.c
>
>between commit:
>
>  725cbb62e7ad ("sched: cls_flower: remove from hashtable only in case skip sw flag is not set")
>
>from the net tree and commit:
>
>  13fa876ebd03 ("net/sched: cls_flower: merge filter delete/destroy common code")
>
>from the net-next tree.
>
>I fixed it up (see below) and can carry the fix as necessary. This
>is now fixed as far as linux-next is concerned, but any non trivial
>conflicts should be mentioned to your upstream maintainer when your tree
>is submitted for merging.  You may also want to consider cooperating
>with the maintainer of the conflicting tree to minimise any particularly
>complex conflicts.

Looks fine to me. Thanks.

>
>-- 
>Cheers,
>Stephen Rothwell
>
>diff --cc net/sched/cls_flower.c
>index 904442421db3,e8dd09af0d0c..000000000000
>--- a/net/sched/cls_flower.c
>+++ b/net/sched/cls_flower.c
>@@@ -273,24 -272,14 +276,32 @@@ static void fl_hw_update_stats(struct t
>  	dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol, &tc);
>  }
>  
> +static void fl_destroy_sleepable(struct work_struct *work)
> +{
> +	struct cls_fl_head *head = container_of(work, struct cls_fl_head,
> +						work);
> +	if (head->mask_assigned)
> +		rhashtable_destroy(&head->ht);
> +	kfree(head);
> +	module_put(THIS_MODULE);
> +}
> +
> +static void fl_destroy_rcu(struct rcu_head *rcu)
> +{
> +	struct cls_fl_head *head = container_of(rcu, struct cls_fl_head, rcu);
> +
> +	INIT_WORK(&head->work, fl_destroy_sleepable);
> +	schedule_work(&head->work);
> +}
> +
>+ static void __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f)
>+ {
>+ 	list_del_rcu(&f->list);
>+ 	fl_hw_destroy_filter(tp, (unsigned long)f);
>+ 	tcf_unbind_filter(tp, &f->res);
>+ 	call_rcu(&f->rcu, fl_destroy_filter);
>+ }
>+ 
>  static bool fl_destroy(struct tcf_proto *tp, bool force)
>  {
>  	struct cls_fl_head *head = rtnl_dereference(tp->root);
>@@@ -299,14 -288,12 +310,11 @@@
>  	if (!force && !list_empty(&head->filters))
>  		return false;
>  
>- 	list_for_each_entry_safe(f, next, &head->filters, list) {
>- 		fl_hw_destroy_filter(tp, (unsigned long)f);
>- 		list_del_rcu(&f->list);
>- 		call_rcu(&f->rcu, fl_destroy_filter);
>- 	}
>+ 	list_for_each_entry_safe(f, next, &head->filters, list)
>+ 		__fl_delete(tp, f);
> -	RCU_INIT_POINTER(tp->root, NULL);
> -	if (head->mask_assigned)
> -		rhashtable_destroy(&head->ht);
> -	kfree_rcu(head, rcu);
> +
> +	__module_get(THIS_MODULE);
> +	call_rcu(&head->rcu, fl_destroy_rcu);
>  	return true;
>  }
>  
>@@@ -761,13 -782,9 +804,10 @@@ static int fl_delete(struct tcf_proto *
>  	struct cls_fl_head *head = rtnl_dereference(tp->root);
>  	struct cls_fl_filter *f = (struct cls_fl_filter *) arg;
>  
> -	rhashtable_remove_fast(&head->ht, &f->ht_node,
> -			       head->ht_params);
> +	if (!tc_skip_sw(f->flags))
> +		rhashtable_remove_fast(&head->ht, &f->ht_node,
> +				       head->ht_params);
>- 	list_del_rcu(&f->list);
>- 	fl_hw_destroy_filter(tp, (unsigned long)f);
>- 	tcf_unbind_filter(tp, &f->res);
>- 	call_rcu(&f->rcu, fl_destroy_filter);
>+ 	__fl_delete(tp, f);
>  	return 0;
>  }
>  

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-12-01  7:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161129162113.GC742-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Tue, Nov 29, 2016 at 09:21:13AM -0700, Jason Gunthorpe wrote:
>On Mon, Nov 28, 2016 at 10:29:38PM -0800, Vishwanathapura, Niranjana wrote:
>> On Thu, Nov 24, 2016 at 09:15:45AM -0700, Jason Gunthorpe wrote:
>> >>And will move the hfi_vnic module under
>> >>???drivers/infiniband/ulp/hfi_vnic???.
>> >
>> >I would prefer drivers/net/ethernet
>> >
>> >This is clearly not a ULP since it doesn't use verbs.
>> >
>>
>> I understand it is not using verbs, but the control path (ib_device client)
>> is using verbs (IB MAD).
>> Our prefernce is to keep it somewhere under drivers/infiniband. Summarizing
>> reasons again here,
>>
>> - VNIC control driver (ib_device client) is an IB MAD agent.
>> - It is purly a software construct, encapsualtes ethernet packets in
>> Omni-path packet and depends on hfi1 driver here for HW access.
>
>Is the majority of the code MAD focused or net stack focused?
>
>I'm not sure it matters, it isn't like we can review Intel's
>proprietary mad stuff anyhow. :\
>
>Jason

That is an intersting measure. In hfi_vnic driver, I would say, >60% of the 
code is MAD focused, mainly interfacing with the IB MAD agent.
It also includes populating/parsing those MAD packets. At the least it is not 
supporting the driver to be put under net folder.

Even in the remaining <40%, half of it is involved with encapsulating ethernet 
frames with Omni-path header (does this makes it belong under 
drivers/infiniband/hw?).
The net stack interface part is pretty standard, hence is not much of code.

I do see the reason to put it under net folder, but I am seeing more reason for 
it to be somewhere under drivers/infiniband.

Niranjana

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-12-01  7:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hefty, Sean, Weiny, Ira, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Dalessandro, Dennis
In-Reply-To: <20161129165009.GA3167-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Tue, Nov 29, 2016 at 09:50:09AM -0700, Jason Gunthorpe wrote:
>On Tue, Nov 29, 2016 at 04:44:37PM +0000, Hefty, Sean wrote:
>> > You are not making a subsystem. Don't overcomplicate things. A
>> > multi-part device device can just directly link.
>>
>> The VNIC may be usable over multiple generations of HFIs, but I
>> don't know if that is the intent.
>
>If Intel wants to build a HFI subystem within RDMA with multiple
>drivers then sure, but they are not there yet, and we don't even know
>what that could look like. So it is better to leave it simple for now.
>
>Jason

Sorry for the delay, I was weighing in couple options.
We envisioned vnic as a pure software construct and hence should be independent 
(like ipoib). ie., both hfi_vnic and hfi1 should be independently loadable 
(like ipoib) despite hfi_vnic being dependent on hfi1 here for HW access.

There doesn't seem to be much value of hfi_vnic being a 'ib client', if it 
still has compilation and module dependency on hfi1 module.

The more I think of it, having vnic ops added to ib device structure (option 
(b)) makes it cleaner (no dependency). We can probably consider extending 'ib 
device' in hfi1 in order for hfi_vnic to get to the vnic ops. But (b) makes it 
simpler.

Though Jason's suggestion could be a temporary measure for this patch series, 
the above approach is what I would like to target here.

Niranjana

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net] tipc: check minimum bearer MTU
From: kbuild test robot @ 2016-12-01  7:21 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: kbuild-all, Jon Maloy, Ying Xue, David S. Miller, tipc-discussion,
	netdev, linux-kernel, Ben Hutchings, Qian Zhang
In-Reply-To: <20161130095702.DD033A0F14@unicorn.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 3127 bytes --]

Hi Michal,

[auto build test ERROR on net/master]

url:    https://github.com/0day-ci/linux/commits/Michal-Kubecek/tipc-check-minimum-bearer-MTU/20161201-140555
config: i386-randconfig-s0-201648 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   net/tipc/bearer.o: In function `tipc_check_mtu':
>> bearer.c:(.text+0xadd): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/core.o: In function `tipc_check_mtu':
   core.c:(.text+0x239): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/link.o: In function `tipc_check_mtu':
   link.c:(.text+0x110e): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/discover.o: In function `tipc_check_mtu':
   discover.c:(.text+0x2c7): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/name_distr.o: In function `tipc_check_mtu':
   name_distr.c:(.text+0x536): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/monitor.o: In function `tipc_check_mtu':
   monitor.c:(.text+0x906): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/name_table.o: In function `tipc_check_mtu':
   name_table.c:(.text+0x875): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/net.o: In function `tipc_check_mtu':
   net.c:(.text+0xc4): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/netlink.o: In function `tipc_check_mtu':
   netlink.c:(.text+0x0): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/netlink_compat.o: In function `tipc_check_mtu':
   netlink_compat.c:(.text+0x2a9e): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/node.o: In function `tipc_check_mtu':
   node.c:(.text+0x1dd8): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/socket.o: In function `tipc_check_mtu':
   socket.c:(.text+0x62bb): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/eth_media.o: In function `tipc_check_mtu':
   eth_media.c:(.text+0x117): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
   net/tipc/udp_media.o: In function `tipc_check_mtu':
   udp_media.c:(.text+0x10b9): multiple definition of `tipc_check_mtu'
   net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 25827 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] tcp: allow to turn tcp timestamp randomization off
From: Yuchung Cheng @ 2016-12-01  7:12 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev
In-Reply-To: <1480508930-24406-2-git-send-email-fw@strlen.de>

On Wed, Nov 30, 2016 at 4:28 AM, Florian Westphal <fw@strlen.de> wrote:
> Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple
> flows, we can deduct the relative qdisc delays (think of fq pacing).
> This should work even if we have one flow per remote peer."
>
> Having random per flow (or host) offsets doesn't allow that anymore so add
> a way to turn this off.
>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  change since v2: do check in secure_tcpv4/6_sequence_number so outgoing
>  syn packets won't have a random offset either in if randomization is off.
>
>  Tested:
>  sysctl_tcp_timestamps==1, tcpdump on lo, both ends have same values.
>
>  Documentation/networking/ip-sysctl.txt | 9 +++++++--
>  net/core/secure_seq.c                  | 5 +++--
>  net/ipv4/tcp_input.c                   | 3 ++-
>  3 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 5af48dd7c5fc..de2448313799 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -610,8 +610,13 @@ tcp_syn_retries - INTEGER
>         with the current initial RTO of 1second. With this the final timeout
>         for an active TCP connection attempt will happen after 127seconds.
>
> -tcp_timestamps - BOOLEAN
> -       Enable timestamps as defined in RFC1323.
> +tcp_timestamps - INTEGER
> +Enable timestamps as defined in RFC1323.
> +       0: Disabled.
> +       1: Enable timestamps as defined in RFC1323.
> +       2: Like 1, but also use a random offset for each connection
> +       rather than only using the current time.
> +       Default: 2
Small suggestion: I suspect host/server configs manually set the knob
to 1. Perhaps swap 1 and 2 to maximize the coverage of this new
feature?

>
>  tcp_min_tso_segs - INTEGER
>         Minimal number of segments per TSO frame.
> diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
> index a8d6062cbb4a..36addd3d9633 100644
> --- a/net/core/secure_seq.c
> +++ b/net/core/secure_seq.c
> @@ -12,6 +12,7 @@
>  #include <net/secure_seq.h>
>
>  #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET)
> +#include <net/tcp.h>
>  #define NET_SECRET_SIZE (MD5_MESSAGE_BYTES / 4)
>
>  static u32 net_secret[NET_SECRET_SIZE] ____cacheline_aligned;
> @@ -58,7 +59,7 @@ u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
>
>         md5_transform(hash, secret);
>
> -       *tsoff = hash[1];
> +       *tsoff = sysctl_tcp_timestamps == 2 ? hash[1] : 0;
>         return seq_scale(hash[0]);
>  }
>  EXPORT_SYMBOL(secure_tcpv6_sequence_number);
> @@ -100,7 +101,7 @@ u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
>
>         md5_transform(hash, net_secret);
>
> -       *tsoff = hash[1];
> +       *tsoff = sysctl_tcp_timestamps == 2 ? hash[1] : 0;
>         return seq_scale(hash[0]);
>  }
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 1b1921c71f7c..5f6d4efd2551 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -76,7 +76,7 @@
>  #include <asm/unaligned.h>
>  #include <linux/errqueue.h>
>
> -int sysctl_tcp_timestamps __read_mostly = 1;
> +int sysctl_tcp_timestamps __read_mostly = 2;
>  int sysctl_tcp_window_scaling __read_mostly = 1;
>  int sysctl_tcp_sack __read_mostly = 1;
>  int sysctl_tcp_fack __read_mostly = 1;
> @@ -85,6 +85,7 @@ int sysctl_tcp_dsack __read_mostly = 1;
>  int sysctl_tcp_app_win __read_mostly = 31;
>  int sysctl_tcp_adv_win_scale __read_mostly = 1;
>  EXPORT_SYMBOL(sysctl_tcp_adv_win_scale);
> +EXPORT_SYMBOL(sysctl_tcp_timestamps);
>
>  /* rfc5961 challenge ack rate limiting */
>  int sysctl_tcp_challenge_ack_limit = 1000;
> --
> 2.7.3
>

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] tcp: randomize tcp timestamp offsets for each connection
From: Yuchung Cheng @ 2016-12-01  7:05 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev
In-Reply-To: <1480508930-24406-1-git-send-email-fw@strlen.de>

On Wed, Nov 30, 2016 at 4:28 AM, Florian Westphal <fw@strlen.de> wrote:
>
> jiffies based timestamps allow for easy inference of number of devices
> behind NAT translators and also makes tracking of hosts simpler.
>
> commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
> added the main infrastructure that is needed for per-connection ts
> randomization, in particular writing/reading the on-wire tcp header
> format takes the offset into account so rest of stack can use normal
> tcp_time_stamp (jiffies).
>
> So only two items are left:
>  - add a tsoffset for request sockets
>  - extend the tcp isn generator to also return another 32bit number
>    in addition to the ISN.
>
> Re-use of ISN generator also means timestamps are still monotonically
> increasing for same connection quadruple, i.e. PAWS will still work.
>
> Includes fixes from Eric Dumazet.
>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Acked-by: Eric Dumazet <edumazet@google.com>
> ---

Acked-by: Yuchung Cheng <ycheng@google.com>

Nice feature!

>
>  No changes since v1, preserved Erics ack.
>
>  include/linux/tcp.h      |  1 +
>  include/net/secure_seq.h |  8 ++++----
>  include/net/tcp.h        |  2 +-
>  net/core/secure_seq.c    | 10 ++++++----
>  net/ipv4/syncookies.c    |  1 +
>  net/ipv4/tcp_input.c     |  7 ++++++-
>  net/ipv4/tcp_ipv4.c      |  9 +++++----
>  net/ipv4/tcp_minisocks.c |  4 +++-
>  net/ipv4/tcp_output.c    |  2 +-
>  net/ipv6/syncookies.c    |  1 +
>  net/ipv6/tcp_ipv6.c      | 10 ++++++----
>  11 files changed, 35 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 32a7c7e35b71..2408bcc579f1 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -123,6 +123,7 @@ struct tcp_request_sock {
>         u32                             txhash;
>         u32                             rcv_isn;
>         u32                             snt_isn;
> +       u32                             ts_off;
>         u32                             last_oow_ack_time; /* last SYNACK */
>         u32                             rcv_nxt; /* the ack # by SYNACK. For
>                                                   * FastOpen it's the seq#
> diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h
> index 3f36d45b714a..0caee631a836 100644
> --- a/include/net/secure_seq.h
> +++ b/include/net/secure_seq.h
> @@ -6,10 +6,10 @@
>  u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport);
>  u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
>                                __be16 dport);
> -__u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
> -                                __be16 sport, __be16 dport);
> -__u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
> -                                  __be16 sport, __be16 dport);
> +u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
> +                              __be16 sport, __be16 dport, u32 *tsoff);
> +u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
> +                                __be16 sport, __be16 dport, u32 *tsoff);
>  u64 secure_dccp_sequence_number(__be32 saddr, __be32 daddr,
>                                 __be16 sport, __be16 dport);
>  u64 secure_dccpv6_sequence_number(__be32 *saddr, __be32 *daddr,
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 7de80739adab..1c09d909bd43 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1809,7 +1809,7 @@ struct tcp_request_sock_ops {
>         struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
>                                        const struct request_sock *req,
>                                        bool *strict);
> -       __u32 (*init_seq)(const struct sk_buff *skb);
> +       __u32 (*init_seq)(const struct sk_buff *skb, u32 *tsoff);
>         int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
>                            struct flowi *fl, struct request_sock *req,
>                            struct tcp_fastopen_cookie *foc,
> diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
> index fd3ce461fbe6..a8d6062cbb4a 100644
> --- a/net/core/secure_seq.c
> +++ b/net/core/secure_seq.c
> @@ -40,8 +40,8 @@ static u32 seq_scale(u32 seq)
>  #endif
>
>  #if IS_ENABLED(CONFIG_IPV6)
> -__u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
> -                                  __be16 sport, __be16 dport)
> +u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
> +                                __be16 sport, __be16 dport, u32 *tsoff)
>  {
>         u32 secret[MD5_MESSAGE_BYTES / 4];
>         u32 hash[MD5_DIGEST_WORDS];
> @@ -58,6 +58,7 @@ __u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
>
>         md5_transform(hash, secret);
>
> +       *tsoff = hash[1];
>         return seq_scale(hash[0]);
>  }
>  EXPORT_SYMBOL(secure_tcpv6_sequence_number);
> @@ -86,8 +87,8 @@ EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
>
>  #ifdef CONFIG_INET
>
> -__u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
> -                                __be16 sport, __be16 dport)
> +u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
> +                              __be16 sport, __be16 dport, u32 *tsoff)
>  {
>         u32 hash[MD5_DIGEST_WORDS];
>
> @@ -99,6 +100,7 @@ __u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
>
>         md5_transform(hash, net_secret);
>
> +       *tsoff = hash[1];
>         return seq_scale(hash[0]);
>  }
>
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 0dc6286272aa..3e88467d70ee 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -334,6 +334,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
>         treq = tcp_rsk(req);
>         treq->rcv_isn           = ntohl(th->seq) - 1;
>         treq->snt_isn           = cookie;
> +       treq->ts_off            = 0;
>         req->mss                = mss;
>         ireq->ir_num            = ntohs(th->dest);
>         ireq->ir_rmt_port       = th->source;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 22e6a2097ff6..1b1921c71f7c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6301,6 +6301,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
>                 goto drop;
>
>         tcp_rsk(req)->af_specific = af_ops;
> +       tcp_rsk(req)->ts_off = 0;
>
>         tcp_clear_options(&tmp_opt);
>         tmp_opt.mss_clamp = af_ops->mss_clamp;
> @@ -6322,6 +6323,9 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
>         if (security_inet_conn_request(sk, skb, req))
>                 goto drop_and_free;
>
> +       if (isn && tmp_opt.tstamp_ok)
> +               af_ops->init_seq(skb, &tcp_rsk(req)->ts_off);
> +
>         if (!want_cookie && !isn) {
>                 /* VJ's idea. We save last timestamp seen
>                  * from the destination in peer table, when entering
> @@ -6362,7 +6366,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
>                         goto drop_and_release;
>                 }
>
> -               isn = af_ops->init_seq(skb);
> +               isn = af_ops->init_seq(skb, &tcp_rsk(req)->ts_off);
>         }
>         if (!dst) {
>                 dst = af_ops->route_req(sk, &fl, req, NULL);
> @@ -6374,6 +6378,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
>
>         if (want_cookie) {
>                 isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
> +               tcp_rsk(req)->ts_off = 0;
>                 req->cookie_ts = tmp_opt.tstamp_ok;
>                 if (!tmp_opt.tstamp_ok)
>                         inet_rsk(req)->ecn_ok = 0;
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 5555eb86e549..b50f05905ced 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -95,12 +95,12 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
>  struct inet_hashinfo tcp_hashinfo;
>  EXPORT_SYMBOL(tcp_hashinfo);
>
> -static  __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
> +static u32 tcp_v4_init_sequence(const struct sk_buff *skb, u32 *tsoff)
>  {
>         return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
>                                           ip_hdr(skb)->saddr,
>                                           tcp_hdr(skb)->dest,
> -                                         tcp_hdr(skb)->source);
> +                                         tcp_hdr(skb)->source, tsoff);
>  }
>
>  int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
> @@ -237,7 +237,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>                 tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr,
>                                                            inet->inet_daddr,
>                                                            inet->inet_sport,
> -                                                          usin->sin_port);
> +                                                          usin->sin_port,
> +                                                          &tp->tsoffset);
>
>         inet->inet_id = tp->write_seq ^ jiffies;
>
> @@ -824,7 +825,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
>         tcp_v4_send_ack(sk, skb, seq,
>                         tcp_rsk(req)->rcv_nxt,
>                         req->rsk_rcv_wnd >> inet_rsk(req)->rcv_wscale,
> -                       tcp_time_stamp,
> +                       tcp_time_stamp + tcp_rsk(req)->ts_off,
>                         req->ts_recent,
>                         0,
>                         tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 6234ebaa7db1..28ce5ee831f5 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -532,7 +532,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>                         newtp->rx_opt.ts_recent_stamp = 0;
>                         newtp->tcp_header_len = sizeof(struct tcphdr);
>                 }
> -               newtp->tsoffset = 0;
> +               newtp->tsoffset = treq->ts_off;
>  #ifdef CONFIG_TCP_MD5SIG
>                 newtp->md5sig_info = NULL;      /*XXX*/
>                 if (newtp->af_specific->md5_lookup(sk, newsk))
> @@ -581,6 +581,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
>
>                 if (tmp_opt.saw_tstamp) {
>                         tmp_opt.ts_recent = req->ts_recent;
> +                       if (tmp_opt.rcv_tsecr)
> +                               tmp_opt.rcv_tsecr -= tcp_rsk(req)->ts_off;
>                         /* We do not store true stamp, but it is not required,
>                          * it can be estimated (approximately)
>                          * from another data.
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 19105b46a304..1b6d5f34bf45 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -640,7 +640,7 @@ static unsigned int tcp_synack_options(struct request_sock *req,
>         }
>         if (likely(ireq->tstamp_ok)) {
>                 opts->options |= OPTION_TS;
> -               opts->tsval = tcp_skb_timestamp(skb);
> +               opts->tsval = tcp_skb_timestamp(skb) + tcp_rsk(req)->ts_off;
>                 opts->tsecr = req->ts_recent;
>                 remaining -= TCPOLEN_TSTAMP_ALIGNED;
>         }
> diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
> index 97830a6a9cbb..a4d49760bf43 100644
> --- a/net/ipv6/syncookies.c
> +++ b/net/ipv6/syncookies.c
> @@ -209,6 +209,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
>         treq->snt_synack.v64    = 0;
>         treq->rcv_isn = ntohl(th->seq) - 1;
>         treq->snt_isn = cookie;
> +       treq->ts_off = 0;
>
>         /*
>          * We need to lookup the dst_entry to get the correct window size.
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 28ec0a2e7b72..a2185a214abc 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -101,12 +101,12 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
>         }
>  }
>
> -static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
> +static u32 tcp_v6_init_sequence(const struct sk_buff *skb, u32 *tsoff)
>  {
>         return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
>                                             ipv6_hdr(skb)->saddr.s6_addr32,
>                                             tcp_hdr(skb)->dest,
> -                                           tcp_hdr(skb)->source);
> +                                           tcp_hdr(skb)->source, tsoff);
>  }
>
>  static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> @@ -283,7 +283,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
>                 tp->write_seq = secure_tcpv6_sequence_number(np->saddr.s6_addr32,
>                                                              sk->sk_v6_daddr.s6_addr32,
>                                                              inet->inet_sport,
> -                                                            inet->inet_dport);
> +                                                            inet->inet_dport,
> +                                                            &tp->tsoffset);
>
>         err = tcp_connect(sk);
>         if (err)
> @@ -956,7 +957,8 @@ static void tcp_v6_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
>                         tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
>                         tcp_rsk(req)->rcv_nxt,
>                         req->rsk_rcv_wnd >> inet_rsk(req)->rcv_wscale,
> -                       tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
> +                       tcp_time_stamp + tcp_rsk(req)->ts_off,
> +                       req->ts_recent, sk->sk_bound_dev_if,
>                         tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
>                         0, 0);
>  }
> --
> 2.7.3
>

^ permalink raw reply

* Re: Re: [PATCH] ipv6:ip6_xmit remove unnecessary np NULL check
From: Eric Dumazet @ 2016-12-01  6:29 UTC (permalink / raw)
  To: r.thapliyal, Lorenzo Colitti
  Cc: Manjeet Pawar, davem@davemloft.net, kuznet@ms2.inr.ac.ru,
	jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	PANKAJ MISHRA, Ajeet Kumar Yadav
In-Reply-To: <20161201061422epcms5p83ef5b41a335c4556dd2bc725eed38d8d@epcms5p8>

On Thu, 2016-12-01 at 06:14 +0000, Rohit Thapliyal wrote:
> Hi,
> 
Hi,

Do not top post on netdev, and do not use HTML format : it wont reach
netdev.

 
> 
> Found at just one place in ping_v6_sendmsg, where np is checked for
> NULL.
> 
> And I am not sure, if it is really required there also. 
> 
> So, I am not sure if sk_fullsock will ever return NULL in inet6_sk.
> 
Well I am sure. You can trust me. You can read git history for the
details.
> 
> sk_fullsock(__sk) ? inet_sk(__sk)->pinet6 : NULL;
> 


Yes, Lorenzo mentioned this ping_v6_sendmsg() useless code, but
apparently never sent the cleanup. This is really minor you know...

diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index 66e2d9dfc43a87ebed092d024e5bf2752b755d0e..775a9365725dd400c36e2510c685d08bf4a7d4b0 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -125,12 +125,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		return PTR_ERR(dst);
 	rt = (struct rt6_info *) dst;
 
-	np = inet6_sk(sk);
-	if (!np) {
-		err = -EBADF;
-		goto dst_err_out;
-	}
-
 	if (!fl6.flowi6_oif && ipv6_addr_is_multicast(&fl6.daddr))
 		fl6.flowi6_oif = np->mcast_oif;
 	else if (!fl6.flowi6_oif)
@@ -165,7 +159,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	}
 	release_sock(sk);
 
-dst_err_out:
 	dst_release(dst);
 
 	if (err)




> 
> Thanks, 
> 
> Rohit Thapliyal
> 
>  
> 
> --------- Original Message ---------
> 
> Sender : Eric Dumazet <eric.dumazet@gmail.com>
> 
> Date : 2016-11-30 00:26 (GMT+9)
> 
> Title : Re: [PATCH] ipv6:ip6_xmit remove unnecessary np NULL check
> 
>  
> 
> On Tue, 2016-11-29 at 12:02 +0530, Manjeet Pawar wrote:
> > From: Rohit Thapliyal <r.thapliyal@samsung.com>
> > 
> > np NULL check doesn't seem required here as it shall never
> > be NULL anyways in inet6_sk(sk).
> > 
> > Signed-off-by: Rohit Thapliyal <r.thapliyal@samsung.com>
> > Signed-off-by: Manjeet Pawar <manjeet.p@samsung.com>
> > Signed-off-by: David Miller <davem@davemloft.net>
> > Reviewed-by: Akhilesh Kumar <akhilesh.k@samsung.com>
> > 
> > ---
> > v2->v3: Modified as per the suggestion from David Miller
> >         ip6_xmit calls are made without checking NULL np
> >         pointer, so no need to explicitly check NULL np in
> >         ip6_xmit.
> > 
> >  include/linux/ipv6.h  | 2 +-
> >  net/ipv6/ip6_output.c | 3 +--
> >  2 files changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> > index a064997..6c9c604 100644
> > --- a/include/linux/ipv6.h
> > +++ b/include/linux/ipv6.h
> > @@ -299,7 +299,7 @@ struct tcp6_timewait_sock {
> >  
> >  static inline struct ipv6_pinfo *inet6_sk(const struct sock *__sk)
> >  {
> > -	return sk_fullsock(__sk) ? inet_sk(__sk)->pinet6 : NULL;
> > +	return inet_sk(__sk)->pinet6;
> 
> 
> David suggestion was about np being NULL or not in ip6_xmit()
> 
> But have you checked inet6_sk() was never called for a TCPv6 TIMEWAIT or
> SYN_RECV request ?
> 
> >  }
> >  
> >  static inline struct raw6_sock *raw6_sk(const struct sock *sk)
> > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> > index 59eb4ed..f8c63ec 100644
> > --- a/net/ipv6/ip6_output.c
> > +++ b/net/ipv6/ip6_output.c
> > @@ -213,8 +213,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
> >  	/*
> >  	 *	Fill in the IPv6 header
> >  	 */
> > -	if (np)
> > -		hlimit = np->hop_limit;
> > +	hlimit = np->hop_limit;
> >  	if (hlimit < 0)
> >  		hlimit = ip6_dst_hoplimit(dst);
> >  
> 
> This part is fine.
> 
> 
> 
> 
> --
> 
> Thanks and Regards,
> 
> Rohit Thapliyal
> 
> Lead Engineer
> 
> Samsung SRI-D
> Phone: +91-9717627969
> 
> 
> 
> 
> 
> 

^ permalink raw reply related

* Re: [PATCH net-next v6 4/6] bpf: Add support for reading socket family, type, protocol
From: Alexei Starovoitov @ 2016-12-01  5:53 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480529810-25850-5-git-send-email-dsa@cumulusnetworks.com>

On Wed, Nov 30, 2016 at 10:16:48AM -0800, David Ahern wrote:
> Add socket family, type and protocol to bpf_sock allowing bpf programs
> read-only access.
> 
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
> +static u32 convert_sock_access(int sock_field, int dst_reg, int src_reg,
> +			       struct bpf_insn *insn_buf)
> +{
> +	struct bpf_insn *insn = insn_buf;
> +
> +	switch (sock_field) {
> +	case SOCKF_AD_TYPE:
> +		*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
> +				      offsetof(struct sock, __sk_flags_offset));
> +		*insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, SK_FL_TYPE_MASK);
> +		*insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, SK_FL_TYPE_SHIFT);
> +		break;

Wasn't that complex after all? ;)
This part looks good.

> +	case offsetof(struct bpf_sock, type):
> +		return convert_sock_access(SOCKF_AD_TYPE, dst_reg, src_reg,
> +					   insn_buf);

my only request-for-change is to place the above 3 lines of *insn++ here
instead of adding new helper function for it.
In case of skb it was a helper, because it was called from different places.
Here this is the only place where it ever be needed, so extra helper
only confuses the reader.

The rest of the patches look good to me.
I'm not Ack-ing them yet, since you didn't carry over my Ack
from the first patch and I had to re-review it again this time.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v6 6/6] samples/bpf: add userspace example for prohibiting sockets
From: Alexei Starovoitov @ 2016-12-01  5:59 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480529810-25850-7-git-send-email-dsa@cumulusnetworks.com>

On Wed, Nov 30, 2016 at 10:16:50AM -0800, David Ahern wrote:
> Add examples preventing a process in a cgroup from opening a socket
> based family, protocol and type.
> 
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
> +++ b/samples/bpf/sock_flags_kern.c
> @@ -0,0 +1,37 @@
> +#include <uapi/linux/bpf.h>
> +#include <linux/socket.h>
> +#include "bpf_helpers.h"
> +
> +SEC("cgroup/sock1")
> +int bpf_prog1(struct bpf_sock *sk)
> +{
> +	char fmt[] = "socket: family %d type %d protocol %d\n";
> +
> +	bpf_trace_printk(fmt, sizeof(fmt), sk->family, sk->type, sk->protocol);
> +
> +	/* block PF_INET6, SOCK_RAW, IPPROTO_ICMPV6 sockets
> +	 * ie., make ping6 fail
> +	 */
> +	if (sk->family == PF_INET6 && sk->type == 3 && sk->protocol == 58)
> +		return 0;

why not to use SOCK_RAW and IPPROTO_ICMPV6 instead of constants?

Thanks for the test!

^ permalink raw reply

* Re: [RFC PATCH net-next v2] ipv6: implement consistent hashing for equal-cost multipath routing
From: Hannes Frederic Sowa @ 2016-12-01  5:30 UTC (permalink / raw)
  To: David Miller, david.lebrun; +Cc: netdev
In-Reply-To: <20161130.144946.28181822717893867.davem@davemloft.net>

On Wed, Nov 30, 2016, at 20:49, David Miller wrote:
> From: David Lebrun <david.lebrun@uclouvain.be>
> Date: Tue, 29 Nov 2016 18:15:18 +0100
> 
> > When multiple nexthops are available for a given route, the routing engine
> > chooses a nexthop by computing the flow hash through get_hash_from_flowi6
> > and by taking that value modulo the number of nexthops. The resulting value
> > indexes the nexthop to select. This method causes issues when a new nexthop
> > is added or one is removed (e.g. link failure). In that case, the number
> > of nexthops changes and potentially all the flows get re-routed to another
> > nexthop.
> > 
> > This patch implements a consistent hash method to select the nexthop in
> > case of ECMP. The idea is to generate K slices (or intervals) for each
> > route with multiple nexthops. The nexthops are randomly assigned to those
> > slices, in a uniform manner. The number K is configurable through a sysctl
> > net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the
> > nexthop, the algorithm takes the flow hash and computes an index which is
> > the flow hash modulo K. As K = 2^x, the modulo can be computed using a
> > simple binary AND operation (idx = hash & (K - 1)). The resulting index
> > references the selected nexthop. The lookup time complexity is thus O(1).
> > 
> > When a nexthop is added, it steals K/N slices from the other nexthops,
> > where N is the new number of nexthops. The slices are stolen randomly and
> > uniformly from the other nexthops. When a nexthop is removed, the orphan
> > slices are randomly reassigned to the other nexthops.
> > 
> > The number of slices for a route also fixes the maximum number of nexthops
> > possible for that route.
> > 
> > Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
> 
> Interesting approach, but like Hannes I worry about the memory
> consumption
> bounds.
> 
> Limiting to 1<<16 is interesting, but if you can limit to 1<<8 (256
> nexthops) maybe the state requirement can be compressed even further?

>From what I have heard the user space DPDK-alike implementations of
consistent hashing actually use as much memory as possible (2^32) to
consistently steer the flow to the target, so every u32 flow hash value
has a unique corresponding mapping. The slice size doesn't affect the
number of nexthops (that is limited by the datatype in the array, which
is in this patch a u8), thus limiting us to 256 nexthops we could refer
to independently of the slice size. Lowering the slices means that the
algorithm becomes inaccurate as more flows map to one nexthop selector,
thus causing more inconsistent routing decisions to be made when changes
happen.

> We can always increase this if necessary in the future if someone
> reports a reasonable use case that really needs it.  Let's start
> simple and small first.

If people really rely on consistent hashing for load balancers setups
where the backends don't synchronize their state across their server
(the IPv6 anycast use case), they would probably always want to go with
the full sized state table, as every update on this map would cause user
errors and inconsistent state on the backend.

Thus, I think that we should also allow for maximum allocations,
wherever they come from, as it is already out there and deployed.

Bye,
Hannes

^ permalink raw reply

* Re: pull-request: wireless-drivers 2016-11-29
From: Kalle Valo @ 2016-12-01  5:19 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20161130.143444.242053190479060640.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> From: Kalle Valo <kvalo@codeaurora.org>
> Date: Tue, 29 Nov 2016 16:59:44 +0200
>
>> if there's still time here's one more patch to 3.9. I think this is good
>> to have in 3.9 as it fixes an issue where we were printing uninitialised
>> memory in mwifiex. I had this in wireless-drivers already for some time
>> as I was waiting for other fixes and nothing serious actually came up.
>> 
>> If this doesn't make it to 3.9 that's not a problem, I'll just merge
>> this to wireless-drivers-next. Let me know what you prefer.
>
> Unless you are in a time-machine of some sort I assume you mean "4.9" not
> "3.9" :-)
>
> Pulled, thanks.

Hah, I have no idea where I came up with this "3.9" :) I was trying mean
"4.9" of course, thanks for pulling.

-- 
Kalle Valo

^ permalink raw reply

* [PATCH net] net: bcmgenet: Utilize correct struct device for all DMA operations
From: Florian Fainelli @ 2016-12-01  5:15 UTC (permalink / raw)
  To: netdev; +Cc: davem, pgynther, jaedon.shin, opendmb, Florian Fainelli

__bcmgenet_tx_reclaim() is not using the struct device reference when
doing its unmap operation, which made the DMA-API debugging warn about
it. Fix this by always using &priv->pdev->dev throughout the driver,
using an identical device reference for all map/unmap calls.

Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 4464bc5db934..d7b553020d66 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -1172,6 +1172,7 @@ static unsigned int __bcmgenet_tx_reclaim(struct net_device *dev,
 					  struct bcmgenet_tx_ring *ring)
 {
 	struct bcmgenet_priv *priv = netdev_priv(dev);
+	struct device *kdev = &priv->pdev->dev;
 	struct enet_cb *tx_cb_ptr;
 	struct netdev_queue *txq;
 	unsigned int pkts_compl = 0;
@@ -1199,13 +1200,13 @@ static unsigned int __bcmgenet_tx_reclaim(struct net_device *dev,
 		if (tx_cb_ptr->skb) {
 			pkts_compl++;
 			bytes_compl += GENET_CB(tx_cb_ptr->skb)->bytes_sent;
-			dma_unmap_single(&dev->dev,
+			dma_unmap_single(kdev,
 					 dma_unmap_addr(tx_cb_ptr, dma_addr),
 					 dma_unmap_len(tx_cb_ptr, dma_len),
 					 DMA_TO_DEVICE);
 			bcmgenet_free_cb(tx_cb_ptr);
 		} else if (dma_unmap_addr(tx_cb_ptr, dma_addr)) {
-			dma_unmap_page(&dev->dev,
+			dma_unmap_page(kdev,
 				       dma_unmap_addr(tx_cb_ptr, dma_addr),
 				       dma_unmap_len(tx_cb_ptr, dma_len),
 				       DMA_TO_DEVICE);
-- 
1.9.1

^ permalink raw reply related

* Re: [v5,1/5] soc: qcom: smem_state: Fix include for ERR_PTR()
From: Valo, Kalle @ 2016-12-01  5:13 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: k.eugene.e@gmail.com, Andy Gross, wcn36xx@lists.infradead.org,
	linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org
In-Reply-To: <878ts0u7z3.fsf@kamboji.qca.qualcomm.com>

Kalle Valo <kvalo@qca.qualcomm.com> writes:

> "Valo, Kalle" <kvalo@qca.qualcomm.com> writes:
>
>> Bjorn Andersson <bjorn.andersson@linaro.org> writes:
>>
>>> On Wed 16 Nov 10:49 PST 2016, Kalle Valo wrote:
>>>
>>>> Bjorn Andersson <bjorn.andersson@linaro.org> wrote:
>>>> > The correct include file for getting errno constants and ERR_PTR() is
>>>> > linux/err.h, rather than linux/errno.h, so fix the include.
>>>> > 
>>>> > Fixes: e8b123e60084 ("soc: qcom: smem_state: Add stubs for disabled smem_state")
>>>> > Acked-by: Andy Gross <andy.gross@linaro.org>
>>>> > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
>>>> 
>>>> For some reason this fails to compile now. Can you take a look, please?
>>>> 
>>>> ERROR: "qcom_wcnss_open_channel" [drivers/net/wireless/ath/wcn36xx/wcn36xx.ko] undefined!
>>>> make[1]: *** [__modpost] Error 1
>>>> make: *** [modules] Error 2
>>>> 
>>>> 5 patches set to Changes Requested.
>>>> 
>>>> 9429045 [v5,1/5] soc: qcom: smem_state: Fix include for ERR_PTR()
>>>> 9429047 [v5,2/5] wcn36xx: Transition driver to SMD client
>>>
>>> This patch was updated with the necessary depends in Kconfig to catch
>>> this exact issue and when I pull in your .config (which has QCOM_SMD=n,
>>> QCOM_WCNSS_CTRL=n and WCN36XX=y) I can build this just fine.
>>>
>>> I've tested the various combinations and it seems to work fine. Do you
>>> have any other patches in your tree?
>>
>> This was with the pending branch of my ath.git tree. There are other
>> wireless patches (ath10k etc) but I would guess they don't affect here.
>>
>>> Any stale objects?
>>
>> Not sure what you mean with this question, but I didn't run 'make clean'
>> if that's what you are asking.
>>
>>> Would you mind retesting this, before I invest more time in trying to
>>> reproduce the issue you're seeing?
>>
>> Sure, I'll take a look but that might take few days.
>
> I didn't find enough time to look at this in detail. I applied this to
> my ath.git pending branch, let's see what the kbuild bot finds.

It found the same problem. Interestingly I'm also building x86 with 32
bit, maybe it's related?

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git pending
head:   1ea16a1c457939b4564643f7637d5cc639a8d3b7
commit: 5eb09c672b01460804fd49b1c9cc7d1072a102f0 [96/99] wcn36xx: Transition driver to SMD client
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        git checkout 5eb09c672b01460804fd49b1c9cc7d1072a102f0
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> ERROR: "qcom_wcnss_open_channel" [drivers/net/wireless/ath/wcn36xx/wcn36xx.ko] undefined!

-- 
Kalle Valo

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Tom Herbert @ 2016-12-01  5:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
	Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
	Achiad Shochat
In-Reply-To: <1480559566.18162.253.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Nov 30, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
>> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > Another issue I found during my tests last days, is a problem with BQL,
>> > and more generally when driver stops/starts the queue.
>> >
>> > When under stress and BQL stops the queue, driver TX completion does a
>> > lot of work, and servicing CPU also takes over further qdisc_run().
>> >
>> Hmm, hard to say if this is problem or a feature I think ;-). One way
>> to "solve" this would be to use IRQ round robin, that would spread the
>> load of networking across CPUs, but that gives us no additional
>> parallelism and reduces locality-- it's generally considered a bad
>> idea. The question might be: is it better to continuously ding one CPU
>> with all the networking work or try to artificially spread it out?
>> Note this is orthogonal to MQ also, so we should already have multiple
>> CPUs doing netif_schedule_queue for queues they manage.
>>
>> Do you have a test or application that shows this is causing pain?
>
> Yes, just launch enough TCP senders (more than 10,000) to fully utilize
> the NIC, with small messages.
>
> super_netperf is not good for that, because you would need 10,000
> processes and would spend too much cycles just dealing with an enormous
> working set, you would not activate BQL.
>
>
>>
>> > The work-flow is :
>> >
>> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
>> > unmap things, queue skbs for freeing.
>> >
>> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
>> >
>> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
>> >      netif_schedule_queue(dev_queue);
>> >
>> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
>> > (They absolutely have no chance to grab it)
>> >
>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
>> > and RX work.
>> >
>> > This problem is magnified when XPS is used, if one mono-threaded application deals with
>> > thousands of TCP sockets.
>> >
>> Do you know of an application doing this? The typical way XPS and most
>> of the other steering mechanisms are configured assume that workloads
>> tend towards a normal distribution. Such mechanisms can become
>> problematic under asymmetric loads, but then I would expect these are
>> likely dedicated servers so that the mechanisms can be tuned
>> accordingly. For instance, XPS can be configured to map more than one
>> queue to a CPU. Alternatively, IIRC Windows has some functionality to
>> tune networking for the load (spin up queues, reconfigure things
>> similar to XPS/RPS, etc.)-- that's promising up the point that we need
>> a lot of heuristics and measurement; but then we lose determinism and
>> risk edge case where we get completely unsatisfied results (one of my
>> concerns with the recent attempt to put configuration in the kernel).
>
> We have thousands of applications, and some of them 'kind of multicast'
> events to a broad number of TCP sockets.
>
> Very often, applications writers use a single thread for doing this,
> when all they need is to send small packets to 10,000 sockets, and they
> do not really care of doing this very fast. They also do not want to
> hurt other applications sharing the NIC.
>
> Very often, process scheduler will also run this single thread in a
> single cpu, ie avoiding expensive migrations if they are not needed.
>
> Problem is this behavior locks one TX queue for the duration of the
> multicast, since XPS will force all the TX packets to go to one TX
> queue.
>
The fact that XPS is forcing TX packets to go over one CPU is a result
of how you've configured XPS. There are other potentially
configurations that present different tradeoffs. For instance, TX
queues are plentiful now days to the point that we could map a number
of queues to each CPU while respecting NUMA locality between the
sending CPU and where TX completions occur. If the set is sufficiently
large this would also helps to address device lock contention. The
other thing I'm wondering is if Willem's concepts distributing DOS
attacks in RPS might be applicable in XPS. If we could detect that a
TX queue is "under attack" maybe we can automatically backoff to
distributing the load to more queues in XPS somehow.

Tom

> Other flows that would share the locked CPU experience high P99
> latencies.
>
>
>>
>> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
>> > to allow another cpu to service the qdisc and spare us.
>> >
>> Wouldn't this need to be an active operation? That is to queue the
>> qdisc on another CPU's output_queue?
>
> I simply suggest we try to queue the qdisc for further servicing as we
> do today, from net_tx_action(), but we might use a different bit, so
> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> before we grab it from net_tx_action(), maybe 100 usec later (time to
> flush all skbs queued in napi_consume_skb() and maybe RX processing,
> since most NIC handle TX completion before doing RX processing from thei
> napi poll handler.
>
> Should be doable with few changes in __netif_schedule() and
> net_tx_action(), plus some control paths that will need to take care of
> the new bit at dismantle time, right ?
>
>
>

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Michael S. Tsirkin @ 2016-12-01  4:56 UTC (permalink / raw)
  To: wangyunjian
  Cc: Jason Wang, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	caihe
In-Reply-To: <34EFBCA9F01B0748BEB6B629CE643AE60B0A7C38@szxeml561-mbx.china.huawei.com>

On Thu, Dec 01, 2016 at 04:41:40AM +0000, wangyunjian wrote:
> >-----Original Message-----
> >From: Jason Wang [mailto:jasowang@redhat.com] 
> >Sent: Thursday, December 01, 2016 11:37 AM
> >To: Michael S. Tsirkin
> >Cc: wangyunjian; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
> >Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
> >
> >
> >
> >On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
> >> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
> >>> >
> >>> >
> >>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
> >>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
> >>>>>> > > > >-----Original Message-----
> >>>>>> > > > >From: Michael S. Tsirkin [mailto:mst@redhat.com]
> >>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
> >>>>>> > > > >To: wangyunjian
> >>>>>> > > > >Cc:jasowang@redhat.com;netdev@vger.kernel.org;linux-kernel@
> >>>>>> > > > >vger.kernel.org; caihe
> >>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call 
> >>>>>> > > > >the recvmsg when meet errors
> >>>>>> > > > >
> >>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
> >>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
> >>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len 
> >>>>>> > > > >return 0 in this case, breaking the loop?
> >>>>> > > >We started many guest VMs while attaching/detaching some virtio-net nics for loop.
> >>>>> > > >The soft lockup might happened. The err is -EBADFD.
> >>>>> > > >
> >>>> > >OK, I'd like to figure out what happened here. why don't we get 0 
> >>>> > >when we peek at the head?
> >>>> > >
> >>>> > >EBADFD is from here:
> >>>> > >          struct tun_struct *tun = __tun_get(tfile); ...
> >>>> > >          if (!tun)
> >>>> > >                  return -EBADFD;
> >>>> > >
> >>>> > >but then:
> >>>> > >static int tun_peek_len(struct socket *sock) {
> >>>> > >
> >>>> > >...
> >>>> > >
> >>>> > >          struct tun_struct *tun; ...
> >>>> > >          tun = __tun_get(tfile);
> >>>> > >          if (!tun)
> >>>> > >                  return 0;
> >>>> > >
> >>>> > >
> >>>> > >so peek len should return 0.
> >>>> > >
> >>>> > >then while will exit:
> >>>> > >          while ((sock_len = vhost_net_rx_peek_head_len(net, 
> >>>> > >sock->sk))) ...
> >>>> > >
> >>> >
> >>> >Consider this case: user do ip link del link tap0 before recvmsg() 
> >>> >but after
> >>> >tun_peek_len() ?
> >> Sure, this can happen, but I think we'll just exit on the next loop, 
> >> won't we?
> >>
> >
> >Right, this is the only case I can image for -EBADFD, let's wait for the author to the steps.
> >
> 
> Thanks, I understand it don't happen in the latest kernel version.
> My problem happened using kernel version 3.10.0-xx
> The peek len willn't return 0.
> 
> static int peek_head_len(struct sock *sk)
> {
> 	struct sk_buff *head;
> 	int len = 0;
> 	unsigned long flags;
> 
> 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
> 	head = skb_peek(&sk->sk_receive_queue);

Should return NULL, should it not?
Maybe sk_receive_queue was not purged on detach back then.


> 	if (likely(head)) {
> 		len = head->len;
> 		if (skb_vlan_tag_present(head))
> 			len += VLAN_HLEN;
> 	}
> 
> 	spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
> 	return len;
> }

^ permalink raw reply

* RE: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: wangyunjian @ 2016-12-01  4:41 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, caihe
In-Reply-To: <936954dd-c8a5-c0f6-c3b0-84a9d67329f5@redhat.com>

>-----Original Message-----
>From: Jason Wang [mailto:jasowang@redhat.com] 
>Sent: Thursday, December 01, 2016 11:37 AM
>To: Michael S. Tsirkin
>Cc: wangyunjian; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>
>
>
>On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
>> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
>>> >
>>> >
>>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
>>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
>>>>>> > > > >-----Original Message-----
>>>>>> > > > >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
>>>>>> > > > >To: wangyunjian
>>>>>> > > > >Cc:jasowang@redhat.com;netdev@vger.kernel.org;linux-kernel@
>>>>>> > > > >vger.kernel.org; caihe
>>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call 
>>>>>> > > > >the recvmsg when meet errors
>>>>>> > > > >
>>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
>>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len 
>>>>>> > > > >return 0 in this case, breaking the loop?
>>>>> > > >We started many guest VMs while attaching/detaching some virtio-net nics for loop.
>>>>> > > >The soft lockup might happened. The err is -EBADFD.
>>>>> > > >
>>>> > >OK, I'd like to figure out what happened here. why don't we get 0 
>>>> > >when we peek at the head?
>>>> > >
>>>> > >EBADFD is from here:
>>>> > >          struct tun_struct *tun = __tun_get(tfile); ...
>>>> > >          if (!tun)
>>>> > >                  return -EBADFD;
>>>> > >
>>>> > >but then:
>>>> > >static int tun_peek_len(struct socket *sock) {
>>>> > >
>>>> > >...
>>>> > >
>>>> > >          struct tun_struct *tun; ...
>>>> > >          tun = __tun_get(tfile);
>>>> > >          if (!tun)
>>>> > >                  return 0;
>>>> > >
>>>> > >
>>>> > >so peek len should return 0.
>>>> > >
>>>> > >then while will exit:
>>>> > >          while ((sock_len = vhost_net_rx_peek_head_len(net, 
>>>> > >sock->sk))) ...
>>>> > >
>>> >
>>> >Consider this case: user do ip link del link tap0 before recvmsg() 
>>> >but after
>>> >tun_peek_len() ?
>> Sure, this can happen, but I think we'll just exit on the next loop, 
>> won't we?
>>
>
>Right, this is the only case I can image for -EBADFD, let's wait for the author to the steps.
>

Thanks, I understand it don't happen in the latest kernel version.
My problem happened using kernel version 3.10.0-xx

The peek len willn't return 0.

static int peek_head_len(struct sock *sk)
{
	struct sk_buff *head;
	int len = 0;
	unsigned long flags;

	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
	head = skb_peek(&sk->sk_receive_queue);
	if (likely(head)) {
		len = head->len;
		if (skb_vlan_tag_present(head))
			len += VLAN_HLEN;
	}

	spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
	return len;
}

^ permalink raw reply

* Re: [net-next PATCH v3 0/6] XDP for virtio_net
From: John Fastabend @ 2016-12-01  4:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: tgraf, shm, alexei.starovoitov, daniel, davem, john.r.fastabend,
	netdev, bblanco, brouer
In-Reply-To: <20161130205527-mutt-send-email-mst@kernel.org>

On 16-11-30 10:56 AM, Michael S. Tsirkin wrote:
> On Wed, Nov 30, 2016 at 10:41:04AM -0800, John Fastabend wrote:
>> On 16-11-30 10:35 AM, Michael S. Tsirkin wrote:
>>> On Tue, Nov 29, 2016 at 12:05:20PM -0800, John Fastabend wrote:
>>>> This implements virtio_net for the mergeable buffers and big_packet
>>>> modes. I tested this with vhost_net running on qemu and did not see
>>>> any issues. For testing num_buf > 1 I added a hack to vhost driver
>>>> to only but 100 bytes per buffer.
>>>>
>>>> There are some restrictions for XDP to be enabled and work well
>>>> (see patch 3) for more details.
>>>>
>>>>   1. LRO must be off
>>>>   2. MTU must be less than PAGE_SIZE
>>>>   3. queues must be available to dedicate to XDP
>>>>   4. num_bufs received in mergeable buffers must be 1
>>>>   5. big_packet mode must have all data on single page
>>>>
>>>> Please review any comments/feedback welcome as always.
>>>>
>>>> v2, fixes rcu usage throughout thanks to Eric and the use of
>>>> num_online_cpus() usage thanks to Jakub.
>>>>
>>>> v3, add slowpath patch to handle num_bufs > 1
>>>>
>>>> Thanks,
>>>> John
>>>
>>> BTW this is threaded incorrectly: patch 1/6 isn't a reply to 0/6,
>>> patches 2 and on are replies to patch 1.
>>>
>>
>> Ah yep, if you mangle the command line git will send the
>> cover letter even if you have mangled 'to' email addresses but when
>> it hits a real patch it aborts. At least on my version of git.
>>
>>> I'm busy until end of week, I'll review Monday. Sorry about the delay.
>>
>> In the meantime I'll post a v4 with better commit message (Alexei) and
>> address a corner cases Jakub pointed out.
> 
> I did a quick look and found some too, but a detailed review will
> have to wait till next week.

Thanks! I'll address all the comments and have a v4 by then.

^ permalink raw reply

* Re: [net-next PATCH v3 3/6] virtio_net: Add XDP support
From: John Fastabend @ 2016-12-01  4:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eric.dumazet, daniel, shm, davem, tgraf, alexei.starovoitov,
	john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <20161130203851-mutt-send-email-mst@kernel.org>

On 16-11-30 10:54 AM, Michael S. Tsirkin wrote:
> On Tue, Nov 29, 2016 at 12:10:21PM -0800, John Fastabend wrote:
>> From: John Fastabend <john.fastabend@gmail.com>
>>
>> This adds XDP support to virtio_net. Some requirements must be
>> met for XDP to be enabled depending on the mode. First it will
>> only be supported with LRO disabled so that data is not pushed
>> across multiple buffers. Second the MTU must be less than a page
>> size to avoid having to handle XDP across multiple pages.
>>
>> If mergeable receive is enabled this patch only supports the case
>> where header and data are in the same buf which we can check when
>> a packet is received by looking at num_buf. If the num_buf is
>> greater than 1 and a XDP program is loaded the packet is dropped
>> and a warning is thrown. When any_header_sg is set this does not
>> happen and both header and data is put in a single buffer as expected
>> so we check this when XDP programs are loaded.  Subsequent patches
>> will process the packet in a degraded mode to ensure connectivity
>> and correctness is not lost even if backend pushes packets into
>> multiple buffers.
>>
>> If big packets mode is enabled and MTU/LRO conditions above are
>> met then XDP is allowed.
>>
>> This patch was tested with qemu with vhost=on and vhost=off where
>> mergable and big_packet modes were forced via hard coding feature
>> negotiation. Multiple buffers per packet was forced via a small
>> test patch to vhost.c in the vhost=on qemu mode.
>>
>> Suggested-by: Shrijeet Mukherjee <shrijeet@gmail.com>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---

[...]

>>  
>> +static u32 do_xdp_prog(struct virtnet_info *vi,
>> +		       struct bpf_prog *xdp_prog,
>> +		       struct page *page, int offset, int len)
>> +{
>> +	int hdr_padded_len;
>> +	struct xdp_buff xdp;
>> +	u32 act;
>> +	u8 *buf;
>> +
>> +	buf = page_address(page) + offset;
>> +
>> +	if (vi->mergeable_rx_bufs)
>> +		hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>> +	else
>> +		hdr_padded_len = sizeof(struct padded_vnet_hdr);
>> +
>> +	xdp.data = buf + hdr_padded_len;
>> +	xdp.data_end = xdp.data + (len - vi->hdr_len);
> 
> so header seems to be ignored completely.
> but the packet could be from the time when
> e.g. checksum offloading was on, and
> so it might gave DATA_VALID (from CHECKSUM_UNNECESSARY
> in host).
> 
> I think you want to verify that flags and gso type
> are 0.

Ah yes agree. I think this might be possible in all the driver
implementations as well I wonder if there are bugs there as well.
The race between setting MTU, LRO, etc. and pushing the XDP prog
into the datapath seems unlikely but possible.

> 
> 
>> +
>> +	act = bpf_prog_run_xdp(xdp_prog, &xdp);
>> +	switch (act) {
>> +	case XDP_PASS:
>> +		return XDP_PASS;
>> +	default:
>> +		bpf_warn_invalid_xdp_action(act);
>> +	case XDP_TX:
>> +	case XDP_ABORTED:
>> +	case XDP_DROP:
>> +		return XDP_DROP;
>> +	}
>> +}
> 
> do we really want this switch just to warn?
> How about doing != XDP_PASS in the caller?

Well in later patches XDP_TX case gets handled. So its really
three cases, PASS, XDP_TX, {default|ABORT|DROP}. I don't have
a strong preference if its case statement or if/else it should
not matter except for style.

> 
>> +
>>  static struct sk_buff *receive_small(struct virtnet_info *vi, void *buf, unsigned int len)
>>  {
>>  	struct sk_buff * skb = buf;
>> @@ -340,14 +375,28 @@ static struct sk_buff *receive_big(struct net_device *dev,
>>  				   void *buf,
>>  				   unsigned int len)
>>  {
>> +	struct bpf_prog *xdp_prog;
>>  	struct page *page = buf;
>> -	struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
>> +	struct sk_buff *skb;
>>  
>> +	rcu_read_lock();
>> +	xdp_prog = rcu_dereference(rq->xdp_prog);
>> +	if (xdp_prog) {
>> +		u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
>> +
>> +		if (act == XDP_DROP)
>> +			goto err_xdp;
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
>>  	if (unlikely(!skb))
>>  		goto err;
>>  
>>  	return skb;
>>  
>> +err_xdp:
>> +	rcu_read_unlock();
>>  err:
>>  	dev->stats.rx_dropped++;
>>  	give_pages(rq, page);
>> @@ -366,10 +415,27 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  	struct page *page = virt_to_head_page(buf);
>>  	int offset = buf - page_address(page);
>>  	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
> 
> This is some useless computation when XDP is used, isn't it?
> 

Seems to be nice catch.

>> +	struct sk_buff *head_skb, *curr_skb;
>> +	struct bpf_prog *xdp_prog;
>>  
>> -	struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
>> -					       truesize);
>> -	struct sk_buff *curr_skb = head_skb;
>> +	rcu_read_lock();
>> +	xdp_prog = rcu_dereference(rq->xdp_prog);
>> +	if (xdp_prog) {
>> +		u32 act;
>> +
>> +		if (num_buf > 1) {
>> +			bpf_warn_invalid_xdp_buffer();
>> +			goto err_xdp;
>> +		}
>> +
>> +		act = do_xdp_prog(vi, xdp_prog, page, offset, len);
>> +		if (act == XDP_DROP)
>> +			goto err_xdp;
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
>> +	curr_skb = head_skb;
>>  
>>  	if (unlikely(!curr_skb))
>>  		goto err_skb;
> 
> I'm confused. Did the requirement to have a page per packet go away?
> I don't think this mode is doing it here.
> 

Correct its not a page per packet but there seems no reason to enforce
that here. XDP works just fine without contiguous buffers. I believe
the page per packet argument came out of the mlx driver implementation.

iirc DaveM had the most push back on the multiple packets per page
thing. Maybe he can comment again. (sorry Dave I'm dragging this up again).

> 
>> @@ -423,6 +489,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>  	ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
>>  	return head_skb;
>>  
>> +err_xdp:
>> +	rcu_read_unlock();
>>  err_skb:
>>  	put_page(page);
>>  	while (--num_buf) {
>> @@ -1328,6 +1396,13 @@ static int virtnet_set_channels(struct net_device *dev,
>>  	if (queue_pairs > vi->max_queue_pairs || queue_pairs == 0)
>>  		return -EINVAL;
>>  
>> +	/* For now we don't support modifying channels while XDP is loaded
>> +	 * also when XDP is loaded all RX queues have XDP programs so we only
>> +	 * need to check a single RX queue.
>> +	 */
>> +	if (vi->rq[0].xdp_prog)
>> +		return -EINVAL;
>> +
>>  	get_online_cpus();
>>  	err = virtnet_set_queues(vi, queue_pairs);
>>  	if (!err) {
>> @@ -1454,6 +1529,68 @@ static int virtnet_set_features(struct net_device *netdev,
>>  	return 0;
>>  }
>>  
>> +static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(dev);
>> +	struct bpf_prog *old_prog;
>> +	int i;
>> +
>> +	if ((dev->features & NETIF_F_LRO) && prog) {
>> +		netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (vi->mergeable_rx_bufs && !vi->any_header_sg) {
>> +		netdev_warn(dev, "XDP expects header/data in single page\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (dev->mtu > PAGE_SIZE) {
>> +		netdev_warn(dev, "XDP requires MTU less than %lu\n", PAGE_SIZE);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (prog) {
>> +		prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
>> +		if (IS_ERR(prog))
>> +			return PTR_ERR(prog);
>> +	}
>> +
>> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>> +		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
>> +		rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
>> +		if (old_prog)
>> +			bpf_prog_put(old_prog);
> 
> don't we need to sync before put?
> 

No should be handled by core infrastructure in bpf_prog_put(). Also this
aligns with other driver implementations.

> 
>> +	}
>> +
>> +	return 0;
>> +}
>> +

Thanks,
John

^ permalink raw reply

* linux-next: manual merge of the tip tree with the net-next tree
From: Stephen Rothwell @ 2016-12-01  3:46 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
	David Miller, Networking
  Cc: linux-next, linux-kernel, Edward Cree, Nicolas Pitre

Hi all,

Today's linux-next merge of the tip tree got a conflict in:

  drivers/net/ethernet/sfc/Kconfig

between commit:

  5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")

from the net-next tree and commit:

  d1cbfd771ce8 ("ptp_clock: Allow for it to be optional")

from the tip tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/ethernet/sfc/Kconfig
index 0e16197e4f18,83f4766a1da0..000000000000
--- a/drivers/net/ethernet/sfc/Kconfig
+++ b/drivers/net/ethernet/sfc/Kconfig
@@@ -5,11 -5,11 +5,11 @@@ config SF
  	select CRC32
  	select I2C
  	select I2C_ALGOBIT
- 	select PTP_1588_CLOCK
 +	select SFC_FALCON
+ 	imply PTP_1588_CLOCK
  	---help---
  	  This driver supports 10/40-gigabit Ethernet cards based on
 -	  the Solarflare SFC4000, SFC9000-family and SFC9100-family
 -	  controllers.
 +	  the Solarflare SFC9000-family and SFC9100-family controllers.

  	  To compile this driver as a module, choose M here.  The module
  	  will be called sfc.

^ permalink raw reply

* Re: [PATCH] tcp_bbr: fix Kconfig to be able to make BBR the default congestion algorithm
From: David Miller @ 2016-12-01  3:43 UTC (permalink / raw)
  To: berny156; +Cc: linux-kernel, netdev, ncardwell
In-Reply-To: <b37e4220-b857-4853-f2e2-b77bb08115e5@gmx.de>

From: Bernhard Held <berny156@gmx.de>
Date: Wed, 30 Nov 2016 23:31:29 +0100

> Add missing line to be able to make BBR the default
> congestion algorithm.
> 
> Signed-off-by: Bernhard Held <berny156@gmx.de>

The current tree already has this fix.

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-12-01  3:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: wangyunjian, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	caihe
In-Reply-To: <20161201052657-mutt-send-email-mst@kernel.org>



On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
>> >
>> >
>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
>>>>> > > > >-----Original Message-----
>>>>> > > > >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
>>>>> > > > >To: wangyunjian
>>>>> > > > >Cc:jasowang@redhat.com;netdev@vger.kernel.org;linux-kernel@vger.kernel.org; caihe
>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>>>>> > > > >
>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>>>>> > > > >return 0 in this case, breaking the loop?
>>>> > > >We started many guest VMs while attaching/detaching some virtio-net nics for loop.
>>>> > > >The soft lockup might happened. The err is -EBADFD.
>>>> > > >
>>> > >OK, I'd like to figure out what happened here. why don't
>>> > >we get 0 when we peek at the head?
>>> > >
>>> > >EBADFD is from here:
>>> > >          struct tun_struct *tun = __tun_get(tfile);
>>> > >...
>>> > >          if (!tun)
>>> > >                  return -EBADFD;
>>> > >
>>> > >but then:
>>> > >static int tun_peek_len(struct socket *sock)
>>> > >{
>>> > >
>>> > >...
>>> > >
>>> > >          struct tun_struct *tun;
>>> > >...
>>> > >          tun = __tun_get(tfile);
>>> > >          if (!tun)
>>> > >                  return 0;
>>> > >
>>> > >
>>> > >so peek len should return 0.
>>> > >
>>> > >then while will exit:
>>> > >          while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
>>> > >...
>>> > >
>> >
>> >Consider this case: user do ip link del link tap0 before recvmsg() but after
>> >tun_peek_len() ?
> Sure, this can happen, but I think we'll just exit on the next loop,
> won't we?
>

Right, this is the only case I can image for -EBADFD, let's wait for the 
author to the steps.

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Michael S. Tsirkin @ 2016-12-01  3:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: wangyunjian, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	caihe
In-Reply-To: <ee70cb92-3fd9-4148-77ba-739af30b04b6@redhat.com>

On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
> 
> 
> On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
> > On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > > > Sent: Wednesday, November 30, 2016 9:41 PM
> > > > To: wangyunjian
> > > > Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
> > > > Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
> > > > 
> > > > On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
> > > > > When we meet an error(err=-EBADFD) recvmsg,
> > > > How do you get EBADFD? Won't vhost_net_rx_peek_head_len
> > > > return 0 in this case, breaking the loop?
> > > We started many guest VMs while attaching/detaching some virtio-net nics for loop.
> > > The soft lockup might happened. The err is -EBADFD.
> > > 
> > OK, I'd like to figure out what happened here. why don't
> > we get 0 when we peek at the head?
> > 
> > EBADFD is from here:
> >          struct tun_struct *tun = __tun_get(tfile);
> > ...
> >          if (!tun)
> >                  return -EBADFD;
> > 
> > but then:
> > static int tun_peek_len(struct socket *sock)
> > {
> > 
> > ...
> > 
> >          struct tun_struct *tun;
> > ...
> >          tun = __tun_get(tfile);
> >          if (!tun)
> >                  return 0;
> > 
> > 
> > so peek len should return 0.
> > 
> > then while will exit:
> >          while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
> > ...
> > 
> 
> Consider this case: user do ip link del link tap0 before recvmsg() but after
> tun_peek_len() ?

Sure, this can happen, but I think we'll just exit on the next loop,
won't we?

> > > meesage log:
> > > kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
> > > call trace:
> > > [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
> > > [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
> > > [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
> > > [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
> > > [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
> > > [<fffffff810a5c7f>]kthread+0xcf/0xe0
> > > [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> > > [<fffffff81648898>]ret_from_fork+0x58/0x90
> > > [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> > So somehow you keep seeing something in tun when we peek.
> > IMO this is the problem we should try to fix.
> > Could you try debugging the root cause here?
> > 
> > > > > the error handling in vhost
> > > > > handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
> > > > > 
> > > > > Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> > > > > ---
> > > > >   drivers/vhost/net.c | 3 +++
> > > > >   1 file changed, 3 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > > > index 5dc128a..edc470b 100644
> > > > > --- a/drivers/vhost/net.c
> > > > > +++ b/drivers/vhost/net.c
> > > > > @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
> > > > >   			pr_debug("Discarded rx packet: "
> > > > >   				 " len %d, expected %zd\n", err, sock_len);
> > > > >   			vhost_discard_vq_desc(vq, headcount);
> > > > > +			/* Don't continue to do, when meet errors. */
> > > > > +			if (err < 0)
> > > > > +				goto out;
> > > > You might get e.g. EAGAIN and I think you need to retry
> > > > in this case.
> > > > 
> > > > >   			continue;
> > > > >   		}
> > > > >   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> > > > > -- 
> > > > > 1.9.5.msysgit.1
> > > > > 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox