public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next 0/7] Add RSS/TSS QP groups and IPoIB support for RSS/TSS
@ 2012-05-08 16:22 Or Gerlitz
       [not found] ` <1336494151-31050-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Or Gerlitz @ 2012-05-08 16:22 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, shlomop-VPRAkNaXOzVWk0Htik3J/w,
	Or Gerlitz

Hi Roland, so...

RSS (Receive Side Scaling) TSS (Transmit Side Scaling, better known as
MQ/Multi-Queue) are common networking techniques which allow to use
contemporary NICs that support multiple receive and transmit descriptor
queues (multi-queue), see also Documentation/networking/scaling.txt

This series introduces the concept of RSS and TSS QP groups which
allows for implementing them by low level drivers and using them
by IPoIB and later also by user space ULPs. An implementation 
for the mlx4_ib driver is provided.

Next, the IPoIB code is modified to support RSS/TSS, this is done in two steps.

The first step is a vectorization restructure as pre-step for TSS/RSS
such that the driver becomes multi-ring one, using multiple RX/TX rings.

The second step makes use of the QP groups concept to enable RSS
and TSS.

The patch series is built as following

1/7 net/mlx4: add new cap flags field to track more capabilities
2/7 IB/mlx4: replace KERN_yyy printk calls with pr_yyy ones
3/7 IB/mlx4: increase the number of vectors (EQs) available for ULPs

are infra-structure changes in the mlx4 driver used later to support TSS/RSS
they were already submitted ten days ago, but added here for clarity 
and ease of review.

4/7 IB/core: Add RSS and TSS QP groups

is the IB core patch that introduces QP groups

5/7 IB/mlx4: Add support for RSS and TSS QP groups

is the mlx4 driver implementation for QP groups and TSS/RSS

6/7 IB/ipoib: Implement vectorization restructure as pre-step for TSS/RSS

is the vectorization restructure, pre-step for TSS/RSS

7/7 IB/ipoib: Add RSS and TSS support for datagram mode

enables TSS and RSS

Enjoy, 

Or.

Shlomo Pongratz (7):
  net/mlx4: add new cap flags field to track more capabilities
  IB/mlx4: replace KERN_yyy printk call with pr_yyy ones
  IB/mlx4: increase the number of vectors (EQs) available for ULPs
  IB/core: Add RSS and TSS QP groups
  IB/mlx4: Add support for RSS and TSS QP groups
  IB/ipoib: Implement vectorization restructure as pre-step for TSS/RSS
  IB/ipoib: Add RSS and TSS support for datagram mode

 drivers/infiniband/core/verbs.c                |    3 +
 drivers/infiniband/hw/amso1100/c2_provider.c   |    3 +
 drivers/infiniband/hw/cxgb3/iwch_provider.c    |    2 +
 drivers/infiniband/hw/cxgb4/qp.c               |    3 +
 drivers/infiniband/hw/ehca/ehca_qp.c           |    6 +
 drivers/infiniband/hw/ipath/ipath_qp.c         |    3 +
 drivers/infiniband/hw/mlx4/cq.c                |   13 +-
 drivers/infiniband/hw/mlx4/main.c              |  102 ++++-
 drivers/infiniband/hw/mlx4/mlx4_ib.h           |   15 +
 drivers/infiniband/hw/mlx4/mr.c                |    2 +-
 drivers/infiniband/hw/mlx4/qp.c                |  353 ++++++++++++-
 drivers/infiniband/hw/mlx4/srq.c               |    2 +-
 drivers/infiniband/hw/mthca/mthca_provider.c   |    3 +
 drivers/infiniband/hw/nes/nes_verbs.c          |    3 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c    |    5 +
 drivers/infiniband/hw/qib/qib_qp.c             |    5 +
 drivers/infiniband/ulp/ipoib/ipoib.h           |  100 +++-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |   97 +++--
 drivers/infiniband/ulp/ipoib/ipoib_ethtool.c   |   92 +++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  552 ++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  363 ++++++++++++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   34 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  660 +++++++++++++++++++++---
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c      |    2 +-
 drivers/net/ethernet/mellanox/mlx4/fw.c        |   29 +
 drivers/net/ethernet/mellanox/mlx4/fw.h        |    2 +
 drivers/net/ethernet/mellanox/mlx4/main.c      |    2 +
 include/linux/mlx4/device.h                    |    8 +
 include/rdma/ib_verbs.h                        |   26 +-
 29 files changed, 2108 insertions(+), 382 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH RESEND for-next 1/7] net/mlx4: add new cap flags field to track more capabilities
       [not found] ` <1336494151-31050-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2012-05-08 16:22   ` Or Gerlitz
  2012-05-08 16:22   ` [PATCH RESEND for-next 2/7] IB/mlx4: replace KERN_yyy printk calls with pr_yyy ones Or Gerlitz
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Or Gerlitz @ 2012-05-08 16:22 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, shlomop-VPRAkNaXOzVWk0Htik3J/w

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch adds a 64 bits "flags2" features variable to export further features
of the hardware. The original "flags" field, tracks feature whose support bits are
advertized by the firmware in offsets 0x40 and 0x44 of the query device
capabilities command,  where flags2 would track features whose support bits are
scattered in various offsets. RSS support is the first feature to be exported
through it. RSS capabilities are located at offset 0x2e. The size of the RSS
indirection table is also given in this offset.

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/net/ethernet/mellanox/mlx4/fw.c   |   29 +++++++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/fw.h   |    2 ++
 drivers/net/ethernet/mellanox/mlx4/main.c |    2 ++
 include/linux/mlx4/device.h               |    8 ++++++++
 4 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 2a02ba5..f7488df 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -118,6 +118,20 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u64 flags)
 			mlx4_dbg(dev, "    %s\n", fname[i]);
 }
 
+static void dump_dev_cap_flags2(struct mlx4_dev *dev, u64 flags)
+{
+	static const char * const fname[] = {
+		[0] = "RSS support",
+		[1] = "RSS Toeplitz Hash Function support",
+		[2] = "RSS XOR Hash Function support"
+	};
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(fname); ++i)
+		if (fname[i] && (flags & (1LL << i)))
+			mlx4_dbg(dev, "    %s\n", fname[i]);
+}
+
 int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg)
 {
 	struct mlx4_cmd_mailbox *mailbox;
@@ -346,6 +360,7 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 #define QUERY_DEV_CAP_MAX_REQ_QP_OFFSET		0x29
 #define QUERY_DEV_CAP_MAX_RES_QP_OFFSET		0x2b
 #define QUERY_DEV_CAP_MAX_GSO_OFFSET		0x2d
+#define QUERY_DEV_CAP_RSS_OFFSET		0x2e
 #define QUERY_DEV_CAP_MAX_RDMA_OFFSET		0x2f
 #define QUERY_DEV_CAP_RSZ_SRQ_OFFSET		0x33
 #define QUERY_DEV_CAP_ACK_DELAY_OFFSET		0x35
@@ -390,6 +405,7 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 #define QUERY_DEV_CAP_RSVD_LKEY_OFFSET		0x98
 #define QUERY_DEV_CAP_MAX_ICM_SZ_OFFSET		0xa0
 
+	dev_cap->flags2 = 0;
 	mailbox = mlx4_alloc_cmd_mailbox(dev);
 	if (IS_ERR(mailbox))
 		return PTR_ERR(mailbox);
@@ -439,6 +455,17 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 	else
 		dev_cap->max_gso_sz = 1 << field;
 
+	MLX4_GET(field, outbox, QUERY_DEV_CAP_RSS_OFFSET);
+	if (field & 0x20)
+		dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_RSS_XOR;
+	if (field & 0x10)
+		dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_RSS_TOP;
+	field &= 0xf;
+	if (field) {
+		dev_cap->flags2 |= MLX4_DEV_CAP_FLAG2_RSS;
+		dev_cap->max_rss_tbl_sz = 1 << field;
+	} else
+		dev_cap->max_rss_tbl_sz = 0;
 	MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_RDMA_OFFSET);
 	dev_cap->max_rdma_global = 1 << (field & 0x3f);
 	MLX4_GET(field, outbox, QUERY_DEV_CAP_ACK_DELAY_OFFSET);
@@ -632,8 +659,10 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 		 dev_cap->max_rq_desc_sz, dev_cap->max_rq_sg);
 	mlx4_dbg(dev, "Max GSO size: %d\n", dev_cap->max_gso_sz);
 	mlx4_dbg(dev, "Max counters: %d\n", dev_cap->max_counters);
+	mlx4_dbg(dev, "Max RSS Table size: %d\n", dev_cap->max_rss_tbl_sz);
 
 	dump_dev_cap_flags(dev, dev_cap->flags);
+	dump_dev_cap_flags2(dev, dev_cap->flags2);
 
 out:
 	mlx4_free_cmd_mailbox(dev, mailbox);
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.h b/drivers/net/ethernet/mellanox/mlx4/fw.h
index e1a5fa5..64c0399 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.h
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.h
@@ -79,6 +79,7 @@ struct mlx4_dev_cap {
 	u64 trans_code[MLX4_MAX_PORTS + 1];
 	u16 stat_rate_support;
 	u64 flags;
+	u64 flags2;
 	int reserved_uars;
 	int uar_size;
 	int min_page_sz;
@@ -110,6 +111,7 @@ struct mlx4_dev_cap {
 	u32 reserved_lkey;
 	u64 max_icm_sz;
 	int max_gso_sz;
+	int max_rss_tbl_sz;
 	u8  supported_port_types[MLX4_MAX_PORTS + 1];
 	u8  suggested_type[MLX4_MAX_PORTS + 1];
 	u8  default_sense[MLX4_MAX_PORTS + 1];
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 8bb05b4..cc68f17 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -272,10 +272,12 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 	dev->caps.max_msg_sz         = dev_cap->max_msg_sz;
 	dev->caps.page_size_cap	     = ~(u32) (dev_cap->min_page_sz - 1);
 	dev->caps.flags		     = dev_cap->flags;
+	dev->caps.flags2		     = dev_cap->flags2;
 	dev->caps.bmme_flags	     = dev_cap->bmme_flags;
 	dev->caps.reserved_lkey	     = dev_cap->reserved_lkey;
 	dev->caps.stat_rate_support  = dev_cap->stat_rate_support;
 	dev->caps.max_gso_sz	     = dev_cap->max_gso_sz;
+	dev->caps.max_rss_tbl_sz	 = dev_cap->max_rss_tbl_sz;
 
 	/* Sense port always allowed on supported devices for ConnectX1 and 2 */
 	if (dev->pdev->device != 0x1003)
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 834c96c..7f5e8d5 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -98,6 +98,12 @@ enum {
 	MLX4_DEV_CAP_FLAG_SENSE_SUPPORT	= 1LL << 55
 };
 
+enum {
+	MLX4_DEV_CAP_FLAG2_RSS			= 1LL <<  0,
+	MLX4_DEV_CAP_FLAG2_RSS_TOP		= 1LL <<  1,
+	MLX4_DEV_CAP_FLAG2_RSS_XOR		= 1LL <<  2
+};
+
 #define MLX4_ATTR_EXTENDED_PORT_INFO	cpu_to_be16(0xff90)
 
 enum {
@@ -292,11 +298,13 @@ struct mlx4_caps {
 	u32			max_msg_sz;
 	u32			page_size_cap;
 	u64			flags;
+	u64			flags2;
 	u32			bmme_flags;
 	u32			reserved_lkey;
 	u16			stat_rate_support;
 	u8			port_width_cap[MLX4_MAX_PORTS + 1];
 	int			max_gso_sz;
+	int			max_rss_tbl_sz;
 	int                     reserved_qps_cnt[MLX4_NUM_QP_REGION];
 	int			reserved_qps;
 	int                     reserved_qps_base[MLX4_NUM_QP_REGION];
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RESEND for-next 2/7] IB/mlx4: replace KERN_yyy printk calls with pr_yyy ones
       [not found] ` <1336494151-31050-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2012-05-08 16:22   ` [PATCH RESEND for-next 1/7] net/mlx4: add new cap flags field to track more capabilities Or Gerlitz
@ 2012-05-08 16:22   ` Or Gerlitz
  2012-05-08 16:22   ` [PATCH RESEND for-next 3/7] IB/mlx4: increase the number of vectors (EQs) available for ULPs Or Gerlitz
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Or Gerlitz @ 2012-05-08 16:22 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, shlomop-VPRAkNaXOzVWk0Htik3J/w

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/cq.c   |   10 +++++-----
 drivers/infiniband/hw/mlx4/main.c |   12 ++++++------
 drivers/infiniband/hw/mlx4/mr.c   |    2 +-
 drivers/infiniband/hw/mlx4/qp.c   |   12 ++++++------
 drivers/infiniband/hw/mlx4/srq.c  |    2 +-
 5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 77c8cb4..34ac0e2 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -50,7 +50,7 @@ static void mlx4_ib_cq_event(struct mlx4_cq *cq, enum mlx4_event type)
 	struct ib_cq *ibcq;
 
 	if (type != MLX4_EVENT_TYPE_CQ_ERROR) {
-		printk(KERN_WARNING "mlx4_ib: Unexpected event type %d "
+		pr_warn("Unexpected event type %d "
 		       "on CQ %06x\n", type, cq->cqn);
 		return;
 	}
@@ -463,7 +463,7 @@ static void dump_cqe(void *cqe)
 {
 	__be32 *buf = cqe;
 
-	printk(KERN_DEBUG "CQE contents %08x %08x %08x %08x %08x %08x %08x %08x\n",
+	pr_debug("CQE contents %08x %08x %08x %08x %08x %08x %08x %08x\n",
 	       be32_to_cpu(buf[0]), be32_to_cpu(buf[1]), be32_to_cpu(buf[2]),
 	       be32_to_cpu(buf[3]), be32_to_cpu(buf[4]), be32_to_cpu(buf[5]),
 	       be32_to_cpu(buf[6]), be32_to_cpu(buf[7]));
@@ -473,7 +473,7 @@ static void mlx4_ib_handle_error_cqe(struct mlx4_err_cqe *cqe,
 				     struct ib_wc *wc)
 {
 	if (cqe->syndrome == MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR) {
-		printk(KERN_DEBUG "local QP operation err "
+		pr_debug("local QP operation err "
 		       "(QPN %06x, WQE index %x, vendor syndrome %02x, "
 		       "opcode = %02x)\n",
 		       be32_to_cpu(cqe->my_qpn), be16_to_cpu(cqe->wqe_index),
@@ -576,7 +576,7 @@ repoll:
 
 	if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_OPCODE_NOP &&
 		     is_send)) {
-		printk(KERN_WARNING "Completion for NOP opcode detected!\n");
+		pr_warn("Completion for NOP opcode detected!\n");
 		return -EINVAL;
 	}
 
@@ -606,7 +606,7 @@ repoll:
 		mqp = __mlx4_qp_lookup(to_mdev(cq->ibcq.device)->dev,
 				       be32_to_cpu(cqe->vlan_my_qpn));
 		if (unlikely(!mqp)) {
-			printk(KERN_WARNING "CQ %06x with entry for unknown QPN %06x\n",
+			pr_warn("CQ %06x with entry for unknown QPN %06x\n",
 			       cq->mcq.cqn, be32_to_cpu(cqe->vlan_my_qpn) & MLX4_CQE_QPN_MASK);
 			return -EINVAL;
 		}
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 75d3056..1a11475 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -784,7 +784,7 @@ static int mlx4_ib_mcg_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
 		list_del(&ge->list);
 		kfree(ge);
 	} else
-		printk(KERN_WARNING "could not find mgid entry\n");
+		pr_warn("could not find mgid entry\n");
 
 	mutex_unlock(&mqp->mutex);
 
@@ -897,7 +897,7 @@ static void update_gids_task(struct work_struct *work)
 
 	mailbox = mlx4_alloc_cmd_mailbox(dev);
 	if (IS_ERR(mailbox)) {
-		printk(KERN_WARNING "update gid table failed %ld\n", PTR_ERR(mailbox));
+		pr_warn("update gid table failed %ld\n", PTR_ERR(mailbox));
 		return;
 	}
 
@@ -908,7 +908,7 @@ static void update_gids_task(struct work_struct *work)
 		       1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
 		       MLX4_CMD_NATIVE);
 	if (err)
-		printk(KERN_WARNING "set port command failed\n");
+		pr_warn("set port command failed\n");
 	else {
 		memcpy(gw->dev->iboe.gid_table[gw->port - 1], gw->gids, sizeof gw->gids);
 		event.device = &gw->dev->ib_dev;
@@ -1082,7 +1082,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 	printk_once(KERN_INFO "%s", mlx4_ib_version);
 
 	if (mlx4_is_mfunc(dev)) {
-		printk(KERN_WARNING "IB not yet supported in SRIOV\n");
+		pr_warn("IB not yet supported in SRIOV\n");
 		return NULL;
 	}
 
@@ -1248,7 +1248,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 
 err_notif:
 	if (unregister_netdevice_notifier(&ibdev->iboe.nb))
-		printk(KERN_WARNING "failure unregistering notifier\n");
+		pr_warn("failure unregistering notifier\n");
 	flush_workqueue(wq);
 
 err_reg:
@@ -1283,7 +1283,7 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr)
 	ib_unregister_device(&ibdev->ib_dev);
 	if (ibdev->iboe.nb.notifier_call) {
 		if (unregister_netdevice_notifier(&ibdev->iboe.nb))
-			printk(KERN_WARNING "failure unregistering notifier\n");
+			pr_warn("failure unregistering notifier\n");
 		ibdev->iboe.nb.notifier_call = NULL;
 	}
 	iounmap(ibdev->uar_map);
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index dca55b1..bbaf617 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -338,7 +338,7 @@ int mlx4_ib_unmap_fmr(struct list_head *fmr_list)
 
 	err = mlx4_SYNC_TPT(mdev);
 	if (err)
-		printk(KERN_WARNING "mlx4_ib: SYNC_TPT error %d when "
+		pr_warn("SYNC_TPT error %d when "
 		       "unmapping FMRs\n", err);
 
 	return 0;
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 4649d83..5967644 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -261,7 +261,7 @@ static void mlx4_ib_qp_event(struct mlx4_qp *qp, enum mlx4_event type)
 			event.event = IB_EVENT_QP_ACCESS_ERR;
 			break;
 		default:
-			printk(KERN_WARNING "mlx4_ib: Unexpected event type %d "
+			pr_warn("Unexpected event type %d "
 			       "on QP %06x\n", type, qp->qpn);
 			return;
 		}
@@ -725,7 +725,7 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
 	if (qp->state != IB_QPS_RESET)
 		if (mlx4_qp_modify(dev->dev, NULL, to_mlx4_state(qp->state),
 				   MLX4_QP_STATE_RST, NULL, 0, 0, &qp->mqp))
-			printk(KERN_WARNING "mlx4_ib: modify QP %06x to RESET failed.\n",
+			pr_warn("modify QP %06x to RESET failed.\n",
 			       qp->mqp.qpn);
 
 	get_cqs(qp, &send_cq, &recv_cq);
@@ -958,7 +958,7 @@ static int mlx4_set_path(struct mlx4_ib_dev *dev, const struct ib_ah_attr *ah,
 
 	if (ah->ah_flags & IB_AH_GRH) {
 		if (ah->grh.sgid_index >= dev->dev->caps.gid_table_len[port]) {
-			printk(KERN_ERR "sgid_index (%u) too large. max is %d\n",
+			pr_err("sgid_index (%u) too large. max is %d\n",
 			       ah->grh.sgid_index, dev->dev->caps.gid_table_len[port] - 1);
 			return -1;
 		}
@@ -1064,7 +1064,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 			context->mtu_msgmax = (IB_MTU_4096 << 5) | 12;
 	} else if (attr_mask & IB_QP_PATH_MTU) {
 		if (attr->path_mtu < IB_MTU_256 || attr->path_mtu > IB_MTU_4096) {
-			printk(KERN_ERR "path MTU (%u) is invalid\n",
+			pr_err("path MTU (%u) is invalid\n",
 			       attr->path_mtu);
 			goto out;
 		}
@@ -1281,7 +1281,7 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 	if (is_qp0(dev, qp)) {
 		if (cur_state != IB_QPS_RTR && new_state == IB_QPS_RTR)
 			if (mlx4_INIT_PORT(dev->dev, qp->port))
-				printk(KERN_WARNING "INIT_PORT failed for port %d\n",
+				pr_warn("INIT_PORT failed for port %d\n",
 				       qp->port);
 
 		if (cur_state != IB_QPS_RESET && cur_state != IB_QPS_ERR &&
@@ -1480,7 +1480,7 @@ static int build_mlx_header(struct mlx4_ib_sqp *sqp, struct ib_send_wr *wr,
 	header_size = ib_ud_header_pack(&sqp->ud_header, sqp->header_buf);
 
 	if (0) {
-		printk(KERN_ERR "built UD header of size %d:\n", header_size);
+		pr_err("built UD header of size %d:\n", header_size);
 		for (i = 0; i < header_size / 4; ++i) {
 			if (i % 8 == 0)
 				printk("  [%02x] ", i * 4);
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 39542f3..60c5fb0 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -59,7 +59,7 @@ static void mlx4_ib_srq_event(struct mlx4_srq *srq, enum mlx4_event type)
 			event.event = IB_EVENT_SRQ_ERR;
 			break;
 		default:
-			printk(KERN_WARNING "mlx4_ib: Unexpected event type %d "
+			pr_warn("Unexpected event type %d "
 			       "on SRQ %06x\n", type, srq->srqn);
 			return;
 		}
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RESEND for-next 3/7] IB/mlx4: increase the number of vectors (EQs) available for ULPs
       [not found] ` <1336494151-31050-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2012-05-08 16:22   ` [PATCH RESEND for-next 1/7] net/mlx4: add new cap flags field to track more capabilities Or Gerlitz
  2012-05-08 16:22   ` [PATCH RESEND for-next 2/7] IB/mlx4: replace KERN_yyy printk calls with pr_yyy ones Or Gerlitz
@ 2012-05-08 16:22   ` Or Gerlitz
       [not found]     ` <1336494151-31050-4-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2012-05-08 16:22   ` [PATCH for-next 4/7] IB/core: Add RSS and TSS QP groups Or Gerlitz
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Or Gerlitz @ 2012-05-08 16:22 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, shlomop-VPRAkNaXOzVWk0Htik3J/w

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Enable IB ULPs to use a larger portion of the device EQs (which map
to IRQs). The mlx4_ib driver follows the mlx4_core framework of the
EQs to be divided among the device ports. In this scheme, for each IB
port, the number of allocated EQs follows the number of cores, subject
to other system constraints, such as number available MSI-X vectors.

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/cq.c      |    3 +
 drivers/infiniband/hw/mlx4/main.c    |   85 ++++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/mlx4/mlx4_ib.h |    2 +
 3 files changed, 90 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 34ac0e2..6d4ef71 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -222,6 +222,9 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector
 		uar = &dev->priv_uar;
 	}
 
+	if (dev->eq_table)
+		vector = dev->eq_table[vector % ibdev->num_comp_vectors];
+
 	err = mlx4_cq_alloc(dev->dev, entries, &cq->buf.mtt, uar,
 			    cq->db.dma, &cq->mcq, vector, 0);
 	if (err)
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 1a11475..8aa06da 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1071,6 +1071,87 @@ static int mlx4_ib_netdev_event(struct notifier_block *this, unsigned long event
 	return NOTIFY_DONE;
 }
 
+static void mlx4_ib_alloc_eqs(struct mlx4_dev *dev, struct mlx4_ib_dev *ibdev)
+{
+	char name[32];
+	int eq_per_port = 0;
+	int added_eqs = 0;
+	int total_eqs = 0;
+	int i, j, eq;
+
+	/* Init eq table */
+	ibdev->eq_table = NULL;
+	ibdev->eq_added = 0;
+
+	/* Legacy mode ? */
+	if (dev->caps.comp_pool == 0)
+		return;
+
+	eq_per_port = rounddown_pow_of_two(dev->caps.comp_pool/
+					dev->caps.num_ports);
+
+	/* Init eq table */
+	added_eqs = 0;
+	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB)
+		added_eqs += eq_per_port;
+
+	total_eqs = dev->caps.num_comp_vectors + added_eqs;
+
+	ibdev->eq_table = kzalloc(total_eqs * sizeof(int), GFP_KERNEL);
+	if (!ibdev->eq_table)
+		return;
+
+	ibdev->eq_added = added_eqs;
+
+	eq = 0;
+	mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) {
+		for (j = 0; j < eq_per_port; j++) {
+			sprintf(name , "mlx4-ib-%d-%d@%s",
+				i , j, dev->pdev->bus->name);
+			/* Set IRQ for specific name (per ring) */
+			if (mlx4_assign_eq(dev, name, &ibdev->eq_table[eq])) {
+				/* Use legacy (same as mlx4_en driver) */
+				printk(KERN_WARNING
+					"Can't allocate eq revert to legacy\n");
+				ibdev->eq_table[eq] =
+					(eq % dev->caps.num_comp_vectors);
+			}
+			eq++;
+		}
+	}
+
+	/* Fill the reset of the vector with legacy EQ */
+	for (i = 0, eq = added_eqs; i < dev->caps.num_comp_vectors; i++)
+		ibdev->eq_table[eq++] = i;
+
+	/* Adevrtize the new EQ number to clients */
+	ibdev->ib_dev.num_comp_vectors = total_eqs;
+}
+
+static void mlx4_ib_free_eqs(struct mlx4_dev *dev, struct mlx4_ib_dev *ibdev)
+{
+	int i;
+	int total_eqs;
+
+	/* Reset the advertizes EQ number */
+	ibdev->ib_dev.num_comp_vectors = dev->caps.num_comp_vectors;
+
+	/* Free only the added eqs */
+	for (i = 0; i < ibdev->eq_added; i++) {
+		/* Don't free legacy eqs if used */
+		if (ibdev->eq_table[i] <= dev->caps.num_comp_vectors)
+			continue;
+		mlx4_release_eq(dev , ibdev->eq_table[i]);
+	}
+
+	total_eqs = dev->caps.num_comp_vectors + ibdev->eq_added;
+	memset(ibdev->eq_table, 0, total_eqs * sizeof(int));
+	kfree(ibdev->eq_table);
+
+	ibdev->eq_table = NULL;
+	ibdev->eq_added = 0;
+}
+
 static void *mlx4_ib_add(struct mlx4_dev *dev)
 {
 	struct mlx4_ib_dev *ibdev;
@@ -1205,6 +1286,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
 			(1ull << IB_USER_VERBS_CMD_CLOSE_XRCD);
 	}
 
+	mlx4_ib_alloc_eqs(dev, ibdev);
+
 	spin_lock_init(&iboe->lock);
 
 	if (init_node_data(ibdev))
@@ -1293,6 +1376,8 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr)
 	mlx4_foreach_port(p, dev, MLX4_PORT_TYPE_IB)
 		mlx4_CLOSE_PORT(dev, p);
 
+	mlx4_ib_free_eqs(dev, ibdev);
+
 	mlx4_uar_free(dev, &ibdev->priv_uar);
 	mlx4_pd_free(dev, ibdev->priv_pdn);
 	ib_dealloc_device(&ibdev->ib_dev);
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index ed80345..9060771 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -202,6 +202,8 @@ struct mlx4_ib_dev {
 	bool			ib_active;
 	struct mlx4_ib_iboe	iboe;
 	int			counters[MLX4_MAX_PORTS];
+	int			*eq_table;
+	int			eq_added;
 };
 
 static inline struct mlx4_ib_dev *to_mdev(struct ib_device *ibdev)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH for-next 4/7] IB/core: Add RSS and TSS QP groups
       [not found] ` <1336494151-31050-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2012-05-08 16:22   ` [PATCH RESEND for-next 3/7] IB/mlx4: increase the number of vectors (EQs) available for ULPs Or Gerlitz
@ 2012-05-08 16:22   ` Or Gerlitz
       [not found]     ` <1336494151-31050-5-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2012-05-08 16:22   ` [PATCH for-next 5/7] IB/mlx4: Add support for " Or Gerlitz
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Or Gerlitz @ 2012-05-08 16:22 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, shlomop-VPRAkNaXOzVWk0Htik3J/w

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

RSS (Receive Side Scaling) TSS (Transmit Side Scaling, better known as
MQ/Multi-Queue) are common networking techniques which allow to use
contemporary NICs that support multiple receive and transmit descriptor
queues (multi-queue), see also Documentation/networking/scaling.txt

This patch introduces the concept of RSS and TSS QP groups which
allows for implementing them by low level drivers and using it
by IPoIB and later also by user space ULPs.

A QP group is a set of QPs consists of a parent QP and two disjoint sets
of RSS and TSS QPs. The creation of a QP group is a two stage process:

In the the 1st stage, the parent QP is created.

In the 2nd stage the children QPs of the parent are created.

Each child QP indicates if its a RSS or TSS QP. Both the TSS
and RSS sets of QPs should have contiguous QP numbers.

A few new elements/concepts are introduced to support this:

Three new device capabilities that can be set by the low level driver:

- IB_DEVICE_QPG which is set to indicate QP groups are supported.

- IB_DEVICE_UD_RSS which is set to indicate that the device supports
RSS, that is applying hash function on incoming TCP/UDP/IP packets and
dispatching them to multiple "rings" (child QPs).

- IB_DEVICE_UD_TSS which is set to indicate that the device supports
"HW TSS" which means that the HW is capable of over-riding the source
UD QPN present in sent IB datagram header (DTH) with the parent's QPN.

Low level drivers not supporting HW TSS, could still support QP groups, such
as combination is referred as "SW TSS". Where in this case, the low level drive
fills in the qpg_tss_mask_sz field of struct ib_qp_cap returned from
ib_create_qp. Such that this mask can be used to retrieve the parent QPN from
incoming packets carrying a child QPN (as of the contiguous QP numbers requirement).

- max rss table size device attribute, which is the maximal size of the RSS
indirection table  supported by the device

- qp group type attribute for qp creation saying whether this is a parent QP
or rx/tx (rss/tss) child QP or none of the above for non rss/tss QPs.

- per qp group type, another attribute is added, for parent QPs, the number
of rx/tx child QPs and for child QPs pointer to the parent.

- IB_QP_GROUP_RSS attribute mask, which should be used when modifying
the parent QP state from reset to init

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/verbs.c              |    3 +++
 drivers/infiniband/hw/amso1100/c2_provider.c |    3 +++
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |    2 ++
 drivers/infiniband/hw/cxgb4/qp.c             |    3 +++
 drivers/infiniband/hw/ehca/ehca_qp.c         |    6 ++++++
 drivers/infiniband/hw/ipath/ipath_qp.c       |    3 +++
 drivers/infiniband/hw/mlx4/qp.c              |    3 +++
 drivers/infiniband/hw/mthca/mthca_provider.c |    3 +++
 drivers/infiniband/hw/nes/nes_verbs.c        |    3 +++
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |    5 +++++
 drivers/infiniband/hw/qib/qib_qp.c           |    5 +++++
 include/rdma/ib_verbs.h                      |   26 +++++++++++++++++++++++++-
 12 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 30f199e..bbe0e5f 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -496,6 +496,9 @@ static const struct {
 						IB_QP_QKEY),
 				[IB_QPT_GSI] = (IB_QP_PKEY_INDEX		|
 						IB_QP_QKEY),
+			},
+			.opt_param = {
+				[IB_QPT_UD]  = IB_QP_GROUP_RSS
 			}
 		},
 	},
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index 07eb3a8..546760b 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -241,6 +241,9 @@ static struct ib_qp *c2_create_qp(struct ib_pd *pd,
 	if (init_attr->create_flags)
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_RC:
 		qp = kzalloc(sizeof(*qp), GFP_KERNEL);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 0bdf09a..49850f6 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -902,6 +902,8 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
 	PDBG("%s ib_pd %p\n", __func__, pd);
 	if (attrs->qp_type != IB_QPT_RC)
 		return ERR_PTR(-EINVAL);
+	if (attrs->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
 	php = to_iwch_pd(pd);
 	rhp = php->rhp;
 	schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid);
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 5f940ae..7ff2aa8 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -1402,6 +1402,9 @@ struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *attrs,
 	if (attrs->qp_type != IB_QPT_RC)
 		return ERR_PTR(-EINVAL);
 
+	if (attrs->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	php = to_c4iw_pd(pd);
 	rhp = php->rhp;
 	schp = get_chp(rhp, ((struct c4iw_cq *)attrs->send_cq)->cq.cqid);
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index 964f855..ca8abd1 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -464,6 +464,9 @@ static struct ehca_qp *internal_create_qp(
 	int is_llqp = 0, has_srq = 0, is_user = 0;
 	int qp_type, max_send_sge, max_recv_sge, ret;
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	/* h_call's out parameters */
 	struct ehca_alloc_qp_parms parms;
 	u32 swqe_size = 0, rwqe_size = 0, ib_qp_num;
@@ -980,6 +983,9 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd,
 	if (srq_init_attr->srq_type != IB_SRQT_BASIC)
 		return ERR_PTR(-ENOSYS);
 
+	if (srq_init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	/* For common attributes, internal_create_qp() takes its info
 	 * out of qp_init_attr, so copy all common attrs there.
 	 */
diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c
index 0857a9c..117b775 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -755,6 +755,9 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
 		goto bail;
 	}
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	if (init_attr->cap.max_send_sge > ib_ipath_max_sges ||
 	    init_attr->cap.max_send_wr > ib_ipath_max_qp_wrs) {
 		ret = ERR_PTR(-EINVAL);
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 5967644..a32482c 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -787,6 +787,9 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 	    (udata || init_attr->qp_type != IB_QPT_UD))
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_XRC_TGT:
 		pd = to_mxrcd(init_attr->xrcd)->pd;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 5b71d43..120aa1e 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -518,6 +518,9 @@ static struct ib_qp *mthca_create_qp(struct ib_pd *pd,
 	if (init_attr->create_flags)
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_RC:
 	case IB_QPT_UC:
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index 8b8812d..24825b5 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -1131,6 +1131,9 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd,
 	if (init_attr->create_flags)
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	atomic_inc(&qps_created);
 	switch (init_attr->qp_type) {
 		case IB_QPT_RC:
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index e9f74d1..9035aae 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -841,6 +841,11 @@ static int ocrdma_check_qp_params(struct ib_pd *ibpd, struct ocrdma_dev *dev,
 			   __func__, dev->id, attrs->qp_type);
 		return -EINVAL;
 	}
+	if (attrs->qpg_type != IB_QPG_NONE) {
+		ocrdma_err("%s(%d) unsupported qpg type=0x%x requested\n",
+			   __func__, dev->id, attrs->qpg_type);
+		return -ENOSYS;
+	}
 	if (attrs->cap.max_send_wr > dev->attr.max_wqe) {
 		ocrdma_err("%s(%d) unsupported send_wr=0x%x requested\n",
 			   __func__, dev->id, attrs->cap.max_send_wr);
diff --git a/drivers/infiniband/hw/qib/qib_qp.c b/drivers/infiniband/hw/qib/qib_qp.c
index 7e7e16f..838b1c7 100644
--- a/drivers/infiniband/hw/qib/qib_qp.c
+++ b/drivers/infiniband/hw/qib/qib_qp.c
@@ -986,6 +986,11 @@ struct ib_qp *qib_create_qp(struct ib_pd *ibpd,
 		goto bail;
 	}
 
+	if (init_attr->qpg_type != IB_QPG_NONE) {
+		ret = ERR_PTR(-ENOSYS);
+		goto bail;
+	}
+
 	/* Check receive queue parameters if no SRQ is specified. */
 	if (!init_attr->srq) {
 		if (init_attr->cap.max_recv_sge > ib_qib_max_sges ||
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 07996af..2e30f89 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -115,6 +115,9 @@ enum ib_device_cap_flags {
 	IB_DEVICE_XRC			= (1<<20),
 	IB_DEVICE_MEM_MGT_EXTENSIONS	= (1<<21),
 	IB_DEVICE_BLOCK_MULTICAST_LOOPBACK = (1<<22),
+	IB_DEVICE_QPG			= (1<<23),
+	IB_DEVICE_UD_RSS		= (1<<24),
+	IB_DEVICE_UD_TSS		= (1<<25)
 };
 
 enum ib_atomic_cap {
@@ -162,6 +165,7 @@ struct ib_device_attr {
 	int			max_srq_wr;
 	int			max_srq_sge;
 	unsigned int		max_fast_reg_page_list_len;
+	int			max_rss_tbl_sz;
 	u16			max_pkeys;
 	u8			local_ca_ack_delay;
 };
@@ -584,6 +588,7 @@ struct ib_qp_cap {
 	u32	max_send_sge;
 	u32	max_recv_sge;
 	u32	max_inline_data;
+	u32	qpg_tss_mask_sz;
 };
 
 enum ib_sig_type {
@@ -616,6 +621,18 @@ enum ib_qp_create_flags {
 	IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK	= 1 << 1,
 };
 
+enum ib_qpg_type {
+	IB_QPG_NONE	= 0,
+	IB_QPG_PARENT	= (1<<0),
+	IB_QPG_CHILD_RX = (1<<1),
+	IB_QPG_CHILD_TX = (1<<2)
+};
+
+struct ib_qpg_init_attrib {
+	u32 tss_child_count;
+	u32 rss_child_count;
+};
+
 struct ib_qp_init_attr {
 	void                  (*event_handler)(struct ib_event *, void *);
 	void		       *qp_context;
@@ -624,9 +641,14 @@ struct ib_qp_init_attr {
 	struct ib_srq	       *srq;
 	struct ib_xrcd	       *xrcd;     /* XRC TGT QPs only */
 	struct ib_qp_cap	cap;
+	union {
+		struct ib_qp *qpg_parent; /* see qpg_type */
+		struct ib_qpg_init_attrib parent_attrib;
+	};
 	enum ib_sig_type	sq_sig_type;
 	enum ib_qp_type		qp_type;
 	enum ib_qp_create_flags	create_flags;
+	enum ib_qpg_type	qpg_type;
 	u8			port_num; /* special QP types only */
 };
 
@@ -693,7 +715,8 @@ enum ib_qp_attr_mask {
 	IB_QP_MAX_DEST_RD_ATOMIC	= (1<<17),
 	IB_QP_PATH_MIG_STATE		= (1<<18),
 	IB_QP_CAP			= (1<<19),
-	IB_QP_DEST_QPN			= (1<<20)
+	IB_QP_DEST_QPN			= (1<<20),
+	IB_QP_GROUP_RSS			= (1<<21)
 };
 
 enum ib_qp_state {
@@ -972,6 +995,7 @@ struct ib_qp {
 	void		       *qp_context;
 	u32			qp_num;
 	enum ib_qp_type		qp_type;
+	enum ib_qpg_type	qpg_type;
 };
 
 struct ib_mr {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH for-next 5/7] IB/mlx4: Add support for RSS and TSS QP groups
       [not found] ` <1336494151-31050-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2012-05-08 16:22   ` [PATCH for-next 4/7] IB/core: Add RSS and TSS QP groups Or Gerlitz
@ 2012-05-08 16:22   ` Or Gerlitz
  2012-05-08 16:22   ` [PATCH for-next 6/7] IB/ipoib: Implement vectorization restructure as pre-step for TSS/RSS Or Gerlitz
  2012-05-08 16:22   ` [PATCH for-next 7/7] IB/ipoib: Add RSS and TSS support for datagram mode Or Gerlitz
  6 siblings, 0 replies; 12+ messages in thread
From: Or Gerlitz @ 2012-05-08 16:22 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, shlomop-VPRAkNaXOzVWk0Htik3J/w

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Depending on the mlx4 device capabilities, support the RSS IB device
capability, using Topelitz or XOR hash functions. Support creating QP
groups where all RX and TX QPs have contiguous QP numbers.

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/main.c    |    5 +
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   13 ++
 drivers/infiniband/hw/mlx4/qp.c      |  342 ++++++++++++++++++++++++++++++++-
 3 files changed, 349 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 8aa06da..1034853 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -131,6 +131,11 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 	if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_XRC)
 		props->device_cap_flags |= IB_DEVICE_XRC;
 
+	props->device_cap_flags |= IB_DEVICE_QPG;
+	if (dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS) {
+		props->device_cap_flags |= IB_DEVICE_UD_RSS;
+		props->max_rss_tbl_sz = dev->dev->caps.max_rss_tbl_sz;
+	}
 	props->vendor_id	   = be32_to_cpup((__be32 *) (out_mad->data + 36)) &
 		0xffffff;
 	props->vendor_part_id	   = be16_to_cpup((__be16 *) (out_mad->data + 30));
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 9060771..57a03ad 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -126,6 +126,17 @@ struct mlx4_ib_gid_entry {
 	u8			port;
 };
 
+struct mlx4_ib_qpg_data {
+	unsigned long *tss_bitmap;
+	unsigned long *rss_bitmap;
+	struct mlx4_ib_qp *qpg_parent;
+	int tss_qpn_base;
+	int rss_qpn_base;
+	u32 tss_child_count;
+	u32 rss_child_count;
+	u32 qpg_tss_mask_sz;
+};
+
 struct mlx4_ib_qp {
 	struct ib_qp		ibqp;
 	struct mlx4_qp		mqp;
@@ -154,6 +165,8 @@ struct mlx4_ib_qp {
 	u8			sq_no_prefetch;
 	u8			state;
 	int			mlx_type;
+	enum ib_qpg_type	qpg_type;
+	struct mlx4_ib_qpg_data *qpg_data;
 	struct list_head	gid_list;
 };
 
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index a32482c..7fe3500 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -34,6 +34,8 @@
 #include <linux/log2.h>
 #include <linux/slab.h>
 #include <linux/netdevice.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
 
 #include <rdma/ib_cache.h>
 #include <rdma/ib_pack.h>
@@ -475,6 +477,241 @@ static int qp_has_rq(struct ib_qp_init_attr *attr)
 	return !attr->srq;
 }
 
+static int init_qpg_parent(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *pqp,
+			   struct ib_qp_init_attr *attr, int *qpn)
+{
+	struct mlx4_ib_qpg_data *qpg_data;
+	int tss_num, rss_num;
+	int tss_align_num, rss_align_num;
+	int tss_base, rss_base;
+	int err;
+
+	/* Parent is part of the TSS range (in SW TSS ARP is sent via parent) */
+	tss_num = 1 + attr->parent_attrib.tss_child_count;
+	tss_align_num = roundup_pow_of_two(tss_num);
+	rss_num = attr->parent_attrib.rss_child_count;
+	rss_align_num = roundup_pow_of_two(rss_num);
+
+	if (rss_num > 1) {
+		/* RSS is requested */
+		if (!(dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS))
+			return -ENOSYS;
+		if (rss_align_num > dev->dev->caps.max_rss_tbl_sz)
+			return -EINVAL;
+		/* We must work with power of two */
+		attr->parent_attrib.rss_child_count = rss_align_num;
+	}
+
+	qpg_data = kzalloc(sizeof *qpg_data, GFP_KERNEL);
+	if (!qpg_data)
+		return -ENOMEM;
+
+	err = mlx4_qp_reserve_range(dev->dev, tss_align_num,
+				    tss_align_num, &tss_base);
+	if (err)
+		goto err1;
+
+	if (tss_num > 1) {
+		u32 alloc = BITS_TO_LONGS(tss_align_num)  * sizeof(long);
+		qpg_data->tss_bitmap = kzalloc(alloc, GFP_KERNEL);
+		if (qpg_data->tss_bitmap == NULL) {
+			err = -ENOMEM;
+			goto err2;
+		}
+		bitmap_fill(qpg_data->tss_bitmap, tss_num);
+		/* Note parent takes first index */
+		clear_bit(0, qpg_data->tss_bitmap);
+	}
+
+	if (rss_num > 1) {
+		u32 alloc = BITS_TO_LONGS(rss_align_num) * sizeof(long);
+		err = mlx4_qp_reserve_range(dev->dev, rss_align_num,
+					    rss_align_num, &rss_base);
+		if (err)
+			goto err3;
+		qpg_data->rss_bitmap = kzalloc(alloc, GFP_KERNEL);
+		if (qpg_data->rss_bitmap == NULL) {
+			err = -ENOMEM;
+			goto err4;
+		}
+		bitmap_fill(qpg_data->rss_bitmap, rss_align_num);
+	}
+
+	qpg_data->tss_child_count = attr->parent_attrib.tss_child_count;
+	qpg_data->rss_child_count = attr->parent_attrib.rss_child_count;
+	qpg_data->qpg_parent = pqp;
+	qpg_data->qpg_tss_mask_sz = ilog2(tss_align_num);
+	qpg_data->tss_qpn_base = tss_base;
+	qpg_data->rss_qpn_base = rss_base;
+
+	pqp->qpg_data = qpg_data;
+	*qpn = tss_base;
+
+	return 0;
+
+err4:
+	mlx4_qp_release_range(dev->dev, rss_base, rss_align_num);
+
+err3:
+	if (tss_num > 1)
+		kfree(qpg_data->tss_bitmap);
+
+err2:
+	mlx4_qp_release_range(dev->dev, tss_base, tss_align_num);
+
+err1:
+	kfree(qpg_data);
+	return err;
+}
+
+static void free_qpg_parent(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *pqp)
+{
+	struct mlx4_ib_qpg_data *qpg_data = pqp->qpg_data;
+	int align_num;
+
+	if (qpg_data->tss_child_count > 1)
+		kfree(qpg_data->tss_bitmap);
+
+	align_num = roundup_pow_of_two(1 + qpg_data->tss_child_count);
+	mlx4_qp_release_range(dev->dev, qpg_data->tss_qpn_base, align_num);
+
+	if (qpg_data->rss_child_count > 1) {
+		kfree(qpg_data->rss_bitmap);
+		align_num = roundup_pow_of_two(qpg_data->rss_child_count);
+		mlx4_qp_release_range(dev->dev, qpg_data->rss_qpn_base,
+					align_num);
+	}
+
+	kfree(qpg_data);
+}
+
+static int alloc_qpg_qpn(struct ib_qp_init_attr *init_attr,
+			 struct mlx4_ib_qp *pqp, int *qpn)
+{
+	struct mlx4_ib_qp *mqp = to_mqp(init_attr->qpg_parent);
+	struct mlx4_ib_qpg_data *qpg_data = mqp->qpg_data;
+	u32 idx, old;
+
+	switch (init_attr->qpg_type) {
+	case IB_QPG_CHILD_TX:
+		if (qpg_data->tss_child_count == 0)
+			return -EINVAL;
+		do {
+			/* Parent took index 0 */
+			idx = find_first_bit(qpg_data->tss_bitmap,
+					     qpg_data->tss_child_count + 1);
+			if (idx >= qpg_data->tss_child_count + 1)
+				return -ENOMEM;
+			old = test_and_clear_bit(idx, qpg_data->tss_bitmap);
+		} while (old == 0);
+		idx += qpg_data->tss_qpn_base;
+		break;
+	case IB_QPG_CHILD_RX:
+		if (qpg_data->rss_child_count == 0)
+			return -EINVAL;
+		do {
+			idx = find_first_bit(qpg_data->rss_bitmap,
+					     qpg_data->rss_child_count);
+			if (idx >= qpg_data->rss_child_count)
+				return -ENOMEM;
+			old = test_and_clear_bit(idx, qpg_data->rss_bitmap);
+		} while (old == 0);
+		idx += qpg_data->rss_qpn_base;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	pqp->qpg_data = qpg_data;
+	*qpn = idx;
+
+	return 0;
+}
+
+static void free_qpg_qpn(struct mlx4_ib_qp *mqp, int qpn)
+{
+	struct mlx4_ib_qpg_data *qpg_data = mqp->qpg_data;
+
+	switch (mqp->qpg_type) {
+	case IB_QPG_CHILD_TX:
+		/* Do range check */
+		qpn -= qpg_data->tss_qpn_base;
+		set_bit(qpn, qpg_data->tss_bitmap);
+		break;
+	case IB_QPG_CHILD_RX:
+		qpn -= qpg_data->rss_qpn_base;
+		set_bit(qpn, qpg_data->rss_bitmap);
+		break;
+	default:
+		/* error */
+		pr_warn("wrong qpg type (%d)\n", mqp->qpg_type);
+		break;
+	}
+}
+
+static int alloc_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
+			    struct ib_qp_init_attr *attr, int *qpn)
+{
+	int err = 0;
+
+	switch (attr->qpg_type) {
+	case IB_QPG_NONE:
+		/* Raw packet QPNs must be aligned to 8 bits. If not, the WQE
+		 * BlueFlame setup flow wrongly causes VLAN insertion. */
+		if (attr->qp_type == IB_QPT_RAW_PACKET)
+			err = mlx4_qp_reserve_range(dev->dev, 1, 1 << 8, qpn);
+		else
+			err = mlx4_qp_reserve_range(dev->dev, 1, 1, qpn);
+		break;
+	case IB_QPG_PARENT:
+		err = init_qpg_parent(dev, qp, attr, qpn);
+		break;
+	case IB_QPG_CHILD_TX:
+	case IB_QPG_CHILD_RX:
+		err = alloc_qpg_qpn(attr, qp, qpn);
+		break;
+	default:
+		qp->qpg_type = IB_QPG_NONE;
+		err = -EINVAL;
+		break;
+	}
+	if (err)
+		return err;
+	qp->qpg_type = attr->qpg_type;
+	return 0;
+}
+
+static void free_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
+			enum ib_qpg_type qpg_type, int qpn)
+{
+	switch (qpg_type) {
+	case IB_QPG_NONE:
+		mlx4_qp_release_range(dev->dev, qpn, 1);
+		break;
+	case IB_QPG_PARENT:
+		free_qpg_parent(dev, qp);
+		break;
+	case IB_QPG_CHILD_TX:
+	case IB_QPG_CHILD_RX:
+		free_qpg_qpn(qp, qpn);
+		break;
+	default:
+		break;
+	}
+}
+
+/* Revert allocation on create_qp_common */
+static void unalloc_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
+			       struct ib_qp_init_attr *attr, int qpn)
+{
+	free_qpn_common(dev, qp, attr->qpg_type, qpn);
+}
+
+static void release_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp)
+{
+	free_qpn_common(dev, qp, qp->qpg_type, qp->mqp.qpn);
+}
+
 static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			    struct ib_qp_init_attr *init_attr,
 			    struct ib_udata *udata, int sqpn, struct mlx4_ib_qp *qp)
@@ -578,12 +815,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	if (sqpn) {
 		qpn = sqpn;
 	} else {
-		/* Raw packet QPNs must be aligned to 8 bits. If not, the WQE
-		 * BlueFlame setup flow wrongly causes VLAN insertion. */
-		if (init_attr->qp_type == IB_QPT_RAW_PACKET)
-			err = mlx4_qp_reserve_range(dev->dev, 1, 1 << 8, &qpn);
-		else
-			err = mlx4_qp_reserve_range(dev->dev, 1, 1, &qpn);
+		err = alloc_qpn_common(dev, qp, init_attr, &qpn);
 		if (err)
 			goto err_wrid;
 	}
@@ -607,8 +839,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	return 0;
 
 err_qpn:
-	if (!sqpn)
-		mlx4_qp_release_range(dev->dev, qpn, 1);
+	unalloc_qpn_common(dev, qp, init_attr, qpn);
 
 err_wrid:
 	if (pd->uobject) {
@@ -746,7 +977,7 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
 	mlx4_qp_free(dev->dev, &qp->mqp);
 
 	if (!is_sqp(dev, qp))
-		mlx4_qp_release_range(dev->dev, qp->mqp.qpn, 1);
+		release_qpn_common(dev, qp);
 
 	mlx4_mtt_cleanup(dev->dev, &qp->mtt);
 
@@ -766,6 +997,52 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
 	del_gid_entries(qp);
 }
 
+static int check_qpg_attr(struct mlx4_ib_dev *dev,
+			  struct ib_qp_init_attr *attr)
+{
+	if (attr->qpg_type == IB_QPG_NONE)
+		return 0;
+
+	if (attr->qp_type != IB_QPT_UD)
+		return -EINVAL;
+
+	if (attr->qpg_type == IB_QPG_PARENT) {
+		if (attr->parent_attrib.tss_child_count == 1)
+			return -EINVAL; /* Doesn't make sense */
+		if (attr->parent_attrib.rss_child_count == 1)
+			return -EINVAL; /* Doesn't make sense */
+		if ((attr->parent_attrib.tss_child_count == 0) &&
+			(attr->parent_attrib.rss_child_count == 0))
+			/* Should be called with IP_QPG_NONE */
+			return -EINVAL;
+		if (attr->parent_attrib.rss_child_count > 1) {
+			int rss_align_num;
+			if (!(dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS))
+				return -ENOSYS;
+			rss_align_num = roundup_pow_of_two(
+					attr->parent_attrib.rss_child_count);
+			if (rss_align_num > dev->dev->caps.max_rss_tbl_sz)
+				return -EINVAL;
+		}
+	} else {
+		struct mlx4_ib_qpg_data *qpg_data;
+		if (attr->qpg_parent == NULL)
+			return -EINVAL;
+		if (IS_ERR(attr->qpg_parent))
+			return -EINVAL;
+		qpg_data = to_mqp(attr->qpg_parent)->qpg_data;
+		if (qpg_data == NULL)
+			return -EINVAL;
+		if (attr->qpg_type == IB_QPG_CHILD_TX &&
+		    !qpg_data->tss_child_count)
+			return -EINVAL;
+		if (attr->qpg_type == IB_QPG_CHILD_RX &&
+		    !qpg_data->rss_child_count)
+			return -EINVAL;
+	}
+	return 0;
+}
+
 struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 				struct ib_qp_init_attr *init_attr,
 				struct ib_udata *udata)
@@ -787,8 +1064,9 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 	    (udata || init_attr->qp_type != IB_QPT_UD))
 		return ERR_PTR(-EINVAL);
 
-	if (init_attr->qpg_type != IB_QPG_NONE)
-		return ERR_PTR(-ENOSYS);
+	err = check_qpg_attr(to_mdev(pd->device), init_attr);
+	if (err)
+		return ERR_PTR(err);
 
 	switch (init_attr->qp_type) {
 	case IB_QPT_XRC_TGT:
@@ -1235,6 +1513,43 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 	if (!ibqp->uobject && cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT)
 		context->rlkey |= (1 << 4);
 
+	if ((attr_mask & IB_QP_GROUP_RSS) &&
+		(qp->qpg_data->rss_child_count > 1)) {
+		struct mlx4_ib_qpg_data *qpg_data = qp->qpg_data;
+		void *rss_context_base = &context->pri_path;
+		struct mlx4_rss_context *rss_context =
+			(struct mlx4_rss_context *) (rss_context_base
+					+ MLX4_RSS_OFFSET_IN_QPC_PRI_PATH);
+
+		context->flags |= cpu_to_be32(1 << MLX4_RSS_QPC_FLAG_OFFSET);
+
+		/* This should be tbl_sz_base_qpn */
+		rss_context->base_qpn = cpu_to_be32(qpg_data->rss_qpn_base |
+				(ilog2(qpg_data->rss_child_count) << 24));
+		rss_context->default_qpn = cpu_to_be32(qpg_data->rss_qpn_base);
+		/* This should be flags_hash_fn */
+		rss_context->flags = MLX4_RSS_TCP_IPV6 |
+				     MLX4_RSS_TCP_IPV4;
+		if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_UDP_RSS) {
+			rss_context->base_qpn_udp = rss_context->default_qpn;
+			rss_context->flags |= MLX4_RSS_IPV6 |
+					MLX4_RSS_IPV4     |
+					MLX4_RSS_UDP_IPV6 |
+					MLX4_RSS_UDP_IPV4;
+		}
+		if (dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS_TOP) {
+			static const u32 rsskey[10] = { 0xD181C62C, 0xF7F4DB5B,
+				0x1983A2FC, 0x943E1ADB, 0xD9389E6B, 0xD1039C2C,
+				0xA74499AD, 0x593D56D9, 0xF3253C06, 0x2ADC1FFC};
+			rss_context->hash_fn = MLX4_RSS_HASH_TOP;
+			memcpy(rss_context->rss_key, rsskey,
+				sizeof(rss_context->rss_key));
+		} else {
+			rss_context->hash_fn = MLX4_RSS_HASH_XOR;
+			memset(rss_context->rss_key, 0,
+				sizeof(rss_context->rss_key));
+		}
+	}
 	/*
 	 * Before passing a kernel QP to the HW, make sure that the
 	 * ownership bits of the send queue are set and the SQ
@@ -2191,6 +2506,11 @@ done:
 	if (qp->flags & MLX4_IB_QP_LSO)
 		qp_init_attr->create_flags |= IB_QP_CREATE_IPOIB_UD_LSO;
 
+	qp_init_attr->qpg_type = ibqp->qpg_type;
+	if (ibqp->qpg_type == IB_QPG_PARENT)
+		qp_attr->cap.qpg_tss_mask_sz = qp->qpg_data->qpg_tss_mask_sz;
+	else
+		qp_attr->cap.qpg_tss_mask_sz = 0;
 out:
 	mutex_unlock(&qp->mutex);
 	return err;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH for-next 6/7] IB/ipoib: Implement vectorization restructure as pre-step for TSS/RSS
       [not found] ` <1336494151-31050-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2012-05-08 16:22   ` [PATCH for-next 5/7] IB/mlx4: Add support for " Or Gerlitz
@ 2012-05-08 16:22   ` Or Gerlitz
  2012-05-08 16:22   ` [PATCH for-next 7/7] IB/ipoib: Add RSS and TSS support for datagram mode Or Gerlitz
  6 siblings, 0 replies; 12+ messages in thread
From: Or Gerlitz @ 2012-05-08 16:22 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, shlomop-VPRAkNaXOzVWk0Htik3J/w

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch is a restructuring step needed to implement RSS (Receive Side
Scaling) and TSS (multi-queue transmit) for IPoIB.

The following structures and flows are changed:

- Addition of struct ipoib_recv_ring and struct ipoib_send_ring which hold
the per RX / TX ring fields respectively. These fields are the plural of
the receive and send fields previously present in struct ipoib_dev_priv.

- Add per send/receive ring stats counters. These counters are accessible
through ethtool. Net device stats are no longer accumulated, instead
ndo_get_stats is implemented.

- Use the multi queue APIs for TX and RX: alloc_netdev_mqs, netif_xxx_subqueue,
netif_subqueue_yyy, use per TX queue timer and NAPI instance per RX queue.

With this patch being an intermediate step, the number of RX and TX rings
is fixed to one. Where the single TX ring and RX ring QP/CQs are currently
taken from the "priv" structure.

The Address Handles Garbage Collection mechanism was changed such
that the data path uses ref count (inc on post send, dec on send completion),
and the AH GC thread code tests for zero value of the ref count instead of
comparing tx_head to last_send. Some change was a must here, since the SAME
AH can be used by multiple TX rings as the skb hashing can possible map the
same neighbor to multiple TX rings (uses L3/L4 headers).

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |   89 +++-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  102 +++--
 drivers/infiniband/ulp/ipoib/ipoib_ethtool.c   |   92 ++++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  542 +++++++++++++++++-------
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  236 +++++++++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   34 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |   63 ++-
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c      |    2 +-
 8 files changed, 874 insertions(+), 286 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 86df632..fb880a0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -152,6 +152,7 @@ struct ipoib_rx_buf {
 
 struct ipoib_tx_buf {
 	struct sk_buff *skb;
+	struct ipoib_ah *ah;
 	u64		mapping[MAX_SKB_FRAGS + 1];
 };
 
@@ -209,6 +210,7 @@ struct ipoib_cm_rx {
 	unsigned long		jiffies;
 	enum ipoib_cm_state	state;
 	int			recv_count;
+	int index; /* For ring counters */
 };
 
 struct ipoib_cm_tx {
@@ -223,6 +225,7 @@ struct ipoib_cm_tx {
 	unsigned	     tx_tail;
 	unsigned long	     flags;
 	u32		     mtu;
+	int index; /* For ndo_select_queue and ring counters */
 };
 
 struct ipoib_cm_rx_buf {
@@ -253,6 +256,9 @@ struct ipoib_cm_dev_priv {
 	int			nonsrq_conn_qp;
 	int			max_cm_mtu;
 	int			num_frags;
+	u32			rx_cq_ind;
+	u32			tx_cq_ind;
+	u32			tx_ring_ind;
 };
 
 struct ipoib_ethtool_st {
@@ -261,6 +267,59 @@ struct ipoib_ethtool_st {
 };
 
 /*
+ * Per QP stats
+ */
+
+struct ipoib_tx_ring_stats {
+	unsigned long tx_packets;
+	unsigned long tx_bytes;
+	unsigned long tx_errors;
+	unsigned long tx_dropped;
+};
+
+struct ipoib_rx_ring_stats {
+	unsigned long rx_packets;
+	unsigned long rx_bytes;
+	unsigned long rx_errors;
+	unsigned long rx_dropped;
+};
+
+/*
+ * Encapsulates the per send QP information
+ */
+struct ipoib_send_ring {
+	struct net_device	*dev;
+	struct ib_cq		*send_cq;
+	struct ib_qp		*send_qp;
+	struct ipoib_tx_buf	*tx_ring;
+	unsigned		tx_head;
+	unsigned		tx_tail;
+	struct ib_sge		tx_sge[MAX_SKB_FRAGS + 1];
+	struct ib_send_wr	tx_wr;
+	unsigned		tx_outstanding;
+	struct ib_wc		tx_wc[MAX_SEND_CQE];
+	struct timer_list	poll_timer;
+	struct ipoib_tx_ring_stats stats;
+	unsigned		index;
+};
+
+/*
+ * Encapsulates the per recv QP information
+ */
+struct ipoib_recv_ring {
+	struct net_device	*dev;
+	struct ib_qp		*recv_qp;
+	struct ib_cq		*recv_cq;
+	struct ib_wc		ibwc[IPOIB_NUM_WC];
+	struct napi_struct	napi;
+	struct ipoib_rx_buf	*rx_ring;
+	struct ib_recv_wr	rx_wr;
+	struct ib_sge		rx_sge[IPOIB_UD_RX_SG];
+	struct ipoib_rx_ring_stats stats;
+	unsigned		index;
+};
+
+/*
  * Device private locking: network stack tx_lock protects members used
  * in TX fast path, lock protects everything else.  lock nests inside
  * of tx_lock (ie tx_lock must be acquired first if needed).
@@ -270,8 +329,6 @@ struct ipoib_dev_priv {
 
 	struct net_device *dev;
 
-	struct napi_struct napi;
-
 	unsigned long flags;
 
 	struct mutex vlan_mutex;
@@ -310,21 +367,6 @@ struct ipoib_dev_priv {
 	unsigned int mcast_mtu;
 	unsigned int max_ib_mtu;
 
-	struct ipoib_rx_buf *rx_ring;
-
-	struct ipoib_tx_buf *tx_ring;
-	unsigned	     tx_head;
-	unsigned	     tx_tail;
-	struct ib_sge	     tx_sge[MAX_SKB_FRAGS + 1];
-	struct ib_send_wr    tx_wr;
-	unsigned	     tx_outstanding;
-	struct ib_wc	     send_wc[MAX_SEND_CQE];
-
-	struct ib_recv_wr    rx_wr;
-	struct ib_sge	     rx_sge[IPOIB_UD_RX_SG];
-
-	struct ib_wc ibwc[IPOIB_NUM_WC];
-
 	struct list_head dead_ahs;
 
 	struct ib_event_handler event_handler;
@@ -345,6 +387,10 @@ struct ipoib_dev_priv {
 	int	hca_caps;
 	struct ipoib_ethtool_st ethtool;
 	struct timer_list poll_timer;
+	struct ipoib_recv_ring *recv_ring;
+	struct ipoib_send_ring *send_ring;
+	unsigned int num_rx_queues;
+	unsigned int num_tx_queues;
 };
 
 struct ipoib_ah {
@@ -352,7 +398,7 @@ struct ipoib_ah {
 	struct ib_ah	  *ah;
 	struct list_head   list;
 	struct kref	   ref;
-	unsigned	   last_send;
+	atomic_t	   refcnt;
 };
 
 struct ipoib_path {
@@ -415,8 +461,8 @@ extern struct workqueue_struct *ipoib_workqueue;
 /* functions */
 
 int ipoib_poll(struct napi_struct *napi, int budget);
-void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr);
-void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr);
+void ipoib_ib_completion(struct ib_cq *cq, void *recv_ring_ptr);
+void ipoib_send_comp_handler(struct ib_cq *cq, void *send_ring_ptr);
 
 struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 				 struct ib_pd *pd, struct ib_ah_attr *attr);
@@ -436,7 +482,8 @@ void ipoib_reap_ah(struct work_struct *work);
 
 void ipoib_mark_paths_invalid(struct net_device *dev);
 void ipoib_flush_paths(struct net_device *dev);
-struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
+struct ipoib_dev_priv *ipoib_intf_alloc(const char *format,
+					struct ipoib_dev_priv *temp_priv);
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush_light(struct work_struct *work);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 014504d..d708ed2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -250,8 +250,6 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
 		.event_handler = ipoib_cm_rx_event_handler,
-		.send_cq = priv->recv_cq, /* For drain WR */
-		.recv_cq = priv->recv_cq,
 		.srq = priv->cm.srq,
 		.cap.max_send_wr = 1, /* For drain WR */
 		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
@@ -259,12 +257,20 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 		.qp_type = IB_QPT_RC,
 		.qp_context = p,
 	};
+	int index;
 
 	if (!ipoib_cm_has_srq(dev)) {
 		attr.cap.max_recv_wr  = ipoib_recvq_size;
 		attr.cap.max_recv_sge = IPOIB_CM_RX_SG;
 	}
 
+	index = (priv->cm.rx_cq_ind < priv->num_rx_queues) ?
+			priv->cm.rx_cq_ind : 0;
+	priv->cm.rx_cq_ind = index + 1;
+	/* send_cp for drain WR */
+	attr.send_cq = attr.recv_cq = priv->recv_ring[index].recv_cq;
+	p->index = index;
+
 	return ib_create_qp(priv->pd, &attr);
 }
 
@@ -593,7 +599,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		ipoib_dbg(priv, "cm recv error "
 			   "(status=%d, wrid=%d vend_err %x)\n",
 			   wc->status, wr_id, wc->vendor_err);
-		++dev->stats.rx_dropped;
+		++priv->recv_ring[p->index].stats.rx_dropped;
 		if (has_srq)
 			goto repost;
 		else {
@@ -646,7 +652,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		 * this packet and reuse the old buffer.
 		 */
 		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
-		++dev->stats.rx_dropped;
+		++priv->recv_ring[p->index].stats.rx_dropped;
 		goto repost;
 	}
 
@@ -663,8 +669,8 @@ copied:
 	skb_reset_mac_header(skb);
 	skb_pull(skb, IPOIB_ENCAP_LEN);
 
-	++dev->stats.rx_packets;
-	dev->stats.rx_bytes += skb->len;
+	++priv->recv_ring[p->index].stats.rx_packets;
+	priv->recv_ring[p->index].stats.rx_bytes += skb->len;
 
 	skb->dev = dev;
 	/* XXX get correct PACKET_ type here */
@@ -691,17 +697,18 @@ repost:
 static inline int post_send(struct ipoib_dev_priv *priv,
 			    struct ipoib_cm_tx *tx,
 			    unsigned int wr_id,
-			    u64 addr, int len)
+			    u64 addr, int len,
+				struct ipoib_send_ring *send_ring)
 {
 	struct ib_send_wr *bad_wr;
 
-	priv->tx_sge[0].addr          = addr;
-	priv->tx_sge[0].length        = len;
+	send_ring->tx_sge[0].addr          = addr;
+	send_ring->tx_sge[0].length        = len;
 
-	priv->tx_wr.num_sge	= 1;
-	priv->tx_wr.wr_id	= wr_id | IPOIB_OP_CM;
+	send_ring->tx_wr.num_sge	= 1;
+	send_ring->tx_wr.wr_id	= wr_id | IPOIB_OP_CM;
 
-	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+	return ib_post_send(tx->qp, &send_ring->tx_wr, &bad_wr);
 }
 
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
@@ -710,12 +717,17 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	struct ipoib_cm_tx_buf *tx_req;
 	u64 addr;
 	int rc;
+	struct ipoib_send_ring *send_ring;
+	u16 queue_index;
+
+	queue_index = skb_get_queue_mapping(skb);
+	send_ring = priv->send_ring + queue_index;
 
 	if (unlikely(skb->len > tx->mtu)) {
 		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
 			   skb->len, tx->mtu);
-		++dev->stats.tx_dropped;
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_dropped;
+		++send_ring->stats.tx_errors;
 		ipoib_cm_skb_too_long(dev, skb, tx->mtu - IPOIB_ENCAP_LEN);
 		return;
 	}
@@ -734,7 +746,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	tx_req->skb = skb;
 	addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_errors;
 		dev_kfree_skb_any(skb);
 		return;
 	}
@@ -742,22 +754,23 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	tx_req->mapping = addr;
 
 	rc = post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1),
-		       addr, skb->len);
+		       addr, skb->len, send_ring);
 	if (unlikely(rc)) {
 		ipoib_warn(priv, "post_send failed, error %d\n", rc);
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_errors;
 		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
 		dev_kfree_skb_any(skb);
 	} else {
-		dev->trans_start = jiffies;
+		netdev_get_tx_queue(dev, queue_index)->trans_start = jiffies;
 		++tx->tx_head;
 
-		if (++priv->tx_outstanding == ipoib_sendq_size) {
+		if (++send_ring->tx_outstanding == ipoib_sendq_size) {
 			ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n",
 				  tx->qp->qp_num);
-			if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP))
+			if (ib_req_notify_cq(send_ring->send_cq,
+					     IB_CQ_NEXT_COMP))
 				ipoib_warn(priv, "request notify on send CQ failed\n");
-			netif_stop_queue(dev);
+			netif_stop_subqueue(dev, queue_index);
 		}
 	}
 }
@@ -769,6 +782,8 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_CM;
 	struct ipoib_cm_tx_buf *tx_req;
 	unsigned long flags;
+	struct ipoib_send_ring *send_ring;
+	u16 queue_index;
 
 	ipoib_dbg_data(priv, "cm send completion: id %d, status: %d\n",
 		       wr_id, wc->status);
@@ -780,22 +795,24 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 	}
 
 	tx_req = &tx->tx_ring[wr_id];
+	queue_index = skb_get_queue_mapping(tx_req->skb);
+	send_ring = priv->send_ring + queue_index;
 
 	ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE);
 
 	/* FIXME: is this right? Shouldn't we only increment on success? */
-	++dev->stats.tx_packets;
-	dev->stats.tx_bytes += tx_req->skb->len;
+	++send_ring->stats.tx_packets;
+	send_ring->stats.tx_bytes += tx_req->skb->len;
 
 	dev_kfree_skb_any(tx_req->skb);
 
 	netif_tx_lock(dev);
 
 	++tx->tx_tail;
-	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
-	    netif_queue_stopped(dev) &&
+	if (unlikely(--send_ring->tx_outstanding == ipoib_sendq_size >> 1) &&
+	    __netif_subqueue_stopped(dev, queue_index) &&
 	    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-		netif_wake_queue(dev);
+		netif_wake_subqueue(dev, queue_index);
 
 	if (wc->status != IB_WC_SUCCESS &&
 	    wc->status != IB_WC_WR_FLUSH_ERR) {
@@ -1016,8 +1033,6 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
-		.send_cq		= priv->recv_cq,
-		.recv_cq		= priv->recv_cq,
 		.srq			= priv->cm.srq,
 		.cap.max_send_wr	= ipoib_sendq_size,
 		.cap.max_send_sge	= 1,
@@ -1025,6 +1040,18 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 		.qp_type		= IB_QPT_RC,
 		.qp_context		= tx
 	};
+	int index;
+
+	/* CM uses ipoib_ib_completion for TX completion and work using NAPI */
+	index =  (priv->cm.tx_cq_ind < priv->num_rx_queues) ?
+			priv->cm.tx_cq_ind : 0;
+	priv->cm.tx_cq_ind = index + 1;
+	attr.send_cq = attr.recv_cq = priv->recv_ring[index].recv_cq;
+	/* For ndo_select_queue */
+	index =  (priv->cm.tx_ring_ind < priv->num_tx_queues) ?
+			priv->cm.tx_ring_ind : 0;
+	priv->cm.tx_ring_ind = index + 1;
+	tx->index = index;
 
 	return ib_create_qp(priv->pd, &attr);
 }
@@ -1177,16 +1204,21 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
 timeout:
 
 	while ((int) p->tx_tail - (int) p->tx_head < 0) {
+		struct ipoib_send_ring *send_ring;
+		u16 queue_index;
 		tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)];
 		ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len,
 				    DMA_TO_DEVICE);
 		dev_kfree_skb_any(tx_req->skb);
 		++p->tx_tail;
+		queue_index = skb_get_queue_mapping(tx_req->skb);
+		send_ring = priv->send_ring + queue_index;
 		netif_tx_lock_bh(p->dev);
-		if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
-		    netif_queue_stopped(p->dev) &&
+		if (unlikely(--send_ring->tx_outstanding ==
+				(ipoib_sendq_size >> 1)) &&
+		    __netif_subqueue_stopped(p->dev, queue_index) &&
 		    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-			netif_wake_queue(p->dev);
+			netif_wake_subqueue(p->dev, queue_index);
 		netif_tx_unlock_bh(p->dev);
 	}
 
@@ -1456,6 +1488,8 @@ static ssize_t set_mode(struct device *d, struct device_attribute *attr,
 {
 	struct net_device *dev = to_net_dev(d);
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int i;
 
 	if (!rtnl_trylock())
 		return restart_syscall();
@@ -1467,7 +1501,11 @@ static ssize_t set_mode(struct device *d, struct device_attribute *attr,
 			   "will cause multicast packet drops\n");
 		netdev_update_features(dev);
 		rtnl_unlock();
-		priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
+		send_ring = priv->send_ring;
+		for (i = 0; i < priv->num_tx_queues; i++) {
+			send_ring->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
+			send_ring++;
+		}
 
 		ipoib_flush_paths(dev);
 		return count;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
index 29bc7b5..f2cc283 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
@@ -57,7 +57,8 @@ static int ipoib_set_coalesce(struct net_device *dev,
 			      struct ethtool_coalesce *coal)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	int ret;
+	int ret, i;
+
 
 	/*
 	 * These values are saved in the private data and returned
@@ -67,23 +68,100 @@ static int ipoib_set_coalesce(struct net_device *dev,
 	    coal->rx_max_coalesced_frames > 0xffff)
 		return -EINVAL;
 
-	ret = ib_modify_cq(priv->recv_cq, coal->rx_max_coalesced_frames,
-			   coal->rx_coalesce_usecs);
-	if (ret && ret != -ENOSYS) {
-		ipoib_warn(priv, "failed modifying CQ (%d)\n", ret);
-		return ret;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		ret = ib_modify_cq(priv->recv_ring[i].recv_cq,
+					coal->rx_max_coalesced_frames,
+					coal->rx_coalesce_usecs);
+		if (ret && ret != -ENOSYS) {
+			ipoib_warn(priv, "failed modifying CQ (%d)\n", ret);
+			return ret;
+		}
 	}
-
 	priv->ethtool.coalesce_usecs       = coal->rx_coalesce_usecs;
 	priv->ethtool.max_coalesced_frames = coal->rx_max_coalesced_frames;
 
 	return 0;
 }
 
+static void ipoib_get_strings(struct net_device *dev, u32 stringset, u8 *data)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i, index = 0;
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		for (i = 0; i < priv->num_rx_queues; i++) {
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_packets", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_bytes", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_errors", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_dropped", i);
+		}
+		for (i = 0; i < priv->num_tx_queues; i++) {
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_packets", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_bytes", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_errors", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_dropped", i);
+		}
+		break;
+	}
+}
+
+static int ipoib_get_sset_count(struct net_device *dev, int sset)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	switch (sset) {
+	case ETH_SS_STATS:
+		return (priv->num_rx_queues + priv->num_tx_queues) * 4;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static void ipoib_get_ethtool_stats(struct net_device *dev,
+				struct ethtool_stats *stats, uint64_t *data)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	struct ipoib_send_ring *send_ring;
+	int index = 0;
+	int i;
+
+	/* Get per QP stats */
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		struct ipoib_rx_ring_stats *rx_stats = &recv_ring->stats;
+		data[index++] = rx_stats->rx_packets;
+		data[index++] = rx_stats->rx_bytes;
+		data[index++] = rx_stats->rx_errors;
+		data[index++] = rx_stats->rx_dropped;
+		recv_ring++;
+	}
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		struct ipoib_tx_ring_stats *tx_stats = &send_ring->stats;
+		data[index++] = tx_stats->tx_packets;
+		data[index++] = tx_stats->tx_bytes;
+		data[index++] = tx_stats->tx_errors;
+		data[index++] = tx_stats->tx_dropped;
+		send_ring++;
+	}
+}
+
 static const struct ethtool_ops ipoib_ethtool_ops = {
 	.get_drvinfo		= ipoib_get_drvinfo,
 	.get_coalesce		= ipoib_get_coalesce,
 	.set_coalesce		= ipoib_set_coalesce,
+	.get_strings		= ipoib_get_strings,
+	.get_sset_count		= ipoib_get_sset_count,
+	.get_ethtool_stats	= ipoib_get_ethtool_stats,
 };
 
 void ipoib_set_ethtool_ops(struct net_device *dev)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 5c1bc99..55f3e35 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -64,7 +64,6 @@ struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 		return ERR_PTR(-ENOMEM);
 
 	ah->dev       = dev;
-	ah->last_send = 0;
 	kref_init(&ah->ref);
 
 	vah = ib_create_ah(pd, attr);
@@ -72,6 +71,7 @@ struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 		kfree(ah);
 		ah = (struct ipoib_ah *)vah;
 	} else {
+		atomic_set(&ah->refcnt, 0);
 		ah->ah = vah;
 		ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah);
 	}
@@ -129,29 +129,32 @@ static void ipoib_ud_skb_put_frags(struct ipoib_dev_priv *priv,
 
 }
 
-static int ipoib_ib_post_receive(struct net_device *dev, int id)
+static int ipoib_ib_post_receive(struct net_device *dev,
+			struct ipoib_recv_ring *recv_ring, int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_recv_wr *bad_wr;
 	int ret;
 
-	priv->rx_wr.wr_id   = id | IPOIB_OP_RECV;
-	priv->rx_sge[0].addr = priv->rx_ring[id].mapping[0];
-	priv->rx_sge[1].addr = priv->rx_ring[id].mapping[1];
+	recv_ring->rx_wr.wr_id   = id | IPOIB_OP_RECV;
+	recv_ring->rx_sge[0].addr = recv_ring->rx_ring[id].mapping[0];
+	recv_ring->rx_sge[1].addr = recv_ring->rx_ring[id].mapping[1];
 
 
-	ret = ib_post_recv(priv->qp, &priv->rx_wr, &bad_wr);
+	ret = ib_post_recv(recv_ring->recv_qp, &recv_ring->rx_wr, &bad_wr);
 	if (unlikely(ret)) {
 		ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret);
-		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[id].mapping);
-		dev_kfree_skb_any(priv->rx_ring[id].skb);
-		priv->rx_ring[id].skb = NULL;
+		ipoib_ud_dma_unmap_rx(priv, recv_ring->rx_ring[id].mapping);
+		dev_kfree_skb_any(recv_ring->rx_ring[id].skb);
+		recv_ring->rx_ring[id].skb = NULL;
 	}
 
 	return ret;
 }
 
-static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
+static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev,
+					  struct ipoib_recv_ring *recv_ring,
+					  int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct sk_buff *skb;
@@ -174,7 +177,7 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 	 */
 	skb_reserve(skb, 4);
 
-	mapping = priv->rx_ring[id].mapping;
+	mapping = recv_ring->rx_ring[id].mapping;
 	mapping[0] = ib_dma_map_single(priv->ca, skb->data, buf_size,
 				       DMA_FROM_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, mapping[0])))
@@ -192,7 +195,7 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 			goto partial_error;
 	}
 
-	priv->rx_ring[id].skb = skb;
+	recv_ring->rx_ring[id].skb = skb;
 	return skb;
 
 partial_error:
@@ -202,18 +205,23 @@ error:
 	return NULL;
 }
 
-static int ipoib_ib_post_receives(struct net_device *dev)
+static int ipoib_ib_post_ring_receives(struct net_device *dev,
+				      struct ipoib_recv_ring *recv_ring)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int i;
 
 	for (i = 0; i < ipoib_recvq_size; ++i) {
-		if (!ipoib_alloc_rx_skb(dev, i)) {
-			ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
+		if (!ipoib_alloc_rx_skb(dev, recv_ring, i)) {
+			ipoib_warn(priv,
+				"failed to allocate receive buffer (%d,%d)\n",
+				recv_ring->index, i);
 			return -ENOMEM;
 		}
-		if (ipoib_ib_post_receive(dev, i)) {
-			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
+		if (ipoib_ib_post_receive(dev, recv_ring, i)) {
+			ipoib_warn(priv,
+				"ipoib_ib_post_receive failed for buf (%d,%d)\n",
+				recv_ring->index, i);
 			return -EIO;
 		}
 	}
@@ -221,7 +229,27 @@ static int ipoib_ib_post_receives(struct net_device *dev)
 	return 0;
 }
 
-static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+static int ipoib_ib_post_receives(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int err;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; ++i) {
+		err = ipoib_ib_post_ring_receives(dev, recv_ring);
+		if (err)
+			return err;
+		recv_ring++;
+	}
+
+	return 0;
+}
+
+static void ipoib_ib_handle_rx_wc(struct net_device *dev,
+				  struct ipoib_recv_ring *recv_ring,
+				  struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
@@ -238,16 +266,16 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		return;
 	}
 
-	skb  = priv->rx_ring[wr_id].skb;
+	skb  = recv_ring->rx_ring[wr_id].skb;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		if (wc->status != IB_WC_WR_FLUSH_ERR)
 			ipoib_warn(priv, "failed recv event "
 				   "(status=%d, wrid=%d vend_err %x)\n",
 				   wc->status, wr_id, wc->vendor_err);
-		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].mapping);
+		ipoib_ud_dma_unmap_rx(priv, recv_ring->rx_ring[wr_id].mapping);
 		dev_kfree_skb_any(skb);
-		priv->rx_ring[wr_id].skb = NULL;
+		recv_ring->rx_ring[wr_id].skb = NULL;
 		return;
 	}
 
@@ -258,18 +286,20 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num)
 		goto repost;
 
-	memcpy(mapping, priv->rx_ring[wr_id].mapping,
+	memcpy(mapping, recv_ring->rx_ring[wr_id].mapping,
 	       IPOIB_UD_RX_SG * sizeof *mapping);
 
 	/*
 	 * If we can't allocate a new RX buffer, dump
 	 * this packet and reuse the old buffer.
 	 */
-	if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) {
-		++dev->stats.rx_dropped;
+	if (unlikely(!ipoib_alloc_rx_skb(dev, recv_ring, wr_id))) {
+		++recv_ring->stats.rx_dropped;
 		goto repost;
 	}
 
+	skb_record_rx_queue(skb, recv_ring->index);
+
 	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
 		       wc->byte_len, wc->slid);
 
@@ -292,18 +322,18 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	skb_reset_mac_header(skb);
 	skb_pull(skb, IPOIB_ENCAP_LEN);
 
-	++dev->stats.rx_packets;
-	dev->stats.rx_bytes += skb->len;
+	++recv_ring->stats.rx_packets;
+	recv_ring->stats.rx_bytes += skb->len;
 
 	skb->dev = dev;
 	if ((dev->features & NETIF_F_RXCSUM) &&
 			likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 
-	napi_gro_receive(&priv->napi, skb);
+	napi_gro_receive(&recv_ring->napi, skb);
 
 repost:
-	if (unlikely(ipoib_ib_post_receive(dev, wr_id)))
+	if (unlikely(ipoib_ib_post_receive(dev, recv_ring, wr_id)))
 		ipoib_warn(priv, "ipoib_ib_post_receive failed "
 			   "for buf %d\n", wr_id);
 }
@@ -372,11 +402,14 @@ static void ipoib_dma_unmap_tx(struct ib_device *ca,
 	}
 }
 
-static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
+static void ipoib_ib_handle_tx_wc(struct ipoib_send_ring *send_ring,
+				struct ib_wc *wc)
 {
+	struct net_device *dev = send_ring->dev;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned int wr_id = wc->wr_id;
 	struct ipoib_tx_buf *tx_req;
+	struct ipoib_ah *ah;
 
 	ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
 		       wr_id, wc->status);
@@ -387,20 +420,23 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 		return;
 	}
 
-	tx_req = &priv->tx_ring[wr_id];
+	tx_req = &send_ring->tx_ring[wr_id];
+
+	ah = tx_req->ah;
+	atomic_dec(&ah->refcnt);
 
 	ipoib_dma_unmap_tx(priv->ca, tx_req);
 
-	++dev->stats.tx_packets;
-	dev->stats.tx_bytes += tx_req->skb->len;
+	++send_ring->stats.tx_packets;
+	send_ring->stats.tx_bytes += tx_req->skb->len;
 
 	dev_kfree_skb_any(tx_req->skb);
 
-	++priv->tx_tail;
-	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
-	    netif_queue_stopped(dev) &&
-	    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-		netif_wake_queue(dev);
+	++send_ring->tx_tail;
+	if (unlikely(--send_ring->tx_outstanding == ipoib_sendq_size >> 1) &&
+			__netif_subqueue_stopped(dev, send_ring->index) &&
+			test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+		netif_wake_subqueue(dev, send_ring->index);
 
 	if (wc->status != IB_WC_SUCCESS &&
 	    wc->status != IB_WC_WR_FLUSH_ERR)
@@ -409,45 +445,47 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 			   wc->status, wr_id, wc->vendor_err);
 }
 
-static int poll_tx(struct ipoib_dev_priv *priv)
+static int poll_tx_ring(struct ipoib_send_ring *send_ring)
 {
 	int n, i;
 
-	n = ib_poll_cq(priv->send_cq, MAX_SEND_CQE, priv->send_wc);
+	n = ib_poll_cq(send_ring->send_cq, MAX_SEND_CQE, send_ring->tx_wc);
 	for (i = 0; i < n; ++i)
-		ipoib_ib_handle_tx_wc(priv->dev, priv->send_wc + i);
+		ipoib_ib_handle_tx_wc(send_ring, send_ring->tx_wc + i);
 
 	return n == MAX_SEND_CQE;
 }
 
 int ipoib_poll(struct napi_struct *napi, int budget)
 {
-	struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, napi);
-	struct net_device *dev = priv->dev;
+	struct ipoib_recv_ring *rx_ring;
+	struct net_device *dev;
 	int done;
 	int t;
 	int n, i;
 
 	done  = 0;
+	rx_ring = container_of(napi, struct ipoib_recv_ring, napi);
+	dev = rx_ring->dev;
 
 poll_more:
 	while (done < budget) {
 		int max = (budget - done);
 
 		t = min(IPOIB_NUM_WC, max);
-		n = ib_poll_cq(priv->recv_cq, t, priv->ibwc);
+		n = ib_poll_cq(rx_ring->recv_cq, t, rx_ring->ibwc);
 
 		for (i = 0; i < n; i++) {
-			struct ib_wc *wc = priv->ibwc + i;
+			struct ib_wc *wc = rx_ring->ibwc + i;
 
 			if (wc->wr_id & IPOIB_OP_RECV) {
 				++done;
 				if (wc->wr_id & IPOIB_OP_CM)
 					ipoib_cm_handle_rx_wc(dev, wc);
 				else
-					ipoib_ib_handle_rx_wc(dev, wc);
+					ipoib_ib_handle_rx_wc(dev, rx_ring, wc);
 			} else
-				ipoib_cm_handle_tx_wc(priv->dev, wc);
+				ipoib_cm_handle_tx_wc(dev, wc);
 		}
 
 		if (n != t)
@@ -456,7 +494,7 @@ poll_more:
 
 	if (done < budget) {
 		napi_complete(napi);
-		if (unlikely(ib_req_notify_cq(priv->recv_cq,
+		if (unlikely(ib_req_notify_cq(rx_ring->recv_cq,
 					      IB_CQ_NEXT_COMP |
 					      IB_CQ_REPORT_MISSED_EVENTS)) &&
 		    napi_reschedule(napi))
@@ -466,36 +504,37 @@ poll_more:
 	return done;
 }
 
-void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr)
+void ipoib_ib_completion(struct ib_cq *cq, void *ctx_ptr)
 {
-	struct net_device *dev = dev_ptr;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring = (struct ipoib_recv_ring *) ctx_ptr;
 
-	napi_schedule(&priv->napi);
+	napi_schedule(&recv_ring->napi);
 }
 
-static void drain_tx_cq(struct net_device *dev)
+static void drain_tx_cq(struct ipoib_send_ring *send_ring)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct netdev_queue *txq;
+
+	txq = netdev_get_tx_queue(send_ring->dev, send_ring->index);
+	__netif_tx_lock(txq, smp_processor_id());
 
-	netif_tx_lock(dev);
-	while (poll_tx(priv))
+	while (poll_tx_ring(send_ring))
 		; /* nothing */
 
-	if (netif_queue_stopped(dev))
-		mod_timer(&priv->poll_timer, jiffies + 1);
+	if (__netif_subqueue_stopped(send_ring->dev, send_ring->index))
+		mod_timer(&send_ring->poll_timer, jiffies + 1);
 
-	netif_tx_unlock(dev);
+	__netif_tx_unlock(txq);
 }
 
-void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr)
+void ipoib_send_comp_handler(struct ib_cq *cq, void *ctx_ptr)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev_ptr);
+	struct ipoib_send_ring *send_ring = (struct ipoib_send_ring *) ctx_ptr;
 
-	mod_timer(&priv->poll_timer, jiffies);
+	mod_timer(&send_ring->poll_timer, jiffies);
 }
 
-static inline int post_send(struct ipoib_dev_priv *priv,
+static inline int post_send(struct ipoib_send_ring *send_ring,
 			    unsigned int wr_id,
 			    struct ib_ah *address, u32 qpn,
 			    struct ipoib_tx_buf *tx_req,
@@ -509,30 +548,30 @@ static inline int post_send(struct ipoib_dev_priv *priv,
 	u64 *mapping = tx_req->mapping;
 
 	if (skb_headlen(skb)) {
-		priv->tx_sge[0].addr         = mapping[0];
-		priv->tx_sge[0].length       = skb_headlen(skb);
+		send_ring->tx_sge[0].addr         = mapping[0];
+		send_ring->tx_sge[0].length       = skb_headlen(skb);
 		off = 1;
 	} else
 		off = 0;
 
 	for (i = 0; i < nr_frags; ++i) {
-		priv->tx_sge[i + off].addr = mapping[i + off];
-		priv->tx_sge[i + off].length = skb_frag_size(&frags[i]);
+		send_ring->tx_sge[i + off].addr = mapping[i + off];
+		send_ring->tx_sge[i + off].length = skb_frag_size(&frags[i]);
 	}
-	priv->tx_wr.num_sge	     = nr_frags + off;
-	priv->tx_wr.wr_id 	     = wr_id;
-	priv->tx_wr.wr.ud.remote_qpn = qpn;
-	priv->tx_wr.wr.ud.ah 	     = address;
+	send_ring->tx_wr.num_sge	 = nr_frags + off;
+	send_ring->tx_wr.wr_id		 = wr_id;
+	send_ring->tx_wr.wr.ud.remote_qpn = qpn;
+	send_ring->tx_wr.wr.ud.ah	 = address;
 
 	if (head) {
-		priv->tx_wr.wr.ud.mss	 = skb_shinfo(skb)->gso_size;
-		priv->tx_wr.wr.ud.header = head;
-		priv->tx_wr.wr.ud.hlen	 = hlen;
-		priv->tx_wr.opcode	 = IB_WR_LSO;
+		send_ring->tx_wr.wr.ud.mss	 = skb_shinfo(skb)->gso_size;
+		send_ring->tx_wr.wr.ud.header = head;
+		send_ring->tx_wr.wr.ud.hlen	 = hlen;
+		send_ring->tx_wr.opcode	 = IB_WR_LSO;
 	} else
-		priv->tx_wr.opcode	 = IB_WR_SEND;
+		send_ring->tx_wr.opcode	 = IB_WR_SEND;
 
-	return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+	return ib_post_send(send_ring->send_qp, &send_ring->tx_wr, &bad_wr);
 }
 
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
@@ -540,16 +579,23 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_tx_buf *tx_req;
+	struct ipoib_send_ring *send_ring;
+	u16 queue_index;
 	int hlen, rc;
 	void *phead;
+	int req_index;
+
+	/* Find the correct QP to submit the IO to */
+	queue_index = skb_get_queue_mapping(skb);
+	send_ring = priv->send_ring + queue_index;
 
 	if (skb_is_gso(skb)) {
 		hlen = skb_transport_offset(skb) + tcp_hdrlen(skb);
 		phead = skb->data;
 		if (unlikely(!skb_pull(skb, hlen))) {
 			ipoib_warn(priv, "linear data too small\n");
-			++dev->stats.tx_dropped;
-			++dev->stats.tx_errors;
+			++send_ring->stats.tx_dropped;
+			++send_ring->stats.tx_errors;
 			dev_kfree_skb_any(skb);
 			return;
 		}
@@ -557,8 +603,8 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 		if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
 			ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
 				   skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN);
-			++dev->stats.tx_dropped;
-			++dev->stats.tx_errors;
+			++send_ring->stats.tx_dropped;
+			++send_ring->stats.tx_errors;
 			ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
 			return;
 		}
@@ -576,47 +622,54 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 	 * means we have to make sure everything is properly recorded and
 	 * our state is consistent before we call post_send().
 	 */
-	tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
+	req_index = send_ring->tx_head & (ipoib_sendq_size - 1);
+	tx_req = &send_ring->tx_ring[req_index];
 	tx_req->skb = skb;
+	tx_req->ah = address;
 	if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_errors;
 		dev_kfree_skb_any(skb);
 		return;
 	}
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
-		priv->tx_wr.send_flags |= IB_SEND_IP_CSUM;
+		send_ring->tx_wr.send_flags |= IB_SEND_IP_CSUM;
 	else
-		priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
+		send_ring->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
 
-	if (++priv->tx_outstanding == ipoib_sendq_size) {
+	if (++send_ring->tx_outstanding == ipoib_sendq_size) {
 		ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
-		if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP))
+		if (ib_req_notify_cq(send_ring->send_cq, IB_CQ_NEXT_COMP))
 			ipoib_warn(priv, "request notify on send CQ failed\n");
-		netif_stop_queue(dev);
+		netif_stop_subqueue(dev, queue_index);
 	}
 
-	rc = post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
+	/*
+	 * Incrementing the reference count after submitting
+	 * may create race condition
+	 * It is better to increment before and decrement in case of error
+	 */
+	atomic_inc(&address->refcnt);
+	rc = post_send(send_ring, req_index,
 		       address->ah, qpn, tx_req, phead, hlen);
 	if (unlikely(rc)) {
 		ipoib_warn(priv, "post_send failed, error %d\n", rc);
-		++dev->stats.tx_errors;
-		--priv->tx_outstanding;
+		++send_ring->stats.tx_errors;
+		--send_ring->tx_outstanding;
 		ipoib_dma_unmap_tx(priv->ca, tx_req);
 		dev_kfree_skb_any(skb);
-		if (netif_queue_stopped(dev))
-			netif_wake_queue(dev);
+		atomic_dec(&address->refcnt);
+		if (__netif_subqueue_stopped(dev, queue_index))
+			netif_wake_subqueue(dev, queue_index);
 	} else {
-		dev->trans_start = jiffies;
+		netdev_get_tx_queue(dev, queue_index)->trans_start = jiffies;
 
-		address->last_send = priv->tx_head;
-		++priv->tx_head;
+		++send_ring->tx_head;
 		skb_orphan(skb);
-
 	}
 
-	if (unlikely(priv->tx_outstanding > MAX_SEND_CQE))
-		while (poll_tx(priv))
+	if (unlikely(send_ring->tx_outstanding > MAX_SEND_CQE))
+		while (poll_tx_ring(send_ring))
 			; /* nothing */
 }
 
@@ -631,7 +684,7 @@ static void __ipoib_reap_ah(struct net_device *dev)
 	spin_lock_irqsave(&priv->lock, flags);
 
 	list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list)
-		if ((int) priv->tx_tail - (int) ah->last_send >= 0) {
+		if (atomic_read(&ah->refcnt) == 0) {
 			list_del(&ah->list);
 			ib_destroy_ah(ah->ah);
 			kfree(ah);
@@ -656,7 +709,31 @@ void ipoib_reap_ah(struct work_struct *work)
 
 static void ipoib_ib_tx_timer_func(unsigned long ctx)
 {
-	drain_tx_cq((struct net_device *)ctx);
+	drain_tx_cq((struct ipoib_send_ring *)ctx);
+}
+
+static void ipoib_napi_enable(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		netif_napi_add(dev, &recv_ring->napi,
+						ipoib_poll, 100);
+		napi_enable(&recv_ring->napi);
+		recv_ring++;
+	}
+}
+
+static void ipoib_napi_disable(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < priv->num_rx_queues; i++)
+		napi_disable(&priv->recv_ring[i].napi);
 }
 
 int ipoib_ib_dev_open(struct net_device *dev)
@@ -696,7 +773,7 @@ int ipoib_ib_dev_open(struct net_device *dev)
 			   round_jiffies_relative(HZ));
 
 	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
-		napi_enable(&priv->napi);
+		ipoib_napi_enable(dev);
 
 	return 0;
 }
@@ -758,19 +835,47 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush)
 static int recvs_pending(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
 	int pending = 0;
-	int i;
+	int i, j;
 
-	for (i = 0; i < ipoib_recvq_size; ++i)
-		if (priv->rx_ring[i].skb)
-			++pending;
+	recv_ring = priv->recv_ring;
+	for (j = 0; j < priv->num_rx_queues; j++) {
+		for (i = 0; i < ipoib_recvq_size; ++i) {
+			if (recv_ring->rx_ring[i].skb)
+				++pending;
+		}
+		recv_ring++;
+	}
 
 	return pending;
 }
 
-void ipoib_drain_cq(struct net_device *dev)
+static int sends_pending(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int pending = 0;
+	int i;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		/*
+		* Note that since head and tails are unsigned then
+		* the result of the substruction is correct even when
+		* the counters wrap around
+		*/
+		pending += send_ring->tx_head - send_ring->tx_tail;
+		send_ring++;
+	}
+
+	return pending;
+}
+
+static void ipoib_drain_rx_ring(struct ipoib_dev_priv *priv,
+				struct ipoib_recv_ring *rx_ring)
+{
+	struct net_device *dev = priv->dev;
 	int i, n;
 
 	/*
@@ -781,42 +886,191 @@ void ipoib_drain_cq(struct net_device *dev)
 	local_bh_disable();
 
 	do {
-		n = ib_poll_cq(priv->recv_cq, IPOIB_NUM_WC, priv->ibwc);
+		n = ib_poll_cq(rx_ring->recv_cq, IPOIB_NUM_WC, rx_ring->ibwc);
 		for (i = 0; i < n; ++i) {
 			/*
 			 * Convert any successful completions to flush
 			 * errors to avoid passing packets up the
 			 * stack after bringing the device down.
 			 */
-			if (priv->ibwc[i].status == IB_WC_SUCCESS)
-				priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR;
+			if (rx_ring->ibwc[i].status == IB_WC_SUCCESS)
+				rx_ring->ibwc[i].status = IB_WC_WR_FLUSH_ERR;
 
-			if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) {
-				if (priv->ibwc[i].wr_id & IPOIB_OP_CM)
-					ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
+			if (rx_ring->ibwc[i].wr_id & IPOIB_OP_RECV) {
+				if (rx_ring->ibwc[i].wr_id & IPOIB_OP_CM)
+					ipoib_cm_handle_rx_wc(dev,
+							rx_ring->ibwc + i);
 				else
-					ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
+					ipoib_ib_handle_rx_wc(dev, rx_ring,
+							rx_ring->ibwc + i);
 			} else
-				ipoib_cm_handle_tx_wc(dev, priv->ibwc + i);
+				ipoib_cm_handle_tx_wc(dev, rx_ring->ibwc + i);
 		}
 	} while (n == IPOIB_NUM_WC);
 
-	while (poll_tx(priv))
-		; /* nothing */
-
 	local_bh_enable();
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+static void drain_rx_rings(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		ipoib_drain_rx_ring(priv, recv_ring);
+		recv_ring++;
+	}
+}
+
+
+static void drain_tx_rings(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *send_ring;
+	int bool_value = 0;
+	int i;
+
+	do {
+		bool_value = 0;
+		send_ring = priv->send_ring;
+		for (i = 0; i < priv->num_tx_queues; i++) {
+			local_bh_disable();
+			bool_value |= poll_tx_ring(send_ring);
+			local_bh_enable();
+			send_ring++;
+		}
+	} while (bool_value);
+}
+
+void ipoib_drain_cq(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	drain_rx_rings(priv);
+
+	drain_tx_rings(priv);
+}
+
+static void ipoib_ib_send_ring_stop(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *tx_ring;
+	struct ipoib_tx_buf *tx_req;
+	int i;
+
+	tx_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		while ((int) tx_ring->tx_tail - (int) tx_ring->tx_head < 0) {
+			tx_req = &tx_ring->tx_ring[tx_ring->tx_tail &
+				  (ipoib_sendq_size - 1)];
+			ipoib_dma_unmap_tx(priv->ca, tx_req);
+			dev_kfree_skb_any(tx_req->skb);
+			++tx_ring->tx_tail;
+			--tx_ring->tx_outstanding;
+		}
+		tx_ring++;
+	}
+}
+
+static void ipoib_ib_recv_ring_stop(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_recv_ring *recv_ring;
+	int i, j;
+
+	recv_ring = priv->recv_ring;
+	for (j = 0; j < priv->num_rx_queues; ++j) {
+		for (i = 0; i < ipoib_recvq_size; ++i) {
+			struct ipoib_rx_buf *rx_req;
+
+			rx_req = &recv_ring->rx_ring[i];
+			if (!rx_req->skb)
+				continue;
+			ipoib_ud_dma_unmap_rx(priv,
+					      recv_ring->rx_ring[i].mapping);
+			dev_kfree_skb_any(rx_req->skb);
+			rx_req->skb = NULL;
+		}
+		recv_ring++;
+	}
+}
+
+static void set_tx_poll_timers(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *send_ring;
+	int i;
+	/* Init a timer per queue */
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		setup_timer(&send_ring->poll_timer, ipoib_ib_tx_timer_func,
+					(unsigned long) send_ring);
+		send_ring++;
+	}
+}
+
+static void del_tx_poll_timers(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *send_ring;
+	int i;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		del_timer_sync(&send_ring->poll_timer);
+		send_ring++;
+	}
+}
+
+static void set_tx_rings_qp_state(struct ipoib_dev_priv *priv,
+					enum ib_qp_state new_state)
+{
+	struct ipoib_send_ring *send_ring;
 	struct ib_qp_attr qp_attr;
+	int i;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i <  priv->num_tx_queues; i++) {
+		qp_attr.qp_state = new_state;
+		if (ib_modify_qp(send_ring->send_qp, &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv, "Failed to modify QP to state(%d)\n",
+					new_state);
+		send_ring++;
+	}
+}
+
+static void set_rx_rings_qp_state(struct ipoib_dev_priv *priv,
+					enum ib_qp_state new_state)
+{
+	struct ipoib_recv_ring *recv_ring;
+	struct ib_qp_attr qp_attr;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		qp_attr.qp_state = new_state;
+		if (ib_modify_qp(recv_ring->recv_qp, &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv, "Failed to modify QP to state(%d)\n",
+					new_state);
+		recv_ring++;
+	}
+}
+
+static void set_rings_qp_state(struct ipoib_dev_priv *priv,
+				enum ib_qp_state new_state)
+{
+	set_tx_rings_qp_state(priv, new_state);
+
+	if (priv->num_rx_queues > 1)
+		set_rx_rings_qp_state(priv, new_state);
+}
+
+
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned long begin;
-	struct ipoib_tx_buf *tx_req;
+	struct ipoib_recv_ring *recv_ring;
 	int i;
 
 	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
-		napi_disable(&priv->napi);
+		ipoib_napi_disable(dev);
 
 	ipoib_cm_dev_stop(dev);
 
@@ -824,42 +1078,24 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 	 * Move our QP to the error state and then reinitialize in
 	 * when all work requests have completed or have been flushed.
 	 */
-	qp_attr.qp_state = IB_QPS_ERR;
-	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
-		ipoib_warn(priv, "Failed to modify QP to ERROR state\n");
+	set_rings_qp_state(priv, IB_QPS_ERR);
+
 
 	/* Wait for all sends and receives to complete */
 	begin = jiffies;
 
-	while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) {
+	while (sends_pending(dev) || recvs_pending(dev)) {
 		if (time_after(jiffies, begin + 5 * HZ)) {
 			ipoib_warn(priv, "timing out; %d sends %d receives not completed\n",
-				   priv->tx_head - priv->tx_tail, recvs_pending(dev));
+				   sends_pending(dev), recvs_pending(dev));
 
 			/*
 			 * assume the HW is wedged and just free up
 			 * all our pending work requests.
 			 */
-			while ((int) priv->tx_tail - (int) priv->tx_head < 0) {
-				tx_req = &priv->tx_ring[priv->tx_tail &
-							(ipoib_sendq_size - 1)];
-				ipoib_dma_unmap_tx(priv->ca, tx_req);
-				dev_kfree_skb_any(tx_req->skb);
-				++priv->tx_tail;
-				--priv->tx_outstanding;
-			}
-
-			for (i = 0; i < ipoib_recvq_size; ++i) {
-				struct ipoib_rx_buf *rx_req;
-
-				rx_req = &priv->rx_ring[i];
-				if (!rx_req->skb)
-					continue;
-				ipoib_ud_dma_unmap_rx(priv,
-						      priv->rx_ring[i].mapping);
-				dev_kfree_skb_any(rx_req->skb);
-				rx_req->skb = NULL;
-			}
+			ipoib_ib_send_ring_stop(priv);
+
+			ipoib_ib_recv_ring_stop(priv);
 
 			goto timeout;
 		}
@@ -872,10 +1108,9 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 	ipoib_dbg(priv, "All sends and receives done.\n");
 
 timeout:
-	del_timer_sync(&priv->poll_timer);
-	qp_attr.qp_state = IB_QPS_RESET;
-	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
-		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
+	del_tx_poll_timers(priv);
+
+	set_rings_qp_state(priv, IB_QPS_RESET);
 
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
@@ -896,7 +1131,11 @@ timeout:
 		msleep(1);
 	}
 
-	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; ++i) {
+		ib_req_notify_cq(recv_ring->recv_cq, IB_CQ_NEXT_COMP);
+		recv_ring++;
+	}
 
 	return 0;
 }
@@ -914,8 +1153,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 		return -ENODEV;
 	}
 
-	setup_timer(&priv->poll_timer, ipoib_ib_tx_timer_func,
-		    (unsigned long) dev);
+	set_tx_poll_timers(priv);
 
 	if (dev->flags & IFF_UP) {
 		if (ipoib_ib_dev_open(dev)) {
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 3974c29..3e6b651 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -562,10 +562,12 @@ static void neigh_add_path(struct sk_buff *skb, struct neighbour *n, struct net_
 	struct ipoib_path *path;
 	struct ipoib_neigh *neigh;
 	unsigned long flags;
+	int index;
 
 	neigh = ipoib_neigh_alloc(n, skb->dev);
 	if (!neigh) {
-		++dev->stats.tx_dropped;
+		index = skb_get_queue_mapping(skb);
+		priv->send_ring[index].stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
 		return;
 	}
@@ -629,7 +631,8 @@ err_list:
 err_path:
 	ipoib_neigh_free(dev, neigh);
 err_drop:
-	++dev->stats.tx_dropped;
+	index = skb_get_queue_mapping(skb);
+	priv->send_ring[index].stats.tx_dropped++;
 	dev_kfree_skb_any(skb);
 
 	spin_unlock_irqrestore(&priv->lock, flags);
@@ -658,6 +661,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_path *path;
 	unsigned long flags;
+	int index = skb_get_queue_mapping(skb);
 
 	spin_lock_irqsave(&priv->lock, flags);
 
@@ -680,7 +684,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 			} else
 				__path_add(dev, path);
 		} else {
-			++dev->stats.tx_dropped;
+			priv->send_ring[index].stats.tx_dropped++;
 			dev_kfree_skb_any(skb);
 		}
 
@@ -699,7 +703,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		__skb_queue_tail(&path->queue, skb);
 	} else {
-		++dev->stats.tx_dropped;
+		priv->send_ring[index].stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
 	}
 
@@ -712,12 +716,15 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct ipoib_neigh *neigh;
 	struct neighbour *n = NULL;
 	unsigned long flags;
+	struct ipoib_send_ring *send_ring;
+
+	send_ring = priv->send_ring + skb_get_queue_mapping(skb);
 
 	rcu_read_lock();
 	if (likely(skb_dst(skb))) {
 		n = dst_get_neighbour_noref(skb_dst(skb));
 		if (!n) {
-			++dev->stats.tx_dropped;
+			send_ring->stats.tx_dropped++;
 			dev_kfree_skb_any(skb);
 			goto unlock;
 		}
@@ -766,7 +773,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			__skb_queue_tail(&neigh->queue, skb);
 			spin_unlock_irqrestore(&priv->lock, flags);
 		} else {
-			++dev->stats.tx_dropped;
+			++send_ring->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 		}
 	} else {
@@ -789,7 +796,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 					   IPOIB_QPN(cb->hwaddr),
 					   cb->hwaddr + 4);
 				dev_kfree_skb_any(skb);
-				++dev->stats.tx_dropped;
+				++send_ring->stats.tx_dropped;
 				goto unlock;
 			}
 
@@ -801,18 +808,70 @@ unlock:
 	return NETDEV_TX_OK;
 }
 
+static u16 ipoib_select_queue_null(struct net_device *dev, struct sk_buff *skb)
+{
+	return 0;
+}
+
 static void ipoib_timeout(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	u16 index;
 
 	ipoib_warn(priv, "transmit timeout: latency %d msecs\n",
 		   jiffies_to_msecs(jiffies - dev->trans_start));
-	ipoib_warn(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",
-		   netif_queue_stopped(dev),
-		   priv->tx_head, priv->tx_tail);
+
+	for (index = 0; index < priv->num_tx_queues; index++) {
+		if (__netif_subqueue_stopped(dev, index)) {
+			send_ring = priv->send_ring + index;
+			ipoib_warn(priv,
+				"queue (%d) stopped, tx_head %u, tx_tail %u\n",
+				index,
+				send_ring->tx_head, send_ring->tx_tail);
+		}
+	}
 	/* XXX reset QP, etc. */
 }
 
+static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct net_device_stats *stats = &dev->stats;
+	struct net_device_stats local_stats;
+	int i;
+
+	memset(&local_stats, 0, sizeof(struct net_device_stats));
+
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		struct ipoib_rx_ring_stats *rstats = &priv->recv_ring[i].stats;
+		local_stats.rx_packets += rstats->rx_packets;
+		local_stats.rx_bytes   += rstats->rx_bytes;
+		local_stats.rx_errors  += rstats->rx_errors;
+		local_stats.rx_dropped += rstats->rx_dropped;
+	}
+
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		struct ipoib_tx_ring_stats *tstats = &priv->send_ring[i].stats;
+		local_stats.tx_packets += tstats->tx_packets;
+		local_stats.tx_bytes   += tstats->tx_bytes;
+		local_stats.tx_errors  += tstats->tx_errors;
+		local_stats.tx_dropped += tstats->tx_dropped;
+	}
+
+	stats->rx_packets = local_stats.rx_packets;
+	stats->rx_bytes   = local_stats.rx_bytes;
+	stats->rx_errors  = local_stats.rx_errors;
+	stats->rx_dropped = local_stats.tx_dropped;
+
+	stats->tx_packets = local_stats.tx_packets;
+	stats->tx_bytes   = local_stats.tx_bytes;
+	stats->tx_errors  = local_stats.tx_errors;
+	stats->tx_dropped = local_stats.tx_dropped;
+
+	return stats;
+}
+
 static int ipoib_hard_header(struct sk_buff *skb,
 			     struct net_device *dev,
 			     unsigned short type,
@@ -902,9 +961,11 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour,
 void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh)
 {
 	struct sk_buff *skb;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	*to_ipoib_neigh(neigh->neighbour) = NULL;
 	while ((skb = __skb_dequeue(&neigh->queue))) {
-		++dev->stats.tx_dropped;
+		int index = skb_get_queue_mapping(skb);
+		priv->send_ring[index].stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
 	}
 	if (ipoib_cm_get(neigh))
@@ -922,43 +983,88 @@ static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *par
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	struct ipoib_recv_ring *recv_ring;
+	int i, rx_allocated, tx_allocated;
+	unsigned long alloc_size;
 
 	/* Allocate RX/TX "rings" to hold queued skbs */
-	priv->rx_ring =	kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring,
+	/* Multi queue initialization */
+	priv->recv_ring = kzalloc(priv->num_rx_queues * sizeof(*recv_ring),
 				GFP_KERNEL);
-	if (!priv->rx_ring) {
-		printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n",
-		       ca->name, ipoib_recvq_size);
+	if (!priv->recv_ring) {
+		pr_warn("%s: failed to allocate RECV ring (%d entries)\n",
+			ca->name, priv->num_rx_queues);
 		goto out;
 	}
 
-	priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring);
-	if (!priv->tx_ring) {
-		printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n",
-		       ca->name, ipoib_sendq_size);
-		goto out_rx_ring_cleanup;
+	alloc_size = ipoib_recvq_size * sizeof(*recv_ring->rx_ring);
+	rx_allocated = 0;
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		recv_ring->rx_ring = kzalloc(alloc_size, GFP_KERNEL);
+		if (!recv_ring->rx_ring) {
+			pr_warn("%s: failed to allocate RX ring (%d entries)\n",
+				ca->name, ipoib_recvq_size);
+			goto out_recv_ring_cleanup;
+		}
+		recv_ring->dev = dev;
+		recv_ring->index = i;
+		recv_ring++;
+		rx_allocated++;
+	}
+
+	priv->send_ring = kzalloc(priv->num_tx_queues * sizeof(*send_ring),
+				GFP_KERNEL);
+	if (!priv->send_ring) {
+		pr_warn("%s: failed to allocate SEND ring (%d entries)\n",
+			ca->name, priv->num_tx_queues);
+		goto out_recv_ring_cleanup;
 	}
 
+	alloc_size = ipoib_sendq_size * sizeof(*send_ring->tx_ring);
+	tx_allocated = 0;
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		send_ring->tx_ring = vzalloc(alloc_size);
+		if (!send_ring->tx_ring) {
+			printk(KERN_WARNING
+				"%s: failed to allocate TX ring (%d entries)\n",
+				ca->name, ipoib_sendq_size);
+			goto out_send_ring_cleanup;
+		}
+		send_ring->dev = dev;
+		send_ring->index = i;
+		send_ring++;
+		tx_allocated++;
+	}
 	/* priv->tx_head, tx_tail & tx_outstanding are already 0 */
 
 	if (ipoib_ib_dev_init(dev, ca, port))
-		goto out_tx_ring_cleanup;
+		goto out_send_ring_cleanup;
 
 	return 0;
 
-out_tx_ring_cleanup:
-	vfree(priv->tx_ring);
+out_send_ring_cleanup:
+	for (i = 0; i < tx_allocated; i++)
+		vfree(priv->send_ring[i].tx_ring);
+	kfree(priv->send_ring);
 
-out_rx_ring_cleanup:
-	kfree(priv->rx_ring);
+out_recv_ring_cleanup:
+	for (i = 0; i < rx_allocated; i++)
+		kfree(priv->recv_ring[i].rx_ring);
+	kfree(priv->recv_ring);
 
 out:
+	priv->send_ring = NULL;
+	priv->recv_ring = NULL;
 	return -ENOMEM;
 }
 
 void ipoib_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv;
+	int i;
 
 	ipoib_delete_debug_files(dev);
 
@@ -971,11 +1077,16 @@ void ipoib_dev_cleanup(struct net_device *dev)
 
 	ipoib_ib_dev_cleanup(dev);
 
-	kfree(priv->rx_ring);
-	vfree(priv->tx_ring);
+	for (i = 0; i < priv->num_tx_queues; i++)
+		vfree(priv->send_ring[i].tx_ring);
+	kfree(priv->send_ring);
+
+	for (i = 0; i < priv->num_rx_queues; i++)
+		kfree(priv->recv_ring[i].rx_ring);
+	kfree(priv->recv_ring);
 
-	priv->rx_ring = NULL;
-	priv->tx_ring = NULL;
+	priv->recv_ring = NULL;
+	priv->send_ring = NULL;
 }
 
 static const struct header_ops ipoib_header_ops = {
@@ -987,23 +1098,25 @@ static const struct net_device_ops ipoib_netdev_ops = {
 	.ndo_stop		 = ipoib_stop,
 	.ndo_change_mtu		 = ipoib_change_mtu,
 	.ndo_fix_features	 = ipoib_fix_features,
-	.ndo_start_xmit	 	 = ipoib_start_xmit,
+	.ndo_start_xmit		 = ipoib_start_xmit,
+	.ndo_select_queue	 = ipoib_select_queue_null,
 	.ndo_tx_timeout		 = ipoib_timeout,
+	.ndo_get_stats		 = ipoib_get_stats,
 	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
 	.ndo_neigh_setup	 = ipoib_neigh_setup_dev,
 };
 
+
 static void ipoib_setup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	dev->netdev_ops		 = &ipoib_netdev_ops;
-	dev->header_ops		 = &ipoib_header_ops;
+	/* Use correct ops (ndo_select_queue) */
+	dev->netdev_ops = &ipoib_netdev_ops;
+	dev->header_ops = &ipoib_header_ops;
 
 	ipoib_set_ethtool_ops(dev);
 
-	netif_napi_add(dev, &priv->napi, ipoib_poll, 100);
-
 	dev->watchdog_timeo	 = HZ;
 
 	dev->flags		|= IFF_BROADCAST | IFF_MULTICAST;
@@ -1041,15 +1154,21 @@ static void ipoib_setup(struct net_device *dev)
 	INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah);
 }
 
-struct ipoib_dev_priv *ipoib_intf_alloc(const char *name)
+struct ipoib_dev_priv *ipoib_intf_alloc(const char *name,
+					struct ipoib_dev_priv *template_priv)
 {
 	struct net_device *dev;
 
-	dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name,
-			   ipoib_setup);
+	dev = alloc_netdev_mqs((int) sizeof(struct ipoib_dev_priv), name,
+			   ipoib_setup,
+			   template_priv->num_tx_queues,
+			   template_priv->num_rx_queues);
 	if (!dev)
 		return NULL;
 
+	netif_set_real_num_tx_queues(dev, template_priv->num_tx_queues);
+	netif_set_real_num_rx_queues(dev, template_priv->num_rx_queues);
+
 	return netdev_priv(dev);
 }
 
@@ -1143,7 +1262,8 @@ int ipoib_add_pkey_attr(struct net_device *dev)
 	return device_create_file(&dev->dev, &dev_attr_pkey);
 }
 
-int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
+static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
+				  struct ib_device *hca)
 {
 	struct ib_device_attr *device_attr;
 	int result = -ENOMEM;
@@ -1166,6 +1286,20 @@ int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
 
 	kfree(device_attr);
 
+	priv->num_rx_queues = 1;
+	priv->num_tx_queues = 1;
+
+	return 0;
+}
+
+int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
+{
+	int result;
+
+	result = ipoib_get_hca_features(priv, hca);
+	if (result)
+		return result;
+
 	if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) {
 		priv->dev->hw_features = NETIF_F_SG |
 			NETIF_F_IP_CSUM | NETIF_F_RXCSUM;
@@ -1182,13 +1316,23 @@ int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
 static struct net_device *ipoib_add_port(const char *format,
 					 struct ib_device *hca, u8 port)
 {
-	struct ipoib_dev_priv *priv;
+	struct ipoib_dev_priv *priv, *template_priv;
 	struct ib_port_attr attr;
 	int result = -ENOMEM;
 
-	priv = ipoib_intf_alloc(format);
-	if (!priv)
-		goto alloc_mem_failed;
+	template_priv = kmalloc(sizeof *template_priv, GFP_KERNEL);
+	if (!template_priv)
+		goto alloc_mem_failed1;
+
+	if (ipoib_get_hca_features(template_priv, hca))
+		goto device_query_failed;
+
+	priv = ipoib_intf_alloc(format, template_priv);
+	if (!priv) {
+		kfree(template_priv);
+		goto alloc_mem_failed2;
+	}
+	kfree(template_priv);
 
 	SET_NETDEV_DEV(priv->dev, hca->dma_device);
 	priv->dev->dev_id = port - 1;
@@ -1287,7 +1431,13 @@ event_failed:
 device_init_failed:
 	free_netdev(priv->dev);
 
-alloc_mem_failed:
+alloc_mem_failed2:
+	return ERR_PTR(result);
+
+device_query_failed:
+	kfree(template_priv);
+
+alloc_mem_failed1:
 	return ERR_PTR(result);
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 20ebc6f..f127296 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -71,7 +71,6 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 	struct net_device *dev = mcast->dev;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_neigh *neigh, *tmp;
-	int tx_dropped = 0;
 
 	ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group %pI6\n",
 			mcast->mcmember.mgid.raw);
@@ -96,14 +95,15 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 		ipoib_put_ah(mcast->ah);
 
 	while (!skb_queue_empty(&mcast->pkt_queue)) {
-		++tx_dropped;
-		dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
+		struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue);
+		int index = skb_get_queue_mapping(skb);
+		/* Modify to lock queue */
+		netif_tx_lock_bh(dev);
+		priv->send_ring[index].stats.tx_dropped++;
+		netif_tx_unlock_bh(dev);
+		dev_kfree_skb_any(skb);
 	}
 
-	netif_tx_lock_bh(dev);
-	dev->stats.tx_dropped += tx_dropped;
-	netif_tx_unlock_bh(dev);
-
 	kfree(mcast);
 }
 
@@ -187,6 +187,7 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 	struct ipoib_ah *ah;
 	int ret;
 	int set_qkey = 0;
+	int i;
 
 	mcast->mcmember = *mcmember;
 
@@ -200,7 +201,8 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 		}
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
 		spin_unlock_irq(&priv->lock);
-		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+		for (i = 0; i < priv->num_tx_queues; i++)
+			priv->send_ring[i].tx_wr.wr.ud.remote_qkey = priv->qkey;
 		set_qkey = 1;
 	}
 
@@ -282,6 +284,7 @@ ipoib_mcast_sendonly_join_complete(int status,
 {
 	struct ipoib_mcast *mcast = multicast->context;
 	struct net_device *dev = mcast->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	/* We trap for port events ourselves. */
 	if (status == -ENETRESET)
@@ -298,8 +301,10 @@ ipoib_mcast_sendonly_join_complete(int status,
 		/* Flush out any queued packets */
 		netif_tx_lock_bh(dev);
 		while (!skb_queue_empty(&mcast->pkt_queue)) {
-			++dev->stats.tx_dropped;
-			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
+			struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue);
+			int index = skb_get_queue_mapping(skb);
+			priv->send_ring[index].stats.tx_dropped++;
+			dev_kfree_skb_any(skb);
 		}
 		netif_tx_unlock_bh(dev);
 
@@ -666,7 +671,8 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb)
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)		||
 	    !priv->broadcast					||
 	    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
-		++dev->stats.tx_dropped;
+		int index = skb_get_queue_mapping(skb);
+		priv->send_ring[index].stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
 		goto unlock;
 	}
@@ -679,9 +685,10 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb)
 
 		mcast = ipoib_mcast_alloc(dev, 0);
 		if (!mcast) {
+			int index = skb_get_queue_mapping(skb);
+			priv->send_ring[index].stats.tx_dropped++;
 			ipoib_warn(priv, "unable to allocate memory for "
 				   "multicast structure\n");
-			++dev->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 			goto out;
 		}
@@ -696,7 +703,8 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb)
 		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
 			skb_queue_tail(&mcast->pkt_queue, skb);
 		else {
-			++dev->stats.tx_dropped;
+			int index = skb_get_queue_mapping(skb);
+			priv->send_ring[index].stats.tx_dropped++;
 			dev_kfree_skb_any(skb);
 		}
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 049a997..4be626f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -118,6 +118,10 @@ int ipoib_init_qp(struct net_device *dev)
 		goto out_fail;
 	}
 
+	/* Only one ring currently */
+	priv->recv_ring[0].recv_qp = priv->qp;
+	priv->send_ring[0].send_qp = priv->qp;
+
 	return 0;
 
 out_fail:
@@ -142,8 +146,10 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 		.qp_type     = IB_QPT_UD
 	};
 
+	struct ipoib_send_ring *send_ring;
+	struct ipoib_recv_ring *recv_ring, *first_recv_ring;
 	int ret, size;
-	int i;
+	int i, j;
 
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
@@ -167,19 +173,24 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 			size += ipoib_recvq_size * ipoib_max_conn_qp;
 	}
 
-	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
+	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL,
+				     priv->recv_ring, size, 0);
 	if (IS_ERR(priv->recv_cq)) {
 		printk(KERN_WARNING "%s: failed to create receive CQ\n", ca->name);
 		goto out_free_mr;
 	}
 
 	priv->send_cq = ib_create_cq(priv->ca, ipoib_send_comp_handler, NULL,
-				     dev, ipoib_sendq_size, 0);
+				     priv->send_ring, ipoib_sendq_size, 0);
 	if (IS_ERR(priv->send_cq)) {
 		printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name);
 		goto out_free_recv_cq;
 	}
 
+	/* Only one ring */
+	priv->recv_ring[0].recv_cq = priv->recv_cq;
+	priv->send_ring[0].send_cq = priv->send_cq;
+
 	if (ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP))
 		goto out_free_send_cq;
 
@@ -205,25 +216,43 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
-	for (i = 0; i < MAX_SKB_FRAGS + 1; ++i)
-		priv->tx_sge[i].lkey = priv->mr->lkey;
+	send_ring = priv->send_ring;
+	for (j = 0; j < priv->num_tx_queues; j++) {
+		for (i = 0; i < MAX_SKB_FRAGS + 1; ++i)
+			send_ring->tx_sge[i].lkey = priv->mr->lkey;
 
-	priv->tx_wr.opcode	= IB_WR_SEND;
-	priv->tx_wr.sg_list	= priv->tx_sge;
-	priv->tx_wr.send_flags	= IB_SEND_SIGNALED;
+		send_ring->tx_wr.opcode	= IB_WR_SEND;
+		send_ring->tx_wr.sg_list	= send_ring->tx_sge;
+		send_ring->tx_wr.send_flags	= IB_SEND_SIGNALED;
+		send_ring++;
+	}
 
-	priv->rx_sge[0].lkey = priv->mr->lkey;
+	recv_ring = priv->recv_ring;
+	recv_ring->rx_sge[0].lkey = priv->mr->lkey;
 	if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
-		priv->rx_sge[0].length = IPOIB_UD_HEAD_SIZE;
-		priv->rx_sge[1].length = PAGE_SIZE;
-		priv->rx_sge[1].lkey = priv->mr->lkey;
-		priv->rx_wr.num_sge = IPOIB_UD_RX_SG;
+		recv_ring->rx_sge[0].length = IPOIB_UD_HEAD_SIZE;
+		recv_ring->rx_sge[1].length = PAGE_SIZE;
+		recv_ring->rx_sge[1].lkey = priv->mr->lkey;
+		recv_ring->rx_wr.num_sge = IPOIB_UD_RX_SG;
 	} else {
-		priv->rx_sge[0].length = IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
-		priv->rx_wr.num_sge = 1;
+		recv_ring->rx_sge[0].length =
+				IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
+		recv_ring->rx_wr.num_sge = 1;
+	}
+	recv_ring->rx_wr.next = NULL;
+	recv_ring->rx_wr.sg_list = recv_ring->rx_sge;
+
+	/* Copy first RX ring sge and wr parameters to the rest RX ring */
+	first_recv_ring = recv_ring;
+	recv_ring++;
+	for (i = 1; i < priv->num_rx_queues; i++) {
+		recv_ring->rx_sge[0] = first_recv_ring->rx_sge[0];
+		recv_ring->rx_sge[1] = first_recv_ring->rx_sge[1];
+		recv_ring->rx_wr = first_recv_ring->rx_wr;
+		/* This field in per ring */
+		recv_ring->rx_wr.sg_list = recv_ring->rx_sge;
+		recv_ring++;
 	}
-	priv->rx_wr.next = NULL;
-	priv->rx_wr.sg_list = priv->rx_sge;
 
 	return 0;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index d7e9740..5dcb9fb 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -84,7 +84,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 
 	snprintf(intf_name, sizeof intf_name, "%s.%04x",
 		 ppriv->dev->name, pkey);
-	priv = ipoib_intf_alloc(intf_name);
+	priv = ipoib_intf_alloc(intf_name, ppriv);
 	if (!priv) {
 		result = -ENOMEM;
 		goto err;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH for-next 7/7] IB/ipoib: Add RSS and TSS support for datagram mode
       [not found] ` <1336494151-31050-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2012-05-08 16:22   ` [PATCH for-next 6/7] IB/ipoib: Implement vectorization restructure as pre-step for TSS/RSS Or Gerlitz
@ 2012-05-08 16:22   ` Or Gerlitz
  6 siblings, 0 replies; 12+ messages in thread
From: Or Gerlitz @ 2012-05-08 16:22 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, shlomop-VPRAkNaXOzVWk0Htik3J/w

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch add RSS (Receive Side Scaling) and TSS (multi-queue transmit)
support for IPoIB. The RSS and TSS implementation utilizes the new QP
groups concept.

The number of RSS and TSS rings is a function of the number of cores,
and the low level driver capability to support QP groups and RSS.

If the low level driver doesn't support QP groups, then only one RX
and one TX rings are created and only one QP, such both rings use it.

If the HW supports RSS then additional receive QP are created, and each
is assigned to a separate receive ring. The number of additional receive
rings is equal to the number of CPU cores rounded to the next power of two.

If the HW doesn't support RSS then only one receive ring is created
and the parent QP is assigned as its QP.

When TSS is used, additional send QPs are created, and each is assigned to
a separate send ring. The number of additional send rings is equal to the
number of CPU cores rounded to the next power of two.

It turns out that there are IPoIB drivers used by some operating-systems
and/or Hypervisors in a para-virtualization (PV) scheme which extract the
source QPN from the CQ WC associated with an incoming packets in order to
generate the source MAC address in the emulated MAC header they build.

With TSS, different packets targeted for the same entity (e.g VM using
PV IPoIB instance) could be potentially sent through different TX rings
which map to different UD QPs, each with its own QPN. This may break some
assumptions made the receiving entity (e.g rules related to security,
monitoring, etc).

If the HW supports TSS, it is capable of over-riding the source UD QPN
present in the IB datagram header (DTH) of sent packets with the parent's
QPN which is part of the device HW address as advertized to the Linux network
stack and hence carried in ARP requests/responses. Thus the above mentioned
problem doesn't exist.

When HW doesn't support TSS, but QP groups are supported which mean the
low level driver can create set of QPs with contiguous QP numbers, TSS
can still be used, this is called "SW TSS".

In this case, the low level drive provides IPoIB with a mask when the
parent QP is created. This mask is later written into the reserved field
of the IPoIB header so receivers of SW TSS packets can mask the QPN of
a received packet and discover the parent QPN.

In order not to possibly breaks inter-operability with the PV IPoIB drivers
which were not yet enhanced to apply this masking from incoming packets,
SW TSS will only be used if the peer advertised its willingness to accept
SW TSS frames, otherwise the parent QP will be used.

The advertisement to accept TSS frames is done using a dedicated bit in
the reserved byte of the IPoIB HW address (e.g similar to CM).

With the current way IPoIB deals with neighbours, an IPoIB neighbour can
be deleted on the transmission path, e.g during either local or remote
bonding fail-over. This area of the code is known to be racy. In order
to avoid the case where multiple TX flows would attempt to delete the
same IPoIB neighbour, the current queue selection implementation binds
all packets (skbs) that use a given IPoIB neighbour to be sent through
a single ring.

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |   19 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c    |    5 -
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   10 +
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  143 ++++++-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |  619 ++++++++++++++++++++++++----
 5 files changed, 701 insertions(+), 95 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index fb880a0..6654551 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -115,7 +115,7 @@ enum {
 
 struct ipoib_header {
 	__be16	proto;
-	u16	reserved;
+	__be16	tss_qpn_mask_sz;
 };
 
 struct ipoib_cb {
@@ -225,7 +225,6 @@ struct ipoib_cm_tx {
 	unsigned	     tx_tail;
 	unsigned long	     flags;
 	u32		     mtu;
-	int index; /* For ndo_select_queue and ring counters */
 };
 
 struct ipoib_cm_rx_buf {
@@ -258,7 +257,6 @@ struct ipoib_cm_dev_priv {
 	int			num_frags;
 	u32			rx_cq_ind;
 	u32			tx_cq_ind;
-	u32			tx_ring_ind;
 };
 
 struct ipoib_ethtool_st {
@@ -355,9 +353,7 @@ struct ipoib_dev_priv {
 	u16		  pkey_index;
 	struct ib_pd	 *pd;
 	struct ib_mr	 *mr;
-	struct ib_cq	 *recv_cq;
-	struct ib_cq	 *send_cq;
-	struct ib_qp	 *qp;
+	struct ib_qp	 *qp; /* also parent QP for TSS & RSS */
 	u32		  qkey;
 
 	union ib_gid local_gid;
@@ -389,8 +385,12 @@ struct ipoib_dev_priv {
 	struct timer_list poll_timer;
 	struct ipoib_recv_ring *recv_ring;
 	struct ipoib_send_ring *send_ring;
-	unsigned int num_rx_queues;
-	unsigned int num_tx_queues;
+	unsigned int rss_qp_num; /* No RSS HW support 0 */
+	unsigned int tss_qp_num; /* No TSS (HW or SW) used 0 */
+	unsigned int num_rx_queues; /* No RSS HW support 1 */
+	unsigned int num_tx_queues; /* No TSS HW support tss_qp_num + 1 */
+	__be16 tss_qpn_mask_sz; /* Put in ipoib header reserved */
+	atomic_t tx_ring_ind;
 };
 
 struct ipoib_ah {
@@ -430,6 +430,7 @@ struct ipoib_neigh {
 	struct net_device *dev;
 
 	struct list_head    list;
+	int index; /* For ndo_select_queue and ring counters */
 };
 
 #define IPOIB_UD_MTU(ib_mtu)		(ib_mtu - IPOIB_ENCAP_LEN)
@@ -551,9 +552,11 @@ int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca);
 
 #define IPOIB_FLAGS_RC		0x80
 #define IPOIB_FLAGS_UC		0x40
+#define IPOIB_FLAGS_TSS		0x20
 
 /* We don't support UC connections at the moment */
 #define IPOIB_CM_SUPPORTED(ha)   (ha[0] & (IPOIB_FLAGS_RC))
+#define IPOIB_TSS_SUPPORTED(ha)   (ha[0] & (IPOIB_FLAGS_TSS))
 
 extern int ipoib_max_conn_qp;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index d708ed2..bb5a8a0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -1047,11 +1047,6 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 			priv->cm.tx_cq_ind : 0;
 	priv->cm.tx_cq_ind = index + 1;
 	attr.send_cq = attr.recv_cq = priv->recv_ring[index].recv_cq;
-	/* For ndo_select_queue */
-	index =  (priv->cm.tx_ring_ind < priv->num_tx_queues) ?
-			priv->cm.tx_ring_ind : 0;
-	priv->cm.tx_ring_ind = index + 1;
-	tx->index = index;
 
 	return ib_create_qp(priv->pd, &attr);
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 55f3e35..fe7500e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -282,6 +282,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev,
 	/*
 	 * Drop packets that this interface sent, ie multicast packets
 	 * that the HCA has replicated.
+	 * Note with SW TSS MC were sent using priv->qp so no need to mask
 	 */
 	if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num)
 		goto repost;
@@ -1055,6 +1056,15 @@ static void set_rx_rings_qp_state(struct ipoib_dev_priv *priv,
 static void set_rings_qp_state(struct ipoib_dev_priv *priv,
 				enum ib_qp_state new_state)
 {
+	if (priv->hca_caps & IB_DEVICE_UD_TSS) {
+		/* TSS HW is supported, parent QP has no ring (send_ring) */
+		struct ib_qp_attr qp_attr;
+		qp_attr.qp_state = new_state;
+		if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv, "Failed to modify QP to state(%d)\n",
+					new_state);
+	}
+
 	set_tx_rings_qp_state(priv, new_state);
 
 	if (priv->num_rx_queues > 1)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 3e6b651..e47d2ef 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -808,9 +808,56 @@ unlock:
 	return NETDEV_TX_OK;
 }
 
-static u16 ipoib_select_queue_null(struct net_device *dev, struct sk_buff *skb)
+static u16 ipoib_select_queue_hw(struct net_device *dev, struct sk_buff *skb)
 {
-	return 0;
+	struct neighbour *n = NULL;
+	struct ipoib_neigh *neigh = NULL;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct dst_entry *dst = skb_dst(skb);
+
+	/* Let the skb_tx_hash do its worK execept for CM */
+	if (likely(dst)) {
+		n = dst_get_neighbour_noref(dst);
+		if (likely(n)) {
+			neigh = *to_ipoib_neigh(n);
+			if (likely(neigh))
+				return neigh->index;
+		}
+	}
+
+	/* We don't have a nighbour, stay on this core */
+	return smp_processor_id() % priv->tss_qp_num;
+}
+
+static u16 ipoib_select_queue_sw(struct net_device *dev, struct sk_buff *skb)
+{
+	struct neighbour *n = NULL;
+	struct ipoib_neigh *neigh = NULL;
+	struct ipoib_header *header;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct dst_entry *dst = skb_dst(skb);
+
+	/* If there is no neighbor (BC/MC) use designated QDISC -> parent QP */
+	if (unlikely(!dst))
+		return priv->tss_qp_num;
+
+	n = dst_get_neighbour_noref(dst);
+	if (unlikely(!n))
+		return priv->tss_qp_num;
+
+	neigh = *to_ipoib_neigh(n);
+	if (unlikely(!neigh))
+		return priv->tss_qp_num;
+
+	/* Did neighbour advertise TSS support */
+	if (unlikely(!IPOIB_TSS_SUPPORTED(n->ha)))
+		return priv->tss_qp_num;
+
+	/* We are after ipoib_hard_header so skb->data is O.K. */
+	header = (struct ipoib_header *) skb->data;
+	header->tss_qpn_mask_sz |= priv->tss_qpn_mask_sz;
+
+	return neigh->index;
 }
 
 static void ipoib_timeout(struct net_device *dev)
@@ -882,7 +929,7 @@ static int ipoib_hard_header(struct sk_buff *skb,
 	header = (struct ipoib_header *) skb_push(skb, sizeof *header);
 
 	header->proto = htons(type);
-	header->reserved = 0;
+	header->tss_qpn_mask_sz = 0;
 
 	/*
 	 * If we don't have a dst_entry structure, stuff the
@@ -942,6 +989,7 @@ static void ipoib_neigh_cleanup(struct neighbour *n)
 struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour,
 				      struct net_device *dev)
 {
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_neigh *neigh;
 
 	neigh = kmalloc(sizeof *neigh, GFP_ATOMIC);
@@ -955,6 +1003,17 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour,
 	skb_queue_head_init(&neigh->queue);
 	ipoib_cm_set(neigh, NULL);
 
+	/*
+	 * ipoib_neigh_alloc can be called from neigh_add_path without
+	 * the protection of spin lock or from ipoib_mcast_send under
+	 * spin lock protection. thus there is a need to use atomic
+	 */
+	if (priv->tss_qp_num > 0)
+		neigh->index = atomic_inc_return(&priv->tx_ring_ind)
+			% priv->tss_qp_num;
+	else
+		neigh->index = 0;
+
 	return neigh;
 }
 
@@ -1028,8 +1087,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	for (i = 0; i < priv->num_tx_queues; i++) {
 		send_ring->tx_ring = vzalloc(alloc_size);
 		if (!send_ring->tx_ring) {
-			printk(KERN_WARNING
-				"%s: failed to allocate TX ring (%d entries)\n",
+			pr_warn("%s: failed to allocate TX ring (%d entries)\n",
 				ca->name, ipoib_sendq_size);
 			goto out_send_ring_cleanup;
 		}
@@ -1093,26 +1151,52 @@ static const struct header_ops ipoib_header_ops = {
 	.create	= ipoib_hard_header,
 };
 
-static const struct net_device_ops ipoib_netdev_ops = {
+static const struct net_device_ops ipoib_netdev_ops_no_tss = {
+	.ndo_open		 = ipoib_open,
+	.ndo_stop		 = ipoib_stop,
+	.ndo_change_mtu		 = ipoib_change_mtu,
+	.ndo_fix_features	 = ipoib_fix_features,
+	.ndo_start_xmit		 = ipoib_start_xmit,
+	.ndo_tx_timeout		 = ipoib_timeout,
+	.ndo_get_stats		 = ipoib_get_stats,
+	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
+	.ndo_neigh_setup	 = ipoib_neigh_setup_dev,
+};
+
+static const struct net_device_ops ipoib_netdev_ops_hw_tss = {
+	.ndo_open		 = ipoib_open,
+	.ndo_stop		 = ipoib_stop,
+	.ndo_change_mtu		 = ipoib_change_mtu,
+	.ndo_fix_features	 = ipoib_fix_features,
+	.ndo_start_xmit		 = ipoib_start_xmit,
+	.ndo_select_queue	 = ipoib_select_queue_hw,
+	.ndo_tx_timeout		 = ipoib_timeout,
+	.ndo_get_stats		 = ipoib_get_stats,
+	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
+	.ndo_neigh_setup	 = ipoib_neigh_setup_dev,
+};
+
+static const struct net_device_ops ipoib_netdev_ops_sw_tss = {
 	.ndo_open		 = ipoib_open,
 	.ndo_stop		 = ipoib_stop,
 	.ndo_change_mtu		 = ipoib_change_mtu,
 	.ndo_fix_features	 = ipoib_fix_features,
 	.ndo_start_xmit		 = ipoib_start_xmit,
-	.ndo_select_queue	 = ipoib_select_queue_null,
+	.ndo_select_queue	 = ipoib_select_queue_sw,
 	.ndo_tx_timeout		 = ipoib_timeout,
 	.ndo_get_stats		 = ipoib_get_stats,
 	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
 	.ndo_neigh_setup	 = ipoib_neigh_setup_dev,
 };
 
+static const struct net_device_ops *ipoib_netdev_ops;
 
 static void ipoib_setup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	/* Use correct ops (ndo_select_queue) */
-	dev->netdev_ops = &ipoib_netdev_ops;
+	dev->netdev_ops = ipoib_netdev_ops;
 	dev->header_ops = &ipoib_header_ops;
 
 	ipoib_set_ethtool_ops(dev);
@@ -1159,6 +1243,16 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name,
 {
 	struct net_device *dev;
 
+	/* Use correct ops (ndo_select_queue) pass to ipoib_setup */
+	if (template_priv->num_tx_queues > 1) {
+		if (template_priv->hca_caps & IB_DEVICE_UD_TSS)
+			ipoib_netdev_ops = &ipoib_netdev_ops_hw_tss;
+		else
+			ipoib_netdev_ops = &ipoib_netdev_ops_sw_tss;
+	} else
+		ipoib_netdev_ops = &ipoib_netdev_ops_no_tss;
+
+
 	dev = alloc_netdev_mqs((int) sizeof(struct ipoib_dev_priv), name,
 			   ipoib_setup,
 			   template_priv->num_tx_queues,
@@ -1266,6 +1360,7 @@ static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
 				  struct ib_device *hca)
 {
 	struct ib_device_attr *device_attr;
+	int num_cores;
 	int result = -ENOMEM;
 
 	device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL);
@@ -1286,8 +1381,36 @@ static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
 
 	kfree(device_attr);
 
-	priv->num_rx_queues = 1;
-	priv->num_tx_queues = 1;
+	num_cores = num_online_cpus();
+	if (num_cores == 1 || !(priv->hca_caps & IB_DEVICE_QPG)) {
+		/* No additional QP, only one QP for RX & TX */
+		priv->rss_qp_num = 0;
+		priv->tss_qp_num = 0;
+		priv->num_rx_queues = 1;
+		priv->num_tx_queues = 1;
+		return 0;
+	}
+	num_cores = roundup_pow_of_two(num_cores);
+	if (priv->hca_caps & IB_DEVICE_UD_RSS) {
+		int max_rss_tbl_sz;
+		max_rss_tbl_sz = device_attr->max_rss_tbl_sz;
+		max_rss_tbl_sz = min(num_cores, max_rss_tbl_sz);
+		max_rss_tbl_sz = rounddown_pow_of_two(max_rss_tbl_sz);
+		priv->rss_qp_num    = max_rss_tbl_sz;
+		priv->num_rx_queues = max_rss_tbl_sz;
+	} else {
+		/* No additional QP, only the parent QP for RX */
+		priv->rss_qp_num = 0;
+		priv->num_rx_queues = 1;
+	}
+
+	priv->tss_qp_num = num_cores;
+	if (priv->hca_caps & IB_DEVICE_UD_TSS)
+		/* TSS is supported by HW */
+		priv->num_tx_queues = priv->tss_qp_num;
+	else
+		/* If TSS is not support by HW use the parent QP for ARP */
+		priv->num_tx_queues = priv->tss_qp_num + 1;
 
 	return 0;
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 4be626f..d05071e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -35,6 +35,31 @@
 
 #include "ipoib.h"
 
+static int set_qps_qkey(struct ipoib_dev_priv *priv)
+{
+	struct ib_qp_attr *qp_attr;
+	struct ipoib_recv_ring *recv_ring;
+	int ret = -ENOMEM;
+	int i;
+
+	qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
+	if (!qp_attr)
+		return -ENOMEM;
+
+	qp_attr->qkey = priv->qkey;
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; ++i) {
+		ret = ib_modify_qp(recv_ring->recv_qp, qp_attr, IB_QP_QKEY);
+		if (ret)
+			break;
+		recv_ring++;
+	}
+
+	kfree(qp_attr);
+
+	return ret;
+}
+
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid, int set_qkey)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -50,18 +75,9 @@ int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid, int
 	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 
 	if (set_qkey) {
-		ret = -ENOMEM;
-		qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
-		if (!qp_attr)
-			goto out;
-
-		/* set correct QKey for QP */
-		qp_attr->qkey = priv->qkey;
-		ret = ib_modify_qp(priv->qp, qp_attr, IB_QP_QKEY);
-		if (ret) {
-			ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret);
+		ret = set_qps_qkey(priv);
+		if (ret)
 			goto out;
-		}
 	}
 
 	/* attach QP to multicast group */
@@ -74,16 +90,13 @@ out:
 	return ret;
 }
 
-int ipoib_init_qp(struct net_device *dev)
+static int ipoib_init_one_qp(struct ipoib_dev_priv *priv, struct ib_qp *qp,
+				int init_attr)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
 	struct ib_qp_attr qp_attr;
 	int attr_mask;
 
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
-		return -1;
-
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
 	qp_attr.port_num = priv->port;
@@ -92,17 +105,18 @@ int ipoib_init_qp(struct net_device *dev)
 	    IB_QP_QKEY |
 	    IB_QP_PORT |
 	    IB_QP_PKEY_INDEX |
-	    IB_QP_STATE;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	    IB_QP_STATE | init_attr;
+
+	ret = ib_modify_qp(qp, &qp_attr, attr_mask);
 	if (ret) {
-		ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret);
+		ipoib_warn(priv, "failed to modify QP to INT, ret = %d\n", ret);
 		goto out_fail;
 	}
 
 	qp_attr.qp_state = IB_QPS_RTR;
 	/* Can't set this in a INIT->RTR transition */
-	attr_mask &= ~IB_QP_PORT;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	attr_mask &= ~(IB_QP_PORT | init_attr);
+	ret = ib_modify_qp(qp, &qp_attr, attr_mask);
 	if (ret) {
 		ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret);
 		goto out_fail;
@@ -112,40 +126,415 @@ int ipoib_init_qp(struct net_device *dev)
 	qp_attr.sq_psn = 0;
 	attr_mask |= IB_QP_SQ_PSN;
 	attr_mask &= ~IB_QP_PKEY_INDEX;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	ret = ib_modify_qp(qp, &qp_attr, attr_mask);
 	if (ret) {
 		ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret);
 		goto out_fail;
 	}
 
-	/* Only one ring currently */
-	priv->recv_ring[0].recv_qp = priv->qp;
-	priv->send_ring[0].send_qp = priv->qp;
-
 	return 0;
 
 out_fail:
 	qp_attr.qp_state = IB_QPS_RESET;
-	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
+	if (ib_modify_qp(qp, &qp_attr, IB_QP_STATE))
 		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
 
 	return ret;
 }
 
-int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
+static int ipoib_init_rss_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	struct ib_qp_attr qp_attr;
+	int i;
+	int ret;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		ret = ipoib_init_one_qp(priv, recv_ring->recv_qp, 0);
+		if (ret) {
+			ipoib_warn(priv,
+				"failed to init rss qp, ind = %d, ret=%d\n",
+				i, ret);
+			goto out_free_reset_qp;
+		}
+		recv_ring++;
+	}
+
+	return 0;
+
+out_free_reset_qp:
+	for (--i; i >= 0; --i) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->recv_ring[i].recv_qp,
+				&qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				"Failed to modify QP to RESET state\n");
+	}
+
+	return ret;
+}
+
+static int ipoib_init_tss_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	struct ib_qp_attr qp_attr;
+	int i;
+	int ret;
+
+	send_ring = priv->send_ring;
+	/*
+	 * Note if priv->tss_qdisc_num > priv->tss_qp_num then since
+	 * the last QP is the parent QP and it will be initialize later
+	 */
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		ret = ipoib_init_one_qp(priv, send_ring->send_qp, 0);
+		if (ret) {
+			ipoib_warn(priv,
+				"failed to init tss qp, ind = %d, ret=%d\n",
+				i, ret);
+			goto out_free_reset_qp;
+		}
+		send_ring++;
+	}
+
+	return 0;
+
+out_free_reset_qp:
+	for (--i; i >= 0; --i) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->send_ring[i].send_qp,
+				&qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				"Failed to modify QP to RESET state\n");
+	}
+
+	return ret;
+}
+
+int ipoib_init_qp(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr qp_attr;
+	int ret, i, attr;
+
+	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
+		ipoib_warn(priv, "PKEY not assigned\n");
+		return -1;
+	}
+
+	/* Init parent QP */
+	/* If rss_qp_num = 0 then the parent QP is the RX QP */
+	ret = ipoib_init_rss_qps(dev);
+	if (ret)
+		return ret;
+
+	ret = ipoib_init_tss_qps(dev);
+	if (ret)
+		goto out_reset_tss_qp;
+
+	/* Init the parent QP which can be the only QP */
+	attr = priv->rss_qp_num > 0 ? IB_QP_GROUP_RSS : 0;
+	ret = ipoib_init_one_qp(priv, priv->qp, attr);
+	if (ret) {
+		ipoib_warn(priv, "failed to init parent qp, ret=%d\n", ret);
+		goto out_reset_rss_qp;
+	}
+
+	return 0;
+
+out_reset_rss_qp:
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->recv_ring[i].recv_qp,
+				&qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				"Failed to modify QP to RESET state\n");
+	}
+
+out_reset_tss_qp:
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->send_ring[i].send_qp,
+				&qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				"Failed to modify QP to RESET state\n");
+	}
+
+	return ret;
+}
+
+static int ipoib_transport_cq_init(struct net_device *dev,
+							int size)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	struct ipoib_send_ring *send_ring;
+	struct ib_cq *cq;
+	int i, allocated_rx, allocated_tx, req_vec;
+
+	allocated_rx = 0;
+	allocated_tx = 0;
+
+	/* We over subscribed the CPUS, ports start from 1 */
+	req_vec = (priv->port - 1) * roundup_pow_of_two(num_online_cpus());
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		/* Try to spread vectors based on port and ring numbers */
+		cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL,
+				  recv_ring, size,
+				  req_vec % priv->ca->num_comp_vectors);
+		if (IS_ERR(cq)) {
+			printk(KERN_WARNING "%s: failed to create recv CQ\n",
+					priv->ca->name);
+			goto out_free_recv_cqs;
+		}
+		recv_ring->recv_cq = cq;
+		allocated_rx++;
+		req_vec++;
+		if (ib_req_notify_cq(recv_ring->recv_cq, IB_CQ_NEXT_COMP)) {
+			printk(KERN_WARNING "%s: req notify recv CQ\n",
+					priv->ca->name);
+			goto out_free_recv_cqs;
+		}
+		recv_ring++;
+	}
+
+	/* We over subscribed the CPUS, ports start from 1 */
+	req_vec = (priv->port - 1) * roundup_pow_of_two(num_online_cpus());
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		cq = ib_create_cq(priv->ca,
+				  ipoib_send_comp_handler, NULL,
+				  send_ring, ipoib_sendq_size,
+				  req_vec % priv->ca->num_comp_vectors);
+		if (IS_ERR(cq)) {
+			printk(KERN_WARNING "%s: failed to create send CQ\n",
+					priv->ca->name);
+			goto out_free_send_cqs;
+		}
+		send_ring->send_cq = cq;
+		allocated_tx++;
+		req_vec++;
+		send_ring++;
+	}
+
+	return 0;
+
+out_free_send_cqs:
+	for (i = 0 ; i < allocated_tx ; i++) {
+		ib_destroy_cq(priv->send_ring[i].send_cq);
+		priv->send_ring[i].send_cq = NULL;
+	}
+
+out_free_recv_cqs:
+	for (i = 0 ; i < allocated_rx ; i++) {
+		ib_destroy_cq(priv->recv_ring[i].recv_cq);
+		priv->recv_ring[i].recv_cq = NULL;
+	}
+
+	return -ENODEV;
+}
+
+static int ipoib_create_parent_qp(struct net_device *dev,
+				  struct ib_device *ca)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr init_attr = {
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type     = IB_QPT_UD
+	};
+	struct ib_qp *qp;
+
+	if (priv->hca_caps & IB_DEVICE_UD_TSO)
+		init_attr.create_flags |= IB_QP_CREATE_IPOIB_UD_LSO;
+
+	if (priv->hca_caps & IB_DEVICE_BLOCK_MULTICAST_LOOPBACK)
+		init_attr.create_flags |= IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK;
+
+	if (dev->features & NETIF_F_SG)
+		init_attr.cap.max_send_sge = MAX_SKB_FRAGS + 1;
+
+	if (priv->tss_qp_num == 0 && priv->rss_qp_num == 0)
+		/* Legacy mode */
+		init_attr.qpg_type = IB_QPG_NONE;
+	else {
+		init_attr.qpg_type = IB_QPG_PARENT;
+		init_attr.parent_attrib.tss_child_count = priv->tss_qp_num;
+		init_attr.parent_attrib.rss_child_count = priv->rss_qp_num;
+	}
+
+	/*
+	 * NO TSS (tss_qp_num = 0 priv->num_tx_queues  == 1)
+	 * OR TSS is not supported in HW in this case
+	 * parent QP is used for ARR and friend transmission
+	 */
+	if (priv->num_tx_queues > priv->tss_qp_num) {
+		init_attr.cap.max_send_wr  = ipoib_sendq_size;
+		init_attr.cap.max_send_sge = 1;
+	}
+
+	/* No RSS parent QP will be used for RX */
+	if (priv->rss_qp_num == 0) {
+		init_attr.cap.max_recv_wr  = ipoib_recvq_size;
+		init_attr.cap.max_recv_sge = IPOIB_UD_RX_SG;
+	}
+
+	/* Note that if parent QP is not used for RX/TX then this is harmless */
+	init_attr.recv_cq = priv->recv_ring[0].recv_cq;
+	init_attr.send_cq = priv->send_ring[priv->tss_qp_num].send_cq;
+
+	qp = ib_create_qp(priv->pd, &init_attr);
+	if (IS_ERR(qp)) {
+		pr_warn("%s: failed to create parent QP\n", ca->name);
+		return -ENODEV; /* qp is an error value and will be checked */
+	}
+
+	priv->qp = qp;
+
+	/* TSS is not supported in HW or NO TSS (tss_qp_num = 0) */
+	if (priv->num_tx_queues > priv->tss_qp_num)
+		priv->send_ring[priv->tss_qp_num].send_qp = qp;
+
+	/* No RSS parent QP will be used for RX */
+	if (priv->rss_qp_num == 0)
+		priv->recv_ring[0].recv_qp = qp;
+
+	/* only with SW TSS there is a need for a mask */
+	if ((priv->hca_caps & IB_DEVICE_UD_TSS) || (priv->tss_qp_num == 0))
+		/* TSS is supported by HW or no TSS at all */
+		priv->tss_qpn_mask_sz = 0;
+	else {
+		/* SW TSS, get mask back from HW, put in the upper nibble */
+		u16 tmp = (u16)init_attr.cap.qpg_tss_mask_sz;
+		priv->tss_qpn_mask_sz = cpu_to_be16((tmp << 12));
+	}
+	return 0;
+}
+
+static struct ib_qp *ipoib_create_tss_qp(struct net_device *dev,
+					 struct ib_device *ca,
+					 int ind)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr init_attr = {
 		.cap = {
 			.max_send_wr  = ipoib_sendq_size,
-			.max_recv_wr  = ipoib_recvq_size,
 			.max_send_sge = 1,
+		},
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type     = IB_QPT_UD
+	};
+	struct ib_qp *qp;
+
+	if (priv->hca_caps & IB_DEVICE_UD_TSO)
+		init_attr.create_flags |= IB_QP_CREATE_IPOIB_UD_LSO;
+
+	if (dev->features & NETIF_F_SG)
+		init_attr.cap.max_send_sge = MAX_SKB_FRAGS + 1;
+
+	init_attr.qpg_type = IB_QPG_CHILD_TX;
+	init_attr.qpg_parent = priv->qp;
+
+	init_attr.send_cq = init_attr.recv_cq = priv->send_ring[ind].send_cq;
+
+	qp = ib_create_qp(priv->pd, &init_attr);
+	if (IS_ERR(qp)) {
+		pr_warn("%s: failed to create TSS QP(%d)\n", ca->name, ind);
+		return qp; /* qp is an error value and will be checked */
+	}
+
+	return qp;
+}
+
+static struct ib_qp *ipoib_create_rss_qp(struct net_device *dev,
+					 struct ib_device *ca,
+					 int ind)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr init_attr = {
+		.cap = {
+			.max_recv_wr  = ipoib_recvq_size,
 			.max_recv_sge = IPOIB_UD_RX_SG
 		},
 		.sq_sig_type = IB_SIGNAL_ALL_WR,
 		.qp_type     = IB_QPT_UD
 	};
+	struct ib_qp *qp;
+
+	init_attr.qpg_type = IB_QPG_CHILD_RX;
+	init_attr.qpg_parent = priv->qp;
+
+	init_attr.send_cq = init_attr.recv_cq = priv->recv_ring[ind].recv_cq;
+
+	qp = ib_create_qp(priv->pd, &init_attr);
+	if (IS_ERR(qp)) {
+		pr_warn("%s: failed to create RSS QP(%d)\n", ca->name, ind);
+		return qp; /* qp is an error value and will be checked */
+	}
 
+	return qp;
+}
+
+static int ipoib_create_other_qps(struct net_device *dev,
+				  struct ib_device *ca)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	struct ipoib_recv_ring *recv_ring;
+	int i, rss_created, tss_created;
+	struct ib_qp *qp;
+
+	tss_created = 0;
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		qp = ipoib_create_tss_qp(dev, ca, i);
+		if (IS_ERR(qp)) {
+			printk(KERN_WARNING "%s: failed to create QP\n",
+				ca->name);
+			goto out_free_send_qp;
+		}
+		send_ring->send_qp = qp;
+		send_ring++;
+		tss_created++;
+	}
+
+	rss_created = 0;
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		qp = ipoib_create_rss_qp(dev, ca, i);
+		if (IS_ERR(qp)) {
+			printk(KERN_WARNING "%s: failed to create QP\n",
+				ca->name);
+			goto out_free_recv_qp;
+		}
+		recv_ring->recv_qp = qp;
+		recv_ring++;
+		rss_created++;
+	}
+
+	return 0;
+
+out_free_recv_qp:
+	for (i = 0; i < rss_created; i++) {
+		ib_destroy_qp(priv->recv_ring[i].recv_qp);
+		priv->recv_ring[i].recv_qp = NULL;
+	}
+
+out_free_send_qp:
+	for (i = 0; i < tss_created; i++) {
+		ib_destroy_qp(priv->send_ring[i].send_qp);
+		priv->send_ring[i].send_qp = NULL;
+	}
+
+	return -ENODEV;
+}
+
+int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_send_ring *send_ring;
 	struct ipoib_recv_ring *recv_ring, *first_recv_ring;
 	int ret, size;
@@ -173,49 +562,38 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 			size += ipoib_recvq_size * ipoib_max_conn_qp;
 	}
 
-	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL,
-				     priv->recv_ring, size, 0);
-	if (IS_ERR(priv->recv_cq)) {
-		printk(KERN_WARNING "%s: failed to create receive CQ\n", ca->name);
+	/* Create CQ(s) and */
+	ret = ipoib_transport_cq_init(dev, size);
+	if (ret) {
+		pr_warn("%s: ipoib_transport_cq_init failed\n", ca->name);
 		goto out_free_mr;
 	}
 
-	priv->send_cq = ib_create_cq(priv->ca, ipoib_send_comp_handler, NULL,
-				     priv->send_ring, ipoib_sendq_size, 0);
-	if (IS_ERR(priv->send_cq)) {
-		printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name);
-		goto out_free_recv_cq;
-	}
-
-	/* Only one ring */
-	priv->recv_ring[0].recv_cq = priv->recv_cq;
-	priv->send_ring[0].send_cq = priv->send_cq;
-
-	if (ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP))
-		goto out_free_send_cq;
-
-	init_attr.send_cq = priv->send_cq;
-	init_attr.recv_cq = priv->recv_cq;
-
-	if (priv->hca_caps & IB_DEVICE_UD_TSO)
-		init_attr.create_flags |= IB_QP_CREATE_IPOIB_UD_LSO;
-
-	if (priv->hca_caps & IB_DEVICE_BLOCK_MULTICAST_LOOPBACK)
-		init_attr.create_flags |= IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK;
-
-	if (dev->features & NETIF_F_SG)
-		init_attr.cap.max_send_sge = MAX_SKB_FRAGS + 1;
-
-	priv->qp = ib_create_qp(priv->pd, &init_attr);
-	if (IS_ERR(priv->qp)) {
-		printk(KERN_WARNING "%s: failed to create QP\n", ca->name);
-		goto out_free_send_cq;
+	/* Init the parent QP */
+	ret = ipoib_create_parent_qp(dev, ca);
+	if (ret) {
+		pr_warn("%s: failed to create parent QP\n", ca->name);
+		goto out_free_cqs;
 	}
 
+	/*
+	 * advetize that we are willing to accept from TSS sender
+	 * note that this only indicates that this side is willing to accept
+	 * TSS frames, it doesn't implies that it will use TSS since for
+	 * transmission the peer should advertize TSS as well
+	 */
+	priv->dev->dev_addr[0] |= IPOIB_FLAGS_TSS;
 	priv->dev->dev_addr[1] = (priv->qp->qp_num >> 16) & 0xff;
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
+	/* create TSS & RSS QPs */
+	ret = ipoib_create_other_qps(dev, ca);
+	if (ret) {
+		pr_warn("%s: failed to create QP(s)\n", ca->name);
+		goto out_free_parent_qp;
+	}
+
 	send_ring = priv->send_ring;
 	for (j = 0; j < priv->num_tx_queues; j++) {
 		for (i = 0; i < MAX_SKB_FRAGS + 1; ++i)
@@ -256,11 +634,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 
 	return 0;
 
-out_free_send_cq:
-	ib_destroy_cq(priv->send_cq);
+out_free_parent_qp:
+	ib_destroy_qp(priv->qp);
+	priv->qp = NULL;
+
+out_free_cqs:
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		ib_destroy_cq(priv->recv_ring[i].recv_cq);
+		priv->recv_ring[i].recv_cq = NULL;
+	}
 
-out_free_recv_cq:
-	ib_destroy_cq(priv->recv_cq);
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		ib_destroy_cq(priv->send_ring[i].send_cq);
+		priv->send_ring[i].send_cq = NULL;
+	}
 
 out_free_mr:
 	ib_dereg_mr(priv->mr);
@@ -271,10 +658,101 @@ out_free_pd:
 	return -ENODEV;
 }
 
+static void ipoib_destroy_tx_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int i;
+
+	if (NULL == priv->send_ring)
+		return;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		if (send_ring->send_qp) {
+			if (ib_destroy_qp(send_ring->send_qp))
+				ipoib_warn(priv, "ib_destroy_qp (send) failed\n");
+			send_ring->send_qp = NULL;
+		}
+		send_ring++;
+	}
+
+	/*
+	 * No support of TSS in HW
+	 * so there is an extra QP but it is freed later
+	 */
+	if (priv->num_tx_queues > priv->tss_qp_num)
+		send_ring->send_qp = NULL;
+}
+
+static void ipoib_destroy_rx_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	if (NULL == priv->recv_ring)
+		return;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		if (recv_ring->recv_qp) {
+			if (ib_destroy_qp(recv_ring->recv_qp))
+				ipoib_warn(priv, "ib_destroy_qp (recv) failed\n");
+			recv_ring->recv_qp = NULL;
+		}
+		recv_ring++;
+	}
+}
+
+static void ipoib_destroy_tx_cqs(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int i;
+
+	if (NULL == priv->send_ring)
+		return;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		if (send_ring->send_cq) {
+			if (ib_destroy_cq(send_ring->send_cq))
+				ipoib_warn(priv, "ib_destroy_cq (send) failed\n");
+			send_ring->send_cq = NULL;
+		}
+		send_ring++;
+	}
+}
+
+static void ipoib_destroy_rx_cqs(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	if (NULL == priv->recv_ring)
+		return;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		if (recv_ring->recv_cq) {
+			if (ib_destroy_cq(recv_ring->recv_cq))
+				ipoib_warn(priv, "ib_destroy_cq (recv) failed\n");
+			recv_ring->recv_cq = NULL;
+		}
+		recv_ring++;
+	}
+}
+
 void ipoib_transport_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
+	ipoib_destroy_rx_qps(dev);
+	ipoib_destroy_tx_qps(dev);
+
+	/* Destroy parent or only QP */
 	if (priv->qp) {
 		if (ib_destroy_qp(priv->qp))
 			ipoib_warn(priv, "ib_qp_destroy failed\n");
@@ -283,11 +761,8 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 	}
 
-	if (ib_destroy_cq(priv->send_cq))
-		ipoib_warn(priv, "ib_cq_destroy (send) failed\n");
-
-	if (ib_destroy_cq(priv->recv_cq))
-		ipoib_warn(priv, "ib_cq_destroy (recv) failed\n");
+	ipoib_destroy_rx_cqs(dev);
+	ipoib_destroy_tx_cqs(dev);
 
 	ipoib_cm_dev_cleanup(dev);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next 4/7] IB/core: Add RSS and TSS QP groups
       [not found]     ` <1336494151-31050-5-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2012-05-09 15:06       ` Or Gerlitz
  2012-05-17  8:19       ` Or Gerlitz
  1 sibling, 0 replies; 12+ messages in thread
From: Or Gerlitz @ 2012-05-09 15:06 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, shlomop-VPRAkNaXOzVWk0Htik3J/w

On 5/8/2012 7:22 PM, Or Gerlitz wrote:
> --- a/drivers/infiniband/hw/ehca/ehca_qp.c
> +++ b/drivers/infiniband/hw/ehca/ehca_qp.c
> @@ -464,6 +464,9 @@ static struct ehca_qp *internal_create_qp(
>   	int is_llqp = 0, has_srq = 0, is_user = 0;
>   	int qp_type, max_send_sge, max_recv_sge, ret;
>
> +	if (init_attr->qpg_type != IB_QPG_NONE)
> +		return ERR_PTR(-ENOSYS);
> +
>   	/* h_call's out parameters */
>   	struct ehca_alloc_qp_parms parms;
>   	u32 swqe_size = 0, rwqe_size = 0, ib_qp_num;
> @@ -980,6 +983,9 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd,
>   	if (srq_init_attr->srq_type != IB_SRQT_BASIC)
>   		return ERR_PTR(-ENOSYS);
>
> +	if (srq_init_attr->qpg_type != IB_QPG_NONE)
> +		return ERR_PTR(-ENOSYS);
> +
>
oops, this setting is wrong, will send fixed patch (mark it as V0.1 to 
allow room for sending V1 later...)

Or.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND for-next 3/7] IB/mlx4: increase the number of vectors (EQs) available for ULPs
       [not found]     ` <1336494151-31050-4-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2012-05-16 20:42       ` Or Gerlitz
       [not found]         ` <CAJZOPZ+53S5q+S2ToiqFQFo6oUP7tpO6gZA4Cu4pBWxrjL-wXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Or Gerlitz @ 2012-05-16 20:42 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, shlomop-VPRAkNaXOzVWk0Htik3J/w

On Tue, May 8, 2012 at 7:22 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

> Enable IB ULPs to use a larger portion of the device EQs (which map
> to IRQs). The mlx4_ib driver follows the mlx4_core framework of the
> EQs to be divided among the device ports. In this scheme, for each IB
> port, the number of allocated EQs follows the number of cores, subject
> to other system constraints, such as number available MSI-X vectors.

Hi Roland,

This patch is kind of following the "occupy wall street" spirit, e.g
let mlx4/IB consumers get the available EQs reserved/allocated for
them by mlx4_core, which does so per port. Its basic for RSS later and
also for optimizations in more ULPs down the road, will be happy to
get any feedback if you have comments here.

Also, another long pending patch is the one to allow for multiple IPoIB childs @
http://marc.info/?l=linux-rdma&m=132630202116857&w=2 , which will be
of use by the eIPoIB - the "Ethernet IPoIB" driver which was presented
in the last OFA conference, the patch was initially posted back in
November, so far no comment.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next 4/7] IB/core: Add RSS and TSS QP groups
       [not found]     ` <1336494151-31050-5-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2012-05-09 15:06       ` Or Gerlitz
@ 2012-05-17  8:19       ` Or Gerlitz
  1 sibling, 0 replies; 12+ messages in thread
From: Or Gerlitz @ 2012-05-17  8:19 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, Hefty, Sean
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, shlomop-VPRAkNaXOzVWk0Htik3J/w,
	Ken Strandberg

On 5/8/2012 7:22 PM, Or Gerlitz wrote:
> From: Shlomo Pongratz<shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>
> RSS (Receive Side Scaling) TSS (Transmit Side Scaling, better known as
> MQ/Multi-Queue) are common networking techniques which allow to use
> contemporary NICs that support multiple receive and transmit descriptor
> queues (multi-queue), see also Documentation/networking/scaling.txt
>
> This patch introduces the concept of RSS and TSS QP groups which
> allows for implementing them by low level drivers and using it
> by IPoIB and later also by user space ULPs.

Hi Roland, Sean - word had reached me tonight that the unbelievable 
happened and OFA now opened their site for all without requiring 
subscription! The concept of QP groups for TSS/RSS was introduced in the 
last OFA conference, maybe you want to take a look on the user mode 
ethernet session slides 10-14, the author was maybe worried from the MS 
mob, and hence didn't use the terms
RSS/TSS but that's the intention...

https://openfabrics.org/resources/document-downloads/presentations/cat_view/57-ofa-documents/23-presentations/81-openfabrics-international-workshops/104-2012-ofa-international-workshop/107-2012-ofa-intl-workshop-wednesday.html


Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RESEND for-next 3/7] IB/mlx4: increase the number of vectors (EQs) available for ULPs
       [not found]         ` <CAJZOPZ+53S5q+S2ToiqFQFo6oUP7tpO6gZA4Cu4pBWxrjL-wXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-05-17  8:24           ` Or Gerlitz
  0 siblings, 0 replies; 12+ messages in thread
From: Or Gerlitz @ 2012-05-17  8:24 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Hefty, Sean, linux-rdma, shlomop-VPRAkNaXOzVWk0Htik3J/w

On 5/16/2012 11:42 PM, Or Gerlitz wrote:
> Also, another long pending patch is the one to allow for multiple 
> IPoIB childs @ http://marc.info/?l=linux-rdma&m=132630202116857&w=2 , 
> which will be of use by the eIPoIB - the "Ethernet IPoIB" driver which 
> was presented in the last OFA conference, the patch was initially 
> posted back in November, so far no comment.

Same here, eIPoIB was presented last month in OFA, see the Ethernet 
IPoIB session on that link. The patch I refer to allows to create 
multiple IPoIB interfaces under the same IPoIB parent, where each such 
interface (e.g ib0.1, ib0.2, ... ib0.N) would eventually serve the 
traffic of VM1, VM2, ... VMN

https://openfabrics.org/resources/document-downloads/presentations/cat_view/57-ofa-documents/23-presentations/81-openfabrics-international-workshops/104-2012-ofa-international-workshop/107-2012-ofa-intl-workshop-wednesday.html

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-05-17  8:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-08 16:22 [PATCH for-next 0/7] Add RSS/TSS QP groups and IPoIB support for RSS/TSS Or Gerlitz
     [not found] ` <1336494151-31050-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2012-05-08 16:22   ` [PATCH RESEND for-next 1/7] net/mlx4: add new cap flags field to track more capabilities Or Gerlitz
2012-05-08 16:22   ` [PATCH RESEND for-next 2/7] IB/mlx4: replace KERN_yyy printk calls with pr_yyy ones Or Gerlitz
2012-05-08 16:22   ` [PATCH RESEND for-next 3/7] IB/mlx4: increase the number of vectors (EQs) available for ULPs Or Gerlitz
     [not found]     ` <1336494151-31050-4-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2012-05-16 20:42       ` Or Gerlitz
     [not found]         ` <CAJZOPZ+53S5q+S2ToiqFQFo6oUP7tpO6gZA4Cu4pBWxrjL-wXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-05-17  8:24           ` Or Gerlitz
2012-05-08 16:22   ` [PATCH for-next 4/7] IB/core: Add RSS and TSS QP groups Or Gerlitz
     [not found]     ` <1336494151-31050-5-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2012-05-09 15:06       ` Or Gerlitz
2012-05-17  8:19       ` Or Gerlitz
2012-05-08 16:22   ` [PATCH for-next 5/7] IB/mlx4: Add support for " Or Gerlitz
2012-05-08 16:22   ` [PATCH for-next 6/7] IB/ipoib: Implement vectorization restructure as pre-step for TSS/RSS Or Gerlitz
2012-05-08 16:22   ` [PATCH for-next 7/7] IB/ipoib: Add RSS and TSS support for datagram mode Or Gerlitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox