[PATCH 00/23] Update driver to 0.9-294

linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/23] Update driver to 0.9-294
@ 2015-10-19 16:43 ira.weiny-ral2JQCrhuEAvxtiuMwx3w
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Ira Weiny

From: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The following series has bug fixes and updates to the staging hfi1 driver.

Arthur Kepner (1):
  staging/rdma/hfi1: optionally prescan rx queue for {B,F}ECNs - UC, RC

Caz Yokoyama (1):
  staging/rdma/hfi1: Reset firmware instead of reloading Sbus

Dean Luick (4):
  staging/rdma/hfi1: Extend the offline timeout
  staging/rdma/hfi1: Add a schedule in send thread
  staging/rdma/hfi1: Add irqsaves in the packet processing path
  staging/rdma/hfi1: Thread the receive interrupt.

Easwar Hariharan (4):
  staging/rdma/hfi1: Reset ASIC CSRs on FLR, and once per ASIC
  staging/rdma/hfi1: Remove QSFP_ENABLED from HFI capability mask
  staging/rdma/hfi1: Fix port bounce issues with 0.22 DC firmware
  staging/rdma/hfi1: Load SBus firmware once per ASIC

Jareer Abdel-Qader (1):
  staging/rdma/hfi1: close shared context security hole

Jubin John (2):
  staging/rdma/hfi1: Add unit # to verbs txreq cache name
  staging/rdma/hfi1: Update driver version string to 0.9-294

Mike Marciniszyn (3):
  staging/rdma/hfi1: inline clear_ahg()
  staging/rdma/hfi: modify workqueue for parallelism
  staging/rdma/hfi1: add additional rc traces

Mitko Haralanov (3):
  staging/rdma/hfi1: Prevent silent data corruption with user SDMA
  staging/rdma/hfi1: Implement Expected Receive TID caching
  staging/rdma/hfi1: Allow tuning of SDMA interrupt rate

Niranjana Vishwanathapura (2):
  staging/rdma/hfi1: Add coalescing support for SDMA TX descriptors
  staging/rdma/hfi1: Fix sparse error in sdma.h file

Vennila Megavannan (2):
  staging/rdma/hfi1: Implement time-out for send context halt recovery
  staging/rdma/hfi1: Method to toggle "fast ECN" detection

 drivers/staging/rdma/hfi1/Kconfig        |   14 +-
 drivers/staging/rdma/hfi1/Makefile       |    2 +-
 drivers/staging/rdma/hfi1/chip.c         |  157 +++-
 drivers/staging/rdma/hfi1/chip.h         |    2 +
 drivers/staging/rdma/hfi1/common.h       |   19 +-
 drivers/staging/rdma/hfi1/driver.c       |  178 ++---
 drivers/staging/rdma/hfi1/file_ops.c     |  495 ++-----------
 drivers/staging/rdma/hfi1/firmware.c     |   37 +-
 drivers/staging/rdma/hfi1/hfi.h          |   98 +--
 drivers/staging/rdma/hfi1/init.c         |   23 +-
 drivers/staging/rdma/hfi1/iowait.h       |    6 +-
 drivers/staging/rdma/hfi1/mad.c          |    4 +-
 drivers/staging/rdma/hfi1/pcie.c         |   15 +-
 drivers/staging/rdma/hfi1/pio.c          |   14 +-
 drivers/staging/rdma/hfi1/qp.c           |   47 +-
 drivers/staging/rdma/hfi1/qp.h           |   51 ++
 drivers/staging/rdma/hfi1/qsfp.c         |   13 +-
 drivers/staging/rdma/hfi1/rc.c           |   33 +-
 drivers/staging/rdma/hfi1/ruc.c          |   55 +-
 drivers/staging/rdma/hfi1/sdma.c         |  160 +++-
 drivers/staging/rdma/hfi1/sdma.h         |   82 ++-
 drivers/staging/rdma/hfi1/trace.c        |    4 +-
 drivers/staging/rdma/hfi1/trace.h        |  180 +++--
 drivers/staging/rdma/hfi1/uc.c           |   15 +-
 drivers/staging/rdma/hfi1/ud.c           |    1 +
 drivers/staging/rdma/hfi1/user_exp_rcv.c | 1171 ++++++++++++++++++++++++++++++
 drivers/staging/rdma/hfi1/user_exp_rcv.h |   82 +++
 drivers/staging/rdma/hfi1/user_pages.c   |  110 +--
 drivers/staging/rdma/hfi1/user_sdma.c    |  103 ++-
 drivers/staging/rdma/hfi1/user_sdma.h    |   10 +-
 drivers/staging/rdma/hfi1/verbs.c        |   91 ++-
 drivers/staging/rdma/hfi1/verbs.h        |    9 +-
 include/uapi/rdma/hfi/hfi1_user.h        |   46 +-
 33 files changed, 2279 insertions(+), 1048 deletions(-)
 create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.c
 create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.h

-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

[parent not found: <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]

* [PATCH 01/23] staging/rdma/hfi1: inline clear_ahg()
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 02/23] staging/rdma/hfi1: Reset ASIC CSRs on FLR, and once per ASIC ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (20 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Ira Weiny

From: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

This additional call is a regression from
qib.  For small messages the progress routine
always builds one and clears out the ahg
state when the queue has gone to empty which
is the predominant case for small messages.

Inline the routine to mitigate the call and move
the routine to qp.h for scope reasons.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

---
 drivers/staging/rdma/hfi1/qp.h    | 15 +++++++++++++++
 drivers/staging/rdma/hfi1/ruc.c   | 13 -------------
 drivers/staging/rdma/hfi1/verbs.h |  2 --
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/qp.h b/drivers/staging/rdma/hfi1/qp.h
index 6b505859b59c..b9c1575990aa 100644
--- a/drivers/staging/rdma/hfi1/qp.h
+++ b/drivers/staging/rdma/hfi1/qp.h
@@ -52,6 +52,7 @@
 
 #include <linux/hash.h>
 #include "verbs.h"
+#include "sdma.h"
 
 #define QPN_MAX                 (1 << 24)
 #define QPNMAP_ENTRIES          (QPN_MAX / PAGE_SIZE / BITS_PER_BYTE)
@@ -117,6 +118,20 @@ static inline struct hfi1_qp *hfi1_lookup_qpn(struct hfi1_ibport *ibp,
 }
 
 /**
+ * clear_ahg - reset ahg status in qp
+ * @qp - qp pointer
+ */
+static inline void clear_ahg(struct hfi1_qp *qp)
+{
+	qp->s_hdr->ahgcount = 0;
+	qp->s_flags &= ~(HFI1_S_AHG_VALID | HFI1_S_AHG_CLEAR);
+	if (qp->s_sde && qp->s_ahgidx >= 0)
+		sdma_ahg_free(qp->s_sde, qp->s_ahgidx);
+	qp->s_ahgidx = -1;
+	qp->s_sde = NULL;
+}
+
+/**
  * hfi1_error_qp - put a QP into the error state
  * @qp: the QP to put into the error state
  * @err: the receive completion error to signal if a RWQE is active
diff --git a/drivers/staging/rdma/hfi1/ruc.c b/drivers/staging/rdma/hfi1/ruc.c
index a4115288db66..faad1b93703e 100644
--- a/drivers/staging/rdma/hfi1/ruc.c
+++ b/drivers/staging/rdma/hfi1/ruc.c
@@ -695,19 +695,6 @@ u32 hfi1_make_grh(struct hfi1_ibport *ibp, struct ib_grh *hdr,
 	return sizeof(struct ib_grh) / sizeof(u32);
 }
 
-/*
- * free_ahg - clear ahg from QP
- */
-void clear_ahg(struct hfi1_qp *qp)
-{
-	qp->s_hdr->ahgcount = 0;
-	qp->s_flags &= ~(HFI1_S_AHG_VALID | HFI1_S_AHG_CLEAR);
-	if (qp->s_sde)
-		sdma_ahg_free(qp->s_sde, qp->s_ahgidx);
-	qp->s_ahgidx = -1;
-	qp->s_sde = NULL;
-}
-
 #define BTH2_OFFSET (offsetof(struct hfi1_pio_header, hdr.u.oth.bth[2]) / 4)
 
 /**
diff --git a/drivers/staging/rdma/hfi1/verbs.h b/drivers/staging/rdma/hfi1/verbs.h
index ed903a93baf7..afaa0fe619fe 100644
--- a/drivers/staging/rdma/hfi1/verbs.h
+++ b/drivers/staging/rdma/hfi1/verbs.h
@@ -1078,8 +1078,6 @@ int hfi1_ruc_check_hdr(struct hfi1_ibport *ibp, struct hfi1_ib_header *hdr,
 u32 hfi1_make_grh(struct hfi1_ibport *ibp, struct ib_grh *hdr,
 		  struct ib_global_route *grh, u32 hwords, u32 nwords);
 
-void clear_ahg(struct hfi1_qp *qp);
-
 void hfi1_make_ruc_header(struct hfi1_qp *qp, struct hfi1_other_headers *ohdr,
 			  u32 bth0, u32 bth2, int middle);
 
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 02/23] staging/rdma/hfi1: Reset ASIC CSRs on FLR, and once per ASIC
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2015-10-19 16:43   ` [PATCH 01/23] staging/rdma/hfi1: inline clear_ahg() ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 03/23] staging/rdma/hfi1: Extend the offline timeout ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (19 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Easwar Hariharan,
	Ira Weiny

From: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The ASIC registers were not reset on FLR, and the code to
protect the ASIC block against multiple initializations by
peer HFIs did not extend to multiple ASICs in a system. This
patch addresses this gap.

Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/chip.c     | 35 +++++++++++++++--------------------
 drivers/staging/rdma/hfi1/firmware.c |  4 ++++
 2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/chip.c b/drivers/staging/rdma/hfi1/chip.c
index 11523596ca57..f300d7fa5e5f 100644
--- a/drivers/staging/rdma/hfi1/chip.c
+++ b/drivers/staging/rdma/hfi1/chip.c
@@ -9327,8 +9327,6 @@ static void reset_cce_csrs(struct hfi1_devdata *dd)
 /* set ASIC CSRs to chip reset defaults */
 static void reset_asic_csrs(struct hfi1_devdata *dd)
 {
-	static DEFINE_MUTEX(asic_mutex);
-	static int called;
 	int i;
 
 	/*
@@ -9338,15 +9336,8 @@ static void reset_asic_csrs(struct hfi1_devdata *dd)
 	 * a known first load do the reset and blocking all others.
 	 */
 
-	/*
-	 * These CSRs should only be reset once - the first one here will
-	 * do the work.  Use a mutex so that a non-first caller waits until
-	 * the first is finished before it can proceed.
-	 */
-	mutex_lock(&asic_mutex);
-	if (called)
-		goto done;
-	called = 1;
+	if (!(dd->flags & HFI1_DO_INIT_ASIC))
+		return;
 
 	if (dd->icode != ICODE_FPGA_EMULATION) {
 		/* emulation does not have an SBus - leave these alone */
@@ -9366,7 +9357,10 @@ static void reset_asic_csrs(struct hfi1_devdata *dd)
 	for (i = 0; i < ASIC_NUM_SCRATCH; i++)
 		write_csr(dd, ASIC_CFG_SCRATCH + (8 * i), 0);
 	write_csr(dd, ASIC_CFG_MUTEX, 0);	/* this will clear it */
+
+	/* We might want to retain this state across FLR if we ever use it */
 	write_csr(dd, ASIC_CFG_DRV_STR, 0);
+
 	write_csr(dd, ASIC_CFG_THERM_POLL_EN, 0);
 	/* ASIC_STS_THERM read-only */
 	/* ASIC_CFG_RESET leave alone */
@@ -9413,9 +9407,6 @@ static void reset_asic_csrs(struct hfi1_devdata *dd)
 	/* this also writes a NOP command, clearing paging mode */
 	write_csr(dd, ASIC_EEP_ADDR_CMD, 0);
 	write_csr(dd, ASIC_EEP_DATA, 0);
-
-done:
-	mutex_unlock(&asic_mutex);
 }
 
 /* set MISC CSRs to chip reset defaults */
@@ -9827,6 +9818,7 @@ static void init_chip(struct hfi1_devdata *dd)
 			restore_pci_variables(dd);
 		}
 
+		reset_asic_csrs(dd);
 	} else {
 		dd_dev_info(dd, "Resetting CSRs with writes\n");
 		reset_cce_csrs(dd);
@@ -9837,6 +9829,7 @@ static void init_chip(struct hfi1_devdata *dd)
 	}
 	/* clear the DC reset */
 	write_csr(dd, CCE_DC_CTRL, 0);
+
 	/* Set the LED off */
 	if (is_a0(dd))
 		setextled(dd, 0);
@@ -10332,7 +10325,7 @@ static void asic_should_init(struct hfi1_devdata *dd)
 }
 
 /**
- * Allocate an initialize the device structure for the hfi.
+ * Allocate and initialize the device structure for the hfi.
  * @dev: the pci_dev for hfi1_ib device
  * @ent: pci_device_id struct for this dev
  *
@@ -10488,6 +10481,12 @@ struct hfi1_devdata *hfi1_init_dd(struct pci_dev *pdev,
 	else if (dd->rcv_intr_timeout_csr == 0 && rcv_intr_timeout)
 		dd->rcv_intr_timeout_csr = 1;
 
+	/* needs to be done before we look for the peer device */
+	read_guid(dd);
+
+	/* should this device init the ASIC block? */
+	asic_should_init(dd);
+
 	/* obtain chip sizes, reset chip CSRs */
 	init_chip(dd);
 
@@ -10496,11 +10495,6 @@ struct hfi1_devdata *hfi1_init_dd(struct pci_dev *pdev,
 	if (ret)
 		goto bail_cleanup;
 
-	/* needs to be done before we look for the peer device */
-	read_guid(dd);
-
-	asic_should_init(dd);
-
 	/* read in firmware */
 	ret = hfi1_firmware_init(dd);
 	if (ret)
@@ -10715,6 +10709,7 @@ static int thermal_init(struct hfi1_devdata *dd)
 
 	acquire_hw_mutex(dd);
 	dd_dev_info(dd, "Initializing thermal sensor\n");
+
 	/* Thermal Sensor Initialization */
 	/*    Step 1: Reset the Thermal SBus Receiver */
 	ret = sbus_request_slow(dd, SBUS_THERMAL, 0x0,
diff --git a/drivers/staging/rdma/hfi1/firmware.c b/drivers/staging/rdma/hfi1/firmware.c
index 5c2f2ed8f224..15c9cb7a3150 100644
--- a/drivers/staging/rdma/hfi1/firmware.c
+++ b/drivers/staging/rdma/hfi1/firmware.c
@@ -1614,6 +1614,10 @@ done:
  */
 void read_guid(struct hfi1_devdata *dd)
 {
+	/* Take the DC out of reset to get a valid GUID value */
+	write_csr(dd, CCE_DC_CTRL, 0);
+	(void) read_csr(dd, CCE_DC_CTRL);
+
 	dd->base_guid = read_csr(dd, DC_DC8051_CFG_LOCAL_GUID);
 	dd_dev_info(dd, "GUID %llx",
 		(unsigned long long)dd->base_guid);
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 03/23] staging/rdma/hfi1: Extend the offline timeout
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2015-10-19 16:43   ` [PATCH 01/23] staging/rdma/hfi1: inline clear_ahg() ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 02/23] staging/rdma/hfi1: Reset ASIC CSRs on FLR, and once per ASIC ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 04/23] staging/rdma/hfi1: Implement time-out for send context halt recovery ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (18 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Dean Luick, Ira Weiny

From: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The latest version of the 8051 firmware will wait longer
when bringing the link down.  Extend the driver's timeout
to go with that.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/chip.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/rdma/hfi1/chip.c b/drivers/staging/rdma/hfi1/chip.c
index f300d7fa5e5f..0a9030ff805f 100644
--- a/drivers/staging/rdma/hfi1/chip.c
+++ b/drivers/staging/rdma/hfi1/chip.c
@@ -6205,7 +6205,7 @@ static int goto_offline(struct hfi1_pportdata *ppd, u8 rem_reason)
 
 	if (do_wait) {
 		/* it can take a while for the link to go down */
-		ret = wait_phy_linkstate(dd, PLS_OFFLINE, 5000);
+		ret = wait_phy_linkstate(dd, PLS_OFFLINE, 10000);
 		if (ret < 0)
 			return ret;
 	}
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 04/23] staging/rdma/hfi1: Implement time-out for send context halt recovery
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 03/23] staging/rdma/hfi1: Extend the offline timeout ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 05/23] staging/rdma/hfi1: Remove QSFP_ENABLED from HFI capability mask ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (17 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Vennila Megavannan,
	Ira Weiny

From: Vennila Megavannan <vennila.megavannan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

This patch increases the timout for packet egress to 500 us and timer
resets to zero if the packet occupancy changes. Also we bounce the link
on time out.

Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Vennila Megavannan <vennila.megavannan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/pio.c  | 14 +++++++++++---
 drivers/staging/rdma/hfi1/sdma.c | 15 ++++++++++++---
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/pio.c b/drivers/staging/rdma/hfi1/pio.c
index 67dd93a6888c..e5c32db4bc67 100644
--- a/drivers/staging/rdma/hfi1/pio.c
+++ b/drivers/staging/rdma/hfi1/pio.c
@@ -922,10 +922,12 @@ void sc_disable(struct send_context *sc)
 static void sc_wait_for_packet_egress(struct send_context *sc, int pause)
 {
 	struct hfi1_devdata *dd = sc->dd;
-	u64 reg;
+	u64 reg = 0;
+	u64 reg_prev;
 	u32 loop = 0;
 
 	while (1) {
+		reg_prev = reg;
 		reg = read_csr(dd, sc->hw_context * 8 +
 			       SEND_EGRESS_CTXT_STATUS);
 		/* done if egress is stopped */
@@ -934,11 +936,17 @@ static void sc_wait_for_packet_egress(struct send_context *sc, int pause)
 		reg = packet_occupancy(reg);
 		if (reg == 0)
 			break;
-		if (loop > 100) {
+		/* counter is reset if occupancy count changes */
+		if (reg != reg_prev)
+			loop = 0;
+		if (loop > 500) {
+			/* timed out - bounce the link */
 			dd_dev_err(dd,
-				"%s: context %u(%u) timeout waiting for packets to egress, remaining count %u\n",
+				"%s: context %u(%u) timeout waiting for packets to egress, remaining count %u, bouncing link\n",
 				__func__, sc->sw_index,
 				sc->hw_context, (u32)reg);
+			queue_work(dd->pport->hfi1_wq,
+				&dd->pport->link_bounce_work);
 			break;
 		}
 		loop++;
diff --git a/drivers/staging/rdma/hfi1/sdma.c b/drivers/staging/rdma/hfi1/sdma.c
index 63ab72102183..d57531796723 100644
--- a/drivers/staging/rdma/hfi1/sdma.c
+++ b/drivers/staging/rdma/hfi1/sdma.c
@@ -303,17 +303,26 @@ static void sdma_wait_for_packet_egress(struct sdma_engine *sde,
 	u64 off = 8 * sde->this_idx;
 	struct hfi1_devdata *dd = sde->dd;
 	int lcnt = 0;
+	u64 reg_prev;
+	u64 reg = 0;
 
 	while (1) {
-		u64 reg = read_csr(dd, off + SEND_EGRESS_SEND_DMA_STATUS);
+		reg_prev = reg;
+		reg = read_csr(dd, off + SEND_EGRESS_SEND_DMA_STATUS);
 
 		reg &= SDMA_EGRESS_PACKET_OCCUPANCY_SMASK;
 		reg >>= SDMA_EGRESS_PACKET_OCCUPANCY_SHIFT;
 		if (reg == 0)
 			break;
-		if (lcnt++ > 100) {
-			dd_dev_err(dd, "%s: engine %u timeout waiting for packets to egress, remaining count %u\n",
+		/* counter is reest if accupancy count changes */
+		if (reg != reg_prev)
+			lcnt = 0;
+		if (lcnt++ > 500) {
+			/* timed out - bounce the link */
+			dd_dev_err(dd, "%s: engine %u timeout waiting for packets to egress, remaining count %u, bouncing link\n",
 				  __func__, sde->this_idx, (u32)reg);
+			queue_work(dd->pport->hfi1_wq,
+				&dd->pport->link_bounce_work);
 			break;
 		}
 		udelay(1);
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 05/23] staging/rdma/hfi1: Remove QSFP_ENABLED from HFI capability mask
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 04/23] staging/rdma/hfi1: Implement time-out for send context halt recovery ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 06/23] staging/rdma/hfi1: Add coalescing support for SDMA TX descriptors ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (16 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Easwar Hariharan,
	Ira Weiny

From: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The QSFP interface code has been running without issues and the flag is
never set to off. This patch removes the QSFP_ENABLED bit from HFI1_CAP.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/chip.c   |  3 +--
 drivers/staging/rdma/hfi1/common.h |  2 --
 drivers/staging/rdma/hfi1/qsfp.c   | 13 ++++---------
 include/uapi/rdma/hfi/hfi1_user.h  |  4 ++--
 4 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/chip.c b/drivers/staging/rdma/hfi1/chip.c
index 0a9030ff805f..93ca13fceb51 100644
--- a/drivers/staging/rdma/hfi1/chip.c
+++ b/drivers/staging/rdma/hfi1/chip.c
@@ -5716,8 +5716,7 @@ void init_qsfp(struct hfi1_pportdata *ppd)
 	u64 qsfp_mask;
 
 	if (loopback == LOOPBACK_SERDES || loopback == LOOPBACK_LCB ||
-			ppd->dd->icode == ICODE_FUNCTIONAL_SIMULATOR ||
-			!HFI1_CAP_IS_KSET(QSFP_ENABLED)) {
+			ppd->dd->icode == ICODE_FUNCTIONAL_SIMULATOR) {
 		ppd->driver_link_ready = 1;
 		return;
 	}
diff --git a/drivers/staging/rdma/hfi1/common.h b/drivers/staging/rdma/hfi1/common.h
index 5f2293729cf9..de62cbe2224c 100644
--- a/drivers/staging/rdma/hfi1/common.h
+++ b/drivers/staging/rdma/hfi1/common.h
@@ -147,7 +147,6 @@
 				  HFI1_CAP_USE_SDMA_HEAD |		\
 				  HFI1_CAP_EXTENDED_PSN |		\
 				  HFI1_CAP_PRINT_UNIMPL |		\
-				  HFI1_CAP_QSFP_ENABLED |		\
 				  HFI1_CAP_NO_INTEGRITY |		\
 				  HFI1_CAP_PKEY_CHECK) <<		\
 				 HFI1_CAP_USER_SHIFT)
@@ -163,7 +162,6 @@
 				 HFI1_CAP_SDMA |			\
 				 HFI1_CAP_PRINT_UNIMPL |		\
 				 HFI1_CAP_STATIC_RATE_CTRL |		\
-				 HFI1_CAP_QSFP_ENABLED |		\
 				 HFI1_CAP_PKEY_CHECK |			\
 				 HFI1_CAP_MULTI_PKT_EGR |		\
 				 HFI1_CAP_EXTENDED_PSN |		\
diff --git a/drivers/staging/rdma/hfi1/qsfp.c b/drivers/staging/rdma/hfi1/qsfp.c
index 3138936157db..ffdb1d787a80 100644
--- a/drivers/staging/rdma/hfi1/qsfp.c
+++ b/drivers/staging/rdma/hfi1/qsfp.c
@@ -403,16 +403,11 @@ static const char *pwr_codes = "1.5W2.0W2.5W3.5W";
 
 int qsfp_mod_present(struct hfi1_pportdata *ppd)
 {
-	if (HFI1_CAP_IS_KSET(QSFP_ENABLED)) {
-		struct hfi1_devdata *dd = ppd->dd;
-		u64 reg;
+	struct hfi1_devdata *dd = ppd->dd;
+	u64 reg;
 
-		reg = read_csr(dd,
-			dd->hfi1_id ? ASIC_QSFP2_IN : ASIC_QSFP1_IN);
-		return !(reg & QSFP_HFI0_MODPRST_N);
-	}
-	/* always return cable present */
-	return 1;
+	reg = read_csr(dd, dd->hfi1_id ? ASIC_QSFP2_IN : ASIC_QSFP1_IN);
+	return !(reg & QSFP_HFI0_MODPRST_N);
 }
 
 /*
diff --git a/include/uapi/rdma/hfi/hfi1_user.h b/include/uapi/rdma/hfi/hfi1_user.h
index 78c442fbf263..599562fe5d57 100644
--- a/include/uapi/rdma/hfi/hfi1_user.h
+++ b/include/uapi/rdma/hfi/hfi1_user.h
@@ -88,7 +88,7 @@
 #define HFI1_CAP_SDMA_AHG         (1UL <<  2) /* Enable SDMA AHG support */
 #define HFI1_CAP_EXTENDED_PSN     (1UL <<  3) /* Enable Extended PSN support */
 #define HFI1_CAP_HDRSUPP          (1UL <<  4) /* Enable Header Suppression */
-/* 1UL << 5 reserved */
+/* 1UL << 5 unused */
 #define HFI1_CAP_USE_SDMA_HEAD    (1UL <<  6) /* DMA Hdr Q tail vs. use CSR */
 #define HFI1_CAP_MULTI_PKT_EGR    (1UL <<  7) /* Enable multi-packet Egr buffs*/
 #define HFI1_CAP_NODROP_RHQ_FULL  (1UL <<  8) /* Don't drop on Hdr Q full */
@@ -99,7 +99,7 @@
 #define HFI1_CAP_NO_INTEGRITY     (1UL << 13) /* Enable ctxt integrity checks */
 #define HFI1_CAP_PKEY_CHECK       (1UL << 14) /* Enable ctxt PKey checking */
 #define HFI1_CAP_STATIC_RATE_CTRL (1UL << 15) /* Allow PBC.StaticRateControl */
-#define HFI1_CAP_QSFP_ENABLED     (1UL << 16) /* Enable QSFP check during LNI */
+/* 1UL << 16 unused */
 #define HFI1_CAP_SDMA_HEAD_CHECK  (1UL << 17) /* SDMA head checking */
 #define HFI1_CAP_EARLY_CREDIT_RETURN (1UL << 18) /* early credit return */
 
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 06/23] staging/rdma/hfi1: Add coalescing support for SDMA TX descriptors
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 05/23] staging/rdma/hfi1: Remove QSFP_ENABLED from HFI capability mask ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 07/23] staging/rdma/hfi1: optionally prescan rx queue for {B,F}ECNs - UC, RC ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (15 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w,
	Niranjana Vishwanathapura, Ira Weiny

From: Niranjana Vishwanathapura <niranjana.vishwanathapura-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

When the number of scatter gather elements in the request is more
that the number of per packet descriptors supported by the hardware,
allocate and coalesce the extra scatter gather elements into a single
buffer. The last descriptor is reserved and used for this coalesced
buffer.

Verbs potentially need this support when transferring small data
chunks involving different memory regions.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/sdma.c | 124 ++++++++++++++++++++++++++++++++++++---
 drivers/staging/rdma/hfi1/sdma.h |  74 +++++++++++++++--------
 2 files changed, 168 insertions(+), 30 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/sdma.c b/drivers/staging/rdma/hfi1/sdma.c
index d57531796723..53b3e4d9518b 100644
--- a/drivers/staging/rdma/hfi1/sdma.c
+++ b/drivers/staging/rdma/hfi1/sdma.c
@@ -55,6 +55,7 @@
 #include <linux/bitops.h>
 #include <linux/timer.h>
 #include <linux/vmalloc.h>
+#include <linux/highmem.h>
 
 #include "hfi.h"
 #include "common.h"
@@ -2706,27 +2707,134 @@ static void __sdma_process_event(struct sdma_engine *sde,
  * of descriptors in the sdma_txreq is exhausted.
  *
  * The code will bump the allocation up to the max
- * of MAX_DESC (64) descriptors.  There doesn't seem
- * much point in an interim step.
+ * of MAX_DESC (64) descriptors. There doesn't seem
+ * much point in an interim step. The last descriptor
+ * is reserved for coalesce buffer in order to support
+ * cases where input packet has >MAX_DESC iovecs.
  *
  */
-int _extend_sdma_tx_descs(struct hfi1_devdata *dd, struct sdma_txreq *tx)
+static int _extend_sdma_tx_descs(struct hfi1_devdata *dd, struct sdma_txreq *tx)
 {
 	int i;
 
+	/* Handle last descriptor */
+	if (unlikely((tx->num_desc == (MAX_DESC - 1)))) {
+		/* if tlen is 0, it is for padding, release last descriptor */
+		if (!tx->tlen) {
+			tx->desc_limit = MAX_DESC;
+		} else if (!tx->coalesce_buf) {
+			/* allocate coalesce buffer with space for padding */
+			tx->coalesce_buf = kmalloc(tx->tlen + sizeof(u32),
+						   GFP_ATOMIC);
+			if (!tx->coalesce_buf)
+				return -ENOMEM;
+
+			tx->coalesce_idx = 0;
+		}
+		return 0;
+	}
+
+	if (unlikely(tx->num_desc == MAX_DESC))
+		return -ENOMEM;
+
 	tx->descp = kmalloc_array(
 			MAX_DESC,
 			sizeof(struct sdma_desc),
 			GFP_ATOMIC);
 	if (!tx->descp)
 		return -ENOMEM;
-	tx->desc_limit = MAX_DESC;
+
+	/* reserve last descriptor for coalescing */
+	tx->desc_limit = MAX_DESC - 1;
 	/* copy ones already built */
 	for (i = 0; i < tx->num_desc; i++)
 		tx->descp[i] = tx->descs[i];
 	return 0;
 }
 
+/*
+ * ext_coal_sdma_tx_descs() - extend or coalesce sdma tx descriptors
+ *
+ * This is called once the initial nominal allocation of descriptors
+ * in the sdma_txreq is exhausted.
+ *
+ * This function calls _extend_sdma_tx_descs to extend or allocate
+ * coalesce buffer. If there is a allocated coalesce buffer, it will
+ * copy the input packet data into the coalesce buffer. It also adds
+ * coalesce buffer descriptor once whe whole packet is received.
+ *
+ * Return:
+ * <0 - error
+ * 0 - coalescing, don't populate descriptor
+ * 1 - continue with populating descriptor
+ */
+int ext_coal_sdma_tx_descs(struct hfi1_devdata *dd, struct sdma_txreq *tx,
+			   int type, void *kvaddr, struct page *page,
+			   unsigned long offset, u16 len)
+{
+	int pad_len, rval;
+	dma_addr_t addr;
+
+	rval = _extend_sdma_tx_descs(dd, tx);
+	if (rval) {
+		sdma_txclean(dd, tx);
+		return rval;
+	}
+
+	/* If coalesce buffer is allocated, copy data into it */
+	if (tx->coalesce_buf) {
+		if (type == SDMA_MAP_NONE) {
+			sdma_txclean(dd, tx);
+			return -EINVAL;
+		}
+
+		if (type == SDMA_MAP_PAGE) {
+			kvaddr = kmap(page);
+			kvaddr += offset;
+		} else if (WARN_ON(!kvaddr)) {
+			sdma_txclean(dd, tx);
+			return -EINVAL;
+		}
+
+		memcpy(tx->coalesce_buf + tx->coalesce_idx, kvaddr, len);
+		tx->coalesce_idx += len;
+		if (type == SDMA_MAP_PAGE)
+			kunmap(page);
+
+		/* If there is more data, return */
+		if (tx->tlen - tx->coalesce_idx)
+			return 0;
+
+		/* Whole packet is received; add any padding */
+		pad_len = tx->packet_len & (sizeof(u32) - 1);
+		if (pad_len) {
+			pad_len = sizeof(u32) - pad_len;
+			memset(tx->coalesce_buf + tx->coalesce_idx, 0, pad_len);
+			/* padding is taken care of for coalescing case */
+			tx->packet_len += pad_len;
+			tx->tlen += pad_len;
+		}
+
+		/* dma map the coalesce buffer */
+		addr = dma_map_single(&dd->pcidev->dev,
+				      tx->coalesce_buf,
+				      tx->tlen,
+				      DMA_TO_DEVICE);
+
+		if (unlikely(dma_mapping_error(&dd->pcidev->dev, addr))) {
+			sdma_txclean(dd, tx);
+			return -ENOSPC;
+		}
+
+		/* Add descriptor for coalesce buffer */
+		tx->desc_limit = MAX_DESC;
+		return _sdma_txadd_daddr(dd, SDMA_MAP_SINGLE, tx,
+					 addr, tx->tlen);
+	}
+
+	return 1;
+}
+
 /* Update sdes when the lmc changes */
 void sdma_update_lmc(struct hfi1_devdata *dd, u64 mask, u32 lid)
 {
@@ -2752,13 +2860,15 @@ int _pad_sdma_tx_descs(struct hfi1_devdata *dd, struct sdma_txreq *tx)
 {
 	int rval = 0;
 
+	tx->num_desc++;
 	if ((unlikely(tx->num_desc == tx->desc_limit))) {
 		rval = _extend_sdma_tx_descs(dd, tx);
-		if (rval)
+		if (rval) {
+			sdma_txclean(dd, tx);
 			return rval;
+		}
 	}
-	/* finish the one just added  */
-	tx->num_desc++;
+	/* finish the one just added */
 	make_tx_sdma_desc(
 		tx,
 		SDMA_MAP_NONE,
diff --git a/drivers/staging/rdma/hfi1/sdma.h b/drivers/staging/rdma/hfi1/sdma.h
index 496086903891..52a7d04067e0 100644
--- a/drivers/staging/rdma/hfi1/sdma.h
+++ b/drivers/staging/rdma/hfi1/sdma.h
@@ -352,6 +352,8 @@ struct sdma_txreq {
 	/* private: */
 	void *coalesce_buf;
 	/* private: */
+	u16 coalesce_idx;
+	/* private: */
 	struct iowait *wait;
 	/* private: */
 	callback_t                  complete;
@@ -735,7 +737,9 @@ static inline void make_tx_sdma_desc(
 }
 
 /* helper to extend txreq */
-int _extend_sdma_tx_descs(struct hfi1_devdata *, struct sdma_txreq *);
+int ext_coal_sdma_tx_descs(struct hfi1_devdata *dd, struct sdma_txreq *tx,
+			   int type, void *kvaddr, struct page *page,
+			   unsigned long offset, u16 len);
 int _pad_sdma_tx_descs(struct hfi1_devdata *, struct sdma_txreq *);
 void sdma_txclean(struct hfi1_devdata *, struct sdma_txreq *);
 
@@ -762,11 +766,6 @@ static inline int _sdma_txadd_daddr(
 {
 	int rval = 0;
 
-	if ((unlikely(tx->num_desc == tx->desc_limit))) {
-		rval = _extend_sdma_tx_descs(dd, tx);
-		if (rval)
-			return rval;
-	}
 	make_tx_sdma_desc(
 		tx,
 		type,
@@ -798,9 +797,7 @@ static inline int _sdma_txadd_daddr(
  *
  * Return:
  * 0 - success, -ENOSPC - mapping fail, -ENOMEM - couldn't
- * extend descriptor array or couldn't allocate coalesce
- * buffer.
- *
+ * extend/coalesce descriptor array
  */
 static inline int sdma_txadd_page(
 	struct hfi1_devdata *dd,
@@ -809,17 +806,28 @@ static inline int sdma_txadd_page(
 	unsigned long offset,
 	u16 len)
 {
-	dma_addr_t addr =
-		dma_map_page(
-			&dd->pcidev->dev,
-			page,
-			offset,
-			len,
-			DMA_TO_DEVICE);
+	dma_addr_t addr;
+	int rval;
+
+	if ((unlikely(tx->num_desc == tx->desc_limit))) {
+		rval = ext_coal_sdma_tx_descs(dd, tx, SDMA_MAP_PAGE,
+					      NULL, page, offset, len);
+		if (rval <= 0)
+			return rval;
+	}
+
+	addr = dma_map_page(
+		       &dd->pcidev->dev,
+		       page,
+		       offset,
+		       len,
+		       DMA_TO_DEVICE);
+
 	if (unlikely(dma_mapping_error(&dd->pcidev->dev, addr))) {
 		sdma_txclean(dd, tx);
 		return -ENOSPC;
 	}
+
 	return _sdma_txadd_daddr(
 			dd, SDMA_MAP_PAGE, tx, addr, len);
 }
@@ -846,6 +854,15 @@ static inline int sdma_txadd_daddr(
 	dma_addr_t addr,
 	u16 len)
 {
+	int rval;
+
+	if ((unlikely(tx->num_desc == tx->desc_limit))) {
+		rval = ext_coal_sdma_tx_descs(dd, tx, SDMA_MAP_NONE,
+					      NULL, NULL, 0, 0);
+		if (rval <= 0)
+			return rval;
+	}
+
 	return _sdma_txadd_daddr(dd, SDMA_MAP_NONE, tx, addr, len);
 }
 
@@ -862,7 +879,7 @@ static inline int sdma_txadd_daddr(
  * The mapping/unmapping of the kvaddr and len is automatically handled.
  *
  * Return:
- * 0 - success, -ENOSPC - mapping fail, -ENOMEM - couldn't extend
+ * 0 - success, -ENOSPC - mapping fail, -ENOMEM - couldn't extend/coalesce
  * descriptor array
  */
 static inline int sdma_txadd_kvaddr(
@@ -871,16 +888,27 @@ static inline int sdma_txadd_kvaddr(
 	void *kvaddr,
 	u16 len)
 {
-	dma_addr_t addr =
-		dma_map_single(
-			&dd->pcidev->dev,
-			kvaddr,
-			len,
-			DMA_TO_DEVICE);
+	dma_addr_t addr;
+	int rval;
+
+	if ((unlikely(tx->num_desc == tx->desc_limit))) {
+		rval = ext_coal_sdma_tx_descs(dd, tx, SDMA_MAP_SINGLE,
+					      kvaddr, 0, 0, len);
+		if (rval <= 0)
+			return rval;
+	}
+
+	addr = dma_map_single(
+		       &dd->pcidev->dev,
+		       kvaddr,
+		       len,
+		       DMA_TO_DEVICE);
+
 	if (unlikely(dma_mapping_error(&dd->pcidev->dev, addr))) {
 		sdma_txclean(dd, tx);
 		return -ENOSPC;
 	}
+
 	return _sdma_txadd_daddr(
 			dd, SDMA_MAP_SINGLE, tx, addr, len);
 }
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 07/23] staging/rdma/hfi1: optionally prescan rx queue for {B,F}ECNs - UC, RC
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 06/23] staging/rdma/hfi1: Add coalescing support for SDMA TX descriptors ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 08/23] staging/rdma/hfi1: Method to toggle "fast ECN" detection ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (14 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Arthur Kepner, Ira Weiny

From: Arthur Kepner <arthur.kepner-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

(This patch adds RC, and UC processing based on commit 48c479f.)

To more rapidly respond to Explicit Congestion Notifications,
prescan the receive queue, and process FECNs, and BECNs first.
When a UC, or RC packet containing a FECN, or BECN is found,
immediately react to the ECN (either by returning a CNP, or
adjusting the injection rate). Afterward, the packet will be
processed normally.

During testing a bug was found where, when HFI1_CAP_DMA_RTAIL
is toggled off the "prescan head" can get out of sync with the
receive context's "head". This happens when, after prescan_rxq()
newly arrived packets are then received, and processed by an
RX interrupt handler. This is an unavoidable race, and to avoid
getting out of sync we always start prescanning at the current
"rcd->head" entry.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Arthur Kepner <arthur.kepner-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/driver.c | 77 +++++++++++++++++---------------------
 drivers/staging/rdma/hfi1/hfi.h    | 13 -------
 drivers/staging/rdma/hfi1/rc.c     | 10 ++---
 drivers/staging/rdma/hfi1/uc.c     | 15 +++++---
 drivers/staging/rdma/hfi1/verbs.c  | 41 +++++++++++++++++---
 5 files changed, 84 insertions(+), 72 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/driver.c b/drivers/staging/rdma/hfi1/driver.c
index c0a59001e5cd..4ccc2c87b196 100644
--- a/drivers/staging/rdma/hfi1/driver.c
+++ b/drivers/staging/rdma/hfi1/driver.c
@@ -436,59 +436,58 @@ static inline void init_packet(struct hfi1_ctxtdata *rcd,
 
 #ifndef CONFIG_PRESCAN_RXQ
 static void prescan_rxq(struct hfi1_packet *packet) {}
-#else /* CONFIG_PRESCAN_RXQ */
+#else /* !CONFIG_PRESCAN_RXQ */
 static int prescan_receive_queue;
 
 static void process_ecn(struct hfi1_qp *qp, struct hfi1_ib_header *hdr,
 			struct hfi1_other_headers *ohdr,
-			u64 rhf, struct ib_grh *grh)
+			u64 rhf, u32 bth1, struct ib_grh *grh)
 {
 	struct hfi1_ibport *ibp = to_iport(qp->ibqp.device, qp->port_num);
-	u32 bth1;
+	u32 rqpn = 0;
+	u16 rlid;
 	u8 sc5, svc_type;
-	int is_fecn, is_becn;
 
 	switch (qp->ibqp.qp_type) {
+	case IB_QPT_SMI:
+	case IB_QPT_GSI:
 	case IB_QPT_UD:
+		rlid = be16_to_cpu(hdr->lrh[3]);
+		rqpn = be32_to_cpu(ohdr->u.ud.deth[1]) & HFI1_QPN_MASK;
 		svc_type = IB_CC_SVCTYPE_UD;
 		break;
-	case IB_QPT_UC:	/* LATER */
-	case IB_QPT_RC:	/* LATER */
+	case IB_QPT_UC:
+		rlid = qp->remote_ah_attr.dlid;
+		rqpn = qp->remote_qpn;
+		svc_type = IB_CC_SVCTYPE_UC;
+		break;
+	case IB_QPT_RC:
+		rlid = qp->remote_ah_attr.dlid;
+		rqpn = qp->remote_qpn;
+		svc_type = IB_CC_SVCTYPE_RC;
+		break;
 	default:
 		return;
 	}
 
-	is_fecn = (be32_to_cpu(ohdr->bth[1]) >> HFI1_FECN_SHIFT) &
-			HFI1_FECN_MASK;
-	is_becn = (be32_to_cpu(ohdr->bth[1]) >> HFI1_BECN_SHIFT) &
-			HFI1_BECN_MASK;
-
 	sc5 = (be16_to_cpu(hdr->lrh[0]) >> 12) & 0xf;
 	if (rhf_dc_info(rhf))
 		sc5 |= 0x10;
 
-	if (is_fecn) {
-		u32 src_qpn = be32_to_cpu(ohdr->u.ud.deth[1]) & HFI1_QPN_MASK;
+	if (bth1 & HFI1_FECN_SMASK) {
 		u16 pkey = (u16)be32_to_cpu(ohdr->bth[0]);
 		u16 dlid = be16_to_cpu(hdr->lrh[1]);
-		u16 slid = be16_to_cpu(hdr->lrh[3]);
 
-		return_cnp(ibp, qp, src_qpn, pkey, dlid, slid, sc5, grh);
+		return_cnp(ibp, qp, rqpn, pkey, dlid, rlid, sc5, grh);
 	}
 
-	if (is_becn) {
+	if (bth1 & HFI1_BECN_SMASK) {
 		struct hfi1_pportdata *ppd = ppd_from_ibp(ibp);
-		u32 lqpn =  be32_to_cpu(ohdr->bth[1]) & HFI1_QPN_MASK;
+		u32 lqpn = bth1 & HFI1_QPN_MASK;
 		u8 sl = ibp->sc_to_sl[sc5];
 
-		process_becn(ppd, sl, 0, lqpn, 0, svc_type);
+		process_becn(ppd, sl, rlid, lqpn, rqpn, svc_type);
 	}
-
-	/* turn off BECN, or FECN */
-	bth1 = be32_to_cpu(ohdr->bth[1]);
-	bth1 &= ~(HFI1_FECN_MASK << HFI1_FECN_SHIFT);
-	bth1 &= ~(HFI1_BECN_MASK << HFI1_BECN_SHIFT);
-	ohdr->bth[1] = cpu_to_be32(bth1);
 }
 
 struct ps_mdata {
@@ -508,15 +507,10 @@ static inline void init_ps_mdata(struct ps_mdata *mdata,
 	mdata->rcd = rcd;
 	mdata->rsize = packet->rsize;
 	mdata->maxcnt = packet->maxcnt;
-
-	if (rcd->ps_state.initialized == 0) {
-		mdata->ps_head = packet->rhqoff;
-		rcd->ps_state.initialized++;
-	} else
-		mdata->ps_head = rcd->ps_state.ps_head;
+	mdata->ps_head = packet->rhqoff;
 
 	if (HFI1_CAP_IS_KSET(DMA_RTAIL)) {
-		mdata->ps_tail = packet->hdrqtail;
+		mdata->ps_tail = get_rcvhdrtail(rcd);
 		mdata->ps_seq = 0; /* not used with DMA_RTAIL */
 	} else {
 		mdata->ps_tail = 0; /* used only with DMA_RTAIL*/
@@ -533,12 +527,9 @@ static inline int ps_done(struct ps_mdata *mdata, u64 rhf)
 
 static inline void update_ps_mdata(struct ps_mdata *mdata)
 {
-	struct hfi1_ctxtdata *rcd = mdata->rcd;
-
 	mdata->ps_head += mdata->rsize;
-	if (mdata->ps_head > mdata->maxcnt)
+	if (mdata->ps_head >= mdata->maxcnt)
 		mdata->ps_head = 0;
-	rcd->ps_state.ps_head = mdata->ps_head;
 	if (!HFI1_CAP_IS_KSET(DMA_RTAIL)) {
 		if (++mdata->ps_seq > 13)
 			mdata->ps_seq = 1;
@@ -571,7 +562,7 @@ static void prescan_rxq(struct hfi1_packet *packet)
 		struct hfi1_other_headers *ohdr;
 		struct ib_grh *grh = NULL;
 		u64 rhf = rhf_to_cpu(rhf_addr);
-		u32 etype = rhf_rcv_type(rhf), qpn;
+		u32 etype = rhf_rcv_type(rhf), qpn, bth1;
 		int is_ecn = 0;
 		u8 lnh;
 
@@ -593,15 +584,13 @@ static void prescan_rxq(struct hfi1_packet *packet)
 		} else
 			goto next; /* just in case */
 
-		is_ecn |= be32_to_cpu(ohdr->bth[1]) &
-			(HFI1_FECN_MASK << HFI1_FECN_SHIFT);
-		is_ecn |= be32_to_cpu(ohdr->bth[1]) &
-			(HFI1_BECN_MASK << HFI1_BECN_SHIFT);
+		bth1 = be32_to_cpu(ohdr->bth[1]);
+		is_ecn = !!(bth1 & (HFI1_FECN_SMASK | HFI1_BECN_SMASK));
 
 		if (!is_ecn)
 			goto next;
 
-		qpn = be32_to_cpu(ohdr->bth[1]) & HFI1_QPN_MASK;
+		qpn = bth1 & HFI1_QPN_MASK;
 		rcu_read_lock();
 		qp = hfi1_lookup_qpn(ibp, qpn);
 
@@ -610,8 +599,12 @@ static void prescan_rxq(struct hfi1_packet *packet)
 			goto next;
 		}
 
-		process_ecn(qp, hdr, ohdr, rhf, grh);
+		process_ecn(qp, hdr, ohdr, rhf, bth1, grh);
 		rcu_read_unlock();
+
+		/* turn off BECN, FECN */
+		bth1 &= ~(HFI1_FECN_SMASK | HFI1_BECN_SMASK);
+		ohdr->bth[1] = cpu_to_be32(bth1);
 next:
 		update_ps_mdata(&mdata);
 	}
diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h
index 8ca171bf3e36..60f4b3d88671 100644
--- a/drivers/staging/rdma/hfi1/hfi.h
+++ b/drivers/staging/rdma/hfi1/hfi.h
@@ -139,15 +139,6 @@ extern const struct pci_error_handlers hfi1_pci_err_handler;
 struct hfi1_opcode_stats_perctx;
 #endif
 
-/*
- * struct ps_state keeps state associated with RX queue "prescanning"
- * (prescanning for FECNs, and BECNs), if prescanning is in use.
- */
-struct ps_state {
-	u32 ps_head;
-	int initialized;
-};
-
 struct ctxt_eager_bufs {
 	ssize_t size;            /* total size of eager buffers */
 	u32 count;               /* size of buffers array */
@@ -302,10 +293,6 @@ struct hfi1_ctxtdata {
 	struct list_head sdma_queues;
 	spinlock_t sdma_qlock;
 
-#ifdef CONFIG_PRESCAN_RXQ
-	struct ps_state ps_state;
-#endif /* CONFIG_PRESCAN_RXQ */
-
 	/*
 	 * The interrupt handler for a particular receive context can vary
 	 * throughout it's lifetime. This is not a lock protected data member so
diff --git a/drivers/staging/rdma/hfi1/rc.c b/drivers/staging/rdma/hfi1/rc.c
index 1e9caebb0281..6931ebe99b3b 100644
--- a/drivers/staging/rdma/hfi1/rc.c
+++ b/drivers/staging/rdma/hfi1/rc.c
@@ -1971,7 +1971,7 @@ void hfi1_rc_rcv(struct hfi1_packet *packet)
 	}
 
 	psn = be32_to_cpu(ohdr->bth[2]);
-	opcode = bth0 >> 24;
+	opcode = (bth0 >> 24) & 0xff;
 
 	/*
 	 * Process responses (ACKs) before anything else.  Note that the
@@ -2384,19 +2384,19 @@ void hfi1_rc_hdrerr(
 	struct hfi1_ibport *ibp = to_iport(qp->ibqp.device, qp->port_num);
 	int diff;
 	u32 opcode;
-	u32 psn;
+	u32 psn, bth0;
 
 	/* Check for GRH */
 	ohdr = &hdr->u.oth;
 	if (has_grh)
 		ohdr = &hdr->u.l.oth;
 
-	opcode = be32_to_cpu(ohdr->bth[0]);
-	if (hfi1_ruc_check_hdr(ibp, hdr, has_grh, qp, opcode))
+	bth0 = be32_to_cpu(ohdr->bth[0]);
+	if (hfi1_ruc_check_hdr(ibp, hdr, has_grh, qp, bth0))
 		return;
 
 	psn = be32_to_cpu(ohdr->bth[2]);
-	opcode >>= 24;
+	opcode = (bth0 >> 24) & 0xff;
 
 	/* Only deal with RDMA Writes for now */
 	if (opcode < IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST) {
diff --git a/drivers/staging/rdma/hfi1/uc.c b/drivers/staging/rdma/hfi1/uc.c
index b536f397737c..d785878d8b90 100644
--- a/drivers/staging/rdma/hfi1/uc.c
+++ b/drivers/staging/rdma/hfi1/uc.c
@@ -268,7 +268,7 @@ void hfi1_uc_rcv(struct hfi1_packet *packet)
 	u32 tlen = packet->tlen;
 	struct hfi1_qp *qp = packet->qp;
 	struct hfi1_other_headers *ohdr = packet->ohdr;
-	u32 opcode;
+	u32 bth0, opcode;
 	u32 hdrsize = packet->hlen;
 	u32 psn;
 	u32 pad;
@@ -278,10 +278,9 @@ void hfi1_uc_rcv(struct hfi1_packet *packet)
 	int has_grh = rcv_flags & HFI1_HAS_GRH;
 	int ret;
 	u32 bth1;
-	struct ib_grh *grh = NULL;
 
-	opcode = be32_to_cpu(ohdr->bth[0]);
-	if (hfi1_ruc_check_hdr(ibp, hdr, has_grh, qp, opcode))
+	bth0 = be32_to_cpu(ohdr->bth[0]);
+	if (hfi1_ruc_check_hdr(ibp, hdr, has_grh, qp, bth0))
 		return;
 
 	bth1 = be32_to_cpu(ohdr->bth[1]);
@@ -303,6 +302,7 @@ void hfi1_uc_rcv(struct hfi1_packet *packet)
 		}
 
 		if (bth1 & HFI1_FECN_SMASK) {
+			struct ib_grh *grh = NULL;
 			u16 pkey = (u16)be32_to_cpu(ohdr->bth[0]);
 			u16 slid = be16_to_cpu(hdr->lrh[3]);
 			u16 dlid = be16_to_cpu(hdr->lrh[1]);
@@ -310,13 +310,16 @@ void hfi1_uc_rcv(struct hfi1_packet *packet)
 			u8 sc5;
 
 			sc5 = ibp->sl_to_sc[qp->remote_ah_attr.sl];
+			if (has_grh)
+				grh = &hdr->u.l.grh;
 
-			return_cnp(ibp, qp, src_qp, pkey, dlid, slid, sc5, grh);
+			return_cnp(ibp, qp, src_qp, pkey, dlid, slid, sc5,
+				   grh);
 		}
 	}
 
 	psn = be32_to_cpu(ohdr->bth[2]);
-	opcode >>= 24;
+	opcode = (bth0 >> 24) & 0xff;
 
 	/* Compare the PSN verses the expected PSN. */
 	if (unlikely(cmp_psn(psn, qp->r_psn) != 0)) {
diff --git a/drivers/staging/rdma/hfi1/verbs.c b/drivers/staging/rdma/hfi1/verbs.c
index a43fccd68a73..8456e82b2ab1 100644
--- a/drivers/staging/rdma/hfi1/verbs.c
+++ b/drivers/staging/rdma/hfi1/verbs.c
@@ -2136,11 +2136,40 @@ void hfi1_schedule_send(struct hfi1_qp *qp)
 void hfi1_cnp_rcv(struct hfi1_packet *packet)
 {
 	struct hfi1_ibport *ibp = &packet->rcd->ppd->ibport_data;
-
-	if (packet->qp->ibqp.qp_type == IB_QPT_UC)
-		hfi1_uc_rcv(packet);
-	else if (packet->qp->ibqp.qp_type == IB_QPT_UD)
-		hfi1_ud_rcv(packet);
-	else
+	struct hfi1_pportdata *ppd = ppd_from_ibp(ibp);
+	struct hfi1_ib_header *hdr = packet->hdr;
+	struct hfi1_qp *qp = packet->qp;
+	u32 lqpn, rqpn = 0;
+	u16 rlid = 0;
+	u8 sl, sc5, sc4_bit, svc_type;
+	bool sc4_set = has_sc4_bit(packet);
+
+	switch (packet->qp->ibqp.qp_type) {
+	case IB_QPT_UC:
+		rlid = qp->remote_ah_attr.dlid;
+		rqpn = qp->remote_qpn;
+		svc_type = IB_CC_SVCTYPE_UC;
+		break;
+	case IB_QPT_RC:
+		rlid = qp->remote_ah_attr.dlid;
+		rqpn = qp->remote_qpn;
+		svc_type = IB_CC_SVCTYPE_RC;
+		break;
+	case IB_QPT_SMI:
+	case IB_QPT_GSI:
+	case IB_QPT_UD:
+		svc_type = IB_CC_SVCTYPE_UD;
+		break;
+	default:
 		ibp->n_pkt_drops++;
+		return;
+	}
+
+	sc4_bit = sc4_set << 4;
+	sc5 = (be16_to_cpu(hdr->lrh[0]) >> 12) & 0xf;
+	sc5 |= sc4_bit;
+	sl = ibp->sc_to_sl[sc5];
+	lqpn = qp->ibqp.qp_num;
+
+	process_becn(ppd, sl, rlid, lqpn, rqpn, svc_type);
 }
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 08/23] staging/rdma/hfi1: Method to toggle "fast ECN" detection
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 07/23] staging/rdma/hfi1: optionally prescan rx queue for {B,F}ECNs - UC, RC ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 09/23] staging/rdma/hfi1: Fix sparse error in sdma.h file ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Vennila Megavannan,
	Ira Weiny

From: Vennila Megavannan <vennila.megavannan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add a module paramter to toggle prescan/Fast ECN Detection.

In addition change the PRESCAN_RXQ Kconfig default to "yes".

Reviewed-by: Arthur Kepner<arthur.kepner-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Mike Marciniszyn<mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Vennila Megavannan<vennila.megavannan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/Kconfig  | 14 +++++++-------
 drivers/staging/rdma/hfi1/driver.c | 24 +++++++++++++++++-------
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/Kconfig b/drivers/staging/rdma/hfi1/Kconfig
index fd25078ee923..0336dea60786 100644
--- a/drivers/staging/rdma/hfi1/Kconfig
+++ b/drivers/staging/rdma/hfi1/Kconfig
@@ -26,12 +26,12 @@ config SDMA_VERBOSITY
 	This is a configuration flag to enable verbose
 	SDMA debug
 config PRESCAN_RXQ
-	bool "Enable prescanning of the RX queue for ECNs"
+	bool "Enable optional prescanning of the RX queue for ECNs"
 	depends on INFINIBAND_HFI1
-	default n
+	default y
 	---help---
-	This option toggles the prescanning of the receive queue for
-	Explicit Congestion Notifications. If an ECN is detected, it
-	is processed as quickly as possible, the ECN is toggled off.
-	After the prescanning step, the receive queue is processed as
-	usual.
+	This option enables code for the prescanning of the receive queue for
+	Explicit Congestion Notifications.  Pre-scanning can be controled via a
+	module option at run time.  If an ECN is detected, it is processed as
+	quickly as possible, the ECN is toggled off.  After the prescanning
+	step, the receive queue is processed as usual.
diff --git a/drivers/staging/rdma/hfi1/driver.c b/drivers/staging/rdma/hfi1/driver.c
index 4ccc2c87b196..9f2cf9589b63 100644
--- a/drivers/staging/rdma/hfi1/driver.c
+++ b/drivers/staging/rdma/hfi1/driver.c
@@ -83,6 +83,14 @@ unsigned int hfi1_cu = 1;
 module_param_named(cu, hfi1_cu, uint, S_IRUGO);
 MODULE_PARM_DESC(cu, "Credit return units");
 
+#ifdef CONFIG_PRESCAN_RXQ
+static unsigned int prescan_rx_queue;
+module_param_named(prescan_rxq, prescan_rx_queue, uint,
+			S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(prescan_rxq,
+		"Used to toggle rx prescan. Set to 1 to enable prescan");
+#endif /* CONFIG_PRESCAN_RXQ */
+
 unsigned long hfi1_cap_mask = HFI1_CAP_MASK_DEFAULT;
 static int hfi1_caps_set(const char *, const struct kernel_param *);
 static int hfi1_caps_get(char *, const struct kernel_param *);
@@ -435,10 +443,8 @@ static inline void init_packet(struct hfi1_ctxtdata *rcd,
 }
 
 #ifndef CONFIG_PRESCAN_RXQ
-static void prescan_rxq(struct hfi1_packet *packet) {}
+#define prescan_rxq(packet)
 #else /* !CONFIG_PRESCAN_RXQ */
-static int prescan_receive_queue;
-
 static void process_ecn(struct hfi1_qp *qp, struct hfi1_ib_header *hdr,
 			struct hfi1_other_headers *ohdr,
 			u64 rhf, u32 bth1, struct ib_grh *grh)
@@ -541,15 +547,19 @@ static inline void update_ps_mdata(struct ps_mdata *mdata)
  * containing Excplicit Congestion Notifications (FECNs, or BECNs).
  * When an ECN is found, process the Congestion Notification, and toggle
  * it off.
+ * This is declared as a macro to allow quick checking of the module param and
+ * avoid the overhead of a function call if not enabled.
  */
-static void prescan_rxq(struct hfi1_packet *packet)
+#define prescan_rxq(packet) \
+	do { \
+		if (prescan_rx_queue) \
+			__prescan_rxq(packet); \
+	} while (0)
+static void __prescan_rxq(struct hfi1_packet *packet)
 {
 	struct hfi1_ctxtdata *rcd = packet->rcd;
 	struct ps_mdata mdata;
 
-	if (!prescan_receive_queue)
-		return;
-
 	init_ps_mdata(&mdata, packet);
 
 	while (1) {
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 09/23] staging/rdma/hfi1: Fix sparse error in sdma.h file
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (7 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 08/23] staging/rdma/hfi1: Method to toggle "fast ECN" detection ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 10/23] staging/rdma/hfi1: close shared context security hole ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (12 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w,
	Niranjana Vishwanathapura, Ira Weiny

From: Niranjana Vishwanathapura <niranjana.vishwanathapura-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Use NULL instead of 0 for pointer argument to fix the sparse error.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/sdma.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/rdma/hfi1/sdma.h b/drivers/staging/rdma/hfi1/sdma.h
index 52a7d04067e0..cc22d2ee2054 100644
--- a/drivers/staging/rdma/hfi1/sdma.h
+++ b/drivers/staging/rdma/hfi1/sdma.h
@@ -893,7 +893,7 @@ static inline int sdma_txadd_kvaddr(
 
 	if ((unlikely(tx->num_desc == tx->desc_limit))) {
 		rval = ext_coal_sdma_tx_descs(dd, tx, SDMA_MAP_SINGLE,
-					      kvaddr, 0, 0, len);
+					      kvaddr, NULL, 0, len);
 		if (rval <= 0)
 			return rval;
 	}
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 10/23] staging/rdma/hfi1: close shared context security hole
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (8 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 09/23] staging/rdma/hfi1: Fix sparse error in sdma.h file ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 11/23] staging/rdma/hfi1: Reset firmware instead of reloading Sbus ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (11 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Jareer Abdel-Qader,
	Ira Weiny

From: Jareer Abdel-Qader <jareer.h.abdel-qader-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Driver does not verify userid for shared context assignments, allowing
malicious user access.

Reviewed by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jareer H Abdel-Qader <jareer.h.abdel-qader-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/file_ops.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/staging/rdma/hfi1/file_ops.c b/drivers/staging/rdma/hfi1/file_ops.c
index 7d2868050981..3c9cae6f64a3 100644
--- a/drivers/staging/rdma/hfi1/file_ops.c
+++ b/drivers/staging/rdma/hfi1/file_ops.c
@@ -948,6 +948,7 @@ static int find_shared_ctxt(struct file *fp,
 			/* Skip ctxt if it doesn't match the requested one */
 			if (memcmp(uctxt->uuid, uinfo->uuid,
 				   sizeof(uctxt->uuid)) ||
+			    uctxt->jkey != generate_jkey(current_uid()) ||
 			    uctxt->subctxt_id != uinfo->subctxt_id ||
 			    uctxt->subctxt_cnt != uinfo->subctxt_cnt)
 				continue;
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 11/23] staging/rdma/hfi1: Reset firmware instead of reloading Sbus
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (9 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 10/23] staging/rdma/hfi1: close shared context security hole ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 12/23] staging/rdma/hfi1: Add a schedule in send thread ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (10 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Caz Yokoyama, Ira Weiny

From: Caz Yokoyama <caz.yokoyama-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Reset firmware instead of reloading Sbus firmware if it's already done for this
ASIC.  To work around thermal polling problem in firmware, don't reload Sbus
firmware, instead, reset the firmware on the initialization of the second HFI.

Reviewed-by: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Caz Yokoyama <caz.yokoyama-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/chip.h     |  1 +
 drivers/staging/rdma/hfi1/firmware.c |  4 +---
 drivers/staging/rdma/hfi1/pcie.c     | 13 ++++++++++++-
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/chip.h b/drivers/staging/rdma/hfi1/chip.h
index f89a432c7334..497c5de23d53 100644
--- a/drivers/staging/rdma/hfi1/chip.h
+++ b/drivers/staging/rdma/hfi1/chip.h
@@ -609,6 +609,7 @@ static inline void write_uctxt_csr(struct hfi1_devdata *dd, int ctxt,
 u64 create_pbc(struct hfi1_pportdata *ppd, u64, int, u32, u32);
 
 /* firmware.c */
+#define SBUS_MASTER_BROADCAST 0xfd
 #define NUM_PCIE_SERDES 16	/* number of PCIe serdes on the SBus */
 extern const u8 pcie_serdes_broadcast[];
 extern const u8 pcie_pcs_addrs[2][NUM_PCIE_SERDES];
diff --git a/drivers/staging/rdma/hfi1/firmware.c b/drivers/staging/rdma/hfi1/firmware.c
index 15c9cb7a3150..ace75ce3da3f 100644
--- a/drivers/staging/rdma/hfi1/firmware.c
+++ b/drivers/staging/rdma/hfi1/firmware.c
@@ -924,9 +924,6 @@ static int load_8051_firmware(struct hfi1_devdata *dd,
 	return 0;
 }
 
-/* SBus Master broadcast address */
-#define SBUS_MASTER_BROADCAST 0xfd
-
 /*
  * Write the SBus request register
  *
@@ -1255,6 +1252,7 @@ int load_firmware(struct hfi1_devdata *dd)
 			ret = load_sbus_firmware(dd, &fw_sbus);
 			if (ret)
 				goto clear;
+			fw_sbus_load = 0;
 		}
 
 		if (fw_fabric_serdes_load) {
diff --git a/drivers/staging/rdma/hfi1/pcie.c b/drivers/staging/rdma/hfi1/pcie.c
index ac5653c0f65e..3b50cdda1c0a 100644
--- a/drivers/staging/rdma/hfi1/pcie.c
+++ b/drivers/staging/rdma/hfi1/pcie.c
@@ -946,9 +946,20 @@ int do_pcie_gen3_transition(struct hfi1_devdata *dd)
 			    __func__);
 	}
 
+retry:
+	if (therm) {
+		/* toggle SPICO_ENABLE to get back to the state
+		   just after the firmware load */
+		sbus_request(dd, SBUS_MASTER_BROADCAST, 0x01,
+			WRITE_SBUS_RECEIVER, 0x00000040);
+		sbus_request(dd, SBUS_MASTER_BROADCAST, 0x01,
+			WRITE_SBUS_RECEIVER, 0x00000140);
+		dd_dev_info(dd, "%s: toggle SPICO_ENABLE to reset the bus\n",
+			    __func__);
+	}
+
 	/* step 3: download SBus Master firmware */
 	/* step 4: download PCIe Gen3 SerDes firmware */
-retry:
 	dd_dev_info(dd, "%s: downloading firmware\n", __func__);
 	ret = load_pcie_firmware(dd);
 	if (ret)
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 12/23] staging/rdma/hfi1: Add a schedule in send thread
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (10 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 11/23] staging/rdma/hfi1: Reset firmware instead of reloading Sbus ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 13/23] staging/rdma/hfi1: Fix port bounce issues with 0.22 DC firmware ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (9 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Dean Luick, Ira Weiny

From: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

When under heavy load, the send handler can run too long without allowing other
tasks to run.  Add a conditional resched to break this up.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/chip.c  | 10 ++++++++++
 drivers/staging/rdma/hfi1/chip.h  |  1 +
 drivers/staging/rdma/hfi1/ruc.c   | 12 ++++++++++++
 drivers/staging/rdma/hfi1/verbs.h |  1 +
 4 files changed, 24 insertions(+)

diff --git a/drivers/staging/rdma/hfi1/chip.c b/drivers/staging/rdma/hfi1/chip.c
index 93ca13fceb51..3c48294abd20 100644
--- a/drivers/staging/rdma/hfi1/chip.c
+++ b/drivers/staging/rdma/hfi1/chip.c
@@ -1530,6 +1530,14 @@ static u64 access_sw_kmem_wait(const struct cntr_entry *entry,
 	return dd->verbs_dev.n_kmem_wait;
 }
 
+static u64 access_sw_send_schedule(const struct cntr_entry *entry,
+			       void *context, int vl, int mode, u64 data)
+{
+	struct hfi1_devdata *dd = (struct hfi1_devdata *)context;
+
+	return dd->verbs_dev.n_send_schedule;
+}
+
 #define def_access_sw_cpu(cntr) \
 static u64 access_sw_cpu_##cntr(const struct cntr_entry *entry,		      \
 			      void *context, int vl, int mode, u64 data)      \
@@ -1720,6 +1728,8 @@ static struct cntr_entry dev_cntrs[DEV_CNTR_LAST] = {
 			    access_sw_pio_wait),
 [C_SW_KMEM_WAIT] = CNTR_ELEM("KmemWait", 0, 0, CNTR_NORMAL,
 			    access_sw_kmem_wait),
+[C_SW_SEND_SCHED] = CNTR_ELEM("SendSched", 0, 0, CNTR_NORMAL,
+			    access_sw_send_schedule),
 };
 
 static struct cntr_entry port_cntrs[PORT_CNTR_LAST] = {
diff --git a/drivers/staging/rdma/hfi1/chip.h b/drivers/staging/rdma/hfi1/chip.h
index 497c5de23d53..ebf9041a1c5e 100644
--- a/drivers/staging/rdma/hfi1/chip.h
+++ b/drivers/staging/rdma/hfi1/chip.h
@@ -787,6 +787,7 @@ enum {
 	C_SW_VTX_WAIT,
 	C_SW_PIO_WAIT,
 	C_SW_KMEM_WAIT,
+	C_SW_SEND_SCHED,
 	DEV_CNTR_LAST  /* Must be kept last */
 };
 
diff --git a/drivers/staging/rdma/hfi1/ruc.c b/drivers/staging/rdma/hfi1/ruc.c
index faad1b93703e..8614b070545c 100644
--- a/drivers/staging/rdma/hfi1/ruc.c
+++ b/drivers/staging/rdma/hfi1/ruc.c
@@ -820,6 +820,9 @@ void hfi1_make_ruc_header(struct hfi1_qp *qp, struct hfi1_other_headers *ohdr,
 	ohdr->bth[2] = cpu_to_be32(bth2);
 }
 
+/* when sending, force a reschedule every one of these periods */
+#define SEND_RESCHED_TIMEOUT (5 * HZ)  /* 5s in jiffies */
+
 /**
  * hfi1_do_send - perform a send on a QP
  * @work: contains a pointer to the QP
@@ -836,6 +839,7 @@ void hfi1_do_send(struct work_struct *work)
 	struct hfi1_pportdata *ppd = ppd_from_ibp(ibp);
 	int (*make_req)(struct hfi1_qp *qp);
 	unsigned long flags;
+	unsigned long timeout;
 
 	if ((qp->ibqp.qp_type == IB_QPT_RC ||
 	     qp->ibqp.qp_type == IB_QPT_UC) &&
@@ -864,6 +868,7 @@ void hfi1_do_send(struct work_struct *work)
 
 	spin_unlock_irqrestore(&qp->s_lock, flags);
 
+	timeout = jiffies + SEND_RESCHED_TIMEOUT;
 	do {
 		/* Check for a constructed packet to be sent. */
 		if (qp->s_hdrwords != 0) {
@@ -877,6 +882,13 @@ void hfi1_do_send(struct work_struct *work)
 			/* Record that s_hdr is empty. */
 			qp->s_hdrwords = 0;
 		}
+
+		/* allow other tasks to run */
+		if (unlikely(time_after(jiffies, timeout))) {
+			cond_resched();
+			ppd->dd->verbs_dev.n_send_schedule++;
+			timeout = jiffies + SEND_RESCHED_TIMEOUT;
+		}
 	} while (make_req(qp));
 }
 
diff --git a/drivers/staging/rdma/hfi1/verbs.h b/drivers/staging/rdma/hfi1/verbs.h
index afaa0fe619fe..e4a8a0d4ccf8 100644
--- a/drivers/staging/rdma/hfi1/verbs.h
+++ b/drivers/staging/rdma/hfi1/verbs.h
@@ -754,6 +754,7 @@ struct hfi1_ibdev {
 	u64 n_piowait;
 	u64 n_txwait;
 	u64 n_kmem_wait;
+	u64 n_send_schedule;
 
 	u32 n_pds_allocated;    /* number of PDs allocated for device */
 	spinlock_t n_pds_lock;
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 13/23] staging/rdma/hfi1: Fix port bounce issues with 0.22 DC firmware
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (11 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 12/23] staging/rdma/hfi1: Add a schedule in send thread ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 14/23] staging/rdma/hfi1: Prevent silent data corruption with user SDMA ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (8 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Easwar Hariharan,
	Ira Weiny

From: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The DC firmware overwrites the enable_lane_tx register and does not update it
on a host request to go to Poll. This causes an infinite loop through the LNI
state machine if a link width downgrade occurs. This patch re-sets the
enable_lane_tx register to all 4 lanes.

Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/chip.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/staging/rdma/hfi1/chip.c b/drivers/staging/rdma/hfi1/chip.c
index 3c48294abd20..0be37e9430af 100644
--- a/drivers/staging/rdma/hfi1/chip.c
+++ b/drivers/staging/rdma/hfi1/chip.c
@@ -5417,6 +5417,8 @@ static int set_local_link_attributes(struct hfi1_pportdata *ppd)
 		if (ppd->link_speed_enabled & OPA_LINK_SPEED_12_5G)
 			ppd->local_tx_rate |= 1;
 	}
+
+	enable_lane_tx = 0xF; /* enable all four lanes */
 	ret = write_tx_settings(dd, enable_lane_tx, tx_polarity_inversion,
 		     rx_polarity_inversion, ppd->local_tx_rate);
 	if (ret != HCMD_SUCCESS)
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 14/23] staging/rdma/hfi1: Prevent silent data corruption with user SDMA
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (12 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 13/23] staging/rdma/hfi1: Fix port bounce issues with 0.22 DC firmware ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 15/23] staging/rdma/hfi1: Implement Expected Receive TID caching ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Mitko Haralanov,
	Ira Weiny

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

User SDMA keeps track of progress into the submitted IO vectors by tracking an
offset into the vectors when packets are submitted. This offset is updated
after a successful submission of a txreq to the SDMA engine.

The same offset was used when determining whether an IO vector should be
'freed' (pages unpinned) in the SDMA callback functions.

This was causing a silent data corruption in big jobs (> 2 nodes, 120 ranks
each) on the receive side because the send side was mistakenly unpinning the
vector pages before the HW has processed all descriptors referencing the
vector.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/user_sdma.c | 90 ++++++++++++++++++++---------------
 1 file changed, 52 insertions(+), 38 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/user_sdma.c b/drivers/staging/rdma/hfi1/user_sdma.c
index 10ab8073d3f2..36c838dcf023 100644
--- a/drivers/staging/rdma/hfi1/user_sdma.c
+++ b/drivers/staging/rdma/hfi1/user_sdma.c
@@ -146,7 +146,8 @@ MODULE_PARM_DESC(sdma_comp_size, "Size of User SDMA completion ring. Default: 12
 #define KDETH_OM_MAX_SIZE  (1 << ((KDETH_OM_LARGE / KDETH_OM_SMALL) + 1))
 
 /* Last packet in the request */
-#define USER_SDMA_TXREQ_FLAGS_LAST_PKT   (1 << 0)
+#define TXREQ_FLAGS_REQ_LAST_PKT   (1 << 0)
+#define TXREQ_FLAGS_IOVEC_LAST_PKT (1 << 0)
 
 #define SDMA_REQ_IN_USE     0
 #define SDMA_REQ_FOR_THREAD 1
@@ -249,13 +250,22 @@ struct user_sdma_request {
 	unsigned long flags;
 };
 
+/*
+ * A single txreq could span up to 3 physical pages when the MTU
+ * is sufficiently large (> 4K). Each of the IOV pointers also
+ * needs it's own set of flags so the vector has been handled
+ * independently of each other.
+ */
 struct user_sdma_txreq {
 	/* Packet header for the txreq */
 	struct hfi1_pkt_header hdr;
 	struct sdma_txreq txreq;
 	struct user_sdma_request *req;
-	struct user_sdma_iovec *iovec1;
-	struct user_sdma_iovec *iovec2;
+	struct {
+		struct user_sdma_iovec *vec;
+		u8 flags;
+	} iovecs[3];
+	int idx;
 	u16 flags;
 	unsigned busycount;
 	u64 seqnum;
@@ -294,21 +304,6 @@ static int defer_packet_queue(
 	unsigned seq);
 static void activate_packet_queue(struct iowait *, int);
 
-static inline int iovec_may_free(struct user_sdma_iovec *iovec,
-				       void (*free)(struct user_sdma_iovec *))
-{
-	if (ACCESS_ONCE(iovec->offset) == iovec->iov.iov_len) {
-		free(iovec);
-		return 1;
-	}
-	return 0;
-}
-
-static inline void iovec_set_complete(struct user_sdma_iovec *iovec)
-{
-	iovec->offset = iovec->iov.iov_len;
-}
-
 static int defer_packet_queue(
 	struct sdma_engine *sde,
 	struct iowait *wait,
@@ -825,11 +820,11 @@ static int user_sdma_send_pkts(struct user_sdma_request *req, unsigned maxpkts)
 		tx->flags = 0;
 		tx->req = req;
 		tx->busycount = 0;
-		tx->iovec1 = NULL;
-		tx->iovec2 = NULL;
+		tx->idx = -1;
+		memset(tx->iovecs, 0, sizeof(tx->iovecs));
 
 		if (req->seqnum == req->info.npkts - 1)
-			tx->flags |= USER_SDMA_TXREQ_FLAGS_LAST_PKT;
+			tx->flags |= TXREQ_FLAGS_REQ_LAST_PKT;
 
 		/*
 		 * Calculate the payload size - this is min of the fragment
@@ -858,7 +853,7 @@ static int user_sdma_send_pkts(struct user_sdma_request *req, unsigned maxpkts)
 					goto free_tx;
 			}
 
-			tx->iovec1 = iovec;
+			tx->iovecs[++tx->idx].vec = iovec;
 			datalen = compute_data_length(req, tx);
 			if (!datalen) {
 				SDMA_DBG(req,
@@ -948,10 +943,17 @@ static int user_sdma_send_pkts(struct user_sdma_request *req, unsigned maxpkts)
 					      iovec->pages[pageidx],
 					      offset, len);
 			if (ret) {
+				int i;
+
 				dd_dev_err(pq->dd,
 					   "SDMA txreq add page failed %d\n",
 					   ret);
-				iovec_set_complete(iovec);
+				/* Mark all assigned vectors as complete so they
+				 * are unpinned in the callback. */
+				for (i = tx->idx; i >= 0; i--) {
+					tx->iovecs[i].flags |=
+						TXREQ_FLAGS_IOVEC_LAST_PKT;
+				}
 				goto free_txreq;
 			}
 			iov_offset += len;
@@ -959,8 +961,11 @@ static int user_sdma_send_pkts(struct user_sdma_request *req, unsigned maxpkts)
 			data_sent += len;
 			if (unlikely(queued < datalen &&
 				     pageidx == iovec->npages &&
-				     req->iov_idx < req->data_iovs - 1)) {
+				     req->iov_idx < req->data_iovs - 1 &&
+				     tx->idx < ARRAY_SIZE(tx->iovecs))) {
 				iovec->offset += iov_offset;
+				tx->iovecs[tx->idx].flags |=
+					TXREQ_FLAGS_IOVEC_LAST_PKT;
 				iovec = &req->iovs[++req->iov_idx];
 				if (!iovec->pages) {
 					ret = pin_vector_pages(req, iovec);
@@ -968,8 +973,7 @@ static int user_sdma_send_pkts(struct user_sdma_request *req, unsigned maxpkts)
 						goto free_txreq;
 				}
 				iov_offset = 0;
-				tx->iovec2 = iovec;
-
+				tx->iovecs[++tx->idx].vec = iovec;
 			}
 		}
 		/*
@@ -981,11 +985,15 @@ static int user_sdma_send_pkts(struct user_sdma_request *req, unsigned maxpkts)
 			req->tidoffset += datalen;
 		req->sent += data_sent;
 		if (req->data_len) {
-			if (tx->iovec1 && !tx->iovec2)
-				tx->iovec1->offset += iov_offset;
-			else if (tx->iovec2)
-				tx->iovec2->offset += iov_offset;
+			tx->iovecs[tx->idx].vec->offset += iov_offset;
+			/* If we've reached the end of the io vector, mark it
+			 * so the callback can unpin the pages and free it. */
+			if (tx->iovecs[tx->idx].vec->offset ==
+			    tx->iovecs[tx->idx].vec->iov.iov_len)
+				tx->iovecs[tx->idx].flags |=
+					TXREQ_FLAGS_IOVEC_LAST_PKT;
 		}
+
 		/*
 		 * It is important to increment this here as it is used to
 		 * generate the BTH.PSN and, therefore, can't be bulk-updated
@@ -1214,7 +1222,7 @@ static int set_txreq_header(struct user_sdma_request *req,
 				req->seqnum));
 
 	/* Set ACK request on last packet */
-	if (unlikely(tx->flags & USER_SDMA_TXREQ_FLAGS_LAST_PKT))
+	if (unlikely(tx->flags & TXREQ_FLAGS_REQ_LAST_PKT))
 		hdr->bth[2] |= cpu_to_be32(1UL<<31);
 
 	/* Set the new offset */
@@ -1246,7 +1254,7 @@ static int set_txreq_header(struct user_sdma_request *req,
 		KDETH_SET(hdr->kdeth.ver_tid_offset, TID,
 			  EXP_TID_GET(tidval, IDX));
 		/* Clear KDETH.SH only on the last packet */
-		if (unlikely(tx->flags & USER_SDMA_TXREQ_FLAGS_LAST_PKT))
+		if (unlikely(tx->flags & TXREQ_FLAGS_REQ_LAST_PKT))
 			KDETH_SET(hdr->kdeth.ver_tid_offset, SH, 0);
 		/*
 		 * Set the KDETH.OFFSET and KDETH.OM based on size of
@@ -1290,7 +1298,7 @@ static int set_txreq_header_ahg(struct user_sdma_request *req,
 	/* BTH.PSN and BTH.A */
 	val32 = (be32_to_cpu(hdr->bth[2]) + req->seqnum) &
 		(HFI1_CAP_IS_KSET(EXTENDED_PSN) ? 0x7fffffff : 0xffffff);
-	if (unlikely(tx->flags & USER_SDMA_TXREQ_FLAGS_LAST_PKT))
+	if (unlikely(tx->flags & TXREQ_FLAGS_REQ_LAST_PKT))
 		val32 |= 1UL << 31;
 	AHG_HEADER_SET(req->ahg, diff, 6, 0, 16, cpu_to_be16(val32 >> 16));
 	AHG_HEADER_SET(req->ahg, diff, 6, 16, 16, cpu_to_be16(val32 & 0xffff));
@@ -1331,7 +1339,7 @@ static int set_txreq_header_ahg(struct user_sdma_request *req,
 		val = cpu_to_le16(((EXP_TID_GET(tidval, CTRL) & 0x3) << 10) |
 					(EXP_TID_GET(tidval, IDX) & 0x3ff));
 		/* Clear KDETH.SH on last packet */
-		if (unlikely(tx->flags & USER_SDMA_TXREQ_FLAGS_LAST_PKT)) {
+		if (unlikely(tx->flags & TXREQ_FLAGS_REQ_LAST_PKT)) {
 			val |= cpu_to_le16(KDETH_GET(hdr->kdeth.ver_tid_offset,
 								INTR) >> 16);
 			val &= cpu_to_le16(~(1U << 13));
@@ -1358,10 +1366,16 @@ static void user_sdma_txreq_cb(struct sdma_txreq *txreq, int status,
 	if (unlikely(!req || !pq))
 		return;
 
-	if (tx->iovec1)
-		iovec_may_free(tx->iovec1, unpin_vector_pages);
-	if (tx->iovec2)
-		iovec_may_free(tx->iovec2, unpin_vector_pages);
+	/* If we have any io vectors associated with this txreq,
+	 * check whether they need to be 'freed'. */
+	if (tx->idx != -1) {
+		int i;
+
+		for (i = tx->idx; i >= 0; i--) {
+			if (tx->iovecs[i].flags & TXREQ_FLAGS_IOVEC_LAST_PKT)
+				unpin_vector_pages(tx->iovecs[i].vec);
+		}
+	}
 
 	tx_seqnum = tx->seqnum;
 	kmem_cache_free(pq->txreq_cache, tx);
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 15/23] staging/rdma/hfi1: Implement Expected Receive TID caching
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (13 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 14/23] staging/rdma/hfi1: Prevent silent data corruption with user SDMA ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
       [not found]     ` <1445273027-29634-16-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2015-10-19 16:43   ` [PATCH 16/23] staging/rdma/hfi1: Allow tuning of SDMA interrupt rate ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (6 subsequent siblings)
  21 siblings, 1 reply; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Mitko Haralanov,
	Ira Weiny

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Expected receives work by user-space libraries (PSM) calling into the driver
with information about the user's receive buffer and have the driver DMA-map
that buffer and program the HFI to receive data directly into it.

This is an expensive operation as it requires the driver to pin the pages which
the user's buffer maps to, DMA-map them, and then program the HFI.

When the receive is complete, user-space libraries have to call into the driver
again so the buffer is removed from the HFI, un-mapped, and the pages unpinned.

All of these operations are expensive, considering that a lot of applications
(especially micro-benchmarks) use the same buffer over and over.

In order to get better performance for user-space applications, it is highly
beneficial that they don't continuously call into the driver to register and
unregister the same buffer. Rather, they can register the buffer and cache it
for future work. The buffer can be unregistered when it is freed by the user.

This change implements such buffer caching by making use of the kernel's MMU
notifier API. User-space libraries call into the driver only when the need to
register a new buffer.

Once a buffer is registered, it stays programmed into the HFI until the kernel
notifies the driver that the buffer has been freed by the user. At that time,
the user-space library is notified and it can do the necessary work to remove
the buffer from its cache.

Buffers which have been invalidated by the kernel are not automatically removed
from the HFI and do not have their pages unpinned. Buffers are only completely
removed when the user-space libraries call into the driver to free them.  This
is done to ensure that any ongoing transfers into that buffer are complete.
This is important when a buffer is not completely freed but rather it is
shrunk. The user-space library could still have uncompleted transfers into the
remaining buffer.

With this feature, it is important that systems are setup with reasonable
limits for the amount of lockable memory.  Keeping the limit at "unlimited" (as
we've done up to this point), may result in jobs being killed by the kernel's
OOM due to them taking up excessive amounts of memory.

This commit also includes some code clean-up and rearrangement to make the
driver code base easier to maintain and develop.

Reviewed-by: Arthur Kepner <arthur.kepner-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/Makefile       |    2 +-
 drivers/staging/rdma/hfi1/common.h       |   15 +-
 drivers/staging/rdma/hfi1/file_ops.c     |  494 ++-----------
 drivers/staging/rdma/hfi1/hfi.h          |   65 +-
 drivers/staging/rdma/hfi1/init.c         |    5 +-
 drivers/staging/rdma/hfi1/trace.h        |  132 ++--
 drivers/staging/rdma/hfi1/user_exp_rcv.c | 1171 ++++++++++++++++++++++++++++++
 drivers/staging/rdma/hfi1/user_exp_rcv.h |   82 +++
 drivers/staging/rdma/hfi1/user_pages.c   |  110 +--
 drivers/staging/rdma/hfi1/user_sdma.c    |   13 +
 drivers/staging/rdma/hfi1/user_sdma.h    |   10 +-
 include/uapi/rdma/hfi/hfi1_user.h        |   42 +-
 12 files changed, 1491 insertions(+), 650 deletions(-)
 create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.c
 create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.h

diff --git a/drivers/staging/rdma/hfi1/Makefile b/drivers/staging/rdma/hfi1/Makefile
index 2e5daa6cdcc2..00c47b788666 100644
--- a/drivers/staging/rdma/hfi1/Makefile
+++ b/drivers/staging/rdma/hfi1/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_INFINIBAND_HFI1) += hfi1.o
 hfi1-y := chip.o cq.o device.o diag.o dma.o driver.o eprom.o file_ops.o firmware.o \
 	init.o intr.o keys.o mad.o mmap.o mr.o pcie.o pio.o pio_copy.o \
 	qp.o qsfp.o rc.o ruc.o sdma.o srq.o sysfs.o trace.o twsi.o \
-	uc.o ud.o user_pages.o user_sdma.o verbs_mcast.o verbs.o
+	uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs_mcast.o verbs.o
 hfi1-$(CONFIG_DEBUG_FS) += debugfs.o
 
 CFLAGS_trace.o = -I$(src)
diff --git a/drivers/staging/rdma/hfi1/common.h b/drivers/staging/rdma/hfi1/common.h
index de62cbe2224c..7809093eb55e 100644
--- a/drivers/staging/rdma/hfi1/common.h
+++ b/drivers/staging/rdma/hfi1/common.h
@@ -132,13 +132,14 @@
  * HFI1_CAP_RESERVED_MASK bits.
  */
 #define HFI1_CAP_WRITABLE_MASK   (HFI1_CAP_SDMA_AHG |			\
-				 HFI1_CAP_HDRSUPP |			\
-				 HFI1_CAP_MULTI_PKT_EGR |		\
-				 HFI1_CAP_NODROP_RHQ_FULL |		\
-				 HFI1_CAP_NODROP_EGR_FULL |		\
-				 HFI1_CAP_ALLOW_PERM_JKEY |		\
-				 HFI1_CAP_STATIC_RATE_CTRL |		\
-				 HFI1_CAP_PRINT_UNIMPL)
+				  HFI1_CAP_HDRSUPP |			\
+				  HFI1_CAP_MULTI_PKT_EGR |		\
+				  HFI1_CAP_NODROP_RHQ_FULL |		\
+				  HFI1_CAP_NODROP_EGR_FULL |		\
+				  HFI1_CAP_ALLOW_PERM_JKEY |		\
+				  HFI1_CAP_STATIC_RATE_CTRL |		\
+				  HFI1_CAP_PRINT_UNIMPL |		\
+				  HFI1_CAP_TID_UNMAP)
 /*
  * A set of capability bits that are "global" and are not allowed to be
  * set in the user bitmask.
diff --git a/drivers/staging/rdma/hfi1/file_ops.c b/drivers/staging/rdma/hfi1/file_ops.c
index 3c9cae6f64a3..61031a0be0dc 100644
--- a/drivers/staging/rdma/hfi1/file_ops.c
+++ b/drivers/staging/rdma/hfi1/file_ops.c
@@ -47,20 +47,10 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  */
-#include <linux/pci.h>
 #include <linux/poll.h>
 #include <linux/cdev.h>
-#include <linux/swap.h>
 #include <linux/vmalloc.h>
-#include <linux/highmem.h>
 #include <linux/io.h>
-#include <linux/jiffies.h>
-#include <asm/pgtable.h>
-#include <linux/delay.h>
-#include <linux/export.h>
-#include <linux/module.h>
-#include <linux/cred.h>
-#include <linux/uio.h>
 
 #include "hfi.h"
 #include "pio.h"
@@ -68,6 +58,7 @@
 #include "common.h"
 #include "trace.h"
 #include "user_sdma.h"
+#include "user_exp_rcv.h"
 #include "eprom.h"
 
 #undef pr_fmt
@@ -105,9 +96,6 @@ static int user_event_ack(struct hfi1_ctxtdata *, int, unsigned long);
 static int set_ctxt_pkey(struct hfi1_ctxtdata *, unsigned, u16);
 static int manage_rcvq(struct hfi1_ctxtdata *, unsigned, int);
 static int vma_fault(struct vm_area_struct *, struct vm_fault *);
-static int exp_tid_setup(struct file *, struct hfi1_tid_info *);
-static int exp_tid_free(struct file *, struct hfi1_tid_info *);
-static void unlock_exp_tids(struct hfi1_ctxtdata *);
 
 static const struct file_operations hfi1_file_ops = {
 	.owner = THIS_MODULE,
@@ -170,18 +158,6 @@ enum mmap_types {
 	HFI1_MMAP_TOKEN_SET(SUBCTXT, subctxt) | \
 	HFI1_MMAP_TOKEN_SET(OFFSET, (offset_in_page(addr))))
 
-#define EXP_TID_SET(field, value)			\
-	(((value) & EXP_TID_TID##field##_MASK) <<	\
-	 EXP_TID_TID##field##_SHIFT)
-#define EXP_TID_CLEAR(tid, field) {					\
-		(tid) &= ~(EXP_TID_TID##field##_MASK <<			\
-			   EXP_TID_TID##field##_SHIFT);			\
-			}
-#define EXP_TID_RESET(tid, field, value) do {				\
-		EXP_TID_CLEAR(tid, field);				\
-		(tid) |= EXP_TID_SET(field, value);			\
-	} while (0)
-
 #define dbg(fmt, ...)				\
 	pr_info(fmt, ##__VA_ARGS__)
 
@@ -195,8 +171,12 @@ static int hfi1_file_open(struct inode *inode, struct file *fp)
 {
 	/* The real work is performed later in assign_ctxt() */
 	fp->private_data = kzalloc(sizeof(struct hfi1_filedata), GFP_KERNEL);
-	if (fp->private_data) /* no cpu affinity by default */
-		((struct hfi1_filedata *)fp->private_data)->rec_cpu_num = -1;
+	if (fp->private_data) {
+		struct hfi1_filedata *fd = fp->private_data;
+
+		/* no cpu affinity by default */
+		fd->rec_cpu_num = -1;
+	}
 	return fp->private_data ? 0 : -ENOMEM;
 }
 
@@ -208,6 +188,7 @@ static ssize_t hfi1_file_write(struct file *fp, const char __user *data,
 	struct hfi1_cmd cmd;
 	struct hfi1_user_info uinfo;
 	struct hfi1_tid_info tinfo;
+	unsigned long addr;
 	ssize_t consumed = 0, copy = 0, ret = 0;
 	void *dest = NULL;
 	__u64 user_val = 0;
@@ -239,6 +220,7 @@ static ssize_t hfi1_file_write(struct file *fp, const char __user *data,
 		break;
 	case HFI1_CMD_TID_UPDATE:
 	case HFI1_CMD_TID_FREE:
+	case HFI1_CMD_TID_INVAL_READ:
 		copy = sizeof(tinfo);
 		dest = &tinfo;
 		break;
@@ -317,9 +299,8 @@ static ssize_t hfi1_file_write(struct file *fp, const char __user *data,
 			sc_return_credits(uctxt->sc);
 		break;
 	case HFI1_CMD_TID_UPDATE:
-		ret = exp_tid_setup(fp, &tinfo);
+		ret = hfi1_user_exp_rcv_setup(fp, &tinfo);
 		if (!ret) {
-			unsigned long addr;
 			/*
 			 * Copy the number of tidlist entries we used
 			 * and the length of the buffer we registered.
@@ -334,8 +315,23 @@ static ssize_t hfi1_file_write(struct file *fp, const char __user *data,
 				ret = -EFAULT;
 		}
 		break;
+	case HFI1_CMD_TID_INVAL_READ:
+		ret = hfi1_user_exp_rcv_invalid(fp, &tinfo);
+		if (!ret) {
+			addr = (unsigned long)cmd.addr +
+				offsetof(struct hfi1_tid_info, tidcnt);
+			if (copy_to_user((void __user *)addr, &tinfo.tidcnt,
+					 sizeof(tinfo.tidcnt)))
+				ret = -EFAULT;
+		}
+		break;
 	case HFI1_CMD_TID_FREE:
-		ret = exp_tid_free(fp, &tinfo);
+		ret = hfi1_user_exp_rcv_clear(fp, &tinfo);
+		addr = (unsigned long)cmd.addr +
+			offsetof(struct hfi1_tid_info, tidcnt);
+		if (copy_to_user((void __user *)addr, &tinfo.tidcnt,
+				 sizeof(tinfo.tidcnt)))
+			ret = -EFAULT;
 		break;
 	case HFI1_CMD_RECV_CTRL:
 		ret = manage_rcvq(uctxt, subctxt_fp(fp), (int)user_val);
@@ -607,9 +603,9 @@ static int hfi1_file_mmap(struct file *fp, struct vm_area_struct *vma)
 		 * Use the page where this context's flags are. User level
 		 * knows where it's own bitmap is within the page.
 		 */
-		memaddr = ((unsigned long)dd->events +
-			   ((uctxt->ctxt - dd->first_user_ctxt) *
-			    HFI1_MAX_SHARED_CTXTS)) & PAGE_MASK;
+		memaddr = (unsigned long)(dd->events +
+					  ((uctxt->ctxt - dd->first_user_ctxt) *
+					   HFI1_MAX_SHARED_CTXTS)) & PAGE_MASK;
 		memlen = PAGE_SIZE;
 		/*
 		 * v3.7 removes VM_RESERVED but the effect is kept by
@@ -760,6 +756,7 @@ static int hfi1_file_close(struct inode *inode, struct file *fp)
 	mutex_lock(&hfi1_mutex);
 
 	flush_wc();
+
 	/* drain user sdma queue */
 	if (fdata->pq)
 		hfi1_user_sdma_free_queues(fdata);
@@ -809,12 +806,9 @@ static int hfi1_file_close(struct inode *inode, struct file *fp)
 	uctxt->pionowait = 0;
 	uctxt->event_flags = 0;
 
-	hfi1_clear_tids(uctxt);
+	hfi1_user_exp_rcv_free(fdata);
 	hfi1_clear_ctxt_pkey(dd, uctxt->ctxt);
 
-	if (uctxt->tid_pg_list)
-		unlock_exp_tids(uctxt);
-
 	hfi1_stats.sps_ctxts--;
 	dd->freectxts++;
 	mutex_unlock(&hfi1_mutex);
@@ -1016,6 +1010,7 @@ static int allocate_ctxt(struct file *fp, struct hfi1_devdata *dd,
 	ret = sc_enable(uctxt->sc);
 	if (ret)
 		return ret;
+
 	/*
 	 * Setup shared context resources if the user-level has requested
 	 * shared contexts and this is the 'master' process.
@@ -1050,22 +1045,19 @@ static int allocate_ctxt(struct file *fp, struct hfi1_devdata *dd,
 static int init_subctxts(struct hfi1_ctxtdata *uctxt,
 			 const struct hfi1_user_info *uinfo)
 {
-	int ret = 0;
 	unsigned num_subctxts;
 
 	num_subctxts = uinfo->subctxt_cnt;
-	if (num_subctxts > HFI1_MAX_SHARED_CTXTS) {
-		ret = -EINVAL;
-		goto bail;
-	}
+	if (num_subctxts > HFI1_MAX_SHARED_CTXTS)
+		return -EINVAL;
 
 	uctxt->subctxt_cnt = uinfo->subctxt_cnt;
 	uctxt->subctxt_id = uinfo->subctxt_id;
 	uctxt->active_slaves = 1;
 	uctxt->redirect_seq_cnt = 1;
 	set_bit(HFI1_CTXT_MASTER_UNINIT, &uctxt->event_flags);
-bail:
-	return ret;
+
+	return 0;
 }
 
 static int setup_subctxt(struct hfi1_ctxtdata *uctxt)
@@ -1122,7 +1114,7 @@ static int user_init(struct file *fp)
 		ret = wait_event_interruptible(uctxt->wait,
 			!test_bit(HFI1_CTXT_MASTER_UNINIT,
 			&uctxt->event_flags));
-		goto done;
+		goto expected;
 	}
 
 	/* initialize poll variables... */
@@ -1169,8 +1161,18 @@ static int user_init(struct file *fp)
 		clear_bit(HFI1_CTXT_MASTER_UNINIT, &uctxt->event_flags);
 		wake_up(&uctxt->wait);
 	}
-	ret = 0;
 
+expected:
+	/*
+	 * Expected receive has to be setup for all processes (including
+	 * shared contexts). However, it has to be done after the master
+	 * context has been fully configured as it depends on the
+	 * eager/expected split of the RcvArray entries.
+	 * Setting it up here ensures that the subcontexts will be waiting
+	 * (due to the above wait_event_interruptible() until the master
+	 * is setup.
+	 */
+	ret = hfi1_user_exp_rcv_init(fp);
 done:
 	return ret;
 }
@@ -1223,6 +1225,7 @@ static int setup_ctxt(struct file *fp)
 	 * is not requested or by the master process.
 	 */
 	if (!uctxt->subctxt_cnt || !subctxt_fp(fp)) {
+
 		ret = hfi1_init_ctxt(uctxt->sc);
 		if (ret)
 			goto done;
@@ -1239,46 +1242,6 @@ static int setup_ctxt(struct file *fp)
 			if (ret)
 				goto done;
 		}
-		/* Setup Expected Rcv memories */
-		uctxt->tid_pg_list = vzalloc(uctxt->expected_count *
-					     sizeof(struct page **));
-		if (!uctxt->tid_pg_list) {
-			ret = -ENOMEM;
-			goto done;
-		}
-		uctxt->physshadow = vzalloc(uctxt->expected_count *
-					    sizeof(*uctxt->physshadow));
-		if (!uctxt->physshadow) {
-			ret = -ENOMEM;
-			goto done;
-		}
-		/* allocate expected TID map and initialize the cursor */
-		atomic_set(&uctxt->tidcursor, 0);
-		uctxt->numtidgroups = uctxt->expected_count /
-			dd->rcv_entries.group_size;
-		uctxt->tidmapcnt = uctxt->numtidgroups / BITS_PER_LONG +
-			!!(uctxt->numtidgroups % BITS_PER_LONG);
-		uctxt->tidusemap = kzalloc_node(uctxt->tidmapcnt *
-						sizeof(*uctxt->tidusemap),
-						GFP_KERNEL, uctxt->numa_id);
-		if (!uctxt->tidusemap) {
-			ret = -ENOMEM;
-			goto done;
-		}
-		/*
-		 * In case that the number of groups is not a multiple of
-		 * 64 (the number of groups in a tidusemap element), mark
-		 * the extra ones as used. This will effectively make them
-		 * permanently used and should never be assigned. Otherwise,
-		 * the code which checks how many free groups we have will
-		 * get completely confused about the state of the bits.
-		 */
-		if (uctxt->numtidgroups % BITS_PER_LONG)
-			uctxt->tidusemap[uctxt->tidmapcnt - 1] =
-				~((1ULL << (uctxt->numtidgroups %
-					    BITS_PER_LONG)) - 1);
-		trace_hfi1_exp_tid_map(uctxt->ctxt, subctxt_fp(fp), 0,
-				       uctxt->tidusemap, uctxt->tidmapcnt);
 	}
 	ret = hfi1_user_sdma_alloc_queues(uctxt, fp);
 	if (ret)
@@ -1514,365 +1477,6 @@ static int user_event_ack(struct hfi1_ctxtdata *uctxt, int subctxt,
 	return 0;
 }
 
-#define num_user_pages(vaddr, len)					\
-	(1 + (((((unsigned long)(vaddr) +				\
-		 (unsigned long)(len) - 1) & PAGE_MASK) -		\
-	       ((unsigned long)vaddr & PAGE_MASK)) >> PAGE_SHIFT))
-
-/**
- * tzcnt - count the number of trailing zeros in a 64bit value
- * @value: the value to be examined
- *
- * Returns the number of trailing least significant zeros in the
- * the input value. If the value is zero, return the number of
- * bits of the value.
- */
-static inline u8 tzcnt(u64 value)
-{
-	return value ? __builtin_ctzl(value) : sizeof(value) * 8;
-}
-
-static inline unsigned num_free_groups(unsigned long map, u16 *start)
-{
-	unsigned free;
-	u16 bitidx = *start;
-
-	if (bitidx >= BITS_PER_LONG)
-		return 0;
-	/* "Turn off" any bits set before our bit index */
-	map &= ~((1ULL << bitidx) - 1);
-	free = tzcnt(map) - bitidx;
-	while (!free && bitidx < BITS_PER_LONG) {
-		/* Zero out the last set bit so we look at the rest */
-		map &= ~(1ULL << bitidx);
-		/*
-		 * Account for the previously checked bits and advance
-		 * the bit index. We don't have to check for bitidx
-		 * getting bigger than BITS_PER_LONG here as it would
-		 * mean extra instructions that we don't need. If it
-		 * did happen, it would push free to a negative value
-		 * which will break the loop.
-		 */
-		free = tzcnt(map) - ++bitidx;
-	}
-	*start = bitidx;
-	return free;
-}
-
-static int exp_tid_setup(struct file *fp, struct hfi1_tid_info *tinfo)
-{
-	int ret = 0;
-	struct hfi1_ctxtdata *uctxt = ctxt_fp(fp);
-	struct hfi1_devdata *dd = uctxt->dd;
-	unsigned tid, mapped = 0, npages, ngroups, exp_groups,
-		tidpairs = uctxt->expected_count / 2;
-	struct page **pages;
-	unsigned long vaddr, tidmap[uctxt->tidmapcnt];
-	dma_addr_t *phys;
-	u32 tidlist[tidpairs], pairidx = 0, tidcursor;
-	u16 useidx, idx, bitidx, tidcnt = 0;
-
-	vaddr = tinfo->vaddr;
-
-	if (offset_in_page(vaddr)) {
-		ret = -EINVAL;
-		goto bail;
-	}
-
-	npages = num_user_pages(vaddr, tinfo->length);
-	if (!npages) {
-		ret = -EINVAL;
-		goto bail;
-	}
-	if (!access_ok(VERIFY_WRITE, (void __user *)vaddr,
-		       npages * PAGE_SIZE)) {
-		dd_dev_err(dd, "Fail vaddr %p, %u pages, !access_ok\n",
-			   (void *)vaddr, npages);
-		ret = -EFAULT;
-		goto bail;
-	}
-
-	memset(tidmap, 0, sizeof(tidmap[0]) * uctxt->tidmapcnt);
-	memset(tidlist, 0, sizeof(tidlist[0]) * tidpairs);
-
-	exp_groups = uctxt->expected_count / dd->rcv_entries.group_size;
-	/* which group set do we look at first? */
-	tidcursor = atomic_read(&uctxt->tidcursor);
-	useidx = (tidcursor >> 16) & 0xffff;
-	bitidx = tidcursor & 0xffff;
-
-	/*
-	 * Keep going until we've mapped all pages or we've exhausted all
-	 * RcvArray entries.
-	 * This iterates over the number of tidmaps + 1
-	 * (idx <= uctxt->tidmapcnt) so we check the bitmap which we
-	 * started from one more time for any free bits before the
-	 * starting point bit.
-	 */
-	for (mapped = 0, idx = 0;
-	     mapped < npages && idx <= uctxt->tidmapcnt;) {
-		u64 i, offset = 0;
-		unsigned free, pinned, pmapped = 0, bits_used;
-		u16 grp;
-
-		/*
-		 * "Reserve" the needed group bits under lock so other
-		 * processes can't step in the middle of it. Once
-		 * reserved, we don't need the lock anymore since we
-		 * are guaranteed the groups.
-		 */
-		spin_lock(&uctxt->exp_lock);
-		if (uctxt->tidusemap[useidx] == -1ULL ||
-		    bitidx >= BITS_PER_LONG) {
-			/* no free groups in the set, use the next */
-			useidx = (useidx + 1) % uctxt->tidmapcnt;
-			idx++;
-			bitidx = 0;
-			spin_unlock(&uctxt->exp_lock);
-			continue;
-		}
-		ngroups = ((npages - mapped) / dd->rcv_entries.group_size) +
-			!!((npages - mapped) % dd->rcv_entries.group_size);
-
-		/*
-		 * If we've gotten here, the current set of groups does have
-		 * one or more free groups.
-		 */
-		free = num_free_groups(uctxt->tidusemap[useidx], &bitidx);
-		if (!free) {
-			/*
-			 * Despite the check above, free could still come back
-			 * as 0 because we don't check the entire bitmap but
-			 * we start from bitidx.
-			 */
-			spin_unlock(&uctxt->exp_lock);
-			continue;
-		}
-		bits_used = min(free, ngroups);
-		tidmap[useidx] |= ((1ULL << bits_used) - 1) << bitidx;
-		uctxt->tidusemap[useidx] |= tidmap[useidx];
-		spin_unlock(&uctxt->exp_lock);
-
-		/*
-		 * At this point, we know where in the map we have free bits.
-		 * properly offset into the various "shadow" arrays and compute
-		 * the RcvArray entry index.
-		 */
-		offset = ((useidx * BITS_PER_LONG) + bitidx) *
-			dd->rcv_entries.group_size;
-		pages = uctxt->tid_pg_list + offset;
-		phys = uctxt->physshadow + offset;
-		tid = uctxt->expected_base + offset;
-
-		/* Calculate how many pages we can pin based on free bits */
-		pinned = min((bits_used * dd->rcv_entries.group_size),
-			     (npages - mapped));
-		/*
-		 * Now that we know how many free RcvArray entries we have,
-		 * we can pin that many user pages.
-		 */
-		ret = hfi1_get_user_pages(vaddr + (mapped * PAGE_SIZE),
-					  pinned, pages);
-		if (ret) {
-			/*
-			 * We can't continue because the pages array won't be
-			 * initialized. This should never happen,
-			 * unless perhaps the user has mpin'ed the pages
-			 * themselves.
-			 */
-			dd_dev_info(dd,
-				    "Failed to lock addr %p, %u pages: errno %d\n",
-				    (void *) vaddr, pinned, -ret);
-			/*
-			 * Let go of the bits that we reserved since we are not
-			 * going to use them.
-			 */
-			spin_lock(&uctxt->exp_lock);
-			uctxt->tidusemap[useidx] &=
-				~(((1ULL << bits_used) - 1) << bitidx);
-			spin_unlock(&uctxt->exp_lock);
-			goto done;
-		}
-		/*
-		 * How many groups do we need based on how many pages we have
-		 * pinned?
-		 */
-		ngroups = (pinned / dd->rcv_entries.group_size) +
-			!!(pinned % dd->rcv_entries.group_size);
-		/*
-		 * Keep programming RcvArray entries for all the <ngroups> free
-		 * groups.
-		 */
-		for (i = 0, grp = 0; grp < ngroups; i++, grp++) {
-			unsigned j;
-			u32 pair_size = 0, tidsize;
-			/*
-			 * This inner loop will program an entire group or the
-			 * array of pinned pages (which ever limit is hit
-			 * first).
-			 */
-			for (j = 0; j < dd->rcv_entries.group_size &&
-				     pmapped < pinned; j++, pmapped++, tid++) {
-				tidsize = PAGE_SIZE;
-				phys[pmapped] = hfi1_map_page(dd->pcidev,
-						   pages[pmapped], 0,
-						   tidsize, PCI_DMA_FROMDEVICE);
-				trace_hfi1_exp_rcv_set(uctxt->ctxt,
-						       subctxt_fp(fp),
-						       tid, vaddr,
-						       phys[pmapped],
-						       pages[pmapped]);
-				/*
-				 * Each RcvArray entry is programmed with one
-				 * page * worth of memory. This will handle
-				 * the 8K MTU as well as anything smaller
-				 * due to the fact that both entries in the
-				 * RcvTidPair are programmed with a page.
-				 * PSM currently does not handle anything
-				 * bigger than 8K MTU, so should we even worry
-				 * about 10K here?
-				 */
-				hfi1_put_tid(dd, tid, PT_EXPECTED,
-					     phys[pmapped],
-					     ilog2(tidsize >> PAGE_SHIFT) + 1);
-				pair_size += tidsize >> PAGE_SHIFT;
-				EXP_TID_RESET(tidlist[pairidx], LEN, pair_size);
-				if (!(tid % 2)) {
-					tidlist[pairidx] |=
-					   EXP_TID_SET(IDX,
-						(tid - uctxt->expected_base)
-						       / 2);
-					tidlist[pairidx] |=
-						EXP_TID_SET(CTRL, 1);
-					tidcnt++;
-				} else {
-					tidlist[pairidx] |=
-						EXP_TID_SET(CTRL, 2);
-					pair_size = 0;
-					pairidx++;
-				}
-			}
-			/*
-			 * We've programmed the entire group (or as much of the
-			 * group as we'll use. Now, it's time to push it out...
-			 */
-			flush_wc();
-		}
-		mapped += pinned;
-		atomic_set(&uctxt->tidcursor,
-			   (((useidx & 0xffffff) << 16) |
-			    ((bitidx + bits_used) & 0xffffff)));
-	}
-	trace_hfi1_exp_tid_map(uctxt->ctxt, subctxt_fp(fp), 0, uctxt->tidusemap,
-			       uctxt->tidmapcnt);
-
-done:
-	/* If we've mapped anything, copy relevant info to user */
-	if (mapped) {
-		if (copy_to_user((void __user *)(unsigned long)tinfo->tidlist,
-				 tidlist, sizeof(tidlist[0]) * tidcnt)) {
-			ret = -EFAULT;
-			goto done;
-		}
-		/* copy TID info to user */
-		if (copy_to_user((void __user *)(unsigned long)tinfo->tidmap,
-				 tidmap, sizeof(tidmap[0]) * uctxt->tidmapcnt))
-			ret = -EFAULT;
-	}
-bail:
-	/*
-	 * Calculate mapped length. New Exp TID protocol does not "unwind" and
-	 * report an error if it can't map the entire buffer. It just reports
-	 * the length that was mapped.
-	 */
-	tinfo->length = mapped * PAGE_SIZE;
-	tinfo->tidcnt = tidcnt;
-	return ret;
-}
-
-static int exp_tid_free(struct file *fp, struct hfi1_tid_info *tinfo)
-{
-	struct hfi1_ctxtdata *uctxt = ctxt_fp(fp);
-	struct hfi1_devdata *dd = uctxt->dd;
-	unsigned long tidmap[uctxt->tidmapcnt];
-	struct page **pages;
-	dma_addr_t *phys;
-	u16 idx, bitidx, tid;
-	int ret = 0;
-
-	if (copy_from_user(&tidmap, (void __user *)(unsigned long)
-			   tinfo->tidmap,
-			   sizeof(tidmap[0]) * uctxt->tidmapcnt)) {
-		ret = -EFAULT;
-		goto done;
-	}
-	for (idx = 0; idx < uctxt->tidmapcnt; idx++) {
-		unsigned long map;
-
-		bitidx = 0;
-		if (!tidmap[idx])
-			continue;
-		map = tidmap[idx];
-		while ((bitidx = tzcnt(map)) < BITS_PER_LONG) {
-			int i, pcount = 0;
-			struct page *pshadow[dd->rcv_entries.group_size];
-			unsigned offset = ((idx * BITS_PER_LONG) + bitidx) *
-				dd->rcv_entries.group_size;
-
-			pages = uctxt->tid_pg_list + offset;
-			phys = uctxt->physshadow + offset;
-			tid = uctxt->expected_base + offset;
-			for (i = 0; i < dd->rcv_entries.group_size;
-			     i++, tid++) {
-				if (pages[i]) {
-					hfi1_put_tid(dd, tid, PT_INVALID,
-						      0, 0);
-					trace_hfi1_exp_rcv_free(uctxt->ctxt,
-								subctxt_fp(fp),
-								tid, phys[i],
-								pages[i]);
-					pci_unmap_page(dd->pcidev, phys[i],
-					      PAGE_SIZE, PCI_DMA_FROMDEVICE);
-					pshadow[pcount] = pages[i];
-					pages[i] = NULL;
-					pcount++;
-					phys[i] = 0;
-				}
-			}
-			flush_wc();
-			hfi1_release_user_pages(pshadow, pcount);
-			clear_bit(bitidx, &uctxt->tidusemap[idx]);
-			map &= ~(1ULL<<bitidx);
-		}
-	}
-	trace_hfi1_exp_tid_map(uctxt->ctxt, subctxt_fp(fp), 1, uctxt->tidusemap,
-			       uctxt->tidmapcnt);
-done:
-	return ret;
-}
-
-static void unlock_exp_tids(struct hfi1_ctxtdata *uctxt)
-{
-	struct hfi1_devdata *dd = uctxt->dd;
-	unsigned tid;
-
-	dd_dev_info(dd, "ctxt %u unlocking any locked expTID pages\n",
-		    uctxt->ctxt);
-	for (tid = 0; tid < uctxt->expected_count; tid++) {
-		struct page *p = uctxt->tid_pg_list[tid];
-		dma_addr_t phys;
-
-		if (!p)
-			continue;
-
-		phys = uctxt->physshadow[tid];
-		uctxt->physshadow[tid] = 0;
-		uctxt->tid_pg_list[tid] = NULL;
-		pci_unmap_page(dd->pcidev, phys, PAGE_SIZE, PCI_DMA_FROMDEVICE);
-		hfi1_release_user_pages(&p, 1);
-	}
-}
-
 static int set_ctxt_pkey(struct hfi1_ctxtdata *uctxt, unsigned subctxt,
 			 u16 pkey)
 {
diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h
index 60f4b3d88671..848042e17320 100644
--- a/drivers/staging/rdma/hfi1/hfi.h
+++ b/drivers/staging/rdma/hfi1/hfi.h
@@ -65,6 +65,8 @@
 #include <linux/cdev.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rbtree.h>
 
 #include "chip_registers.h"
 #include "common.h"
@@ -157,6 +159,11 @@ struct ctxt_eager_bufs {
 	} *rcvtids;
 };
 
+struct exp_tid_set {
+	struct list_head list;
+	u32 count;
+};
+
 struct hfi1_ctxtdata {
 	/* shadow the ctxt's RcvCtrl register */
 	u64 rcvctrl;
@@ -213,20 +220,13 @@ struct hfi1_ctxtdata {
 	u32 expected_count;
 	/* index of first expected TID entry. */
 	u32 expected_base;
-	/* cursor into the exp group sets */
-	atomic_t tidcursor;
-	/* number of exp TID groups assigned to the ctxt */
-	u16 numtidgroups;
-	/* size of exp TID group fields in tidusemap */
-	u16 tidmapcnt;
-	/* exp TID group usage bitfield array */
-	unsigned long *tidusemap;
-	/* pinned pages for exp sends, allocated at open */
-	struct page **tid_pg_list;
-	/* dma handles for exp tid pages */
-	dma_addr_t *physshadow;
+
+	struct exp_tid_set tid_group_list;
+	struct exp_tid_set tid_used_list;
+	struct exp_tid_set tid_full_list;
+
 	/* lock protecting all Expected TID data */
-	spinlock_t exp_lock;
+	struct mutex exp_lock;
 	/* number of pio bufs for this ctxt (all procs, if shared) */
 	u32 piocnt;
 	/* first pio buffer for this ctxt */
@@ -1081,6 +1081,8 @@ struct hfi1_devdata {
 #define PT_EAGER    1
 #define PT_INVALID  2
 
+struct mmu_rb_node;
+
 /* Private data for file operations */
 struct hfi1_filedata {
 	struct hfi1_ctxtdata *uctxt;
@@ -1089,8 +1091,27 @@ struct hfi1_filedata {
 	struct hfi1_user_sdma_pkt_q *pq;
 	/* for cpu affinity; -1 if none */
 	int rec_cpu_num;
+	struct mmu_notifier mn;
+	struct rb_root tid_rb_root;
+	u32 tid_limit;
+	u32 tid_used;
+	spinlock_t rb_lock;
+	u32 *invalid_tids;
+	u32 invalid_tid_idx;
+	spinlock_t invalid_lock;
+	int (*mmu_rb_insert)(struct rb_root *, struct mmu_rb_node *);
 };
 
+/* for use in system calls, where we want to know device type, etc. */
+#define fp_to_fd(fp) ((struct hfi1_filedata *)(fp)->private_data)
+#define ctxt_fp(fp) (fp_to_fd((fp))->uctxt)
+#define subctxt_fp(fp) (fp_to_fd((fp))->subctxt)
+#define tidcursor_fp(fp) (fp_to_fd((fp))->tidcursor)
+#define user_sdma_pkt_fp(fp) (fp_to_fd((fp))->pq)
+#define user_sdma_comp_fp(fp) (fp_to_fd((fp))->cq)
+#define notifier_fp(fp) (fp_to_fd((fp))->mn)
+#define rb_fp(fp) (fp_to_fd((fp))->tid_rb_root)
+
 extern struct list_head hfi1_dev_list;
 extern spinlock_t hfi1_devs_lock;
 struct hfi1_devdata *hfi1_lookup(int unit);
@@ -1398,18 +1419,6 @@ int snoop_send_pio_handler(struct hfi1_qp *qp, struct ahg_ib_header *ibhdr,
 void snoop_inline_pio_send(struct hfi1_devdata *dd, struct pio_buf *pbuf,
 			   u64 pbc, const void *from, size_t count);
 
-/* for use in system calls, where we want to know device type, etc. */
-#define ctxt_fp(fp) \
-	(((struct hfi1_filedata *)(fp)->private_data)->uctxt)
-#define subctxt_fp(fp) \
-	(((struct hfi1_filedata *)(fp)->private_data)->subctxt)
-#define tidcursor_fp(fp) \
-	(((struct hfi1_filedata *)(fp)->private_data)->tidcursor)
-#define user_sdma_pkt_fp(fp) \
-	(((struct hfi1_filedata *)(fp)->private_data)->pq)
-#define user_sdma_comp_fp(fp) \
-	(((struct hfi1_filedata *)(fp)->private_data)->cq)
-
 static inline struct hfi1_devdata *dd_from_ppd(struct hfi1_pportdata *ppd)
 {
 	return ppd->dd;
@@ -1545,8 +1554,8 @@ void hfi1_set_led_override(struct hfi1_pportdata *ppd, unsigned int val);
  */
 #define DEFAULT_RCVHDR_ENTSIZE 32
 
-int hfi1_get_user_pages(unsigned long, size_t, struct page **);
-void hfi1_release_user_pages(struct page **, size_t);
+int hfi1_acquire_user_pages(unsigned long, size_t, bool, struct page **);
+void hfi1_release_user_pages(struct page **, size_t, bool);
 
 static inline void clear_rcvhdrtail(const struct hfi1_ctxtdata *rcd)
 {
@@ -1595,8 +1604,6 @@ int get_platform_config_field(struct hfi1_devdata *dd,
 			enum platform_config_table_type_encoding table_type,
 			int table_index, int field_index, u32 *data, u32 len);
 
-dma_addr_t hfi1_map_page(struct pci_dev *, struct page *, unsigned long,
-			 size_t, int);
 const char *get_unit_name(int unit);
 
 /*
diff --git a/drivers/staging/rdma/hfi1/init.c b/drivers/staging/rdma/hfi1/init.c
index cd1508ec0914..62aa7718b6d6 100644
--- a/drivers/staging/rdma/hfi1/init.c
+++ b/drivers/staging/rdma/hfi1/init.c
@@ -219,7 +219,7 @@ struct hfi1_ctxtdata *hfi1_create_ctxtdata(struct hfi1_pportdata *ppd, u32 ctxt)
 		rcd->numa_id = numa_node_id();
 		rcd->rcv_array_groups = dd->rcv_entries.ngroups;
 
-		spin_lock_init(&rcd->exp_lock);
+		mutex_init(&rcd->exp_lock);
 
 		/*
 		 * Calculate the context's RcvArray entry starting point.
@@ -941,13 +941,10 @@ void hfi1_free_ctxtdata(struct hfi1_devdata *dd, struct hfi1_ctxtdata *rcd)
 	kfree(rcd->egrbufs.buffers);
 
 	sc_free(rcd->sc);
-	vfree(rcd->physshadow);
-	vfree(rcd->tid_pg_list);
 	vfree(rcd->user_event_mask);
 	vfree(rcd->subctxt_uregbase);
 	vfree(rcd->subctxt_rcvegrbuf);
 	vfree(rcd->subctxt_rcvhdr_base);
-	kfree(rcd->tidusemap);
 	kfree(rcd->opstats);
 	kfree(rcd);
 }
diff --git a/drivers/staging/rdma/hfi1/trace.h b/drivers/staging/rdma/hfi1/trace.h
index d7851c0a0171..0354dca9a6f4 100644
--- a/drivers/staging/rdma/hfi1/trace.h
+++ b/drivers/staging/rdma/hfi1/trace.h
@@ -153,92 +153,130 @@ TRACE_EVENT(hfi1_receive_interrupt,
 	)
 );
 
-const char *print_u64_array(struct trace_seq *, u64 *, int);
+TRACE_EVENT(hfi1_exp_tid_reg,
+	    TP_PROTO(unsigned ctxt, u16 subctxt, u32 rarr,
+		     u32 npages, unsigned long va, unsigned long pa,
+		     dma_addr_t dma),
+	    TP_ARGS(ctxt, subctxt, rarr, npages, va, pa, dma),
+	    TP_STRUCT__entry(
+		    __field(unsigned, ctxt)
+		    __field(u16, subctxt)
+		    __field(u32, rarr)
+		    __field(u32, npages)
+		    __field(unsigned long, va)
+		    __field(unsigned long, pa)
+		    __field(dma_addr_t, dma)
+		    ),
+	    TP_fast_assign(
+		    __entry->ctxt = ctxt;
+		    __entry->subctxt = subctxt;
+		    __entry->rarr = rarr;
+		    __entry->npages = npages;
+		    __entry->va = va;
+		    __entry->pa = pa;
+		    __entry->dma = dma;
+		    ),
+	    TP_printk("[%u:%u] entry:%u, %u pages @ 0x%lx, va:0x%lx dma:0x%llx",
+		      __entry->ctxt,
+		      __entry->subctxt,
+		      __entry->rarr,
+		      __entry->npages,
+		      __entry->pa,
+		      __entry->va,
+		      __entry->dma
+		    )
+	);
 
-TRACE_EVENT(hfi1_exp_tid_map,
-	    TP_PROTO(unsigned ctxt, u16 subctxt, int dir,
-		     unsigned long *maps, u16 count),
-	    TP_ARGS(ctxt, subctxt, dir, maps, count),
+TRACE_EVENT(hfi1_exp_tid_unreg,
+	    TP_PROTO(unsigned ctxt, u16 subctxt, u32 rarr, u32 npages,
+		     unsigned long va, unsigned long pa, dma_addr_t dma),
+	    TP_ARGS(ctxt, subctxt, rarr, npages, va, pa, dma),
 	    TP_STRUCT__entry(
 		    __field(unsigned, ctxt)
 		    __field(u16, subctxt)
-		    __field(int, dir)
-		    __field(u16, count)
-		    __dynamic_array(unsigned long, maps, sizeof(*maps) * count)
+		    __field(u32, rarr)
+		    __field(u32, npages)
+		    __field(unsigned long, va)
+		    __field(unsigned long, pa)
+		    __field(dma_addr_t, dma)
 		    ),
 	    TP_fast_assign(
 		    __entry->ctxt = ctxt;
 		    __entry->subctxt = subctxt;
-		    __entry->dir = dir;
-		    __entry->count = count;
-		    memcpy(__get_dynamic_array(maps), maps,
-			   sizeof(*maps) * count);
+		    __entry->rarr = rarr;
+		    __entry->npages = npages;
+		    __entry->va = va;
+		    __entry->pa = pa;
+		    __entry->dma = dma;
 		    ),
-	    TP_printk("[%3u:%02u] %s tidmaps %s",
+	    TP_printk("[%u:%u] entry:%u, %u pages @ 0x%lx, va:0x%lx dma:0x%llx",
 		      __entry->ctxt,
 		      __entry->subctxt,
-		      (__entry->dir ? ">" : "<"),
-		      print_u64_array(p, __get_dynamic_array(maps),
-				      __entry->count)
+		      __entry->rarr,
+		      __entry->npages,
+		      __entry->pa,
+		      __entry->va,
+		      __entry->dma
 		    )
 	);
 
-TRACE_EVENT(hfi1_exp_rcv_set,
-	    TP_PROTO(unsigned ctxt, u16 subctxt, u32 tid,
-		     unsigned long vaddr, u64 phys_addr, void *page),
-	    TP_ARGS(ctxt, subctxt, tid, vaddr, phys_addr, page),
+TRACE_EVENT(hfi1_exp_tid_inval,
+	    TP_PROTO(unsigned ctxt, u16 subctxt, unsigned long va, u32 rarr,
+		     u32 npages, dma_addr_t dma),
+	    TP_ARGS(ctxt, subctxt, va, rarr, npages, dma),
 	    TP_STRUCT__entry(
 		    __field(unsigned, ctxt)
 		    __field(u16, subctxt)
-		    __field(u32, tid)
-		    __field(unsigned long, vaddr)
-		    __field(u64, phys_addr)
-		    __field(void *, page)
+		    __field(unsigned long, va)
+		    __field(u32, rarr)
+		    __field(u32, npages)
+		    __field(dma_addr_t, dma)
 		    ),
 	    TP_fast_assign(
 		    __entry->ctxt = ctxt;
 		    __entry->subctxt = subctxt;
-		    __entry->tid = tid;
-		    __entry->vaddr = vaddr;
-		    __entry->phys_addr = phys_addr;
-		    __entry->page = page;
+		    __entry->va = va;
+		    __entry->rarr = rarr;
+		    __entry->npages = npages;
+		    __entry->dma = dma;
 		    ),
-	    TP_printk("[%u:%u] TID %u, vaddrs 0x%lx, physaddr 0x%llx, pgp %p",
+	    TP_printk("[%u:%u] entry:%u, %u pages @ 0x%lx dma: 0x%llx",
 		      __entry->ctxt,
 		      __entry->subctxt,
-		      __entry->tid,
-		      __entry->vaddr,
-		      __entry->phys_addr,
-		      __entry->page
+		      __entry->rarr,
+		      __entry->npages,
+		      __entry->va,
+		      __entry->dma
 		    )
 	);
 
-TRACE_EVENT(hfi1_exp_rcv_free,
-	    TP_PROTO(unsigned ctxt, u16 subctxt, u32 tid,
-		     unsigned long phys, void *page),
-	    TP_ARGS(ctxt, subctxt, tid, phys, page),
+TRACE_EVENT(hfi1_mmu_invalidate,
+	    TP_PROTO(unsigned ctxt, u16 subctxt, const char *type,
+		     unsigned long start, unsigned long end),
+	    TP_ARGS(ctxt, subctxt, type, start, end),
 	    TP_STRUCT__entry(
 		    __field(unsigned, ctxt)
 		    __field(u16, subctxt)
-		    __field(u32, tid)
-		    __field(unsigned long, phys)
-		    __field(void *, page)
+		    __string(type, type)
+		    __field(unsigned long, start)
+		    __field(unsigned long, end)
 		    ),
 	    TP_fast_assign(
 		    __entry->ctxt = ctxt;
 		    __entry->subctxt = subctxt;
-		    __entry->tid = tid;
-		    __entry->phys = phys;
-		    __entry->page = page;
+		    __assign_str(type, type);
+		    __entry->start = start;
+		    __entry->end = end;
 		    ),
-	    TP_printk("[%u:%u] freeing TID %u, 0x%lx, pgp %p",
+	    TP_printk("[%3u:%02u] MMU Invalidate (%s) 0x%lx - 0x%lx",
 		      __entry->ctxt,
 		      __entry->subctxt,
-		      __entry->tid,
-		      __entry->phys,
-		      __entry->page
+		      __get_str(type),
+		      __entry->start,
+		      __entry->end
 		    )
 	);
+
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM hfi1_tx
 
diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c b/drivers/staging/rdma/hfi1/user_exp_rcv.c
new file mode 100644
index 000000000000..5274a9f4c3eb
--- /dev/null
+++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c
@@ -0,0 +1,1171 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#include <asm/page.h>
+
+#include "user_exp_rcv.h"
+#include "trace.h"
+
+struct tid_group {
+	struct list_head list;
+	unsigned base;
+	u8 size;
+	u8 used;
+	u8 map;
+};
+
+struct mmu_rb_node {
+	struct rb_node rbnode;
+	unsigned long virt;
+	unsigned long phys;
+	unsigned long len;
+	struct tid_group *grp;
+	u32 rcventry;
+	dma_addr_t dma_addr;
+	bool freed;
+	unsigned npages;
+	struct page *pages[0];
+};
+
+enum mmu_call_types {
+	MMU_INVALIDATE_PAGE = 0,
+	MMU_INVALIDATE_RANGE = 1
+};
+
+static const char * const mmu_types[] = {
+	"PAGE",
+	"RANGE"
+};
+
+struct tid_pageset {
+	u16 idx;
+	u16 count;
+};
+
+
+#define EXP_TID_SET_EMPTY(set) (set.count == 0 && list_empty(&set.list))
+
+#define num_user_pages(vaddr, len)					\
+	(1 + (((((unsigned long)(vaddr) +				\
+		 (unsigned long)(len) - 1) & PAGE_MASK) -		\
+	       ((unsigned long)vaddr & PAGE_MASK)) >> PAGE_SHIFT))
+
+static void unlock_exp_tids(struct hfi1_ctxtdata *, struct exp_tid_set *,
+			    struct rb_root *);
+static u32 find_phys_blocks(struct page **, unsigned, struct tid_pageset *);
+static int set_rcvarray_entry(struct file *, unsigned long, u32,
+			      struct tid_group *, struct page **, unsigned);
+
+static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long,
+			       unsigned long);
+static struct mmu_rb_node *mmu_rb_search_by_addr(struct rb_root *,
+						 unsigned long);
+static inline struct mmu_rb_node *mmu_rb_search_by_entry(struct rb_root *,
+							 u32);
+static int mmu_rb_insert_by_addr(struct rb_root *, struct mmu_rb_node *);
+static int mmu_rb_insert_by_entry(struct rb_root *, struct mmu_rb_node *);
+static void mmu_notifier_mem_invalidate(struct mmu_notifier *,
+					unsigned long, unsigned long,
+					enum mmu_call_types);
+static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *,
+				     unsigned long);
+static inline void mmu_notifier_range_start(struct mmu_notifier *,
+					    struct mm_struct *,
+					    unsigned long, unsigned long);
+static int program_rcvarray(struct file *, unsigned long, struct tid_group *,
+			    struct tid_pageset *, unsigned, u16, struct page **,
+			    u32 *, unsigned *, unsigned *);
+static int unprogram_rcvarray(struct file *, u32, struct tid_group **);
+static void clear_tid_node(struct hfi1_filedata *, u16, struct mmu_rb_node *);
+
+static inline u32 rcventry2tidinfo(u32 rcventry)
+{
+	u32 pair = rcventry & ~0x1;
+
+	return EXP_TID_SET(IDX, pair >> 1) |
+		EXP_TID_SET(CTRL, 1 << (rcventry - pair));
+}
+
+static inline void exp_tid_group_init(struct exp_tid_set *set)
+{
+	INIT_LIST_HEAD(&set->list);
+	set->count = 0;
+}
+
+static inline void tid_group_remove(struct tid_group *grp,
+				    struct exp_tid_set *set)
+{
+	list_del_init(&grp->list);
+	set->count--;
+}
+
+static inline void tid_group_add_tail(struct tid_group *grp,
+				      struct exp_tid_set *set)
+{
+	list_add_tail(&grp->list, &set->list);
+	set->count++;
+}
+
+static inline struct tid_group *tid_group_pop(struct exp_tid_set *set)
+{
+	struct tid_group *grp =
+		list_first_entry(&set->list, struct tid_group, list);
+	list_del_init(&grp->list);
+	set->count--;
+	return grp;
+}
+
+static inline void tid_group_move(struct tid_group *group,
+				  struct exp_tid_set *s1,
+				  struct exp_tid_set *s2)
+{
+	tid_group_remove(group, s1);
+	tid_group_add_tail(group, s2);
+}
+
+static struct mmu_notifier_ops mn_opts = {
+	.invalidate_page = mmu_notifier_page,
+	.invalidate_range_start = mmu_notifier_range_start,
+};
+
+/*
+ * Initialize context and file private data needed for Expected
+ * receive caching. This needs to be done after the context has
+ * been configured with the eager/expected RcvEntry counts.
+ */
+int hfi1_user_exp_rcv_init(struct file *fp)
+{
+	struct hfi1_filedata *fd = fp_to_fd(fp);
+	struct hfi1_ctxtdata *uctxt = ctxt_fp(fp);
+	struct hfi1_devdata *dd = uctxt->dd;
+	unsigned tidbase;
+	int i, ret = 0;
+
+	INIT_HLIST_NODE(&fd->mn.hlist);
+	spin_lock_init(&fd->rb_lock);
+	spin_lock_init(&fd->invalid_lock);
+	fd->mn.ops = &mn_opts;
+	fd->tid_rb_root = RB_ROOT;
+
+	if (!uctxt->subctxt_cnt || !subctxt_fp(fp)) {
+		exp_tid_group_init(&uctxt->tid_group_list);
+		exp_tid_group_init(&uctxt->tid_used_list);
+		exp_tid_group_init(&uctxt->tid_full_list);
+
+		tidbase = uctxt->expected_base;
+		for (i = 0; i < uctxt->expected_count /
+			     dd->rcv_entries.group_size; i++) {
+			struct tid_group *grp;
+
+			grp = kzalloc(sizeof(*grp), GFP_KERNEL);
+			if (!grp) {
+				/*
+				 * If we fail here, the groups already
+				 * allocated will be freed by the close
+				 * call.
+				 */
+				ret = -ENOMEM;
+				goto done;
+			}
+			grp->size = dd->rcv_entries.group_size;
+			grp->base = tidbase;
+			tid_group_add_tail(grp, &uctxt->tid_group_list);
+			tidbase += dd->rcv_entries.group_size;
+		}
+	}
+
+	if (!HFI1_CAP_IS_USET(TID_UNMAP)) {
+		fd->invalid_tid_idx = 0;
+		fd->invalid_tids = kzalloc(uctxt->expected_count *
+					   sizeof(u32), GFP_KERNEL);
+		if (!fd->invalid_tids) {
+			ret = -ENOMEM;
+			goto done;
+		} else {
+			/*
+			 * Register MMU notifier callbacks. If the registration
+			 * fails, continue but turn off the TID caching for
+			 * all user contexts.
+			 */
+			ret = mmu_notifier_register(&notifier_fp(fp),
+						    current->mm);
+			if (ret) {
+				dd_dev_info(dd,
+					    "Failed MMU notifier registration %d\n",
+					    ret);
+				HFI1_CAP_USET(TID_UNMAP);
+				ret = 0;
+			}
+		}
+	}
+
+	if (HFI1_CAP_IS_USET(TID_UNMAP))
+		fd->mmu_rb_insert = mmu_rb_insert_by_entry;
+	else
+		fd->mmu_rb_insert = mmu_rb_insert_by_addr;
+
+	/*
+	 * PSM does not have a good way to separate, count, and
+	 * effectively enforce a limit on RcvArray entries used by
+	 * subctxts (when context sharing is used) when TID caching
+	 * is enabled. To help with that, we calculate a per-process
+	 * RcvArray entry share and enforce that.
+	 * If TID caching is not in use, PSM deals with usage on its
+	 * own. In that case, we allow any subctxt to take all of the
+	 * entries.
+	 *
+	 * Make sure that we set the tid counts only after successful
+	 * init.
+	 */
+	if (uctxt->subctxt_cnt && !HFI1_CAP_IS_USET(TID_UNMAP)) {
+		u16 remainder;
+
+		fd->tid_limit = uctxt->expected_count / uctxt->subctxt_cnt;
+		remainder = uctxt->expected_count % uctxt->subctxt_cnt;
+		if (remainder && subctxt_fp(fp) < remainder)
+			fd->tid_limit++;
+	} else
+		fd->tid_limit = uctxt->expected_count;
+done:
+	return ret;
+}
+
+int hfi1_user_exp_rcv_free(struct hfi1_filedata *fd)
+{
+	struct hfi1_ctxtdata *uctxt = fd->uctxt;
+	struct tid_group *grp, *gptr;
+
+	/*
+	 * The notifier would have been removed when the process'es mm
+	 * was freed.
+	 */
+	if (current->mm && !HFI1_CAP_IS_USET(TID_UNMAP))
+		mmu_notifier_unregister(&fd->mn, current->mm);
+
+	kfree(fd->invalid_tids);
+
+	if (!uctxt->cnt) {
+		if (!EXP_TID_SET_EMPTY(uctxt->tid_full_list))
+			unlock_exp_tids(uctxt, &uctxt->tid_full_list,
+					&fd->tid_rb_root);
+		if (!EXP_TID_SET_EMPTY(uctxt->tid_used_list))
+			unlock_exp_tids(uctxt, &uctxt->tid_used_list,
+					&fd->tid_rb_root);
+		list_for_each_entry_safe(grp, gptr, &uctxt->tid_group_list.list,
+					 list) {
+			list_del_init(&grp->list);
+			kfree(grp);
+		}
+		spin_lock(&fd->rb_lock);
+		if (!RB_EMPTY_ROOT(&fd->tid_rb_root)) {
+			struct rb_node *node;
+			struct mmu_rb_node *rbnode;
+
+			while ((node = rb_first(&fd->tid_rb_root))) {
+				rbnode = rb_entry(node, struct mmu_rb_node,
+						  rbnode);
+				rb_erase(&rbnode->rbnode, &fd->tid_rb_root);
+				kfree(rbnode);
+			}
+		}
+		spin_unlock(&fd->rb_lock);
+		hfi1_clear_tids(uctxt);
+	}
+	return 0;
+}
+
+/*
+ * Write an "empty" RcvArray entry.
+ * This function exists so the TID registaration code can use it
+ * to write to unused/unneeded entries and still take advantage
+ * of the WC performance improvements. The HFI will ignore this
+ * write to the RcvArray entry.
+ */
+static inline void rcv_array_wc_fill(struct hfi1_devdata *dd, u32 index)
+{
+	/* Doing the WC fill writes only makes sense if the device is
+	 * present and the RcvArray has been mapped as WC memory. */
+	if ((dd->flags & HFI1_PRESENT) && dd->rcvarray_wc)
+		writeq(0, dd->rcvarray_wc + (index * 8));
+}
+
+/*
+ * RcvArray entry allocation for Expected Receives is done by the
+ * following algorithm:
+ *
+ * The context keeps 3 lists of groups of RcvArray entries:
+ *   1. List of empty groups - tid_group_list
+ *      This list is created during user context creation and
+ *      contains elements which describe sets (of 8) of empty
+ *      RcvArray entries.
+ *   2. List of partially used groups - tid_used_list
+ *      This list contains sets of RcvArray entries which are
+ *      not completely used up. Another mapping request could
+ *      use some of all of the remaining entries.
+ *   3. List of full groups - tid_full_list
+ *      This is the list where sets that are completely used
+ *      up go.
+ *
+ * An attempt to optimize the usage of RcvArray entries is
+ * made by finding all sets of physically contiguous pages in a
+ * user's buffer.
+ * These physically contiguous sets are further split into
+ * sizes supported by the receive engine of the HFI. The
+ * resulting sets of pages are stored in struct tid_pageset,
+ * which describes the sets as:
+ *    * .count - number of pages in this set
+ *    * .idx - starting index into struct page ** array
+ *                    of this set
+ *
+ * From this point on, the algorithm deals with the page sets
+ * described above. The number of pagesets is divided by the
+ * RcvArray group size to produce the number of full groups
+ * needed.
+ *
+ * Groups from the 3 lists are manipulated using the following
+ * rules:
+ *   1. For each set of 8 pagesets, a complete group from
+ *      tid_group_list is taken, programmed, and moved to
+ *      the tid_full_list list.
+ *   2. For all remaining pagesets:
+ *      2.1 If the tid_used_list is empty and the tid_group_list
+ *          is empty, stop processing pageset and return only
+ *          what has been programmed up to this point.
+ *      2.2 If the tid_used_list is empty and the tid_group_list
+ *          is not empty, move a group from tid_group_list to
+ *          tid_used_list.
+ *      2.3 For each group is tid_used_group, program as much as
+ *          can fit into the group. If the group becomes fully
+ *          used, move it to tid_full_list.
+ */
+int hfi1_user_exp_rcv_setup(struct file *fp, struct hfi1_tid_info *tinfo)
+{
+	int ret = 0, need_group = 0, pinned;
+	struct hfi1_ctxtdata *uctxt = ctxt_fp(fp);
+	struct hfi1_devdata *dd = uctxt->dd;
+	unsigned npages, ngroups, pageidx = 0, pageset_count, npagesets,
+		tididx = 0, mapped, mapped_pages = 0;
+	unsigned long vaddr = tinfo->vaddr;
+	struct page **pages = NULL;
+	u32 *tidlist = NULL;
+	struct tid_pageset *pagesets = NULL;
+
+	/* Get the number of pages the user buffer spans */
+	npages = num_user_pages(vaddr, tinfo->length);
+	if (!npages) {
+		ret = -EINVAL;
+		goto bail;
+	}
+
+	if (npages > uctxt->expected_count) {
+		dd_dev_err(dd, "Expected buffer too big\n");
+		ret = -EINVAL;
+		goto bail;
+	}
+
+	pagesets = kcalloc(uctxt->expected_count, sizeof(*pagesets),
+			   GFP_KERNEL);
+	if (!pagesets) {
+		ret = -ENOMEM;
+		goto bail;
+	}
+
+	/* Verify that access is OK for the user buffer */
+	if (!access_ok(VERIFY_WRITE, (void __user *)vaddr,
+		       npages * PAGE_SIZE)) {
+		dd_dev_err(dd, "Fail vaddr %p, %u pages, !access_ok\n",
+			   (void *)vaddr, npages);
+		ret = -EFAULT;
+		goto bail;
+	}
+
+	/* Allocate the array of struct page pointers needed for pinning */
+	pages = kcalloc(npages, sizeof(*pages), GFP_KERNEL);
+	if (!pages) {
+		ret = -ENOMEM;
+		goto bail;
+	}
+
+	/*
+	 * Pin all the pages of the user buffer. If we can't pin all the
+	 * pages, accept the amount pinned so far and program only that.
+	 * User space knows how to deal with partially programmed buffers.
+	 */
+	pinned = hfi1_acquire_user_pages(vaddr, npages, true, pages);
+	if (pinned <= 0) {
+		/*
+		 * -EDQUOT has a special meaning (we can't lock any more
+		 * pages), which user space knows how to deal with. We
+		 * don't need an error message.
+		 */
+		if (pinned != -EDQUOT)
+			dd_dev_err(dd,
+				   "Failed to lock addr %p, %u pages: errno %d\n",
+				   (void *) vaddr, npages, pinned);
+		ret = pinned;
+		goto bail;
+	}
+
+	/* Find sets of physically contiguous pages */
+	npagesets = find_phys_blocks(pages, pinned, pagesets);
+
+	/*
+	 * We don't need to access this under a lock since tid_used is per
+	 * process and the same process cannot be in hfi1_user_exp_rcv_clear()
+	 * and hfi1_user_exp_rcv_setup() at the same time.
+	 */
+	if (fp_to_fd(fp)->tid_used + npagesets > fp_to_fd(fp)->tid_limit)
+		pageset_count = fp_to_fd(fp)->tid_limit -
+			fp_to_fd(fp)->tid_used;
+	else
+		pageset_count = npagesets;
+
+	if (!pageset_count)
+		goto bail;
+
+	ngroups = pageset_count / dd->rcv_entries.group_size;
+	tidlist = kcalloc(pageset_count, sizeof(*tidlist), GFP_KERNEL);
+	if (!tidlist) {
+		ret = -ENOMEM;
+		goto nomem;
+	}
+
+	tididx = 0;
+
+	/* From this point on, we are going to be using shared (between master
+	 * and subcontexts) context resources. We need to take the lock. */
+	mutex_lock(&uctxt->exp_lock);
+	/* The first step is to program the RcvArray entries which are complete
+	 * groups. */
+	while (ngroups && uctxt->tid_group_list.count) {
+		struct tid_group *grp =
+			tid_group_pop(&uctxt->tid_group_list);
+
+		ret = program_rcvarray(fp, vaddr, grp, pagesets,
+				       pageidx, dd->rcv_entries.group_size,
+				       pages, tidlist, &tididx, &mapped);
+		/*
+		 * If there was a failure to program the RcvArray
+		 * entries for the entire group, reset the grp fields
+		 * and add the grp back to the free group list.
+		 */
+		if (ret <= 0) {
+			tid_group_add_tail(grp, &uctxt->tid_group_list);
+			hfi1_cdbg(TID,
+				  "Failed to program RcvArray group %d", ret);
+			goto unlock;
+		}
+
+		tid_group_add_tail(grp, &uctxt->tid_full_list);
+		ngroups--;
+		pageidx += ret;
+		mapped_pages += mapped;
+	}
+
+	while (pageidx < pageset_count) {
+		struct tid_group *grp, *ptr;
+		/*
+		 * If we don't have any partially used tid groups, check
+		 * if we have empty groups. If so, take one from there and
+		 * put in the partially used list.
+		 */
+		if (!uctxt->tid_used_list.count || need_group) {
+			if (!uctxt->tid_group_list.count)
+				goto unlock;
+
+			grp = tid_group_pop(&uctxt->tid_group_list);
+			tid_group_add_tail(grp, &uctxt->tid_used_list);
+			need_group = 0;
+		}
+		/*
+		 * There is an optimization opportunity here - instead of
+		 * fitting as many page sets as we can, check for a group
+		 * later on in the list that could fit all of them.
+		 */
+		list_for_each_entry_safe(grp, ptr, &uctxt->tid_used_list.list,
+					 list) {
+			unsigned use = min_t(unsigned, pageset_count - pageidx,
+					     grp->size - grp->used);
+
+			ret = program_rcvarray(fp, vaddr, grp, pagesets,
+					       pageidx, use, pages, tidlist,
+					       &tididx, &mapped);
+			if (ret < 0) {
+				hfi1_cdbg(TID,
+					  "Failed to program RcvArray entries %d",
+					  ret);
+				ret = -EFAULT;
+				goto unlock;
+			} else if (ret > 0) {
+				if (grp->used == grp->size)
+					tid_group_move(grp,
+						       &uctxt->tid_used_list,
+						       &uctxt->tid_full_list);
+				pageidx += ret;
+				mapped_pages += mapped;
+				need_group = 0;
+				/* Check if we are done so we break out early */
+				if (pageidx >= pageset_count)
+					break;
+			} else if (WARN_ON(ret == 0)) {
+				/*
+				 * If ret is 0, we did not program any entries
+				 * into this group, which can only happen if
+				 * we've screwed up the accounting somewhere.
+				 * Warn and try to continue.
+				 */
+				need_group = 1;
+			}
+		}
+	}
+unlock:
+	mutex_unlock(&uctxt->exp_lock);
+nomem:
+	hfi1_cdbg(TID, "total mapped: tidpairs:%u pages:%u (%d)", tididx,
+		  mapped_pages, ret);
+	if (tididx) {
+		fp_to_fd(fp)->tid_used += tididx;
+		tinfo->tidcnt = tididx;
+		tinfo->length = mapped_pages * PAGE_SIZE;
+
+		if (copy_to_user((void __user *)(unsigned long)tinfo->tidlist,
+				 tidlist, sizeof(tidlist[0]) * tididx)) {
+			/* On failure to copy to the user level, we need to undo
+			 * everything done so far so we don't leak resources. */
+			tinfo->tidlist = (unsigned long)&tidlist;
+			hfi1_user_exp_rcv_clear(fp, tinfo);
+			tinfo->tidlist = 0;
+			ret = -EFAULT;
+			goto bail;
+		}
+	}
+
+	/*
+	 * If not everything was mapped (due to insufficient RcvArray entries,
+	 * for example), unpin all unmapped pages so we can pin them nex time.
+	 */
+	if (mapped_pages != pinned)
+		hfi1_release_user_pages(&pages[mapped_pages],
+					pinned - mapped_pages,
+					false);
+bail:
+	kfree(pagesets);
+	kfree(pages);
+	kfree(tidlist);
+	return ret > 0 ? 0 : ret;
+}
+
+int hfi1_user_exp_rcv_clear(struct file *fp, struct hfi1_tid_info *tinfo)
+{
+	int ret = 0;
+	struct hfi1_ctxtdata *uctxt = ctxt_fp(fp);
+	u32 *tidinfo;
+	unsigned tididx;
+
+	tidinfo = kcalloc(tinfo->tidcnt, sizeof(*tidinfo), GFP_KERNEL);
+	if (!tidinfo)
+		return -ENOMEM;
+
+	if (copy_from_user(tidinfo, (void __user *)(unsigned long)
+			   tinfo->tidlist, sizeof(tidinfo[0]) *
+			   tinfo->tidcnt)) {
+		ret = -EFAULT;
+		goto done;
+	}
+
+	mutex_lock(&uctxt->exp_lock);
+	for (tididx = 0; tididx < tinfo->tidcnt; tididx++) {
+		ret = unprogram_rcvarray(fp, tidinfo[tididx], NULL);
+		if (ret) {
+			hfi1_cdbg(TID, "Failed to unprogram rcv array %d",
+				  ret);
+			break;
+		}
+	}
+	fp_to_fd(fp)->tid_used -= tididx;
+	tinfo->tidcnt = tididx;
+	mutex_unlock(&uctxt->exp_lock);
+done:
+	kfree(tidinfo);
+	return ret;
+}
+
+int hfi1_user_exp_rcv_invalid(struct file *fp, struct hfi1_tid_info *tinfo)
+{
+	struct hfi1_filedata *fd = fp_to_fd(fp);
+	struct hfi1_ctxtdata *uctxt = ctxt_fp(fp);
+	unsigned long *ev = uctxt->dd->events +
+		(((uctxt->ctxt - uctxt->dd->first_user_ctxt) *
+		  HFI1_MAX_SHARED_CTXTS) + subctxt_fp(fp));
+	u32 *array;
+	int ret = 0;
+
+	if (!fd->invalid_tids) {
+		ret = -EINVAL;
+		goto done;
+	}
+
+	/*
+	 * copy_to_user() can sleep, which will leave the invalid_lock
+	 * locked and cause the MMU notifier to be blocked on the lock
+	 * for a long time.
+	 * Copy the data to a local buffer so we can release the lock.
+	 */
+	array = kcalloc(uctxt->expected_count, sizeof(*array), GFP_KERNEL);
+	if (!array) {
+		ret = -EFAULT;
+		goto done;
+	}
+
+	spin_lock(&fp_to_fd(fp)->invalid_lock);
+	if (fd->invalid_tid_idx) {
+		memcpy(array, fd->invalid_tids, sizeof(*array) *
+		       fd->invalid_tid_idx);
+		memset(fd->invalid_tids, 0, sizeof(*fd->invalid_tids) *
+		       fd->invalid_tid_idx);
+		tinfo->tidcnt = fd->invalid_tid_idx;
+		fd->invalid_tid_idx = 0;
+		/* Reset the user flag while still holding the lock.
+		 * Otherwise, PSM can miss events. */
+		clear_bit(_HFI1_EVENT_TID_MMU_NOTIFY_BIT, ev);
+	} else
+		tinfo->tidcnt = 0;
+	spin_unlock(&fp_to_fd(fp)->invalid_lock);
+
+	if (tinfo->tidcnt) {
+		if (copy_to_user((void __user *)tinfo->tidlist,
+				 array, sizeof(*array) * tinfo->tidcnt))
+			ret = -EFAULT;
+	}
+	kfree(array);
+done:
+	return ret;
+}
+
+static u32 find_phys_blocks(struct page **pages, unsigned npages,
+			    struct tid_pageset *list)
+{
+	unsigned pagecount, pageidx, setcount = 0, i;
+	unsigned long pfn, this_pfn;
+
+	if (!npages)
+		return 0;
+
+	/*
+	 * Look for sets of physically contiguous pages in the user buffer.
+	 * This will allow us to optimize Expected RcvArray entry usage by
+	 * using the bigger supported sizes.
+	 */
+	pfn = page_to_pfn(pages[0]);
+	for (pageidx = 0, pagecount = 1, i = 1; i <= npages; i++) {
+		this_pfn = i < npages ? page_to_pfn(pages[i]) : 0;
+
+		/* If the pfn's are not sequential, pages are not physically
+		 * contiguous. */
+		if (this_pfn != ++pfn) {
+			/*
+			 * At this point we have to loop over the set of
+			 * physically contiguous pages and break them down it
+			 * sizes supported by the HW.
+			 * There are two main constraints:
+			 *     1. The max buffer size is MAX_EXPECTED_BUFFER.
+			 *        If the total set size is bigger than that
+			 *        program only a MAX_EXPECTED_BUFFER chunk.
+			 *     2. The buffer size has to be a power of two. If
+			 *        it is not, round down to the closes power of
+			 *        2 and program that size.
+			 */
+			while (pagecount) {
+				int maxpages = pagecount;
+				u32 bufsize = pagecount * PAGE_SIZE;
+
+				if (bufsize > MAX_EXPECTED_BUFFER)
+					maxpages =
+						MAX_EXPECTED_BUFFER >>
+						PAGE_SHIFT;
+				else if (!is_power_of_2(bufsize))
+					maxpages =
+						rounddown_pow_of_two(bufsize) >>
+						PAGE_SHIFT;
+
+				list[setcount].idx = pageidx;
+				list[setcount].count = maxpages;
+				pagecount -= maxpages;
+				pageidx += maxpages;
+				setcount++;
+			}
+			pageidx = i;
+			pagecount = 1;
+			pfn = this_pfn;
+		} else
+			pagecount++;
+	}
+	return setcount;
+}
+
+static int program_rcvarray(struct file *fp, unsigned long vaddr,
+			    struct tid_group *grp,
+			    struct tid_pageset *sets,
+			    unsigned start, u16 count, struct page **pages,
+			    u32 *tidlist, unsigned *tididx, unsigned *pmapped)
+{
+	struct hfi1_ctxtdata *uctxt = ctxt_fp(fp);
+	struct hfi1_devdata *dd = uctxt->dd;
+	u16 idx;
+	u32 tidinfo = 0, rcventry, useidx = 0;
+	int mapped = 0;
+
+	/* Count should never be larger than the group size */
+	if (count > grp->size)
+		return -EINVAL;
+
+	/* Find the first unused entry in the group */
+	for (idx = 0; idx < grp->size; idx++) {
+		if (!(grp->map & (1 << idx))) {
+			useidx = idx;
+			break;
+		}
+		rcv_array_wc_fill(dd, grp->base + idx);
+	}
+
+	idx = 0;
+	while (idx < count) {
+		u16 npages, pageidx, setidx = start + idx;
+		int ret = 0;
+
+		/*
+		 * If this entry in the group is used, move to the next one.
+		 * If we go past the end of the group, exit the loop.
+		 */
+		if (useidx >= grp->size)
+			break;
+		else if (grp->map & (1 << useidx)) {
+			rcv_array_wc_fill(dd, grp->base + useidx);
+			useidx++;
+			continue;
+		}
+
+		rcventry = grp->base + useidx;
+		npages = sets[setidx].count;
+		pageidx = sets[setidx].idx;
+
+		ret = set_rcvarray_entry(fp, vaddr + (pageidx * PAGE_SIZE),
+					 rcventry, grp, pages + pageidx,
+					 npages);
+		if (ret)
+			return ret;
+		mapped += npages;
+
+		tidinfo = rcventry2tidinfo(rcventry - uctxt->expected_base) |
+			EXP_TID_SET(LEN, npages);
+		tidlist[(*tididx)++] = tidinfo;
+		grp->used++;
+		grp->map |= 1 << useidx++;
+		idx++;
+	}
+
+	/* Fill the rest of the group with "blank" writes */
+	for (; useidx < grp->size; useidx++)
+		rcv_array_wc_fill(dd, grp->base + useidx);
+	*pmapped = mapped;
+	return idx;
+}
+
+static int set_rcvarray_entry(struct file *fp, unsigned long vaddr,
+			      u32 rcventry, struct tid_group *grp,
+			      struct page **pages, unsigned npages)
+{
+	int ret;
+	struct hfi1_ctxtdata *uctxt = ctxt_fp(fp);
+	struct mmu_rb_node *node;
+	struct hfi1_devdata *dd = uctxt->dd;
+	struct rb_root *root = &rb_fp(fp);
+	dma_addr_t phys;
+
+	/*
+	 * Allocate the node first so we can handle a potential
+	 * failure before we've programmed anything.
+	 */
+	node = kzalloc(sizeof(*node) + (sizeof(struct page *) * npages),
+		       GFP_KERNEL);
+	if (!node)
+		return -ENOMEM;
+
+	phys = pci_map_single(dd->pcidev,
+			      __va(page_to_phys(pages[0])),
+			      npages * PAGE_SIZE, PCI_DMA_FROMDEVICE);
+	if (dma_mapping_error(&dd->pcidev->dev, phys)) {
+		dd_dev_err(dd, "Failed to DMA map Exp Rcv pages 0x%llx\n",
+			   phys);
+		kfree(node);
+		return -EFAULT;
+	}
+
+	node->virt = vaddr;
+	node->phys = page_to_phys(pages[0]);
+	node->len = npages * PAGE_SIZE;
+	node->npages = npages;
+	node->rcventry = rcventry;
+	node->dma_addr = phys;
+	node->grp = grp;
+	node->freed = false;
+	memcpy(node->pages, pages, sizeof(struct page *) * npages);
+
+	spin_lock(&fp_to_fd(fp)->rb_lock);
+	ret = fp_to_fd(fp)->mmu_rb_insert(root, node);
+	spin_unlock(&fp_to_fd(fp)->rb_lock);
+
+	if (ret) {
+		hfi1_cdbg(TID, "Failed to insert RB node %u 0x%lx, 0x%lx %d",
+			  node->rcventry, node->virt, node->phys, ret);
+		pci_unmap_single(dd->pcidev, phys, npages * PAGE_SIZE,
+				 PCI_DMA_FROMDEVICE);
+		kfree(node);
+		return -EFAULT;
+	}
+	hfi1_put_tid(dd, rcventry, PT_EXPECTED, phys, ilog2(npages) + 1);
+	trace_hfi1_exp_tid_reg(uctxt->ctxt, subctxt_fp(fp), rcventry,
+			       npages, node->virt, node->phys, phys);
+	return 0;
+}
+
+static int unprogram_rcvarray(struct file *fp, u32 tidinfo,
+			      struct tid_group **grp)
+{
+	struct hfi1_ctxtdata *uctxt = ctxt_fp(fp);
+	struct hfi1_devdata *dd = uctxt->dd;
+	struct mmu_rb_node *node;
+	u8 tidctrl = EXP_TID_GET(tidinfo, CTRL);
+	u32 tidbase = uctxt->expected_base,
+		tididx = EXP_TID_GET(tidinfo, IDX) << 1, rcventry;
+
+	if (tididx > uctxt->expected_count) {
+		dd_dev_err(dd, "Invalid RcvArray entry (%u) index for ctxt %u\n",
+			   tididx, uctxt->ctxt);
+		return -EINVAL;
+	}
+
+	if (tidctrl == 0x3)
+		return -EINVAL;
+
+	rcventry = tidbase + tididx + (tidctrl - 1);
+
+	spin_lock(&fp_to_fd(fp)->rb_lock);
+	node = mmu_rb_search_by_entry(&rb_fp(fp), rcventry);
+	if (!node) {
+		spin_unlock(&fp_to_fd(fp)->rb_lock);
+		return -EBADF;
+	}
+	rb_erase(&node->rbnode, &rb_fp(fp));
+	spin_unlock(&fp_to_fd(fp)->rb_lock);
+	if (grp)
+		*grp = node->grp;
+	clear_tid_node(fp_to_fd(fp), subctxt_fp(fp), node);
+	return 0;
+}
+
+static void clear_tid_node(struct hfi1_filedata *fd, u16 subctxt,
+			   struct mmu_rb_node *node)
+{
+	struct hfi1_ctxtdata *uctxt = fd->uctxt;
+	struct hfi1_devdata *dd = uctxt->dd;
+
+	trace_hfi1_exp_tid_unreg(uctxt->ctxt, fd->subctxt, node->rcventry,
+				 node->npages, node->virt, node->phys,
+				 node->dma_addr);
+
+	hfi1_put_tid(dd, node->rcventry, PT_INVALID, 0, 0);
+	/* Make sure device has seen the write before we unpin the
+	 * pages */
+	flush_wc();
+
+	pci_unmap_single(dd->pcidev, node->dma_addr, node->len,
+			 PCI_DMA_FROMDEVICE);
+	hfi1_release_user_pages(node->pages, node->npages, true);
+
+	node->grp->used--;
+	node->grp->map &= ~(1 << (node->rcventry - node->grp->base));
+
+	if (node->grp->used == node->grp->size - 1)
+		tid_group_move(node->grp, &uctxt->tid_full_list,
+			       &uctxt->tid_used_list);
+	else if (!node->grp->used)
+		tid_group_move(node->grp, &uctxt->tid_used_list,
+			       &uctxt->tid_group_list);
+	kfree(node);
+}
+
+static void unlock_exp_tids(struct hfi1_ctxtdata *uctxt,
+			    struct exp_tid_set *set, struct rb_root *root)
+{
+	struct tid_group *grp, *ptr;
+	struct hfi1_filedata *fd = container_of(root, struct hfi1_filedata,
+						tid_rb_root);
+	int i;
+
+	list_for_each_entry_safe(grp, ptr, &set->list, list) {
+		list_del_init(&grp->list);
+
+		spin_lock(&fd->rb_lock);
+		for (i = 0; i < grp->size; i++) {
+			if (grp->map & (1 << i)) {
+				u16 rcventry = grp->base + i;
+				struct mmu_rb_node *node;
+
+				node = mmu_rb_search_by_entry(root, rcventry);
+				if (!node)
+					continue;
+				rb_erase(&node->rbnode, root);
+				clear_tid_node(fd, -1, node);
+			}
+		}
+		spin_unlock(&fd->rb_lock);
+	}
+}
+
+static inline void mmu_notifier_page(struct mmu_notifier *mn,
+				     struct mm_struct *mm, unsigned long addr)
+{
+	mmu_notifier_mem_invalidate(mn, addr, addr + PAGE_SIZE,
+				    MMU_INVALIDATE_PAGE);
+}
+
+static inline void mmu_notifier_range_start(struct mmu_notifier *mn,
+					    struct mm_struct *mm,
+					    unsigned long start,
+					    unsigned long end)
+{
+	mmu_notifier_mem_invalidate(mn, start, end, MMU_INVALIDATE_RANGE);
+}
+
+static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn,
+					unsigned long start, unsigned long end,
+					enum mmu_call_types type)
+{
+	struct hfi1_filedata *fd = container_of(mn, struct hfi1_filedata, mn);
+	struct hfi1_ctxtdata *uctxt = fd->uctxt;
+	struct rb_root *root = &fd->tid_rb_root;
+	struct mmu_rb_node *node;
+	unsigned long addr = start;
+
+	trace_hfi1_mmu_invalidate(uctxt->ctxt, fd->subctxt, mmu_types[type],
+				  start, end);
+
+	spin_lock(&fd->rb_lock);
+	while (addr < end) {
+		node = mmu_rb_search_by_addr(root, addr);
+
+		if (!node) {
+			/* Didn't find a node at this address. However, the
+			 * range could be bigger than what we have registered
+			 * so we have to keep looking. */
+			addr += PAGE_SIZE;
+			continue;
+		}
+
+		/*
+		 * The next address to be looked up is computed based
+		 * on the node's starting address. This is due to the
+		 * fact that the range where we start might be in the
+		 * middle of the node's buffer so simply incrementing
+		 * the address by the node's size would result is a
+		 * bad address.
+		 */
+		addr = node->virt + (node->npages * PAGE_SIZE);
+		if (node->freed)
+			continue;
+
+		trace_hfi1_exp_tid_inval(uctxt->ctxt, fd->subctxt, node->virt,
+					 node->rcventry, node->npages,
+					 node->dma_addr);
+		node->freed = true;
+
+		spin_lock(&fd->invalid_lock);
+		if (fd->invalid_tid_idx < uctxt->expected_count) {
+			fd->invalid_tids[fd->invalid_tid_idx] =
+				rcventry2tidinfo(node->rcventry -
+						 uctxt->expected_base);
+			fd->invalid_tids[fd->invalid_tid_idx] |=
+				EXP_TID_SET(LEN, node->npages);
+			if (!fd->invalid_tid_idx) {
+				unsigned long *ev;
+
+				/*
+				 * hfi1_set_uevent_bits() sets a user even flag
+				 * for all processes. Because calling into the
+				 * driver to process TID cache invalidations is
+				 * expensive and TID cache invalidations are
+				 * handled on a per-process basis, we can
+				 * optimize this to set the flag only for the
+				 * process in question.
+				 */
+				ev = uctxt->dd->events +
+					(((uctxt->ctxt -
+					   uctxt->dd->first_user_ctxt) *
+					  HFI1_MAX_SHARED_CTXTS) + fd->subctxt);
+				set_bit(_HFI1_EVENT_TID_MMU_NOTIFY_BIT, ev);
+			}
+			fd->invalid_tid_idx++;
+		}
+		spin_unlock(&fd->invalid_lock);
+	}
+	spin_unlock(&fd->rb_lock);
+}
+
+static inline int mmu_addr_cmp(struct mmu_rb_node *node, unsigned long addr,
+			       unsigned long len)
+{
+	if ((addr + len) <= node->virt)
+		return -1;
+	else if (addr >= node->virt && addr < (node->virt + node->len))
+		return 0;
+	else
+		return 1;
+}
+
+static inline int mmu_entry_cmp(struct mmu_rb_node *node, u32 entry)
+{
+	if (entry < node->rcventry)
+		return -1;
+	else if (entry > node->rcventry)
+		return 1;
+	else
+		return 0;
+}
+
+static struct mmu_rb_node *mmu_rb_search_by_addr(struct rb_root *root,
+						 unsigned long addr)
+{
+	struct rb_node *node = root->rb_node;
+
+	while (node) {
+		struct mmu_rb_node *mnode =
+			container_of(node, struct mmu_rb_node, rbnode);
+		/*
+		 * When searching, use at least one page length for size. The
+		 * MMU notifier will not give us anything less than that. We
+		 * also don't need anything more than a page because we are
+		 * guaranteed to have non-overlapping buffers in the tree.
+		 */
+		int result = mmu_addr_cmp(mnode, addr, PAGE_SIZE);
+
+		if (result < 0)
+			node = node->rb_left;
+		else if (result > 0)
+			node = node->rb_right;
+		else
+			return mnode;
+	}
+	return NULL;
+}
+
+static inline struct mmu_rb_node *mmu_rb_search_by_entry(struct rb_root *root,
+							 u32 index)
+{
+	struct mmu_rb_node *rbnode;
+	struct rb_node *node;
+
+	if (root && !RB_EMPTY_ROOT(root))
+		for (node = rb_first(root); node; node = rb_next(node)) {
+			rbnode = rb_entry(node, struct mmu_rb_node, rbnode);
+			if (rbnode->rcventry == index)
+				return rbnode;
+		}
+	return NULL;
+}
+
+static int mmu_rb_insert_by_entry(struct rb_root *root,
+				  struct mmu_rb_node *node)
+{
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+	while (*new) {
+		struct mmu_rb_node *this =
+			container_of(*new, struct mmu_rb_node, rbnode);
+		int result = mmu_entry_cmp(this, node->rcventry);
+
+		parent = *new;
+		if (result < 0)
+			new = &((*new)->rb_left);
+		else if (result > 0)
+			new = &((*new)->rb_right);
+		else
+			return 1;
+	}
+
+	rb_link_node(&node->rbnode, parent, new);
+	rb_insert_color(&node->rbnode, root);
+	return 0;
+}
+
+static int mmu_rb_insert_by_addr(struct rb_root *root, struct mmu_rb_node *node)
+{
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+	/* Figure out where to put new node */
+	while (*new) {
+		struct mmu_rb_node *this =
+			container_of(*new, struct mmu_rb_node, rbnode);
+		int result = mmu_addr_cmp(this, node->virt, node->len);
+
+		parent = *new;
+		if (result < 0)
+			new = &((*new)->rb_left);
+		else if (result > 0)
+			new = &((*new)->rb_right);
+		else
+			return 1;
+	}
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&node->rbnode, parent, new);
+	rb_insert_color(&node->rbnode, root);
+
+	return 0;
+}
diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.h b/drivers/staging/rdma/hfi1/user_exp_rcv.h
new file mode 100644
index 000000000000..28ef98a45a1e
--- /dev/null
+++ b/drivers/staging/rdma/hfi1/user_exp_rcv.h
@@ -0,0 +1,82 @@
+#ifndef _HFI1_USER_EXP_RCV_H
+#define _HFI1_USER_EXP_RCV_H
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include "hfi.h"
+
+#define EXP_TID_TIDLEN_MASK   0x7FFULL
+#define EXP_TID_TIDLEN_SHIFT  0
+#define EXP_TID_TIDCTRL_MASK  0x3ULL
+#define EXP_TID_TIDCTRL_SHIFT 20
+#define EXP_TID_TIDIDX_MASK   0x3FFULL
+#define EXP_TID_TIDIDX_SHIFT  22
+#define EXP_TID_GET(tid, field)	\
+	(((tid) >> EXP_TID_TID##field##_SHIFT) & EXP_TID_TID##field##_MASK)
+
+#define EXP_TID_SET(field, value)			\
+	(((value) & EXP_TID_TID##field##_MASK) <<	\
+	 EXP_TID_TID##field##_SHIFT)
+#define EXP_TID_CLEAR(tid, field) ({					\
+		(tid) &= ~(EXP_TID_TID##field##_MASK <<			\
+			   EXP_TID_TID##field##_SHIFT);			\
+		})
+#define EXP_TID_RESET(tid, field, value) do {				\
+		EXP_TID_CLEAR(tid, field);				\
+		(tid) |= EXP_TID_SET(field, (value));			\
+	} while (0)
+
+int hfi1_user_exp_rcv_init(struct file *);
+int hfi1_user_exp_rcv_free(struct hfi1_filedata *);
+int hfi1_user_exp_rcv_setup(struct file *, struct hfi1_tid_info *);
+int hfi1_user_exp_rcv_clear(struct file *, struct hfi1_tid_info *);
+int hfi1_user_exp_rcv_invalid(struct file *, struct hfi1_tid_info *);
+
+#endif /* _HFI1_USER_EXP_RCV_H */
diff --git a/drivers/staging/rdma/hfi1/user_pages.c b/drivers/staging/rdma/hfi1/user_pages.c
index 9071afbd7bf4..ec84cc63743e 100644
--- a/drivers/staging/rdma/hfi1/user_pages.c
+++ b/drivers/staging/rdma/hfi1/user_pages.c
@@ -47,110 +47,48 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  */
-
 #include <linux/mm.h>
+#include <linux/sched.h>
 #include <linux/device.h>
-
 #include "hfi.h"
 
-static void __hfi1_release_user_pages(struct page **p, size_t num_pages,
-				      int dirty)
+int hfi1_acquire_user_pages(unsigned long vaddr, size_t npages, bool writable,
+			    struct page **pages)
 {
-	size_t i;
-
-	for (i = 0; i < num_pages; i++) {
-		if (dirty)
-			set_page_dirty_lock(p[i]);
-		put_page(p[i]);
-	}
-}
-
-/*
- * Call with current->mm->mmap_sem held.
- */
-static int __hfi1_get_user_pages(unsigned long start_page, size_t num_pages,
-				 struct page **p)
-{
-	unsigned long lock_limit;
-	size_t got;
+	unsigned long pinned, lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	bool can_lock = capable(CAP_IPC_LOCK);
 	int ret;
 
-	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-	if (num_pages > lock_limit && !capable(CAP_IPC_LOCK)) {
-		ret = -ENOMEM;
-		goto bail;
-	}
-
-	for (got = 0; got < num_pages; got += ret) {
-		ret = get_user_pages(current, current->mm,
-				     start_page + got * PAGE_SIZE,
-				     num_pages - got, 1, 1,
-				     p + got, NULL);
-		if (ret < 0)
-			goto bail_release;
-	}
-
-	current->mm->pinned_vm += num_pages;
-
-	ret = 0;
-	goto bail;
-
-bail_release:
-	__hfi1_release_user_pages(p, got, 0);
-bail:
-	return ret;
-}
-
-/**
- * hfi1_map_page - a safety wrapper around pci_map_page()
- *
- */
-dma_addr_t hfi1_map_page(struct pci_dev *hwdev, struct page *page,
-			 unsigned long offset, size_t size, int direction)
-{
-	dma_addr_t phys;
+	down_read(&current->mm->mmap_sem);
+	pinned = current->mm->pinned_vm;
+	up_read(&current->mm->mmap_sem);
 
-	phys = pci_map_page(hwdev, page, offset, size, direction);
+	if (pinned + npages > lock_limit && !can_lock)
+		return -EDQUOT;
 
-	return phys;
-}
-
-/**
- * hfi1_get_user_pages - lock user pages into memory
- * @start_page: the start page
- * @num_pages: the number of pages
- * @p: the output page structures
- *
- * This function takes a given start page (page aligned user virtual
- * address) and pins it and the following specified number of pages.  For
- * now, num_pages is always 1, but that will probably change at some point
- * (because caller is doing expected sends on a single virtually contiguous
- * buffer, so we can do all pages at once).
- */
-int hfi1_get_user_pages(unsigned long start_page, size_t num_pages,
-			struct page **p)
-{
-	int ret;
+	ret = get_user_pages_fast(vaddr, npages, writable, pages);
+	if (ret < 0)
+		return ret;
 
 	down_write(&current->mm->mmap_sem);
-
-	ret = __hfi1_get_user_pages(start_page, num_pages, p);
-
+	current->mm->pinned_vm += ret;
 	up_write(&current->mm->mmap_sem);
-
 	return ret;
 }
 
-void hfi1_release_user_pages(struct page **p, size_t num_pages)
+void hfi1_release_user_pages(struct page **p, size_t npages, bool dirty)
 {
-	if (current->mm) /* during close after signal, mm can be NULL */
-		down_write(&current->mm->mmap_sem);
+	size_t i;
 
-	__hfi1_release_user_pages(p, num_pages, 1);
+	for (i = 0; i < npages; i++) {
+		if (dirty)
+			set_page_dirty_lock(p[i]);
+		put_page(p[i]);
+	}
 
-	if (current->mm) {
-		current->mm->pinned_vm -= num_pages;
+	if (current->mm) { /* during close after signal, mm can be NULL */
+		down_write(&current->mm->mmap_sem);
+		current->mm->pinned_vm -= npages;
 		up_write(&current->mm->mmap_sem);
 	}
 }
diff --git a/drivers/staging/rdma/hfi1/user_sdma.c b/drivers/staging/rdma/hfi1/user_sdma.c
index 36c838dcf023..2355ce7b17b7 100644
--- a/drivers/staging/rdma/hfi1/user_sdma.c
+++ b/drivers/staging/rdma/hfi1/user_sdma.c
@@ -1055,6 +1055,12 @@ static int pin_vector_pages(struct user_sdma_request *req,
 	/* If called by the kernel thread, use the user's mm */
 	if (current->flags & PF_KTHREAD)
 		use_mm(req->user_proc->mm);
+	/*
+	 * We should be calling hfi1_acquire_user_pages() so we can keep
+	 * the number of pinned pages up-to-date. However, we can't do
+	 * that because we can't use hfi1_release_user_pages() (see
+	 * comment in unpin_vector_pages()).
+	 */
 	pinned = get_user_pages_fast(
 		(unsigned long)iovec->iov.iov_base,
 		iovec->npages, 0, iovec->pages);
@@ -1084,6 +1090,13 @@ static void unpin_vector_pages(struct user_sdma_iovec *iovec)
 			  iovec->offset, iovec->iov.iov_len);
 		return;
 	}
+	/*
+	 * We should be calling hfi1_release_user_pages() so we can keep
+	 * the number of pinned pages up-to-date. However, this function
+	 * can be called in IRQ context and this will cause a deadlock
+	 * because hfi1_release_user_pages() takes the mm semaphore,
+	 * (which sleeps).
+	 */
 	for (i = 0; i < iovec->npages; i++)
 		if (iovec->pages[i])
 			put_page(iovec->pages[i]);
diff --git a/drivers/staging/rdma/hfi1/user_sdma.h b/drivers/staging/rdma/hfi1/user_sdma.h
index fa4422553e23..0046ffa774fe 100644
--- a/drivers/staging/rdma/hfi1/user_sdma.h
+++ b/drivers/staging/rdma/hfi1/user_sdma.h
@@ -52,15 +52,7 @@
 
 #include "common.h"
 #include "iowait.h"
-
-#define EXP_TID_TIDLEN_MASK   0x7FFULL
-#define EXP_TID_TIDLEN_SHIFT  0
-#define EXP_TID_TIDCTRL_MASK  0x3ULL
-#define EXP_TID_TIDCTRL_SHIFT 20
-#define EXP_TID_TIDIDX_MASK   0x7FFULL
-#define EXP_TID_TIDIDX_SHIFT  22
-#define EXP_TID_GET(tid, field)	\
-	(((tid) >> EXP_TID_TID##field##_SHIFT) & EXP_TID_TID##field##_MASK)
+#include "user_exp_rcv.h"
 
 extern uint extended_psn;
 
diff --git a/include/uapi/rdma/hfi/hfi1_user.h b/include/uapi/rdma/hfi/hfi1_user.h
index 599562fe5d57..54998e86689b 100644
--- a/include/uapi/rdma/hfi/hfi1_user.h
+++ b/include/uapi/rdma/hfi/hfi1_user.h
@@ -66,7 +66,7 @@
  * The major version changes when data structures change in an incompatible
  * way. The driver must be the same for initialization to succeed.
  */
-#define HFI1_USER_SWMAJOR 4
+#define HFI1_USER_SWMAJOR 5
 
 /*
  * Minor version differences are always compatible
@@ -93,7 +93,7 @@
 #define HFI1_CAP_MULTI_PKT_EGR    (1UL <<  7) /* Enable multi-packet Egr buffs*/
 #define HFI1_CAP_NODROP_RHQ_FULL  (1UL <<  8) /* Don't drop on Hdr Q full */
 #define HFI1_CAP_NODROP_EGR_FULL  (1UL <<  9) /* Don't drop on EGR buffs full */
-#define HFI1_CAP_TID_UNMAP        (1UL << 10) /* Enable Expected TID caching */
+#define HFI1_CAP_TID_UNMAP        (1UL << 10) /* Disable Expected TID caching */
 #define HFI1_CAP_PRINT_UNIMPL     (1UL << 11) /* Show for unimplemented feats */
 #define HFI1_CAP_ALLOW_PERM_JKEY  (1UL << 12) /* Allow use of permissive JKEY */
 #define HFI1_CAP_NO_INTEGRITY     (1UL << 13) /* Enable ctxt integrity checks */
@@ -127,13 +127,14 @@
 #define HFI1_CMD_TID_UPDATE      4	/* update expected TID entries */
 #define HFI1_CMD_TID_FREE        5	/* free expected TID entries */
 #define HFI1_CMD_CREDIT_UPD      6	/* force an update of PIO credit */
-#define HFI1_CMD_SDMA_STATUS_UPD 7       /* force update of SDMA status ring */
+#define HFI1_CMD_SDMA_STATUS_UPD 7      /* force update of SDMA status ring */
 
 #define HFI1_CMD_RECV_CTRL       8	/* control receipt of packets */
 #define HFI1_CMD_POLL_TYPE       9	/* set the kind of polling we want */
 #define HFI1_CMD_ACK_EVENT       10	/* ack & clear user status bits */
-#define HFI1_CMD_SET_PKEY        11      /* set context's pkey */
-#define HFI1_CMD_CTXT_RESET      12      /* reset context's HW send context */
+#define HFI1_CMD_SET_PKEY        11     /* set context's pkey */
+#define HFI1_CMD_CTXT_RESET      12     /* reset context's HW send context */
+#define HFI1_CMD_TID_INVAL_READ  13     /* read TID cache invalidations */
 /* separate EPROM commands from normal PSM commands */
 #define HFI1_CMD_EP_INFO         64      /* read EPROM device ID */
 #define HFI1_CMD_EP_ERASE_CHIP   65      /* erase whole EPROM */
@@ -144,18 +145,20 @@
 #define HFI1_CMD_EP_WRITE_P0     70      /* write EPROM partition 0 */
 #define HFI1_CMD_EP_WRITE_P1     71      /* write EPROM partition 1 */
 
-#define _HFI1_EVENT_FROZEN_BIT       0
-#define _HFI1_EVENT_LINKDOWN_BIT     1
-#define _HFI1_EVENT_LID_CHANGE_BIT   2
-#define _HFI1_EVENT_LMC_CHANGE_BIT   3
-#define _HFI1_EVENT_SL2VL_CHANGE_BIT 4
-#define _HFI1_MAX_EVENT_BIT _HFI1_EVENT_SL2VL_CHANGE_BIT
-
-#define HFI1_EVENT_FROZEN                (1UL << _HFI1_EVENT_FROZEN_BIT)
-#define HFI1_EVENT_LINKDOWN_BIT		(1UL << _HFI1_EVENT_LINKDOWN_BIT)
-#define HFI1_EVENT_LID_CHANGE_BIT	(1UL << _HFI1_EVENT_LID_CHANGE_BIT)
-#define HFI1_EVENT_LMC_CHANGE_BIT	(1UL << _HFI1_EVENT_LMC_CHANGE_BIT)
-#define HFI1_EVENT_SL2VL_CHANGE_BIT	(1UL << _HFI1_EVENT_SL2VL_CHANGE_BIT)
+#define _HFI1_EVENT_FROZEN_BIT         0
+#define _HFI1_EVENT_LINKDOWN_BIT       1
+#define _HFI1_EVENT_LID_CHANGE_BIT     2
+#define _HFI1_EVENT_LMC_CHANGE_BIT     3
+#define _HFI1_EVENT_SL2VL_CHANGE_BIT   4
+#define _HFI1_EVENT_TID_MMU_NOTIFY_BIT 5
+#define _HFI1_MAX_EVENT_BIT _HFI1_EVENT_TID_MMU_NOTIFY_BIT
+
+#define HFI1_EVENT_FROZEN            (1UL << _HFI1_EVENT_FROZEN_BIT)
+#define HFI1_EVENT_LINKDOWN          (1UL << _HFI1_EVENT_LINKDOWN_BIT)
+#define HFI1_EVENT_LID_CHANGE        (1UL << _HFI1_EVENT_LID_CHANGE_BIT)
+#define HFI1_EVENT_LMC_CHANGE        (1UL << _HFI1_EVENT_LMC_CHANGE_BIT)
+#define HFI1_EVENT_SL2VL_CHANGE      (1UL << _HFI1_EVENT_SL2VL_CHANGE_BIT)
+#define HFI1_EVENT_TID_MMU_NOTIFY    (1UL << _HFI1_EVENT_TID_MMU_NOTIFY_BIT)
 
 /*
  * These are the status bits readable (in ASCII form, 64bit value)
@@ -240,11 +243,6 @@ struct hfi1_tid_info {
 	__u32 tidcnt;
 	/* length of transfer buffer programmed by this request */
 	__u32 length;
-	/*
-	 * pointer to bitmap of TIDs used for this call;
-	 * checked for being large enough at open
-	 */
-	__u64 tidmap;
 };
 
 struct hfi1_cmd {
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

[parent not found: <1445273027-29634-16-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]

* Re: [PATCH 15/23] staging/rdma/hfi1: Implement Expected Receive TID caching
       [not found]     ` <1445273027-29634-16-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2015-10-19 16:54       ` Greg KH
  0 siblings, 0 replies; 31+ messages in thread
From: Greg KH @ 2015-10-19 16:54 UTC (permalink / raw)
  To: ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  Cc: devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mitko Haralanov,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w

On Mon, Oct 19, 2015 at 12:43:39PM -0400, ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> 
> Expected receives work by user-space libraries (PSM) calling into the driver
> with information about the user's receive buffer and have the driver DMA-map
> that buffer and program the HFI to receive data directly into it.
> 
> This is an expensive operation as it requires the driver to pin the pages which
> the user's buffer maps to, DMA-map them, and then program the HFI.
> 
> When the receive is complete, user-space libraries have to call into the driver
> again so the buffer is removed from the HFI, un-mapped, and the pages unpinned.
> 
> All of these operations are expensive, considering that a lot of applications
> (especially micro-benchmarks) use the same buffer over and over.
> 
> In order to get better performance for user-space applications, it is highly
> beneficial that they don't continuously call into the driver to register and
> unregister the same buffer. Rather, they can register the buffer and cache it
> for future work. The buffer can be unregistered when it is freed by the user.
> 
> This change implements such buffer caching by making use of the kernel's MMU
> notifier API. User-space libraries call into the driver only when the need to
> register a new buffer.
> 
> Once a buffer is registered, it stays programmed into the HFI until the kernel
> notifies the driver that the buffer has been freed by the user. At that time,
> the user-space library is notified and it can do the necessary work to remove
> the buffer from its cache.
> 
> Buffers which have been invalidated by the kernel are not automatically removed
> from the HFI and do not have their pages unpinned. Buffers are only completely
> removed when the user-space libraries call into the driver to free them.  This
> is done to ensure that any ongoing transfers into that buffer are complete.
> This is important when a buffer is not completely freed but rather it is
> shrunk. The user-space library could still have uncompleted transfers into the
> remaining buffer.
> 
> With this feature, it is important that systems are setup with reasonable
> limits for the amount of lockable memory.  Keeping the limit at "unlimited" (as
> we've done up to this point), may result in jobs being killed by the kernel's
> OOM due to them taking up excessive amounts of memory.
> 
> This commit also includes some code clean-up and rearrangement to make the
> driver code base easier to maintain and develop.

Please do cleanup and rearrangement in a separate patch (ideally before
this one) as this is impossible to review as-is.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 16/23] staging/rdma/hfi1: Allow tuning of SDMA interrupt rate
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (14 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 15/23] staging/rdma/hfi1: Implement Expected Receive TID caching ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 19/23] staging/rdma/hfi: modify workqueue for parallelism ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (5 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Mitko Haralanov,
	Ira Weiny

From: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The SDMA engines were configured to generate progrss interrupts every time they
processed N/2 descriptors (where N is the size of the descriptor queue). This
interval was too infrequent, leading to degraded performance.

This commit adds a module parameter, which allows for the tuning of the
interrupt frequency until an optimal frequency is found for both PSM and Verbs.
At that time, the parameter could be pulled out, if desired.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/sdma.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/rdma/hfi1/sdma.c b/drivers/staging/rdma/hfi1/sdma.c
index 53b3e4d9518b..452e7edcee7a 100644
--- a/drivers/staging/rdma/hfi1/sdma.c
+++ b/drivers/staging/rdma/hfi1/sdma.c
@@ -80,6 +80,10 @@ uint mod_num_sdma;
 module_param_named(num_sdma, mod_num_sdma, uint, S_IRUGO);
 MODULE_PARM_DESC(num_sdma, "Set max number SDMA engines to use");
 
+static uint sdma_desct_intr;
+module_param_named(desct_intr, sdma_desct_intr, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(desct_intr, "Number of SDMA descriptor before interrupt");
+
 #define SDMA_WAIT_BATCH_SIZE 20
 /* max wait time for a SDMA engine to indicate it has halted */
 #define SDMA_ERR_HALT_TIMEOUT 10 /* ms */
@@ -1046,6 +1050,8 @@ int sdma_init(struct hfi1_devdata *dd, u8 port)
 		return -ENOMEM;
 
 	idle_cnt = ns_to_cclock(dd, idle_cnt);
+	if (!sdma_desct_intr)
+		sdma_desct_intr = descq_cnt / 2;
 	/* Allocate memory for SendDMA descriptor FIFOs */
 	for (this_idx = 0; this_idx < num_engines; ++this_idx) {
 		sde = &dd->per_sdma[this_idx];
@@ -1548,7 +1554,7 @@ void sdma_engine_interrupt(struct sdma_engine *sde, u64 status)
 {
 	trace_hfi1_sdma_engine_interrupt(sde, status);
 	write_seqlock(&sde->head_lock);
-	sdma_set_desc_cnt(sde, sde->descq_cnt / 2);
+	sdma_set_desc_cnt(sde, sdma_desct_intr);
 	sdma_make_progress(sde, status);
 	write_sequnlock(&sde->head_lock);
 }
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 19/23] staging/rdma/hfi: modify workqueue for parallelism
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (15 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 16/23] staging/rdma/hfi1: Allow tuning of SDMA interrupt rate ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 20/23] staging/rdma/hfi1: Load SBus firmware once per ASIC ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (4 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Ira Weiny

From: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The workqueue is currently single threaded per port which for a small number of
SDMA engines is ok.

For hfi1, the there are up to 16 SDMA engines that can be fed descriptors in
parallel.

This patch:
- Converts to use alloc_workqueue
- Changes the workqueue limit from 1 to num_sdma
- Makes the queue WQ_CPU_INTENSIVE and WQ_HIGHPRI
- The sdma_engine now has a cpu that is initialized
  as the MSI-X vectors are setup
- Adjusts the post send logic to call a new scheduler
  that doesn't get the s_lock
- The new and old workqueue schedule now pass a
  cpu
- post send now uses the new scheduler
- RC/UC QPs now pre-compute the sc, sde
- The sde wq is eliminated since the new hfi1_wq is
  multi-threaded

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/chip.c   |  1 +
 drivers/staging/rdma/hfi1/init.c   | 13 +++++------
 drivers/staging/rdma/hfi1/iowait.h |  6 +++--
 drivers/staging/rdma/hfi1/qp.c     | 47 +++++++++++++++++++++++++++++++++-----
 drivers/staging/rdma/hfi1/qp.h     | 38 +++++++++++++++++++++++++++++-
 drivers/staging/rdma/hfi1/ruc.c    | 30 ++----------------------
 drivers/staging/rdma/hfi1/sdma.c   |  8 ++++---
 drivers/staging/rdma/hfi1/sdma.h   |  8 ++++---
 drivers/staging/rdma/hfi1/ud.c     |  1 +
 drivers/staging/rdma/hfi1/verbs.c  | 34 ++++++---------------------
 drivers/staging/rdma/hfi1/verbs.h  |  6 ++---
 11 files changed, 111 insertions(+), 81 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/chip.c b/drivers/staging/rdma/hfi1/chip.c
index 352b0d089ae1..af1b3bb10ecb 100644
--- a/drivers/staging/rdma/hfi1/chip.c
+++ b/drivers/staging/rdma/hfi1/chip.c
@@ -9021,6 +9021,7 @@ static int request_msix_irqs(struct hfi1_devdata *dd)
 		if (handler == sdma_interrupt) {
 			dd_dev_info(dd, "sdma engine %d cpu %d\n",
 				sde->this_idx, sdma_cpu);
+			sde->cpu = sdma_cpu;
 			cpumask_set_cpu(sdma_cpu, dd->msix_entries[i].mask);
 			sdma_cpu = cpumask_next(sdma_cpu, def);
 			if (sdma_cpu >= nr_cpu_ids)
diff --git a/drivers/staging/rdma/hfi1/init.c b/drivers/staging/rdma/hfi1/init.c
index 060ab566856a..3f6166b9c59f 100644
--- a/drivers/staging/rdma/hfi1/init.c
+++ b/drivers/staging/rdma/hfi1/init.c
@@ -601,20 +601,19 @@ static int create_workqueues(struct hfi1_devdata *dd)
 	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
 		ppd = dd->pport + pidx;
 		if (!ppd->hfi1_wq) {
-			char wq_name[8]; /* 3 + 2 + 1 + 1 + 1 */
-
-			snprintf(wq_name, sizeof(wq_name), "hfi%d_%d",
-				 dd->unit, pidx);
 			ppd->hfi1_wq =
-				create_singlethread_workqueue(wq_name);
+				alloc_workqueue(
+				    "hfi%d_%d",
+				    WQ_SYSFS | WQ_HIGHPRI | WQ_CPU_INTENSIVE,
+				    dd->num_sdma,
+				    dd->unit, pidx);
 			if (!ppd->hfi1_wq)
 				goto wq_error;
 		}
 	}
 	return 0;
 wq_error:
-	pr_err("create_singlethread_workqueue failed for port %d\n",
-		pidx + 1);
+	pr_err("alloc_workqueue failed for port %d\n", pidx + 1);
 	for (pidx = 0; pidx < dd->num_pports; ++pidx) {
 		ppd = dd->pport + pidx;
 		if (ppd->hfi1_wq) {
diff --git a/drivers/staging/rdma/hfi1/iowait.h b/drivers/staging/rdma/hfi1/iowait.h
index fa361b405851..e8ba5606d08d 100644
--- a/drivers/staging/rdma/hfi1/iowait.h
+++ b/drivers/staging/rdma/hfi1/iowait.h
@@ -150,12 +150,14 @@ static inline void iowait_init(
  * iowait_schedule() - initialize wait structure
  * @wait: wait struct to schedule
  * @wq: workqueue for schedule
+ * @cpu: cpu
  */
 static inline void iowait_schedule(
 	struct iowait *wait,
-	struct workqueue_struct *wq)
+	struct workqueue_struct *wq,
+	int cpu)
 {
-	queue_work(wq, &wait->iowork);
+	queue_work_on(cpu, wq, &wait->iowork);
 }
 
 /**
diff --git a/drivers/staging/rdma/hfi1/qp.c b/drivers/staging/rdma/hfi1/qp.c
index df1fa56eaf85..87aa49227bf6 100644
--- a/drivers/staging/rdma/hfi1/qp.c
+++ b/drivers/staging/rdma/hfi1/qp.c
@@ -617,7 +617,7 @@ int hfi1_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	int mig = 0;
 	int ret;
 	u32 pmtu = 0; /* for gcc warning only */
-	struct hfi1_devdata *dd;
+	struct hfi1_devdata *dd = dd_from_dev(dev);
 
 	spin_lock_irq(&qp->r_lock);
 	spin_lock(&qp->s_lock);
@@ -631,23 +631,35 @@ int hfi1_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 		goto inval;
 
 	if (attr_mask & IB_QP_AV) {
+		u8 sc;
+
 		if (attr->ah_attr.dlid >= HFI1_MULTICAST_LID_BASE)
 			goto inval;
 		if (hfi1_check_ah(qp->ibqp.device, &attr->ah_attr))
 			goto inval;
+		sc = ah_to_sc(ibqp->device, &attr->ah_attr);
+		if (!qp_to_sdma_engine(qp, sc) &&
+				dd->flags & HFI1_HAS_SEND_DMA)
+			goto inval;
 	}
 
 	if (attr_mask & IB_QP_ALT_PATH) {
+		u8 sc;
+
 		if (attr->alt_ah_attr.dlid >= HFI1_MULTICAST_LID_BASE)
 			goto inval;
 		if (hfi1_check_ah(qp->ibqp.device, &attr->alt_ah_attr))
 			goto inval;
-		if (attr->alt_pkey_index >= hfi1_get_npkeys(dd_from_dev(dev)))
+		if (attr->alt_pkey_index >= hfi1_get_npkeys(dd))
+			goto inval;
+		sc = ah_to_sc(ibqp->device, &attr->alt_ah_attr);
+		if (!qp_to_sdma_engine(qp, sc) &&
+				dd->flags & HFI1_HAS_SEND_DMA)
 			goto inval;
 	}
 
 	if (attr_mask & IB_QP_PKEY_INDEX)
-		if (attr->pkey_index >= hfi1_get_npkeys(dd_from_dev(dev)))
+		if (attr->pkey_index >= hfi1_get_npkeys(dd))
 			goto inval;
 
 	if (attr_mask & IB_QP_MIN_RNR_TIMER)
@@ -792,6 +804,8 @@ int hfi1_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 		qp->remote_ah_attr = attr->ah_attr;
 		qp->s_srate = attr->ah_attr.static_rate;
 		qp->srate_mbps = ib_rate_to_mbps(qp->s_srate);
+		qp->s_sc = ah_to_sc(ibqp->device, &qp->remote_ah_attr);
+		qp->s_sde = qp_to_sdma_engine(qp, qp->s_sc);
 	}
 
 	if (attr_mask & IB_QP_ALT_PATH) {
@@ -806,6 +820,8 @@ int hfi1_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 			qp->port_num = qp->alt_ah_attr.port_num;
 			qp->s_pkey_index = qp->s_alt_pkey_index;
 			qp->s_flags |= HFI1_S_AHG_CLEAR;
+			qp->s_sc = ah_to_sc(ibqp->device, &qp->remote_ah_attr);
+			qp->s_sde = qp_to_sdma_engine(qp, qp->s_sc);
 		}
 	}
 
@@ -1528,9 +1544,6 @@ struct sdma_engine *qp_to_sdma_engine(struct hfi1_qp *qp, u8 sc5)
 	if (!(dd->flags & HFI1_HAS_SEND_DMA))
 		return NULL;
 	switch (qp->ibqp.qp_type) {
-	case IB_QPT_UC:
-	case IB_QPT_RC:
-		break;
 	case IB_QPT_SMI:
 		return NULL;
 	default:
@@ -1685,3 +1698,25 @@ void qp_comm_est(struct hfi1_qp *qp)
 		qp->ibqp.event_handler(&ev, qp->ibqp.qp_context);
 	}
 }
+
+/*
+ * Switch to alternate path.
+ * The QP s_lock should be held and interrupts disabled.
+ */
+void hfi1_migrate_qp(struct hfi1_qp *qp)
+{
+	struct ib_event ev;
+
+	qp->s_mig_state = IB_MIG_MIGRATED;
+	qp->remote_ah_attr = qp->alt_ah_attr;
+	qp->port_num = qp->alt_ah_attr.port_num;
+	qp->s_pkey_index = qp->s_alt_pkey_index;
+	qp->s_flags |= HFI1_S_AHG_CLEAR;
+	qp->s_sc = ah_to_sc(qp->ibqp.device, &qp->remote_ah_attr);
+	qp->s_sde = qp_to_sdma_engine(qp, qp->s_sc);
+
+	ev.device = qp->ibqp.device;
+	ev.element.qp = &qp->ibqp;
+	ev.event = IB_EVENT_PATH_MIG;
+	qp->ibqp.event_handler(&ev, qp->ibqp.qp_context);
+}
diff --git a/drivers/staging/rdma/hfi1/qp.h b/drivers/staging/rdma/hfi1/qp.h
index b9c1575990aa..5e1def523f61 100644
--- a/drivers/staging/rdma/hfi1/qp.h
+++ b/drivers/staging/rdma/hfi1/qp.h
@@ -128,7 +128,6 @@ static inline void clear_ahg(struct hfi1_qp *qp)
 	if (qp->s_sde && qp->s_ahgidx >= 0)
 		sdma_ahg_free(qp->s_sde, qp->s_ahgidx);
 	qp->s_ahgidx = -1;
-	qp->s_sde = NULL;
 }
 
 /**
@@ -247,4 +246,41 @@ void qp_iter_print(struct seq_file *s, struct qp_iter *iter);
  */
 void qp_comm_est(struct hfi1_qp *qp);
 
+/**
+ * _hfi1_schedule_send - schedule progress
+ * @qp: the QP
+ *
+ * This schedules qp progress w/o regard to the s_flags.
+ *
+ * It is only used in the post send, which doesn't hold
+ * the s_lock.
+ */
+static inline void _hfi1_schedule_send(struct hfi1_qp *qp)
+{
+	struct hfi1_ibport *ibp =
+		to_iport(qp->ibqp.device, qp->port_num);
+	struct hfi1_pportdata *ppd = ppd_from_ibp(ibp);
+	struct hfi1_devdata *dd = dd_from_ibdev(qp->ibqp.device);
+
+	iowait_schedule(&qp->s_iowait, ppd->hfi1_wq,
+		qp->s_sde ?
+			qp->s_sde->cpu :
+			cpumask_first(cpumask_of_node(dd->assigned_node_id)));
+}
+
+/**
+ * hfi1_schedule_send - schedule progress
+ * @qp: the QP
+ *
+ * This schedules qp progress and caller should hold
+ * the s_lock.
+ */
+static inline void hfi1_schedule_send(struct hfi1_qp *qp)
+{
+	if (hfi1_send_ok(qp))
+		_hfi1_schedule_send(qp);
+}
+
+void hfi1_migrate_qp(struct hfi1_qp *qp);
+
 #endif /* _QP_H */
diff --git a/drivers/staging/rdma/hfi1/ruc.c b/drivers/staging/rdma/hfi1/ruc.c
index 8614b070545c..7b11c61ac5d6 100644
--- a/drivers/staging/rdma/hfi1/ruc.c
+++ b/drivers/staging/rdma/hfi1/ruc.c
@@ -241,26 +241,6 @@ bail:
 	return ret;
 }
 
-/*
- * Switch to alternate path.
- * The QP s_lock should be held and interrupts disabled.
- */
-void hfi1_migrate_qp(struct hfi1_qp *qp)
-{
-	struct ib_event ev;
-
-	qp->s_mig_state = IB_MIG_MIGRATED;
-	qp->remote_ah_attr = qp->alt_ah_attr;
-	qp->port_num = qp->alt_ah_attr.port_num;
-	qp->s_pkey_index = qp->s_alt_pkey_index;
-	qp->s_flags |= HFI1_S_AHG_CLEAR;
-
-	ev.device = qp->ibqp.device;
-	ev.element.qp = &qp->ibqp;
-	ev.event = IB_EVENT_PATH_MIG;
-	qp->ibqp.event_handler(&ev, qp->ibqp.qp_context);
-}
-
 static __be64 get_sguid(struct hfi1_ibport *ibp, unsigned index)
 {
 	if (!index) {
@@ -714,11 +694,8 @@ static inline void build_ahg(struct hfi1_qp *qp, u32 npsn)
 		clear_ahg(qp);
 	if (!(qp->s_flags & HFI1_S_AHG_VALID)) {
 		/* first middle that needs copy  */
-		if (qp->s_ahgidx < 0) {
-			if (!qp->s_sde)
-				qp->s_sde = qp_to_sdma_engine(qp, qp->s_sc);
+		if (qp->s_ahgidx < 0)
 			qp->s_ahgidx = sdma_ahg_alloc(qp->s_sde);
-		}
 		if (qp->s_ahgidx >= 0) {
 			qp->s_ahgpsn = npsn;
 			qp->s_hdr->tx_flags |= SDMA_TXREQ_F_AHG_COPY;
@@ -761,7 +738,6 @@ void hfi1_make_ruc_header(struct hfi1_qp *qp, struct hfi1_other_headers *ohdr,
 	u16 lrh0;
 	u32 nwords;
 	u32 extra_bytes;
-	u8 sc5;
 	u32 bth1;
 
 	/* Construct the header. */
@@ -775,9 +751,7 @@ void hfi1_make_ruc_header(struct hfi1_qp *qp, struct hfi1_other_headers *ohdr,
 		lrh0 = HFI1_LRH_GRH;
 		middle = 0;
 	}
-	sc5 = ibp->sl_to_sc[qp->remote_ah_attr.sl];
-	lrh0 |= (sc5 & 0xf) << 12 | (qp->remote_ah_attr.sl & 0xf) << 4;
-	qp->s_sc = sc5;
+	lrh0 |= (qp->s_sc & 0xf) << 12 | (qp->remote_ah_attr.sl & 0xf) << 4;
 	/*
 	 * reset s_hdr/AHG fields
 	 *
diff --git a/drivers/staging/rdma/hfi1/sdma.c b/drivers/staging/rdma/hfi1/sdma.c
index 16b1edf2a5cc..6aad23ca1908 100644
--- a/drivers/staging/rdma/hfi1/sdma.c
+++ b/drivers/staging/rdma/hfi1/sdma.c
@@ -777,18 +777,19 @@ struct sdma_engine *sdma_select_engine_vl(
 	struct sdma_engine *rval;
 
 	if (WARN_ON(vl > 8))
-		return NULL;
+		return &dd->per_sdma[0];
 
 	rcu_read_lock();
 	m = rcu_dereference(dd->sdma_map);
 	if (unlikely(!m)) {
 		rcu_read_unlock();
-		return NULL;
+		return &dd->per_sdma[0];
 	}
 	e = m->map[vl & m->mask];
 	rval = e->sde[selector & e->mask];
 	rcu_read_unlock();
 
+	rval =  !rval ? &dd->per_sdma[0] : rval;
 	trace_hfi1_sdma_engine_select(dd, selector, vl, rval->this_idx);
 	return rval;
 }
@@ -1874,7 +1875,7 @@ static void dump_sdma_state(struct sdma_engine *sde)
 }
 
 #define SDE_FMT \
-	"SDE %u STE %s C 0x%llx S 0x%016llx E 0x%llx T(HW) 0x%llx T(SW) 0x%x H(HW) 0x%llx H(SW) 0x%x H(D) 0x%llx DM 0x%llx GL 0x%llx R 0x%llx LIS 0x%llx AHGI 0x%llx TXT %u TXH %u DT %u DH %u FLNE %d DQF %u SLC 0x%llx\n"
+	"SDE %u CPU %d STE %s C 0x%llx S 0x%016llx E 0x%llx T(HW) 0x%llx T(SW) 0x%x H(HW) 0x%llx H(SW) 0x%x H(D) 0x%llx DM 0x%llx GL 0x%llx R 0x%llx LIS 0x%llx AHGI 0x%llx TXT %u TXH %u DT %u DH %u FLNE %d DQF %u SLC 0x%llx\n"
 /**
  * sdma_seqfile_dump_sde() - debugfs dump of sde
  * @s: seq file
@@ -1894,6 +1895,7 @@ void sdma_seqfile_dump_sde(struct seq_file *s, struct sdma_engine *sde)
 	head = sde->descq_head & sde->sdma_mask;
 	tail = ACCESS_ONCE(sde->descq_tail) & sde->sdma_mask;
 	seq_printf(s, SDE_FMT, sde->this_idx,
+		sde->cpu,
 		sdma_state_name(sde->state.current_state),
 		(unsigned long long)read_sde_csr(sde, SD(CTRL)),
 		(unsigned long long)read_sde_csr(sde, SD(STATUS)),
diff --git a/drivers/staging/rdma/hfi1/sdma.h b/drivers/staging/rdma/hfi1/sdma.h
index cc22d2ee2054..85701eed1585 100644
--- a/drivers/staging/rdma/hfi1/sdma.h
+++ b/drivers/staging/rdma/hfi1/sdma.h
@@ -410,8 +410,6 @@ struct sdma_engine {
 	u64 idle_mask;
 	u64 progress_mask;
 	/* private: */
-	struct workqueue_struct *wq;
-	/* private: */
 	volatile __le64      *head_dma; /* DMA'ed by chip */
 	/* private: */
 	dma_addr_t            head_phys;
@@ -426,6 +424,8 @@ struct sdma_engine {
 	u32 sdma_mask;
 	/* private */
 	struct sdma_state state;
+	/* private */
+	int cpu;
 	/* private: */
 	u8 sdma_shift;
 	/* private: */
@@ -990,7 +990,9 @@ static inline void sdma_iowait_schedule(
 	struct sdma_engine *sde,
 	struct iowait *wait)
 {
-	iowait_schedule(wait, sde->wq);
+	struct hfi1_pportdata *ppd = sde->dd->pport;
+
+	iowait_schedule(wait, ppd->hfi1_wq, sde->cpu);
 }
 
 /* for use by interrupt handling */
diff --git a/drivers/staging/rdma/hfi1/ud.c b/drivers/staging/rdma/hfi1/ud.c
index d40d1a1e10aa..d6b4ba8a811e 100644
--- a/drivers/staging/rdma/hfi1/ud.c
+++ b/drivers/staging/rdma/hfi1/ud.c
@@ -383,6 +383,7 @@ int hfi1_make_ud_req(struct hfi1_qp *qp)
 		lrh0 |= (sc5 & 0xf) << 12;
 		qp->s_sc = sc5;
 	}
+	qp->s_sde = qp_to_sdma_engine(qp, qp->s_sc);
 	qp->s_hdr->ibh.lrh[0] = cpu_to_be16(lrh0);
 	qp->s_hdr->ibh.lrh[1] = cpu_to_be16(ah_attr->dlid);  /* DEST LID */
 	qp->s_hdr->ibh.lrh[2] =
diff --git a/drivers/staging/rdma/hfi1/verbs.c b/drivers/staging/rdma/hfi1/verbs.c
index a5bcb1036aba..b2bb8ce97915 100644
--- a/drivers/staging/rdma/hfi1/verbs.c
+++ b/drivers/staging/rdma/hfi1/verbs.c
@@ -159,6 +159,8 @@ static inline struct hfi1_ucontext *to_iucontext(struct ib_ucontext
 	return container_of(ibucontext, struct hfi1_ucontext, ibucontext);
 }
 
+static inline void _hfi1_schedule_send(struct hfi1_qp *qp);
+
 /*
  * Translate ib_wr_opcode into ib_wc_opcode.
  */
@@ -494,9 +496,9 @@ static int post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		nreq++;
 	}
 bail:
-	if (nreq && !call_send)
-		hfi1_schedule_send(qp);
 	spin_unlock_irqrestore(&qp->s_lock, flags);
+	if (nreq && !call_send)
+		_hfi1_schedule_send(qp);
 	if (nreq && call_send)
 		hfi1_do_send(&qp->s_iowait.iowork);
 	return err;
@@ -994,7 +996,6 @@ int hfi1_verbs_send_dma(struct hfi1_qp *qp, struct ahg_ib_header *ahdr,
 	struct verbs_txreq *tx;
 	struct sdma_txreq *stx;
 	u64 pbc_flags = 0;
-	struct sdma_engine *sde;
 	u8 sc5 = qp->s_sc;
 	int ret;
 
@@ -1015,12 +1016,7 @@ int hfi1_verbs_send_dma(struct hfi1_qp *qp, struct ahg_ib_header *ahdr,
 	if (IS_ERR(tx))
 		goto bail_tx;
 
-	if (!qp->s_hdr->sde) {
-		tx->sde = sde = qp_to_sdma_engine(qp, sc5);
-		if (!sde)
-			goto bail_no_sde;
-	} else
-		tx->sde = sde = qp->s_hdr->sde;
+	tx->sde = qp->s_sde;
 
 	if (likely(pbc == 0)) {
 		u32 vl = sc_to_vlt(dd_from_ibdev(qp->ibqp.device), sc5);
@@ -1035,17 +1031,15 @@ int hfi1_verbs_send_dma(struct hfi1_qp *qp, struct ahg_ib_header *ahdr,
 	if (qp->s_rdma_mr)
 		qp->s_rdma_mr = NULL;
 	tx->hdr_dwords = hdrwords + 2;
-	ret = build_verbs_tx_desc(sde, ss, len, tx, ahdr, pbc);
+	ret = build_verbs_tx_desc(tx->sde, ss, len, tx, ahdr, pbc);
 	if (unlikely(ret))
 		goto bail_build;
 	trace_output_ibhdr(dd_from_ibdev(qp->ibqp.device), &ahdr->ibh);
-	ret =  sdma_send_txreq(sde, &qp->s_iowait, &tx->txreq);
+	ret =  sdma_send_txreq(tx->sde, &qp->s_iowait, &tx->txreq);
 	if (unlikely(ret == -ECOMM))
 		goto bail_ecomm;
 	return ret;
 
-bail_no_sde:
-	hfi1_put_txreq(tx);
 bail_ecomm:
 	/* The current one got "sent" */
 	return 0;
@@ -2120,20 +2114,6 @@ void hfi1_unregister_ib_device(struct hfi1_devdata *dd)
 	vfree(dev->lk_table.table);
 }
 
-/*
- * This must be called with s_lock held.
- */
-void hfi1_schedule_send(struct hfi1_qp *qp)
-{
-	if (hfi1_send_ok(qp)) {
-		struct hfi1_ibport *ibp =
-			to_iport(qp->ibqp.device, qp->port_num);
-		struct hfi1_pportdata *ppd = ppd_from_ibp(ibp);
-
-		iowait_schedule(&qp->s_iowait, ppd->hfi1_wq);
-	}
-}
-
 void hfi1_cnp_rcv(struct hfi1_packet *packet)
 {
 	struct hfi1_ibport *ibp = &packet->rcd->ppd->ibport_data;
diff --git a/drivers/staging/rdma/hfi1/verbs.h b/drivers/staging/rdma/hfi1/verbs.h
index e4a8a0d4ccf8..62c6e38cca45 100644
--- a/drivers/staging/rdma/hfi1/verbs.h
+++ b/drivers/staging/rdma/hfi1/verbs.h
@@ -436,7 +436,8 @@ struct hfi1_qp {
 	struct hfi1_swqe *s_wq;  /* send work queue */
 	struct hfi1_mmap_info *ip;
 	struct ahg_ib_header *s_hdr;     /* next packet header to send */
-	u8 s_sc;			/* SC[0..4] for next packet */
+	/* sc for UC/RC QPs - based on ah for UD */
+	u8 s_sc;
 	unsigned long timeout_jiffies;  /* computed from timeout */
 
 	enum ib_mtu path_mtu;
@@ -841,7 +842,6 @@ static inline int hfi1_send_ok(struct hfi1_qp *qp)
 /*
  * This must be called with s_lock held.
  */
-void hfi1_schedule_send(struct hfi1_qp *qp);
 void hfi1_bad_pqkey(struct hfi1_ibport *ibp, __be16 trap_num, u32 key, u32 sl,
 		    u32 qp1, u32 qp2, __be16 lid1, __be16 lid2);
 void hfi1_cap_mask_chg(struct hfi1_ibport *ibp);
@@ -1071,8 +1071,6 @@ int hfi1_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
 
 int hfi1_get_rwqe(struct hfi1_qp *qp, int wr_id_only);
 
-void hfi1_migrate_qp(struct hfi1_qp *qp);
-
 int hfi1_ruc_check_hdr(struct hfi1_ibport *ibp, struct hfi1_ib_header *hdr,
 		       int has_grh, struct hfi1_qp *qp, u32 bth0);
 
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 20/23] staging/rdma/hfi1: Load SBus firmware once per ASIC
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (16 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 19/23] staging/rdma/hfi: modify workqueue for parallelism ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 21/23] staging/rdma/hfi1: Add unit # to verbs txreq cache name ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (3 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Easwar Hariharan,
	Ira Weiny

From: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Using fw_sbus_load to control SBus firmware load doesn't scale across multiple
HFI1 cards in a single system. This patch ensures that the SBus firmware is
loaded once per ASIC.

Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/firmware.c | 31 ++++++++-----------------------
 drivers/staging/rdma/hfi1/pcie.c     | 10 ++++++----
 2 files changed, 14 insertions(+), 27 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/firmware.c b/drivers/staging/rdma/hfi1/firmware.c
index ace75ce3da3f..b4bdcf341aac 100644
--- a/drivers/staging/rdma/hfi1/firmware.c
+++ b/drivers/staging/rdma/hfi1/firmware.c
@@ -1236,35 +1236,20 @@ int load_firmware(struct hfi1_devdata *dd)
 {
 	int ret;
 
-	if (fw_sbus_load || fw_fabric_serdes_load) {
+	if (fw_fabric_serdes_load) {
 		ret = acquire_hw_mutex(dd);
 		if (ret)
 			return ret;
 
 		set_sbus_fast_mode(dd);
 
-		/*
-		 * The SBus contains part of the fabric firmware and so must
-		 * also be downloaded.
-		 */
-		if (fw_sbus_load) {
-			turn_off_spicos(dd, SPICO_SBUS);
-			ret = load_sbus_firmware(dd, &fw_sbus);
-			if (ret)
-				goto clear;
-			fw_sbus_load = 0;
-		}
+		set_serdes_broadcast(dd, all_fabric_serdes_broadcast,
+				fabric_serdes_broadcast[dd->hfi1_id],
+				fabric_serdes_addrs[dd->hfi1_id],
+				NUM_FABRIC_SERDES);
+		turn_off_spicos(dd, SPICO_FABRIC);
+		ret = load_fabric_serdes_firmware(dd, &fw_fabric);
 
-		if (fw_fabric_serdes_load) {
-			set_serdes_broadcast(dd, all_fabric_serdes_broadcast,
-					fabric_serdes_broadcast[dd->hfi1_id],
-					fabric_serdes_addrs[dd->hfi1_id],
-					NUM_FABRIC_SERDES);
-			turn_off_spicos(dd, SPICO_FABRIC);
-			ret = load_fabric_serdes_firmware(dd, &fw_fabric);
-		}
-
-clear:
 		clear_sbus_fast_mode(dd);
 		release_hw_mutex(dd);
 		if (ret)
@@ -1583,7 +1568,7 @@ int load_pcie_firmware(struct hfi1_devdata *dd)
 	/* both firmware loads below use the SBus */
 	set_sbus_fast_mode(dd);
 
-	if (fw_sbus_load) {
+	if (fw_sbus_load && (dd->flags & HFI1_DO_INIT_ASIC)) {
 		turn_off_spicos(dd, SPICO_SBUS);
 		ret = load_sbus_firmware(dd, &fw_sbus);
 		if (ret)
diff --git a/drivers/staging/rdma/hfi1/pcie.c b/drivers/staging/rdma/hfi1/pcie.c
index 3b50cdda1c0a..a956044459a2 100644
--- a/drivers/staging/rdma/hfi1/pcie.c
+++ b/drivers/staging/rdma/hfi1/pcie.c
@@ -947,15 +947,16 @@ int do_pcie_gen3_transition(struct hfi1_devdata *dd)
 	}
 
 retry:
+
 	if (therm) {
-		/* toggle SPICO_ENABLE to get back to the state
-		   just after the firmware load */
+		/*
+		 * toggle SPICO_ENABLE to get back to the state
+		 * just after the firmware load
+		 */
 		sbus_request(dd, SBUS_MASTER_BROADCAST, 0x01,
 			WRITE_SBUS_RECEIVER, 0x00000040);
 		sbus_request(dd, SBUS_MASTER_BROADCAST, 0x01,
 			WRITE_SBUS_RECEIVER, 0x00000140);
-		dd_dev_info(dd, "%s: toggle SPICO_ENABLE to reset the bus\n",
-			    __func__);
 	}
 
 	/* step 3: download SBus Master firmware */
@@ -1198,6 +1199,7 @@ retry:
 
 	/* clear the DC reset */
 	write_csr(dd, CCE_DC_CTRL, 0);
+
 	/* Set the LED off */
 	if (is_a0(dd))
 		setextled(dd, 0);
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 21/23] staging/rdma/hfi1: Add unit # to verbs txreq cache name
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (17 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 20/23] staging/rdma/hfi1: Load SBus firmware once per ASIC ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 22/23] staging/rdma/hfi1: add additional rc traces ira.weiny-ral2JQCrhuEAvxtiuMwx3w
                     ` (2 subsequent siblings)
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Jubin John, Ira Weiny

From: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The name used to create the verbs txreq cache was not qualified with the unit
number. This causes a panic when destroying the cache on a dual HFI systems.
Create a unique name with the unit number with this patch

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/verbs.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/rdma/hfi1/verbs.c b/drivers/staging/rdma/hfi1/verbs.c
index b2bb8ce97915..514956af5b5d 100644
--- a/drivers/staging/rdma/hfi1/verbs.c
+++ b/drivers/staging/rdma/hfi1/verbs.c
@@ -129,6 +129,9 @@ static void verbs_sdma_complete(
 	int status,
 	int drained);
 
+/* Length of buffer to create verbs txreq cache name */
+#define TXREQ_NAME_LEN 24
+
 /*
  * Note that it is OK to post send work requests in the SQE and ERR
  * states; hfi1_do_send() will process them and generate error
@@ -1899,6 +1902,7 @@ int hfi1_register_ib_device(struct hfi1_devdata *dd)
 	int ret;
 	size_t lcpysz = IB_DEVICE_NAME_MAX;
 	u16 descq_cnt;
+	char buf[TXREQ_NAME_LEN];
 
 	ret = hfi1_qp_init(dev);
 	if (ret)
@@ -1952,8 +1956,9 @@ int hfi1_register_ib_device(struct hfi1_devdata *dd)
 
 	descq_cnt = sdma_get_descq_cnt();
 
+	snprintf(buf, sizeof(buf), "hfi1_%u_vtxreq_cache", dd->unit);
 	/* SLAB_HWCACHE_ALIGN for AHG */
-	dev->verbs_txreq_cache = kmem_cache_create("hfi1_vtxreq_cache",
+	dev->verbs_txreq_cache = kmem_cache_create(buf,
 						   sizeof(struct verbs_txreq),
 						   0, SLAB_HWCACHE_ALIGN,
 						   verbs_txreq_kmem_cache_ctor);
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 22/23] staging/rdma/hfi1: add additional rc traces
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (18 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 21/23] staging/rdma/hfi1: Add unit # to verbs txreq cache name ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:43   ` [PATCH 23/23] staging/rdma/hfi1: Update driver version string to 0.9-294 ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:54   ` [PATCH 00/23] Update driver " Greg KH
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Ira Weiny

From: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add additional rc traces to aid in debugging rc retry logic.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/rc.c    |  4 ++++
 drivers/staging/rdma/hfi1/trace.c |  4 ++--
 drivers/staging/rdma/hfi1/trace.h | 48 +++++++++++++++++++++++++++++----------
 3 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/rc.c b/drivers/staging/rdma/hfi1/rc.c
index 8e4f421b2b58..e25c70b9de34 100644
--- a/drivers/staging/rdma/hfi1/rc.c
+++ b/drivers/staging/rdma/hfi1/rc.c
@@ -927,6 +927,7 @@ static void rc_timeout(unsigned long arg)
 		ibp->n_rc_timeouts++;
 		qp->s_flags &= ~HFI1_S_TIMER;
 		del_timer(&qp->s_timer);
+		trace_hfi1_rc_timeout(qp, qp->s_last_psn + 1);
 		restart_rc(qp, qp->s_last_psn + 1, 1);
 		hfi1_schedule_send(qp);
 	}
@@ -1442,6 +1443,8 @@ static void rc_rcv_resp(struct hfi1_ibport *ibp,
 
 	spin_lock_irqsave(&qp->s_lock, flags);
 
+	trace_hfi1_rc_ack(qp, psn);
+
 	/* Ignore invalid responses. */
 	if (cmp_psn(psn, qp->s_next_psn) >= 0)
 		goto ack_done;
@@ -1630,6 +1633,7 @@ static noinline int rc_rcv_error(struct hfi1_other_headers *ohdr, void *data,
 	u8 i, prev;
 	int old_req;
 
+	trace_hfi1_rc_rcv_error(qp, psn);
 	if (diff > 0) {
 		/*
 		 * Packet sequence error.
diff --git a/drivers/staging/rdma/hfi1/trace.c b/drivers/staging/rdma/hfi1/trace.c
index 70ad7b9fc1ce..f55b75194847 100644
--- a/drivers/staging/rdma/hfi1/trace.c
+++ b/drivers/staging/rdma/hfi1/trace.c
@@ -126,13 +126,13 @@ const char *parse_everbs_hdrs(
 	case OP(RC, ACKNOWLEDGE):
 		trace_seq_printf(p, AETH_PRN,
 			be32_to_cpu(eh->aeth) >> 24,
-			be32_to_cpu(eh->aeth) & HFI1_QPN_MASK);
+			be32_to_cpu(eh->aeth) & HFI1_MSN_MASK);
 		break;
 	/* aeth + atomicacketh */
 	case OP(RC, ATOMIC_ACKNOWLEDGE):
 		trace_seq_printf(p, AETH_PRN " " ATOMICACKETH_PRN,
 			(be32_to_cpu(eh->at.aeth) >> 24) & 0xff,
-			be32_to_cpu(eh->at.aeth) & HFI1_QPN_MASK,
+			be32_to_cpu(eh->at.aeth) & HFI1_MSN_MASK,
 			(unsigned long long)ib_u64_get(eh->at.atomic_ack_eth));
 		break;
 	/* atomiceth */
diff --git a/drivers/staging/rdma/hfi1/trace.h b/drivers/staging/rdma/hfi1/trace.h
index 0354dca9a6f4..db51cf55c538 100644
--- a/drivers/staging/rdma/hfi1/trace.h
+++ b/drivers/staging/rdma/hfi1/trace.h
@@ -1290,37 +1290,61 @@ TRACE_EVENT(hfi1_sdma_state,
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM hfi1_rc
 
-DECLARE_EVENT_CLASS(hfi1_sdma_rc,
+DECLARE_EVENT_CLASS(hfi1_rc_template,
 	TP_PROTO(struct hfi1_qp *qp, u32 psn),
 	TP_ARGS(qp, psn),
 	TP_STRUCT__entry(
 		DD_DEV_ENTRY(dd_from_ibdev(qp->ibqp.device))
 		__field(u32, qpn)
-		__field(u32, flags)
+		__field(u32, s_flags)
 		__field(u32, psn)
-		__field(u32, sending_psn)
-		__field(u32, sending_hpsn)
+		__field(u32, s_psn)
+		__field(u32, s_next_psn)
+		__field(u32, s_sending_psn)
+		__field(u32, s_sending_hpsn)
+		__field(u32, r_psn)
 	),
 	TP_fast_assign(
 		DD_DEV_ASSIGN(dd_from_ibdev(qp->ibqp.device))
 		__entry->qpn = qp->ibqp.qp_num;
-		__entry->flags = qp->s_flags;
+		__entry->s_flags = qp->s_flags;
 		__entry->psn = psn;
-		__entry->sending_psn = qp->s_sending_psn;
-		__entry->sending_hpsn = qp->s_sending_hpsn;
+		__entry->s_psn = qp->s_psn;
+		__entry->s_next_psn = qp->s_next_psn;
+		__entry->s_sending_psn = qp->s_sending_psn;
+		__entry->s_sending_hpsn = qp->s_sending_hpsn;
+		__entry->r_psn = qp->r_psn;
 	),
 	TP_printk(
-		"[%s] qpn 0x%x flags 0x%x psn 0x%x sending_psn 0x%x sending_hpsn 0x%x",
+		"[%s] qpn 0x%x s_flags 0x%x psn 0x%x s_psn 0x%x s_next_psn 0x%x s_sending_psn 0x%x sending_hpsn 0x%x r_psn 0x%x",
 		__get_str(dev),
 		__entry->qpn,
-		__entry->flags,
+		__entry->s_flags,
 		__entry->psn,
-		__entry->sending_psn,
-		__entry->sending_psn
+		__entry->s_psn,
+		__entry->s_next_psn,
+		__entry->s_sending_psn,
+		__entry->s_sending_hpsn,
+		__entry->r_psn
 	)
 );
 
-DEFINE_EVENT(hfi1_sdma_rc, hfi1_rc_sendcomplete,
+DEFINE_EVENT(hfi1_rc_template, hfi1_rc_sendcomplete,
+	     TP_PROTO(struct hfi1_qp *qp, u32 psn),
+	     TP_ARGS(qp, psn)
+);
+
+DEFINE_EVENT(hfi1_rc_template, hfi1_rc_ack,
+	     TP_PROTO(struct hfi1_qp *qp, u32 psn),
+	     TP_ARGS(qp, psn)
+);
+
+DEFINE_EVENT(hfi1_rc_template, hfi1_rc_timeout,
+	     TP_PROTO(struct hfi1_qp *qp, u32 psn),
+	     TP_ARGS(qp, psn)
+);
+
+DEFINE_EVENT(hfi1_rc_template, hfi1_rc_rcv_error,
 	     TP_PROTO(struct hfi1_qp *qp, u32 psn),
 	     TP_ARGS(qp, psn)
 );
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 23/23] staging/rdma/hfi1: Update driver version string to 0.9-294
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (19 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 22/23] staging/rdma/hfi1: add additional rc traces ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:43   ` ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  2015-10-19 16:54   ` [PATCH 00/23] Update driver " Greg KH
  21 siblings, 0 replies; 31+ messages in thread
From: ira.weiny-ral2JQCrhuEAvxtiuMwx3w @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w,
	mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w, Jubin John, Ira Weiny

From: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Signed-off-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/staging/rdma/hfi1/common.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/rdma/hfi1/common.h b/drivers/staging/rdma/hfi1/common.h
index 7809093eb55e..5dd92720faae 100644
--- a/drivers/staging/rdma/hfi1/common.h
+++ b/drivers/staging/rdma/hfi1/common.h
@@ -205,7 +205,7 @@
  * to the driver itself, not the software interfaces it supports.
  */
 #ifndef HFI1_DRIVER_VERSION_BASE
-#define HFI1_DRIVER_VERSION_BASE "0.9-248"
+#define HFI1_DRIVER_VERSION_BASE "0.9-294"
 #endif
 
 /* create the final driver version string */
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/23] Update driver to 0.9-294
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (20 preceding siblings ...)
  2015-10-19 16:43   ` [PATCH 23/23] staging/rdma/hfi1: Update driver version string to 0.9-294 ira.weiny-ral2JQCrhuEAvxtiuMwx3w
@ 2015-10-19 16:54   ` Greg KH
  2015-10-19 18:16     ` Weiny, Ira
  21 siblings, 1 reply; 31+ messages in thread
From: Greg KH @ 2015-10-19 16:54 UTC (permalink / raw)
  To: ira.weiny-ral2JQCrhuEAvxtiuMwx3w
  Cc: devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	dledford-H+wXaHxf7aLQT0dZR+AlfA,
	dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w

On Mon, Oct 19, 2015 at 12:43:24PM -0400, ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> From: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> 
> The following series has bug fixes and updates to the staging hfi1 driver.

Why are you adding new functionality to this driver before it is moved
out of drivers/staging/ ?  I _REALLY_ don't like to see that, because it
implies that you are spending time and energy on new stuff, not fixing
up the known issues instead.

Please redo this patch series and don't add new stuff.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 00/23] Update driver to 0.9-294
  2015-10-19 16:54   ` [PATCH 00/23] Update driver " Greg KH
@ 2015-10-19 18:16     ` Weiny, Ira
       [not found]       ` <2807E5FD2F6FDA4886F6618EAC48510E1CBCA641-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: Weiny, Ira @ 2015-10-19 18:16 UTC (permalink / raw)
  To: Greg KH
  Cc: devel@driverdev.osuosl.org, linux-rdma@vger.kernel.org,
	dledford@redhat.com, Dalessandro, Dennis

> 
> On Mon, Oct 19, 2015 at 12:43:24PM -0400, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > The following series has bug fixes and updates to the staging hfi1 driver.
> 
> Why are you adding new functionality to this driver before it is moved out of
> drivers/staging/ ?  I _REALLY_ don't like to see that, because it implies that you
> are spending time and energy on new stuff, not fixing up the known issues
> instead.

Perhaps I did not chose my words carefully enough.

The largest issue on the TODO list is the refactoring of the code to be shared between the hfi1 and qib driver.  While the hardware between hfi1 and qib is similar and thus the initial code looked similar, our performance tuning on hfi1 has revealed that some changes will be required to the hfi1 code to fully utilize the hardware.  We would like to get these updates in to make sure that the refactoring work is done properly for the 2 hardware types.

> 
> Please redo this patch series and don't add new stuff.
> 

I can do this but it will cause us to do our refactoring work 2 times in order for it to be done correctly (properly support both hardware types).  Of the 23 patches I sent only one has significant updates which don't conflict with this refactoring work.

[PATCH 15/23] staging/rdma/hfi1: Implement Expected Receive TID caching

However, without this patch users performance will be impacted.

In the interest of full disclosure we still have at least 2 more series, possibly 3 which are similar to this series, bug fixes and performance improvements.

Given the above explanation, would you find it acceptable to take these "updates"?  I can certainly rework patch 15 to separate out the code clean up as per your other email.

Ira

^ permalink raw reply	[flat|nested] 31+ messages in thread

[parent not found: <2807E5FD2F6FDA4886F6618EAC48510E1CBCA641-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: [PATCH 00/23] Update driver to 0.9-294
       [not found]       ` <2807E5FD2F6FDA4886F6618EAC48510E1CBCA641-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-10-19 18:25         ` Greg KH
  2015-10-20 12:45         ` Moni Shoua
  1 sibling, 0 replies; 31+ messages in thread
From: Greg KH @ 2015-10-19 18:25 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	Dalessandro, Dennis

On Mon, Oct 19, 2015 at 06:16:55PM +0000, Weiny, Ira wrote:
> > 
> > On Mon, Oct 19, 2015 at 12:43:24PM -0400, ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> > > From: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > >
> > > The following series has bug fixes and updates to the staging hfi1 driver.
> > 
> > Why are you adding new functionality to this driver before it is moved out of
> > drivers/staging/ ?  I _REALLY_ don't like to see that, because it implies that you
> > are spending time and energy on new stuff, not fixing up the known issues
> > instead.
> 
> Perhaps I did not chose my words carefully enough.
> 
> The largest issue on the TODO list is the refactoring of the code to
> be shared between the hfi1 and qib driver.  While the hardware between
> hfi1 and qib is similar and thus the initial code looked similar, our
> performance tuning on hfi1 has revealed that some changes will be
> required to the hfi1 code to fully utilize the hardware.  We would
> like to get these updates in to make sure that the refactoring work is
> done properly for the 2 hardware types.

Then you need to say that very carefully :)

> > Please redo this patch series and don't add new stuff.
> > 
> 
> I can do this but it will cause us to do our refactoring work 2 times in order for it to be done correctly (properly support both hardware types).  Of the 23 patches I sent only one has significant updates which don't conflict with this refactoring work.

Please wrap your email lines...

If you are just "refactoring to make things better for X" then great,
that's fine, but that's not what you said here.

> [PATCH 15/23] staging/rdma/hfi1: Implement Expected Receive TID caching
> 
> However, without this patch users performance will be impacted.

You mean it would be just as bad as it is today, which is fine :)

See my comments on that patch for what I object to it specifically.

> In the interest of full disclosure we still have at least 2 more series, possibly 3 which are similar to this series, bug fixes and performance improvements.

Why are you keeping these patch series from being submitted?  The longer
you wait, the more work it will take on your part.  If you have
performance fixes, those are just that, 'fixes', but please don't go
adding new functionality, take the time to do the work to get this out
of the staging tree and then you can optimize it as much as you want.

> Given the above explanation, would you find it acceptable to take these "updates"?  I can certainly rework patch 15 to separate out the code clean up as per your other email.

Please rework the series.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 00/23] Update driver to 0.9-294
       [not found]       ` <2807E5FD2F6FDA4886F6618EAC48510E1CBCA641-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2015-10-19 18:25         ` Greg KH
@ 2015-10-20 12:45         ` Moni Shoua
       [not found]           ` <CAG9sBKM+hidZi8i+1=dZowT5Pw+bYq21LHvPPZMqpmxPFNSRug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 31+ messages in thread
From: Moni Shoua @ 2015-10-20 12:45 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Greg KH, devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	Dalessandro, Dennis

>
> Perhaps I did not chose my words carefully enough.
>
> The largest issue on the TODO list is the refactoring of the code to be shared between the hfi1 and qib driver.  While the hardware between hfi1 and qib is similar and thus the initial code looked similar, our performance tuning on hfi1 has revealed that some changes will be required to the hfi1 code to fully utilize the hardware.  We would like to get these updates in to make sure that the refactoring work is done properly for the 2 hardware types.
Please keep in mind that there are 3 HW types (our SoftRoCE driver)
that needs to be part of the framework.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

[parent not found: <CAG9sBKM+hidZi8i+1=dZowT5Pw+bYq21LHvPPZMqpmxPFNSRug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH 00/23] Update driver to 0.9-294
       [not found]           ` <CAG9sBKM+hidZi8i+1=dZowT5Pw+bYq21LHvPPZMqpmxPFNSRug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-10-20 16:29             ` ira.weiny
       [not found]               ` <20151020162910.GA13340-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  0 siblings, 1 reply; 31+ messages in thread
From: ira.weiny @ 2015-10-20 16:29 UTC (permalink / raw)
  To: Moni Shoua
  Cc: Greg KH, devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	Dalessandro, Dennis

On Tue, Oct 20, 2015 at 03:45:47PM +0300, Moni Shoua wrote:
> >
> > Perhaps I did not chose my words carefully enough.
> >
> > The largest issue on the TODO list is the refactoring of the code to be
> > shared between the hfi1 and qib driver.  While the hardware between hfi1
> > and qib is similar and thus the initial code looked similar, our
> > performance tuning on hfi1 has revealed that some changes will be required
> > to the hfi1 code to fully utilize the hardware.  We would like to get these
> > updates in to make sure that the refactoring work is done properly for the
> > 2 hardware types.
>
> Please keep in mind that there are 3 HW types (our SoftRoCE driver)
> that needs to be part of the framework.

Understood, however, unlike SoftRoCE, qib and hfi currently share a lot of code
to drive the hardware.

The underlying reason for the TODO item "Remove software processing of IB
protocol..." is because we have a large amount of duplicated code between these
drivers.  _Some_ of which, at a high level, is sharable with SoftRoCE.

These patches (and more to follow), further differentiate the 2 drivers along
hardware lines.  Accepting these patches will help us make sure that we don't
create some common code between qib and hfi which, because of our testing we
now know, needs to be separated out.

This is a separate issue from the higher level code sharing which will be done
with SoftRoCE.

Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

[parent not found: <20151020162910.GA13340-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>]

* Re: [PATCH 00/23] Update driver to 0.9-294
       [not found]               ` <20151020162910.GA13340-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
@ 2015-10-21  6:36                 ` Moni Shoua
  0 siblings, 0 replies; 31+ messages in thread
From: Moni Shoua @ 2015-10-21  6:36 UTC (permalink / raw)
  To: ira.weiny
  Cc: Greg KH, devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	Dalessandro, Dennis

> Understood, however, unlike SoftRoCE, qib and hfi currently share a lot of code
> to drive the hardware.
>
> The underlying reason for the TODO item "Remove software processing of IB
> protocol..." is because we have a large amount of duplicated code between these
> drivers.  _Some_ of which, at a high level, is sharable with SoftRoCE.
>
After studying the qib driver (briefly I must admit) I found that it
has a lot in common with SoftRoCE driver and both can be redesigned to
share a fair amount of code.
> These patches (and more to follow), further differentiate the 2 drivers along
> hardware lines.  Accepting these patches will help us make sure that we don't
> create some common code between qib and hfi which, because of our testing we
> now know, needs to be separated out.
>
I'm sorry but I'm not sure that I understand the plan. What we had in
mind is that we start following the idea which was presented by Deny,
build a detailed design plan and implement it. All that can be done
without fixing bugs, adding functionality and improving performance
but I understand that you also need to move forward.
Us, in Mellanox are willing to contribute to the effort and take
responsibility over the Software Verbs Transport.
> This is a separate issue from the higher level code sharing which will be done
> with SoftRoCE.
I'm not sure I agree. Can you explain the plan to remove code sharing
between qib and hfi that is not part of the Software Verbs Transport
module?
>
> Ira
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 17/23] staging/rdma/hfi1: Add irqsaves in the packet processing path
  2015-10-19 16:43 [PATCH 00/23] Update driver to 0.9-294 ira.weiny-ral2JQCrhuEAvxtiuMwx3w
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2015-10-19 16:43 ` ira.weiny
  2015-10-19 16:43 ` [PATCH 18/23] staging/rdma/hfi1: Thread the receive interrupt ira.weiny
  2 siblings, 0 replies; 31+ messages in thread
From: ira.weiny @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh, devel; +Cc: Dean Luick, linux-rdma, dennis.dalessandro, dledford

From: Dean Luick <dean.luick@intel.com>

When the receive interrupt is threaded, the packet processing path is no longer
guaranteed to have IRQs disabled.  Add irqsaves where needed on several locks
in the packet processing path.  Anything that did not have an obvious, "close"
irqsave in its caller is a candidate.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/staging/rdma/hfi1/driver.c |  5 +++--
 drivers/staging/rdma/hfi1/init.c   |  5 +++--
 drivers/staging/rdma/hfi1/mad.c    |  4 ++--
 drivers/staging/rdma/hfi1/rc.c     | 19 +++++++++++--------
 drivers/staging/rdma/hfi1/sdma.c   |  9 +++++----
 drivers/staging/rdma/hfi1/verbs.c  |  9 +++++----
 6 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/driver.c b/drivers/staging/rdma/hfi1/driver.c
index 9f2cf9589b63..484f7c4472b0 100644
--- a/drivers/staging/rdma/hfi1/driver.c
+++ b/drivers/staging/rdma/hfi1/driver.c
@@ -310,6 +310,7 @@ static void rcv_hdrerr(struct hfi1_ctxtdata *rcd, struct hfi1_pportdata *ppd,
 		qp_num = be32_to_cpu(ohdr->bth[1]) & HFI1_QPN_MASK;
 		if (lid < HFI1_MULTICAST_LID_BASE) {
 			struct hfi1_qp *qp;
+			unsigned long flags;
 
 			rcu_read_lock();
 			qp = hfi1_lookup_qpn(ibp, qp_num);
@@ -322,7 +323,7 @@ static void rcv_hdrerr(struct hfi1_ctxtdata *rcd, struct hfi1_pportdata *ppd,
 			 * Handle only RC QPs - for other QP types drop error
 			 * packet.
 			 */
-			spin_lock(&qp->r_lock);
+			spin_lock_irqsave(&qp->r_lock, flags);
 
 			/* Check for valid receive state. */
 			if (!(ib_hfi1_state_ops[qp->state] &
@@ -343,7 +344,7 @@ static void rcv_hdrerr(struct hfi1_ctxtdata *rcd, struct hfi1_pportdata *ppd,
 				break;
 			}
 
-			spin_unlock(&qp->r_lock);
+			spin_unlock_irqrestore(&qp->r_lock, flags);
 			rcu_read_unlock();
 		} /* Unicast QP */
 	} /* Valid packet with TIDErr */
diff --git a/drivers/staging/rdma/hfi1/init.c b/drivers/staging/rdma/hfi1/init.c
index 62aa7718b6d6..060ab566856a 100644
--- a/drivers/staging/rdma/hfi1/init.c
+++ b/drivers/staging/rdma/hfi1/init.c
@@ -413,6 +413,7 @@ static enum hrtimer_restart cca_timer_fn(struct hrtimer *t)
 	int sl;
 	u16 ccti, ccti_timer, ccti_min;
 	struct cc_state *cc_state;
+	unsigned long flags;
 
 	cca_timer = container_of(t, struct cca_timer, hrtimer);
 	ppd = cca_timer->ppd;
@@ -436,7 +437,7 @@ static enum hrtimer_restart cca_timer_fn(struct hrtimer *t)
 	ccti_min = cc_state->cong_setting.entries[sl].ccti_min;
 	ccti_timer = cc_state->cong_setting.entries[sl].ccti_timer;
 
-	spin_lock(&ppd->cca_timer_lock);
+	spin_lock_irqsave(&ppd->cca_timer_lock, flags);
 
 	ccti = cca_timer->ccti;
 
@@ -445,7 +446,7 @@ static enum hrtimer_restart cca_timer_fn(struct hrtimer *t)
 		set_link_ipg(ppd);
 	}
 
-	spin_unlock(&ppd->cca_timer_lock);
+	spin_unlock_irqrestore(&ppd->cca_timer_lock, flags);
 
 	rcu_read_unlock();
 
diff --git a/drivers/staging/rdma/hfi1/mad.c b/drivers/staging/rdma/hfi1/mad.c
index b2c1b72d38ce..51832fb7b29c 100644
--- a/drivers/staging/rdma/hfi1/mad.c
+++ b/drivers/staging/rdma/hfi1/mad.c
@@ -3257,7 +3257,7 @@ static int __subn_get_opa_hfi1_cong_log(struct opa_smp *smp, u32 am,
 		return reply((struct ib_mad_hdr *)smp);
 	}
 
-	spin_lock(&ppd->cc_log_lock);
+	spin_lock_irq(&ppd->cc_log_lock);
 
 	cong_log->log_type = OPA_CC_LOG_TYPE_HFI;
 	cong_log->congestion_flags = 0;
@@ -3300,7 +3300,7 @@ static int __subn_get_opa_hfi1_cong_log(struct opa_smp *smp, u32 am,
 	       sizeof(ppd->threshold_cong_event_map));
 	ppd->threshold_event_counter = 0;
 
-	spin_unlock(&ppd->cc_log_lock);
+	spin_unlock_irq(&ppd->cc_log_lock);
 
 	if (resp_len)
 		*resp_len += sizeof(struct opa_hfi1_cong_log);
diff --git a/drivers/staging/rdma/hfi1/rc.c b/drivers/staging/rdma/hfi1/rc.c
index 6931ebe99b3b..8e4f421b2b58 100644
--- a/drivers/staging/rdma/hfi1/rc.c
+++ b/drivers/staging/rdma/hfi1/rc.c
@@ -697,6 +697,7 @@ void hfi1_send_rc_ack(struct hfi1_ctxtdata *rcd, struct hfi1_qp *qp,
 	struct pio_buf *pbuf;
 	struct hfi1_ib_header hdr;
 	struct hfi1_other_headers *ohdr;
+	unsigned long flags;
 
 	/* Don't send ACK or NAK if a RDMA read or atomic is pending. */
 	if (qp->s_flags & HFI1_S_RESP_PENDING)
@@ -771,7 +772,7 @@ void hfi1_send_rc_ack(struct hfi1_ctxtdata *rcd, struct hfi1_qp *qp,
 
 queue_ack:
 	this_cpu_inc(*ibp->rc_qacks);
-	spin_lock(&qp->s_lock);
+	spin_lock_irqsave(&qp->s_lock, flags);
 	qp->s_flags |= HFI1_S_ACK_PENDING | HFI1_S_RESP_PENDING;
 	qp->s_nak_state = qp->r_nak_state;
 	qp->s_ack_psn = qp->r_ack_psn;
@@ -780,7 +781,7 @@ queue_ack:
 
 	/* Schedule the send tasklet. */
 	hfi1_schedule_send(qp);
-	spin_unlock(&qp->s_lock);
+	spin_unlock_irqrestore(&qp->s_lock, flags);
 }
 
 /**
@@ -1152,7 +1153,7 @@ static struct hfi1_swqe *do_rc_completion(struct hfi1_qp *qp,
  *
  * This is called from rc_rcv_resp() to process an incoming RC ACK
  * for the given QP.
- * Called at interrupt level with the QP s_lock held.
+ * May be called at interrupt level, with the QP s_lock held.
  * Returns 1 if OK, 0 if current operation should be aborted (NAK).
  */
 static int do_rc_ack(struct hfi1_qp *qp, u32 aeth, u32 psn, int opcode,
@@ -1835,11 +1836,12 @@ static void log_cca_event(struct hfi1_pportdata *ppd, u8 sl, u32 rlid,
 			  u32 lqpn, u32 rqpn, u8 svc_type)
 {
 	struct opa_hfi1_cong_log_event_internal *cc_event;
+	unsigned long flags;
 
 	if (sl >= OPA_MAX_SLS)
 		return;
 
-	spin_lock(&ppd->cc_log_lock);
+	spin_lock_irqsave(&ppd->cc_log_lock, flags);
 
 	ppd->threshold_cong_event_map[sl/8] |= 1 << (sl % 8);
 	ppd->threshold_event_counter++;
@@ -1855,7 +1857,7 @@ static void log_cca_event(struct hfi1_pportdata *ppd, u8 sl, u32 rlid,
 	/* keep timestamp in units of 1.024 usec */
 	cc_event->timestamp = ktime_to_ns(ktime_get()) / 1024;
 
-	spin_unlock(&ppd->cc_log_lock);
+	spin_unlock_irqrestore(&ppd->cc_log_lock, flags);
 }
 
 void process_becn(struct hfi1_pportdata *ppd, u8 sl, u16 rlid, u32 lqpn,
@@ -1865,6 +1867,7 @@ void process_becn(struct hfi1_pportdata *ppd, u8 sl, u16 rlid, u32 lqpn,
 	u16 ccti, ccti_incr, ccti_timer, ccti_limit;
 	u8 trigger_threshold;
 	struct cc_state *cc_state;
+	unsigned long flags;
 
 	if (sl >= OPA_MAX_SLS)
 		return;
@@ -1887,7 +1890,7 @@ void process_becn(struct hfi1_pportdata *ppd, u8 sl, u16 rlid, u32 lqpn,
 	trigger_threshold =
 		cc_state->cong_setting.entries[sl].trigger_threshold;
 
-	spin_lock(&ppd->cca_timer_lock);
+	spin_lock_irqsave(&ppd->cca_timer_lock, flags);
 
 	if (cca_timer->ccti < ccti_limit) {
 		if (cca_timer->ccti + ccti_incr <= ccti_limit)
@@ -1897,7 +1900,7 @@ void process_becn(struct hfi1_pportdata *ppd, u8 sl, u16 rlid, u32 lqpn,
 		set_link_ipg(ppd);
 	}
 
-	spin_unlock(&ppd->cca_timer_lock);
+	spin_unlock_irqrestore(&ppd->cca_timer_lock, flags);
 
 	ccti = cca_timer->ccti;
 
@@ -1924,7 +1927,7 @@ void process_becn(struct hfi1_pportdata *ppd, u8 sl, u16 rlid, u32 lqpn,
  *
  * This is called from qp_rcv() to process an incoming RC packet
  * for the given QP.
- * Called at interrupt level.
+ * May be called at interrupt level.
  */
 void hfi1_rc_rcv(struct hfi1_packet *packet)
 {
diff --git a/drivers/staging/rdma/hfi1/sdma.c b/drivers/staging/rdma/hfi1/sdma.c
index 452e7edcee7a..54c146afe52f 100644
--- a/drivers/staging/rdma/hfi1/sdma.c
+++ b/drivers/staging/rdma/hfi1/sdma.c
@@ -383,16 +383,17 @@ static void sdma_flush(struct sdma_engine *sde)
 {
 	struct sdma_txreq *txp, *txp_next;
 	LIST_HEAD(flushlist);
+	unsigned long flags;
 
 	/* flush from head to tail */
 	sdma_flush_descq(sde);
-	spin_lock(&sde->flushlist_lock);
+	spin_lock_irqsave(&sde->flushlist_lock, flags);
 	/* copy flush list */
 	list_for_each_entry_safe(txp, txp_next, &sde->flushlist, list) {
 		list_del_init(&txp->list);
 		list_add_tail(&txp->list, &flushlist);
 	}
-	spin_unlock(&sde->flushlist_lock);
+	spin_unlock_irqrestore(&sde->flushlist_lock, flags);
 	/* flush from flush list */
 	list_for_each_entry_safe(txp, txp_next, &flushlist, list) {
 		int drained = 0;
@@ -2095,9 +2096,9 @@ unlock_noconn:
 	tx->sn = sde->tail_sn++;
 	trace_hfi1_sdma_in_sn(sde, tx->sn);
 #endif
-	spin_lock(&sde->flushlist_lock);
+	spin_lock_irqsave(&sde->flushlist_lock, flags);
 	list_add_tail(&tx->list, &sde->flushlist);
-	spin_unlock(&sde->flushlist_lock);
+	spin_unlock_irqrestore(&sde->flushlist_lock, flags);
 	if (wait) {
 		wait->tx_count++;
 		wait->count += tx->num_desc;
diff --git a/drivers/staging/rdma/hfi1/verbs.c b/drivers/staging/rdma/hfi1/verbs.c
index 8456e82b2ab1..a5bcb1036aba 100644
--- a/drivers/staging/rdma/hfi1/verbs.c
+++ b/drivers/staging/rdma/hfi1/verbs.c
@@ -597,6 +597,7 @@ void hfi1_ib_rcv(struct hfi1_packet *packet)
 	u32 tlen = packet->tlen;
 	struct hfi1_pportdata *ppd = rcd->ppd;
 	struct hfi1_ibport *ibp = &ppd->ibport_data;
+	unsigned long flags;
 	u32 qp_num;
 	int lnh;
 	u8 opcode;
@@ -639,10 +640,10 @@ void hfi1_ib_rcv(struct hfi1_packet *packet)
 			goto drop;
 		list_for_each_entry_rcu(p, &mcast->qp_list, list) {
 			packet->qp = p->qp;
-			spin_lock(&packet->qp->r_lock);
+			spin_lock_irqsave(&packet->qp->r_lock, flags);
 			if (likely((qp_ok(opcode, packet))))
 				opcode_handler_tbl[opcode](packet);
-			spin_unlock(&packet->qp->r_lock);
+			spin_unlock_irqrestore(&packet->qp->r_lock, flags);
 		}
 		/*
 		 * Notify hfi1_multicast_detach() if it is waiting for us
@@ -657,10 +658,10 @@ void hfi1_ib_rcv(struct hfi1_packet *packet)
 			rcu_read_unlock();
 			goto drop;
 		}
-		spin_lock(&packet->qp->r_lock);
+		spin_lock_irqsave(&packet->qp->r_lock, flags);
 		if (likely((qp_ok(opcode, packet))))
 			opcode_handler_tbl[opcode](packet);
-		spin_unlock(&packet->qp->r_lock);
+		spin_unlock_irqrestore(&packet->qp->r_lock, flags);
 		rcu_read_unlock();
 	}
 	return;
-- 
1.8.2

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 18/23] staging/rdma/hfi1: Thread the receive interrupt.
  2015-10-19 16:43 [PATCH 00/23] Update driver to 0.9-294 ira.weiny-ral2JQCrhuEAvxtiuMwx3w
       [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2015-10-19 16:43 ` [PATCH 17/23] staging/rdma/hfi1: Add irqsaves in the packet processing path ira.weiny
@ 2015-10-19 16:43 ` ira.weiny
  2 siblings, 0 replies; 31+ messages in thread
From: ira.weiny @ 2015-10-19 16:43 UTC (permalink / raw)
  To: gregkh, devel; +Cc: Dean Luick, linux-rdma, dennis.dalessandro, dledford

From: Dean Luick <dean.luick@intel.com>

When under heavy load, the receive interrupt handler can run too long with IRQs
disabled.  Add a mixed-mode threading scheme.  Initially process packets in the
handler for quick responses (latency).  If there are too many packets to
process move to a thread to continue (bandwidth).

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/staging/rdma/hfi1/chip.c   | 104 +++++++++++++++++++++++++++++++++----
 drivers/staging/rdma/hfi1/driver.c |  72 ++++++++++++++-----------
 drivers/staging/rdma/hfi1/hfi.h    |  20 +++++--
 drivers/staging/rdma/hfi1/sdma.c   |   4 +-
 4 files changed, 154 insertions(+), 46 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/chip.c b/drivers/staging/rdma/hfi1/chip.c
index 0be37e9430af..352b0d089ae1 100644
--- a/drivers/staging/rdma/hfi1/chip.c
+++ b/drivers/staging/rdma/hfi1/chip.c
@@ -4426,7 +4426,7 @@ static void is_rcv_avail_int(struct hfi1_devdata *dd, unsigned int source)
 		rcd = dd->rcd[source];
 		if (rcd) {
 			if (source < dd->first_user_ctxt)
-				rcd->do_interrupt(rcd);
+				rcd->do_interrupt(rcd, 0);
 			else
 				handle_user_interrupt(rcd);
 			return;	/* OK */
@@ -4592,23 +4592,106 @@ static irqreturn_t sdma_interrupt(int irq, void *data)
 }
 
 /*
- * NOTE: this routine expects to be on its own MSI-X interrupt.  If
- * multiple receive contexts share the same MSI-X interrupt, then this
- * routine must check for who received it.
+ * Clear the receive interrupt, forcing the write and making sure
+ * we have data from the chip, pushing everything in front of it
+ * back to the host.
+ */
+static inline void clear_recv_intr(struct hfi1_ctxtdata *rcd)
+{
+	struct hfi1_devdata *dd = rcd->dd;
+	u32 addr = CCE_INT_CLEAR + (8 * rcd->ireg);
+
+	mmiowb();	/* make sure everything before is written */
+	write_csr(dd, addr, rcd->imask);
+	/* force the above write on the chip and get a value back */
+	(void)read_csr(dd, addr);
+}
+
+/* force the receive interrupt */
+static inline void force_recv_intr(struct hfi1_ctxtdata *rcd)
+{
+	write_csr(rcd->dd, CCE_INT_FORCE + (8 * rcd->ireg), rcd->imask);
+}
+
+/* return non-zero if a packet is present */
+static inline int check_packet_present(struct hfi1_ctxtdata *rcd)
+{
+	if (!HFI1_CAP_IS_KSET(DMA_RTAIL))
+		return (rcd->seq_cnt ==
+				rhf_rcv_seq(rhf_to_cpu(get_rhf_addr(rcd))));
+
+	/* else is RDMA rtail */
+	return (rcd->head != get_rcvhdrtail(rcd));
+}
+
+/*
+ * Receive packet IRQ handler.  This routine expects to be on its own IRQ.
+ * This routine will try to handle packets immediately (latency), but if
+ * it finds too many, it will invoke the thread handler (bandwitdh).  The
+ * chip receive interupt is *not* cleared down until this or the thread (if
+ * invoked) is finished.  The intent is to avoid extra interrupts while we
+ * are processing packets anyway.
  */
 static irqreturn_t receive_context_interrupt(int irq, void *data)
 {
 	struct hfi1_ctxtdata *rcd = data;
 	struct hfi1_devdata *dd = rcd->dd;
+	int disposition;
+	int present;
 
 	trace_hfi1_receive_interrupt(dd, rcd->ctxt);
 	this_cpu_inc(*dd->int_counter);
 
-	/* clear the interrupt */
-	write_csr(rcd->dd, CCE_INT_CLEAR + (8*rcd->ireg), rcd->imask);
+	/* receive interrupt remains blocked while processing packets */
+	disposition = rcd->do_interrupt(rcd, 0);
 
-	/* handle the interrupt */
-	rcd->do_interrupt(rcd);
+	/*
+	 * Too many packets were seen while processing packets in this
+	 * IRQ handler.  Invoke the handler thread.  The receive interrupt
+	 * remains blocked.
+	 */
+	if (disposition == RCV_PKT_LIMIT)
+		return IRQ_WAKE_THREAD;
+
+	/*
+	 * The packet processor detected no more packets.  Clear the receive
+	 * interrupt and recheck for a packet packet that may have arrived
+	 * after the previous check and interrupt clear.  If a packet arrived,
+	 * force another interrupt.
+	 */
+	clear_recv_intr(rcd);
+	present = check_packet_present(rcd);
+	if (present)
+		force_recv_intr(rcd);
+
+	return IRQ_HANDLED;
+}
+
+/*
+ * Receive packet thread handler.  This expects to be invoked with the
+ * receive interrupt still blocked.
+ */
+static irqreturn_t receive_context_thread(int irq, void *data)
+{
+	struct hfi1_ctxtdata *rcd = data;
+	int present;
+
+	/* receive interrupt is still blocked from the IRQ handler */
+	(void)rcd->do_interrupt(rcd, 1);
+
+	/*
+	 * The packet processor will only return if it detected no more
+	 * packets.  Hold IRQs here so we can safely clear the interrupt and
+	 * recheck for a packet that may have arrived after the previous
+	 * check and the interrupt clear.  If a packet arrived, force another
+	 * interrupt.
+	 */
+	local_irq_disable();
+	clear_recv_intr(rcd);
+	present = check_packet_present(rcd);
+	if (present)
+		force_recv_intr(rcd);
+	local_irq_enable();
 
 	return IRQ_HANDLED;
 }
@@ -8861,6 +8944,7 @@ static int request_msix_irqs(struct hfi1_devdata *dd)
 		struct hfi1_msix_entry *me = &dd->msix_entries[i];
 		const char *err_info;
 		irq_handler_t handler;
+		irq_handler_t thread = NULL;
 		void *arg;
 		int idx;
 		struct hfi1_ctxtdata *rcd = NULL;
@@ -8897,6 +8981,7 @@ static int request_msix_irqs(struct hfi1_devdata *dd)
 			rcd->imask = ((u64)1) <<
 					((IS_RCVAVAIL_START+idx) % 64);
 			handler = receive_context_interrupt;
+			thread = receive_context_thread;
 			arg = rcd;
 			snprintf(me->name, sizeof(me->name),
 				DRIVER_NAME"_%d kctxt%d", dd->unit, idx);
@@ -8915,7 +9000,8 @@ static int request_msix_irqs(struct hfi1_devdata *dd)
 		/* make sure the name is terminated */
 		me->name[sizeof(me->name)-1] = 0;
 
-		ret = request_irq(me->msix.vector, handler, 0, me->name, arg);
+		ret = request_threaded_irq(me->msix.vector, handler, thread, 0,
+						me->name, arg);
 		if (ret) {
 			dd_dev_err(dd,
 				"unable to allocate %s interrupt, vector %d, index %d, err %d\n",
diff --git a/drivers/staging/rdma/hfi1/driver.c b/drivers/staging/rdma/hfi1/driver.c
index 484f7c4472b0..5cc4256c4e6a 100644
--- a/drivers/staging/rdma/hfi1/driver.c
+++ b/drivers/staging/rdma/hfi1/driver.c
@@ -435,8 +435,7 @@ static inline void init_packet(struct hfi1_ctxtdata *rcd,
 	packet->rcd = rcd;
 	packet->updegr = 0;
 	packet->etail = -1;
-	packet->rhf_addr = (__le32 *) rcd->rcvhdrq + rcd->head +
-			   rcd->dd->rhf_offset;
+	packet->rhf_addr = get_rhf_addr(rcd);
 	packet->rhf = rhf_to_cpu(packet->rhf_addr);
 	packet->rhqoff = rcd->head;
 	packet->numpkt = 0;
@@ -622,10 +621,7 @@ next:
 }
 #endif /* CONFIG_PRESCAN_RXQ */
 
-#define RCV_PKT_OK 0x0
-#define RCV_PKT_MAX 0x1
-
-static inline int process_rcv_packet(struct hfi1_packet *packet)
+static inline int process_rcv_packet(struct hfi1_packet *packet, int thread)
 {
 	int ret = RCV_PKT_OK;
 
@@ -667,9 +663,13 @@ static inline int process_rcv_packet(struct hfi1_packet *packet)
 	if (packet->rhqoff >= packet->maxcnt)
 		packet->rhqoff = 0;
 
-	if (packet->numpkt == MAX_PKT_RECV) {
-		ret = RCV_PKT_MAX;
-		this_cpu_inc(*packet->rcd->dd->rcv_limit);
+	if (unlikely((packet->numpkt & (MAX_PKT_RECV - 1)) == 0)) {
+		if (thread) {
+			cond_resched();
+		} else {
+			ret = RCV_PKT_LIMIT;
+			this_cpu_inc(*packet->rcd->dd->rcv_limit);
+		}
 	}
 
 	packet->rhf_addr = (__le32 *) packet->rcd->rcvhdrq + packet->rhqoff +
@@ -746,57 +746,63 @@ static inline void process_rcv_qp_work(struct hfi1_packet *packet)
 /*
  * Handle receive interrupts when using the no dma rtail option.
  */
-void handle_receive_interrupt_nodma_rtail(struct hfi1_ctxtdata *rcd)
+int handle_receive_interrupt_nodma_rtail(struct hfi1_ctxtdata *rcd, int thread)
 {
 	u32 seq;
-	int last = 0;
+	int last = RCV_PKT_OK;
 	struct hfi1_packet packet;
 
 	init_packet(rcd, &packet);
 	seq = rhf_rcv_seq(packet.rhf);
-	if (seq != rcd->seq_cnt)
+	if (seq != rcd->seq_cnt) {
+		last = RCV_PKT_DONE;
 		goto bail;
+	}
 
 	prescan_rxq(&packet);
 
-	while (!last) {
-		last = process_rcv_packet(&packet);
+	while (last == RCV_PKT_OK) {
+		last = process_rcv_packet(&packet, thread);
 		seq = rhf_rcv_seq(packet.rhf);
 		if (++rcd->seq_cnt > 13)
 			rcd->seq_cnt = 1;
 		if (seq != rcd->seq_cnt)
-			last = 1;
+			last = RCV_PKT_DONE;
 		process_rcv_update(last, &packet);
 	}
 	process_rcv_qp_work(&packet);
 bail:
 	finish_packet(&packet);
+	return last;
 }
 
-void handle_receive_interrupt_dma_rtail(struct hfi1_ctxtdata *rcd)
+int handle_receive_interrupt_dma_rtail(struct hfi1_ctxtdata *rcd, int thread)
 {
 	u32 hdrqtail;
-	int last = 0;
+	int last = RCV_PKT_OK;
 	struct hfi1_packet packet;
 
 	init_packet(rcd, &packet);
 	hdrqtail = get_rcvhdrtail(rcd);
-	if (packet.rhqoff == hdrqtail)
+	if (packet.rhqoff == hdrqtail) {
+		last = RCV_PKT_DONE;
 		goto bail;
+	}
 	smp_rmb();  /* prevent speculative reads of dma'ed hdrq */
 
 	prescan_rxq(&packet);
 
-	while (!last) {
-		last = process_rcv_packet(&packet);
+	while (last == RCV_PKT_OK) {
+		last = process_rcv_packet(&packet, thread);
+		hdrqtail = get_rcvhdrtail(rcd);
 		if (packet.rhqoff == hdrqtail)
-			last = 1;
+			last = RCV_PKT_DONE;
 		process_rcv_update(last, &packet);
 	}
 	process_rcv_qp_work(&packet);
 bail:
 	finish_packet(&packet);
-
+	return last;
 }
 
 static inline void set_all_nodma_rtail(struct hfi1_devdata *dd)
@@ -824,12 +830,11 @@ static inline void set_all_dma_rtail(struct hfi1_devdata *dd)
  * Called from interrupt handler for errors or receive interrupt.
  * This is the slow path interrupt handler.
  */
-void handle_receive_interrupt(struct hfi1_ctxtdata *rcd)
+int handle_receive_interrupt(struct hfi1_ctxtdata *rcd, int thread)
 {
-
 	struct hfi1_devdata *dd = rcd->dd;
 	u32 hdrqtail;
-	int last = 0, needset = 1;
+	int last = RCV_PKT_OK, needset = 1;
 	struct hfi1_packet packet;
 
 	init_packet(rcd, &packet);
@@ -837,19 +842,23 @@ void handle_receive_interrupt(struct hfi1_ctxtdata *rcd)
 	if (!HFI1_CAP_IS_KSET(DMA_RTAIL)) {
 		u32 seq = rhf_rcv_seq(packet.rhf);
 
-		if (seq != rcd->seq_cnt)
+		if (seq != rcd->seq_cnt) {
+			last = RCV_PKT_DONE;
 			goto bail;
+		}
 		hdrqtail = 0;
 	} else {
 		hdrqtail = get_rcvhdrtail(rcd);
-		if (packet.rhqoff == hdrqtail)
+		if (packet.rhqoff == hdrqtail) {
+			last = RCV_PKT_DONE;
 			goto bail;
+		}
 		smp_rmb();  /* prevent speculative reads of dma'ed hdrq */
 	}
 
 	prescan_rxq(&packet);
 
-	while (!last) {
+	while (last == RCV_PKT_OK) {
 
 		if (unlikely(dd->do_drop && atomic_xchg(&dd->drop_packet,
 			DROP_PACKET_OFF) == DROP_PACKET_ON)) {
@@ -863,7 +872,7 @@ void handle_receive_interrupt(struct hfi1_ctxtdata *rcd)
 			packet.rhf = rhf_to_cpu(packet.rhf_addr);
 
 		} else {
-			last = process_rcv_packet(&packet);
+			last = process_rcv_packet(&packet, thread);
 		}
 
 		if (!HFI1_CAP_IS_KSET(DMA_RTAIL)) {
@@ -872,7 +881,7 @@ void handle_receive_interrupt(struct hfi1_ctxtdata *rcd)
 			if (++rcd->seq_cnt > 13)
 				rcd->seq_cnt = 1;
 			if (seq != rcd->seq_cnt)
-				last = 1;
+				last = RCV_PKT_DONE;
 			if (needset) {
 				dd_dev_info(dd,
 					"Switching to NO_DMA_RTAIL\n");
@@ -881,7 +890,7 @@ void handle_receive_interrupt(struct hfi1_ctxtdata *rcd)
 			}
 		} else {
 			if (packet.rhqoff == hdrqtail)
-				last = 1;
+				last = RCV_PKT_DONE;
 			if (needset) {
 				dd_dev_info(dd,
 					    "Switching to DMA_RTAIL\n");
@@ -901,6 +910,7 @@ bail:
 	 * if no packets were processed.
 	 */
 	finish_packet(&packet);
+	return last;
 }
 
 /*
diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h
index 848042e17320..4d7227d54cae 100644
--- a/drivers/staging/rdma/hfi1/hfi.h
+++ b/drivers/staging/rdma/hfi1/hfi.h
@@ -300,7 +300,7 @@ struct hfi1_ctxtdata {
 	 * be valid. Worst case is we process an extra interrupt and up to 64
 	 * packets with the wrong interrupt handler.
 	 */
-	void (*do_interrupt)(struct hfi1_ctxtdata *rcd);
+	int (*do_interrupt)(struct hfi1_ctxtdata *rcd, int threaded);
 };
 
 /*
@@ -1138,9 +1138,21 @@ void hfi1_init_pportdata(struct pci_dev *, struct hfi1_pportdata *,
 			 struct hfi1_devdata *, u8, u8);
 void hfi1_free_ctxtdata(struct hfi1_devdata *, struct hfi1_ctxtdata *);
 
-void handle_receive_interrupt(struct hfi1_ctxtdata *);
-void handle_receive_interrupt_nodma_rtail(struct hfi1_ctxtdata *rcd);
-void handle_receive_interrupt_dma_rtail(struct hfi1_ctxtdata *rcd);
+int handle_receive_interrupt(struct hfi1_ctxtdata *, int);
+int handle_receive_interrupt_nodma_rtail(struct hfi1_ctxtdata *, int);
+int handle_receive_interrupt_dma_rtail(struct hfi1_ctxtdata *, int);
+
+/* receive packet handler dispositions */
+#define RCV_PKT_OK      0x0 /* keep going */
+#define RCV_PKT_LIMIT   0x1 /* stop, hit limit, start thread */
+#define RCV_PKT_DONE    0x2 /* stop, no more packets detected */
+
+/* calculate the current RHF address */
+static inline __le32 *get_rhf_addr(struct hfi1_ctxtdata *rcd)
+{
+	return (__le32 *)rcd->rcvhdrq + rcd->head + rcd->dd->rhf_offset;
+}
+
 int hfi1_reset_device(int);
 
 /* return the driver's idea of the logical OPA port state */
diff --git a/drivers/staging/rdma/hfi1/sdma.c b/drivers/staging/rdma/hfi1/sdma.c
index 54c146afe52f..16b1edf2a5cc 100644
--- a/drivers/staging/rdma/hfi1/sdma.c
+++ b/drivers/staging/rdma/hfi1/sdma.c
@@ -2096,9 +2096,9 @@ unlock_noconn:
 	tx->sn = sde->tail_sn++;
 	trace_hfi1_sdma_in_sn(sde, tx->sn);
 #endif
-	spin_lock_irqsave(&sde->flushlist_lock, flags);
+	spin_lock(&sde->flushlist_lock);
 	list_add_tail(&tx->list, &sde->flushlist);
-	spin_unlock_irqrestore(&sde->flushlist_lock, flags);
+	spin_unlock(&sde->flushlist_lock);
 	if (wait) {
 		wait->tx_count++;
 		wait->count += tx->num_desc;
-- 
1.8.2

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-10-21  6:36 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-19 16:43 [PATCH 00/23] Update driver to 0.9-294 ira.weiny-ral2JQCrhuEAvxtiuMwx3w
     [not found] ` <1445273027-29634-1-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-10-19 16:43   ` [PATCH 01/23] staging/rdma/hfi1: inline clear_ahg() ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 02/23] staging/rdma/hfi1: Reset ASIC CSRs on FLR, and once per ASIC ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 03/23] staging/rdma/hfi1: Extend the offline timeout ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 04/23] staging/rdma/hfi1: Implement time-out for send context halt recovery ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 05/23] staging/rdma/hfi1: Remove QSFP_ENABLED from HFI capability mask ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 06/23] staging/rdma/hfi1: Add coalescing support for SDMA TX descriptors ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 07/23] staging/rdma/hfi1: optionally prescan rx queue for {B,F}ECNs - UC, RC ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 08/23] staging/rdma/hfi1: Method to toggle "fast ECN" detection ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 09/23] staging/rdma/hfi1: Fix sparse error in sdma.h file ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 10/23] staging/rdma/hfi1: close shared context security hole ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 11/23] staging/rdma/hfi1: Reset firmware instead of reloading Sbus ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 12/23] staging/rdma/hfi1: Add a schedule in send thread ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 13/23] staging/rdma/hfi1: Fix port bounce issues with 0.22 DC firmware ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 14/23] staging/rdma/hfi1: Prevent silent data corruption with user SDMA ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 15/23] staging/rdma/hfi1: Implement Expected Receive TID caching ira.weiny-ral2JQCrhuEAvxtiuMwx3w
     [not found]     ` <1445273027-29634-16-git-send-email-ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-10-19 16:54       ` Greg KH
2015-10-19 16:43   ` [PATCH 16/23] staging/rdma/hfi1: Allow tuning of SDMA interrupt rate ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 19/23] staging/rdma/hfi: modify workqueue for parallelism ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 20/23] staging/rdma/hfi1: Load SBus firmware once per ASIC ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 21/23] staging/rdma/hfi1: Add unit # to verbs txreq cache name ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 22/23] staging/rdma/hfi1: add additional rc traces ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:43   ` [PATCH 23/23] staging/rdma/hfi1: Update driver version string to 0.9-294 ira.weiny-ral2JQCrhuEAvxtiuMwx3w
2015-10-19 16:54   ` [PATCH 00/23] Update driver " Greg KH
2015-10-19 18:16     ` Weiny, Ira
     [not found]       ` <2807E5FD2F6FDA4886F6618EAC48510E1CBCA641-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-10-19 18:25         ` Greg KH
2015-10-20 12:45         ` Moni Shoua
     [not found]           ` <CAG9sBKM+hidZi8i+1=dZowT5Pw+bYq21LHvPPZMqpmxPFNSRug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-20 16:29             ` ira.weiny
     [not found]               ` <20151020162910.GA13340-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2015-10-21  6:36                 ` Moni Shoua
2015-10-19 16:43 ` [PATCH 17/23] staging/rdma/hfi1: Add irqsaves in the packet processing path ira.weiny
2015-10-19 16:43 ` [PATCH 18/23] staging/rdma/hfi1: Thread the receive interrupt ira.weiny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).