linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] Doorbell drop Avoidance Bug fix for iw_cxgb4
@ 2014-03-14 16:22 Hariprasad Shenai
  2014-03-14 16:22 ` [PATCH net-next 2/2] cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes Hariprasad Shenai
       [not found] ` <1394814128-8815-1-git-send-email-hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 2 replies; 25+ messages in thread
From: Hariprasad Shenai @ 2014-03-14 16:22 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, roland-BHEL68pLQRGGvPXPguhicg,
	dm-ut6Up61K2wZBDgjK7y7TUQ, swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	leedom-ut6Up61K2wZBDgjK7y7TUQ, santosh-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	hariprasad-ut6Up61K2wZBDgjK7y7TUQ

Hi All,

This patch series provides fixes for Chelsio T4/T5 adapters
related to DB Drop avoidance and other small fix related to keepalive on
iw-cxgb4.

The patches series is created against David Miller's 'net-next' tree.
And includes patches on cxgb4 and iw_cxgb4 driver.

We would like to request this patch series to get merged via David Miller's
'net-next' tree.

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.

Thanks


Steve Wise (2):
  cxgb4/iw_cxgb4: Treat CPL_ERR_KEEPALV_NEG_ADVICE as negative advice
  cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes

 drivers/infiniband/hw/cxgb4/cm.c                |  24 ++--
 drivers/infiniband/hw/cxgb4/device.c            | 177 ++++++++++++++----------
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   9 +-
 drivers/infiniband/hw/cxgb4/provider.c          |  43 +++++-
 drivers/infiniband/hw/cxgb4/qp.c                | 140 +++++++++----------
 drivers/infiniband/hw/cxgb4/t4.h                |   6 +
 drivers/infiniband/hw/cxgb4/user.h              |   5 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  87 +++++++-----
 drivers/net/ethernet/chelsio/cxgb4/sge.c        |   8 +-
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h     |   1 +
 11 files changed, 299 insertions(+), 202 deletions(-)

-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net-next 1/2] cxgb4/iw_cxgb4: Treat CPL_ERR_KEEPALV_NEG_ADVICE as negative advice
       [not found] ` <1394814128-8815-1-git-send-email-hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
@ 2014-03-14 16:22   ` Hariprasad Shenai
  2014-03-15  2:44   ` [PATCH net-next 0/2] Doorbell drop Avoidance Bug fix for iw_cxgb4 David Miller
  1 sibling, 0 replies; 25+ messages in thread
From: Hariprasad Shenai @ 2014-03-14 16:22 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, roland-BHEL68pLQRGGvPXPguhicg,
	dm-ut6Up61K2wZBDgjK7y7TUQ, swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	leedom-ut6Up61K2wZBDgjK7y7TUQ, santosh-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	hariprasad-ut6Up61K2wZBDgjK7y7TUQ

From: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

Based on original work by Anand Priyadarshee <anandp-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>.

Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/infiniband/hw/cxgb4/cm.c            | 24 ++++++++++++------------
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h |  1 +
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index d286bde..7e98a58 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1647,6 +1647,15 @@ static inline int act_open_has_tid(int status)
 	       status != CPL_ERR_ARP_MISS;
 }
 
+/* Returns whether a CPL status conveys negative advice.
+ */
+static int is_neg_adv(unsigned int status)
+{
+	return status == CPL_ERR_RTX_NEG_ADVICE ||
+	       status == CPL_ERR_PERSIST_NEG_ADVICE ||
+	       status == CPL_ERR_KEEPALV_NEG_ADVICE;
+}
+
 #define ACT_OPEN_RETRY_COUNT 2
 
 static int import_ep(struct c4iw_ep *ep, int iptype, __u8 *peer_ip,
@@ -1835,7 +1844,7 @@ static int act_open_rpl(struct c4iw_dev *dev, struct sk_buff *skb)
 	PDBG("%s ep %p atid %u status %u errno %d\n", __func__, ep, atid,
 	     status, status2errno(status));
 
-	if (status == CPL_ERR_RTX_NEG_ADVICE) {
+	if (is_neg_adv(status)) {
 		printk(KERN_WARNING MOD "Connection problems for atid %u\n",
 			atid);
 		return 0;
@@ -2265,15 +2274,6 @@ static int peer_close(struct c4iw_dev *dev, struct sk_buff *skb)
 	return 0;
 }
 
-/*
- * Returns whether an ABORT_REQ_RSS message is a negative advice.
- */
-static int is_neg_adv_abort(unsigned int status)
-{
-	return status == CPL_ERR_RTX_NEG_ADVICE ||
-	       status == CPL_ERR_PERSIST_NEG_ADVICE;
-}
-
 static int peer_abort(struct c4iw_dev *dev, struct sk_buff *skb)
 {
 	struct cpl_abort_req_rss *req = cplhdr(skb);
@@ -2287,7 +2287,7 @@ static int peer_abort(struct c4iw_dev *dev, struct sk_buff *skb)
 	unsigned int tid = GET_TID(req);
 
 	ep = lookup_tid(t, tid);
-	if (is_neg_adv_abort(req->status)) {
+	if (is_neg_adv(req->status)) {
 		PDBG("%s neg_adv_abort ep %p tid %u\n", __func__, ep,
 		     ep->hwtid);
 		return 0;
@@ -3570,7 +3570,7 @@ static int peer_abort_intr(struct c4iw_dev *dev, struct sk_buff *skb)
 		kfree_skb(skb);
 		return 0;
 	}
-	if (is_neg_adv_abort(req->status)) {
+	if (is_neg_adv(req->status)) {
 		PDBG("%s neg_adv_abort ep %p tid %u\n", __func__, ep,
 		     ep->hwtid);
 		kfree_skb(skb);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
index cd6874b..f2738c7 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
@@ -116,6 +116,7 @@ enum CPL_error {
 	CPL_ERR_KEEPALIVE_TIMEDOUT = 34,
 	CPL_ERR_RTX_NEG_ADVICE     = 35,
 	CPL_ERR_PERSIST_NEG_ADVICE = 36,
+	CPL_ERR_KEEPALV_NEG_ADVICE = 37,
 	CPL_ERR_ABORT_FAILED       = 42,
 	CPL_ERR_IWARP_FLM          = 50,
 };
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 2/2] cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes
  2014-03-14 16:22 [PATCH net-next 0/2] Doorbell drop Avoidance Bug fix for iw_cxgb4 Hariprasad Shenai
@ 2014-03-14 16:22 ` Hariprasad Shenai
       [not found]   ` <1394814128-8815-3-git-send-email-hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
       [not found] ` <1394814128-8815-1-git-send-email-hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
  1 sibling, 1 reply; 25+ messages in thread
From: Hariprasad Shenai @ 2014-03-14 16:22 UTC (permalink / raw)
  To: netdev, linux-rdma
  Cc: davem, roland, dm, swise, leedom, santosh, kumaras, nirranjan,
	hariprasad

From: Steve Wise <swise@opengridcomputing.com>

The current logic suffers from a slow response time to disable user DB
usage, and also fails to avoid DB FIFO drops under heavy load. This commit
fixes these deficiencies and makes the avoidance logic more optimal.
This is done by more efficiently notifying the ULDs of potential DB
problems, and implements a smoother flow control algorithm in iw_cxgb4,
which is the ULD that puts the most load on the DB fifo.

Design:

cxgb4:

Direct ULD callback from the DB FULL/DROP interrupt handler.  This allows
the ULD to stop doing user DB writes as quickly as possible.

While user DB usage is disabled, the LLD will accumulate DB write events
for its queues.  Then once DB usage is reenabled, a single DB write is
done for each queue with its accumulated write count.  This reduces the
load put on the DB fifo when reenabling.

iw_cxgb4:

Instead of marking each qp to indicate DB writes are disabled, we create
a device-global status page that each user process maps.  This allows
iw_cxgb4 to only set this single bit to disable all DB writes for all
user QPs vs traversing the idr of all the active QPs.  If the libcxgb4
doesn't support this, then we fall back to the old approach of marking
each QP.  Thus we allow the new driver to work with an older libcxgb4.

When the LLD upcalls iw_cxgb4 indicating DB FULL, we disable all DB writes
via the status page and transition the DB state to STOPPED.  As user
processes see that DB writes are disabled, they call into iw_cxgb4
to submit their DB write events.  Since the DB state is in STOPPED,
the QP trying to write gets enqueued on a new DB "flow control" list.
As subsequent DB writes are submitted for this flow controlled QP, the
amount of writes are accumulated for each QP on the flow control list.
So all the user QPs that are actively ringing the DB get put on this
list and the number of writes they request are accumulated.

When the LLD upcalls iw_cxgb4 indicating DB EMPTY, which is in a workq
context, we change the DB state to FLOW_CONTROL, and begin resuming all
the QPs that are on the flow control list.  This logic runs on until
the flow control list is empty or we exit FLOW_CONTROL mode (due to
a DB DROP upcall, for example).  QPs are removed from this list, and
their accumulated DB write counts written to the DB FIFO.  Sets of QPs,
called chunks in the code, are removed at one time. The chunk size is 64.
So 64 QPs are resumed at a time, and before the next chunk is resumed, the
logic waits (blocks) for the DB FIFO to drain.  This prevents resuming to
quickly and overflowing the FIFO.  Once the flow control list is empty,
the db state transitions back to NORMAL and user QPs are again allowed
to write directly to the user DB register.

The algorithm is designed such that if the DB write load is high enough,
then all the DB writes get submitted by the kernel using this flow
controlled approach to avoid DB drops.  As the load lightens though, we
resume to normal DB writes directly by user applications.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/hw/cxgb4/device.c            | 177 ++++++++++++++----------
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   9 +-
 drivers/infiniband/hw/cxgb4/provider.c          |  43 +++++-
 drivers/infiniband/hw/cxgb4/qp.c                | 140 +++++++++----------
 drivers/infiniband/hw/cxgb4/t4.h                |   6 +
 drivers/infiniband/hw/cxgb4/user.h              |   5 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  87 +++++++-----
 drivers/net/ethernet/chelsio/cxgb4/sge.c        |   8 +-
 9 files changed, 286 insertions(+), 190 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c b/drivers/infiniband/hw/cxgb4/device.c
index 4a03385..ba7335f 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -64,6 +64,10 @@ struct uld_ctx {
 static LIST_HEAD(uld_ctx_list);
 static DEFINE_MUTEX(dev_mutex);
 
+#define DB_FC_RESUME_SIZE 64
+#define DB_FC_RESUME_DELAY 1
+#define DB_FC_DRAIN_THRESH 0
+
 static struct dentry *c4iw_debugfs_root;
 
 struct c4iw_debugfs_data {
@@ -282,7 +286,7 @@ static const struct file_operations stag_debugfs_fops = {
 	.llseek  = default_llseek,
 };
 
-static char *db_state_str[] = {"NORMAL", "FLOW_CONTROL", "RECOVERY"};
+static char *db_state_str[] = {"NORMAL", "FLOW_CONTROL", "RECOVERY", "STOPPED"};
 
 static int stats_show(struct seq_file *seq, void *v)
 {
@@ -311,9 +315,10 @@ static int stats_show(struct seq_file *seq, void *v)
 	seq_printf(seq, "  DB FULL: %10llu\n", dev->rdev.stats.db_full);
 	seq_printf(seq, " DB EMPTY: %10llu\n", dev->rdev.stats.db_empty);
 	seq_printf(seq, "  DB DROP: %10llu\n", dev->rdev.stats.db_drop);
-	seq_printf(seq, " DB State: %s Transitions %llu\n",
+	seq_printf(seq, " DB State: %s Transitions %llu FC Interruptions %llu\n",
 		   db_state_str[dev->db_state],
-		   dev->rdev.stats.db_state_transitions);
+		   dev->rdev.stats.db_state_transitions,
+		   dev->rdev.stats.db_fc_interruptions);
 	seq_printf(seq, "TCAM_FULL: %10llu\n", dev->rdev.stats.tcam_full);
 	seq_printf(seq, "ACT_OFLD_CONN_FAILS: %10llu\n",
 		   dev->rdev.stats.act_ofld_conn_fails);
@@ -643,6 +648,12 @@ static int c4iw_rdev_open(struct c4iw_rdev *rdev)
 		printk(KERN_ERR MOD "error %d initializing ocqp pool\n", err);
 		goto err4;
 	}
+	rdev->status_page = (struct t4_dev_status_page *)
+			    __get_free_page(GFP_KERNEL);
+	if (!rdev->status_page) {
+		pr_err(MOD "error allocating status page\n");
+		goto err4;
+	}
 	return 0;
 err4:
 	c4iw_rqtpool_destroy(rdev);
@@ -656,6 +667,7 @@ err1:
 
 static void c4iw_rdev_close(struct c4iw_rdev *rdev)
 {
+	free_page((unsigned long)rdev->status_page);
 	c4iw_pblpool_destroy(rdev);
 	c4iw_rqtpool_destroy(rdev);
 	c4iw_destroy_resource(&rdev->resource);
@@ -703,18 +715,6 @@ static struct c4iw_dev *c4iw_alloc(const struct cxgb4_lld_info *infop)
 		pr_info("%s: On-Chip Queues not supported on this device.\n",
 			pci_name(infop->pdev));
 
-	if (!is_t4(infop->adapter_type)) {
-		if (!allow_db_fc_on_t5) {
-			db_fc_threshold = 100000;
-			pr_info("DB Flow Control Disabled.\n");
-		}
-
-		if (!allow_db_coalescing_on_t5) {
-			db_coalescing_threshold = -1;
-			pr_info("DB Coalescing Disabled.\n");
-		}
-	}
-
 	devp = (struct c4iw_dev *)ib_alloc_device(sizeof(*devp));
 	if (!devp) {
 		printk(KERN_ERR MOD "Cannot allocate ib device\n");
@@ -749,6 +749,7 @@ static struct c4iw_dev *c4iw_alloc(const struct cxgb4_lld_info *infop)
 	spin_lock_init(&devp->lock);
 	mutex_init(&devp->rdev.stats.lock);
 	mutex_init(&devp->db_mutex);
+	INIT_LIST_HEAD(&devp->db_fc_list);
 
 	if (c4iw_debugfs_root) {
 		devp->debugfs_root = debugfs_create_dir(
@@ -977,13 +978,16 @@ static int disable_qp_db(int id, void *p, void *data)
 
 static void stop_queues(struct uld_ctx *ctx)
 {
-	spin_lock_irq(&ctx->dev->lock);
-	if (ctx->dev->db_state == NORMAL) {
-		ctx->dev->rdev.stats.db_state_transitions++;
-		ctx->dev->db_state = FLOW_CONTROL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ctx->dev->lock, flags);
+	ctx->dev->rdev.stats.db_state_transitions++;
+	ctx->dev->db_state = STOPPED;
+	if (ctx->dev->rdev.flags & T4_STATUS_PAGE_DISABLED)
 		idr_for_each(&ctx->dev->qpidr, disable_qp_db, NULL);
-	}
-	spin_unlock_irq(&ctx->dev->lock);
+	else
+		ctx->dev->rdev.status_page->db_off = 1;
+	spin_unlock_irqrestore(&ctx->dev->lock, flags);
 }
 
 static int enable_qp_db(int id, void *p, void *data)
@@ -994,15 +998,70 @@ static int enable_qp_db(int id, void *p, void *data)
 	return 0;
 }
 
+static void resume_rc_qp(struct c4iw_qp *qp)
+{
+	spin_lock(&qp->lock);
+	t4_ring_sq_db(&qp->wq, qp->wq.sq.wq_pidx_inc);
+	qp->wq.sq.wq_pidx_inc = 0;
+	t4_ring_rq_db(&qp->wq, qp->wq.rq.wq_pidx_inc);
+	qp->wq.rq.wq_pidx_inc = 0;
+	spin_unlock(&qp->lock);
+}
+
+static void resume_a_chunk(struct uld_ctx *ctx)
+{
+	int i;
+	struct c4iw_qp *qp;
+
+	for (i = 0; i < DB_FC_RESUME_SIZE; i++) {
+		qp = list_first_entry(&ctx->dev->db_fc_list, struct c4iw_qp,
+				      db_fc_entry);
+		list_del_init(&qp->db_fc_entry);
+		resume_rc_qp(qp);
+		if (list_empty(&ctx->dev->db_fc_list))
+			break;
+	}
+}
+
 static void resume_queues(struct uld_ctx *ctx)
 {
 	spin_lock_irq(&ctx->dev->lock);
-	if (ctx->dev->qpcnt <= db_fc_threshold &&
-	    ctx->dev->db_state == FLOW_CONTROL) {
-		ctx->dev->db_state = NORMAL;
-		ctx->dev->rdev.stats.db_state_transitions++;
-		idr_for_each(&ctx->dev->qpidr, enable_qp_db, NULL);
+	if (ctx->dev->db_state != STOPPED)
+		goto out;
+	ctx->dev->db_state = FLOW_CONTROL;
+	while (1) {
+		if (list_empty(&ctx->dev->db_fc_list)) {
+			WARN_ON(ctx->dev->db_state != FLOW_CONTROL);
+			ctx->dev->db_state = NORMAL;
+			ctx->dev->rdev.stats.db_state_transitions++;
+			if (ctx->dev->rdev.flags & T4_STATUS_PAGE_DISABLED) {
+				idr_for_each(&ctx->dev->qpidr, enable_qp_db,
+					     NULL);
+			} else {
+				ctx->dev->rdev.status_page->db_off = 0;
+			}
+			break;
+		} else {
+			if (cxgb4_dbfifo_count(ctx->dev->rdev.lldi.ports[0], 1)
+			    < (ctx->dev->rdev.lldi.dbfifo_int_thresh <<
+			       DB_FC_DRAIN_THRESH)) {
+				resume_a_chunk(ctx);
+			}
+			if (!list_empty(&ctx->dev->db_fc_list)) {
+				spin_unlock_irq(&ctx->dev->lock);
+				if (DB_FC_RESUME_DELAY) {
+					set_current_state(TASK_UNINTERRUPTIBLE);
+					schedule_timeout(DB_FC_RESUME_DELAY);
+				}
+				spin_lock_irq(&ctx->dev->lock);
+				if (ctx->dev->db_state != FLOW_CONTROL)
+					break;
+			}
+		}
 	}
+out:
+	if (ctx->dev->db_state != NORMAL)
+		ctx->dev->rdev.stats.db_fc_interruptions++;
 	spin_unlock_irq(&ctx->dev->lock);
 }
 
@@ -1028,12 +1087,12 @@ static int count_qps(int id, void *p, void *data)
 	return 0;
 }
 
-static void deref_qps(struct qp_list qp_list)
+static void deref_qps(struct qp_list *qp_list)
 {
 	int idx;
 
-	for (idx = 0; idx < qp_list.idx; idx++)
-		c4iw_qp_rem_ref(&qp_list.qps[idx]->ibqp);
+	for (idx = 0; idx < qp_list->idx; idx++)
+		c4iw_qp_rem_ref(&qp_list->qps[idx]->ibqp);
 }
 
 static void recover_lost_dbs(struct uld_ctx *ctx, struct qp_list *qp_list)
@@ -1044,17 +1103,22 @@ static void recover_lost_dbs(struct uld_ctx *ctx, struct qp_list *qp_list)
 	for (idx = 0; idx < qp_list->idx; idx++) {
 		struct c4iw_qp *qp = qp_list->qps[idx];
 
+		spin_lock_irq(&qp->rhp->lock);
+		spin_lock(&qp->lock);
 		ret = cxgb4_sync_txq_pidx(qp->rhp->rdev.lldi.ports[0],
 					  qp->wq.sq.qid,
 					  t4_sq_host_wq_pidx(&qp->wq),
 					  t4_sq_wq_size(&qp->wq));
 		if (ret) {
-			printk(KERN_ERR MOD "%s: Fatal error - "
+			pr_err(KERN_ERR MOD "%s: Fatal error - "
 			       "DB overflow recovery failed - "
 			       "error syncing SQ qid %u\n",
 			       pci_name(ctx->lldi.pdev), qp->wq.sq.qid);
+			spin_unlock(&qp->lock);
+			spin_unlock_irq(&qp->rhp->lock);
 			return;
 		}
+		qp->wq.sq.wq_pidx_inc = 0;
 
 		ret = cxgb4_sync_txq_pidx(qp->rhp->rdev.lldi.ports[0],
 					  qp->wq.rq.qid,
@@ -1062,12 +1126,17 @@ static void recover_lost_dbs(struct uld_ctx *ctx, struct qp_list *qp_list)
 					  t4_rq_wq_size(&qp->wq));
 
 		if (ret) {
-			printk(KERN_ERR MOD "%s: Fatal error - "
+			pr_err(KERN_ERR MOD "%s: Fatal error - "
 			       "DB overflow recovery failed - "
 			       "error syncing RQ qid %u\n",
 			       pci_name(ctx->lldi.pdev), qp->wq.rq.qid);
+			spin_unlock(&qp->lock);
+			spin_unlock_irq(&qp->rhp->lock);
 			return;
 		}
+		qp->wq.rq.wq_pidx_inc = 0;
+		spin_unlock(&qp->lock);
+		spin_unlock_irq(&qp->rhp->lock);
 
 		/* Wait for the dbfifo to drain */
 		while (cxgb4_dbfifo_count(qp->rhp->rdev.lldi.ports[0], 1) > 0) {
@@ -1083,36 +1152,22 @@ static void recover_queues(struct uld_ctx *ctx)
 	struct qp_list qp_list;
 	int ret;
 
-	/* lock out kernel db ringers */
-	mutex_lock(&ctx->dev->db_mutex);
-
-	/* put all queues in to recovery mode */
-	spin_lock_irq(&ctx->dev->lock);
-	ctx->dev->db_state = RECOVERY;
-	ctx->dev->rdev.stats.db_state_transitions++;
-	idr_for_each(&ctx->dev->qpidr, disable_qp_db, NULL);
-	spin_unlock_irq(&ctx->dev->lock);
-
 	/* slow everybody down */
 	set_current_state(TASK_UNINTERRUPTIBLE);
 	schedule_timeout(usecs_to_jiffies(1000));
 
-	/* Wait for the dbfifo to completely drain. */
-	while (cxgb4_dbfifo_count(ctx->dev->rdev.lldi.ports[0], 1) > 0) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		schedule_timeout(usecs_to_jiffies(10));
-	}
-
 	/* flush the SGE contexts */
 	ret = cxgb4_flush_eq_cache(ctx->dev->rdev.lldi.ports[0]);
 	if (ret) {
 		printk(KERN_ERR MOD "%s: Fatal error - DB overflow recovery failed\n",
 		       pci_name(ctx->lldi.pdev));
-		goto out;
+		return;
 	}
 
 	/* Count active queues so we can build a list of queues to recover */
 	spin_lock_irq(&ctx->dev->lock);
+	WARN_ON(ctx->dev->db_state != STOPPED);
+	ctx->dev->db_state = RECOVERY;
 	idr_for_each(&ctx->dev->qpidr, count_qps, &count);
 
 	qp_list.qps = kzalloc(count * sizeof *qp_list.qps, GFP_ATOMIC);
@@ -1120,7 +1175,7 @@ static void recover_queues(struct uld_ctx *ctx)
 		printk(KERN_ERR MOD "%s: Fatal error - DB overflow recovery failed\n",
 		       pci_name(ctx->lldi.pdev));
 		spin_unlock_irq(&ctx->dev->lock);
-		goto out;
+		return;
 	}
 	qp_list.idx = 0;
 
@@ -1133,29 +1188,13 @@ static void recover_queues(struct uld_ctx *ctx)
 	recover_lost_dbs(ctx, &qp_list);
 
 	/* we're almost done!  deref the qps and clean up */
-	deref_qps(qp_list);
+	deref_qps(&qp_list);
 	kfree(qp_list.qps);
 
-	/* Wait for the dbfifo to completely drain again */
-	while (cxgb4_dbfifo_count(ctx->dev->rdev.lldi.ports[0], 1) > 0) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		schedule_timeout(usecs_to_jiffies(10));
-	}
-
-	/* resume the queues */
 	spin_lock_irq(&ctx->dev->lock);
-	if (ctx->dev->qpcnt > db_fc_threshold)
-		ctx->dev->db_state = FLOW_CONTROL;
-	else {
-		ctx->dev->db_state = NORMAL;
-		idr_for_each(&ctx->dev->qpidr, enable_qp_db, NULL);
-	}
-	ctx->dev->rdev.stats.db_state_transitions++;
+	WARN_ON(ctx->dev->db_state != RECOVERY);
+	ctx->dev->db_state = STOPPED;
 	spin_unlock_irq(&ctx->dev->lock);
-
-out:
-	/* start up kernel db ringers again */
-	mutex_unlock(&ctx->dev->db_mutex);
 }
 
 static int c4iw_uld_control(void *handle, enum cxgb4_control control, ...)
@@ -1165,9 +1204,7 @@ static int c4iw_uld_control(void *handle, enum cxgb4_control control, ...)
 	switch (control) {
 	case CXGB4_CONTROL_DB_FULL:
 		stop_queues(ctx);
-		mutex_lock(&ctx->dev->rdev.stats.lock);
 		ctx->dev->rdev.stats.db_full++;
-		mutex_unlock(&ctx->dev->rdev.stats.lock);
 		break;
 	case CXGB4_CONTROL_DB_EMPTY:
 		resume_queues(ctx);
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 23eaeab..eb18f9b 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -109,6 +109,7 @@ struct c4iw_dev_ucontext {
 
 enum c4iw_rdev_flags {
 	T4_FATAL_ERROR = (1<<0),
+	T4_STATUS_PAGE_DISABLED = (1<<1),
 };
 
 struct c4iw_stat {
@@ -130,6 +131,7 @@ struct c4iw_stats {
 	u64  db_empty;
 	u64  db_drop;
 	u64  db_state_transitions;
+	u64  db_fc_interruptions;
 	u64  tcam_full;
 	u64  act_ofld_conn_fails;
 	u64  pas_ofld_conn_fails;
@@ -150,6 +152,7 @@ struct c4iw_rdev {
 	unsigned long oc_mw_pa;
 	void __iomem *oc_mw_kva;
 	struct c4iw_stats stats;
+	struct t4_dev_status_page *status_page;
 };
 
 static inline int c4iw_fatal_error(struct c4iw_rdev *rdev)
@@ -211,7 +214,8 @@ static inline int c4iw_wait_for_reply(struct c4iw_rdev *rdev,
 enum db_state {
 	NORMAL = 0,
 	FLOW_CONTROL = 1,
-	RECOVERY = 2
+	RECOVERY = 2,
+	STOPPED = 3
 };
 
 struct c4iw_dev {
@@ -225,10 +229,10 @@ struct c4iw_dev {
 	struct mutex db_mutex;
 	struct dentry *debugfs_root;
 	enum db_state db_state;
-	int qpcnt;
 	struct idr hwtid_idr;
 	struct idr atid_idr;
 	struct idr stid_idr;
+	struct list_head db_fc_list;
 };
 
 static inline struct c4iw_dev *to_c4iw_dev(struct ib_device *ibdev)
@@ -432,6 +436,7 @@ struct c4iw_qp_attributes {
 
 struct c4iw_qp {
 	struct ib_qp ibqp;
+	struct list_head db_fc_entry;
 	struct c4iw_dev *rhp;
 	struct c4iw_ep *ep;
 	struct c4iw_qp_attributes attr;
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index 7e94c9a..e36d2a2 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -106,15 +106,54 @@ static struct ib_ucontext *c4iw_alloc_ucontext(struct ib_device *ibdev,
 {
 	struct c4iw_ucontext *context;
 	struct c4iw_dev *rhp = to_c4iw_dev(ibdev);
+	static int warned;
+	struct c4iw_alloc_ucontext_resp uresp;
+	int ret = 0;
+	struct c4iw_mm_entry *mm = NULL;
 
 	PDBG("%s ibdev %p\n", __func__, ibdev);
 	context = kzalloc(sizeof(*context), GFP_KERNEL);
-	if (!context)
-		return ERR_PTR(-ENOMEM);
+	if (!context) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
 	c4iw_init_dev_ucontext(&rhp->rdev, &context->uctx);
 	INIT_LIST_HEAD(&context->mmaps);
 	spin_lock_init(&context->mmap_lock);
+
+	if (udata->outlen < sizeof(uresp)) {
+		if (!warned++)
+			pr_err(MOD "Warning - downlevel libcxgb4 (non-fatal), device status page disabled.");
+		rhp->rdev.flags |= T4_STATUS_PAGE_DISABLED;
+	} else {
+		mm = kmalloc(sizeof(*mm), GFP_KERNEL);
+		if (!mm)
+			goto err_free;
+
+		uresp.status_page_size = PAGE_SIZE;
+
+		spin_lock(&context->mmap_lock);
+		uresp.status_page_key = context->key;
+		context->key += PAGE_SIZE;
+		spin_unlock(&context->mmap_lock);
+
+		ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+		if (ret)
+			goto err_mm;
+
+		mm->key = uresp.status_page_key;
+		mm->addr = virt_to_phys(rhp->rdev.status_page);
+		mm->len = PAGE_SIZE;
+		insert_mmap(context, mm);
+	}
 	return &context->ibucontext;
+err_mm:
+	kfree(mm);
+err_free:
+	kfree(context);
+err:
+	return ERR_PTR(ret);
 }
 
 static int c4iw_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 5829367..3b62eb5 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -638,6 +638,46 @@ void c4iw_qp_rem_ref(struct ib_qp *qp)
 		wake_up(&(to_c4iw_qp(qp)->wait));
 }
 
+static void add_to_fc_list(struct list_head *head, struct list_head *entry)
+{
+	if (list_empty(entry))
+		list_add_tail(entry, head);
+}
+
+static int ring_kernel_sq_db(struct c4iw_qp *qhp, u16 inc)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&qhp->rhp->lock, flags);
+	spin_lock(&qhp->lock);
+	if (qhp->rhp->db_state == NORMAL) {
+		t4_ring_sq_db(&qhp->wq, inc);
+	} else {
+		add_to_fc_list(&qhp->rhp->db_fc_list, &qhp->db_fc_entry);
+		qhp->wq.sq.wq_pidx_inc += inc;
+	}
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&qhp->rhp->lock, flags);
+	return 0;
+}
+
+static int ring_kernel_rq_db(struct c4iw_qp *qhp, u16 inc)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&qhp->rhp->lock, flags);
+	spin_lock(&qhp->lock);
+	if (qhp->rhp->db_state == NORMAL) {
+		t4_ring_rq_db(&qhp->wq, inc);
+	} else {
+		add_to_fc_list(&qhp->rhp->db_fc_list, &qhp->db_fc_entry);
+		qhp->wq.rq.wq_pidx_inc += inc;
+	}
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&qhp->rhp->lock, flags);
+	return 0;
+}
+
 int c4iw_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		   struct ib_send_wr **bad_wr)
 {
@@ -750,9 +790,13 @@ int c4iw_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		t4_sq_produce(&qhp->wq, len16);
 		idx += DIV_ROUND_UP(len16*16, T4_EQ_ENTRY_SIZE);
 	}
-	if (t4_wq_db_enabled(&qhp->wq))
+	if (!qhp->rhp->rdev.status_page->db_off) {
 		t4_ring_sq_db(&qhp->wq, idx);
-	spin_unlock_irqrestore(&qhp->lock, flag);
+		spin_unlock_irqrestore(&qhp->lock, flag);
+	} else {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		ring_kernel_sq_db(qhp, idx);
+	}
 	return err;
 }
 
@@ -812,9 +856,13 @@ int c4iw_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
 		wr = wr->next;
 		num_wrs--;
 	}
-	if (t4_wq_db_enabled(&qhp->wq))
+	if (!qhp->rhp->rdev.status_page->db_off) {
 		t4_ring_rq_db(&qhp->wq, idx);
-	spin_unlock_irqrestore(&qhp->lock, flag);
+		spin_unlock_irqrestore(&qhp->lock, flag);
+	} else {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		ring_kernel_rq_db(qhp, idx);
+	}
 	return err;
 }
 
@@ -1200,35 +1248,6 @@ out:
 	return ret;
 }
 
-/*
- * Called by the library when the qp has user dbs disabled due to
- * a DB_FULL condition.  This function will single-thread all user
- * DB rings to avoid overflowing the hw db-fifo.
- */
-static int ring_kernel_db(struct c4iw_qp *qhp, u32 qid, u16 inc)
-{
-	int delay = db_delay_usecs;
-
-	mutex_lock(&qhp->rhp->db_mutex);
-	do {
-
-		/*
-		 * The interrupt threshold is dbfifo_int_thresh << 6. So
-		 * make sure we don't cross that and generate an interrupt.
-		 */
-		if (cxgb4_dbfifo_count(qhp->rhp->rdev.lldi.ports[0], 1) <
-		    (qhp->rhp->rdev.lldi.dbfifo_int_thresh << 5)) {
-			writel(QID(qid) | PIDX(inc), qhp->wq.db);
-			break;
-		}
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		schedule_timeout(usecs_to_jiffies(delay));
-		delay = min(delay << 1, 2000);
-	} while (1);
-	mutex_unlock(&qhp->rhp->db_mutex);
-	return 0;
-}
-
 int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp *qhp,
 		   enum c4iw_qp_attr_mask mask,
 		   struct c4iw_qp_attributes *attrs,
@@ -1278,11 +1297,11 @@ int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp *qhp,
 	}
 
 	if (mask & C4IW_QP_ATTR_SQ_DB) {
-		ret = ring_kernel_db(qhp, qhp->wq.sq.qid, attrs->sq_db_inc);
+		ret = ring_kernel_sq_db(qhp, attrs->sq_db_inc);
 		goto out;
 	}
 	if (mask & C4IW_QP_ATTR_RQ_DB) {
-		ret = ring_kernel_db(qhp, qhp->wq.rq.qid, attrs->rq_db_inc);
+		ret = ring_kernel_rq_db(qhp, attrs->rq_db_inc);
 		goto out;
 	}
 
@@ -1465,14 +1484,6 @@ out:
 	return ret;
 }
 
-static int enable_qp_db(int id, void *p, void *data)
-{
-	struct c4iw_qp *qp = p;
-
-	t4_enable_wq_db(&qp->wq);
-	return 0;
-}
-
 int c4iw_destroy_qp(struct ib_qp *ib_qp)
 {
 	struct c4iw_dev *rhp;
@@ -1490,22 +1501,15 @@ int c4iw_destroy_qp(struct ib_qp *ib_qp)
 		c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE, &attrs, 0);
 	wait_event(qhp->wait, !qhp->ep);
 
-	spin_lock_irq(&rhp->lock);
-	remove_handle_nolock(rhp, &rhp->qpidr, qhp->wq.sq.qid);
-	rhp->qpcnt--;
-	BUG_ON(rhp->qpcnt < 0);
-	if (rhp->qpcnt <= db_fc_threshold && rhp->db_state == FLOW_CONTROL) {
-		rhp->rdev.stats.db_state_transitions++;
-		rhp->db_state = NORMAL;
-		idr_for_each(&rhp->qpidr, enable_qp_db, NULL);
-	}
-	if (db_coalescing_threshold >= 0)
-		if (rhp->qpcnt <= db_coalescing_threshold)
-			cxgb4_enable_db_coalescing(rhp->rdev.lldi.ports[0]);
-	spin_unlock_irq(&rhp->lock);
+	remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid);
 	atomic_dec(&qhp->refcnt);
 	wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
 
+	spin_lock_irq(&rhp->lock);
+	if (!list_empty(&qhp->db_fc_entry))
+		list_del_init(&qhp->db_fc_entry);
+	spin_unlock_irq(&rhp->lock);
+
 	ucontext = ib_qp->uobject ?
 		   to_c4iw_ucontext(ib_qp->uobject->context) : NULL;
 	destroy_qp(&rhp->rdev, &qhp->wq,
@@ -1516,14 +1520,6 @@ int c4iw_destroy_qp(struct ib_qp *ib_qp)
 	return 0;
 }
 
-static int disable_qp_db(int id, void *p, void *data)
-{
-	struct c4iw_qp *qp = p;
-
-	t4_disable_wq_db(&qp->wq);
-	return 0;
-}
-
 struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *attrs,
 			     struct ib_udata *udata)
 {
@@ -1610,20 +1606,7 @@ struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *attrs,
 	init_waitqueue_head(&qhp->wait);
 	atomic_set(&qhp->refcnt, 1);
 
-	spin_lock_irq(&rhp->lock);
-	if (rhp->db_state != NORMAL)
-		t4_disable_wq_db(&qhp->wq);
-	rhp->qpcnt++;
-	if (rhp->qpcnt > db_fc_threshold && rhp->db_state == NORMAL) {
-		rhp->rdev.stats.db_state_transitions++;
-		rhp->db_state = FLOW_CONTROL;
-		idr_for_each(&rhp->qpidr, disable_qp_db, NULL);
-	}
-	if (db_coalescing_threshold >= 0)
-		if (rhp->qpcnt > db_coalescing_threshold)
-			cxgb4_disable_db_coalescing(rhp->rdev.lldi.ports[0]);
-	ret = insert_handle_nolock(rhp, &rhp->qpidr, qhp, qhp->wq.sq.qid);
-	spin_unlock_irq(&rhp->lock);
+	ret = insert_handle(rhp, &rhp->qpidr, qhp, qhp->wq.sq.qid);
 	if (ret)
 		goto err2;
 
@@ -1709,6 +1692,7 @@ struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *attrs,
 	}
 	qhp->ibqp.qp_num = qhp->wq.sq.qid;
 	init_timer(&(qhp->timer));
+	INIT_LIST_HEAD(&qhp->db_fc_entry);
 	PDBG("%s qhp %p sq_num_entries %d, rq_num_entries %d qpid 0x%0x\n",
 	     __func__, qhp, qhp->attr.sq_num_entries, qhp->attr.rq_num_entries,
 	     qhp->wq.sq.qid);
diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index e73ace7..eeca8b1 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -300,6 +300,7 @@ struct t4_sq {
 	u16 cidx;
 	u16 pidx;
 	u16 wq_pidx;
+	u16 wq_pidx_inc;
 	u16 flags;
 	short flush_cidx;
 };
@@ -324,6 +325,7 @@ struct t4_rq {
 	u16 cidx;
 	u16 pidx;
 	u16 wq_pidx;
+	u16 wq_pidx_inc;
 };
 
 struct t4_wq {
@@ -609,3 +611,7 @@ static inline void t4_set_cq_in_error(struct t4_cq *cq)
 	((struct t4_status_page *)&cq->queue[cq->size])->qp_err = 1;
 }
 #endif
+
+struct t4_dev_status_page {
+	u8 db_off;
+};
diff --git a/drivers/infiniband/hw/cxgb4/user.h b/drivers/infiniband/hw/cxgb4/user.h
index 32b754c..11ccd27 100644
--- a/drivers/infiniband/hw/cxgb4/user.h
+++ b/drivers/infiniband/hw/cxgb4/user.h
@@ -70,4 +70,9 @@ struct c4iw_create_qp_resp {
 	__u32 qid_mask;
 	__u32 flags;
 };
+
+struct c4iw_alloc_ucontext_resp {
+	__u64 status_page_key;
+	__u32 status_page_size;
+};
 #endif
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 50abe1d..32db377 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -500,6 +500,7 @@ struct sge_txq {
 	spinlock_t db_lock;
 	int db_disabled;
 	unsigned short db_pidx;
+	unsigned short db_pidx_inc;
 	u64 udb;
 };
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 0ac53dd..cc04d09 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3578,14 +3578,25 @@ static void drain_db_fifo(struct adapter *adap, int usecs)
 
 static void disable_txq_db(struct sge_txq *q)
 {
-	spin_lock_irq(&q->db_lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&q->db_lock, flags);
 	q->db_disabled = 1;
-	spin_unlock_irq(&q->db_lock);
+	spin_unlock_irqrestore(&q->db_lock, flags);
 }
 
-static void enable_txq_db(struct sge_txq *q)
+static void enable_txq_db(struct adapter *adap, struct sge_txq *q)
 {
 	spin_lock_irq(&q->db_lock);
+	if (q->db_pidx_inc) {
+		/* Make sure that all writes to the TX descriptors
+		 * are committed before we tell HW about them.
+		 */
+		wmb();
+		t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL),
+			     QID(q->cntxt_id) | PIDX(q->db_pidx_inc));
+		q->db_pidx_inc = 0;
+	}
 	q->db_disabled = 0;
 	spin_unlock_irq(&q->db_lock);
 }
@@ -3607,11 +3618,32 @@ static void enable_dbs(struct adapter *adap)
 	int i;
 
 	for_each_ethrxq(&adap->sge, i)
-		enable_txq_db(&adap->sge.ethtxq[i].q);
+		enable_txq_db(adap, &adap->sge.ethtxq[i].q);
 	for_each_ofldrxq(&adap->sge, i)
-		enable_txq_db(&adap->sge.ofldtxq[i].q);
+		enable_txq_db(adap, &adap->sge.ofldtxq[i].q);
 	for_each_port(adap, i)
-		enable_txq_db(&adap->sge.ctrlq[i].q);
+		enable_txq_db(adap, &adap->sge.ctrlq[i].q);
+}
+
+static void notify_rdma_uld(struct adapter *adap, enum cxgb4_control cmd)
+{
+	if (adap->uld_handle[CXGB4_ULD_RDMA])
+		ulds[CXGB4_ULD_RDMA].control(adap->uld_handle[CXGB4_ULD_RDMA],
+				cmd);
+}
+
+static void process_db_full(struct work_struct *work)
+{
+	struct adapter *adap;
+
+	adap = container_of(work, struct adapter, db_full_task);
+
+	drain_db_fifo(adap, dbfifo_drain_delay);
+	enable_dbs(adap);
+	notify_rdma_uld(adap, CXGB4_CONTROL_DB_EMPTY);
+	t4_set_reg_field(adap, SGE_INT_ENABLE3,
+			 DBFIFO_HP_INT | DBFIFO_LP_INT,
+			 DBFIFO_HP_INT | DBFIFO_LP_INT);
 }
 
 static void sync_txq_pidx(struct adapter *adap, struct sge_txq *q)
@@ -3619,7 +3651,7 @@ static void sync_txq_pidx(struct adapter *adap, struct sge_txq *q)
 	u16 hw_pidx, hw_cidx;
 	int ret;
 
-	spin_lock_bh(&q->db_lock);
+	spin_lock_irq(&q->db_lock);
 	ret = read_eq_indices(adap, (u16)q->cntxt_id, &hw_pidx, &hw_cidx);
 	if (ret)
 		goto out;
@@ -3636,7 +3668,8 @@ static void sync_txq_pidx(struct adapter *adap, struct sge_txq *q)
 	}
 out:
 	q->db_disabled = 0;
-	spin_unlock_bh(&q->db_lock);
+	q->db_pidx_inc = 0;
+	spin_unlock_irq(&q->db_lock);
 	if (ret)
 		CH_WARN(adap, "DB drop recovery failed.\n");
 }
@@ -3652,29 +3685,6 @@ static void recover_all_queues(struct adapter *adap)
 		sync_txq_pidx(adap, &adap->sge.ctrlq[i].q);
 }
 
-static void notify_rdma_uld(struct adapter *adap, enum cxgb4_control cmd)
-{
-	mutex_lock(&uld_mutex);
-	if (adap->uld_handle[CXGB4_ULD_RDMA])
-		ulds[CXGB4_ULD_RDMA].control(adap->uld_handle[CXGB4_ULD_RDMA],
-				cmd);
-	mutex_unlock(&uld_mutex);
-}
-
-static void process_db_full(struct work_struct *work)
-{
-	struct adapter *adap;
-
-	adap = container_of(work, struct adapter, db_full_task);
-
-	notify_rdma_uld(adap, CXGB4_CONTROL_DB_FULL);
-	drain_db_fifo(adap, dbfifo_drain_delay);
-	t4_set_reg_field(adap, SGE_INT_ENABLE3,
-			 DBFIFO_HP_INT | DBFIFO_LP_INT,
-			 DBFIFO_HP_INT | DBFIFO_LP_INT);
-	notify_rdma_uld(adap, CXGB4_CONTROL_DB_EMPTY);
-}
-
 static void process_db_drop(struct work_struct *work)
 {
 	struct adapter *adap;
@@ -3682,11 +3692,13 @@ static void process_db_drop(struct work_struct *work)
 	adap = container_of(work, struct adapter, db_drop_task);
 
 	if (is_t4(adap->params.chip)) {
-		disable_dbs(adap);
+		drain_db_fifo(adap, dbfifo_drain_delay);
 		notify_rdma_uld(adap, CXGB4_CONTROL_DB_DROP);
-		drain_db_fifo(adap, 1);
+		drain_db_fifo(adap, dbfifo_drain_delay);
 		recover_all_queues(adap);
+		drain_db_fifo(adap, dbfifo_drain_delay);
 		enable_dbs(adap);
+		notify_rdma_uld(adap, CXGB4_CONTROL_DB_EMPTY);
 	} else {
 		u32 dropped_db = t4_read_reg(adap, 0x010ac);
 		u16 qid = (dropped_db >> 15) & 0x1ffff;
@@ -3727,6 +3739,8 @@ static void process_db_drop(struct work_struct *work)
 void t4_db_full(struct adapter *adap)
 {
 	if (is_t4(adap->params.chip)) {
+		disable_dbs(adap);
+		notify_rdma_uld(adap, CXGB4_CONTROL_DB_FULL);
 		t4_set_reg_field(adap, SGE_INT_ENABLE3,
 				 DBFIFO_HP_INT | DBFIFO_LP_INT, 0);
 		queue_work(workq, &adap->db_full_task);
@@ -3735,8 +3749,11 @@ void t4_db_full(struct adapter *adap)
 
 void t4_db_dropped(struct adapter *adap)
 {
-	if (is_t4(adap->params.chip))
-		queue_work(workq, &adap->db_drop_task);
+	if (is_t4(adap->params.chip)) {
+		disable_dbs(adap);
+		notify_rdma_uld(adap, CXGB4_CONTROL_DB_FULL);
+	}
+	queue_work(workq, &adap->db_drop_task);
 }
 
 static void uld_attach(struct adapter *adap, unsigned int uld)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 46429f9..d4db382 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -860,9 +860,10 @@ static void cxgb_pio_copy(u64 __iomem *dst, u64 *src)
 static inline void ring_tx_db(struct adapter *adap, struct sge_txq *q, int n)
 {
 	unsigned int *wr, index;
+	unsigned long flags;
 
 	wmb();            /* write descriptors before telling HW */
-	spin_lock(&q->db_lock);
+	spin_lock_irqsave(&q->db_lock, flags);
 	if (!q->db_disabled) {
 		if (is_t4(adap->params.chip)) {
 			t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL),
@@ -878,9 +879,10 @@ static inline void ring_tx_db(struct adapter *adap, struct sge_txq *q, int n)
 				writel(n,  adap->bar2 + q->udb + 8);
 			wmb();
 		}
-	}
+	} else
+		q->db_pidx_inc += n;
 	q->db_pidx = q->pidx;
-	spin_unlock(&q->db_lock);
+	spin_unlock_irqrestore(&q->db_lock, flags);
 }
 
 /**
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 0/2] Doorbell drop Avoidance Bug fix for iw_cxgb4
       [not found] ` <1394814128-8815-1-git-send-email-hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
  2014-03-14 16:22   ` [PATCH net-next 1/2] cxgb4/iw_cxgb4: Treat CPL_ERR_KEEPALV_NEG_ADVICE as negative advice Hariprasad Shenai
@ 2014-03-15  2:44   ` David Miller
  1 sibling, 0 replies; 25+ messages in thread
From: David Miller @ 2014-03-15  2:44 UTC (permalink / raw)
  To: hariprasad-ut6Up61K2wZBDgjK7y7TUQ
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-BHEL68pLQRGGvPXPguhicg, dm-ut6Up61K2wZBDgjK7y7TUQ,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	leedom-ut6Up61K2wZBDgjK7y7TUQ, santosh-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ

From: Hariprasad Shenai <hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Date: Fri, 14 Mar 2014 21:52:06 +0530

> This patch series provides fixes for Chelsio T4/T5 adapters
> related to DB Drop avoidance and other small fix related to keepalive on
> iw-cxgb4.
> 
> The patches series is created against David Miller's 'net-next' tree.
> And includes patches on cxgb4 and iw_cxgb4 driver.
> 
> We would like to request this patch series to get merged via David Miller's
> 'net-next' tree.
> 
> We have included all the maintainers of respective drivers. Kindly review the
> change and let us know in case of any review comments.

Series applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 2/2] cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes
       [not found]   ` <1394814128-8815-3-git-send-email-hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
@ 2014-03-26 15:49     ` Yann Droneaud
       [not found]       ` <1395848977.3297.15.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2014-03-28  8:24       ` [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext() Dan Carpenter
  0 siblings, 2 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-03-26 15:49 UTC (permalink / raw)
  To: Steve Wise
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, roland-BHEL68pLQRGGvPXPguhicg,
	dm-ut6Up61K2wZBDgjK7y7TUQ, leedom-ut6Up61K2wZBDgjK7y7TUQ,
	santosh-ut6Up61K2wZBDgjK7y7TUQ, kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	hariprasad-ut6Up61K2wZBDgjK7y7TUQ

Le vendredi 14 mars 2014 à 21:52 +0530, Hariprasad Shenai a écrit :
> From: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

[...]

> Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
> ---
>  drivers/infiniband/hw/cxgb4/device.c            | 177 ++++++++++++++----------
>  drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   9 +-
>  drivers/infiniband/hw/cxgb4/provider.c          |  43 +++++-
>  drivers/infiniband/hw/cxgb4/qp.c                | 140 +++++++++----------
>  drivers/infiniband/hw/cxgb4/t4.h                |   6 +
>  drivers/infiniband/hw/cxgb4/user.h              |   5 +
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   1 +
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  87 +++++++-----
>  drivers/net/ethernet/chelsio/cxgb4/sge.c        |   8 +-
>  9 files changed, 286 insertions(+), 190 deletions(-)
> 

[...]

> diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
> index 7e94c9a..e36d2a2 100644
> --- a/drivers/infiniband/hw/cxgb4/provider.c
> +++ b/drivers/infiniband/hw/cxgb4/provider.c
> @@ -106,15 +106,54 @@ static struct ib_ucontext *c4iw_alloc_ucontext(struct ib_device *ibdev,
>  {
>  	struct c4iw_ucontext *context;
>  	struct c4iw_dev *rhp = to_c4iw_dev(ibdev);
> +	static int warned;
> +	struct c4iw_alloc_ucontext_resp uresp;
> +	int ret = 0;
> +	struct c4iw_mm_entry *mm = NULL;
>  
>  	PDBG("%s ibdev %p\n", __func__, ibdev);
>  	context = kzalloc(sizeof(*context), GFP_KERNEL);
> -	if (!context)
> -		return ERR_PTR(-ENOMEM);
> +	if (!context) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
>  	c4iw_init_dev_ucontext(&rhp->rdev, &context->uctx);
>  	INIT_LIST_HEAD(&context->mmaps);
>  	spin_lock_init(&context->mmap_lock);
> +
> +	if (udata->outlen < sizeof(uresp)) {
> +		if (!warned++)
> +			pr_err(MOD "Warning - downlevel libcxgb4 (non-fatal), device status page disabled.");
> +		rhp->rdev.flags |= T4_STATUS_PAGE_DISABLED;
> +	} else {
> +		mm = kmalloc(sizeof(*mm), GFP_KERNEL);
> +		if (!mm)
> +			goto err_free;
> +

OK, that's the origin of the missing error I've noticed in my latest
review on linux-next.

See
<http://marc.info/?i=1395846311-29288-1-git-send-email-ydroneaud@opteya.com>
<http://marc.info/?i=005b01cf4907$9adfa320$d09ee960
$@opengridcomputing.com>

Sorry, I've missed the opportunity to report it.

> +		uresp.status_page_size = PAGE_SIZE;
> +
> +		spin_lock(&context->mmap_lock);
> +		uresp.status_page_key = context->key;
> +		context->key += PAGE_SIZE;
> +		spin_unlock(&context->mmap_lock);
> +

Is it really necessary to spinlock here since context is local to the
function ?

> +		ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
> +		if (ret)
> +			goto err_mm;
> +
> +		mm->key = uresp.status_page_key;
> +		mm->addr = virt_to_phys(rhp->rdev.status_page);
> +		mm->len = PAGE_SIZE;
> +		insert_mmap(context, mm);
> +	}
>  	return &context->ibucontext;
> +err_mm:
> +	kfree(mm);
> +err_free:
> +	kfree(context);
> +err:
> +	return ERR_PTR(ret);
>  }
>  

[...]

> diff --git a/drivers/infiniband/hw/cxgb4/user.h b/drivers/infiniband/hw/cxgb4/user.h
> index 32b754c..11ccd27 100644
> --- a/drivers/infiniband/hw/cxgb4/user.h
> +++ b/drivers/infiniband/hw/cxgb4/user.h
> @@ -70,4 +70,9 @@ struct c4iw_create_qp_resp {
>  	__u32 qid_mask;
>  	__u32 flags;
>  };
> +
> +struct c4iw_alloc_ucontext_resp {
> +	__u64 status_page_key;
> +	__u32 status_page_size;
> +};

If this is going to be part of the ABI, mind add an explicit padding to
align the structure on 64bits.

Regards

-- 
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH net-next 2/2] cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes
       [not found]       ` <1395848977.3297.15.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2014-03-26 15:58         ` Steve Wise
  0 siblings, 0 replies; 25+ messages in thread
From: Steve Wise @ 2014-03-26 15:58 UTC (permalink / raw)
  To: 'Yann Droneaud'
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, roland-BHEL68pLQRGGvPXPguhicg,
	dm-ut6Up61K2wZBDgjK7y7TUQ, leedom-ut6Up61K2wZBDgjK7y7TUQ,
	santosh-ut6Up61K2wZBDgjK7y7TUQ, kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	hariprasad-ut6Up61K2wZBDgjK7y7TUQ



> > +		uresp.status_page_size = PAGE_SIZE;
> > +
> > +		spin_lock(&context->mmap_lock);
> > +		uresp.status_page_key = context->key;
> > +		context->key += PAGE_SIZE;
> > +		spin_unlock(&context->mmap_lock);
> > +
> 
> Is it really necessary to spinlock here since context is local to the
> function ?
> 

You're correct.

> 
> [...]
> 
> > diff --git a/drivers/infiniband/hw/cxgb4/user.h b/drivers/infiniband/hw/cxgb4/user.h
> > index 32b754c..11ccd27 100644
> > --- a/drivers/infiniband/hw/cxgb4/user.h
> > +++ b/drivers/infiniband/hw/cxgb4/user.h
> > @@ -70,4 +70,9 @@ struct c4iw_create_qp_resp {
> >  	__u32 qid_mask;
> >  	__u32 flags;
> >  };
> > +
> > +struct c4iw_alloc_ucontext_resp {
> > +	__u64 status_page_key;
> > +	__u32 status_page_size;
> > +};
> 
> If this is going to be part of the ABI, mind add an explicit padding to
> align the structure on 64bits.
> 

Sounds good.  Can you provide patches for these?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext()
@ 2014-03-28  8:24       ` Dan Carpenter
  2014-03-28 10:27         ` Yann Droneaud
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Carpenter @ 2014-03-28  8:24 UTC (permalink / raw)
  To: Steve Wise
  Cc: Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

The c4iw_alloc_ucontext_resp struct has a 4 byte hole after the last
member and we should clear it before passing it to the user.

Fixes: 05eb23893c2c ('cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes')
Signed-off-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index e36d2a2..a72aaa7 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -107,7 +107,7 @@ static struct ib_ucontext *c4iw_alloc_ucontext(struct ib_device *ibdev,
 	struct c4iw_ucontext *context;
 	struct c4iw_dev *rhp = to_c4iw_dev(ibdev);
 	static int warned;
-	struct c4iw_alloc_ucontext_resp uresp;
+	struct c4iw_alloc_ucontext_resp uresp = {};
 	int ret = 0;
 	struct c4iw_mm_entry *mm = NULL;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext()
  2014-03-28  8:24       ` [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext() Dan Carpenter
@ 2014-03-28 10:27         ` Yann Droneaud
       [not found]           ` <1396002468.3297.63.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Droneaud @ 2014-03-28 10:27 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Steve Wise, Roland Dreier, Sean Hefty, Hal Rosenstock, linux-rdma,
	kernel-janitors, netdev, davem, roland, dm, leedom, santosh,
	kumaras, nirranjan, hariprasad, Steve Wise

Hi,

Le vendredi 28 mars 2014 à 11:24 +0300, Dan Carpenter a écrit :
> The c4iw_alloc_ucontext_resp struct has a 4 byte hole after the last
> member and we should clear it before passing it to the user.
> 
> Fixes: 05eb23893c2c ('cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes')
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> 

It's not the proper fix for this issue: an explicit padding has to be
added (and initialized), see "Re: [PATCH net-next 2/2] cxgb4/iw_cxgb4:
Doorbell Drop Avoidance Bug Fixes"
http://marc.info/?i=1395848977.3297.15.camel@localhost.localdomain

In its current form, the c4iw_alloc_ucontext_resp structure does not
require padding on i386, so a 32bits userspace program using this
structure against a x86_64 kernel will make the kernel do a buffer
overflow in userspace, likely on stack, as answer of a GET_CONTEXT
request:

#include <infiniband/verbs.h>
#include <infiniband/kern-abi.h>

#define IBV_INIT_CMD_RESP(cmd, size, opcode, out, outsize) \
do {                                                       \
    (cmd)->command = IB_USER_VERBS_CMD_##opcode;           \
    (cmd)->in_words  = (size) / 4;                         \
    (cmd)->out_words = (outsize) / 4;                      \
    (cmd)->response  = (out);                              \
} while (0)

struct c4iw_alloc_ucontext_resp {
        struct ibv_get_context_resp ibv_resp;
        __u64 status_page_key;
        __u32 status_page_size;
};

struct ibv_context *alloc_context(struct ibv_device *ibdev,
                                  int cmd_fd)
{
    struct ibv_get_context cmd;
    struct c4iw_alloc_ucontext_resp resp;
    ssize_t sret;
    ...

    IBV_INIT_CMD_RESP(&cmd, sizeof(cmd),
                      GET_CONTEXT,
                      &resp, sizeof(resp));
    ...
    sret = write(context->cmd_fd,
                 &cmd,
                 sizeof(cmd));

    if (sret != sizeof(cmd)) {
        int err = errno;
        fprintf(stderr, "GET_CONTEXT failed: %d (%s)\n",
                err, strerror(err));
        ...
    }
    ...
}

Unfortunately, it's not the only structure which has this problem. I'm
currently preparing a report on this issue for this driver (cxgb4) and
another.

> diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
> index e36d2a2..a72aaa7 100644
> --- a/drivers/infiniband/hw/cxgb4/provider.c
> +++ b/drivers/infiniband/hw/cxgb4/provider.c
> @@ -107,7 +107,7 @@ static struct ib_ucontext *c4iw_alloc_ucontext(struct ib_device *ibdev,
>  	struct c4iw_ucontext *context;
>  	struct c4iw_dev *rhp = to_c4iw_dev(ibdev);
>  	static int warned;
> -	struct c4iw_alloc_ucontext_resp uresp;
> +	struct c4iw_alloc_ucontext_resp uresp = {};
>  	int ret = 0;
>  	struct c4iw_mm_entry *mm = NULL;
>  

Regards.

-- 
Yann Droneaud
OPTEYA

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext()
       [not found]           ` <1396002468.3297.63.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2014-03-28 10:58             ` David Laight
  2014-05-02 23:56             ` Dan Carpenter
  1 sibling, 0 replies; 25+ messages in thread
From: David Laight @ 2014-03-28 10:58 UTC (permalink / raw)
  To: 'Yann Droneaud', Dan Carpenter
  Cc: Steve Wise, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org,
	roland-BHEL68pLQRGGvPXPguhicg@public.gmane.org,
	dm-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org,
	leedom-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org,
	santosh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org,
	nirranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org,
	hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org, Steve Wise

From: Yann Droneaud
> Hi,
> 
> Le vendredi 28 mars 2014  11:24 +0300, Dan Carpenter a crit :
> > The c4iw_alloc_ucontext_resp struct has a 4 byte hole after the last
> > member and we should clear it before passing it to the user.
> >
> > Fixes: 05eb23893c2c ('cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes')
> > Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> >
> 
> It's not the proper fix for this issue: an explicit padding has to be
> added (and initialized), see "Re: [PATCH net-next 2/2] cxgb4/iw_cxgb4:
> Doorbell Drop Avoidance Bug Fixes"
> http://marc.info/?i=1395848977.3297.15.camel@localhost.localdomain
> 
> In its current form, the c4iw_alloc_ucontext_resp structure does not
> require padding on i386, so a 32bits userspace program using this
> structure against a x86_64 kernel will make the kernel do a buffer
> overflow in userspace, likely on stack, as answer of a GET_CONTEXT
> request:
...
> struct c4iw_alloc_ucontext_resp {
>         struct ibv_get_context_resp ibv_resp;
>         __u64 status_page_key;
>         __u32 status_page_size;
> };

Or add __attribute__((aligned(4))) to the 64bit fields.
And maybe a compile time assert on the length of the structure.
Since it is part of an ABI it must not change

	David


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext()
       [not found]           ` <1396002468.3297.63.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2014-03-28 10:58             ` David Laight
@ 2014-05-02 23:56             ` Dan Carpenter
  2014-05-04 21:21               ` [PATCH for v3.15 0/4] uverbs ABI fixes Yann Droneaud
  2014-05-04 21:46               ` [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext() Yann Droneaud
  1 sibling, 2 replies; 25+ messages in thread
From: Dan Carpenter @ 2014-05-02 23:56 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: Steve Wise, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	roland-BHEL68pLQRGGvPXPguhicg, dm-ut6Up61K2wZBDgjK7y7TUQ,
	leedom-ut6Up61K2wZBDgjK7y7TUQ, santosh-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	hariprasad-ut6Up61K2wZBDgjK7y7TUQ, Steve Wise

On Fri, Mar 28, 2014 at 11:27:48AM +0100, Yann Droneaud wrote:
> Unfortunately, it's not the only structure which has this problem. I'm
> currently preparing a report on this issue for this driver (cxgb4) and
> another.
> 

This information leak is still present in linux-next.  These days we
count those things as security vulnerabilities with CVEs and everything.

When is it likely to get fixed?

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH for v3.15 0/4] uverbs ABI fixes
  2014-05-02 23:56             ` Dan Carpenter
@ 2014-05-04 21:21               ` Yann Droneaud
  2014-05-04 21:21                 ` [PATCH 2/4] RDMA/mlx5: add missing padding at end of struct mlx5_ib_create_srq Yann Droneaud
       [not found]                 ` <cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
  2014-05-04 21:46               ` [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext() Yann Droneaud
  1 sibling, 2 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-05-04 21:21 UTC (permalink / raw)
  To: Eli Cohen, Steve Wise, Roland Dreier
  Cc: Yann Droneaud, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Dan Carpenter

Hi,

Please find 4 patches which fix some issues regarding missing explicit
padding at end of structure exchanged between kernel and userspace.

These makes i386 userspace libraries and x86_64 kernel disagree about
the size of the structures.

Additionally, as reported by Dan Carpenter, in one case, stack information
can be leaked by the kernel to userspace due to implicit padding being not
initialized.

Unfortunately, the data structure cannot be fixed alone as it would break
existing applications. So in order to remain compatible with i386 libraries,
providers (hw) functions are modified to use the input length to guess the
expected format of the command in order to check the content of the reserved
field for future usage. Other are modified to not write the padding field in
response to make the kernel able to handle gracefully i386 userspace on x86_64.

For full coherency, patches against the userspace libraries (libcxgb4 and
libmlx5) will be submitted as a followup to update the data structure on
userspace side.

Yann Droneaud (4):
  RDMA/mlx5: add missing padding at end of struct mlx5_ib_create_cq
  RDMA/mlx5: add missing padding at end of struct mlx5_ib_create_srq
  RDMA/cxgb4: add missing padding at end of struct c4iw_create_cq_resp
  RDMA/cxgb4: add missing padding at end of struct
    c4iw_alloc_ucontext_resp

 drivers/infiniband/hw/cxgb4/cq.c       |  8 ++++++--
 drivers/infiniband/hw/cxgb4/provider.c |  8 ++++++--
 drivers/infiniband/hw/cxgb4/user.h     |  2 ++
 drivers/infiniband/hw/mlx5/cq.c        | 13 +++++++++++--
 drivers/infiniband/hw/mlx5/srq.c       | 18 +++++++++++++++---
 drivers/infiniband/hw/mlx5/user.h      |  2 ++
 6 files changed, 42 insertions(+), 9 deletions(-)

-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/4] RDMA/mlx5: add missing padding at end of struct mlx5_ib_create_cq
       [not found]                 ` <cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
@ 2014-05-04 21:21                   ` Yann Droneaud
  2014-05-04 21:21                   ` [PATCH 3/4] RDMA/cxgb4: add missing padding at end of struct c4iw_create_cq_resp Yann Droneaud
                                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-05-04 21:21 UTC (permalink / raw)
  To: Eli Cohen, Roland Dreier
  Cc: Yann Droneaud, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	stable-u79uwXL29TY76Z2rM5mHXA

i386 ABI disagree with most other ABIs regarding alignment
of data type larger than 4 bytes: on most ABIs a padding must
be added at end of the structures, while it is not required on
i386.

So for most ABI struct mlx5_ib_create_cq get padded to be aligned
on a 8 bytes multiple, while for i386, such padding is not added.

Tool pahole could be used to find such implicit padding:

  $ pahole --anon_include \
  	 --nested_anon_include \
  	 --recursive \
  	 --class_name mlx5_ib_create_cq \
  	 drivers/infiniband/hw/mlx5/mlx5_ib.o

Then, structure layout can be compared between i386 and x86_64:

  +++ obj-i386/drivers/infiniband/hw/mlx5/mlx5_ib.o.pahole.txt    2014-03-28 11:43:07.386413682 +0100
  --- obj-x86_64/drivers/infiniband/hw/mlx5/mlx5_ib.o.pahole.txt  2014-03-27 13:06:17.788472721 +0100
  @@ -34,9 +34,8 @@ struct mlx5_ib_create_cq {
          __u64                      db_addr;              /*     8     8 */
          __u32                      cqe_size;             /*    16     4 */

  -       /* size: 20, cachelines: 1, members: 3 */
  -       /* last cacheline: 20 bytes */
  +       /* size: 24, cachelines: 1, members: 3 */
  +       /* padding: 4 */
  +       /* last cacheline: 24 bytes */
   };

This ABI disagreement will make an x86_64 kernel try to read
past the buffer provided by an i386 binary.

When boundary check will be implemented, a x86_64 kernel will
refuse to read past the i386 userspace provided buffer
and the uverb will fail.

Anyway, if the structure lay in memory on a page boundary and
next page is not mapped, ib_copy_from_udata() will fail when
trying to read the 4 bytes of padding and the uverb will fail.

This patch makes create_cq_user() takes care of the input
data size to handle the case where no padding is provided.

This way, x86_64 kernel will be able to handle struct mlx5_ib_create_cq
as sent by unpatched and patched i386 libmlx5.

Link: http://marc.info/?i=cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Fixes: e126ba97dba9e ('mlx5: Add driver for Mellanox Connect-IB adapter')
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/mlx5/cq.c   | 13 +++++++++++--
 drivers/infiniband/hw/mlx5/user.h |  1 +
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 62bb6b49dc1d..5996132e799d 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -32,6 +32,7 @@
 
 #include <linux/kref.h>
 #include <rdma/ib_umem.h>
+#include <rdma/ib_user_verbs.h>
 #include "mlx5_ib.h"
 #include "user.h"
 
@@ -607,8 +608,16 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
 	int ncont;
 	int err;
 
-	if (ib_copy_from_udata(&ucmd, udata, sizeof(ucmd)))
-		return -EFAULT;
+	if (udata->inlen - sizeof(struct ib_uverbs_cmd_hdr) < sizeof(ucmd)) {
+		if (ib_copy_from_udata(&ucmd, udata, sizeof(ucmd) - sizeof(ucmd.reserved)))
+			return -EFAULT;
+	} else {
+		if (ib_copy_from_udata(&ucmd, udata, sizeof(ucmd)))
+			return -EFAULT;
+
+		if (ucmd.reserved != 0)
+			return -EINVAL;
+	}
 
 	if (ucmd.cqe_size != 64 && ucmd.cqe_size != 128)
 		return -EINVAL;
diff --git a/drivers/infiniband/hw/mlx5/user.h b/drivers/infiniband/hw/mlx5/user.h
index 0f4f8e42a17f..d44ecd2c2faf 100644
--- a/drivers/infiniband/hw/mlx5/user.h
+++ b/drivers/infiniband/hw/mlx5/user.h
@@ -91,6 +91,7 @@ struct mlx5_ib_create_cq {
 	__u64	buf_addr;
 	__u64	db_addr;
 	__u32	cqe_size;
+	__u32	reserved; /* explicit padding (optional on i386) */
 };
 
 struct mlx5_ib_create_cq_resp {
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/4] RDMA/mlx5: add missing padding at end of struct mlx5_ib_create_srq
  2014-05-04 21:21               ` [PATCH for v3.15 0/4] uverbs ABI fixes Yann Droneaud
@ 2014-05-04 21:21                 ` Yann Droneaud
       [not found]                 ` <cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-05-04 21:21 UTC (permalink / raw)
  To: Eli Cohen, Roland Dreier; +Cc: Yann Droneaud, linux-rdma, stable

i386 ABI disagree with most other ABIs regarding alignment
of data type larger than 4 bytes: on most ABIs a padding must
be added at end of the structures, while it is not required on
i386.

So for most ABI struct mlx5_ib_create_srq get implicitly padded
to be aligned on a 8 bytes multiple, while for i386, such padding
is not added.

Tool pahole could be used to find such implicit padding:

  $ pahole --anon_include \
           --nested_anon_include \
           --recursive \
           --class_name mlx5_ib_create_srq \
           drivers/infiniband/hw/mlx5/mlx5_ib.o

Then, structure layout can be compared between i386 and x86_64:

  +++ obj-i386/drivers/infiniband/hw/mlx5/mlx5_ib.o.pahole.txt    2014-03-28 11:43:07.386413682 +0100
  --- obj-x86_64/drivers/infiniband/hw/mlx5/mlx5_ib.o.pahole.txt  2014-03-27 13:06:17.788472721 +0100
  @@ -69,7 +68,6 @@ struct mlx5_ib_create_srq {
          __u64                      db_addr;              /*     8     8 */
          __u32                      flags;                /*    16     4 */

  -       /* size: 20, cachelines: 1, members: 3 */
  -       /* last cacheline: 20 bytes */
  +       /* size: 24, cachelines: 1, members: 3 */
  +       /* padding: 4 */
  +       /* last cacheline: 24 bytes */
   };

ABI disagreement will make an x86_64 kernel try to read past
the buffer provided by an i386 binary.

When boundary check will be implemented, the x86_64 kernel will
refuse to read past the i386 userspace provided buffer and the
uverb will fail.

Anyway, if the structure lay in memory on a page boundary and
next page is not mapped, ib_copy_from_udata() will fail and the
uverb will fail.

This patch makes create_srq_user() takes care of the input
data size to handle the case where no padding was provided.

This way, x86_64 kernel will be able to handle struct mlx5_ib_create_srq
as sent by unpatched and patched i386 libmlx5.

Link: http://marc.info/?i=cover.1399216475.git.ydroneaud@opteya.com
Cc: stable@vger.kernel.org
Fixes: e126ba97dba9e ('mlx5: Add driver for Mellanox Connect-IB adapter')
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
---
 drivers/infiniband/hw/mlx5/srq.c  | 18 +++++++++++++++---
 drivers/infiniband/hw/mlx5/user.h |  1 +
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index 210b3eaf188a..38ca52181eb6 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -35,6 +35,7 @@
 #include <linux/mlx5/srq.h>
 #include <linux/slab.h>
 #include <rdma/ib_umem.h>
+#include <rdma/ib_user_verbs.h>
 
 #include "mlx5_ib.h"
 #include "user.h"
@@ -84,10 +85,21 @@ static int create_srq_user(struct ib_pd *pd, struct mlx5_ib_srq *srq,
 	int ncont;
 	u32 offset;
 
-	if (ib_copy_from_udata(&ucmd, udata, sizeof(ucmd))) {
-		mlx5_ib_dbg(dev, "failed copy udata\n");
-		return -EFAULT;
+	if (udata->inlen - sizeof(struct ib_uverbs_cmd_hdr) < sizeof(ucmd)) {
+		if (ib_copy_from_udata(&ucmd, udata, sizeof(ucmd) - sizeof(ucmd.reserved))) {
+			mlx5_ib_dbg(dev, "failed copy udata\n");
+			return -EFAULT;
+		}
+	} else {
+		if (ib_copy_from_udata(&ucmd, udata, sizeof(ucmd))) {
+			mlx5_ib_dbg(dev, "failed copy udata\n");
+			return -EFAULT;
+		}
+
+		if (ucmd.reserved != 0)
+			return -EINVAL;
 	}
+
 	srq->wq_sig = !!(ucmd.flags & MLX5_SRQ_FLAG_SIGNATURE);
 
 	srq->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr, buf_size,
diff --git a/drivers/infiniband/hw/mlx5/user.h b/drivers/infiniband/hw/mlx5/user.h
index d44ecd2c2faf..d0ba264ac1ed 100644
--- a/drivers/infiniband/hw/mlx5/user.h
+++ b/drivers/infiniband/hw/mlx5/user.h
@@ -110,6 +110,7 @@ struct mlx5_ib_create_srq {
 	__u64	buf_addr;
 	__u64	db_addr;
 	__u32	flags;
+	__u32	reserved; /* explicit padding (optional on i386) */
 };
 
 struct mlx5_ib_create_srq_resp {
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/4] RDMA/cxgb4: add missing padding at end of struct c4iw_create_cq_resp
       [not found]                 ` <cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
  2014-05-04 21:21                   ` [PATCH 1/4] RDMA/mlx5: add missing padding at end of struct mlx5_ib_create_cq Yann Droneaud
@ 2014-05-04 21:21                   ` Yann Droneaud
  2014-05-04 21:21                   ` [PATCH 4/4] RDMA/cxgb4: add missing padding at end of struct c4iw_alloc_ucontext_resp Yann Droneaud
                                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-05-04 21:21 UTC (permalink / raw)
  To: Steve Wise, Roland Dreier
  Cc: Yann Droneaud, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	stable-u79uwXL29TY76Z2rM5mHXA, Dan Carpenter

i386 ABI disagree with most other ABIs regarding alignment
of data type larger than 4 bytes: on most ABIs a padding must
be added at end of the structures, while it is not required on
i386.

So for most ABI struct c4iw_create_cq_resp get implicitly padded
to be aligned on a 8 bytes multiple, while for i386, such padding
is not added.

Tool pahole could be used to find such implicit padding:

  $ pahole --anon_include \
           --nested_anon_include \
           --recursive \
           --class_name c4iw_create_cq_resp \
           drivers/infiniband/hw/cxgb4/iw_cxgb4.o

Then, structure layout can be compared between i386 and x86_64:

  +++ obj-i386/drivers/infiniband/hw/cxgb4/iw_cxgb4.o.pahole.txt   2014-03-28 11:43:05.547432195 +0100
  --- obj-x86_64/drivers/infiniband/hw/cxgb4/iw_cxgb4.o.pahole.txt 2014-03-28 10:55:10.990133017 +0100
  @@ -14,9 +13,8 @@ struct c4iw_create_cq_resp {
          __u32                      size;                 /*    28     4 */
          __u32                      qid_mask;             /*    32     4 */

  -       /* size: 36, cachelines: 1, members: 6 */
  -       /* last cacheline: 36 bytes */
  +       /* size: 40, cachelines: 1, members: 6 */
  +       /* padding: 4 */
  +       /* last cacheline: 40 bytes */
   };

This ABI disagreement will make an x86_64 kernel try to write past
the buffer provided by an i386 binary.

When boundary check will be implemented, the x86_64 kernel will
refuse to write past the i386 userspace provided buffer
and the uverbs will fail.

If the structure lay in memory on a page boundary and next page
is not mapped, ib_copy_to_udata() will fail and the uverbs
will fail.

This patch adds an explicit padding at end of structure
c4iw_create_cq_resp, and, like 92b0ca7cb149 ('IB/mlx5: Fix stack
info leak in mlx5_ib_alloc_ucontext()'), makes function c4iw_create_cq()
not writting this padding field to userspace. This way, x86_64 kernel
will be able to write struct c4iw_create_cq_resp as expected by unpatched
and patched i386 libcxgb4.

Link: http://marc.info/?i=cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Fixes: cfdda9d764362 ('RDMA/cxgb4: Add driver for Chelsio T4 RNIC')
Fixes: e24a72a3302a6 ('RDMA/cxgb4: Fix four byte info leak in c4iw_create_cq()')
Cc: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/cxgb4/cq.c   | 8 ++++++--
 drivers/infiniband/hw/cxgb4/user.h | 1 +
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index cfaa56ada189..5cb3cb4b4c25 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -940,7 +940,6 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int entries,
 		if (!mm2)
 			goto err4;
 
-		memset(&uresp, 0, sizeof(uresp));
 		uresp.qid_mask = rhp->rdev.cqmask;
 		uresp.cqid = chp->cq.cqid;
 		uresp.size = chp->cq.size;
@@ -951,7 +950,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int entries,
 		uresp.gts_key = ucontext->key;
 		ucontext->key += PAGE_SIZE;
 		spin_unlock(&ucontext->mmap_lock);
-		ret = ib_copy_to_udata(udata, &uresp, sizeof uresp);
+		ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp) - sizeof(uresp.reserved));
 		if (ret)
 			goto err5;
 
diff --git a/drivers/infiniband/hw/cxgb4/user.h b/drivers/infiniband/hw/cxgb4/user.h
index 11ccd276e5d9..9b7534b5f07d 100644
--- a/drivers/infiniband/hw/cxgb4/user.h
+++ b/drivers/infiniband/hw/cxgb4/user.h
@@ -48,6 +48,7 @@ struct c4iw_create_cq_resp {
 	__u32 cqid;
 	__u32 size;
 	__u32 qid_mask;
+	__u32 reserved; /* explicit padding (optional for i386) */
 };
 
 
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/4] RDMA/cxgb4: add missing padding at end of struct c4iw_alloc_ucontext_resp
       [not found]                 ` <cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
  2014-05-04 21:21                   ` [PATCH 1/4] RDMA/mlx5: add missing padding at end of struct mlx5_ib_create_cq Yann Droneaud
  2014-05-04 21:21                   ` [PATCH 3/4] RDMA/cxgb4: add missing padding at end of struct c4iw_create_cq_resp Yann Droneaud
@ 2014-05-04 21:21                   ` Yann Droneaud
       [not found]                     ` <2236129ab4aa1ad1562858f363fb6fef0d6bc93b.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
  2014-05-04 21:31                   ` [PATCH libcxgb4 0/2] uverbs ABI fixes Yann Droneaud
                                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Yann Droneaud @ 2014-05-04 21:21 UTC (permalink / raw)
  To: Steve Wise, Roland Dreier
  Cc: Yann Droneaud, linux-rdma-u79uwXL29TY76Z2rM5mHXA

i386 ABI disagree with most other ABIs regarding alignment
of data type larger than 4 bytes: on most ABIs a padding must
be added at end of the structures, while it is not required on
i386.

So for most ABI struct c4iw_alloc_ucontext_resp get implicitly padded
to be aligned on a 8 bytes multiple, while for i386, such padding
is not added.

Tool pahole could be used to find such implicit padding:

  $ pahole --anon_include \
           --nested_anon_include \
           --recursive \
           --class_name c4iw_alloc_ucontext_resp \
           drivers/infiniband/hw/cxgb4/iw_cxgb4.o

Then, structure layout can be compared between i386 and x86_64:

  +++ obj-i386/drivers/infiniband/hw/cxgb4/iw_cxgb4.o.pahole.txt   2014-03-28 11:43:05.547432195 +0100
  --- obj-x86_64/drivers/infiniband/hw/cxgb4/iw_cxgb4.o.pahole.txt 2014-03-28 10:55:10.990133017 +0100
  @@ -2,9 +2,8 @@ struct c4iw_alloc_ucontext_resp {
          __u64                      status_page_key;      /*     0     8 */
          __u32                      status_page_size;     /*     8     4 */

  -       /* size: 12, cachelines: 1, members: 2 */
  -       /* last cacheline: 12 bytes */
  +       /* size: 16, cachelines: 1, members: 2 */
  +       /* padding: 4 */
  +       /* last cacheline: 16 bytes */
   };

This ABI disagreement will make an x86_64 kernel try to write
past the buffer provided by an i386 binary.

When boundary check will be implemented, the x86_64 kernel will
refuse to write past the i386 userspace provided buffer
and the uverbs will fail.

If the structure lay in memory on a page boundary and next page
is not mapped, ib_copy_to_udata() will fail and the uverb
will fail.

Additionally, as reported by Dan Carpenter, without the implicit
padding being properly cleared, an information leak would take
place in most architectures.

This patch adds an explicit padding to struct c4iw_alloc_ucontext_resp,
and, like 92b0ca7cb149 ('IB/mlx5: Fix stack info leak in
mlx5_ib_alloc_ucontext()'), makes function c4iw_alloc_ucontext()
not writting this padding field to userspace. This way, x86_64 kernel
will be able to write struct c4iw_alloc_ucontext_resp as expected by
unpatched and patched i386 libcxgb4.

Link: http://marc.info/?i=cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
Link: http://marc.info/?i=1395848977.3297.15.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org
Link: http://marc.info/?i=20140328082428.GH25192@mwanda
Fixes: 05eb23893c2c ('cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes')
Reported-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
Reported-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/cxgb4/provider.c | 8 ++++++--
 drivers/infiniband/hw/cxgb4/user.h     | 1 +
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index a94a3e12c349..9ba01d524f51 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -122,7 +122,7 @@ static struct ib_ucontext *c4iw_alloc_ucontext(struct ib_device *ibdev,
 	INIT_LIST_HEAD(&context->mmaps);
 	spin_lock_init(&context->mmap_lock);
 
-	if (udata->outlen < sizeof(uresp)) {
+	if (udata->outlen < sizeof(uresp) - sizeof(uresp.reserved)) {
 		if (!warned++)
 			pr_err(MOD "Warning - downlevel libcxgb4 (non-fatal), device status page disabled.");
 		rhp->rdev.flags |= T4_STATUS_PAGE_DISABLED;
@@ -134,13 +134,13 @@ static struct ib_ucontext *c4iw_alloc_ucontext(struct ib_device *ibdev,
 		}
 
 		uresp.status_page_size = PAGE_SIZE;
 
 		spin_lock(&context->mmap_lock);
 		uresp.status_page_key = context->key;
 		context->key += PAGE_SIZE;
 		spin_unlock(&context->mmap_lock);
 
-		ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+		ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp) - sizeof(uresp.reserved));
 		if (ret)
 			goto err_mm;
 
diff --git a/drivers/infiniband/hw/cxgb4/user.h b/drivers/infiniband/hw/cxgb4/user.h
index 9b7534b5f07d..cbd0ce170728 100644
--- a/drivers/infiniband/hw/cxgb4/user.h
+++ b/drivers/infiniband/hw/cxgb4/user.h
@@ -75,5 +75,6 @@ struct c4iw_create_qp_resp {
 struct c4iw_alloc_ucontext_resp {
 	__u64 status_page_key;
 	__u32 status_page_size;
+	__u32 reserved; /* explicit padding (optional for i386) */
 };
 #endif
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH libcxgb4 0/2] uverbs ABI fixes
       [not found]                 ` <cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
                                     ` (2 preceding siblings ...)
  2014-05-04 21:21                   ` [PATCH 4/4] RDMA/cxgb4: add missing padding at end of struct c4iw_alloc_ucontext_resp Yann Droneaud
@ 2014-05-04 21:31                   ` Yann Droneaud
       [not found]                   ` <cover.1399235229.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
                                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-05-04 21:31 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yann Droneaud

Hi,

Please find two patches for libcxgb4 to keep the library in sync
regarding the kernel changes I've submitted: explicit padding
must be added at end of the structure so that i386 userspace library
has the same structure layout than x86_64 kernel.

See http://marc.info/?i=cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org

Yann Droneaud (2):
  kernel abi: adds explicit padding in struct c4iw_create_cq_resp
  kernel abi: adds explicit padding in struct c4iw_alloc_ucontext_resp

 src/cxgb4-abi.h | 2 ++
 src/dev.c       | 5 +++++
 src/verbs.c     | 5 +++++
 3 files changed, 12 insertions(+)

-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH libcxgb4 1/2] kernel abi: adds explicit padding in struct c4iw_create_cq_resp
       [not found]                   ` <cover.1399235229.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
@ 2014-05-04 21:31                     ` Yann Droneaud
  2014-05-04 21:31                     ` [PATCH libcxgb4 2/2] kernel abi: adds explicit padding in struct c4iw_alloc_ucontext_resp Yann Droneaud
  2014-05-05 18:14                     ` [PATCH libcxgb4 0/2] uverbs ABI fixes Steve Wise
  2 siblings, 0 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-05-04 21:31 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yann Droneaud

i386 ABI disagree with most other ABIs regarding alignment
of data type larger than 4 bytes: on most ABIs a padding must
be added at end of the structures, while it is not required
on i386.

Such ABI disagreement will make an x86_64 kernel try to write past
the struct c4iw_create_cq_resp buffer provided by an i386
userspace binary. As struct c4iw_create_cq_resp is likely
on stack, see function c4iw_create_cq(), side effects are
expected.

On kernel side, this structure was added for kernel v2.6.35-rc1
by following commit.

  Commit cfdda9d764362ab77b11a410bb928400e6520d57
  Author: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  Date:   Wed Apr 21 15:30:06 2010 -0700

      RDMA/cxgb4: Add driver for Chelsio T4 RNIC

If boundary check is implemented on kernel side, the x86_64 kernel
will refuse to write past the i386 userspace provided buffer and the
uverbs will fail.

To fix these issues, this patch adds an explicit padding at end
of structure so that i386 and others ABI share the same structure
layout. This patch makes c4iw_create_cq() check for a value in the
padding field to detect newer kernel using the field for a future
purpose (only activated in debug).

With this patch, libcxgb4 will work against older kernel and
newer patched kernel.

Link: http://marc.info/?i=cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 src/cxgb4-abi.h | 1 +
 src/verbs.c     | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/src/cxgb4-abi.h b/src/cxgb4-abi.h
index d70b0f132a7f..23870f66dc0d 100644
--- a/src/cxgb4-abi.h
+++ b/src/cxgb4-abi.h
@@ -53,6 +53,7 @@ struct c4iw_create_cq_resp {
 	__u32 cqid;
 	__u32 size;
 	__u32 qid_mask;
+	__u32 reserved;
 };
 
 enum {
diff --git a/src/verbs.c b/src/verbs.c
index ab4a45d7cdbc..4a6c1b47bc9e 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -181,12 +181,17 @@ struct ibv_cq *c4iw_create_cq(struct ibv_context *context, int cqe,
 		return NULL;
 	}
 
+	resp.reserved = 0;
 	ret = ibv_cmd_create_cq(context, cqe, channel, comp_vector,
 				&chp->ibv_cq, &cmd, sizeof cmd,
 				&resp.ibv_resp, sizeof resp);
 	if (ret)
 		goto err1;
 
+	if (resp.reserved)
+		PDBG("%s c4iw_create_cq_resp reserved field modified by kernel\n",
+		     __FUNCTION__);
+
 	pthread_spin_init(&chp->lock, PTHREAD_PROCESS_PRIVATE);
 #ifdef STALL_DETECTION
 	gettimeofday(&chp->time, NULL);
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH libcxgb4 2/2] kernel abi: adds explicit padding in struct c4iw_alloc_ucontext_resp
       [not found]                   ` <cover.1399235229.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
  2014-05-04 21:31                     ` [PATCH libcxgb4 1/2] kernel abi: adds explicit padding in struct c4iw_create_cq_resp Yann Droneaud
@ 2014-05-04 21:31                     ` Yann Droneaud
  2014-05-05 18:14                     ` [PATCH libcxgb4 0/2] uverbs ABI fixes Steve Wise
  2 siblings, 0 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-05-04 21:31 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yann Droneaud

i386 ABI disagree with most other ABIs regarding alignment
of data type larger than 4 bytes: on most ABIs a padding must
be added at end of the structures, while it is not
required on i386.

Such ABI disagreement will make an x86_64 kernel try to write past
the struct c4iw_alloc_ucontext_resp buffer provided by an i386
userspace binary. As struct c4iw_alloc_ucontext_resp is likely
on stack, see function c4iw_alloc_context(), side effects are
expected.

On kernel side, this structure was modified for kernel v3.15-rc1
by following commit:

  Commit 05eb23893c2cf9502a9cec0c32e7f1d1ed2895c8
  Author: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
  Date:   Fri Mar 14 21:52:08 2014 +0530

      cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes

If boundary check is implemented on kernel side, the x86_64
kernel will instead refuse to write past the i386 userspace
provided buffer and the uverbs will fail.

To fix these issues, this patch adds an explicit padding at end
of structure so that i386 and others ABI share the same structure
layout. This patch makes c4iw_alloc_context() check for a value
in the padding field to detect newer kernel using the field for
a future purpose (only activated in debug).

With this patch, libcxgb4 will work against older kernel and
newer patched kernel.

Link: http://marc.info/?i=cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 src/cxgb4-abi.h | 1 +
 src/dev.c       | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/src/cxgb4-abi.h b/src/cxgb4-abi.h
index 23870f66dc0d..0b9f4d99d0e7 100644
--- a/src/cxgb4-abi.h
+++ b/src/cxgb4-abi.h
@@ -38,6 +38,7 @@ struct c4iw_alloc_ucontext_resp {
 	struct ibv_get_context_resp ibv_resp;
 	__u64 status_page_key;
 	__u32 status_page_size;
+	__u32 reserved;
 };
 
 struct c4iw_alloc_pd_resp {
diff --git a/src/dev.c b/src/dev.c
index 3236e6b2db6d..f66df71105e5 100644
--- a/src/dev.c
+++ b/src/dev.c
@@ -125,10 +125,15 @@ static struct ibv_context *c4iw_alloc_context(struct ibv_device *ibdev,
 	context->ibv_ctx.cmd_fd = cmd_fd;
 
 	resp.status_page_size = 0;
+	resp.reserved = 0;
 	if (ibv_cmd_get_context(&context->ibv_ctx, &cmd, sizeof cmd,
 				&resp.ibv_resp, sizeof resp))
 		goto err_free;
 
+	if (resp.reserved)
+		PDBG("%s c4iw_alloc_ucontext_resp reserved field modified by kernel\n",
+		     __FUNCTION__);
+
 	context->status_page_size = resp.status_page_size;
 	if (resp.status_page_size) {
 		context->status_page = mmap(NULL, resp.status_page_size,
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH libmlx5] abi: adds explicit padding on mlx5_create_cq and mlx5_create_srq
       [not found]                 ` <cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
                                     ` (4 preceding siblings ...)
       [not found]                   ` <cover.1399235229.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
@ 2014-05-04 21:41                   ` Yann Droneaud
  2014-05-05  9:01                   ` [PATCH for v3.15 0/4] uverbs ABI fixes Yann Droneaud
  6 siblings, 0 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-05-04 21:41 UTC (permalink / raw)
  To: Eli Cohen; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yann Droneaud

i386 ABI disagree with most other ABIs regarding alignment
of data type larger than 4 bytes: on most ABIs a padding must
be added at end of the structures, while it is not
required on i386.

Such ABI disagreement will make an x86_64 kernel try to read
past a buffer provided by an i386 binary, as the latter will
not have the expected padding for struct mlx5_create_cq and
mlx5_create_srq.

On kernel side, these structures were added for kernel v3.11-rc1
by following commit:

  Commit e126ba97dba9edeb6fafa3665b5f8497fc9cdf8c
  Author: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  Date:   Sun Jul 7 17:25:49 2013 +0300

      mlx5: Add driver for Mellanox Connect-IB adapters

If future kernel is to use the padding for extension, on a
x86_64 unpatched kernel, it might read garbage as it would
read past the i386 provided buffer.

In this other hand, if boundary check is implemented on kernel
side, the x86_64 kernel will refuse to read past the i386
userspace provided buffer for struct mlx5_create_cq and
mlx5_create_srq, making the uverbs fail.

To address all these issues, this patch add an explicit padding
at end of structures and initialize it so that i386 and others ABI
share the same structure layout.

With this patch, libmlx5 will run on older kernel and
newer patched kernel.

Link: http://marc.info/?i=cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
---
 src/mlx5-abi.h | 2 ++
 src/verbs.c    | 6 ++++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/mlx5-abi.h b/src/mlx5-abi.h
index 6f98e62c59d5..980b24910403 100644
--- a/src/mlx5-abi.h
+++ b/src/mlx5-abi.h
@@ -83,6 +83,7 @@ struct mlx5_create_cq {
 	__u64				buf_addr;
 	__u64				db_addr;
 	__u32				cqe_size;
+	__u32				reserved;
 };
 
 struct mlx5_create_cq_resp {
@@ -95,6 +96,7 @@ struct mlx5_create_srq {
 	__u64				buf_addr;
 	__u64				db_addr;
 	__u32				flags;
+	__u32				reserved;
 };
 
 struct mlx5_create_srq_resp {
diff --git a/src/verbs.c b/src/verbs.c
index 7201e94925c5..1de8692e5264 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -261,7 +261,6 @@ struct ibv_cq *mlx5_create_cq(struct ibv_context *context, int cqe,
 		return NULL;
 	}
 
-	memset(&cmd, 0, sizeof cmd);
 	cq->cons_index = 0;
 
 	if (mlx5_spinlock_init(&cq->lock))
@@ -307,6 +306,7 @@ struct ibv_cq *mlx5_create_cq(struct ibv_context *context, int cqe,
 	cmd.buf_addr = (uintptr_t) cq->buf_a.buf;
 	cmd.db_addr  = (uintptr_t) cq->dbrec;
 	cmd.cqe_size = cqe_sz;
+	cmd.reserved = 0;
 
 	ret = ibv_cmd_create_cq(context, ncqe - 1, channel, comp_vector,
 				&cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd,
@@ -442,7 +442,6 @@ struct ibv_srq *mlx5_create_srq(struct ibv_pd *pd,
 	}
 	ibsrq = &srq->srq;
 
-	memset(&cmd, 0, sizeof cmd);
 	if (mlx5_spinlock_init(&srq->lock)) {
 		fprintf(stderr, "%s-%d:\n", __func__, __LINE__);
 		goto err;
@@ -490,6 +489,9 @@ struct ibv_srq *mlx5_create_srq(struct ibv_pd *pd,
 	srq->wq_sig = srq_sig_enabled();
 	if (srq->wq_sig)
 		cmd.flags = MLX5_SRQ_FLAG_SIGNATURE;
+	else
+		cmd.flags = 0;
+	cmd.reserved = 0;
 
 	attr->attr.max_sge = srq->max_gs;
 	pthread_mutex_lock(&ctx->srq_table_mutex);
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext()
  2014-05-02 23:56             ` Dan Carpenter
  2014-05-04 21:21               ` [PATCH for v3.15 0/4] uverbs ABI fixes Yann Droneaud
@ 2014-05-04 21:46               ` Yann Droneaud
  1 sibling, 0 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-05-04 21:46 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Steve Wise, Roland Dreier, Sean Hefty, Hal Rosenstock, linux-rdma,
	kernel-janitors, netdev, davem, roland, dm, leedom, santosh,
	kumaras, nirranjan, hariprasad, Steve Wise

Hi,

Le samedi 03 mai 2014 à 02:56 +0300, Dan Carpenter a écrit :
> On Fri, Mar 28, 2014 at 11:27:48AM +0100, Yann Droneaud wrote:
> > Unfortunately, it's not the only structure which has this problem. I'm
> > currently preparing a report on this issue for this driver (cxgb4) and
> > another.
> > 
> 
> This information leak is still present in linux-next.  These days we
> count those things as security vulnerabilities with CVEs and everything.
> 
> When is it likely to get fixed?
> 

Thanks for the reminder.

The patches were waiting for some times in my patch queue, I failed to
provide them earlier. BTW, I've taken time during the week end to
complete them. They should be available on linux-rdma@vger.kernel.org.
See http://marc.info/?i=cover.1399216475.git.ydroneaud@opteya.com

Regards.

-- 
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH for v3.15 0/4] uverbs ABI fixes
       [not found]                 ` <cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
                                     ` (5 preceding siblings ...)
  2014-05-04 21:41                   ` [PATCH libmlx5] abi: adds explicit padding on mlx5_create_cq and mlx5_create_srq Yann Droneaud
@ 2014-05-05  9:01                   ` Yann Droneaud
  6 siblings, 0 replies; 25+ messages in thread
From: Yann Droneaud @ 2014-05-05  9:01 UTC (permalink / raw)
  To: Eli Cohen, Steve Wise, Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Dan Carpenter

Hi,

Le dimanche 04 mai 2014 à 23:21 +0200, Yann Droneaud a écrit :

> Please find 4 patches which fix some issues regarding missing explicit
> padding at end of structure exchanged between kernel and userspace.
> 

I've made a review of all others drivers. I've identified the following
structures as part of ABI:

cxgb3/iw_cxgb3.o     struct iwch_create_cq_req
cxgb3/iw_cxgb3.o     struct iwch_create_cq_resp
cxgb3/iw_cxgb3.o     struct iwch_create_qp_resp
cxgb3/iw_cxgb3.o     struct iwch_reg_user_mr_resp

cxgb4/iw_cxgb4.o     struct c4iw_alloc_ucontext_resp
cxgb4/iw_cxgb4.o     struct c4iw_create_cq_resp
cxgb4/iw_cxgb4.o     struct c4iw_create_qp_resp

ehca/ib_ehca.o       struct ehca_create_cq_resp
ehca/ib_ehca.o       struct ehca_create_qp_resp
ehca/ib_ehca.o       struct ipzu_queue_resp

mlx4/mlx4_ib.o       struct mlx4_ib_alloc_ucontext_resp
mlx4/mlx4_ib.o       struct mlx4_ib_alloc_ucontext_resp_v3
mlx4/mlx4_ib.o       struct mlx4_ib_create_cq
mlx4/mlx4_ib.o       struct mlx4_ib_create_qp
mlx4/mlx4_ib.o       struct mlx4_ib_create_srq
mlx4/mlx4_ib.o       struct mlx4_ib_resize_cq

mlx5/mlx5_ib.o       struct mlx5_ib_alloc_pd_resp
mlx5/mlx5_ib.o       struct mlx5_ib_alloc_ucontext_req_v2
mlx5/mlx5_ib.o       struct mlx5_ib_alloc_ucontext_resp
mlx5/mlx5_ib.o       struct mlx5_ib_create_cq
mlx5/mlx5_ib.o       struct mlx5_ib_create_qp
mlx5/mlx5_ib.o       struct mlx5_ib_create_qp_resp
mlx5/mlx5_ib.o       struct mlx5_ib_create_srq
mlx5/mlx5_ib.o       struct mlx5_ib_resize_cq

mthca/ib_mthca.o     struct mthca_alloc_ucontext_resp
mthca/ib_mthca.o     struct mthca_create_cq
mthca/ib_mthca.o     struct mthca_create_qp
mthca/ib_mthca.o     struct mthca_create_srq
mthca/ib_mthca.o     struct mthca_reg_mr
mthca/ib_mthca.o     struct mthca_resize_cq

nes/iw_nes.o         struct nes_alloc_pd_resp
nes/iw_nes.o         struct nes_alloc_ucontext_req
nes/iw_nes.o         struct nes_alloc_ucontext_resp
nes/iw_nes.o         struct nes_create_cq_req
nes/iw_nes.o         struct nes_create_cq_resp
nes/iw_nes.o         struct nes_create_qp_req
nes/iw_nes.o         struct nes_create_qp_resp
nes/iw_nes.o         struct nes_mem_reg_req

ocrdma/ocrdma.o      struct ocrdma_alloc_pd_uresp
ocrdma/ocrdma.o      struct ocrdma_alloc_ucontext_resp
ocrdma/ocrdma.o      struct ocrdma_create_cq_ureq
ocrdma/ocrdma.o      struct ocrdma_create_cq_uresp
ocrdma/ocrdma.o      struct ocrdma_create_qp_ureq
ocrdma/ocrdma.o      struct ocrdma_create_qp_uresp
ocrdma/ocrdma.o      struct ocrdma_create_srq_uresp

usnic/usnic_verbs.o  struct usnic_ib_create_qp_cmd
usnic/usnic_verbs.o  struct usnic_ib_create_qp_resp
usnic/usnic_verbs.o  struct usnic_transport_spec

It seems that amso1100/iw_c2.o, ipath/ib_ipath.o and qib/ib_qib.o don't
make use of structure to exchange data with userspace: they use single
values, either u32 or u64.

So using pahole I've found issues in mlx5 and cxgb4 only.

> These makes i386 userspace libraries and x86_64 kernel disagree about
> the size of the structures.
> 
> Additionally, as reported by Dan Carpenter, in one case, stack information
> can be leaked by the kernel to userspace due to implicit padding being not
> initialized.
> 
> Unfortunately, the data structure cannot be fixed alone as it would break
> existing applications. So in order to remain compatible with i386 libraries,
> providers (hw) functions are modified to use the input length to guess the
> expected format of the command in order to check the content of the reserved
> field for future usage. Other are modified to not write the padding field in
> response to make the kernel able to handle gracefully i386 userspace on x86_64.
> 
> For full coherency, patches against the userspace libraries (libcxgb4 and
> libmlx5) will be submitted as a followup to update the data structure on
> userspace side.
> 

BTW, as I don't have the hardware / I don't have access to the hardware,
the patches are not tested against real world. (I only have HCAs handled
by mlx4 and qib drivers).

Regards.

-- 
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 4/4] RDMA/cxgb4: add missing padding at end of struct c4iw_alloc_ucontext_resp
       [not found]                     ` <2236129ab4aa1ad1562858f363fb6fef0d6bc93b.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
@ 2014-05-05 15:06                       ` Steve Wise
  2014-05-05 16:59                         ` Yann Droneaud
  0 siblings, 1 reply; 25+ messages in thread
From: Steve Wise @ 2014-05-05 15:06 UTC (permalink / raw)
  To: 'Yann Droneaud', 'Steve Wise',
	'Roland Dreier'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

<snip>

> diff --git a/drivers/infiniband/hw/cxgb4/provider.c
b/drivers/infiniband/hw/cxgb4/provider.c
> index a94a3e12c349..9ba01d524f51 100644
> --- a/drivers/infiniband/hw/cxgb4/provider.c
> +++ b/drivers/infiniband/hw/cxgb4/provider.c
> @@ -122,7 +122,7 @@ static struct ib_ucontext *c4iw_alloc_ucontext(struct ib_device
*ibdev,
>  	INIT_LIST_HEAD(&context->mmaps);
>  	spin_lock_init(&context->mmap_lock);
> 
> -	if (udata->outlen < sizeof(uresp)) {
> +	if (udata->outlen < sizeof(uresp) - sizeof(uresp.reserved)) {
>  		if (!warned++)
>  			pr_err(MOD "Warning - downlevel libcxgb4 (non-fatal), device
status
> page disabled.");
>  		rhp->rdev.flags |= T4_STATUS_PAGE_DISABLED;
> @@ -134,13 +134,13 @@ static struct ib_ucontext *c4iw_alloc_ucontext(struct ib_device
> *ibdev,
>  		}
> 
>  		uresp.status_page_size = PAGE_SIZE;
> 
>  		spin_lock(&context->mmap_lock);
>  		uresp.status_page_key = context->key;
>  		context->key += PAGE_SIZE;
>  		spin_unlock(&context->mmap_lock);
> 
> -		ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
> +		ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp) -
sizeof(uresp.reserved));
>  		if (ret)
>  			goto err_mm;
> 

This chunk doesn't apply.  Not sure why.  I can use -F 6 with patch, but it should apply
cleanly. Can you please respin this against 89ca3b8 Linux 3.15-rc4?

I'll test the cxgb4 stuff today. 

Thanks,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] RDMA/cxgb4: add missing padding at end of struct c4iw_alloc_ucontext_resp
  2014-05-05 15:06                       ` Steve Wise
@ 2014-05-05 16:59                         ` Yann Droneaud
       [not found]                           ` <1399309159.2957.19.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Droneaud @ 2014-05-05 16:59 UTC (permalink / raw)
  To: Steve Wise, 'Steve Wise'
  Cc: 'Roland Dreier', linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi,

Le lundi 05 mai 2014 à 10:06 -0500, Steve Wise a écrit :
> <snip>
> 
> > diff --git a/drivers/infiniband/hw/cxgb4/provider.c
> b/drivers/infiniband/hw/cxgb4/provider.c
> > index a94a3e12c349..9ba01d524f51 100644
> > --- a/drivers/infiniband/hw/cxgb4/provider.c
> > +++ b/drivers/infiniband/hw/cxgb4/provider.c
> > @@ -122,7 +122,7 @@ static struct ib_ucontext *c4iw_alloc_ucontext(struct ib_device
> *ibdev,
> >  	INIT_LIST_HEAD(&context->mmaps);
> >  	spin_lock_init(&context->mmap_lock);
> > 
> > -	if (udata->outlen < sizeof(uresp)) {
> > +	if (udata->outlen < sizeof(uresp) - sizeof(uresp.reserved)) {
> >  		if (!warned++)
> >  			pr_err(MOD "Warning - downlevel libcxgb4 (non-fatal), device
> status
> > page disabled.");
> >  		rhp->rdev.flags |= T4_STATUS_PAGE_DISABLED;
> > @@ -134,13 +134,13 @@ static struct ib_ucontext *c4iw_alloc_ucontext(struct ib_device
> > *ibdev,
> >  		}
> > 
> >  		uresp.status_page_size = PAGE_SIZE;
> > 
> >  		spin_lock(&context->mmap_lock);
> >  		uresp.status_page_key = context->key;
> >  		context->key += PAGE_SIZE;
> >  		spin_unlock(&context->mmap_lock);
> > 
> > -		ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
> > +		ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp) -
> sizeof(uresp.reserved));
> >  		if (ret)
> >  			goto err_mm;
> > 
> 
> This chunk doesn't apply.  Not sure why.  I can use -F 6 with patch, but it should apply
> cleanly. Can you please respin this against 89ca3b8 Linux 3.15-rc4?
> 

I've hand-edited it before sending, but I've tested it with git am
against linux-next (as of today).

Anyway, I'm rolling an updated patchset. Sorry for the inconvenience.

> I'll test the cxgb4 stuff today. 
> 

Thanks.

-- 
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 4/4] RDMA/cxgb4: add missing padding at end of struct c4iw_alloc_ucontext_resp
       [not found]                           ` <1399309159.2957.19.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2014-05-05 17:01                             ` Steve Wise
  0 siblings, 0 replies; 25+ messages in thread
From: Steve Wise @ 2014-05-05 17:01 UTC (permalink / raw)
  To: 'Yann Droneaud', 'Steve Wise'
  Cc: 'Roland Dreier', linux-rdma-u79uwXL29TY76Z2rM5mHXA

> > This chunk doesn't apply.  Not sure why.  I can use -F 6 with patch, but it should apply
> > cleanly. Can you please respin this against 89ca3b8 Linux 3.15-rc4?
> >
> 
> I've hand-edited it before sending, but I've tested it with git am
> against linux-next (as of today).
> 
> Anyway, I'm rolling an updated patchset. Sorry for the inconvenience.
>

No problem.  I hand applied it and tested these 2 + the libcxgb4 changes and they seem to work fine.  
 
Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH libcxgb4 0/2] uverbs ABI fixes
       [not found]                   ` <cover.1399235229.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
  2014-05-04 21:31                     ` [PATCH libcxgb4 1/2] kernel abi: adds explicit padding in struct c4iw_create_cq_resp Yann Droneaud
  2014-05-04 21:31                     ` [PATCH libcxgb4 2/2] kernel abi: adds explicit padding in struct c4iw_alloc_ucontext_resp Yann Droneaud
@ 2014-05-05 18:14                     ` Steve Wise
  2 siblings, 0 replies; 25+ messages in thread
From: Steve Wise @ 2014-05-05 18:14 UTC (permalink / raw)
  To: Yann Droneaud; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Applied.  Thanks!

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-05-05 18:14 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-14 16:22 [PATCH net-next 0/2] Doorbell drop Avoidance Bug fix for iw_cxgb4 Hariprasad Shenai
2014-03-14 16:22 ` [PATCH net-next 2/2] cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes Hariprasad Shenai
     [not found]   ` <1394814128-8815-3-git-send-email-hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
2014-03-26 15:49     ` Yann Droneaud
     [not found]       ` <1395848977.3297.15.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2014-03-26 15:58         ` Steve Wise
2014-03-28  8:24       ` [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext() Dan Carpenter
2014-03-28 10:27         ` Yann Droneaud
     [not found]           ` <1396002468.3297.63.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2014-03-28 10:58             ` David Laight
2014-05-02 23:56             ` Dan Carpenter
2014-05-04 21:21               ` [PATCH for v3.15 0/4] uverbs ABI fixes Yann Droneaud
2014-05-04 21:21                 ` [PATCH 2/4] RDMA/mlx5: add missing padding at end of struct mlx5_ib_create_srq Yann Droneaud
     [not found]                 ` <cover.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
2014-05-04 21:21                   ` [PATCH 1/4] RDMA/mlx5: add missing padding at end of struct mlx5_ib_create_cq Yann Droneaud
2014-05-04 21:21                   ` [PATCH 3/4] RDMA/cxgb4: add missing padding at end of struct c4iw_create_cq_resp Yann Droneaud
2014-05-04 21:21                   ` [PATCH 4/4] RDMA/cxgb4: add missing padding at end of struct c4iw_alloc_ucontext_resp Yann Droneaud
     [not found]                     ` <2236129ab4aa1ad1562858f363fb6fef0d6bc93b.1399216475.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
2014-05-05 15:06                       ` Steve Wise
2014-05-05 16:59                         ` Yann Droneaud
     [not found]                           ` <1399309159.2957.19.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2014-05-05 17:01                             ` Steve Wise
2014-05-04 21:31                   ` [PATCH libcxgb4 0/2] uverbs ABI fixes Yann Droneaud
     [not found]                   ` <cover.1399235229.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
2014-05-04 21:31                     ` [PATCH libcxgb4 1/2] kernel abi: adds explicit padding in struct c4iw_create_cq_resp Yann Droneaud
2014-05-04 21:31                     ` [PATCH libcxgb4 2/2] kernel abi: adds explicit padding in struct c4iw_alloc_ucontext_resp Yann Droneaud
2014-05-05 18:14                     ` [PATCH libcxgb4 0/2] uverbs ABI fixes Steve Wise
2014-05-04 21:41                   ` [PATCH libmlx5] abi: adds explicit padding on mlx5_create_cq and mlx5_create_srq Yann Droneaud
2014-05-05  9:01                   ` [PATCH for v3.15 0/4] uverbs ABI fixes Yann Droneaud
2014-05-04 21:46               ` [patch] RDMA/cxgb4: info leak in c4iw_alloc_ucontext() Yann Droneaud
     [not found] ` <1394814128-8815-1-git-send-email-hariprasad-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
2014-03-14 16:22   ` [PATCH net-next 1/2] cxgb4/iw_cxgb4: Treat CPL_ERR_KEEPALV_NEG_ADVICE as negative advice Hariprasad Shenai
2014-03-15  2:44   ` [PATCH net-next 0/2] Doorbell drop Avoidance Bug fix for iw_cxgb4 David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).