Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: 4.9-rc7: (forcedeth?) BUG: sleeping function called from invalid context at kernel/irq/manage.c:110
From: Meelis Roos @ 2016-11-29 21:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Linux Kernel Mailing List, Network Development
In-Reply-To: <CA+55aFwUFPLbcqpoccKXHsGj_80ANaMwCSgwTUqgxMFmNRHkaQ@mail.gmail.com>

> On Nov 29, 2016 11:58 AM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:
> >
> > nv_do_nic_poll() is simply buggy and needs a fix.
> >
> > synchronize_irq() can sleep.
> 
> Yes, but why did it start showing up now? None of this has changed as far as I can see?

Found one thing that changed - compiler from 6.2.0-9 to 6.2.1-5 and 
binutils too in debian unstable (explicitly upgraded to get fix to the 
binutils bug that broke 64-bit kernels).

> Is it just timing and the transmit queue being busy? Perhaps due to the sheer size of messages? Meelis does seem to have a ton of debugging enabled..

Well, the dmesg is not too verbose but when I debugged some earlier 
problem, netconsole and many other debugging options were left on.

-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply

* [PATCH] netns: avoid disabling irq for netns id
From: Paul Moore @ 2016-11-29 22:11 UTC (permalink / raw)
  To: netdev, linux-audit; +Cc: Cong Wang

From: Paul Moore <paul@paul-moore.com>

Bring back commit bc51dddf98c9 ("netns: avoid disabling irq for netns
id") now that we've fixed some audit multicast issues that caused
problems with original attempt.  Additional information, and history,
can be found in the links below:

 * https://github.com/linux-audit/audit-kernel/issues/22
 * https://github.com/linux-audit/audit-kernel/issues/23

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
---
 net/core/net_namespace.c |   35 +++++++++++++++--------------------
 1 file changed, 15 insertions(+), 20 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2c2eb1b..10608dd 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -213,14 +213,13 @@ static void rtnl_net_notifyid(struct net *net, int cmd, int id);
  */
 int peernet2id_alloc(struct net *net, struct net *peer)
 {
-	unsigned long flags;
 	bool alloc;
 	int id;
 
-	spin_lock_irqsave(&net->nsid_lock, flags);
+	spin_lock_bh(&net->nsid_lock);
 	alloc = atomic_read(&peer->count) == 0 ? false : true;
 	id = __peernet2id_alloc(net, peer, &alloc);
-	spin_unlock_irqrestore(&net->nsid_lock, flags);
+	spin_unlock_bh(&net->nsid_lock);
 	if (alloc && id >= 0)
 		rtnl_net_notifyid(net, RTM_NEWNSID, id);
 	return id;
@@ -230,12 +229,11 @@ EXPORT_SYMBOL(peernet2id_alloc);
 /* This function returns, if assigned, the id of a peer netns. */
 int peernet2id(struct net *net, struct net *peer)
 {
-	unsigned long flags;
 	int id;
 
-	spin_lock_irqsave(&net->nsid_lock, flags);
+	spin_lock_bh(&net->nsid_lock);
 	id = __peernet2id(net, peer);
-	spin_unlock_irqrestore(&net->nsid_lock, flags);
+	spin_unlock_bh(&net->nsid_lock);
 	return id;
 }
 
@@ -249,18 +247,17 @@ bool peernet_has_id(struct net *net, struct net *peer)
 
 struct net *get_net_ns_by_id(struct net *net, int id)
 {
-	unsigned long flags;
 	struct net *peer;
 
 	if (id < 0)
 		return NULL;
 
 	rcu_read_lock();
-	spin_lock_irqsave(&net->nsid_lock, flags);
+	spin_lock_bh(&net->nsid_lock);
 	peer = idr_find(&net->netns_ids, id);
 	if (peer)
 		get_net(peer);
-	spin_unlock_irqrestore(&net->nsid_lock, flags);
+	spin_unlock_bh(&net->nsid_lock);
 	rcu_read_unlock();
 
 	return peer;
@@ -404,17 +401,17 @@ static void cleanup_net(struct work_struct *work)
 		for_each_net(tmp) {
 			int id;
 
-			spin_lock_irq(&tmp->nsid_lock);
+			spin_lock_bh(&tmp->nsid_lock);
 			id = __peernet2id(tmp, net);
 			if (id >= 0)
 				idr_remove(&tmp->netns_ids, id);
-			spin_unlock_irq(&tmp->nsid_lock);
+			spin_unlock_bh(&tmp->nsid_lock);
 			if (id >= 0)
 				rtnl_net_notifyid(tmp, RTM_DELNSID, id);
 		}
-		spin_lock_irq(&net->nsid_lock);
+		spin_lock_bh(&net->nsid_lock);
 		idr_destroy(&net->netns_ids);
-		spin_unlock_irq(&net->nsid_lock);
+		spin_unlock_bh(&net->nsid_lock);
 
 	}
 	rtnl_unlock();
@@ -542,7 +539,6 @@ static int rtnl_net_newid(struct sk_buff *skb, struct nlmsghdr *nlh)
 {
 	struct net *net = sock_net(skb->sk);
 	struct nlattr *tb[NETNSA_MAX + 1];
-	unsigned long flags;
 	struct net *peer;
 	int nsid, err;
 
@@ -563,15 +559,15 @@ static int rtnl_net_newid(struct sk_buff *skb, struct nlmsghdr *nlh)
 	if (IS_ERR(peer))
 		return PTR_ERR(peer);
 
-	spin_lock_irqsave(&net->nsid_lock, flags);
+	spin_lock_bh(&net->nsid_lock);
 	if (__peernet2id(net, peer) >= 0) {
-		spin_unlock_irqrestore(&net->nsid_lock, flags);
+		spin_unlock_bh(&net->nsid_lock);
 		err = -EEXIST;
 		goto out;
 	}
 
 	err = alloc_netid(net, peer, nsid);
-	spin_unlock_irqrestore(&net->nsid_lock, flags);
+	spin_unlock_bh(&net->nsid_lock);
 	if (err >= 0) {
 		rtnl_net_notifyid(net, RTM_NEWNSID, err);
 		err = 0;
@@ -693,11 +689,10 @@ static int rtnl_net_dumpid(struct sk_buff *skb, struct netlink_callback *cb)
 		.idx = 0,
 		.s_idx = cb->args[0],
 	};
-	unsigned long flags;
 
-	spin_lock_irqsave(&net->nsid_lock, flags);
+	spin_lock_bh(&net->nsid_lock);
 	idr_for_each(&net->netns_ids, rtnl_net_dumpid_one, &net_cb);
-	spin_unlock_irqrestore(&net->nsid_lock, flags);
+	spin_unlock_bh(&net->nsid_lock);
 
 	cb->args[0] = net_cb.idx;
 	return skb->len;

^ permalink raw reply related

* Re: [PATCH] netns: avoid disabling irq for netns id
From: Paul Moore @ 2016-11-29 22:14 UTC (permalink / raw)
  To: netdev, linux-audit; +Cc: Cong Wang
In-Reply-To: <148045748887.22539.3188295553967836703.stgit@sifl>

On Tue, Nov 29, 2016 at 5:11 PM, Paul Moore <pmoore@redhat.com> wrote:
> From: Paul Moore <paul@paul-moore.com>
>
> Bring back commit bc51dddf98c9 ("netns: avoid disabling irq for netns
> id") now that we've fixed some audit multicast issues that caused
> problems with original attempt.  Additional information, and history,
> can be found in the links below:
>
>  * https://github.com/linux-audit/audit-kernel/issues/22
>  * https://github.com/linux-audit/audit-kernel/issues/23
>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> Signed-off-by: Paul Moore <paul@paul-moore.com>
> ---
>  net/core/net_namespace.c |   35 +++++++++++++++--------------------
>  1 file changed, 15 insertions(+), 20 deletions(-)

Cong Wang, I added your sign-off to the patch since you were the
original author, if you would prefer I leave it off or want it changed
just let me know.

David/netdev, since this depends on a bunch of audit changes (about
nine patches) I went ahead and just merged this into the audit#next
branch.  If you have a problem with that let me know.

-- 
paul moore
security @ redhat

^ permalink raw reply

* [PATCH net-next 0/7] Mellanox 100G mlx5 updates 2016-11-29
From: Saeed Mahameed @ 2016-11-29 22:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed

Hi Dave,

The following series from Tariq and Roi, provides some critical fixes
and updates for the mlx5e driver.

>From Tariq: 
 - Fix driver coherent memory huge allocation issues by fragmenting
   completion queues, in a way that is transparent to the netdev driver by
   providing a new buffer type "mlx5_frag_buf" with the same access API.
 - Create UMR MKey per RQ to have better scalability.

>From Roi:
 - Some fixes for the encap-decap support and tc flower added lately to the
   mlx5e driver.

This series was generated against commit:
31ac1c19455f ("geneve: fix ip_hdr_len reserved for geneve6 tunnel.")

Thanks,
Saeed.

Roi Dayan (4):
  net/mlx5e: Remove redundant hashtable lookup in configure flower
  net/mlx5e: Correct cleanup order when deleting offloaded TC rules
  net/mlx5e: Refactor tc del flow to accept mlx5e_tc_flow instance
  net/mlx5e: Remove flow encap entry in the correct place

Tariq Toukan (3):
  net/mlx5e: Implement Fragmented Work Queue (WQ)
  net/mlx5e: Move function mlx5e_create_umr_mkey
  net/mlx5e: Create UMR MKey per RQ

 drivers/net/ethernet/mellanox/mlx5/core/alloc.c    |  66 +++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  14 +--
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  12 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 121 +++++++++++----------
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    |  82 ++++++--------
 drivers/net/ethernet/mellanox/mlx5/core/wq.c       |  26 +++--
 drivers/net/ethernet/mellanox/mlx5/core/wq.h       |  18 ++-
 include/linux/mlx5/driver.h                        |  11 ++
 8 files changed, 215 insertions(+), 135 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next 1/7] net/mlx5e: Implement Fragmented Work Queue (WQ)
From: Saeed Mahameed @ 2016-11-29 22:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480457972-31286-1-git-send-email-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

Add new type of struct mlx5_frag_buf which is used to allocate fragmented
buffers rather than contiguous, and make the Completion Queues (CQs) use
it as they are big (default of 2MB per CQ in Striding RQ).

This fixes the failures of type:
"mlx5e_open_locked: mlx5e_open_channels failed, -12"
due to dma_zalloc_coherent insufficient contiguous coherent memory to
satisfy the driver's request when the user tries to setup more or larger
rings.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/alloc.c   | 66 +++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 10 ++--
 drivers/net/ethernet/mellanox/mlx5/core/wq.c      | 26 ++++++---
 drivers/net/ethernet/mellanox/mlx5/core/wq.h      | 18 +++++--
 include/linux/mlx5/driver.h                       | 11 ++++
 6 files changed, 116 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/alloc.c b/drivers/net/ethernet/mellanox/mlx5/core/alloc.c
index 2c6e3c7..bc8357d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/alloc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/alloc.c
@@ -106,6 +106,63 @@ void mlx5_buf_free(struct mlx5_core_dev *dev, struct mlx5_buf *buf)
 }
 EXPORT_SYMBOL_GPL(mlx5_buf_free);
 
+int mlx5_frag_buf_alloc_node(struct mlx5_core_dev *dev, int size,
+			     struct mlx5_frag_buf *buf, int node)
+{
+	int i;
+
+	buf->size = size;
+	buf->npages = 1 << get_order(size);
+	buf->page_shift = PAGE_SHIFT;
+	buf->frags = kcalloc(buf->npages, sizeof(struct mlx5_buf_list),
+			     GFP_KERNEL);
+	if (!buf->frags)
+		goto err_out;
+
+	for (i = 0; i < buf->npages; i++) {
+		struct mlx5_buf_list *frag = &buf->frags[i];
+		int frag_sz = min_t(int, size, PAGE_SIZE);
+
+		frag->buf = mlx5_dma_zalloc_coherent_node(dev, frag_sz,
+							  &frag->map, node);
+		if (!frag->buf)
+			goto err_free_buf;
+		if (frag->map & ((1 << buf->page_shift) - 1)) {
+			dma_free_coherent(&dev->pdev->dev, frag_sz,
+					  buf->frags[i].buf, buf->frags[i].map);
+			mlx5_core_warn(dev, "unexpected map alignment: 0x%p, page_shift=%d\n",
+				       (void *)frag->map, buf->page_shift);
+			goto err_free_buf;
+		}
+		size -= frag_sz;
+	}
+
+	return 0;
+
+err_free_buf:
+	while (--i)
+		dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->frags[i].buf,
+				  buf->frags[i].map);
+	kfree(buf->frags);
+err_out:
+	return -ENOMEM;
+}
+
+void mlx5_frag_buf_free(struct mlx5_core_dev *dev, struct mlx5_frag_buf *buf)
+{
+	int size = buf->size;
+	int i;
+
+	for (i = 0; i < buf->npages; i++) {
+		int frag_sz = min_t(int, size, PAGE_SIZE);
+
+		dma_free_coherent(&dev->pdev->dev, frag_sz, buf->frags[i].buf,
+				  buf->frags[i].map);
+		size -= frag_sz;
+	}
+	kfree(buf->frags);
+}
+
 static struct mlx5_db_pgdir *mlx5_alloc_db_pgdir(struct mlx5_core_dev *dev,
 						 int node)
 {
@@ -230,3 +287,12 @@ void mlx5_fill_page_array(struct mlx5_buf *buf, __be64 *pas)
 	}
 }
 EXPORT_SYMBOL_GPL(mlx5_fill_page_array);
+
+void mlx5_fill_page_frag_array(struct mlx5_frag_buf *buf, __be64 *pas)
+{
+	int i;
+
+	for (i = 0; i < buf->npages; i++)
+		pas[i] = cpu_to_be64(buf->frags[i].map);
+}
+EXPORT_SYMBOL_GPL(mlx5_fill_page_frag_array);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 442dbc3..f16f7fb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -286,7 +286,7 @@ struct mlx5e_cq {
 	u16                        decmprs_wqe_counter;
 
 	/* control */
-	struct mlx5_wq_ctrl        wq_ctrl;
+	struct mlx5_frag_wq_ctrl   wq_ctrl;
 } ____cacheline_aligned_in_smp;
 
 struct mlx5e_rq;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6b492ca..ba25cd3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1201,7 +1201,7 @@ static int mlx5e_create_cq(struct mlx5e_channel *c,
 
 static void mlx5e_destroy_cq(struct mlx5e_cq *cq)
 {
-	mlx5_wq_destroy(&cq->wq_ctrl);
+	mlx5_cqwq_destroy(&cq->wq_ctrl);
 }
 
 static int mlx5e_enable_cq(struct mlx5e_cq *cq, struct mlx5e_cq_param *param)
@@ -1218,7 +1218,7 @@ static int mlx5e_enable_cq(struct mlx5e_cq *cq, struct mlx5e_cq_param *param)
 	int err;
 
 	inlen = MLX5_ST_SZ_BYTES(create_cq_in) +
-		sizeof(u64) * cq->wq_ctrl.buf.npages;
+		sizeof(u64) * cq->wq_ctrl.frag_buf.npages;
 	in = mlx5_vzalloc(inlen);
 	if (!in)
 		return -ENOMEM;
@@ -1227,15 +1227,15 @@ static int mlx5e_enable_cq(struct mlx5e_cq *cq, struct mlx5e_cq_param *param)
 
 	memcpy(cqc, param->cqc, sizeof(param->cqc));
 
-	mlx5_fill_page_array(&cq->wq_ctrl.buf,
-			     (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas));
+	mlx5_fill_page_frag_array(&cq->wq_ctrl.frag_buf,
+				  (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas));
 
 	mlx5_vector2eqn(mdev, param->eq_ix, &eqn, &irqn_not_used);
 
 	MLX5_SET(cqc,   cqc, cq_period_mode, param->cq_period_mode);
 	MLX5_SET(cqc,   cqc, c_eqn,         eqn);
 	MLX5_SET(cqc,   cqc, uar_page,      mcq->uar->index);
-	MLX5_SET(cqc,   cqc, log_page_size, cq->wq_ctrl.buf.page_shift -
+	MLX5_SET(cqc,   cqc, log_page_size, cq->wq_ctrl.frag_buf.page_shift -
 					    MLX5_ADAPTER_PAGE_SHIFT);
 	MLX5_SET64(cqc, cqc, dbr_addr,      cq->wq_ctrl.db.dma);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wq.c b/drivers/net/ethernet/mellanox/mlx5/core/wq.c
index 821a087..921673c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wq.c
@@ -101,13 +101,15 @@ int mlx5_wq_cyc_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
 
 int mlx5_cqwq_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
 		     void *cqc, struct mlx5_cqwq *wq,
-		     struct mlx5_wq_ctrl *wq_ctrl)
+		     struct mlx5_frag_wq_ctrl *wq_ctrl)
 {
 	int err;
 
-	wq->log_stride = 6 + MLX5_GET(cqc, cqc, cqe_sz);
-	wq->log_sz = MLX5_GET(cqc, cqc, log_cq_size);
-	wq->sz_m1 = (1 << wq->log_sz) - 1;
+	wq->log_stride	= 6 + MLX5_GET(cqc, cqc, cqe_sz);
+	wq->log_sz	= MLX5_GET(cqc, cqc, log_cq_size);
+	wq->sz_m1	= (1 << wq->log_sz) - 1;
+	wq->log_frag_strides = PAGE_SHIFT - wq->log_stride;
+	wq->frag_sz_m1	= (1 << wq->log_frag_strides) - 1;
 
 	err = mlx5_db_alloc_node(mdev, &wq_ctrl->db, param->db_numa_node);
 	if (err) {
@@ -115,14 +117,16 @@ int mlx5_cqwq_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
 		return err;
 	}
 
-	err = mlx5_buf_alloc_node(mdev, mlx5_cqwq_get_byte_size(wq),
-				  &wq_ctrl->buf, param->buf_numa_node);
+	err = mlx5_frag_buf_alloc_node(mdev, mlx5_cqwq_get_byte_size(wq),
+				       &wq_ctrl->frag_buf,
+				       param->buf_numa_node);
 	if (err) {
-		mlx5_core_warn(mdev, "mlx5_buf_alloc_node() failed, %d\n", err);
+		mlx5_core_warn(mdev, "mlx5_frag_buf_alloc_node() failed, %d\n",
+			       err);
 		goto err_db_free;
 	}
 
-	wq->buf = wq_ctrl->buf.direct.buf;
+	wq->frag_buf = wq_ctrl->frag_buf;
 	wq->db  = wq_ctrl->db.db;
 
 	wq_ctrl->mdev = mdev;
@@ -184,3 +188,9 @@ void mlx5_wq_destroy(struct mlx5_wq_ctrl *wq_ctrl)
 	mlx5_buf_free(wq_ctrl->mdev, &wq_ctrl->buf);
 	mlx5_db_free(wq_ctrl->mdev, &wq_ctrl->db);
 }
+
+void mlx5_cqwq_destroy(struct mlx5_frag_wq_ctrl *wq_ctrl)
+{
+	mlx5_frag_buf_free(wq_ctrl->mdev, &wq_ctrl->frag_buf);
+	mlx5_db_free(wq_ctrl->mdev, &wq_ctrl->db);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wq.h b/drivers/net/ethernet/mellanox/mlx5/core/wq.h
index 6c2a8f9..d8afed8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wq.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wq.h
@@ -47,6 +47,12 @@ struct mlx5_wq_ctrl {
 	struct mlx5_db		db;
 };
 
+struct mlx5_frag_wq_ctrl {
+	struct mlx5_core_dev	*mdev;
+	struct mlx5_frag_buf	frag_buf;
+	struct mlx5_db		db;
+};
+
 struct mlx5_wq_cyc {
 	void			*buf;
 	__be32			*db;
@@ -55,12 +61,14 @@ struct mlx5_wq_cyc {
 };
 
 struct mlx5_cqwq {
-	void			*buf;
+	struct mlx5_frag_buf	frag_buf;
 	__be32			*db;
 	u32			sz_m1;
+	u32			frag_sz_m1;
 	u32			cc; /* consumer counter */
 	u8			log_sz;
 	u8			log_stride;
+	u8			log_frag_strides;
 };
 
 struct mlx5_wq_ll {
@@ -81,7 +89,7 @@ u32 mlx5_wq_cyc_get_size(struct mlx5_wq_cyc *wq);
 
 int mlx5_cqwq_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
 		     void *cqc, struct mlx5_cqwq *wq,
-		     struct mlx5_wq_ctrl *wq_ctrl);
+		     struct mlx5_frag_wq_ctrl *wq_ctrl);
 u32 mlx5_cqwq_get_size(struct mlx5_cqwq *wq);
 
 int mlx5_wq_ll_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
@@ -90,6 +98,7 @@ int mlx5_wq_ll_create(struct mlx5_core_dev *mdev, struct mlx5_wq_param *param,
 u32 mlx5_wq_ll_get_size(struct mlx5_wq_ll *wq);
 
 void mlx5_wq_destroy(struct mlx5_wq_ctrl *wq_ctrl);
+void mlx5_cqwq_destroy(struct mlx5_frag_wq_ctrl *wq_ctrl);
 
 static inline u16 mlx5_wq_cyc_ctr2ix(struct mlx5_wq_cyc *wq, u16 ctr)
 {
@@ -116,7 +125,10 @@ static inline u32 mlx5_cqwq_get_ci(struct mlx5_cqwq *wq)
 
 static inline void *mlx5_cqwq_get_wqe(struct mlx5_cqwq *wq, u32 ix)
 {
-	return wq->buf + (ix << wq->log_stride);
+	unsigned int frag = (ix >> wq->log_frag_strides);
+
+	return wq->frag_buf.frags[frag].buf +
+		((wq->frag_sz_m1 & ix) << wq->log_stride);
 }
 
 static inline u32 mlx5_cqwq_get_wrap_cnt(struct mlx5_cqwq *wq)
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 68b85ef..0ae5536 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -318,6 +318,13 @@ struct mlx5_buf {
 	u8			page_shift;
 };
 
+struct mlx5_frag_buf {
+	struct mlx5_buf_list	*frags;
+	int			npages;
+	int			size;
+	u8			page_shift;
+};
+
 struct mlx5_eq_tasklet {
 	struct list_head list;
 	struct list_head process_list;
@@ -822,6 +829,9 @@ int mlx5_buf_alloc_node(struct mlx5_core_dev *dev, int size,
 			struct mlx5_buf *buf, int node);
 int mlx5_buf_alloc(struct mlx5_core_dev *dev, int size, struct mlx5_buf *buf);
 void mlx5_buf_free(struct mlx5_core_dev *dev, struct mlx5_buf *buf);
+int mlx5_frag_buf_alloc_node(struct mlx5_core_dev *dev, int size,
+			     struct mlx5_frag_buf *buf, int node);
+void mlx5_frag_buf_free(struct mlx5_core_dev *dev, struct mlx5_frag_buf *buf);
 struct mlx5_cmd_mailbox *mlx5_alloc_cmd_mailbox_chain(struct mlx5_core_dev *dev,
 						      gfp_t flags, int npages);
 void mlx5_free_cmd_mailbox_chain(struct mlx5_core_dev *dev,
@@ -866,6 +876,7 @@ void mlx5_unregister_debugfs(void);
 int mlx5_eq_init(struct mlx5_core_dev *dev);
 void mlx5_eq_cleanup(struct mlx5_core_dev *dev);
 void mlx5_fill_page_array(struct mlx5_buf *buf, __be64 *pas);
+void mlx5_fill_page_frag_array(struct mlx5_frag_buf *frag_buf, __be64 *pas);
 void mlx5_cq_completion(struct mlx5_core_dev *dev, u32 cqn);
 void mlx5_rsc_event(struct mlx5_core_dev *dev, u32 rsn, int event_type);
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 5/7] net/mlx5e: Correct cleanup order when deleting offloaded TC rules
From: Saeed Mahameed @ 2016-11-29 22:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480457972-31286-1-git-send-email-saeedm@mellanox.com>

From: Roi Dayan <roid@mellanox.com>

According to the reverse unwinding principle, on delete time we should
first handle deletion of the steering rule and later handle the vlan
deletion from the eswitch.

Fixes: 8b32580df1cb ("net/mlx5e: Add TC vlan action for SRIOV offloads")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index dd6d954..4d71445 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -151,11 +151,11 @@ static void mlx5e_tc_del_flow(struct mlx5e_priv *priv,
 
 	counter = mlx5_flow_rule_counter(rule);
 
+	mlx5_del_flow_rules(rule);
+
 	if (esw && esw->mode == SRIOV_OFFLOADS)
 		mlx5_eswitch_del_vlan_action(esw, attr);
 
-	mlx5_del_flow_rules(rule);
-
 	mlx5_fc_destroy(priv->mdev, counter);
 
 	if (!mlx5e_tc_num_filters(priv) && (priv->fs.tc.t)) {
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 7/7] net/mlx5e: Remove flow encap entry in the correct place
From: Saeed Mahameed @ 2016-11-29 22:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480457972-31286-1-git-send-email-saeedm@mellanox.com>

From: Roi Dayan <roid@mellanox.com>

Handling flow encap entry should be inside tc del flow
and is only relevant for offloaded eswitch TC rules.

Fixes: 11a457e9b6c1 ("net/mlx5e: Add basic TC tunnel set action for SRIOV offloads")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 43 +++++++++++++------------
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 3875c1c..f07ef8c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -142,6 +142,24 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
 	return mlx5_eswitch_add_offloaded_rule(esw, spec, attr);
 }
 
+static void mlx5e_detach_encap(struct mlx5e_priv *priv,
+			       struct mlx5e_tc_flow *flow) {
+	struct list_head *next = flow->encap.next;
+
+	list_del(&flow->encap);
+	if (list_empty(next)) {
+		struct mlx5_encap_entry *e;
+
+		e = list_entry(next, struct mlx5_encap_entry, flows);
+		if (e->n) {
+			mlx5_encap_dealloc(priv->mdev, e->encap_id);
+			neigh_release(e->n);
+		}
+		hlist_del_rcu(&e->encap_hlist);
+		kfree(e);
+	}
+}
+
 static void mlx5e_tc_del_flow(struct mlx5e_priv *priv,
 			      struct mlx5e_tc_flow *flow)
 {
@@ -152,8 +170,11 @@ static void mlx5e_tc_del_flow(struct mlx5e_priv *priv,
 
 	mlx5_del_flow_rules(flow->rule);
 
-	if (esw && esw->mode == SRIOV_OFFLOADS)
+	if (esw && esw->mode == SRIOV_OFFLOADS) {
 		mlx5_eswitch_del_vlan_action(esw, flow->attr);
+		if (flow->attr->action & MLX5_FLOW_CONTEXT_ACTION_ENCAP)
+			mlx5e_detach_encap(priv, flow);
+	}
 
 	mlx5_fc_destroy(priv->mdev, counter);
 
@@ -973,24 +994,6 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv, __be16 protocol,
 	return err;
 }
 
-static void mlx5e_detach_encap(struct mlx5e_priv *priv,
-			       struct mlx5e_tc_flow *flow) {
-	struct list_head *next = flow->encap.next;
-
-	list_del(&flow->encap);
-	if (list_empty(next)) {
-		struct mlx5_encap_entry *e;
-
-		e = list_entry(next, struct mlx5_encap_entry, flows);
-		if (e->n) {
-			mlx5_encap_dealloc(priv->mdev, e->encap_id);
-			neigh_release(e->n);
-		}
-		hlist_del_rcu(&e->encap_hlist);
-		kfree(e);
-	}
-}
-
 int mlx5e_delete_flower(struct mlx5e_priv *priv,
 			struct tc_cls_flower_offload *f)
 {
@@ -1006,8 +1009,6 @@ int mlx5e_delete_flower(struct mlx5e_priv *priv,
 
 	mlx5e_tc_del_flow(priv, flow);
 
-	if (flow->attr->action & MLX5_FLOW_CONTEXT_ACTION_ENCAP)
-		mlx5e_detach_encap(priv, flow);
 
 	kfree(flow);
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 4/7] net/mlx5e: Remove redundant hashtable lookup in configure flower
From: Saeed Mahameed @ 2016-11-29 22:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480457972-31286-1-git-send-email-saeedm@mellanox.com>

From: Roi Dayan <roid@mellanox.com>

We will never find a flow with the same cookie as cls_flower always
allocates a new flow and the cookie is the allocated memory address.

Fixes: e3a2b7ed018e ("net/mlx5e: Support offload cls_flower with drop action")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 26 +++++++------------------
 1 file changed, 7 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 4d06fab..dd6d954 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -915,25 +915,17 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv, __be16 protocol,
 	u32 flow_tag, action;
 	struct mlx5e_tc_flow *flow;
 	struct mlx5_flow_spec *spec;
-	struct mlx5_flow_handle *old = NULL;
-	struct mlx5_esw_flow_attr *old_attr = NULL;
 	struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
 
 	if (esw && esw->mode == SRIOV_OFFLOADS)
 		fdb_flow = true;
 
-	flow = rhashtable_lookup_fast(&tc->ht, &f->cookie,
-				      tc->ht_params);
-	if (flow) {
-		old = flow->rule;
-		old_attr = flow->attr;
-	} else {
-		if (fdb_flow)
-			flow = kzalloc(sizeof(*flow) + sizeof(struct mlx5_esw_flow_attr),
-				       GFP_KERNEL);
-		else
-			flow = kzalloc(sizeof(*flow), GFP_KERNEL);
-	}
+	if (fdb_flow)
+		flow = kzalloc(sizeof(*flow) +
+			       sizeof(struct mlx5_esw_flow_attr),
+			       GFP_KERNEL);
+	else
+		flow = kzalloc(sizeof(*flow), GFP_KERNEL);
 
 	spec = mlx5_vzalloc(sizeof(*spec));
 	if (!spec || !flow) {
@@ -970,17 +962,13 @@ int mlx5e_configure_flower(struct mlx5e_priv *priv, __be16 protocol,
 	if (err)
 		goto err_del_rule;
 
-	if (old)
-		mlx5e_tc_del_flow(priv, old, old_attr);
-
 	goto out;
 
 err_del_rule:
 	mlx5_del_flow_rules(flow->rule);
 
 err_free:
-	if (!old)
-		kfree(flow);
+	kfree(flow);
 out:
 	kvfree(spec);
 	return err;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 3/7] net/mlx5e: Create UMR MKey per RQ
From: Saeed Mahameed @ 2016-11-29 22:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480457972-31286-1-git-send-email-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

In Striding RQ implementation, we used a single UMR
(User-Mode Memory Registration) memory key for all RQs.
When the product of RQs number*size gets high, we hit a
limitation of u16 field size in FW.

Here we move to using a UMR memory key per RQ, so we can
scale to any number of rings, with the maximum buffer
size in each.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       | 12 ++---
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 12 +----
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 53 ++++++++++++----------
 3 files changed, 35 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index f16f7fb..63dd639 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -77,9 +77,9 @@
 						 MLX5_MPWRQ_WQE_PAGE_ORDER)
 
 #define MLX5_MTT_OCTW(npages) (ALIGN(npages, 8) / 2)
-#define MLX5E_REQUIRED_MTTS(rqs, wqes)\
-	(rqs * wqes * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8))
-#define MLX5E_VALID_NUM_MTTS(num_mtts) (MLX5_MTT_OCTW(num_mtts) <= U16_MAX)
+#define MLX5E_REQUIRED_MTTS(wqes)		\
+	(wqes * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8))
+#define MLX5E_VALID_NUM_MTTS(num_mtts) (MLX5_MTT_OCTW(num_mtts) - 1 <= U16_MAX)
 
 #define MLX5_UMR_ALIGN				(2048)
 #define MLX5_MPWRQ_SMALL_PACKET_THRESHOLD	(128)
@@ -347,7 +347,6 @@ struct mlx5e_rq {
 		struct {
 			struct mlx5e_mpw_info *info;
 			void                  *mtt_no_align;
-			u32                    mtt_offset;
 		} mpwqe;
 	};
 	struct {
@@ -382,6 +381,7 @@ struct mlx5e_rq {
 	u32                    rqn;
 	struct mlx5e_channel  *channel;
 	struct mlx5e_priv     *priv;
+	struct mlx5_core_mkey  umr_mkey;
 } ____cacheline_aligned_in_smp;
 
 struct mlx5e_umr_dma_info {
@@ -689,7 +689,6 @@ struct mlx5e_priv {
 
 	unsigned long              state;
 	struct mutex               state_lock; /* Protects Interface state */
-	struct mlx5_core_mkey      umr_mkey;
 	struct mlx5e_rq            drop_rq;
 
 	struct mlx5e_channel     **channel;
@@ -838,8 +837,7 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
 
 static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
 {
-	return rq->mpwqe.mtt_offset +
-		wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
+	return wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
 }
 
 static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index aa963d7..352462a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -499,8 +499,7 @@ static int mlx5e_set_ringparam(struct net_device *dev,
 		return -EINVAL;
 	}
 
-	num_mtts = MLX5E_REQUIRED_MTTS(priv->params.num_channels,
-				       rx_pending_wqes);
+	num_mtts = MLX5E_REQUIRED_MTTS(rx_pending_wqes);
 	if (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ &&
 	    !MLX5E_VALID_NUM_MTTS(num_mtts)) {
 		netdev_info(dev, "%s: rx_pending (%d) request can't be satisfied, try to reduce.\n",
@@ -565,7 +564,6 @@ static int mlx5e_set_channels(struct net_device *dev,
 	unsigned int count = ch->combined_count;
 	bool arfs_enabled;
 	bool was_opened;
-	u32 num_mtts;
 	int err = 0;
 
 	if (!count) {
@@ -584,14 +582,6 @@ static int mlx5e_set_channels(struct net_device *dev,
 		return -EINVAL;
 	}
 
-	num_mtts = MLX5E_REQUIRED_MTTS(count, BIT(priv->params.log_rq_size));
-	if (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ &&
-	    !MLX5E_VALID_NUM_MTTS(num_mtts)) {
-		netdev_info(dev, "%s: rx count (%d) request can't be satisfied, try to reduce.\n",
-			    __func__, count);
-		return -EINVAL;
-	}
-
 	if (priv->params.num_channels == count)
 		return 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 49ca30b..84a4adb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -471,24 +471,25 @@ static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
 	kfree(rq->mpwqe.info);
 }
 
-static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv)
+static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv,
+				 u64 npages, u8 page_shift,
+				 struct mlx5_core_mkey *umr_mkey)
 {
 	struct mlx5_core_dev *mdev = priv->mdev;
-	u64 npages = MLX5E_REQUIRED_MTTS(priv->profile->max_nch(mdev),
-					 BIT(MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW));
 	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
 	void *mkc;
 	u32 *in;
 	int err;
 
+	if (!MLX5E_VALID_NUM_MTTS(npages))
+		return -EINVAL;
+
 	in = mlx5_vzalloc(inlen);
 	if (!in)
 		return -ENOMEM;
 
 	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
 
-	npages = min_t(u32, ALIGN(U16_MAX, 4) * 2, npages);
-
 	MLX5_SET(mkc, mkc, free, 1);
 	MLX5_SET(mkc, mkc, umr_en, 1);
 	MLX5_SET(mkc, mkc, lw, 1);
@@ -497,17 +498,25 @@ static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv)
 
 	MLX5_SET(mkc, mkc, qpn, 0xffffff);
 	MLX5_SET(mkc, mkc, pd, mdev->mlx5e_res.pdn);
-	MLX5_SET64(mkc, mkc, len, npages << PAGE_SHIFT);
+	MLX5_SET64(mkc, mkc, len, npages << page_shift);
 	MLX5_SET(mkc, mkc, translations_octword_size,
 		 MLX5_MTT_OCTW(npages));
-	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
+	MLX5_SET(mkc, mkc, log_page_size, page_shift);
 
-	err = mlx5_core_create_mkey(mdev, &priv->umr_mkey, in, inlen);
+	err = mlx5_core_create_mkey(mdev, umr_mkey, in, inlen);
 
 	kvfree(in);
 	return err;
 }
 
+static int mlx5e_create_rq_umr_mkey(struct mlx5e_rq *rq)
+{
+	struct mlx5e_priv *priv = rq->priv;
+	u64 num_mtts = MLX5E_REQUIRED_MTTS(BIT(priv->params.log_rq_size));
+
+	return mlx5e_create_umr_mkey(priv, num_mtts, PAGE_SHIFT, &rq->umr_mkey);
+}
+
 static int mlx5e_create_rq(struct mlx5e_channel *c,
 			   struct mlx5e_rq_param *param,
 			   struct mlx5e_rq *rq)
@@ -564,18 +573,20 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 		rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
-		rq->mpwqe.mtt_offset = c->ix *
-			MLX5E_REQUIRED_MTTS(1, BIT(priv->params.log_rq_size));
-
 		rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
 		rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
 
 		rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
 		byte_count = rq->buff.wqe_sz;
-		rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
-		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
+
+		err = mlx5e_create_rq_umr_mkey(rq);
 		if (err)
 			goto err_rq_wq_destroy;
+		rq->mkey_be = cpu_to_be32(rq->umr_mkey.key);
+
+		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
+		if (err)
+			goto err_destroy_umr_mkey;
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
 		rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info),
@@ -626,6 +637,9 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 
 	return 0;
 
+err_destroy_umr_mkey:
+	mlx5_core_destroy_mkey(mdev, &rq->umr_mkey);
+
 err_rq_wq_destroy:
 	if (rq->xdp_prog)
 		bpf_prog_put(rq->xdp_prog);
@@ -644,6 +658,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		mlx5e_rq_free_mpwqe_info(rq);
+		mlx5_core_destroy_mkey(rq->priv->mdev, &rq->umr_mkey);
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
 		kfree(rq->dma_info);
@@ -3868,15 +3883,9 @@ int mlx5e_attach_netdev(struct mlx5_core_dev *mdev, struct net_device *netdev)
 	profile = priv->profile;
 	clear_bit(MLX5E_STATE_DESTROYING, &priv->state);
 
-	err = mlx5e_create_umr_mkey(priv);
-	if (err) {
-		mlx5_core_err(mdev, "create umr mkey failed, %d\n", err);
-		goto out;
-	}
-
 	err = profile->init_tx(priv);
 	if (err)
-		goto err_destroy_umr_mkey;
+		goto out;
 
 	err = mlx5e_open_drop_rq(priv);
 	if (err) {
@@ -3916,9 +3925,6 @@ int mlx5e_attach_netdev(struct mlx5_core_dev *mdev, struct net_device *netdev)
 err_cleanup_tx:
 	profile->cleanup_tx(priv);
 
-err_destroy_umr_mkey:
-	mlx5_core_destroy_mkey(mdev, &priv->umr_mkey);
-
 out:
 	return err;
 }
@@ -3967,7 +3973,6 @@ void mlx5e_detach_netdev(struct mlx5_core_dev *mdev, struct net_device *netdev)
 	profile->cleanup_rx(priv);
 	mlx5e_close_drop_rq(priv);
 	profile->cleanup_tx(priv);
-	mlx5_core_destroy_mkey(priv->mdev, &priv->umr_mkey);
 	cancel_delayed_work_sync(&priv->update_stats_work);
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 2/7] net/mlx5e: Move function mlx5e_create_umr_mkey
From: Saeed Mahameed @ 2016-11-29 22:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480457972-31286-1-git-send-email-saeedm@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>

In next patch we are going to create a UMR MKey per RQ, we need
mlx5e_create_umr_mkey declared before mlx5e_create_rq.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 74 +++++++++++------------
 1 file changed, 37 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ba25cd3..49ca30b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -471,6 +471,43 @@ static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
 	kfree(rq->mpwqe.info);
 }
 
+static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv)
+{
+	struct mlx5_core_dev *mdev = priv->mdev;
+	u64 npages = MLX5E_REQUIRED_MTTS(priv->profile->max_nch(mdev),
+					 BIT(MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW));
+	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
+	void *mkc;
+	u32 *in;
+	int err;
+
+	in = mlx5_vzalloc(inlen);
+	if (!in)
+		return -ENOMEM;
+
+	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
+
+	npages = min_t(u32, ALIGN(U16_MAX, 4) * 2, npages);
+
+	MLX5_SET(mkc, mkc, free, 1);
+	MLX5_SET(mkc, mkc, umr_en, 1);
+	MLX5_SET(mkc, mkc, lw, 1);
+	MLX5_SET(mkc, mkc, lr, 1);
+	MLX5_SET(mkc, mkc, access_mode, MLX5_MKC_ACCESS_MODE_MTT);
+
+	MLX5_SET(mkc, mkc, qpn, 0xffffff);
+	MLX5_SET(mkc, mkc, pd, mdev->mlx5e_res.pdn);
+	MLX5_SET64(mkc, mkc, len, npages << PAGE_SHIFT);
+	MLX5_SET(mkc, mkc, translations_octword_size,
+		 MLX5_MTT_OCTW(npages));
+	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
+
+	err = mlx5_core_create_mkey(mdev, &priv->umr_mkey, in, inlen);
+
+	kvfree(in);
+	return err;
+}
+
 static int mlx5e_create_rq(struct mlx5e_channel *c,
 			   struct mlx5e_rq_param *param,
 			   struct mlx5e_rq *rq)
@@ -3625,43 +3662,6 @@ static void mlx5e_destroy_q_counter(struct mlx5e_priv *priv)
 	mlx5_core_dealloc_q_counter(priv->mdev, priv->q_counter);
 }
 
-static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv)
-{
-	struct mlx5_core_dev *mdev = priv->mdev;
-	u64 npages = MLX5E_REQUIRED_MTTS(priv->profile->max_nch(mdev),
-					 BIT(MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW));
-	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
-	void *mkc;
-	u32 *in;
-	int err;
-
-	in = mlx5_vzalloc(inlen);
-	if (!in)
-		return -ENOMEM;
-
-	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
-
-	npages = min_t(u32, ALIGN(U16_MAX, 4) * 2, npages);
-
-	MLX5_SET(mkc, mkc, free, 1);
-	MLX5_SET(mkc, mkc, umr_en, 1);
-	MLX5_SET(mkc, mkc, lw, 1);
-	MLX5_SET(mkc, mkc, lr, 1);
-	MLX5_SET(mkc, mkc, access_mode, MLX5_MKC_ACCESS_MODE_MTT);
-
-	MLX5_SET(mkc, mkc, qpn, 0xffffff);
-	MLX5_SET(mkc, mkc, pd, mdev->mlx5e_res.pdn);
-	MLX5_SET64(mkc, mkc, len, npages << PAGE_SHIFT);
-	MLX5_SET(mkc, mkc, translations_octword_size,
-		 MLX5_MTT_OCTW(npages));
-	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
-
-	err = mlx5_core_create_mkey(mdev, &priv->umr_mkey, in, inlen);
-
-	kvfree(in);
-	return err;
-}
-
 static void mlx5e_nic_init(struct mlx5_core_dev *mdev,
 			   struct net_device *netdev,
 			   const struct mlx5e_profile *profile,
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 6/7] net/mlx5e: Refactor tc del flow to accept mlx5e_tc_flow instance
From: Saeed Mahameed @ 2016-11-29 22:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tariq Toukan, Or Gerlitz, Roi Dayan, Sebastian Ott,
	Saeed Mahameed
In-Reply-To: <1480457972-31286-1-git-send-email-saeedm@mellanox.com>

From: Roi Dayan <roid@mellanox.com>

Change the function that deletes offloaded TC rule to get
struct mlx5e_tc_flow instance which contains both the flow
handle and flow attributes. This is a cleanup needed for
downstream patches, it doesn't change any functionality.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 4d71445..3875c1c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -143,18 +143,17 @@ mlx5e_tc_add_fdb_flow(struct mlx5e_priv *priv,
 }
 
 static void mlx5e_tc_del_flow(struct mlx5e_priv *priv,
-			      struct mlx5_flow_handle *rule,
-			      struct mlx5_esw_flow_attr *attr)
+			      struct mlx5e_tc_flow *flow)
 {
 	struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
 	struct mlx5_fc *counter = NULL;
 
-	counter = mlx5_flow_rule_counter(rule);
+	counter = mlx5_flow_rule_counter(flow->rule);
 
-	mlx5_del_flow_rules(rule);
+	mlx5_del_flow_rules(flow->rule);
 
 	if (esw && esw->mode == SRIOV_OFFLOADS)
-		mlx5_eswitch_del_vlan_action(esw, attr);
+		mlx5_eswitch_del_vlan_action(esw, flow->attr);
 
 	mlx5_fc_destroy(priv->mdev, counter);
 
@@ -1005,7 +1004,7 @@ int mlx5e_delete_flower(struct mlx5e_priv *priv,
 
 	rhashtable_remove_fast(&tc->ht, &flow->node, tc->ht_params);
 
-	mlx5e_tc_del_flow(priv, flow->rule, flow->attr);
+	mlx5e_tc_del_flow(priv, flow);
 
 	if (flow->attr->action & MLX5_FLOW_CONTEXT_ACTION_ENCAP)
 		mlx5e_detach_encap(priv, flow);
@@ -1065,7 +1064,7 @@ static void _mlx5e_tc_del_flow(void *ptr, void *arg)
 	struct mlx5e_tc_flow *flow = ptr;
 	struct mlx5e_priv *priv = arg;
 
-	mlx5e_tc_del_flow(priv, flow->rule, flow->attr);
+	mlx5e_tc_del_flow(priv, flow);
 	kfree(flow);
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH] net: ethernet: ti: cpsw: fix ASSERT_RTNL() warning during resume
From: Grygorii Strashko @ 2016-11-29 22:27 UTC (permalink / raw)
  To: David S. Miller, netdev, Mugunthan V N
  Cc: Sekhar Nori, linux-kernel, linux-omap, Ivan Khoronzhuk,
	Grygorii Strashko, Dave Gerlach

netif_set_real_num_tx/rx_queues() are required to be called with rtnl_lock
taken, otherwise ASSERT_RTNL() warning will be triggered - which happens
now during System resume from suspend:
cpsw_resume()
|- cpsw_ndo_open()
  |- netif_set_real_num_tx/rx_queues()
     |- ASSERT_RTNL();

Hence, fix it by surrounding cpsw_ndo_open() by rtnl_lock/unlock() calls.

Cc: Dave Gerlach <d-gerlach@ti.com>
Cc: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Fixes: commit e05107e6b747 ("net: ethernet: ti: cpsw: add multi queue support")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index ae1ec6a..fd6c03b 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2944,6 +2944,8 @@ static int cpsw_resume(struct device *dev)
 	/* Select default pin state */
 	pinctrl_pm_select_default_state(dev);
 
+	/* shut up ASSERT_RTNL() warning in netif_set_real_num_tx/rx_queues */
+	rtnl_lock();
 	if (cpsw->data.dual_emac) {
 		int i;
 
@@ -2955,6 +2957,8 @@ static int cpsw_resume(struct device *dev)
 		if (netif_running(ndev))
 			cpsw_ndo_open(ndev);
 	}
+	rtnl_unlock();
+
 	return 0;
 }
 #endif
-- 
2.10.1

^ permalink raw reply related

* Re: 4.9-rc7: (forcedeth?) BUG: sleeping function called from invalid context at kernel/irq/manage.c:110
From: Eric Dumazet @ 2016-11-29 23:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Network Development, Meelis Roos
In-Reply-To: <CA+55aFwUFPLbcqpoccKXHsGj_80ANaMwCSgwTUqgxMFmNRHkaQ@mail.gmail.com>

On Tue, 2016-11-29 at 12:06 -0800, Linus Torvalds wrote:
> On Nov 29, 2016 11:58 AM, "Eric Dumazet" <eric.dumazet@gmail.com>
> wrote:
> >
> > nv_do_nic_poll() is simply buggy and needs a fix.
> >
> > synchronize_irq() can sleep.
> 
> Yes, but why did it start showing up now? None of this has changed as
> far as I can see?
> 
> Is it just timing and the transmit queue being busy? Perhaps due to
> the sheer size of messages? Meelis does seem to have a ton of
> debugging enabled..
> 
>     Linus
> 
> 
Looks like bad luck really or forced_irqthreads ? 


These commits contain what is wanted.

02cea3958664723a5d2236f0f0058de97c7e4693 
genirq: Provide disable_hardirq()



18258f7239a61d8929b8e0c7b6d46c446459074c
genirq: Provide synchronize_hardirq()

^ permalink raw reply

* [PATCH] ethtool: mark mask values as ULL values
From: Jacob Keller @ 2016-11-29 23:08 UTC (permalink / raw)
  To: netdev, Intel Wired LAN; +Cc: David Miller, Jakub Kicinski, Jacob Keller

If compiling with signed checks enabled, there will be warnings
generated by the ETHTOOL_RX_FLOW_SPEC_RING(_VF) masks. These are
currently marked as signed constants, which will generate warnings when
masking with unsigned values. Avoid this by marking them explicitly as
unsigned values.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 include/uapi/linux/ethtool.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index ed650eef9e54..a716112b2b01 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -891,8 +891,8 @@ struct ethtool_rx_flow_spec {
  * space for this at this time. If a future patch consumes the next
  * byte it should be aware of this possiblity.
  */
-#define ETHTOOL_RX_FLOW_SPEC_RING	0x00000000FFFFFFFFLL
-#define ETHTOOL_RX_FLOW_SPEC_RING_VF	0x000000FF00000000LL
+#define ETHTOOL_RX_FLOW_SPEC_RING	0x00000000FFFFFFFFULL
+#define ETHTOOL_RX_FLOW_SPEC_RING_VF	0x000000FF00000000ULL
 #define ETHTOOL_RX_FLOW_SPEC_RING_VF_OFF 32
 static inline __u64 ethtool_get_flow_spec_ring(__u64 ring_cookie)
 {
-- 
2.11.0.rc2.152.g4d04e67

^ permalink raw reply related

* [PATCH for-next 0/6] IB/hns: Bug Fixes for HNS RoCE Driver
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: salil.mehta-hv44wF8Li93QT0dZR+AlfA,
	xavier.huwei-hv44wF8Li93QT0dZR+AlfA,
	oulijun-hv44wF8Li93QT0dZR+AlfA, xushaobo2-hv44wF8Li93QT0dZR+AlfA,
	mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w, lijun_nudt-9Onoh4P/yGk,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA

This patch-set contains bug fixes for the HNS RoCE driver.

Lijun Ou (1):
  IB/hns: Fix the IB device name

Shaobo Xu (2):
  IB/hns: Fix the bug when free mr
  IB/hns: Fix the bug when free cq

Wei Hu (Xavier) (3):
  IB/hns: Fix the bug when destroy qp
  IB/hns: Fix the bug of setting port mtu
  IB/hns: Delete the redundant memset operation

 drivers/infiniband/hw/hns/hns_roce_cmd.h    |    5 -
 drivers/infiniband/hw/hns/hns_roce_common.h |   42 ++
 drivers/infiniband/hw/hns/hns_roce_cq.c     |   27 +-
 drivers/infiniband/hw/hns/hns_roce_device.h |   18 +
 drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |  967 ++++++++++++++++++++++++---
 drivers/infiniband/hw/hns/hns_roce_hw_v1.h  |   57 ++
 drivers/infiniband/hw/hns/hns_roce_main.c   |   26 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c     |   21 +-
 8 files changed, 1026 insertions(+), 137 deletions(-)

-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH for-next 1/6] IB/hns: Fix the bug when destroy qp
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: salil.mehta-hv44wF8Li93QT0dZR+AlfA,
	xavier.huwei-hv44wF8Li93QT0dZR+AlfA,
	oulijun-hv44wF8Li93QT0dZR+AlfA, xushaobo2-hv44wF8Li93QT0dZR+AlfA,
	mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w, lijun_nudt-9Onoh4P/yGk,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA, Dongdong Huang
In-Reply-To: <20161129231030.1105600-1-salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

From: "Wei Hu (Xavier)" <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

If send queue is still working when qp is in reset state by modify qp
in destroy qp function, hardware will hold on and don't work in hip06
SoC. In current codes, RoCE driver check hardware pointer of sending and
hardware pointer of processing to ensure that hardware has processed all
the dbs of this qp. But while the environment of wire becomes not good,
The checking time maybe too long.

In order to solve this problem, RoCE driver created a workqueue at probe
function. If there is a timeout when checking the status of qp, driver
initialize work entry and push it into the workqueue, Work function will
finish checking and release the related resources later.

Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dongdong Huang(Donald) <hdd.huang-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Salil Mehta <salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/hns/hns_roce_common.h |   40 +++
 drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |  435 +++++++++++++++++++++------
 drivers/infiniband/hw/hns/hns_roce_hw_v1.h  |   23 ++
 3 files changed, 402 insertions(+), 96 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_common.h b/drivers/infiniband/hw/hns/hns_roce_common.h
index 0dcb620..a055632 100644
--- a/drivers/infiniband/hw/hns/hns_roce_common.h
+++ b/drivers/infiniband/hw/hns/hns_roce_common.h
@@ -57,6 +57,32 @@
 #define roce_set_bit(origin, shift, val) \
 	roce_set_field((origin), (1ul << (shift)), (shift), (val))
 
+/*
+ * roce_hw_index_cmp_lt - Compare two hardware index values in hisilicon
+ *                        SOC, check if a is less than b.
+ * @a: hardware index value
+ * @b: hardware index value
+ * @bits: the number of bits of a and b, range: 0~31.
+ *
+ * Hardware index increases continuously till max value, and then restart
+ * from zero, again and again. Because the bits of reg field is often
+ * limited, the reg field can only hold the low bits of the hardware index
+ * in hisilicon SOC.
+ * In some scenes we need to compare two values(a,b) getted from two reg
+ * fields in this driver, for example:
+ * If a equals 0xfffe, b equals 0x1 and bits equals 16, we think b has
+ * incresed from 0xffff to 0x1 and a is less than b.
+ * If a equals 0xfffe, b equals 0x0xf001 and bits equals 16, we think a
+ * is bigger than b.
+ *
+ * Return true on a less than b, otherwise false.
+ */
+#define roce_hw_index_mask(bits)	((1ul << (bits)) - 1)
+#define roce_hw_index_shift(bits)	(32 - (bits))
+#define roce_hw_index_cmp_lt(a, b, bits) \
+	((int)((((a) - (b)) & roce_hw_index_mask(bits)) << \
+		roce_hw_index_shift(bits)) < 0)
+
 #define ROCEE_GLB_CFG_ROCEE_DB_SQ_MODE_S 3
 #define ROCEE_GLB_CFG_ROCEE_DB_OTH_MODE_S 4
 
@@ -245,10 +271,22 @@
 #define ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_M   \
 	(((1UL << 28) - 1) << ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_S)
 
+#define ROCEE_SDB_PTR_CMP_BITS 28
+
 #define ROCEE_SDB_INV_CNT_SDB_INV_CNT_S 0
 #define ROCEE_SDB_INV_CNT_SDB_INV_CNT_M   \
 	(((1UL << 16) - 1) << ROCEE_SDB_INV_CNT_SDB_INV_CNT_S)
 
+#define ROCEE_SDB_RETRY_CNT_SDB_RETRY_CT_S	0
+#define ROCEE_SDB_RETRY_CNT_SDB_RETRY_CT_M	\
+	(((1UL << 16) - 1) << ROCEE_SDB_RETRY_CNT_SDB_RETRY_CT_S)
+
+#define ROCEE_SDB_CNT_CMP_BITS 16
+
+#define ROCEE_TSP_BP_ST_QH_FIFO_ENTRY_S	20
+
+#define ROCEE_CNT_CLR_CE_CNT_CLR_CE_S 0
+
 /*************ROCEE_REG DEFINITION****************/
 #define ROCEE_VENDOR_ID_REG			0x0
 #define ROCEE_VENDOR_PART_ID_REG		0x4
@@ -317,6 +355,8 @@
 #define ROCEE_SDB_ISSUE_PTR_REG			0x758
 #define ROCEE_SDB_SEND_PTR_REG			0x75C
 #define ROCEE_SDB_INV_CNT_REG			0x9A4
+#define ROCEE_SDB_RETRY_CNT_REG			0x9AC
+#define ROCEE_TSP_BP_ST_REG			0x9EC
 #define ROCEE_ECC_UCERR_ALM0_REG		0xB34
 #define ROCEE_ECC_CERR_ALM0_REG			0xB40
 
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
index 125ab90..aee1d01 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
@@ -948,6 +948,38 @@ int hns_roce_v1_reset(struct hns_roce_dev *hr_dev, bool dereset)
 	return ret;
 }
 
+static int hns_roce_des_qp_init(struct hns_roce_dev *hr_dev)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct hns_roce_v1_priv *priv;
+	struct hns_roce_des_qp *des_qp;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	des_qp = &priv->des_qp;
+
+	des_qp->requeue_flag = 1;
+	des_qp->qp_wq = create_singlethread_workqueue("hns_roce_destroy_qp");
+	if (!des_qp->qp_wq) {
+		dev_err(dev, "Create destroy qp workqueue failed!\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void hns_roce_des_qp_free(struct hns_roce_dev *hr_dev)
+{
+	struct hns_roce_v1_priv *priv;
+	struct hns_roce_des_qp *des_qp;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	des_qp = &priv->des_qp;
+
+	des_qp->requeue_flag = 0;
+	flush_workqueue(des_qp->qp_wq);
+	destroy_workqueue(des_qp->qp_wq);
+}
+
 void hns_roce_v1_profile(struct hns_roce_dev *hr_dev)
 {
 	int i = 0;
@@ -1050,8 +1082,6 @@ int hns_roce_v1_init(struct hns_roce_dev *hr_dev)
 		goto error_failed_raq_init;
 	}
 
-	hns_roce_port_enable(hr_dev, HNS_ROCE_PORT_UP);
-
 	ret = hns_roce_bt_init(hr_dev);
 	if (ret) {
 		dev_err(dev, "bt init failed!\n");
@@ -1064,13 +1094,23 @@ int hns_roce_v1_init(struct hns_roce_dev *hr_dev)
 		goto error_failed_tptr_init;
 	}
 
+	ret = hns_roce_des_qp_init(hr_dev);
+	if (ret) {
+		dev_err(dev, "des qp init failed!\n");
+		goto error_failed_des_qp_init;
+	}
+
+	hns_roce_port_enable(hr_dev, HNS_ROCE_PORT_UP);
+
 	return 0;
 
+error_failed_des_qp_init:
+	hns_roce_tptr_free(hr_dev);
+
 error_failed_tptr_init:
 	hns_roce_bt_free(hr_dev);
 
 error_failed_bt_init:
-	hns_roce_port_enable(hr_dev, HNS_ROCE_PORT_DOWN);
 	hns_roce_raq_free(hr_dev);
 
 error_failed_raq_init:
@@ -1080,9 +1120,10 @@ int hns_roce_v1_init(struct hns_roce_dev *hr_dev)
 
 void hns_roce_v1_exit(struct hns_roce_dev *hr_dev)
 {
+	hns_roce_port_enable(hr_dev, HNS_ROCE_PORT_DOWN);
+	hns_roce_des_qp_free(hr_dev);
 	hns_roce_tptr_free(hr_dev);
 	hns_roce_bt_free(hr_dev);
-	hns_roce_port_enable(hr_dev, HNS_ROCE_PORT_DOWN);
 	hns_roce_raq_free(hr_dev);
 	hns_roce_db_free(hr_dev);
 }
@@ -2906,132 +2947,334 @@ int hns_roce_v1_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr,
 		hns_roce_v1_q_sqp(ibqp, qp_attr, qp_attr_mask, qp_init_attr) :
 		hns_roce_v1_q_qp(ibqp, qp_attr, qp_attr_mask, qp_init_attr);
 }
-static void hns_roce_v1_destroy_qp_common(struct hns_roce_dev *hr_dev,
-					  struct hns_roce_qp *hr_qp,
-					  int is_user)
+
+static int check_qp_db_process_status(struct hns_roce_dev *hr_dev,
+				      struct hns_roce_qp *hr_qp,
+				      u32 sdb_issue_ptr,
+				      u32 *sdb_inv_cnt,
+				      u32 *wait_stage)
 {
-	u32 sdbinvcnt;
-	unsigned long end = 0;
-	u32 sdbinvcnt_val;
-	u32 sdbsendptr_val;
-	u32 sdbisusepr_val;
-	struct hns_roce_cq *send_cq, *recv_cq;
 	struct device *dev = &hr_dev->pdev->dev;
+	u32 sdb_retry_cnt, old_retry;
+	u32 sdb_send_ptr, old_send;
+	u32 success_flags = 0;
+	u32 cur_cnt, old_cnt;
+	unsigned long end;
+	u32 send_ptr;
+	u32 inv_cnt;
+	u32 tsp_st;
+
+	if (*wait_stage > HNS_ROCE_V1_DB_STAGE2 ||
+	    *wait_stage < HNS_ROCE_V1_DB_STAGE1) {
+		dev_err(dev, "QP(0x%lx) db status wait stage(%d) error!\n",
+			hr_qp->qpn, *wait_stage);
+		return -EINVAL;
+	}
 
-	if (hr_qp->ibqp.qp_type == IB_QPT_RC) {
-		if (hr_qp->state != IB_QPS_RESET) {
-			/*
-			 * Set qp to ERR,
-			 * waiting for hw complete processing all dbs
-			 */
-			if (hns_roce_v1_qp_modify(hr_dev, NULL,
-					to_hns_roce_state(
-						(enum ib_qp_state)hr_qp->state),
-						HNS_ROCE_QP_STATE_ERR, NULL,
-						hr_qp))
-				dev_err(dev, "modify QP %06lx to ERR failed.\n",
-					hr_qp->qpn);
-
-			/* Record issued doorbell */
-			sdbisusepr_val = roce_read(hr_dev,
-					 ROCEE_SDB_ISSUE_PTR_REG);
-			/*
-			 * Query db process status,
-			 * until hw process completely
-			 */
-			end = msecs_to_jiffies(
-			      HNS_ROCE_QP_DESTROY_TIMEOUT_MSECS) + jiffies;
-			do {
-				sdbsendptr_val = roce_read(hr_dev,
+	/* Calculate the total timeout for the entire verification process */
+	end = msecs_to_jiffies(HNS_ROCE_V1_CHECK_DB_TIMEOUT_MSECS) + jiffies;
+
+	if (*wait_stage == HNS_ROCE_V1_DB_STAGE1) {
+		/* Query db process status, until hw process completely */
+		sdb_send_ptr = roce_read(hr_dev, ROCEE_SDB_SEND_PTR_REG);
+		while (roce_hw_index_cmp_lt(sdb_send_ptr, sdb_issue_ptr,
+					    ROCEE_SDB_PTR_CMP_BITS)) {
+			if (!time_before(jiffies, end)) {
+				dev_dbg(dev, "QP(0x%lx) db process stage1 timeout. issue 0x%x send 0x%x.\n",
+					hr_qp->qpn, sdb_issue_ptr,
+					sdb_send_ptr);
+				return 0;
+			}
+
+			msleep(HNS_ROCE_V1_CHECK_DB_SLEEP_MSECS);
+			sdb_send_ptr = roce_read(hr_dev,
 						 ROCEE_SDB_SEND_PTR_REG);
-				if (!time_before(jiffies, end)) {
-					dev_err(dev, "destroy qp(0x%lx) timeout!!!",
-						hr_qp->qpn);
-					break;
-				}
-			} while ((short)(roce_get_field(sdbsendptr_val,
-					ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_M,
-					ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_S) -
-				roce_get_field(sdbisusepr_val,
-					ROCEE_SDB_ISSUE_PTR_SDB_ISSUE_PTR_M,
-					ROCEE_SDB_ISSUE_PTR_SDB_ISSUE_PTR_S)
-				) < 0);
+		}
 
-			/* Get list pointer */
-			sdbinvcnt = roce_read(hr_dev, ROCEE_SDB_INV_CNT_REG);
+		if (roce_get_field(sdb_issue_ptr,
+				   ROCEE_SDB_ISSUE_PTR_SDB_ISSUE_PTR_M,
+				   ROCEE_SDB_ISSUE_PTR_SDB_ISSUE_PTR_S) ==
+		    roce_get_field(sdb_send_ptr,
+				   ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_M,
+				   ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_S)) {
+			old_send = roce_read(hr_dev, ROCEE_SDB_SEND_PTR_REG);
+			old_retry = roce_read(hr_dev, ROCEE_SDB_RETRY_CNT_REG);
 
-			/* Query db's list status, until hw reversal */
 			do {
-				sdbinvcnt_val = roce_read(hr_dev,
-						ROCEE_SDB_INV_CNT_REG);
+				tsp_st = roce_read(hr_dev, ROCEE_TSP_BP_ST_REG);
+				if (roce_get_bit(tsp_st,
+					ROCEE_TSP_BP_ST_QH_FIFO_ENTRY_S) == 1) {
+					*wait_stage = HNS_ROCE_V1_DB_WAIT_OK;
+					return 0;
+				}
+
 				if (!time_before(jiffies, end)) {
-					dev_err(dev, "destroy qp(0x%lx) timeout!!!",
-						hr_qp->qpn);
-					dev_err(dev, "SdbInvCnt = 0x%x\n",
-						sdbinvcnt_val);
-					break;
+					dev_dbg(dev, "QP(0x%lx) db process stage1 timeout when send ptr equals issue ptr.\n"
+						     "issue 0x%x send 0x%x.\n",
+						hr_qp->qpn, sdb_issue_ptr,
+						sdb_send_ptr);
+					return 0;
 				}
-			} while ((short)(roce_get_field(sdbinvcnt_val,
-				  ROCEE_SDB_INV_CNT_SDB_INV_CNT_M,
-				  ROCEE_SDB_INV_CNT_SDB_INV_CNT_S) -
-				  (sdbinvcnt + SDB_INV_CNT_OFFSET)) < 0);
-
-			/* Modify qp to reset before destroying qp */
-			if (hns_roce_v1_qp_modify(hr_dev, NULL,
-					to_hns_roce_state(
-					(enum ib_qp_state)hr_qp->state),
-					HNS_ROCE_QP_STATE_RST, NULL, hr_qp))
-				dev_err(dev, "modify QP %06lx to RESET failed.\n",
-					hr_qp->qpn);
+
+				msleep(HNS_ROCE_V1_CHECK_DB_SLEEP_MSECS);
+
+				sdb_send_ptr = roce_read(hr_dev,
+							ROCEE_SDB_SEND_PTR_REG);
+				sdb_retry_cnt =	roce_read(hr_dev,
+						       ROCEE_SDB_RETRY_CNT_REG);
+				cur_cnt = roce_get_field(sdb_send_ptr,
+					ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_M,
+					ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_S) +
+					roce_get_field(sdb_retry_cnt,
+					ROCEE_SDB_RETRY_CNT_SDB_RETRY_CT_M,
+					ROCEE_SDB_RETRY_CNT_SDB_RETRY_CT_S);
+				if (!roce_get_bit(tsp_st,
+					ROCEE_CNT_CLR_CE_CNT_CLR_CE_S)) {
+					old_cnt = roce_get_field(old_send,
+					ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_M,
+					ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_S) +
+					roce_get_field(old_retry,
+					ROCEE_SDB_RETRY_CNT_SDB_RETRY_CT_M,
+					ROCEE_SDB_RETRY_CNT_SDB_RETRY_CT_S);
+					if (cur_cnt - old_cnt > SDB_ST_CMP_VAL)
+						success_flags = 1;
+				} else {
+					old_cnt = roce_get_field(old_send,
+					ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_M,
+					ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_S);
+					if (cur_cnt - old_cnt > SDB_ST_CMP_VAL)
+						success_flags = 1;
+					else {
+					    send_ptr = roce_get_field(old_send,
+					    ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_M,
+					    ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_S) +
+					    roce_get_field(sdb_retry_cnt,
+					    ROCEE_SDB_RETRY_CNT_SDB_RETRY_CT_M,
+					    ROCEE_SDB_RETRY_CNT_SDB_RETRY_CT_S);
+					    roce_set_field(old_send,
+					    ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_M,
+					    ROCEE_SDB_SEND_PTR_SDB_SEND_PTR_S,
+						send_ptr);
+					}
+				}
+			} while (!success_flags);
+		}
+
+		*wait_stage = HNS_ROCE_V1_DB_STAGE2;
+
+		/* Get list pointer */
+		*sdb_inv_cnt = roce_read(hr_dev, ROCEE_SDB_INV_CNT_REG);
+		dev_dbg(dev, "QP(0x%lx) db process stage2. inv cnt = 0x%x.\n",
+			hr_qp->qpn, *sdb_inv_cnt);
+	}
+
+	if (*wait_stage == HNS_ROCE_V1_DB_STAGE2) {
+		/* Query db's list status, until hw reversal */
+		inv_cnt = roce_read(hr_dev, ROCEE_SDB_INV_CNT_REG);
+		while (roce_hw_index_cmp_lt(inv_cnt,
+					    *sdb_inv_cnt + SDB_INV_CNT_OFFSET,
+					    ROCEE_SDB_CNT_CMP_BITS)) {
+			if (!time_before(jiffies, end)) {
+				dev_dbg(dev, "QP(0x%lx) db process stage2 timeout. inv cnt 0x%x.\n",
+					hr_qp->qpn, inv_cnt);
+				return 0;
+			}
+
+			msleep(HNS_ROCE_V1_CHECK_DB_SLEEP_MSECS);
+			inv_cnt = roce_read(hr_dev, ROCEE_SDB_INV_CNT_REG);
 		}
+
+		*wait_stage = HNS_ROCE_V1_DB_WAIT_OK;
+	}
+
+	return 0;
+}
+
+static int check_qp_reset_state(struct hns_roce_dev *hr_dev,
+				struct hns_roce_qp *hr_qp,
+				struct hns_roce_qp_work *qp_work_entry,
+				int *is_timeout)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	u32 sdb_issue_ptr;
+	int ret;
+
+	if (hr_qp->state != IB_QPS_RESET) {
+		/* Set qp to ERR, waiting for hw complete processing all dbs */
+		ret = hns_roce_v1_modify_qp(&hr_qp->ibqp, NULL, 0, hr_qp->state,
+					    IB_QPS_ERR);
+		if (ret) {
+			dev_err(dev, "Modify QP(0x%lx) to ERR failed!\n",
+				hr_qp->qpn);
+			return ret;
+		}
+
+		/* Record issued doorbell */
+		sdb_issue_ptr = roce_read(hr_dev, ROCEE_SDB_ISSUE_PTR_REG);
+		qp_work_entry->sdb_issue_ptr = sdb_issue_ptr;
+		qp_work_entry->db_wait_stage = HNS_ROCE_V1_DB_STAGE1;
+
+		/* Query db process status, until hw process completely */
+		ret = check_qp_db_process_status(hr_dev, hr_qp, sdb_issue_ptr,
+						 &qp_work_entry->sdb_inv_cnt,
+						 &qp_work_entry->db_wait_stage);
+		if (ret) {
+			dev_err(dev, "Check QP(0x%lx) db process status failed!\n",
+				hr_qp->qpn);
+			return ret;
+		}
+
+		if (qp_work_entry->db_wait_stage != HNS_ROCE_V1_DB_WAIT_OK) {
+			qp_work_entry->sche_cnt = 0;
+			*is_timeout = 1;
+			return 0;
+		}
+
+		/* Modify qp to reset before destroying qp */
+		ret = hns_roce_v1_modify_qp(&hr_qp->ibqp, NULL, 0, hr_qp->state,
+					    IB_QPS_RESET);
+		if (ret) {
+			dev_err(dev, "Modify QP(0x%lx) to RST failed!\n",
+				hr_qp->qpn);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static void hns_roce_v1_destroy_qp_work_fn(struct work_struct *work)
+{
+	struct hns_roce_qp_work *qp_work_entry;
+	struct hns_roce_v1_priv *priv;
+	struct hns_roce_dev *hr_dev;
+	struct hns_roce_qp *hr_qp;
+	struct device *dev;
+	int ret;
+
+	qp_work_entry = container_of(work, struct hns_roce_qp_work, work);
+	hr_dev = to_hr_dev(qp_work_entry->ib_dev);
+	dev = &hr_dev->pdev->dev;
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	hr_qp = qp_work_entry->qp;
+
+	dev_dbg(dev, "Schedule destroy QP(0x%lx) work.\n", hr_qp->qpn);
+
+	qp_work_entry->sche_cnt++;
+
+	/* Query db process status, until hw process completely */
+	ret = check_qp_db_process_status(hr_dev, hr_qp,
+					 qp_work_entry->sdb_issue_ptr,
+					 &qp_work_entry->sdb_inv_cnt,
+					 &qp_work_entry->db_wait_stage);
+	if (ret) {
+		dev_err(dev, "Check QP(0x%lx) db process status failed!\n",
+			hr_qp->qpn);
+		return;
+	}
+
+	if (qp_work_entry->db_wait_stage != HNS_ROCE_V1_DB_WAIT_OK &&
+	    priv->des_qp.requeue_flag) {
+		queue_work(priv->des_qp.qp_wq, work);
+		return;
+	}
+
+	/* Modify qp to reset before destroying qp */
+	ret = hns_roce_v1_modify_qp(&hr_qp->ibqp, NULL, 0, hr_qp->state,
+				    IB_QPS_RESET);
+	if (ret) {
+		dev_err(dev, "Modify QP(0x%lx) to RST failed!\n", hr_qp->qpn);
+		return;
+	}
+
+	hns_roce_qp_remove(hr_dev, hr_qp);
+	hns_roce_qp_free(hr_dev, hr_qp);
+
+	if (hr_qp->ibqp.qp_type == IB_QPT_RC) {
+		/* RC QP, release QPN */
+		hns_roce_release_range_qp(hr_dev, hr_qp->qpn, 1);
+		kfree(hr_qp);
+	} else
+		kfree(hr_to_hr_sqp(hr_qp));
+
+	kfree(qp_work_entry);
+
+	dev_dbg(dev, "Accomplished destroy QP(0x%lx) work.\n", hr_qp->qpn);
+}
+
+int hns_roce_v1_destroy_qp(struct ib_qp *ibqp)
+{
+	struct hns_roce_dev *hr_dev = to_hr_dev(ibqp->device);
+	struct hns_roce_qp *hr_qp = to_hr_qp(ibqp);
+	struct device *dev = &hr_dev->pdev->dev;
+	struct hns_roce_qp_work qp_work_entry;
+	struct hns_roce_qp_work *qp_work;
+	struct hns_roce_v1_priv *priv;
+	struct hns_roce_cq *send_cq, *recv_cq;
+	int is_user = !!ibqp->pd->uobject;
+	int is_timeout = 0;
+	int ret;
+
+	ret = check_qp_reset_state(hr_dev, hr_qp, &qp_work_entry, &is_timeout);
+	if (ret) {
+		dev_err(dev, "QP reset state check failed(%d)!\n", ret);
+		return ret;
 	}
 
 	send_cq = to_hr_cq(hr_qp->ibqp.send_cq);
 	recv_cq = to_hr_cq(hr_qp->ibqp.recv_cq);
 
 	hns_roce_lock_cqs(send_cq, recv_cq);
-
 	if (!is_user) {
 		__hns_roce_v1_cq_clean(recv_cq, hr_qp->qpn, hr_qp->ibqp.srq ?
 				       to_hr_srq(hr_qp->ibqp.srq) : NULL);
 		if (send_cq != recv_cq)
 			__hns_roce_v1_cq_clean(send_cq, hr_qp->qpn, NULL);
 	}
-
-	hns_roce_qp_remove(hr_dev, hr_qp);
-
 	hns_roce_unlock_cqs(send_cq, recv_cq);
 
-	hns_roce_qp_free(hr_dev, hr_qp);
+	if (!is_timeout) {
+		hns_roce_qp_remove(hr_dev, hr_qp);
+		hns_roce_qp_free(hr_dev, hr_qp);
 
-	/* Not special_QP, free their QPN */
-	if ((hr_qp->ibqp.qp_type == IB_QPT_RC) ||
-	    (hr_qp->ibqp.qp_type == IB_QPT_UC) ||
-	    (hr_qp->ibqp.qp_type == IB_QPT_UD))
-		hns_roce_release_range_qp(hr_dev, hr_qp->qpn, 1);
+		/* RC QP, release QPN */
+		if (hr_qp->ibqp.qp_type == IB_QPT_RC)
+			hns_roce_release_range_qp(hr_dev, hr_qp->qpn, 1);
+	}
 
 	hns_roce_mtt_cleanup(hr_dev, &hr_qp->mtt);
 
-	if (is_user) {
+	if (is_user)
 		ib_umem_release(hr_qp->umem);
-	} else {
+	else {
 		kfree(hr_qp->sq.wrid);
 		kfree(hr_qp->rq.wrid);
+
 		hns_roce_buf_free(hr_dev, hr_qp->buff_size, &hr_qp->hr_buf);
 	}
-}
 
-int hns_roce_v1_destroy_qp(struct ib_qp *ibqp)
-{
-	struct hns_roce_dev *hr_dev = to_hr_dev(ibqp->device);
-	struct hns_roce_qp *hr_qp = to_hr_qp(ibqp);
-
-	hns_roce_v1_destroy_qp_common(hr_dev, hr_qp, !!ibqp->pd->uobject);
-
-	if (hr_qp->ibqp.qp_type == IB_QPT_GSI)
-		kfree(hr_to_hr_sqp(hr_qp));
-	else
-		kfree(hr_qp);
+	if (!is_timeout) {
+		if (hr_qp->ibqp.qp_type == IB_QPT_RC)
+			kfree(hr_qp);
+		else
+			kfree(hr_to_hr_sqp(hr_qp));
+	} else {
+		qp_work = kzalloc(sizeof(*qp_work), GFP_KERNEL);
+		if (!qp_work)
+			return -ENOMEM;
+
+		INIT_WORK(&qp_work->work, hns_roce_v1_destroy_qp_work_fn);
+		qp_work->ib_dev	= &hr_dev->ib_dev;
+		qp_work->qp		= hr_qp;
+		qp_work->db_wait_stage	= qp_work_entry.db_wait_stage;
+		qp_work->sdb_issue_ptr	= qp_work_entry.sdb_issue_ptr;
+		qp_work->sdb_inv_cnt	= qp_work_entry.sdb_inv_cnt;
+		qp_work->sche_cnt	= qp_work_entry.sche_cnt;
+
+		priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+		queue_work(priv->des_qp.qp_wq, &qp_work->work);
+		dev_dbg(dev, "Begin destroy QP(0x%lx) work.\n", hr_qp->qpn);
+	}
 
 	return 0;
 }
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.h b/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
index cf28f1b..1d250c0 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
@@ -102,6 +102,12 @@
 #define HNS_ROCE_V1_EXT_ODB_ALFUL	\
 	(HNS_ROCE_V1_EXT_ODB_DEPTH - HNS_ROCE_V1_DB_RSVD)
 
+#define HNS_ROCE_V1_DB_WAIT_OK				0
+#define HNS_ROCE_V1_DB_STAGE1				1
+#define HNS_ROCE_V1_DB_STAGE2				2
+#define HNS_ROCE_V1_CHECK_DB_TIMEOUT_MSECS		10000
+#define HNS_ROCE_V1_CHECK_DB_SLEEP_MSECS		20
+
 #define HNS_ROCE_BT_RSV_BUF_SIZE			(1 << 17)
 
 #define HNS_ROCE_V1_TPTR_ENTRY_SIZE			2
@@ -144,6 +150,7 @@
 #define SQ_PSN_SHIFT					8
 #define QKEY_VAL					0x80010000
 #define SDB_INV_CNT_OFFSET				8
+#define SDB_ST_CMP_VAL					8
 
 struct hns_roce_cq_context {
 	u32 cqc_byte_4;
@@ -993,11 +1000,27 @@ struct hns_roce_tptr_table {
 	struct hns_roce_buf_list tptr_buf;
 };
 
+struct hns_roce_qp_work {
+	struct	work_struct work;
+	struct	ib_device *ib_dev;
+	struct	hns_roce_qp *qp;
+	u32	db_wait_stage;
+	u32	sdb_issue_ptr;
+	u32	sdb_inv_cnt;
+	u32	sche_cnt;
+};
+
+struct hns_roce_des_qp {
+	struct workqueue_struct	*qp_wq;
+	int	requeue_flag;
+};
+
 struct hns_roce_v1_priv {
 	struct hns_roce_db_table  db_table;
 	struct hns_roce_raq_table raq_table;
 	struct hns_roce_bt_table  bt_table;
 	struct hns_roce_tptr_table tptr_table;
+	struct hns_roce_des_qp des_qp;
 };
 
 int hns_dsaf_roce_reset(struct fwnode_handle *dsaf_fwnode, bool dereset);
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH for-next 2/6] IB/hns: Fix the bug when free mr
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford
  Cc: salil.mehta, xavier.huwei, oulijun, xushaobo2, mehta.salil.lnk,
	lijun_nudt, linux-rdma, netdev, linux-kernel, linuxarm
In-Reply-To: <20161129231030.1105600-1-salil.mehta@huawei.com>

From: Shaobo Xu <xushaobo2@huawei.com>

If the resources of mr are freed while executing the user case, hardware
can not been notified in hip06 SoC. Then hardware will hold on when it
reads the payload by the PA which has been released.

In order to slove this problem, RoCE driver creates 8 reserved loopback
QPs to ensure zero wqe when free mr. When the mac address is reset, in
order to avoid loopback failure, we need to release the reserved loopback
QPs and recreate them.

Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/infiniband/hw/hns/hns_roce_cmd.h    |    5 -
 drivers/infiniband/hw/hns/hns_roce_device.h |   10 +
 drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |  485 +++++++++++++++++++++++++++
 drivers/infiniband/hw/hns/hns_roce_hw_v1.h  |   34 ++
 drivers/infiniband/hw/hns/hns_roce_main.c   |    5 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c     |   21 +-
 6 files changed, 545 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.h b/drivers/infiniband/hw/hns/hns_roce_cmd.h
index ed14ad3..f5a9ee2 100644
--- a/drivers/infiniband/hw/hns/hns_roce_cmd.h
+++ b/drivers/infiniband/hw/hns/hns_roce_cmd.h
@@ -58,11 +58,6 @@ enum {
 	HNS_ROCE_CMD_QUERY_QP		= 0x22,
 };
 
-struct hns_roce_cmd_mailbox {
-	void		       *buf;
-	dma_addr_t		dma;
-};
-
 int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param,
 		      unsigned long in_modifier, u8 op_modifier, u16 op,
 		      unsigned long timeout);
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
index e48464d..1050829 100644
--- a/drivers/infiniband/hw/hns/hns_roce_device.h
+++ b/drivers/infiniband/hw/hns/hns_roce_device.h
@@ -388,6 +388,11 @@ struct hns_roce_cmdq {
 	u8			toggle;
 };
 
+struct hns_roce_cmd_mailbox {
+	void		       *buf;
+	dma_addr_t		dma;
+};
+
 struct hns_roce_dev;
 
 struct hns_roce_qp {
@@ -522,6 +527,7 @@ struct hns_roce_hw {
 			 struct ib_recv_wr **bad_recv_wr);
 	int (*req_notify_cq)(struct ib_cq *ibcq, enum ib_cq_notify_flags flags);
 	int (*poll_cq)(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
+	int (*dereg_mr)(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr);
 	void	*priv;
 };
 
@@ -688,6 +694,10 @@ struct ib_mr *hns_roce_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 				   u64 virt_addr, int access_flags,
 				   struct ib_udata *udata);
 int hns_roce_dereg_mr(struct ib_mr *ibmr);
+int hns_roce_hw2sw_mpt(struct hns_roce_dev *hr_dev,
+		       struct hns_roce_cmd_mailbox *mailbox,
+		       unsigned long mpt_index);
+unsigned long key_to_hw_index(u32 key);
 
 void hns_roce_buf_free(struct hns_roce_dev *hr_dev, u32 size,
 		       struct hns_roce_buf *buf);
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
index aee1d01..f67a3bf 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
@@ -295,6 +295,8 @@ int hns_roce_v1_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		roce_set_field(sq_db.u32_4, SQ_DOORBELL_U32_4_SQ_HEAD_M,
 			       SQ_DOORBELL_U32_4_SQ_HEAD_S,
 			      (qp->sq.head & ((qp->sq.wqe_cnt << 1) - 1)));
+		roce_set_field(sq_db.u32_4, SQ_DOORBELL_U32_4_SL_M,
+			       SQ_DOORBELL_U32_4_SL_S, qp->sl);
 		roce_set_field(sq_db.u32_4, SQ_DOORBELL_U32_4_PORT_M,
 			       SQ_DOORBELL_U32_4_PORT_S, qp->phy_port);
 		roce_set_field(sq_db.u32_8, SQ_DOORBELL_U32_8_QPN_M,
@@ -622,6 +624,213 @@ static int hns_roce_db_ext_init(struct hns_roce_dev *hr_dev, u32 sdb_ext_mod,
 	return ret;
 }
 
+static struct hns_roce_qp *hns_roce_v1_create_lp_qp(struct hns_roce_dev *hr_dev,
+						    struct ib_pd *pd)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct ib_qp_init_attr init_attr;
+	struct ib_qp *qp;
+
+	memset(&init_attr, 0, sizeof(struct ib_qp_init_attr));
+	init_attr.qp_type		= IB_QPT_RC;
+	init_attr.sq_sig_type		= IB_SIGNAL_ALL_WR;
+	init_attr.cap.max_recv_wr	= HNS_ROCE_MIN_WQE_NUM;
+	init_attr.cap.max_send_wr	= HNS_ROCE_MIN_WQE_NUM;
+
+	qp = hns_roce_create_qp(pd, &init_attr, NULL);
+	if (IS_ERR(qp)) {
+		dev_err(dev, "Create loop qp for mr free failed!");
+		return NULL;
+	}
+
+	return to_hr_qp(qp);
+}
+
+static int hns_roce_v1_rsv_lp_qp(struct hns_roce_dev *hr_dev)
+{
+	struct hns_roce_caps *caps = &hr_dev->caps;
+	struct device *dev = &hr_dev->pdev->dev;
+	struct ib_cq_init_attr cq_init_attr;
+	struct hns_roce_free_mr *free_mr;
+	struct ib_qp_attr attr = { 0 };
+	struct hns_roce_v1_priv *priv;
+	struct hns_roce_qp *hr_qp;
+	struct ib_cq *cq;
+	struct ib_pd *pd;
+	u64 subnet_prefix;
+	int attr_mask = 0;
+	int i;
+	int ret;
+	u8 phy_port;
+	u8 sl;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	/* Reserved cq for loop qp */
+	cq_init_attr.cqe		= HNS_ROCE_MIN_WQE_NUM * 2;
+	cq_init_attr.comp_vector	= 0;
+	cq = hns_roce_ib_create_cq(&hr_dev->ib_dev, &cq_init_attr, NULL, NULL);
+	if (IS_ERR(cq)) {
+		dev_err(dev, "Create cq for reseved loop qp failed!");
+		return -ENOMEM;
+	}
+	free_mr->mr_free_cq = to_hr_cq(cq);
+	free_mr->mr_free_cq->ib_cq.device		= &hr_dev->ib_dev;
+	free_mr->mr_free_cq->ib_cq.uobject		= NULL;
+	free_mr->mr_free_cq->ib_cq.comp_handler		= NULL;
+	free_mr->mr_free_cq->ib_cq.event_handler	= NULL;
+	free_mr->mr_free_cq->ib_cq.cq_context		= NULL;
+	atomic_set(&free_mr->mr_free_cq->ib_cq.usecnt, 0);
+
+	pd = hns_roce_alloc_pd(&hr_dev->ib_dev, NULL, NULL);
+	if (IS_ERR(pd)) {
+		dev_err(dev, "Create pd for reseved loop qp failed!");
+		ret = -ENOMEM;
+		goto alloc_pd_failed;
+	}
+	free_mr->mr_free_pd = to_hr_pd(pd);
+	free_mr->mr_free_pd->ibpd.device  = &hr_dev->ib_dev;
+	free_mr->mr_free_pd->ibpd.uobject = NULL;
+	atomic_set(&free_mr->mr_free_pd->ibpd.usecnt, 0);
+
+	attr.qp_access_flags	= IB_ACCESS_REMOTE_WRITE;
+	attr.pkey_index		= 0;
+	attr.min_rnr_timer	= 0;
+	/* Disable read ability */
+	attr.max_dest_rd_atomic = 0;
+	attr.max_rd_atomic	= 0;
+	/* Use arbitrary values as rq_psn and sq_psn */
+	attr.rq_psn		= 0x0808;
+	attr.sq_psn		= 0x0808;
+	attr.retry_cnt		= 7;
+	attr.rnr_retry		= 7;
+	attr.timeout		= 0x12;
+	attr.path_mtu		= IB_MTU_256;
+	attr.ah_attr.ah_flags		= 1;
+	attr.ah_attr.static_rate	= 3;
+	attr.ah_attr.grh.sgid_index	= 0;
+	attr.ah_attr.grh.hop_limit	= 1;
+	attr.ah_attr.grh.flow_label	= 0;
+	attr.ah_attr.grh.traffic_class	= 0;
+
+	subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
+	for (i = 0; i < HNS_ROCE_V1_RESV_QP; i++) {
+		free_mr->mr_free_qp[i] = hns_roce_v1_create_lp_qp(hr_dev, pd);
+		if (IS_ERR(free_mr->mr_free_qp[i])) {
+			dev_err(dev, "Create loop qp failed!\n");
+			goto create_lp_qp_failed;
+		}
+		hr_qp = free_mr->mr_free_qp[i];
+
+		sl = i / caps->num_ports;
+
+		if (caps->num_ports == HNS_ROCE_MAX_PORTS)
+			phy_port = (i >= HNS_ROCE_MAX_PORTS) ? (i - 2) :
+				(i % caps->num_ports);
+		else
+			phy_port = i % caps->num_ports;
+
+		hr_qp->port		= phy_port + 1;
+		hr_qp->phy_port		= phy_port;
+		hr_qp->ibqp.qp_type	= IB_QPT_RC;
+		hr_qp->ibqp.device	= &hr_dev->ib_dev;
+		hr_qp->ibqp.uobject	= NULL;
+		atomic_set(&hr_qp->ibqp.usecnt, 0);
+		hr_qp->ibqp.pd		= pd;
+		hr_qp->ibqp.recv_cq	= cq;
+		hr_qp->ibqp.send_cq	= cq;
+
+		attr.ah_attr.port_num	= phy_port + 1;
+		attr.ah_attr.sl		= sl;
+		attr.port_num		= phy_port + 1;
+
+		attr.dest_qp_num	= hr_qp->qpn;
+		memcpy(attr.ah_attr.dmac, hr_dev->dev_addr[phy_port],
+		       MAC_ADDR_OCTET_NUM);
+
+		memcpy(attr.ah_attr.grh.dgid.raw,
+			&subnet_prefix, sizeof(u64));
+		memcpy(&attr.ah_attr.grh.dgid.raw[8],
+		       hr_dev->dev_addr[phy_port], 3);
+		memcpy(&attr.ah_attr.grh.dgid.raw[13],
+		       hr_dev->dev_addr[phy_port] + 3, 3);
+		attr.ah_attr.grh.dgid.raw[11] = 0xff;
+		attr.ah_attr.grh.dgid.raw[12] = 0xfe;
+		attr.ah_attr.grh.dgid.raw[8] ^= 2;
+
+		attr_mask |= IB_QP_PORT;
+
+		ret = hr_dev->hw->modify_qp(&hr_qp->ibqp, &attr, attr_mask,
+					    IB_QPS_RESET, IB_QPS_INIT);
+		if (ret) {
+			dev_err(dev, "modify qp failed(%d)!\n", ret);
+			goto create_lp_qp_failed;
+		}
+
+		ret = hr_dev->hw->modify_qp(&hr_qp->ibqp, &attr, attr_mask,
+					    IB_QPS_INIT, IB_QPS_RTR);
+		if (ret) {
+			dev_err(dev, "modify qp failed(%d)!\n", ret);
+			goto create_lp_qp_failed;
+		}
+
+		ret = hr_dev->hw->modify_qp(&hr_qp->ibqp, &attr, attr_mask,
+					    IB_QPS_RTR, IB_QPS_RTS);
+		if (ret) {
+			dev_err(dev, "modify qp failed(%d)!\n", ret);
+			goto create_lp_qp_failed;
+		}
+	}
+
+	return 0;
+
+create_lp_qp_failed:
+	for (i -= 1; i >= 0; i--) {
+		hr_qp = free_mr->mr_free_qp[i];
+		if (hns_roce_v1_destroy_qp(&hr_qp->ibqp))
+			dev_err(dev, "Destroy qp %d for mr free failed!\n", i);
+	}
+
+	if (hns_roce_dealloc_pd(pd))
+		dev_err(dev, "Destroy pd for create_lp_qp failed!\n");
+
+alloc_pd_failed:
+	if (hns_roce_ib_destroy_cq(cq))
+		dev_err(dev, "Destroy cq for create_lp_qp failed!\n");
+
+	return -EINVAL;
+}
+
+static void hns_roce_v1_release_lp_qp(struct hns_roce_dev *hr_dev)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_v1_priv *priv;
+	struct hns_roce_qp *hr_qp;
+	int ret;
+	int i;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	for (i = 0; i < HNS_ROCE_V1_RESV_QP; i++) {
+		hr_qp = free_mr->mr_free_qp[i];
+		ret = hns_roce_v1_destroy_qp(&hr_qp->ibqp);
+		if (ret)
+			dev_err(dev, "Destroy qp %d for mr free failed(%d)!\n",
+				i, ret);
+	}
+
+	ret = hns_roce_ib_destroy_cq(&free_mr->mr_free_cq->ib_cq);
+	if (ret)
+		dev_err(dev, "Destroy cq for mr_free failed(%d)!\n", ret);
+
+	ret = hns_roce_dealloc_pd(&free_mr->mr_free_pd->ibpd);
+	if (ret)
+		dev_err(dev, "Destroy pd for mr_free failed(%d)!\n", ret);
+}
+
 static int hns_roce_db_init(struct hns_roce_dev *hr_dev)
 {
 	struct device *dev = &hr_dev->pdev->dev;
@@ -659,6 +868,223 @@ static int hns_roce_db_init(struct hns_roce_dev *hr_dev)
 	return 0;
 }
 
+void hns_roce_v1_recreate_lp_qp_work_fn(struct work_struct *work)
+{
+	struct hns_roce_recreate_lp_qp_work *lp_qp_work;
+	struct hns_roce_dev *hr_dev;
+
+	lp_qp_work = container_of(work, struct hns_roce_recreate_lp_qp_work,
+				  work);
+	hr_dev = to_hr_dev(lp_qp_work->ib_dev);
+
+	hns_roce_v1_release_lp_qp(hr_dev);
+
+	if (hns_roce_v1_rsv_lp_qp(hr_dev))
+		dev_err(&hr_dev->pdev->dev, "create reserver qp failed\n");
+
+	if (lp_qp_work->comp_flag)
+		complete(lp_qp_work->comp);
+
+	kfree(lp_qp_work);
+}
+
+static int hns_roce_v1_recreate_lp_qp(struct hns_roce_dev *hr_dev)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct hns_roce_recreate_lp_qp_work *lp_qp_work;
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_v1_priv *priv;
+	struct completion comp;
+	unsigned long end =
+	  msecs_to_jiffies(HNS_ROCE_V1_RECREATE_LP_QP_TIMEOUT_MSECS) + jiffies;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	lp_qp_work = kzalloc(sizeof(struct hns_roce_recreate_lp_qp_work),
+			     GFP_KERNEL);
+
+	INIT_WORK(&(lp_qp_work->work), hns_roce_v1_recreate_lp_qp_work_fn);
+
+	lp_qp_work->ib_dev = &(hr_dev->ib_dev);
+	lp_qp_work->comp = &comp;
+	lp_qp_work->comp_flag = 1;
+
+	init_completion(lp_qp_work->comp);
+
+	queue_work(free_mr->free_mr_wq, &(lp_qp_work->work));
+
+	while (time_before_eq(jiffies, end)) {
+		if (try_wait_for_completion(&comp))
+			return 0;
+		msleep(HNS_ROCE_V1_RECREATE_LP_QP_WAIT_VALUE);
+	}
+
+	lp_qp_work->comp_flag = 0;
+	if (try_wait_for_completion(&comp))
+		return 0;
+
+	dev_warn(dev, "recreate lp qp failed 20s timeout and return failed!\n");
+	return -ETIMEDOUT;
+}
+
+static int hns_roce_v1_send_lp_wqe(struct hns_roce_qp *hr_qp)
+{
+	struct hns_roce_dev *hr_dev = to_hr_dev(hr_qp->ibqp.device);
+	struct device *dev = &hr_dev->pdev->dev;
+	struct ib_send_wr send_wr, *bad_wr;
+	int ret;
+
+	memset(&send_wr, 0, sizeof(send_wr));
+	send_wr.next	= NULL;
+	send_wr.num_sge	= 0;
+	send_wr.send_flags = 0;
+	send_wr.sg_list	= NULL;
+	send_wr.wr_id	= (unsigned long long)&send_wr;
+	send_wr.opcode	= IB_WR_RDMA_WRITE;
+
+	ret = hns_roce_v1_post_send(&hr_qp->ibqp, &send_wr, &bad_wr);
+	if (ret) {
+		dev_err(dev, "Post write wqe for mr free failed(%d)!", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void hns_roce_v1_mr_free_work_fn(struct work_struct *work)
+{
+	struct hns_roce_mr_free_work *mr_work;
+	struct ib_wc wc[HNS_ROCE_V1_RESV_QP];
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_cq *mr_free_cq;
+	struct hns_roce_v1_priv *priv;
+	struct hns_roce_dev *hr_dev;
+	struct hns_roce_mr *hr_mr;
+	struct hns_roce_qp *hr_qp;
+	struct device *dev;
+	unsigned long end =
+		msecs_to_jiffies(HNS_ROCE_V1_FREE_MR_TIMEOUT_MSECS) + jiffies;
+	int i;
+	int ret;
+	int ne;
+
+	mr_work = container_of(work, struct hns_roce_mr_free_work, work);
+	hr_mr = (struct hns_roce_mr *)mr_work->mr;
+	hr_dev = to_hr_dev(mr_work->ib_dev);
+	dev = &hr_dev->pdev->dev;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+	mr_free_cq = free_mr->mr_free_cq;
+
+	for (i = 0; i < HNS_ROCE_V1_RESV_QP; i++) {
+		hr_qp = free_mr->mr_free_qp[i];
+		ret = hns_roce_v1_send_lp_wqe(hr_qp);
+		if (ret) {
+			dev_err(dev,
+			     "Send wqe (qp:0x%lx) for mr free failed(%d)!\n",
+			     hr_qp->qpn, ret);
+			goto free_work;
+		}
+	}
+
+	ne = HNS_ROCE_V1_RESV_QP;
+	do {
+		ret = hns_roce_v1_poll_cq(&mr_free_cq->ib_cq, ne, wc);
+		if (ret < 0) {
+			dev_err(dev,
+			   "(qp:0x%lx) starts, Poll cqe failed(%d) for mr 0x%x free! Remain %d cqe\n",
+			   hr_qp->qpn, ret, hr_mr->key, ne);
+			goto free_work;
+		}
+		ne -= ret;
+		msleep(HNS_ROCE_V1_FREE_MR_WAIT_VALUE);
+	} while (ne && time_before_eq(jiffies, end));
+
+	if (ne != 0)
+		dev_err(dev,
+			"Poll cqe for mr 0x%x free timeout! Remain %d cqe\n",
+			hr_mr->key, ne);
+
+free_work:
+	if (mr_work->comp_flag)
+		complete(mr_work->comp);
+	kfree(mr_work);
+}
+
+int hns_roce_v1_dereg_mr(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct hns_roce_mr_free_work *mr_work;
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_v1_priv *priv;
+	struct completion comp;
+	unsigned long end =
+		msecs_to_jiffies(HNS_ROCE_V1_FREE_MR_TIMEOUT_MSECS) + jiffies;
+	unsigned long start = jiffies;
+	int npages;
+	int ret = 0;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	if (mr->enabled) {
+		if (hns_roce_hw2sw_mpt(hr_dev, NULL, key_to_hw_index(mr->key)
+				       & (hr_dev->caps.num_mtpts - 1)))
+			dev_warn(dev, "HW2SW_MPT failed!\n");
+	}
+
+	mr_work = kzalloc(sizeof(*mr_work), GFP_KERNEL);
+	if (!mr_work) {
+		ret = -ENOMEM;
+		goto free_mr;
+	}
+
+	INIT_WORK(&(mr_work->work), hns_roce_v1_mr_free_work_fn);
+
+	mr_work->ib_dev = &(hr_dev->ib_dev);
+	mr_work->comp = &comp;
+	mr_work->comp_flag = 1;
+	mr_work->mr = (void *)mr;
+	init_completion(mr_work->comp);
+
+	queue_work(free_mr->free_mr_wq, &(mr_work->work));
+
+	while (time_before_eq(jiffies, end)) {
+		if (try_wait_for_completion(&comp))
+			goto free_mr;
+		msleep(HNS_ROCE_V1_FREE_MR_WAIT_VALUE);
+	}
+
+	mr_work->comp_flag = 0;
+	if (try_wait_for_completion(&comp))
+		goto free_mr;
+
+	dev_warn(dev, "Free mr work 0x%x over 50s and failed!\n", mr->key);
+	ret = -ETIMEDOUT;
+
+free_mr:
+	dev_dbg(dev, "Free mr 0x%x use 0x%x us.\n",
+		mr->key, jiffies_to_usecs(jiffies) - jiffies_to_usecs(start));
+
+	if (mr->size != ~0ULL) {
+		npages = ib_umem_page_count(mr->umem);
+		dma_free_coherent(dev, npages * 8, mr->pbl_buf,
+				  mr->pbl_dma_addr);
+	}
+
+	hns_roce_bitmap_free(&hr_dev->mr_table.mtpt_bitmap,
+			     key_to_hw_index(mr->key), 0);
+
+	if (mr->umem)
+		ib_umem_release(mr->umem);
+
+	kfree(mr);
+
+	return ret;
+}
+
 static void hns_roce_db_free(struct hns_roce_dev *hr_dev)
 {
 	struct device *dev = &hr_dev->pdev->dev;
@@ -899,6 +1325,46 @@ static void hns_roce_tptr_free(struct hns_roce_dev *hr_dev)
 			  tptr_buf->buf, tptr_buf->map);
 }
 
+static int hns_roce_free_mr_init(struct hns_roce_dev *hr_dev)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_v1_priv *priv;
+	int ret = 0;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	free_mr->free_mr_wq = create_singlethread_workqueue("hns_roce_free_mr");
+	if (!free_mr->free_mr_wq) {
+		dev_err(dev, "Create free mr workqueue failed!\n");
+		return -ENOMEM;
+	}
+
+	ret = hns_roce_v1_rsv_lp_qp(hr_dev);
+	if (ret) {
+		dev_err(dev, "Reserved loop qp failed(%d)!\n", ret);
+		flush_workqueue(free_mr->free_mr_wq);
+		destroy_workqueue(free_mr->free_mr_wq);
+	}
+
+	return ret;
+}
+
+static void hns_roce_free_mr_free(struct hns_roce_dev *hr_dev)
+{
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_v1_priv *priv;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	flush_workqueue(free_mr->free_mr_wq);
+	destroy_workqueue(free_mr->free_mr_wq);
+
+	hns_roce_v1_release_lp_qp(hr_dev);
+}
+
 /**
  * hns_roce_v1_reset - reset RoCE
  * @hr_dev: RoCE device struct pointer
@@ -1100,10 +1566,19 @@ int hns_roce_v1_init(struct hns_roce_dev *hr_dev)
 		goto error_failed_des_qp_init;
 	}
 
+	ret = hns_roce_free_mr_init(hr_dev);
+	if (ret) {
+		dev_err(dev, "free mr init failed!\n");
+		goto error_failed_free_mr_init;
+	}
+
 	hns_roce_port_enable(hr_dev, HNS_ROCE_PORT_UP);
 
 	return 0;
 
+error_failed_free_mr_init:
+	hns_roce_des_qp_free(hr_dev);
+
 error_failed_des_qp_init:
 	hns_roce_tptr_free(hr_dev);
 
@@ -1121,6 +1596,7 @@ int hns_roce_v1_init(struct hns_roce_dev *hr_dev)
 void hns_roce_v1_exit(struct hns_roce_dev *hr_dev)
 {
 	hns_roce_port_enable(hr_dev, HNS_ROCE_PORT_DOWN);
+	hns_roce_free_mr_free(hr_dev);
 	hns_roce_des_qp_free(hr_dev);
 	hns_roce_tptr_free(hr_dev);
 	hns_roce_bt_free(hr_dev);
@@ -1161,6 +1637,14 @@ void hns_roce_v1_set_mac(struct hns_roce_dev *hr_dev, u8 phy_port, u8 *addr)
 	u32 *p;
 	u32 val;
 
+	/*
+	 * When mac changed, loopback may fail
+	 * because of smac not equal to dmac.
+	 * We Need to release and create reserved qp again.
+	 */
+	if (hr_dev->hw->dereg_mr && hns_roce_v1_recreate_lp_qp(hr_dev))
+		dev_warn(&hr_dev->pdev->dev, "recreate lp qp timeout!\n");
+
 	p = (u32 *)(&addr[0]);
 	reg_smac_l = *p;
 	roce_raw_write(reg_smac_l, hr_dev->reg_base + ROCEE_SMAC_L_0_REG +
@@ -3299,5 +3783,6 @@ struct hns_roce_hw hns_roce_hw_v1 = {
 	.post_recv = hns_roce_v1_post_recv,
 	.req_notify_cq = hns_roce_v1_req_notify_cq,
 	.poll_cq = hns_roce_v1_poll_cq,
+	.dereg_mr = hns_roce_v1_dereg_mr,
 	.priv = &hr_v1_priv,
 };
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.h b/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
index 1d250c0..b213b5e 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
@@ -58,6 +58,7 @@
 #define HNS_ROCE_V1_PHY_UAR_NUM				8
 
 #define HNS_ROCE_V1_GID_NUM				16
+#define HNS_ROCE_V1_RESV_QP				8
 
 #define HNS_ROCE_V1_NUM_COMP_EQE			0x8000
 #define HNS_ROCE_V1_NUM_ASYNC_EQE			0x400
@@ -107,6 +108,10 @@
 #define HNS_ROCE_V1_DB_STAGE2				2
 #define HNS_ROCE_V1_CHECK_DB_TIMEOUT_MSECS		10000
 #define HNS_ROCE_V1_CHECK_DB_SLEEP_MSECS		20
+#define HNS_ROCE_V1_FREE_MR_TIMEOUT_MSECS		50000
+#define HNS_ROCE_V1_RECREATE_LP_QP_TIMEOUT_MSECS	10000
+#define HNS_ROCE_V1_FREE_MR_WAIT_VALUE			5
+#define HNS_ROCE_V1_RECREATE_LP_QP_WAIT_VALUE		20
 
 #define HNS_ROCE_BT_RSV_BUF_SIZE			(1 << 17)
 
@@ -969,6 +974,10 @@ struct hns_roce_sq_db {
 #define SQ_DOORBELL_U32_4_SQ_HEAD_M   \
 	(((1UL << 15) - 1) << SQ_DOORBELL_U32_4_SQ_HEAD_S)
 
+#define SQ_DOORBELL_U32_4_SL_S 16
+#define SQ_DOORBELL_U32_4_SL_M   \
+	(((1UL << 2) - 1) << SQ_DOORBELL_U32_4_SL_S)
+
 #define SQ_DOORBELL_U32_4_PORT_S 18
 #define SQ_DOORBELL_U32_4_PORT_M  (((1UL << 3) - 1) << SQ_DOORBELL_U32_4_PORT_S)
 
@@ -1015,14 +1024,39 @@ struct hns_roce_des_qp {
 	int	requeue_flag;
 };
 
+struct hns_roce_mr_free_work {
+	struct	work_struct work;
+	struct	ib_device *ib_dev;
+	struct	completion *comp;
+	int	comp_flag;
+	void	*mr;
+};
+
+struct hns_roce_recreate_lp_qp_work {
+	struct	work_struct work;
+	struct	ib_device *ib_dev;
+	struct	completion *comp;
+	int	comp_flag;
+};
+
+struct hns_roce_free_mr {
+	struct workqueue_struct *free_mr_wq;
+	struct hns_roce_qp *mr_free_qp[HNS_ROCE_V1_RESV_QP];
+	struct hns_roce_cq *mr_free_cq;
+	struct hns_roce_pd *mr_free_pd;
+};
+
 struct hns_roce_v1_priv {
 	struct hns_roce_db_table  db_table;
 	struct hns_roce_raq_table raq_table;
 	struct hns_roce_bt_table  bt_table;
 	struct hns_roce_tptr_table tptr_table;
 	struct hns_roce_des_qp des_qp;
+	struct hns_roce_free_mr free_mr;
 };
 
 int hns_dsaf_roce_reset(struct fwnode_handle *dsaf_fwnode, bool dereset);
+int hns_roce_v1_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
+int hns_roce_v1_destroy_qp(struct ib_qp *ibqp);
 
 #endif
diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 914d0ac..0cedec0 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -129,7 +129,6 @@ static int handle_en_event(struct hns_roce_dev *hr_dev, u8 port,
 {
 	struct device *dev = &hr_dev->pdev->dev;
 	struct net_device *netdev;
-	unsigned long flags;
 
 	netdev = hr_dev->iboe.netdevs[port];
 	if (!netdev) {
@@ -137,7 +136,7 @@ static int handle_en_event(struct hns_roce_dev *hr_dev, u8 port,
 		return -ENODEV;
 	}
 
-	spin_lock_irqsave(&hr_dev->iboe.lock, flags);
+	spin_lock_bh(&hr_dev->iboe.lock);
 
 	switch (event) {
 	case NETDEV_UP:
@@ -156,7 +155,7 @@ static int handle_en_event(struct hns_roce_dev *hr_dev, u8 port,
 		break;
 	}
 
-	spin_unlock_irqrestore(&hr_dev->iboe.lock, flags);
+	spin_unlock_bh(&hr_dev->iboe.lock);
 	return 0;
 }
 
diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c
index 9b8a1ad..4139abe 100644
--- a/drivers/infiniband/hw/hns/hns_roce_mr.c
+++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
@@ -42,7 +42,7 @@ static u32 hw_index_to_key(unsigned long ind)
 	return (u32)(ind >> 24) | (ind << 8);
 }
 
-static unsigned long key_to_hw_index(u32 key)
+unsigned long key_to_hw_index(u32 key)
 {
 	return (key << 24) | (key >> 8);
 }
@@ -56,7 +56,7 @@ static int hns_roce_sw2hw_mpt(struct hns_roce_dev *hr_dev,
 				 HNS_ROCE_CMD_TIMEOUT_MSECS);
 }
 
-static int hns_roce_hw2sw_mpt(struct hns_roce_dev *hr_dev,
+int hns_roce_hw2sw_mpt(struct hns_roce_dev *hr_dev,
 			      struct hns_roce_cmd_mailbox *mailbox,
 			      unsigned long mpt_index)
 {
@@ -607,13 +607,20 @@ struct ib_mr *hns_roce_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 
 int hns_roce_dereg_mr(struct ib_mr *ibmr)
 {
+	struct hns_roce_dev *hr_dev = to_hr_dev(ibmr->device);
 	struct hns_roce_mr *mr = to_hr_mr(ibmr);
+	int ret = 0;
 
-	hns_roce_mr_free(to_hr_dev(ibmr->device), mr);
-	if (mr->umem)
-		ib_umem_release(mr->umem);
+	if (hr_dev->hw->dereg_mr) {
+		ret = hr_dev->hw->dereg_mr(hr_dev, mr);
+	} else {
+		hns_roce_mr_free(hr_dev, mr);
 
-	kfree(mr);
+		if (mr->umem)
+			ib_umem_release(mr->umem);
 
-	return 0;
+		kfree(mr);
+	}
+
+	return ret;
 }
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH for-next 3/6] IB/hns: Fix the bug of setting port mtu
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: salil.mehta-hv44wF8Li93QT0dZR+AlfA,
	xavier.huwei-hv44wF8Li93QT0dZR+AlfA,
	oulijun-hv44wF8Li93QT0dZR+AlfA, xushaobo2-hv44wF8Li93QT0dZR+AlfA,
	mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w, lijun_nudt-9Onoh4P/yGk,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA
In-Reply-To: <20161129231030.1105600-1-salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

From: "Wei Hu (Xavier)" <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

In hns_roce driver, we need not call iboe_get_mtu to reduce
IB headers from effective IBoE MTU because hr_dev->caps.max_mtu
has already been reduced.

Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Salil Mehta <salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/hns/hns_roce_main.c |   16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 0cedec0..5e620f9 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -72,18 +72,6 @@ static void hns_roce_set_mac(struct hns_roce_dev *hr_dev, u8 port, u8 *addr)
 	hr_dev->hw->set_mac(hr_dev, phy_port, addr);
 }
 
-static void hns_roce_set_mtu(struct hns_roce_dev *hr_dev, u8 port, int mtu)
-{
-	u8 phy_port = hr_dev->iboe.phy_port[port];
-	enum ib_mtu tmp;
-
-	tmp = iboe_get_mtu(mtu);
-	if (!tmp)
-		tmp = IB_MTU_256;
-
-	hr_dev->hw->set_mtu(hr_dev, phy_port, tmp);
-}
-
 static int hns_roce_add_gid(struct ib_device *device, u8 port_num,
 			    unsigned int index, const union ib_gid *gid,
 			    const struct ib_gid_attr *attr, void **context)
@@ -188,8 +176,8 @@ static int hns_roce_setup_mtu_mac(struct hns_roce_dev *hr_dev)
 	u8 i;
 
 	for (i = 0; i < hr_dev->caps.num_ports; i++) {
-		hns_roce_set_mtu(hr_dev, i,
-				 ib_mtu_enum_to_int(hr_dev->caps.max_mtu));
+		hr_dev->hw->set_mtu(hr_dev, hr_dev->iboe.phy_port[i],
+				    hr_dev->caps.max_mtu);
 		hns_roce_set_mac(hr_dev, i, hr_dev->iboe.netdevs[i]->dev_addr);
 	}
 
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH for-next 4/6] IB/hns: Delete the redundant memset operation
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: salil.mehta-hv44wF8Li93QT0dZR+AlfA,
	xavier.huwei-hv44wF8Li93QT0dZR+AlfA,
	oulijun-hv44wF8Li93QT0dZR+AlfA, xushaobo2-hv44wF8Li93QT0dZR+AlfA,
	mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w, lijun_nudt-9Onoh4P/yGk,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA
In-Reply-To: <20161129231030.1105600-1-salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

From: "Wei Hu (Xavier)" <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

It deleted the redundant memset operation because the memory allocated
by ib_alloc_device has been set zero.

Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Salil Mehta <salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/hns/hns_roce_main.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 5e620f9..28a8f24 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -843,9 +843,6 @@ static int hns_roce_probe(struct platform_device *pdev)
 	if (!hr_dev)
 		return -ENOMEM;
 
-	memset((u8 *)hr_dev + sizeof(struct ib_device), 0,
-		sizeof(struct hns_roce_dev) - sizeof(struct ib_device));
-
 	hr_dev->pdev = pdev;
 	platform_set_drvdata(pdev, hr_dev);
 
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH for-next 6/6] IB/hns: Fix the IB device name
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: salil.mehta-hv44wF8Li93QT0dZR+AlfA,
	xavier.huwei-hv44wF8Li93QT0dZR+AlfA,
	oulijun-hv44wF8Li93QT0dZR+AlfA, xushaobo2-hv44wF8Li93QT0dZR+AlfA,
	mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w, lijun_nudt-9Onoh4P/yGk,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA
In-Reply-To: <20161129231030.1105600-1-salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

From: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

This patch mainly fix the name for IB device in order
to match with libhns.

Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Salil Mehta <salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/hns/hns_roce_main.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 28a8f24..eddb053 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -433,7 +433,7 @@ static int hns_roce_register_device(struct hns_roce_dev *hr_dev)
 	spin_lock_init(&iboe->lock);
 
 	ib_dev = &hr_dev->ib_dev;
-	strlcpy(ib_dev->name, "hisi_%d", IB_DEVICE_NAME_MAX);
+	strlcpy(ib_dev->name, "hns_%d", IB_DEVICE_NAME_MAX);
 
 	ib_dev->owner			= THIS_MODULE;
 	ib_dev->node_type		= RDMA_NODE_IB_CA;
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: qed, qedi patchset submission
From: Arun Easi @ 2016-11-29 23:11 UTC (permalink / raw)
  To: David Miller, Martin K. Petersen; +Cc: linux-scsi, netdev
In-Reply-To: <yq11sydfwts.fsf@sermon.lab.mkp.net>

Hi David,

As we are preparing to post V3 of the qed (drivers/net/) and qedi 
(drivers/scsi/), I would like to get your opinion on the "qedi" part. As 
qedi (the iSCSI driver) is dependent on the accompanying "qed" changes, 
qedi changes have to follow or be together with qed changes. Martin is ok 
in having you taking the qedi part as well (see below discussion) or wait 
until qed is in, and then take qedi through SCSI tree.

Hi David, Martin,

So far, we have been posting qedi changes split into functional blocks, 
for review, but was not bisectable. With Martin ok to our request to 
squash all patches while committing to tree, we were wondering if we 
should post the qedi patches squashed, with all the Reviewed-by added, or 
continue to post as before?

Regards,
-Arun

On Mon, 14 Nov 2016, 3:47pm, Martin K. Petersen wrote:

> >>>>> "Arun" == Arun Easi <arun.easi@cavium.com> writes:
> 
> Arun,
> 
> Arun> Do you have any preference or thoughts on how the "qed" patches be
> Arun> approached? Just as a reference, our rdma driver "qedr" went
> Arun> through something similar[1], and eventually "qed" patches were
> Arun> taken by David in the net tree and "qedr", in the rdma tree
> Arun> (obviously) by Doug L.
> 
> DaveM can queue the whole series or I can hold the SCSI pieces for
> rc1. Either way works for me.
> 
> However, I would like to do a review of the driver first. I will try to
> get it done this week.
> 
> 

^ permalink raw reply

* [PATCH for-next 5/6] IB/hns: Fix the bug when free cq
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford
  Cc: salil.mehta, xavier.huwei, oulijun, xushaobo2, mehta.salil.lnk,
	lijun_nudt, linux-rdma, netdev, linux-kernel, linuxarm
In-Reply-To: <20161129231030.1105600-1-salil.mehta@huawei.com>

From: Shaobo Xu <xushaobo2@huawei.com>

If the resources of cq are freed while executing the user case, hardware
can not been notified in hip06 SoC. Then hardware will hold on when it
writes the cq buffer which has been released.

In order to slove this problem, RoCE driver checks the CQE counter, and
ensure that the outstanding CQE have been written. Then the cq buffer
can be released.

Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/infiniband/hw/hns/hns_roce_common.h |    2 +
 drivers/infiniband/hw/hns/hns_roce_cq.c     |   27 ++++++++------
 drivers/infiniband/hw/hns/hns_roce_device.h |    8 ++++
 drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |   53 +++++++++++++++++++++++++++
 4 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_common.h b/drivers/infiniband/hw/hns/hns_roce_common.h
index a055632..4af403e 100644
--- a/drivers/infiniband/hw/hns/hns_roce_common.h
+++ b/drivers/infiniband/hw/hns/hns_roce_common.h
@@ -354,6 +354,8 @@
 
 #define ROCEE_SDB_ISSUE_PTR_REG			0x758
 #define ROCEE_SDB_SEND_PTR_REG			0x75C
+#define ROCEE_CAEP_CQE_WCMD_EMPTY		0x850
+#define ROCEE_SCAEP_WR_CQE_CNT			0x8D0
 #define ROCEE_SDB_INV_CNT_REG			0x9A4
 #define ROCEE_SDB_RETRY_CNT_REG			0x9AC
 #define ROCEE_TSP_BP_ST_REG			0x9EC
diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c
index c9f6c3d..ff9a6a3 100644
--- a/drivers/infiniband/hw/hns/hns_roce_cq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_cq.c
@@ -179,8 +179,7 @@ static int hns_roce_hw2sw_cq(struct hns_roce_dev *dev,
 				 HNS_ROCE_CMD_TIMEOUT_MSECS);
 }
 
-static void hns_roce_free_cq(struct hns_roce_dev *hr_dev,
-			     struct hns_roce_cq *hr_cq)
+void hns_roce_free_cq(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq)
 {
 	struct hns_roce_cq_table *cq_table = &hr_dev->cq_table;
 	struct device *dev = &hr_dev->pdev->dev;
@@ -392,19 +391,25 @@ int hns_roce_ib_destroy_cq(struct ib_cq *ib_cq)
 {
 	struct hns_roce_dev *hr_dev = to_hr_dev(ib_cq->device);
 	struct hns_roce_cq *hr_cq = to_hr_cq(ib_cq);
+	int ret = 0;
 
-	hns_roce_free_cq(hr_dev, hr_cq);
-	hns_roce_mtt_cleanup(hr_dev, &hr_cq->hr_buf.hr_mtt);
+	if (hr_dev->hw->destroy_cq) {
+		ret = hr_dev->hw->destroy_cq(ib_cq);
+	} else {
+		hns_roce_free_cq(hr_dev, hr_cq);
+		hns_roce_mtt_cleanup(hr_dev, &hr_cq->hr_buf.hr_mtt);
 
-	if (ib_cq->uobject)
-		ib_umem_release(hr_cq->umem);
-	else
-		/* Free the buff of stored cq */
-		hns_roce_ib_free_cq_buf(hr_dev, &hr_cq->hr_buf, ib_cq->cqe);
+		if (ib_cq->uobject)
+			ib_umem_release(hr_cq->umem);
+		else
+			/* Free the buff of stored cq */
+			hns_roce_ib_free_cq_buf(hr_dev, &hr_cq->hr_buf,
+						ib_cq->cqe);
 
-	kfree(hr_cq);
+		kfree(hr_cq);
+	}
 
-	return 0;
+	return ret;
 }
 
 void hns_roce_cq_completion(struct hns_roce_dev *hr_dev, u32 cqn)
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
index 1050829..d4f0fce 100644
--- a/drivers/infiniband/hw/hns/hns_roce_device.h
+++ b/drivers/infiniband/hw/hns/hns_roce_device.h
@@ -56,6 +56,12 @@
 #define HNS_ROCE_MAX_INNER_MTPT_NUM		0x7
 #define HNS_ROCE_MAX_MTPT_PBL_NUM		0x100000
 
+#define HNS_ROCE_EACH_FREE_CQ_WAIT_MSECS	20
+#define HNS_ROCE_MAX_FREE_CQ_WAIT_CNT	\
+	(5000 / HNS_ROCE_EACH_FREE_CQ_WAIT_MSECS)
+#define HNS_ROCE_CQE_WCMD_EMPTY_BIT		0x2
+#define HNS_ROCE_MIN_CQE_CNT			16
+
 #define HNS_ROCE_MAX_IRQ_NUM			34
 
 #define HNS_ROCE_COMP_VEC_NUM			32
@@ -528,6 +534,7 @@ struct hns_roce_hw {
 	int (*req_notify_cq)(struct ib_cq *ibcq, enum ib_cq_notify_flags flags);
 	int (*poll_cq)(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
 	int (*dereg_mr)(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr);
+	int (*destroy_cq)(struct ib_cq *ibcq);
 	void	*priv;
 };
 
@@ -734,6 +741,7 @@ struct ib_cq *hns_roce_ib_create_cq(struct ib_device *ib_dev,
 				    struct ib_udata *udata);
 
 int hns_roce_ib_destroy_cq(struct ib_cq *ib_cq);
+void hns_roce_free_cq(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq);
 
 void hns_roce_cq_completion(struct hns_roce_dev *hr_dev, u32 cqn);
 void hns_roce_cq_event(struct hns_roce_dev *hr_dev, u32 cqn, int event_type);
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
index f67a3bf..b8111b0 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
@@ -3763,6 +3763,58 @@ int hns_roce_v1_destroy_qp(struct ib_qp *ibqp)
 	return 0;
 }
 
+int hns_roce_v1_destroy_cq(struct ib_cq *ibcq)
+{
+	struct hns_roce_dev *hr_dev = to_hr_dev(ibcq->device);
+	struct hns_roce_cq *hr_cq = to_hr_cq(ibcq);
+	struct device *dev = &hr_dev->pdev->dev;
+	u32 cqe_cnt_ori;
+	u32 cqe_cnt_cur;
+	u32 cq_buf_size;
+	int wait_time = 0;
+	int ret = 0;
+
+	hns_roce_free_cq(hr_dev, hr_cq);
+
+	/*
+	 * Before freeing cq buffer, we need to ensure that the outstanding CQE
+	 * have been written by checking the CQE counter.
+	 */
+	cqe_cnt_ori = roce_read(hr_dev, ROCEE_SCAEP_WR_CQE_CNT);
+	while (1) {
+		if (roce_read(hr_dev, ROCEE_CAEP_CQE_WCMD_EMPTY) &
+		    HNS_ROCE_CQE_WCMD_EMPTY_BIT)
+			break;
+
+		cqe_cnt_cur = roce_read(hr_dev, ROCEE_SCAEP_WR_CQE_CNT);
+		if ((cqe_cnt_cur - cqe_cnt_ori) >= HNS_ROCE_MIN_CQE_CNT)
+			break;
+
+		msleep(HNS_ROCE_EACH_FREE_CQ_WAIT_MSECS);
+		if (wait_time > HNS_ROCE_MAX_FREE_CQ_WAIT_CNT) {
+			dev_warn(dev, "Destroy cq 0x%lx timeout!\n",
+				hr_cq->cqn);
+			ret = -ETIMEDOUT;
+			break;
+		}
+		wait_time++;
+	}
+
+	hns_roce_mtt_cleanup(hr_dev, &hr_cq->hr_buf.hr_mtt);
+
+	if (ibcq->uobject)
+		ib_umem_release(hr_cq->umem);
+	else {
+		/* Free the buff of stored cq */
+		cq_buf_size = (ibcq->cqe + 1) * hr_dev->caps.cq_entry_sz;
+		hns_roce_buf_free(hr_dev, cq_buf_size, &hr_cq->hr_buf.hr_buf);
+	}
+
+	kfree(hr_cq);
+
+	return ret;
+}
+
 struct hns_roce_v1_priv hr_v1_priv;
 
 struct hns_roce_hw hns_roce_hw_v1 = {
@@ -3784,5 +3836,6 @@ struct hns_roce_hw hns_roce_hw_v1 = {
 	.req_notify_cq = hns_roce_v1_req_notify_cq,
 	.poll_cq = hns_roce_v1_poll_cq,
 	.dereg_mr = hns_roce_v1_dereg_mr,
+	.destroy_cq = hns_roce_v1_destroy_cq,
 	.priv = &hr_v1_priv,
 };
-- 
1.7.9.5

^ permalink raw reply related

* Re: netlink: GPF in sock_sndtimeo
From: Cong Wang @ 2016-11-29 23:13 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Dmitry Vyukov, David Miller, Johannes Berg, Florian Westphal,
	Eric Dumazet, Herbert Xu, netdev, LKML, syzkaller
In-Reply-To: <20161129164859.GD26673@madcap2.tricolour.ca>

On Tue, Nov 29, 2016 at 8:48 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2016-11-26 17:11, Cong Wang wrote:
>> It is racy on audit_sock, especially on the netns exit path.
>
> I think that is the only place it is racy.  The other places audit_sock
> is set is when the socket failure has just triggered a reset.
>
> Is there a notifier callback for failed or reaped sockets?

Is NETLINK_URELEASE event what you are looking for?

^ permalink raw reply

* Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
From: Alexander Duyck @ 2016-11-29 23:14 UTC (permalink / raw)
  To: Robert Shearman; +Cc: David Miller, Netdev, Alexander Duyck
In-Reply-To: <1480441605-27855-1-git-send-email-rshearma@brocade.com>

On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman <rshearma@brocade.com> wrote:
> With certain distributions of routes it can take a long time to add
> and delete further routes at scale. For example, with a route for
> 10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes
> from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck
> is update_suffix:
>
>       40.39%  [kernel]                      [k] update_suffix
>        8.02%  libnl-3.so.200.19.0           [.] nl_hash_table_lookup

Well yeah, it will be expensive when you have over 512K entries in a
single node.  I'm assuming that is the case based on the fact that
200K routes will double up in the trie node due to broadcast and the
route ending up in adjacent key vectors.

> With these changes, the time is reduced to 4s for the same scale and
> distribution of routes.
>
> The issue is that update_suffix does an O(n) walk on the children of a
> node and the with a dense distribtion of routes the number of children
> in a node tends towards the number of nodes in the tree.

Other than the performance I was just wondering if you did any other
validation to confirm that nothing else differs.  Basically it would
just be a matter of verifying that /proc/net/fib_trie is the same
before and after your patch.

> In the add case it isn't necessary to walk all the other children to
> update the largest suffix length of the node (which is what
> update_suffix is doing) since we already know what the largest suffix
> length of all the other children is and we only need to update it if
> the new node/leaf has a larger suffix. However, it is currently called
> from resize which is called on any update to rebalance the trie.
>
> Therefore improve the scaling by moving the responsibility of
> recalculating the node's largest suffix length out of the resize
> function into its callers. For fib_insert_node this can be taken care
> of by extending put_child to not only update the largest suffix length
> in the parent node, but to propagate it all the way up the trie as
> required, using a new function, node_push_suffix, based on
> leaf_push_suffix, but renamed to reflect its purpose and made safe if
> the node has no parent.
>
> No changes are required to inflate/halve since the maximum suffix
> length of the sub-trie starting from the node's parent isn't changed,
> and the suffix lengths of the halved/inflated nodes are updated
> through the creation of the new nodes and put_child called to add the
> children to them.
>
> In both fib_table_flush and fib_table_flush_external, given that a
> walk of the entire trie is being done there's a choice of optimising
> either for the case of a small number of leafs being flushed by
> updating the suffix on a change and propagating it up, or optimising
> for large numbers of leafs being flushed by deferring the updating of
> the largest suffix length to the walk up to the parent node. I opted
> for the latter to keep the algorithm linear in complexity in the case
> of flushing all leafs and because it is close to the status quo.
>
> Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
> Signed-off-by: Robert Shearman <rshearma@brocade.com>

So I am not entirely convinced this is a regression, I was wondering
what your numbers looked like before the patch you are saying this
fixes?  I recall I fixed a number of issues leading up to this so I am
just curious.

My thought is this seems more like a performance optimization than a
fix.  If that is the case this probably belongs in net-next.

It seems to me we might be able to simplify update_suffix rather than
going through all this change.  That might be something that is more
acceptable for net.  Have you looked at simply adding code so that
there is a break inside update_suffix if (slen == tn->slen)?  We don't
have to call it for if a leaf is added so it might make sense to add
that check.

> ---
>  net/ipv4/fib_trie.c | 86 ++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 62 insertions(+), 24 deletions(-)
>
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 026f309c51e9..701cae8af44a 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -421,8 +421,22 @@ static inline int tnode_full(struct key_vector *tn, struct key_vector *n)
>         return n && ((n->pos + n->bits) == tn->pos) && IS_TNODE(n);
>  }
>
> +static void node_push_suffix(struct key_vector *tn, struct key_vector *l)
> +{
> +       while (tn->slen < l->slen) {
> +               tn->slen = l->slen;
> +               tn = node_parent(tn);
> +               if (!tn)
> +                       break;

I don't think the NULL check is necessary, at least it wasn't with the old code.

> +       }
> +}
> +
>  /* Add a child at position i overwriting the old value.
>   * Update the value of full_children and empty_children.
> + *
> + * The suffix length of the parent node and the rest of the tree is
> + * updated (if required) when adding/replacing a node, but is caller's
> + * responsibility on removal.
>   */
>  static void put_child(struct key_vector *tn, unsigned long i,
>                       struct key_vector *n)
> @@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned long i,
>         else if (!wasfull && isfull)
>                 tn_info(tn)->full_children++;
>
> -       if (n && (tn->slen < n->slen))
> -               tn->slen = n->slen;
> +       if (n)
> +               node_push_suffix(tn, n);

This is just creating redundant work if we have to perform a resize.

>         rcu_assign_pointer(tn->tnode[i], n);
>  }
> @@ -919,34 +933,35 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn)
>         if (max_work != MAX_WORK)
>                 return tp;
>
> -       /* push the suffix length to the parent node */
> -       if (tn->slen > tn->pos) {
> -               unsigned char slen = update_suffix(tn);
> -
> -               if (slen > tp->slen)
> -                       tp->slen = slen;
> -       }
> -
>         return tp;
>  }
>

So it seems like the goal here is to move update_suffix out of the
resize code.  I'm mostly okay with that.  I don't recall why I was
doing it as a part of the resize instead of just doing it whenever a
longer suffix was added.

> -static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
> +static void node_set_suffix(struct key_vector *tp, unsigned char slen)
>  {
> -       while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
> -               if (update_suffix(tp) > l->slen)
> +       if (slen > tp->slen)
> +               tp->slen = slen;
> +}
> +
> +static void node_pull_suffix(struct key_vector *tn)
> +{
> +       struct key_vector *tp;
> +       unsigned char slen;
> +
> +       slen = update_suffix(tn);
> +       tp = node_parent(tn);
> +       while ((tp->slen > tp->pos) && (tp->slen > slen)) {
> +               if (update_suffix(tp) > slen)
>                         break;
>                 tp = node_parent(tp);
>         }
>  }
>

I really hate all the renaming.  Can't you just leave these as
leaf_pull_suffix and leaf_push _suffix.  I'm fine with us dropping the
leaf of the suffix but the renaming just makes adds a bunch of noise.
Really it might work better if you broke the replacement of the
key_vector as a leaf with the suffix length into a separate patch,
then you could do the rename as a part of that.

> -static void leaf_push_suffix(struct key_vector *tn, struct key_vector *l)
> +static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
>  {
> -       /* if this is a new leaf then tn will be NULL and we can sort
> -        * out parent suffix lengths as a part of trie_rebalance
> -        */
> -       while (tn->slen < l->slen) {
> -               tn->slen = l->slen;
> -               tn = node_parent(tn);
> +       while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
> +               if (update_suffix(tp) > l->slen)
> +                       break;
> +               tp = node_parent(tp);
>         }
>  }

If possible it would work better if you could avoid moving functions
around as a result of your changes.

> @@ -1107,7 +1122,7 @@ static int fib_insert_alias(struct trie *t, struct key_vector *tp,
>         /* if we added to the tail node then we need to update slen */
>         if (l->slen < new->fa_slen) {
>                 l->slen = new->fa_slen;
> -               leaf_push_suffix(tp, l);
> +               node_push_suffix(tp, l);
>         }
>
>         return 0;
> @@ -1495,12 +1510,15 @@ static void fib_remove_alias(struct trie *t, struct key_vector *tp,
>         /* remove the fib_alias from the list */
>         hlist_del_rcu(&old->fa_list);
>
> -       /* if we emptied the list this leaf will be freed and we can sort
> -        * out parent suffix lengths as a part of trie_rebalance
> -        */
> +       /* if we emptied the list this leaf will be freed */
>         if (hlist_empty(&l->leaf)) {
>                 put_child_root(tp, l->key, NULL);
>                 node_free(l);
> +               /* only need to update suffixes if this alias was
> +                * possibly the one with the largest suffix in the parent
> +                */
> +               if (tp->slen == old->fa_slen)
> +                       node_pull_suffix(tp);
>                 trie_rebalance(t, tp);
>                 return;
>         }
> @@ -1783,6 +1801,16 @@ void fib_table_flush_external(struct fib_table *tb)
>                         if (IS_TRIE(pn))
>                                 break;
>
> +                       /* push the suffix length to the parent node,
> +                        * since a previous leaf removal may have
> +                        * caused it to change
> +                        */
> +                       if (pn->slen > pn->pos) {
> +                               unsigned char slen = update_suffix(pn);
> +
> +                               node_set_suffix(node_parent(pn), slen);
> +                       }
> +

Why bother with the local variable?  You could just pass
update_suffix(pn) as the parameter to your node_set_suffix function.

>                         /* resize completed node */
>                         pn = resize(t, pn);
>                         cindex = get_index(pkey, pn);
> @@ -1849,6 +1877,16 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
>                         if (IS_TRIE(pn))
>                                 break;
>
> +                       /* push the suffix length to the parent node,
> +                        * since a previous leaf removal may have
> +                        * caused it to change
> +                        */
> +                       if (pn->slen > pn->pos) {
> +                               unsigned char slen = update_suffix(pn);
> +
> +                               node_set_suffix(node_parent(pn), slen);
> +                       }
> +

Same here.

>                         /* resize completed node */
>                         pn = resize(t, pn);
>                         cindex = get_index(pkey, pn);
> --
> 2.1.4
>

^ permalink raw reply

* RE: [PATCH net-netx] net: lan78xx: add LAN7801 MAC-only support
From: Woojung.Huh @ 2016-11-29 23:20 UTC (permalink / raw)
  To: f.fainelli, davem, andrew; +Cc: netdev, UNGLinuxDriver
In-Reply-To: <dcabc07f-ab5f-ad25-5beb-09f717c06e4e@gmail.com>

> There are two ways to get these settings propagated to the PHY driver:
> 
> - using a board fixup which is going to be invoked during
> drv->config_init() time
> 
> - specifying a phydev->dev_flags and reading it from the PHY driver to
> act upon and configure the PHY based on that value, there are only
> 32-bits available though, and you need to make sure they are not
> conflicting with other potential users in tree
> 
> My preference would go with 1, since you could just register it in your
> PHY driver and re-use the code you are proposing to include here.

fixup looks good to me too. Will submit revision soon.

Thanks.
- Woojung

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox