Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: [PATCH for-next 03/11] IB/hns: Optimize the logic of allocating memory using APIs
From: Leon Romanovsky @ 2016-11-16  8:36 UTC (permalink / raw)
  To: Salil Mehta
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, Huwei (Xavier),
	oulijun, mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linuxarm,
	Zhangping (ZP)
In-Reply-To: <F4CC6FACFEB3C54C9141D49AD221F7F91A7A2371@lhreml503-mbx>

[-- Attachment #1: Type: text/plain, Size: 3338 bytes --]

On Tue, Nov 15, 2016 at 03:52:46PM +0000, Salil Mehta wrote:
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org]
> > Sent: Wednesday, November 09, 2016 7:22 AM
> > To: Salil Mehta
> > Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; Huwei (Xavier); oulijun;
> > mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> > netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Linuxarm;
> > Zhangping (ZP)
> > Subject: Re: [PATCH for-next 03/11] IB/hns: Optimize the logic of
> > allocating memory using APIs
> >
> > On Fri, Nov 04, 2016 at 04:36:25PM +0000, Salil Mehta wrote:
> > > From: "Wei Hu (Xavier)" <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > >
> > > This patch modified the logic of allocating memory using APIs in
> > > hns RoCE driver. We used kcalloc instead of kmalloc_array and
> > > bitmap_zero. And When kcalloc failed, call vzalloc to alloc
> > > memory.
> > >
> > > Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > > Signed-off-by: Ping Zhang <zhangping5-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > > Signed-off-by: Salil Mehta  <salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > > ---
> > >  drivers/infiniband/hw/hns/hns_roce_mr.c |   15 ++++++++-------
> > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c
> > b/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > index fb87883..d3dfb5f 100644
> > > --- a/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > @@ -137,11 +137,12 @@ static int hns_roce_buddy_init(struct
> > hns_roce_buddy *buddy, int max_order)
> > >
> > >  	for (i = 0; i <= buddy->max_order; ++i) {
> > >  		s = BITS_TO_LONGS(1 << (buddy->max_order - i));
> > > -		buddy->bits[i] = kmalloc_array(s, sizeof(long),
> > GFP_KERNEL);
> > > -		if (!buddy->bits[i])
> > > -			goto err_out_free;
> > > -
> > > -		bitmap_zero(buddy->bits[i], 1 << (buddy->max_order - i));
> > > +		buddy->bits[i] = kcalloc(s, sizeof(long), GFP_KERNEL);
> > > +		if (!buddy->bits[i]) {
> > > +			buddy->bits[i] = vzalloc(s * sizeof(long));
> >
> > I wonder, why don't you use directly vzalloc instead of kcalloc
> > fallback?
> As we know we will have physical contiguous pages if the kcalloc
> call succeeds. This will give us a chance to have better performance
> over the allocations which are just virtually contiguous through the
> function vzalloc(). Therefore, later has only been used as a fallback
> when our memory request cannot be entertained through kcalloc.
>
> Are you suggesting that there will not be much performance penalty
> if we use just vzalloc ?

Not exactly,
I asked it, because we have similar code in our drivers and this
construction looks strange to me.

1. If performance is critical, we will use kmalloc.
2. If performance is not critical, we will use vmalloc.

But in this case, such construction shows me that we can live with
vmalloc performance and kmalloc allocation are not really needed.

In your specific case, I'm not sure that kcalloc will ever fail.

Thanks


>
> >
> > > +			if (!buddy->bits[i])
> > > +				goto err_out_free;
> > > +		}
> > >  	}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [PATCH rdma-rc 0/5] RXE fixes for 4.9
From: Leon Romanovsky @ 2016-11-16  8:39 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Doug,

Please find below the RXE fixes for 4.9 from Yonatan and Moni.

This patchset was generated against v4.9-rc3.

Available in the "topic/rxe-fixes-4.9" topic branch of this git repo:
git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git

Or for browsing:
https://git.kernel.org/cgit/linux/kernel/git/leon/linux-rdma.git/log/?h=topic/rxe-fixes-4.9

Thanks

Yonatan Cohen (5):
  IB/rxe: Fix kernel panic in UDP tunnel with GRO and RX checksum
  IB/rxe: Fix handling of erroneous WR
  IB/rxe: Increase max number of completions to 32k
  IB/rxe: Clear queue buffer when modifying QP to reset
  IB/rxe: Update qp state for user query

 drivers/infiniband/sw/rxe/rxe_net.c   |  8 ++------
 drivers/infiniband/sw/rxe/rxe_param.h |  2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c    |  2 ++
 drivers/infiniband/sw/rxe/rxe_queue.c |  9 +++++++++
 drivers/infiniband/sw/rxe/rxe_queue.h |  2 ++
 drivers/infiniband/sw/rxe/rxe_req.c   | 21 +++++++++++++--------
 6 files changed, 29 insertions(+), 15 deletions(-)

--
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH rdma-rc 1/5] IB/rxe: Fix kernel panic in UDP tunnel with GRO and RX checksum
From: Leon Romanovsky @ 2016-11-16  8:39 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yonatan Cohen
In-Reply-To: <1479285558-19627-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

From: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Missing initialization of udp_tunnel_sock_cfg causes to following
kernel panic, while kernel tries to execute gro_receive().

While being there, we converted udp_port_cfg to use the same
initialization scheme as udp_tunnel_sock_cfg.

------------[ cut here ]------------
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle kernel paging request at ffffffffa0588c50
IP: [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
PGD 1c09067 PUD 1c0a063 PMD bb394067 PTE 80000000ad5e8163
Oops: 0011 [#1] SMP
Modules linked in: ib_rxe ip6_udp_tunnel udp_tunnel
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc3+ #2
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
task: ffff880235e4e680 ti: ffff880235e68000 task.ti: ffff880235e68000
RIP: 0010:[<ffffffffa0588c50>]
[<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
RSP: 0018:ffff880237343c80  EFLAGS: 00010282
RAX: 00000000dffe482d RBX: ffff8800ae330900 RCX: 000000002001b712
RDX: ffff8800ae330900 RSI: ffff8800ae102578 RDI: ffff880235589c00
RBP: ffff880237343cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800ae33e262
R13: ffff880235589c00 R14: 0000000000000014 R15: ffff8800ae102578
FS:  0000000000000000(0000) GS:ffff880237340000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa0588c50 CR3: 0000000001c06000 CR4: 00000000000006e0
Stack:
ffffffff8160860e ffff8800ae330900 ffff8800ae102578 0000000000000014
000000000000004e ffff8800ae102578 ffff880237343ce0 ffffffff816088fb
0000000000000000 ffff8800ae330900 0000000000000000 00000000ffad0000
Call Trace:
<IRQ>
[<ffffffff8160860e>] ? udp_gro_receive+0xde/0x130
[<ffffffff816088fb>] udp4_gro_receive+0x10b/0x2d0
[<ffffffff81611373>] inet_gro_receive+0x1d3/0x270
[<ffffffff81594e29>] dev_gro_receive+0x269/0x3b0
[<ffffffff81595188>] napi_gro_receive+0x38/0x120
[<ffffffffa011caee>] mlx5e_handle_rx_cqe+0x27e/0x340 [mlx5_core]
[<ffffffffa011d076>] mlx5e_poll_rx_cq+0x66/0x6d0 [mlx5_core]
[<ffffffffa011d7ae>] mlx5e_napi_poll+0x8e/0x400 [mlx5_core]
[<ffffffff815949a0>] net_rx_action+0x160/0x380
[<ffffffff816a9197>] __do_softirq+0xd7/0x2c5
[<ffffffff81085c35>] irq_exit+0xf5/0x100
[<ffffffff816a8f16>] do_IRQ+0x56/0xd0
[<ffffffff816a6dcc>] common_interrupt+0x8c/0x8c
<EOI>
[<ffffffff81061f96>] ? native_safe_halt+0x6/0x10
[<ffffffff81037ade>] default_idle+0x1e/0xd0
[<ffffffff8103828f>] arch_cpu_idle+0xf/0x20
[<ffffffff810c37dc>] default_idle_call+0x3c/0x50
[<ffffffff810c3b13>] cpu_startup_entry+0x323/0x3c0
[<ffffffff81050d8c>] start_secondary+0x15c/0x1a0
RIP  [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
RSP <ffff880237343c80>
CR2: ffffffffa0588c50
---[ end trace 489ee31fa7614ac5 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt
------------[ cut here ]------------

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 drivers/infiniband/sw/rxe/rxe_net.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index b8258e4..ffff5a5 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -243,10 +243,8 @@ static struct socket *rxe_setup_udp_tunnel(struct net *net, __be16 port,
 {
 	int err;
 	struct socket *sock;
-	struct udp_port_cfg udp_cfg;
-	struct udp_tunnel_sock_cfg tnl_cfg;
-
-	memset(&udp_cfg, 0, sizeof(udp_cfg));
+	struct udp_port_cfg udp_cfg = {0};
+	struct udp_tunnel_sock_cfg tnl_cfg = {0};
 
 	if (ipv6) {
 		udp_cfg.family = AF_INET6;
@@ -264,10 +262,8 @@ static struct socket *rxe_setup_udp_tunnel(struct net *net, __be16 port,
 		return ERR_PTR(err);
 	}
 
-	tnl_cfg.sk_user_data = NULL;
 	tnl_cfg.encap_type = 1;
 	tnl_cfg.encap_rcv = rxe_udp_encap_recv;
-	tnl_cfg.encap_destroy = NULL;
 
 	/* Setup UDP tunnel */
 	setup_udp_tunnel_sock(net, sock, &tnl_cfg);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH rdma-rc 2/5] IB/rxe: Fix handling of erroneous WR
From: Leon Romanovsky @ 2016-11-16  8:39 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yonatan Cohen
In-Reply-To: <1479285558-19627-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

From: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

To correctly handle a erroneous WR this fix does the following
1. Make sure the bad WQE causes a user completion event.
2. Call rxe_completer to handle the erred WQE.

Before the fix, when rxe_requester found a bad WQE, it changed its
status to IB_WC_LOC_PROT_ERR and exit with 0 for non RC QPs.

If this was the 1st WQE then there would be no ACK to invoke the
completer and this bad WQE would be stuck in the QP's send-q.

On top of that the requester exiting with 0 caused rxe_do_task to
endlessly invoke rxe_requester, resulting in a soft-lockup attached
below.

In case the WQE was not the 1st and rxe_completer did get a chance to
handle the bad WQE, it did not cause a complete event since the WQE's
IB_SEND_SIGNALED flag was not set.

Setting WQE status to IB_SEND_SIGNALED is subject to IBA spec
version 1.2.1, section 10.7.3.1 Signaled Completions.

NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s!
[<ffffffffa0590145>] ? rxe_pool_get_index+0x35/0xb0 [rdma_rxe]
[<ffffffffa05952ec>] lookup_mem+0x3c/0xc0 [rdma_rxe]
[<ffffffffa0595534>] copy_data+0x1c4/0x230 [rdma_rxe]
[<ffffffffa058c180>] rxe_requester+0x9d0/0x1100 [rdma_rxe]
[<ffffffff8158e98a>] ? kfree_skbmem+0x5a/0x60
[<ffffffffa05962c9>] rxe_do_task+0x89/0xf0 [rdma_rxe]
[<ffffffffa05963e2>] rxe_run_task+0x12/0x30 [rdma_rxe]
[<ffffffffa059110a>] rxe_post_send+0x41a/0x550 [rdma_rxe]
[<ffffffff811ef922>] ? __kmalloc+0x182/0x200
[<ffffffff816ba512>] ? down_read+0x12/0x40
[<ffffffffa054bd32>] ib_uverbs_post_send+0x532/0x540 [ib_uverbs]
[<ffffffff815f8722>] ? tcp_sendmsg+0x402/0xb80
[<ffffffffa05453dc>] ib_uverbs_write+0x18c/0x3f0 [ib_uverbs]
[<ffffffff81623c2e>] ? inet_recvmsg+0x7e/0xb0
[<ffffffff8158764d>] ? sock_recvmsg+0x3d/0x50
[<ffffffff81215b87>] __vfs_write+0x37/0x140
[<ffffffff81216892>] vfs_write+0xb2/0x1b0
[<ffffffff81217ce5>] SyS_write+0x55/0xc0
[<ffffffff816bc672>] entry_SYSCALL_64_fastpath+0x1a/0xa

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 drivers/infiniband/sw/rxe/rxe_req.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 832846b..22bd963 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -696,7 +696,8 @@ int rxe_requester(void *arg)
 						       qp->req.wqe_index);
 			wqe->state = wqe_state_done;
 			wqe->status = IB_WC_SUCCESS;
-			goto complete;
+			__rxe_do_task(&qp->comp.task);
+			return 0;
 		}
 		payload = mtu;
 	}
@@ -745,13 +746,17 @@ int rxe_requester(void *arg)
 	wqe->status = IB_WC_LOC_PROT_ERR;
 	wqe->state = wqe_state_error;
 
-complete:
-	if (qp_type(qp) != IB_QPT_RC) {
-		while (rxe_completer(qp) == 0)
-			;
-	}
-
-	return 0;
+	/*
+	 * IBA Spec. Section 10.7.3.1 SIGNALED COMPLETIONS
+	 * ---------8<---------8<-------------
+	 * ...Note that if a completion error occurs, a Work Completion
+	 * will always be generated, even if the signaling
+	 * indicator requests an Unsignaled Completion.
+	 * ---------8<---------8<-------------
+	 */
+	wqe->wr.send_flags |= IB_SEND_SIGNALED;
+	__rxe_do_task(&qp->comp.task);
+	return -EAGAIN;
 
 exit:
 	return -EAGAIN;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH rdma-rc 3/5] IB/rxe: Increase max number of completions to 32k
From: Leon Romanovsky @ 2016-11-16  8:39 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yonatan Cohen
In-Reply-To: <1479285558-19627-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

From: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Increase limit of max CQE from 8K to 32K to allow demanding
applications to work over SoftRoCE with same configuration
as most RoCEv2 HW vendors have.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 drivers/infiniband/sw/rxe/rxe_param.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
index f459c43..13ed2cc 100644
--- a/drivers/infiniband/sw/rxe/rxe_param.h
+++ b/drivers/infiniband/sw/rxe/rxe_param.h
@@ -82,7 +82,7 @@ enum rxe_device_param {
 	RXE_MAX_SGE			= 32,
 	RXE_MAX_SGE_RD			= 32,
 	RXE_MAX_CQ			= 16384,
-	RXE_MAX_LOG_CQE			= 13,
+	RXE_MAX_LOG_CQE			= 15,
 	RXE_MAX_MR			= 2 * 1024,
 	RXE_MAX_PD			= 0x7ffc,
 	RXE_MAX_QP_RD_ATOM		= 128,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH rdma-rc 4/5] IB/rxe: Clear queue buffer when modifying QP to reset
From: Leon Romanovsky @ 2016-11-16  8:39 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yonatan Cohen
In-Reply-To: <1479285558-19627-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

From: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

RXE resets the send-q only once in rxe_qp_init_req() when
QP is created, but when the QP is reused after QP reset, the send-q
holds previous garbage data.

This garbage data wrongly fails CQEs that otherwise
should have completed successfully.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 drivers/infiniband/sw/rxe/rxe_qp.c    | 1 +
 drivers/infiniband/sw/rxe/rxe_queue.c | 9 +++++++++
 drivers/infiniband/sw/rxe/rxe_queue.h | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index b8036cf..95aaaa2 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -522,6 +522,7 @@ static void rxe_qp_reset(struct rxe_qp *qp)
 	if (qp->sq.queue) {
 		__rxe_do_task(&qp->comp.task);
 		__rxe_do_task(&qp->req.task);
+		rxe_queue_reset(qp->sq.queue);
 	}
 
 	/* cleanup attributes */
diff --git a/drivers/infiniband/sw/rxe/rxe_queue.c b/drivers/infiniband/sw/rxe/rxe_queue.c
index 0827425..d14bf49 100644
--- a/drivers/infiniband/sw/rxe/rxe_queue.c
+++ b/drivers/infiniband/sw/rxe/rxe_queue.c
@@ -84,6 +84,15 @@ int do_mmap_info(struct rxe_dev *rxe,
 	return -EINVAL;
 }
 
+inline void rxe_queue_reset(struct rxe_queue *q)
+{
+	/* queue is comprised from header and the memory
+	 * of the actual queue. See "struct rxe_queue_buf" in rxe_queue.h
+	 * reset only the queue itself and not the management header
+	 */
+	memset(q->buf->data, 0, q->buf_size - sizeof(struct rxe_queue_buf));
+}
+
 struct rxe_queue *rxe_queue_init(struct rxe_dev *rxe,
 				 int *num_elem,
 				 unsigned int elem_size)
diff --git a/drivers/infiniband/sw/rxe/rxe_queue.h b/drivers/infiniband/sw/rxe/rxe_queue.h
index 239fd60..8c8641c 100644
--- a/drivers/infiniband/sw/rxe/rxe_queue.h
+++ b/drivers/infiniband/sw/rxe/rxe_queue.h
@@ -84,6 +84,8 @@ int do_mmap_info(struct rxe_dev *rxe,
 		 size_t buf_size,
 		 struct rxe_mmap_info **ip_p);
 
+void rxe_queue_reset(struct rxe_queue *q);
+
 struct rxe_queue *rxe_queue_init(struct rxe_dev *rxe,
 				 int *num_elem,
 				 unsigned int elem_size);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH rdma-rc 5/5] IB/rxe: Update qp state for user query
From: Leon Romanovsky @ 2016-11-16  8:39 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yonatan Cohen
In-Reply-To: <1479285558-19627-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

From: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The method rxe_qp_error() transitions QP to error state
and make sure the QP is drained. It did not though update
the QP state for user's query.

This patch fixes this.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 drivers/infiniband/sw/rxe/rxe_qp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 95aaaa2..c3e60e4 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -574,6 +574,7 @@ void rxe_qp_error(struct rxe_qp *qp)
 {
 	qp->req.state = QP_STATE_ERROR;
 	qp->resp.state = QP_STATE_ERROR;
+	qp->attr.qp_state = IB_QPS_ERR;
 
 	/* drain work and packet queues */
 	rxe_run_task(&qp->resp.task, 1);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH rdma-core] rxe: Remove perl::switch module dependancy
From: Yonatan Cohen @ 2016-11-16  8:46 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA, leon-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yonatan Cohen

Remove perl::switch dependency from RXE, since it is
not installed by default.

Signed-off-by: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 providers/rxe/rxe_cfg | 35 +++++++++++++----------------------
 1 file changed, 13 insertions(+), 22 deletions(-)

diff --git a/providers/rxe/rxe_cfg b/providers/rxe/rxe_cfg
index 6c414fb..c2dbd0e 100755
--- a/providers/rxe/rxe_cfg
+++ b/providers/rxe/rxe_cfg
@@ -37,7 +37,6 @@ use strict;
 
 use File::Basename;
 use Getopt::Long;
-use Switch;
 
 my $help = 0;
 my $no_persist = 0;
@@ -559,26 +558,21 @@ sub do_debug {
     my $debugfile = "$parms/debug";
     chomp($arg2);
 
-    #print "debug $arg2\n";
-    #system("echo 'debug $arg2' > $proc");
-
     if (!(-e "$debugfile")) {
 	print "Error: debug is compiled out of this rxe driver\n";
 	return;
     }
 
-    switch ($arg2) {
-	case "on"   { 	system("echo '31' > $debugfile"); }
-	case "off"  { 	system("echo '0'  > $debugfile"); }
-	case "0"    { 	system("echo '0'  > $debugfile"); }
-	case ""     { }
+    if    ($arg2 eq "on")  { system("echo '31' > $debugfile"); }
+    elsif ($arg2 eq "off") { system("echo '0'  > $debugfile"); }
+    elsif ($arg2 eq "0")   { system("echo '0'  > $debugfile"); }
+    elsif ($arg2 eq "")    { }
 	elsif ($arg2 ge "0" && $arg2 le "31") {
 	    system("echo '$arg2' > $debugfile");
 	}
 	else {
 	    print "unrecognized debug cmd ($arg2)\n";
 	}
-    }
 
     my $current = read_file($debugfile);
     chomp($current);
@@ -645,11 +639,10 @@ sub main {
     }
 
     # stuff that does not require modules to be loaded
-    switch($arg1) {
-        case "help"		{ usage(); exit; }
-        case "start"		{ do_start(); do_status(); exit; }
-        case "persistent"	{ system("cat $persistence_file"); exit; }
-    }
+    if    ($arg1 eq "help")       { usage(); exit; }
+    elsif ($arg1 eq "start")      { do_start(); do_status(); exit; }
+    elsif ($arg1 eq "persistent") { system("cat $persistence_file"); exit; }
+
 
     # can't do much else, bail if modules aren't loaded
     if (check_module_status()) {
@@ -668,13 +661,11 @@ sub main {
     get_dev_info();
 
     # Stuff that requires the rdma_rxe module to be loaded
-    switch($arg1) {
-        case "stop"	{ do_stop(); 			exit; }
-        case "debug"	{ do_debug($arg2);	 	exit; }
-        case "add"	{ rxe_add($arg2); 		exit; }
-        case "remove"	{ rxe_remove($arg2); 		exit; }
-        case "help"	{ usage();			exit; }
-    }
+    if    ($arg1 eq "stop")   { do_stop(); 	   exit; }
+    elsif ($arg1 eq "debug")  { do_debug($arg2);   exit; }
+    elsif ($arg1 eq "add")    { rxe_add($arg2);    exit; }
+    elsif ($arg1 eq "remove") { rxe_remove($arg2); exit; }
+    elsif ($arg1 eq "help")   { usage();	   exit; }
 }
 
 main();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* RE: Configuration of cq->cqe is lower than entries by 1
From: Amrani, Ram @ 2016-11-16  9:45 UTC (permalink / raw)
  To: Majd Dibbiny
  Cc: Leon Romanovsky,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <2DF5C492-C364-4353-8FA9-51FA5EE760F0-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

>> 192         entries      = roundup_pow_of_two(entries + 1);
>> 193         cq->ibcq.cqe = entries - 1;
>> 
>> I thought something else might hide there.
>> Hi Ram,
>> 
> For CQs, we always reserve an extra CQE, and thus report one less CQE to the
> user.
> This CQE is used for resize CQ operations.
> When the CQ is resized, the HW posts a CQE to indicate that the operation was
> completed on the original CQ buffer.
> Hope now it's clear.

Thanks. I had a guess this was related. 
Note that there might be different behavior than what you are describing. If the user requested for (2^n) -1 then it'll receive the exact number of entries it requested, without the (at least one) extra entries.
For example, if the user requests 63 entries (n=6) it'll receive the same number of entries:
192         entries      = roundup_pow_of_two(63+ 1);	// this gives 64
193         cq->ibcq.cqe = entries - 1;			// this gives 63 again.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: Configuration of cq->cqe is lower than entries by 1
From: Amrani, Ram @ 2016-11-16 10:05 UTC (permalink / raw)
  To: Majd Dibbiny
  Cc: Leon Romanovsky,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <84A08B2A-85D5-467C-AE80-63134CA07767-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

 > Correct. We create it with 64(the result of the roundup) and let him work with
> 63 only.

ACK
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH rdma-core] ccan: Add likely implementation
From: Leon Romanovsky @ 2016-11-16 17:40 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	yishaih-VPRAkNaXOzVWk0Htik3J/w,
	Tatyana.E.Nikolova-ral2JQCrhuEAvxtiuMwx3w,
	oulijun-hv44wF8Li93QT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Add likely/unlikely macros to ccan directory.
This change includes adjustments to various providers, who
defined it locally (nes, mlx4, mlx5 and hns).

The code of i40iw had such definitions too, but without actual usage and
this patch removed this dead code.

Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 ccan/CMakeLists.txt           |   2 +
 ccan/likely.c                 | 136 ++++++++++++++++++++++++++++++++++++++++++
 ccan/likely.h                 | 111 ++++++++++++++++++++++++++++++++++
 providers/hns/hns_roce_u.h    |   5 +-
 providers/i40iw/i40iw_umain.h |   7 ---
 providers/mlx4/mlx4.h         |   8 ---
 providers/mlx4/qp.c           |   1 +
 providers/mlx5/mlx5.h         |   6 +-
 providers/nes/nes_umain.h     |   8 +--
 9 files changed, 253 insertions(+), 31 deletions(-)
 create mode 100644 ccan/likely.c
 create mode 100644 ccan/likely.h

diff --git a/ccan/CMakeLists.txt b/ccan/CMakeLists.txt
index b5de515..153426f 100644
--- a/ccan/CMakeLists.txt
+++ b/ccan/CMakeLists.txt
@@ -6,11 +6,13 @@ publish_internal_headers(ccan
   minmax.h
   str.h
   str_debug.h
+  likely.h
   )
 
 set(C_FILES
   list.c
   str.c
+  likely.c
   )
 add_library(ccan STATIC ${C_FILES})
 add_library(ccan_pic STATIC ${C_FILES})
diff --git a/ccan/likely.c b/ccan/likely.c
new file mode 100644
index 0000000..83e8d6f
--- /dev/null
+++ b/ccan/likely.c
@@ -0,0 +1,136 @@
+/* CC0 (Public domain) - see LICENSE file for details. */
+#ifdef CCAN_LIKELY_DEBUG
+#include <ccan/likely/likely.h>
+#include <ccan/hash/hash.h>
+#include <ccan/htable/htable_type.h>
+#include <stdlib.h>
+#include <stdio.h>
+struct trace {
+	const char *condstr;
+	const char *file;
+	unsigned int line;
+	bool expect;
+	unsigned long count, right;
+};
+
+static size_t hash_trace(const struct trace *trace)
+{
+	return hash(trace->condstr, strlen(trace->condstr),
+		    hash(trace->file, strlen(trace->file),
+			 trace->line + trace->expect));
+}
+
+static bool trace_eq(const struct trace *t1, const struct trace *t2)
+{
+	return t1->condstr == t2->condstr
+		&& t1->file == t2->file
+		&& t1->line == t2->line
+		&& t1->expect == t2->expect;
+}
+
+/* struct thash */
+HTABLE_DEFINE_TYPE(struct trace, (const struct trace *), hash_trace, trace_eq,
+		   thash);
+
+static struct thash htable
+= { HTABLE_INITIALIZER(htable.raw, thash_hash, NULL) };
+
+static void init_trace(struct trace *trace,
+		       const char *condstr, const char *file, unsigned int line,
+		       bool expect)
+{
+	trace->condstr = condstr;
+	trace->file = file;
+	trace->line = line;
+	trace->expect = expect;
+	trace->count = trace->right = 0;
+}
+
+static struct trace *add_trace(const struct trace *t)
+{
+	struct trace *trace = malloc(sizeof(*trace));
+	*trace = *t;
+	thash_add(&htable, trace);
+	return trace;
+}
+
+long _likely_trace(bool cond, bool expect,
+		   const char *condstr,
+		   const char *file, unsigned int line)
+{
+	struct trace *p, trace;
+
+	init_trace(&trace, condstr, file, line, expect);
+	p = thash_get(&htable, &trace);
+	if (!p)
+		p = add_trace(&trace);
+
+	p->count++;
+	if (cond == expect)
+		p->right++;
+
+	return cond;
+}
+
+static double right_ratio(const struct trace *t)
+{
+	return (double)t->right / t->count;
+}
+
+char *likely_stats(unsigned int min_hits, unsigned int percent)
+{
+	struct trace *worst;
+	double worst_ratio;
+	struct thash_iter i;
+	char *ret;
+	struct trace *t;
+
+	worst = NULL;
+	worst_ratio = 2;
+
+	/* This is O(n), but it's not likely called that often. */
+	for (t = thash_first(&htable, &i); t; t = thash_next(&htable, &i)) {
+		if (t->count >= min_hits) {
+			if (right_ratio(t) < worst_ratio) {
+				worst = t;
+				worst_ratio = right_ratio(t);
+			}
+		}
+	}
+
+	if (worst_ratio * 100 > percent)
+		return NULL;
+
+	ret = malloc(strlen(worst->condstr) +
+		     strlen(worst->file) +
+		     sizeof(long int) * 8 +
+		     sizeof("%s:%u:%slikely(%s) correct %u%% (%lu/%lu)"));
+	sprintf(ret, "%s:%u:%slikely(%s) correct %u%% (%lu/%lu)",
+		worst->file, worst->line,
+		worst->expect ? "" : "un", worst->condstr,
+		(unsigned)(worst_ratio * 100),
+		worst->right, worst->count);
+
+	thash_del(&htable, worst);
+	free(worst);
+
+	return ret;
+}
+
+void likely_stats_reset(void)
+{
+	struct thash_iter i;
+	struct trace *t;
+
+	/* This is a bit better than O(n^2), but we have to loop since
+	 * first/next during delete is unreliable. */
+	while ((t = thash_first(&htable, &i)) != NULL) {
+		for (; t; t = thash_next(&htable, &i)) {
+			thash_del(&htable, t);
+			free(t);
+		}
+	}
+
+	thash_clear(&htable);
+}
+#endif /*CCAN_LIKELY_DEBUG*/
diff --git a/ccan/likely.h b/ccan/likely.h
new file mode 100644
index 0000000..a8f003d
--- /dev/null
+++ b/ccan/likely.h
@@ -0,0 +1,111 @@
+/* CC0 (Public domain) - see LICENSE file for details */
+#ifndef CCAN_LIKELY_H
+#define CCAN_LIKELY_H
+#include "config.h"
+#include <stdbool.h>
+
+#ifndef CCAN_LIKELY_DEBUG
+#if HAVE_BUILTIN_EXPECT
+/**
+ * likely - indicate that a condition is likely to be true.
+ * @cond: the condition
+ *
+ * This uses a compiler extension where available to indicate a likely
+ * code path and optimize appropriately; it's also useful for readers
+ * to quickly identify exceptional paths through functions.  The
+ * threshold for "likely" is usually considered to be between 90 and
+ * 99%; marginal cases should not be marked either way.
+ *
+ * See Also:
+ *	unlikely(), likely_stats()
+ *
+ * Example:
+ *	// Returns false if we overflow.
+ *	static inline bool inc_int(unsigned int *val)
+ *	{
+ *		(*val)++;
+ *		if (likely(*val))
+ *			return true;
+ *		return false;
+ *	}
+ */
+#define likely(cond) __builtin_expect(!!(cond), 1)
+
+/**
+ * unlikely - indicate that a condition is unlikely to be true.
+ * @cond: the condition
+ *
+ * This uses a compiler extension where available to indicate an unlikely
+ * code path and optimize appropriately; see likely() above.
+ *
+ * See Also:
+ *	likely(), likely_stats(), COLD (compiler.h)
+ *
+ * Example:
+ *	// Prints a warning if we overflow.
+ *	static inline void inc_int(unsigned int *val)
+ *	{
+ *		(*val)++;
+ *		if (unlikely(*val == 0))
+ *			fprintf(stderr, "Overflow!");
+ *	}
+ */
+#define unlikely(cond) __builtin_expect(!!(cond), 0)
+#else
+#define likely(cond) (!!(cond))
+#define unlikely(cond) (!!(cond))
+#endif
+#else /* CCAN_LIKELY_DEBUG versions */
+#include <ccan/str/str.h>
+
+#define likely(cond) \
+	(_likely_trace(!!(cond), 1, stringify(cond), __FILE__, __LINE__))
+#define unlikely(cond) \
+	(_likely_trace(!!(cond), 0, stringify(cond), __FILE__, __LINE__))
+
+long _likely_trace(bool cond, bool expect,
+		   const char *condstr,
+		   const char *file, unsigned int line);
+/**
+ * likely_stats - return description of abused likely()/unlikely()
+ * @min_hits: minimum number of hits
+ * @percent: maximum percentage correct
+ *
+ * When CCAN_LIKELY_DEBUG is defined, likely() and unlikely() trace their
+ * results: this causes a significant slowdown, but allows analysis of
+ * whether the branches are labelled correctly.
+ *
+ * This function returns a malloc'ed description of the least-correct
+ * usage of likely() or unlikely().  It ignores places which have been
+ * called less than @min_hits times, and those which were predicted
+ * correctly more than @percent of the time.  It returns NULL when
+ * nothing meets those criteria.
+ *
+ * Note that this call is destructive; the returned offender is
+ * removed from the trace so that the next call to likely_stats() will
+ * return the next-worst likely()/unlikely() usage.
+ *
+ * Example:
+ *	// Print every place hit more than twice which was wrong > 5%.
+ *	static void report_stats(void)
+ *	{
+ *	#ifdef CCAN_LIKELY_DEBUG
+ *		const char *bad;
+ *
+ *		while ((bad = likely_stats(2, 95)) != NULL) {
+ *			printf("Suspicious likely: %s", bad);
+ *			free(bad);
+ *		}
+ *	#endif
+ *	}
+ */
+char *likely_stats(unsigned int min_hits, unsigned int percent);
+
+/**
+ * likely_stats_reset - free up memory of likely()/unlikely() branches.
+ *
+ * This can also plug memory leaks.
+ */
+void likely_stats_reset(void);
+#endif /* CCAN_LIKELY_DEBUG */
+#endif /* CCAN_LIKELY_H */
diff --git a/providers/hns/hns_roce_u.h b/providers/hns/hns_roce_u.h
index 4a6ed8e..1659958 100644
--- a/providers/hns/hns_roce_u.h
+++ b/providers/hns/hns_roce_u.h
@@ -39,6 +39,7 @@
 #include <infiniband/arch.h>
 #include <infiniband/verbs.h>
 #include <ccan/container_of.h>
+#include <ccan/likely.h>
 
 #define HNS_ROCE_CQE_ENTRY_SIZE		0x20
 
@@ -51,10 +52,6 @@
 
 #define PFX				"hns: "
 
-#ifndef likely
-#define likely(x)     __builtin_expect(!!(x), 1)
-#endif
-
 #define roce_get_field(origin, mask, shift) \
 	(((origin) & (mask)) >> (shift))
 
diff --git a/providers/i40iw/i40iw_umain.h b/providers/i40iw/i40iw_umain.h
index 719aefc..6a23504 100644
--- a/providers/i40iw/i40iw_umain.h
+++ b/providers/i40iw/i40iw_umain.h
@@ -47,13 +47,6 @@
 #include "i40iw_status.h"
 #include "i40iw_user.h"
 
-#ifndef likely
-#define likely(x)   __builtin_expect((x), 1)
-#endif
-#ifndef unlikely
-#define unlikely(x) __builtin_expect((x), 0)
-#endif
-
 #define PFX "libi40iw-"
 
 #define  I40IW_BASE_PUSH_PAGE	1
diff --git a/providers/mlx4/mlx4.h b/providers/mlx4/mlx4.h
index b851e95..6467d5a 100644
--- a/providers/mlx4/mlx4.h
+++ b/providers/mlx4/mlx4.h
@@ -51,14 +51,6 @@ enum {
 	MLX4_STAT_RATE_OFFSET		= 5
 };
 
-#ifndef likely
-#ifdef __GNUC__
-#define likely(x)       __builtin_expect(!!(x),1)
-#else
-#define likely(x)      (x)
-#endif
-#endif
-
 enum {
 	MLX4_QP_TABLE_BITS		= 8,
 	MLX4_QP_TABLE_SIZE		= 1 << MLX4_QP_TABLE_BITS,
diff --git a/providers/mlx4/qp.c b/providers/mlx4/qp.c
index 268fb7d..af08874 100644
--- a/providers/mlx4/qp.c
+++ b/providers/mlx4/qp.c
@@ -40,6 +40,7 @@
 #include <string.h>
 #include <errno.h>
 #include <util/compiler.h>
+#include <ccan/likely.h>
 
 #include "mlx4.h"
 #include "doorbell.h"
diff --git a/providers/mlx5/mlx5.h b/providers/mlx5/mlx5.h
index cb65429..d5704d4 100644
--- a/providers/mlx5/mlx5.h
+++ b/providers/mlx5/mlx5.h
@@ -42,11 +42,7 @@
 #include <ccan/list.h>
 #include "bitmap.h"
 #include <ccan/minmax.h>
-
-#ifdef __GNUC__
-#define likely(x)	__builtin_expect((x), 1)
-#define unlikely(x)	__builtin_expect((x), 0)
-#endif
+#include <ccan/likely.h>
 
 #include <valgrind/memcheck.h>
 
diff --git a/providers/nes/nes_umain.h b/providers/nes/nes_umain.h
index 093956a..5f357a2 100644
--- a/providers/nes/nes_umain.h
+++ b/providers/nes/nes_umain.h
@@ -40,13 +40,7 @@
 
 #include <infiniband/driver.h>
 #include <infiniband/arch.h>
-
-#ifndef likely
-#define likely(x)   __builtin_expect((x),1)
-#endif
-#ifndef unlikely
-#define unlikely(x) __builtin_expect((x),0)
-#endif
+#include <ccan/likely.h>
 
 #define PFX	"libnes: "
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* mlx4 BUG_ON in probe path
From: Bjorn Helgaas @ 2016-11-16 18:25 UTC (permalink / raw)
  To: Yishai Hadas; +Cc: netdev, linux-rdma, Johannes Thumshirn, linux-kernel

Hi Yishai,

Johannes has been working on an mlx4 initialization problem on an
IBM x3850 X6.  The underlying problem is a PCI core issue -- we're
setting RCB in the Mellanox device, which means it thinks it can
generate 128-byte Completions, even though the Root Port above it
can't handle them.  That issue is
https://bugzilla.kernel.org/show_bug.cgi?id=187781

The machine crashed when this happened, apparently not because of any
error reported via AER, but because mlx4 contains a BUG_ON, probably
the one in mlx4_enter_error_state().

That one happens if pci_channel_offline() returns false.  Is this
telling us about a problem in PCI error handling, or is it just a case
where mlx4 isn't as smart as it could be?

Ideally, if mlx4 can't initialize the device, it should just return an
error from the probe function instead of crashing the whole machine.

Here's the crash (the entire dmesg log is in the bugzilla above):

  mlx4_core 0000:41:00.0: command 0xfff timed out (go bit not cleared)
  mlx4_core 0000:41:00.0: device is going to be reset
  mlx4_core 0000:41:00.0: Failed to obtain HW semaphore, aborting
  mlx4_core 0000:41:00.0: Fail to reset HCA
  ------------[ cut here ]------------
  kernel BUG at drivers/net/ethernet/mellanox/mlx4/catas.c:193!
  invalid opcode: 0000 [#1] SMP 
  Modules linked in: sr_mod(E) cdrom(E) uas(E) usb_storage(E) mlx4_core(E+) cdc_ether(E) usbnet(E) mii(E) joydev(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) drbg(E) ansi_cprng(E) aesni_intel(E) iTCO_wdt(E) aes_x86_64(E) igb(E) ipmi_devintf(E) iTCO_vendor_support(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) ptp(E) cryptd(E) pps_core(E) sb_edac(E) pcspkr(E) lpc_ich(E) ipmi_ssif(E) ioatdma(E) edac_core(E) shpchp(E) mfd_core(E) dca(E) wmi(E) ipmi_si(E) ipmi_msghandler(E) fjes(E) button(E) processor(E) acpi_pad(E) hid_generic(E) usbhid(E) ext4(E) crc16(E) jbd2(E) mbcache(E) sd_mod(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) xhci_pci(E) sysfillrect(E) ehci_pci(E) sysimgbl
 t(E)
   fb_sys_fops(E) xhci_hcd(E) ehci_hcd(E) ttm(E) usbcore(E) drm(E) usb_common(E) megaraid_sas(E) dm_mirror(E) dm_region_hash(E) dm_log(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E)
  Supported: Yes
  CPU: 27 PID: 2867 Comm: modprobe Tainted: G            E      4.4.21-default #6
  Hardware name: IBM x3850 X6 -[3837Z7P]-/00FN772, BIOS -[A8E120CUS-1.30]- 08/22/2016
  task: ffff881fb2ff9280 ti: ffff881fbd3c4000 task.ti: ffff881fbd3c4000
  RIP: 0010:[<ffffffffa0446740>]  [<ffffffffa0446740>] mlx4_enter_error_state+0x240/0x320 [mlx4_core]
  RSP: 0018:ffff881fbd3c79a0  EFLAGS: 00010246
  RAX: ffff8820b2486e00 RBX: ffff883fbe240000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff881fbf63b000
  RBP: ffff8820b2486e60 R08: 0000000000000029 R09: ffff88803feda50f
  R10: 00000000000d1b50 R11: 0000000000000000 R12: 0000000000000000
  R13: 0000000000000000 R14: ffff883fbe240460 R15: 00000000fffffffb
  FS:  00007f7c55203700(0000) GS:ffff883fbf900000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1813c88000 CR3: 0000003fbe637000 CR4: 00000000001406e0
  Stack:
   15b30000c0000100 ffff883fbe240000 0000000000000fff 0000000000000000
   ffffffffa0447d54 000000000000ffff ffffffff00000000 000000000000ea60
   0000000000000000 000000000000ea60 ffffc90031dba680 ffff883fbe240000
  Call Trace:
   [<ffffffffa0447d54>] __mlx4_cmd+0x594/0x8a0 [mlx4_core]
   [<ffffffffa045191b>] mlx4_map_cmd+0x2ab/0x3c0 [mlx4_core]
   [<ffffffffa045a855>] mlx4_load_one+0x515/0x1220 [mlx4_core]
   [<ffffffffa045bb69>] mlx4_init_one+0x4e9/0x6a0 [mlx4_core]
   [<ffffffff8135626f>] local_pci_probe+0x3f/0xa0
   [<ffffffff81357694>] pci_device_probe+0xd4/0x120
   [<ffffffff8144d0b7>] driver_probe_device+0x1f7/0x420
   [<ffffffff8144d35b>] __driver_attach+0x7b/0x80
   [<ffffffff8144afc8>] bus_for_each_dev+0x58/0x90
   [<ffffffff8144c519>] bus_add_driver+0x1c9/0x280
   [<ffffffff8144dccb>] driver_register+0x5b/0xd0
   [<ffffffffa03f911a>] mlx4_init+0x11a/0x1000 [mlx4_core]
   [<ffffffff81002138>] do_one_initcall+0xc8/0x1f0
   [<ffffffff81182a08>] do_init_module+0x5a/0x1d7
   [<ffffffff81103726>] load_module+0x1366/0x1c50
   [<ffffffff811041c0>] SYSC_finit_module+0x70/0xa0
   [<ffffffff815e14ae>] entry_SYSCALL_64_fastpath+0x12/0x71

^ permalink raw reply

* Re: NFSD generic R/W API (sendto path) performance results
From: Chuck Lever @ 2016-11-16 19:45 UTC (permalink / raw)
  To: Steve Wise; +Cc: Christoph Hellwig, Sagi Grimberg, List Linux RDMA Mailing
In-Reply-To: <024601d23f7f$cef62500$6ce26f00$@opengridcomputing.com>


> On Nov 15, 2016, at 3:35 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> 
>> 
>> I've built a prototype conversion of the in-kernel NFS server's sendto
>> path to use the new generic R/W API. This path handles NFS Replies, so
>> it is responsible for building and sending RDMA Writes carrying NFS
>> READ payloads, and for transmitting all NFS Replies.
>> 
>> I've published the prototype (against my for-4.10 server series) here:
>> 
>> 
> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfsd-rdma-rw
> -api
>> 
>> It's the very last patch in the series.
>> 
>> 
>> "iozone -i0 -i1 -s2g -r1m -I" with NFSv3, sec=sys, CX-3 on both sides,
>> FDR fabric, share is a tmpfs. This test writes and reads a 2GB file with
>> 1MB direct writes and reads.
>> 
>> The client forms NFS requests with a single 1MB RDMA segment to catch
>> the NFS READ payload. Before the conversion, the server posts a series
>> of single Write WRs with 30 pages each, for each RDMA segment written
>> to the client. After the conversion, the server posts a single chain
>> of 30-page Write WRs for each RDMA segment written to the client.
>> 
>> Before the API conversion: rdma_stat_post_send = 45097
>> 
>> After the API conversion: rdma_stat_post_send = 16411
>> 
>> That's what I expected to see. This shows the number of ib_post_send
>> calls is significantly lower after the conversion.
>> 
>> 
>> Unfortunately the throughput and latency numbers are worse (ignore
>> the write/rewrite numbers for now). Output is in kBytes/sec.
>> 
>> Before conversion, one iozone run:
>> 
>>              kB  reclen    write  rewrite    read    reread
>>         2097152    1024   772835   931267  1895922  1927848
>> 
>> READ:
>>    4098 ops (49%)
>>    avg bytes sent per op: 140    avg bytes received per op: 1048704
>>    backlog wait: 0.006345     RTT: 0.321132     total execute time: 0.332113
>> 
>> After conversion:
>> 
>>              kB  reclen    write  rewrite    read    reread
>>         2097152    1024   703850   913824  1561682  1441448
>> 
>> READ:
>>    4098 ops (49%)
>>    avg bytes sent per op: 140    avg bytes received per op: 1048704
>>    backlog wait: 0.010737     RTT: 0.469497     total execute time: 0.488043
>> 
>> That's 140us worse RTT per READ, in this run. The gap between before and
>> after was roughly the same for all runs.
>> 
>> 
>> To partially explain this, I captured traffic on the server using ibdump
>> during a similar iozone test. This removes fabric and client HCA latencies
>> from the picture.
>> 
>> This is a QD=1 test, so it's easy to analyze individual NFS READ operations
>> in each capture. I computed three latency numbers per READ transaction
>> based on the timestamps in the capture file, which should be accurate to
>> 1 microsecond:
>> 
>> 1. Call took: the time between when the server i/f sees the incoming RDMA
>> Send carrying the NFS READ Call, and when the server i/f sees the outgoing
>> RDMA Send carrying the NFS READ Reply.
>> 
>> 2. Call-to-first-Write: the time between when the server i/f sees the
>> incoming RDMA Send carrying the NFS READ Call, and when the server i/f
>> sees the first outgoing RDMA Write request. Roughly how long it takes
>> the server to prepare and post the RDMA Writes.
>> 
>> 3. First-to-last-Write: the time between when the server i/f sees the
>> first outgoing RDMA Write request, and when the server i/f sees the
>> last outgoing RDMA Write request. Roughly how long it takes the HCA
>> to transmit the RDMA Writes.
>> 
>> 
>> Averages over 5 NFS READ calls chosen at random, before conversion:
>> Call took 414us. Call-to-first-Write 85us. First-to-last-Write 327us
>> 
>> Averages over 5 NFS READ calls chosen at random, after conversion:
>> Call took 521us. Call-to-first-Write 160us. First-to-last-Write 360us
>> 
>> The gap between before and after results was 100% consistent with
>> the average results across the individual NFS READ operations.
>> 
>> 
> 
> Good work here! 
> 
>> There are two stories here:
>> 
>> 1. Call-to-first-Write takes longer. My first guess is that the server
>> takes longer to build and DMA map a long Write WR chain than it does
>> to build, map, and post a single Write WR. The HCA can get started
>> transmitting Writes sooner, and the server continues working on
>> posting Write WRs in parallel with the on-the-wire activity.
>> 
> 
> So perhaps the RDMA R/W API can have a threshold where it will dump a list of
> WRs once it exceeds the threshold, and continue chunking?  That threshold, by
> the way, is probably device-specific.
> 
>> 2. First-to-last-Write takes longer. I don't have any explanation
>> for the HCA taking 10% longer to transmit the full 1MB payload.
>> 
> 
> Perhaps the single WR posts are hitting device's fast-path and lowering latency
> vs a long chain post that must be DMAed by the device?  I'm not sure exactly how
> the MLX devices work, but they do have a fast path that utilizes the CPU's
> write-combining logic to send a WR over the bus as a single PCIE transaction.
> But your WRs are probably large since they have 30 pages in the SGE.  I'm not
> sure what the threshold is for this fastpath logic for mlx.  For cxgb, its 64B,
> so the WR would have to fit in 64B to take advantage.

Out of curiosity, I hacked up my NFS client to limit the size of RDMA
segments to 30 pages (the server HCA's max_sge).

A 1MB NFS READ now takes 9 segments. That forces the after-conversion
server to build single-Write chains and use 9 post_send calls to
transmit the READ payload, just like the before-conversion server.

Performance of before- and after-conversion servers is now equivalent.

              kB  reclen    write  rewrite    read    reread
         2097152    1024  1061237  1141614  1961410  2000223                                                                                  

READ:
    4098 ops (49%) 
    avg bytes sent per op: 140    avg bytes received per op: 1048704
    backlog wait: 0.006345     RTT: 0.314300     total execute time: 0.325037

At 60-page segments (2 Write WRs per chain), I see about the same
throughput, and RT latency is a touch higher.

At 61-page segments (3 Write WRs per chain), throughput drops
significantly:

              kB  reclen    write  rewrite    read    reread
         2097152    1024   932665   976784  1627842  1627169                                                                                  

READ:
    4098 ops (49%) 
    avg bytes sent per op: 140	avg bytes received per op: 1048704
    backlog wait: 0.009761 	RTT: 0.383358 	total execute time: 0.398731

A couple of random samples of an ibdump capture show that most of the
latency increase is in the Call-to-first-Write gap (1. above).


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-core] ccan: Add likely implementation
From: Jason Gunthorpe @ 2016-11-16 20:13 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	Tatyana.E.Nikolova-ral2JQCrhuEAvxtiuMwx3w,
	oulijun-hv44wF8Li93QT0dZR+AlfA, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1479318011-26878-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

On Wed, Nov 16, 2016 at 07:40:11PM +0200, Leon Romanovsky wrote:

> index b5de515..153426f 100644
> +++ b/ccan/CMakeLists.txt
> @@ -6,11 +6,13 @@ publish_internal_headers(ccan
>    minmax.h
>    str.h
>    str_debug.h
> +  likely.h
>    )
>  
>  set(C_FILES
>    list.c
>    str.c
> +  likely.c
>    )

Keep these lists sorted please

> +++ b/ccan/likely.c
> @@ -0,0 +1,136 @@
> +/* CC0 (Public domain) - see LICENSE file for details. */
> +#ifdef CCAN_LIKELY_DEBUG
> +#include <ccan/likely/likely.h>
> +#include <ccan/hash/hash.h>
> +#include <ccan/htable/htable_type.h>

Hmm, this isn't going to compile if the debug is set, maybe drop the
.c - but this seems really interesting to see if likely is being used
sensibly....

> +#ifndef CCAN_LIKELY_DEBUG
> +#if HAVE_BUILTIN_EXPECT

You need to add '#define HAVE_BUILTIN_EXPECT 1' to
buildlib/config.h.in - or this doesn't work at all.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* rdma-core release process questions
From: Nikolova, Tatyana E @ 2016-11-17  4:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford,
	leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hi,

We are submitting patches to the kernel space driver i40iw and to the user space plugin libi40iw, which is currently part of rdma-core. Some of the changes need to be coordinated so that they appear in both kernel space and user space in corresponding releases. We have some questions regarding the process about submitting patches to rdma-core which have dependencies on kernel patches.

1) Can user space patches target a for-next rdma-core release, if the corresponding kernel patches are queued for the next kernel? 

2) How are ABI changes handled in rdma-core?

3) Could you explain the release process for rdma-core? 

4) Is each rdma-core release going to correspond to a specific kernel version?

Thank you,
Tatyana
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: rdma-core release process questions
From: Jason Gunthorpe @ 2016-11-17  5:14 UTC (permalink / raw)
  To: Nikolova, Tatyana E
  Cc: Doug Ledford, leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <13AA599688F47243B14FCFCCC2C803BB10AB79D5-96pTJSsuoYQ64kNsxIetb7fspsVTdybXVpNB7YpNyf8@public.gmane.org>

On Thu, Nov 17, 2016 at 04:56:45AM +0000, Nikolova, Tatyana E wrote:

> We are submitting patches to the kernel space driver i40iw and to
> the user space plugin libi40iw, which is currently part of
> rdma-core. Some of the changes need to be coordinated so that they
> appear in both kernel space and user space in corresponding
> releases. We have some questions regarding the process about
> submitting patches to rdma-core which have dependencies on kernel
> patches.

Incompatible changes are very strongly discouraged, just don't do it.

> 4) Is each rdma-core release going to correspond to a specific kernel version?

No. All versions must work with all kernels.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-core] ccan: Add likely implementation
From: Leon Romanovsky @ 2016-11-17  8:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	Tatyana.E.Nikolova-ral2JQCrhuEAvxtiuMwx3w,
	oulijun-hv44wF8Li93QT0dZR+AlfA, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161116201343.GB19593-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

On Wed, Nov 16, 2016 at 01:13:43PM -0700, Jason Gunthorpe wrote:
> On Wed, Nov 16, 2016 at 07:40:11PM +0200, Leon Romanovsky wrote:
>
> > index b5de515..153426f 100644
> > +++ b/ccan/CMakeLists.txt
> > @@ -6,11 +6,13 @@ publish_internal_headers(ccan
> >    minmax.h
> >    str.h
> >    str_debug.h
> > +  likely.h
> >    )
> >
> >  set(C_FILES
> >    list.c
> >    str.c
> > +  likely.c
> >    )
>
> Keep these lists sorted please

Sure

>
> > +++ b/ccan/likely.c
> > @@ -0,0 +1,136 @@
> > +/* CC0 (Public domain) - see LICENSE file for details. */
> > +#ifdef CCAN_LIKELY_DEBUG
> > +#include <ccan/likely/likely.h>
> > +#include <ccan/hash/hash.h>
> > +#include <ccan/htable/htable_type.h>
>
> Hmm, this isn't going to compile if the debug is set, maybe drop the
> .c - but this seems really interesting to see if likely is being used
> sensibly....
>
> > +#ifndef CCAN_LIKELY_DEBUG
> > +#if HAVE_BUILTIN_EXPECT
>
> You need to add '#define HAVE_BUILTIN_EXPECT 1' to
> buildlib/config.h.in - or this doesn't work at all.

This is exactly what I wanted to discuss over ML.

From one side, I wanted to ensure that ccan files are similar to
official ones, so upgrade to new versions will be seamless.

From another side, I don't see the real usage of likely/unlikely debug
facilities.

So my approach was to add these files as is, but don't connect debug
functionality.

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: mlx4 BUG_ON in probe path
From: Yishai Hadas @ 2016-11-17 10:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yishai Hadas, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Johannes Thumshirn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161116182527.GC26600-1RhO1Y9PlrlHTL0Zs8A6p5iNqAH0jzoTYJqu5kTmcBRl57MIdRCFDg@public.gmane.org>

On 11/16/2016 8:25 PM, Bjorn Helgaas wrote:
> Hi Yishai,
>
> Johannes has been working on an mlx4 initialization problem on an
> IBM x3850 X6.  The underlying problem is a PCI core issue -- we're
> setting RCB in the Mellanox device, which means it thinks it can
> generate 128-byte Completions, even though the Root Port above it
> can't handle them.  That issue is
> https://bugzilla.kernel.org/show_bug.cgi?id=187781
>
> The machine crashed when this happened, apparently not because of any
> error reported via AER, but because mlx4 contains a BUG_ON, probably
> the one in mlx4_enter_error_state().
>
> That one happens if pci_channel_offline() returns false.  Is this
> telling us about a problem in PCI error handling, or is it just a case
> where mlx4 isn't as smart as it could be?

Yes, we expect at that step a problem/bug in the PCI layer that should 
be fixed (e.g. reporting online but really is offline, etc.), can you 
please evaluate and confirm that ?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: rdma-core release process questions
From: Leon Romanovsky @ 2016-11-17 10:28 UTC (permalink / raw)
  To: Nikolova, Tatyana E
  Cc: Jason Gunthorpe, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <13AA599688F47243B14FCFCCC2C803BB10AB79D5-96pTJSsuoYQ64kNsxIetb7fspsVTdybXVpNB7YpNyf8@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1596 bytes --]

On Thu, Nov 17, 2016 at 04:56:45AM +0000, Nikolova, Tatyana E wrote:
> Hi,
>
> We are submitting patches to the kernel space driver i40iw and to the user space plugin libi40iw, which is currently part of rdma-core. Some of the changes need to be coordinated so that they appear in both kernel space and user space in corresponding releases. We have some questions regarding the process about submitting patches to rdma-core which have dependencies on kernel patches.
>
> 1) Can user space patches target a for-next rdma-core release, if the corresponding kernel patches are queued for the next kernel?

I don't see any problem with that, once the patches accepted for the -next by Doug,
they can be accepted to the rdma-core too. Anyway these changes should be compatible
with old kernel without such new feature.

> 2) How are ABI changes handled in rdma-core?

Do you have specific thing in mind?
Generally speaking, send to ML pass review and we will apply.

>
> 3) Could you explain the release process for rdma-core?

The process as agreed will be something like that:
a. Review/accept/decline patches in 1-2 weeks time frame.
b. Once kernel released, stop accepting new features.
c. Wait for 1-2 weeks to see no one complains. It is just to be on safe side,
because the library is always ready for release and checked constantly.
d. Create new tag and push release.

>
> 4) Is each rdma-core release going to correspond to a specific kernel version?

As Jason wrote, It will be aligned in release time to the kernel,
but library should remain backward compatible.

>
> Thank you,
> Tatyana

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [patch] IB/rxe: Remove unneeded cast in rxe_srq_from_attr()
From: Dan Carpenter @ 2016-11-17 11:00 UTC (permalink / raw)
  To: Moni Shoua
  Cc: Doug Ledford, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

It makes me nervous when we cast pointer parameters.  I would estimate
that around 50% of the time, it indicates a bug.  Here the cast is not
needed becaue u32 and and unsigned int are the same thing.  Removing the
cast makes the code more robust and future proof in case any of the
types change.

Signed-off-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
index 2a6e3cd..efc832a 100644
--- a/drivers/infiniband/sw/rxe/rxe_srq.c
+++ b/drivers/infiniband/sw/rxe/rxe_srq.c
@@ -169,7 +169,7 @@ int rxe_srq_from_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
 			}
 		}
 
-		err = rxe_queue_resize(q, (unsigned int *)&attr->max_wr,
+		err = rxe_queue_resize(q, &attr->max_wr,
 				       rcv_wqe_size(srq->rq.max_sge),
 				       srq->rq.queue->ip ?
 						srq->rq.queue->ip->context :
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [patch] IB/rxe: Remove unneeded cast in rxe_srq_from_attr()
From: Leon Romanovsky @ 2016-11-17 11:49 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Moni Shoua, Doug Ledford, Sean Hefty, Hal Rosenstock, linux-rdma,
	kernel-janitors
In-Reply-To: <20161117110005.GB32143@mwanda>

[-- Attachment #1: Type: text/plain, Size: 1232 bytes --]

On Thu, Nov 17, 2016 at 02:00:05PM +0300, Dan Carpenter wrote:
> It makes me nervous when we cast pointer parameters.  I would estimate
> that around 50% of the time, it indicates a bug.  Here the cast is not
> needed becaue u32 and and unsigned int are the same thing.  Removing the
> cast makes the code more robust and future proof in case any of the
> types change.
>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Thanks,
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

>
> diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
> index 2a6e3cd..efc832a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_srq.c
> +++ b/drivers/infiniband/sw/rxe/rxe_srq.c
> @@ -169,7 +169,7 @@ int rxe_srq_from_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
>  			}
>  		}
>
> -		err = rxe_queue_resize(q, (unsigned int *)&attr->max_wr,
> +		err = rxe_queue_resize(q, &attr->max_wr,
>  				       rcv_wqe_size(srq->rq.max_sge),
>  				       srq->rq.queue->ip ?
>  						srq->rq.queue->ip->context :
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [PULL REQUEST] Please pull rdma.git
From: Doug Ledford @ 2016-11-17 12:13 UTC (permalink / raw)
  To: Torvalds, Linus, linux-rdma


[-- Attachment #1.1: Type: text/plain, Size: 6015 bytes --]

Hi Linus,

Due to various issues, I've been away and couldn't send a pull request
for about three weeks.  There were a number of -rc patches that built up
in the meantime (some where there already from the early -rc stages).
Obviously, there were way too many to send now, so I tried to pare the
list down to the more important patches for the -rc cycle.  Most of the
code has had plenty of soak time at the various vendor's testing setups,
so I doubt there will be another -rc pull request this cycle.  I also
tried to limit the patches to those with smaller footprints, so even
though a shortlog is longer than I would like, the actual diffstat is
mostly very small with the exception of just three files that had more
changes, and a couple files with pure removals.  Here's the boilerplate:

The following changes since commit a909d3e636995ba7c349e2ca5dbb528154d4ac30:

  Linux 4.9-rc3 (2016-10-29 13:52:02 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
tags/for-linus

for you to fetch changes up to 5c6b2aaf9316fd0983c0c999d920306ddc65bd2d:

  iw_cxgb4: invalidate the mr when posting a read_w_inv wr (2016-11-16
20:10:36 -0500)

----------------------------------------------------------------
First round of -rc fixes

- Misc Intel hfi1 fixes
- Misc Mellanox mlx4, mlx5, and rxe fixes
- A couple cxgb4 fixes

----------------------------------------------------------------
Daniel Jurgens (2):
      IB/mlx5: Use cache line size to select CQE stride
      IB/mlx4: Check gid_index return value

Dasaratharaman Chandramouli (1):
      IB/hfi1: Fix ECN processing in prescan_rxq

Dennis Dalessandro (3):
      IB/rdmavt: rdmavt can handle non aligned page maps
      IB/hfi1: Remove leftover snoop references
      IB/hfi1: Remove incorrect IS_ERR check

Doug Ledford (1):
      Merge branches 'hfi1' and 'mlx' into k.o/for-4.9-rc

Easwar Hariharan (2):
      IB/hfi1: Clean up unused argument
      IB/hfi1: Delete unused lock

Eli Cohen (2):
      IB/mlx5: Fix fatal error dispatching
      IB/mlx5: Fix NULL pointer dereference on debug print

Ira Weiny (1):
      IB/hfi1: Fix rnr_timer addition

Jakub Pawlak (2):
      IB/hfi1: Fix integrity check flags default values
      IB/hfi1: Fix status error code for unsupported packets

Jianxin Xiong (2):
      IB/hfi1: Fix a potential memory leak in hfi1_create_ctxts()
      IB/hfi1: Prevent hardware counter names from being cut off

Krzysztof Blaszkowski (2):
      IB/hfi1: Return ENODEV for unsupported PCI device ids.
      IB/hfi1: Relocate rcvhdrcnt module parameter check.

Leon Romanovsky (1):
      IB/core: Set routable RoCE gid type for ipv4/ipv6 networks

Majd Dibbiny (1):
      IB/mlx5: Fix memory leak in query device

Maor Gottlieb (1):
      IB/mlx5: Validate requested RQT size

Mark Bloch (3):
      IB/cm: Mark stale CM id's whenever the mad agent was unregistered
      IB/core: Add missing check for addr_resolve callback return value
      IB/core: Avoid unsigned int overflow in sg_alloc_table

Matan Barak (1):
      IB/mlx4: Fix create CQ error flow

Moshe Lazer (1):
      IB/mlx5: Resolve soft lock on massive reg MRs

Steve Wise (2):
      iw_cxgb4: set *bad_wr for post_send/post_recv errors
      iw_cxgb4: invalidate the mr when posting a read_w_inv wr

Tadeusz Struk (2):
      IB/hfi1: Remove redundant sysfs irq affinity entry
      IB/hfi1: Fix an Oops on pci device force remove

Tariq Toukan (1):
      IB/uverbs: Fix leak of XRC target QPs

Yonatan Cohen (4):
      IB/rxe: Fix kernel panic in UDP tunnel with GRO and RX checksum
      IB/rxe: Fix handling of erroneous WR
      IB/rxe: Clear queue buffer when modifying QP to reset
      IB/rxe: Update qp state for user query

 drivers/infiniband/core/addr.c         |  11 ++-
 drivers/infiniband/core/cm.c           | 126
++++++++++++++++++++++++++++-----
 drivers/infiniband/core/cma.c          |  21 +++++-
 drivers/infiniband/core/umem.c         |   2 +-
 drivers/infiniband/core/uverbs_main.c  |   7 +-
 drivers/infiniband/hw/cxgb4/cq.c       |  17 +----
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |   2 +-
 drivers/infiniband/hw/cxgb4/mem.c      |  12 ++++
 drivers/infiniband/hw/cxgb4/qp.c       |  20 +++---
 drivers/infiniband/hw/hfi1/affinity.c  |  72 -------------------
 drivers/infiniband/hw/hfi1/affinity.h  |   4 --
 drivers/infiniband/hw/hfi1/chip.c      |  27 +++----
 drivers/infiniband/hw/hfi1/chip.h      |   3 +
 drivers/infiniband/hw/hfi1/driver.c    |  37 +++++++---
 drivers/infiniband/hw/hfi1/file_ops.c  |  19 ++++-
 drivers/infiniband/hw/hfi1/hfi.h       |  89 +++++++++--------------
 drivers/infiniband/hw/hfi1/init.c      | 104 ++++++++++++++++-----------
 drivers/infiniband/hw/hfi1/pcie.c      |   3 +-
 drivers/infiniband/hw/hfi1/pio.c       |  13 +---
 drivers/infiniband/hw/hfi1/rc.c        |   2 +-
 drivers/infiniband/hw/hfi1/sdma.c      |  19 +----
 drivers/infiniband/hw/hfi1/sysfs.c     |  25 -------
 drivers/infiniband/hw/hfi1/trace_rx.h  |  60 ----------------
 drivers/infiniband/hw/hfi1/user_sdma.c |   2 +-
 drivers/infiniband/hw/mlx4/ah.c        |   5 +-
 drivers/infiniband/hw/mlx4/cq.c        |   5 +-
 drivers/infiniband/hw/mlx5/cq.c        |   3 +-
 drivers/infiniband/hw/mlx5/main.c      |  11 +--
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |   2 +
 drivers/infiniband/hw/mlx5/mr.c        |   6 +-
 drivers/infiniband/hw/mlx5/qp.c        |  12 +++-
 drivers/infiniband/sw/rdmavt/dma.c     |   3 -
 drivers/infiniband/sw/rxe/rxe_net.c    |   8 +--
 drivers/infiniband/sw/rxe/rxe_qp.c     |   2 +
 drivers/infiniband/sw/rxe/rxe_queue.c  |   9 +++
 drivers/infiniband/sw/rxe/rxe_queue.h  |   2 +
 drivers/infiniband/sw/rxe/rxe_req.c    |  21 +++---
 37 files changed, 391 insertions(+), 395 deletions(-)

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: [patch] IB/rxe: Remove unneeded cast in rxe_srq_from_attr()
From: Yuval Shaia @ 2016-11-17 12:16 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Moni Shoua, Doug Ledford, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161117110005.GB32143@mwanda>

Besides the soft-aggressive commit message -:)
Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

On Thu, Nov 17, 2016 at 02:00:05PM +0300, Dan Carpenter wrote:
> It makes me nervous when we cast pointer parameters.  I would estimate
> that around 50% of the time, it indicates a bug.  Here the cast is not
> needed becaue u32 and and unsigned int are the same thing.  Removing the
> cast makes the code more robust and future proof in case any of the
> types change.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
> index 2a6e3cd..efc832a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_srq.c
> +++ b/drivers/infiniband/sw/rxe/rxe_srq.c
> @@ -169,7 +169,7 @@ int rxe_srq_from_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
>  			}
>  		}
>  
> -		err = rxe_queue_resize(q, (unsigned int *)&attr->max_wr,
> +		err = rxe_queue_resize(q, &attr->max_wr,
>  				       rcv_wqe_size(srq->rq.max_sge),
>  				       srq->rq.queue->ip ?
>  						srq->rq.queue->ip->context :
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: NFSD generic R/W API (sendto path) performance results
From: Christoph Hellwig @ 2016-11-17 12:46 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Steve Wise, Christoph Hellwig, Sagi Grimberg,
	List Linux RDMA Mailing
In-Reply-To: <BA9DC9F7-C893-428B-AFE5-EFCCD13C9F25-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

On Wed, Nov 16, 2016 at 02:45:33PM -0500, Chuck Lever wrote:
> Out of curiosity, I hacked up my NFS client to limit the size of RDMA
> segments to 30 pages (the server HCA's max_sge).
> 
> A 1MB NFS READ now takes 9 segments. That forces the after-conversion
> server to build single-Write chains and use 9 post_send calls to
> transmit the READ payload, just like the before-conversion server.
> 
> Performance of before- and after-conversion servers is now equivalent.
> 
>               kB  reclen    write  rewrite    read    reread
>          2097152    1024  1061237  1141614  1961410  2000223                                                                                  

What HCA is this, btw?  Also did you try to always register for > max_sge
calls?  The code can already register all segments with the
rdma_rw_force_mr module option, so it would only need a small tweak for
that behavior.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [patch] IB/rxe: Remove unneeded cast in rxe_srq_from_attr()
From: Moni Shoua @ 2016-11-17 13:51 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Doug Ledford, Sean Hefty, Hal Rosenstock, linux-rdma,
	kernel-janitors
In-Reply-To: <20161117110005.GB32143@mwanda>

On Thu, Nov 17, 2016 at 1:00 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> It makes me nervous when we cast pointer parameters.  I would estimate
> that around 50% of the time, it indicates a bug.  Here the cast is not
> needed becaue u32 and and unsigned int are the same thing.  Removing the
> cast makes the code more robust and future proof in case any of the
> types change.
>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Moni Shoua <monis@mellanox.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox