public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* CQ overrun with ib_send_bw
@ 2010-08-13 18:44 Sumeet Lahorani
       [not found] ` <4C659288.4030402-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Sumeet Lahorani @ 2010-08-13 18:44 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA


Hi,

If I run ib_send_bw with the -a option, we seem to be getting CQ overrun 
errors.

Server :
[root@dscbad01 ~]# ib_send_bw
------------------------------------------------------------------
                    Send BW Test
Connection type : RC
Inline data is used up to 1 bytes message
  local address:  LID 0x24, QPN 0x1c004c, PSN 0x85c292
  remote address: LID 0x2a, QPN 0x14004a, PSN 0x858358
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec] 
------------------------------------------------------------------

Client :
[root@dscbad03 ~]# ib_send_bw -a dscbad01
------------------------------------------------------------------
                    Send BW Test
Connection type : RC
Inline data is used up to 1 bytes message
  local address:  LID 0x2a, QPN 0x14004a, PSN 0x858358
  remote address: LID 0x24, QPN 0x1c004c, PSN 0x85c292
Mtu : 2048
------------------------------------------------------------------
 #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec] 
      2        1000               5.99                  5.45
Completion wth error at client:
Failed status 12: wr_id 1 syndrom 0x81
scnt=600, ccnt=300

and on the client console

mlx4_core 0000:13:00.0: CQ overrun on CQN 000086
mlx4_core 0000:13:00.0: Internal error detected:
mlx4_core 0000:13:00.0:   buf[00]: 00328f6f
mlx4_core 0000:13:00.0:   buf[01]: 00000000
mlx4_core 0000:13:00.0:   buf[02]: 20070000
mlx4_core 0000:13:00.0:   buf[03]: 00000000
mlx4_core 0000:13:00.0:   buf[04]: 00328f3c
mlx4_core 0000:13:00.0:   buf[05]: 0014004a
mlx4_core 0000:13:00.0:   buf[06]: 00340000
mlx4_core 0000:13:00.0:   buf[07]: 00000044
mlx4_core 0000:13:00.0:   buf[08]: 00000804
mlx4_core 0000:13:00.0:   buf[09]: 00000804
mlx4_core 0000:13:00.0:   buf[0a]: 00000000
mlx4_core 0000:13:00.0:   buf[0b]: 00000000
mlx4_core 0000:13:00.0:   buf[0c]: 00000000
mlx4_core 0000:13:00.0:   buf[0d]: 00000000
mlx4_core 0000:13:00.0:   buf[0e]: 00000000
mlx4_core 0000:13:00.0:   buf[0f]: 00000000

This is with OFED 1.5.1 but it also happens with OFED 1.4.2. Sometimes, 
the node crashes because it runs out of memory but most of the time, I 
see just the above errors. What could be wrong?

- Sumeet

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: CQ overrun with ib_send_bw
       [not found] ` <4C659288.4030402-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2010-08-13 19:06   ` Ralph Campbell
       [not found]     ` <1281726396.2313.44.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Ralph Campbell @ 2010-08-13 19:06 UTC (permalink / raw)
  To: Sumeet Lahorani; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

I know there is a bug with "ib_send_bw -b" (bi-directional)
since it doesn't create a CQ that is large enough for all the
posted sends *and* receives.  I have tried several times to get the
following patch applied but I never got a reply and nothing was
done.

diff --git a/send_bw.c b/send_bw.c
index ddd2b73..e3f644a 100644
--- a/send_bw.c
+++ b/send_bw.c
@@ -746,6 +746,8 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev,
 	if (user_parm->use_mcg && !user_parm->servername) {
 		cq_rx_depth *= user_parm->num_of_clients_mcg;
 	}
+	if (user_parm->duplex)
+		cq_rx_depth += ctx->tx_depth;
 	ctx->cq = ibv_create_cq(ctx->context,cq_rx_depth, NULL, ctx->channel, 0);
 	if (!ctx->cq) {
 		fprintf(stderr, "Couldn't create CQ\n");

There should be enough CQEs in the normal case though.

On Fri, 2010-08-13 at 11:44 -0700, Sumeet Lahorani wrote:
> Hi,
> 
> If I run ib_send_bw with the -a option, we seem to be getting CQ overrun 
> errors.
> 
> Server :
> [root@dscbad01 ~]# ib_send_bw
> ------------------------------------------------------------------
>                     Send BW Test
> Connection type : RC
> Inline data is used up to 1 bytes message
>   local address:  LID 0x24, QPN 0x1c004c, PSN 0x85c292
>   remote address: LID 0x2a, QPN 0x14004a, PSN 0x858358
> Mtu : 2048
> ------------------------------------------------------------------
>  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec] 
> ------------------------------------------------------------------
> 
> Client :
> [root@dscbad03 ~]# ib_send_bw -a dscbad01
> ------------------------------------------------------------------
>                     Send BW Test
> Connection type : RC
> Inline data is used up to 1 bytes message
>   local address:  LID 0x2a, QPN 0x14004a, PSN 0x858358
>   remote address: LID 0x24, QPN 0x1c004c, PSN 0x85c292
> Mtu : 2048
> ------------------------------------------------------------------
>  #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec] 
>       2        1000               5.99                  5.45
> Completion wth error at client:
> Failed status 12: wr_id 1 syndrom 0x81
> scnt=600, ccnt=300
> 
> and on the client console
> 
> mlx4_core 0000:13:00.0: CQ overrun on CQN 000086
> mlx4_core 0000:13:00.0: Internal error detected:
> mlx4_core 0000:13:00.0:   buf[00]: 00328f6f
> mlx4_core 0000:13:00.0:   buf[01]: 00000000
> mlx4_core 0000:13:00.0:   buf[02]: 20070000
> mlx4_core 0000:13:00.0:   buf[03]: 00000000
> mlx4_core 0000:13:00.0:   buf[04]: 00328f3c
> mlx4_core 0000:13:00.0:   buf[05]: 0014004a
> mlx4_core 0000:13:00.0:   buf[06]: 00340000
> mlx4_core 0000:13:00.0:   buf[07]: 00000044
> mlx4_core 0000:13:00.0:   buf[08]: 00000804
> mlx4_core 0000:13:00.0:   buf[09]: 00000804
> mlx4_core 0000:13:00.0:   buf[0a]: 00000000
> mlx4_core 0000:13:00.0:   buf[0b]: 00000000
> mlx4_core 0000:13:00.0:   buf[0c]: 00000000
> mlx4_core 0000:13:00.0:   buf[0d]: 00000000
> mlx4_core 0000:13:00.0:   buf[0e]: 00000000
> mlx4_core 0000:13:00.0:   buf[0f]: 00000000
> 
> This is with OFED 1.5.1 but it also happens with OFED 1.4.2. Sometimes, 
> the node crashes because it runs out of memory but most of the time, I 
> see just the above errors. What could be wrong?
> 
> - Sumeet
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* RE: CQ overrun with ib_send_bw
       [not found]     ` <1281726396.2313.44.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-08-13 19:14       ` Hefty, Sean
       [not found]         ` <CF9C39F99A89134C9CF9C4CCB68B8DDF25A96887A2-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Hefty, Sean @ 2010-08-13 19:14 UTC (permalink / raw)
  To: Ralph Campbell, Sumeet Lahorani
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

> I know there is a bug with "ib_send_bw -b" (bi-directional)
> since it doesn't create a CQ that is large enough for all the
> posted sends *and* receives.  I have tried several times to get the
> following patch applied but I never got a reply and nothing was
> done.

Who's the maintainer of these tests?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: CQ overrun with ib_send_bw
       [not found]         ` <CF9C39F99A89134C9CF9C4CCB68B8DDF25A96887A2-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2010-08-13 19:21           ` Ralph Campbell
       [not found]             ` <1281727297.2313.47.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Ralph Campbell @ 2010-08-13 19:21 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Sumeet Lahorani,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Fri, 2010-08-13 at 12:14 -0700, Hefty, Sean wrote:
> > I know there is a bug with "ib_send_bw -b" (bi-directional)
> > since it doesn't create a CQ that is large enough for all the
> > posted sends *and* receives.  I have tried several times to get the
> > following patch applied but I never got a reply and nothing was
> > done.
> 
> Who's the maintainer of these tests?

I believe it is:

Ido Shamai <idos-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>

git://git.openfabrics.org/~shamoya/perftest.git

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: CQ overrun with ib_send_bw
       [not found]             ` <1281727297.2313.47.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-08-17 11:19               ` Tziporet Koren
       [not found]                 ` <E113D394D7C5DB4F8FF691FA7EE9DB443B5668DE17-WQlSmcKwN8Te+A/uUDamNg@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Tziporet Koren @ 2010-08-17 11:19 UTC (permalink / raw)
  To: Ralph Campbell, Hefty, Sean, Ido Shamay, Amir Ancel
  Cc: Sumeet Lahorani,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On 8/13/2010 10:21 PM, Ralph Campbell wrote:
> On Fri, 2010-08-13 at 12:14 -0700, Hefty, Sean wrote:
>>> I know there is a bug with "ib_send_bw -b" (bi-directional)
>>> since it doesn't create a CQ that is large enough for all the
>>> posted sends *and* receives.  I have tried several times to get the
>>> following patch applied but I never got a reply and nothing was
>>> done.
>>
>> Who's the maintainer of these tests?
>
> I believe it is:
>
> Ido Shamai <idos-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
>
> git://git.openfabrics.org/~shamoya/perftest.git
>
>

Yes Ido is the maintainer, however he is on vacation till Sep.
I add Amir that may help for now

Tziporet
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: CQ overrun with ib_send_bw
       [not found]                 ` <E113D394D7C5DB4F8FF691FA7EE9DB443B5668DE17-WQlSmcKwN8Te+A/uUDamNg@public.gmane.org>
@ 2010-08-17 11:36                   ` Amir Ancel
       [not found]                     ` <1EEC75D0B27041449A1EEA2927D1B145380145A7DA-WQlSmcKwN8Te+A/uUDamNg@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Amir Ancel @ 2010-08-17 11:36 UTC (permalink / raw)
  To: Tziporet Koren, Ralph Campbell, Hefty, Sean, Ido Shamay
  Cc: Sumeet Lahorani,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Raz Baussi

Hi Sean,

We've seen this issue as well.

Can you send the patch directly to us ?

Added Raz from my team which replaces Ido while he is OOO.


Thanks,

Amir Ancel
Performance Team Manager
Mellanox Technologies

-----Original Message-----
From: Tziporet Koren 
Sent: Tuesday, August 17, 2010 2:19 PM
To: Ralph Campbell; Hefty, Sean; Ido Shamay; Amir Ancel
Cc: Sumeet Lahorani; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: RE: CQ overrun with ib_send_bw

On 8/13/2010 10:21 PM, Ralph Campbell wrote:
> On Fri, 2010-08-13 at 12:14 -0700, Hefty, Sean wrote:
>>> I know there is a bug with "ib_send_bw -b" (bi-directional)
>>> since it doesn't create a CQ that is large enough for all the
>>> posted sends *and* receives.  I have tried several times to get the
>>> following patch applied but I never got a reply and nothing was
>>> done.
>>
>> Who's the maintainer of these tests?
>
> I believe it is:
>
> Ido Shamai <idos-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
>
> git://git.openfabrics.org/~shamoya/perftest.git
>
>

Yes Ido is the maintainer, however he is on vacation till Sep.
I add Amir that may help for now

Tziporet
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: CQ overrun with ib_send_bw
       [not found]                     ` <1EEC75D0B27041449A1EEA2927D1B145380145A7DA-WQlSmcKwN8Te+A/uUDamNg@public.gmane.org>
@ 2010-08-17 18:59                       ` Ralph Campbell
       [not found]                         ` <1282071547.2313.100.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Ralph Campbell @ 2010-08-17 18:59 UTC (permalink / raw)
  To: Amir Ancel
  Cc: Tziporet Koren, Hefty, Sean, Ido Shamay, Sumeet Lahorani,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Raz Baussi

[-- Attachment #1: Type: text/plain, Size: 1318 bytes --]

The patch is attached.

On Tue, 2010-08-17 at 04:36 -0700, Amir Ancel wrote:
> Hi Sean,
> 
> We've seen this issue as well.
> 
> Can you send the patch directly to us ?
> 
> Added Raz from my team which replaces Ido while he is OOO.
> 
> 
> Thanks,
> 
> Amir Ancel
> Performance Team Manager
> Mellanox Technologies
> 
> -----Original Message-----
> From: Tziporet Koren 
> Sent: Tuesday, August 17, 2010 2:19 PM
> To: Ralph Campbell; Hefty, Sean; Ido Shamay; Amir Ancel
> Cc: Sumeet Lahorani; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: RE: CQ overrun with ib_send_bw
> 
> On 8/13/2010 10:21 PM, Ralph Campbell wrote:
> > On Fri, 2010-08-13 at 12:14 -0700, Hefty, Sean wrote:
> >>> I know there is a bug with "ib_send_bw -b" (bi-directional)
> >>> since it doesn't create a CQ that is large enough for all the
> >>> posted sends *and* receives.  I have tried several times to get the
> >>> following patch applied but I never got a reply and nothing was
> >>> done.
> >>
> >> Who's the maintainer of these tests?
> >
> > I believe it is:
> >
> > Ido Shamai <idos-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
> >
> > git://git.openfabrics.org/~shamoya/perftest.git
> >
> >
> 
> Yes Ido is the maintainer, however he is on vacation till Sep.
> I add Amir that may help for now
> 
> Tziporet
> 


[-- Attachment #2: send_bw.patch --]
[-- Type: text/x-patch, Size: 491 bytes --]

diff --git a/send_bw.c b/send_bw.c
index ddd2b73..e3f644a 100644
--- a/send_bw.c
+++ b/send_bw.c
@@ -746,6 +746,8 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev,
 	if (user_parm->use_mcg && !user_parm->servername) {
 		cq_rx_depth *= user_parm->num_of_clients_mcg;
 	}
+	if (user_parm->duplex)
+		cq_rx_depth += ctx->tx_depth;
 	ctx->cq = ibv_create_cq(ctx->context,cq_rx_depth, NULL, ctx->channel, 0);
 	if (!ctx->cq) {
 		fprintf(stderr, "Couldn't create CQ\n");

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* RE: CQ overrun with ib_send_bw
       [not found]                         ` <1282071547.2313.100.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-08-17 19:08                           ` Amir Ancel
  0 siblings, 0 replies; 8+ messages in thread
From: Amir Ancel @ 2010-08-17 19:08 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Tziporet Koren, Hefty, Sean, Ido Shamay, Sumeet Lahorani,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Raz Baussi

Thanks,

We'll apply it soon.

Amir Ancel
Performance Team Manager
Mellanox Technologies


-----Original Message-----
From: Ralph Campbell [mailto:ralph.campbell-h88ZbnxC6KDQT0dZR+AlfA@public.gmane.org] 
Sent: Tuesday, August 17, 2010 9:59 PM
To: Amir Ancel
Cc: Tziporet Koren; Hefty, Sean; Ido Shamay; Sumeet Lahorani; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Raz Baussi
Subject: RE: CQ overrun with ib_send_bw

The patch is attached.

On Tue, 2010-08-17 at 04:36 -0700, Amir Ancel wrote:
> Hi Sean,
> 
> We've seen this issue as well.
> 
> Can you send the patch directly to us ?
> 
> Added Raz from my team which replaces Ido while he is OOO.
> 
> 
> Thanks,
> 
> Amir Ancel
> Performance Team Manager
> Mellanox Technologies
> 
> -----Original Message-----
> From: Tziporet Koren
> Sent: Tuesday, August 17, 2010 2:19 PM
> To: Ralph Campbell; Hefty, Sean; Ido Shamay; Amir Ancel
> Cc: Sumeet Lahorani; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: RE: CQ overrun with ib_send_bw
> 
> On 8/13/2010 10:21 PM, Ralph Campbell wrote:
> > On Fri, 2010-08-13 at 12:14 -0700, Hefty, Sean wrote:
> >>> I know there is a bug with "ib_send_bw -b" (bi-directional) since 
> >>> it doesn't create a CQ that is large enough for all the posted 
> >>> sends *and* receives.  I have tried several times to get the 
> >>> following patch applied but I never got a reply and nothing was 
> >>> done.
> >>
> >> Who's the maintainer of these tests?
> >
> > I believe it is:
> >
> > Ido Shamai <idos-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
> >
> > git://git.openfabrics.org/~shamoya/perftest.git
> >
> >
> 
> Yes Ido is the maintainer, however he is on vacation till Sep.
> I add Amir that may help for now
> 
> Tziporet
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-08-17 19:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-13 18:44 CQ overrun with ib_send_bw Sumeet Lahorani
     [not found] ` <4C659288.4030402-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2010-08-13 19:06   ` Ralph Campbell
     [not found]     ` <1281726396.2313.44.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-08-13 19:14       ` Hefty, Sean
     [not found]         ` <CF9C39F99A89134C9CF9C4CCB68B8DDF25A96887A2-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2010-08-13 19:21           ` Ralph Campbell
     [not found]             ` <1281727297.2313.47.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-08-17 11:19               ` Tziporet Koren
     [not found]                 ` <E113D394D7C5DB4F8FF691FA7EE9DB443B5668DE17-WQlSmcKwN8Te+A/uUDamNg@public.gmane.org>
2010-08-17 11:36                   ` Amir Ancel
     [not found]                     ` <1EEC75D0B27041449A1EEA2927D1B145380145A7DA-WQlSmcKwN8Te+A/uUDamNg@public.gmane.org>
2010-08-17 18:59                       ` Ralph Campbell
     [not found]                         ` <1282071547.2313.100.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-08-17 19:08                           ` Amir Ancel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox