* rdma problems on Sun / ConnectX hardware
@ 2009-12-31 19:08 Jeff Haferman
[not found] ` <20091231190859.041281D90009-uDbadAYOwZ9eoWH0uzbU5w@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Jeff Haferman @ 2009-12-31 19:08 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Linux kernel = 2.6.18-92.1.26.el5_lustre.1.6.7.2smp
OS = Centos 5.2
We have a Sun Blade system with Sun IB products
(switch= Sun part number X2821A-Z 36 port QDR switch)
(hcas = Sun part number X4216A-Z dual port DDR PCI-E)
I can SOMETIMES run mvapich or openmpi over IB and it works, but generally I get
a "CQ polling error". So I went back to the rdma tests and see some problems.
We have installed OFED 1.4.1-4, and because I was having problems I upgraded the firmware on the HCAS:
lspci | grep -i infin
0b:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0)
mstflint -d 0b:00.0 q
Image type: ConnectX
FW Version: 2.6.0
Device ID: 25418
Chip Revision: A0
Description: Node Port1 Port2 Sys
image
GUIDs: 0003ba000100d770 0003ba000100d771 0003ba000100d772
0003ba000100d773
MACs: 0003ba00d771 0003ba00d772
Board ID: (SUN0060000001)
VSD:
PSID: SUN0060000001
An rping from the client to server gives
verbose
client
created cm_id 0x10ca7c70
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x10ca7c70 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x10ca7c70 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0x10caa3d0
created channel 0x10caa3f0
created cq 0x10caa410
created qp 0x10caa550
rping_setup_buffers called on cb 0x10ca5010
allocated & registered buffers...
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x10ca7c70 (parent)
ESTABLISHED
rmda_connect successful
RDMA addr 10caaa90 rkey 2002800 len 100
send completion
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x10ca7c70 (parent)
client DISCONNECT EVENT...
wait for RDMA_WRITE_ADV state 6
cq completion failed status 5
rping_free_buffers called on cb 0x10ca5010
destroy cm_id 0x10ca7c70
An ib_rdma_lat gives
local address: LID 0x16 QPN 0x004f PSN 0x743778 RKey 0x002500 VAddr 0x00000007c72001
remote address: LID 0x01 QPN 0x004f PSN 0x6497a1 RKey 0x002500 VAddr 0x00000018780001
Conflicting CPU frequency values detected: 2336.000000 != 2003.000000
Latency typical: inf usec
Latency best : inf usec
Latency worst : inf usec
Any ideas?????
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: rdma problems on Sun / ConnectX hardware
@ 2010-01-03 20:00 Jeff Haferman
[not found] ` <20100103200023.169DF1D90008-uDbadAYOwZ9eoWH0uzbU5w@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Jeff Haferman @ 2010-01-03 20:00 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
I tried posting this to general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org and got an auto-reply saying that
list is no longer active and to instead post here... I posted here a few days ago but
no response, so, my question is, does anyone have any ideas, or, is there a more
appropriate place to post?
I've made a bit of progress, with the latest ibtools there is a "-F" option that can be
passed to "ib_write_lat" to ignore cpufreq stuff, and I now get latencies returned.
"rping" however always seems to fail with CQ errors.
mvapich / openmpi over infiniband usually fails with CQ errors but sometimes my test
programs run to completion.
Original message below:
> OS = Centos 5.2
>
> We have a Sun Blade system with Sun IB products
> (switch= Sun part number X2821A-Z 36 port QDR switch)
> (hcas = Sun part number X4216A-Z dual port DDR PCI-E)
>
> I can SOMETIMES run mvapich or openmpi over IB and it works, but generally I get
> a "CQ polling error". So I went back to the rdma tests and see some problems.
>
> We have installed OFED 1.4.1-4, and because I was having problems I upgraded the firmware on the HCAS:
>
> lspci | grep -i infin
> 0b:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0)
>
> mstflint -d 0b:00.0 q
> Image type: ConnectX
> FW Version: 2.6.0
> Device ID: 25418
> Chip Revision: A0
> Description: Node Port1 Port2 Sys
> image
> GUIDs: 0003ba000100d770 0003ba000100d771 0003ba000100d772
> 0003ba000100d773
> MACs: 0003ba00d771 0003ba00d772
> Board ID: (SUN0060000001)
> VSD:
> PSID: SUN0060000001
>
> An rping from the client to server gives
> verbose
> client
> created cm_id 0x10ca7c70
> cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x10ca7c70 (parent)
> cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x10ca7c70 (parent)
> rdma_resolve_addr - rdma_resolve_route successful
> created pd 0x10caa3d0
> created channel 0x10caa3f0
> created cq 0x10caa410
> created qp 0x10caa550
> rping_setup_buffers called on cb 0x10ca5010
> allocated & registered buffers...
> cq_thread started.
> cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x10ca7c70 (parent)
> ESTABLISHED
> rmda_connect successful
> RDMA addr 10caaa90 rkey 2002800 len 100
> send completion
> cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x10ca7c70 (parent)
> client DISCONNECT EVENT...
> wait for RDMA_WRITE_ADV state 6
> cq completion failed status 5
> rping_free_buffers called on cb 0x10ca5010
> destroy cm_id 0x10ca7c70
>
>
> An ib_rdma_lat gives
> local address: LID 0x16 QPN 0x004f PSN 0x743778 RKey 0x002500 VAddr 0x00000007c72001
> remote address: LID 0x01 QPN 0x004f PSN 0x6497a1 RKey 0x002500 VAddr 0x00000018780001
> Conflicting CPU frequency values detected: 2336.000000 != 2003.000000
> Latency typical: inf usec
> Latency best : inf usec
> Latency worst : inf usec
>
>
> Linux kernel = 2.6.18-92.1.26.el5_lustre.1.6.7.2smp
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: rdma problems on Sun / ConnectX hardware
[not found] ` <20100103200023.169DF1D90008-uDbadAYOwZ9eoWH0uzbU5w@public.gmane.org>
@ 2010-01-03 20:19 ` Joe Landman
0 siblings, 0 replies; 4+ messages in thread
From: Joe Landman @ 2010-01-03 20:19 UTC (permalink / raw)
To: jeff-ruUnomVL5WBWk0Htik3J/w; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Jeff Haferman wrote:
> I tried posting this to general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org and got an auto-reply saying that
> list is no longer active and to instead post here... I posted here a few days ago but
> no response, so, my question is, does anyone have any ideas, or, is there a more
> appropriate place to post?
Hi Jeff
This should be fine.
[...]
>
> I've made a bit of progress, with the latest ibtools there is a "-F" option that can be
> passed to "ib_write_lat" to ignore cpufreq stuff, and I now get latencies returned.
>
> "rping" however always seems to fail with CQ errors.
>
> mvapich / openmpi over infiniband usually fails with CQ errors but sometimes my test
> programs run to completion.
[...]
>> lspci | grep -i infin
>> 0b:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0)
>>
>> mstflint -d 0b:00.0 q
>> Image type: ConnectX
>> FW Version: 2.6.0
This is an old firmware. Can you update to 2.6.100 or 2.7.0?
[...]
>> Linux kernel = 2.6.18-92.1.26.el5_lustre.1.6.7.2smp
This also could be an issue ... the 2.6.18 kernel is ancient. Of course
with the Lustre patches, you might not be able to use a more modern
kernel. 1.8.x Lustre might allow you to update the kernel. I don't
know if 1.5 OFED works with Lustre just yet. Which OFED stack are you
using?
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman-nyOC7EYE20mM0MU9lROt9DlRY1/6cnIP@public.gmane.org
web : http://scalableinformatics.com
http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: rdma problems on Sun / ConnectX hardware
[not found] ` <20091231190859.041281D90009-uDbadAYOwZ9eoWH0uzbU5w@public.gmane.org>
@ 2010-01-04 2:36 ` Frank Leers
0 siblings, 0 replies; 4+ messages in thread
From: Frank Leers @ 2010-01-04 2:36 UTC (permalink / raw)
To: jeff-ruUnomVL5WBWk0Htik3J/w; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi Jeff,
On Dec 31, 2009, at 11:08 AM, Jeff Haferman wrote:
>
> Linux kernel = 2.6.18-92.1.26.el5_lustre.1.6.7.2smp
> OS = Centos 5.2
>
> We have a Sun Blade system with Sun IB products
> (switch= Sun part number X2821A-Z 36 port QDR switch)
> (hcas = Sun part number X4216A-Z dual port DDR PCI-E)
>
> I can SOMETIMES run mvapich or openmpi over IB and it works, but generally I get
> a "CQ polling error". So I went back to the rdma tests and see some problems.
Has anyone 'blessed' your fabric yet, what do the error counters look like for the ports in question, how about ibdiagnet? What does the fabric topology look like?
I'd be happy to work with you off-list if you like, please reply directly to take me up on the offer.
-frank
>
> We have installed OFED 1.4.1-4, and because I was having problems I upgraded the firmware on the HCAS:
>
> lspci | grep -i infin
> 0b:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0)
>
> mstflint -d 0b:00.0 q
> Image type: ConnectX
> FW Version: 2.6.0
> Device ID: 25418
> Chip Revision: A0
> Description: Node Port1 Port2 Sys
> image
> GUIDs: 0003ba000100d770 0003ba000100d771 0003ba000100d772
> 0003ba000100d773
> MACs: 0003ba00d771 0003ba00d772
> Board ID: (SUN0060000001)
> VSD:
> PSID: SUN0060000001
>
> An rping from the client to server gives
> verbose
> client
> created cm_id 0x10ca7c70
> cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x10ca7c70 (parent)
> cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x10ca7c70 (parent)
> rdma_resolve_addr - rdma_resolve_route successful
> created pd 0x10caa3d0
> created channel 0x10caa3f0
> created cq 0x10caa410
> created qp 0x10caa550
> rping_setup_buffers called on cb 0x10ca5010
> allocated & registered buffers...
> cq_thread started.
> cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x10ca7c70 (parent)
> ESTABLISHED
> rmda_connect successful
> RDMA addr 10caaa90 rkey 2002800 len 100
> send completion
> cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x10ca7c70 (parent)
> client DISCONNECT EVENT...
> wait for RDMA_WRITE_ADV state 6
> cq completion failed status 5
> rping_free_buffers called on cb 0x10ca5010
> destroy cm_id 0x10ca7c70
>
>
> An ib_rdma_lat gives
> local address: LID 0x16 QPN 0x004f PSN 0x743778 RKey 0x002500 VAddr 0x00000007c72001
> remote address: LID 0x01 QPN 0x004f PSN 0x6497a1 RKey 0x002500 VAddr 0x00000018780001
> Conflicting CPU frequency values detected: 2336.000000 != 2003.000000
> Latency typical: inf usec
> Latency best : inf usec
> Latency worst : inf usec
>
>
> Any ideas?????
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2010-01-04 2:36 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-03 20:00 rdma problems on Sun / ConnectX hardware Jeff Haferman
[not found] ` <20100103200023.169DF1D90008-uDbadAYOwZ9eoWH0uzbU5w@public.gmane.org>
2010-01-03 20:19 ` Joe Landman
-- strict thread matches above, loose matches on Subject: below --
2009-12-31 19:08 Jeff Haferman
[not found] ` <20091231190859.041281D90009-uDbadAYOwZ9eoWH0uzbU5w@public.gmane.org>
2010-01-04 2:36 ` Frank Leers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox