All of lore.kernel.org
 help / color / mirror / Atom feed
* IB errors with openMPI
@ 2010-02-22  5:46 Pradeep Satyanarayana
  0 siblings, 0 replies; only message in thread
From: Pradeep Satyanarayana @ 2010-02-22  5:46 UTC (permalink / raw)
  To: EWG; +Cc: linux-rdma

We are trying run openMPI with OFED-1.5 on the 2.6.31-rt11-preempt-rt kernel and see the following errors:

[[45393,1],8][../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_wc]
from elm3b107 to: elm3b17 error polling HP CQ with status WORK REQUEST FLUSHED
ERROR status number 5 for wr_id 1289846528 opcode -1782678528  vendor error 244
qp_idx 0

At this point I looked at the mlx4 diag counters and saw some non-zero values. Since we were attempting 
a series of runs, we don't know when the counters increased from 0. Do these counters have any correlation 
to the above MPI error?

[root@elm3b17 diag_counters]# pwd
/sys/class/infiniband/mlx4_0/diag_counters
[root@elm3b17 diag_counters]#

[root@elm3b17 diag_counters]# cat rq_num_rnr
19
[root@elm3b17 diag_counters]# cat rq_num_wrfe 
2009
[root@elm3b17 diag_counters]# cat sq_num_tree 
12
[root@elm3b17 diag_counters]# cat sq_num_wrfe
12
[root@elm3b17 diag_counters]#

Similarly on 3b107 let us look at the counters.

[root@elm3b107 diag_counters]# cat rq_num_wrfe
5156
[root@elm3b107 diag_counters]# cat sq_num_rnr
18
[root@elm3b107 diag_counters]# cat sq_num_tree
20
[root@elm3b107 diag_counters]# cat sq_num_wrfe
20
[root@elm3b107 diag_counters]#


We are using ConnectX dual port DDR HCAs (FW version 2.6). What does the vendor error 244 mean? Any suggestions to 
debug this further?

Thanks
Pradeep

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2010-02-22  5:46 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-22  5:46 IB errors with openMPI Pradeep Satyanarayana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.