All of lore.kernel.org
 help / color / mirror / Atom feed
* NFS/RDMA connection establish/break in loop
@ 2012-05-29 18:35 Goldwyn Rodrigues
  0 siblings, 0 replies; only message in thread
From: Goldwyn Rodrigues @ 2012-05-29 18:35 UTC (permalink / raw)
  To: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

Hi,

When we try to establish a connection with NFS RDMA server, we get the 
following messages with debug enabled -

[ 2937.577657] RPC:       rpcrdma_conn_upcall: established: 
192.168.1.13:20049 (ep 0xffff88012f980628 event 0x9)
[ 2937.597566] RPC:       rpcrdma_conn_upcall: connected
[ 2937.597569] RPC:  6385 __rpc_wake_up_task (now 4295627490)
[ 2937.597572] RPC:  6385 disabling timer
[ 2937.597576] RPC:  6385 removed from queue ffff88012f9802f0 "xprt_pending"
[ 2937.597580] RPC:       __rpc_wake_up_task done
[ 2937.597586] RPC:  6385 sync task resuming
[ 2937.597592] rpcrdma: connection to 192.168.1.13:20049 on mlx4_0, 
memreg 5 slots 32 ird 4
[ 2937.597597] RPC:  6385 marshaling NULL cred ffffffffa0437c60
[ 2937.597603] RPC:  6385 using AUTH_NULL cred ffffffffa0437c60 to wrap 
rpc data
[ 2937.597607] RPC:       rpcrdma_ep_connect: connected
[ 2937.597611] RPC:  6385 sleep_on(queue "xprt_pending" time 4295627490)
[ 2937.597615] RPC:       xprt_rdma_connect_worker: exit
[ 2937.597620] RPC:  6385 added to queue ffff88012f9802f0 "xprt_pending"
[ 2937.597625] RPC:  6385 setting alarm for 60000 ms
[ 2937.597631] RPC:  6385 sync task going to sleep
[ 2937.597812] RPC:       rpcrdma_qp_async_error_upcall: QP error 3 on 
device mlx4_0 ep ffff88012f980628
[ 2937.597817] RPC:  6385 __rpc_wake_up_task (now 4295627490)
[ 2937.597818] RPC:  6385 disabling timer
[ 2937.597821] RPC:  6385 removed from queue ffff88012f9802f0 "xprt_pending"
[ 2937.597824] RPC:       __rpc_wake_up_task done
[ 2937.597830] RPC:       rpcrdma_event_process: event rep 
ffff880139eb7000 status 5 opcode FFFFFFFF length 4294936578
[ 2937.597833] RPC:       rpcrdma_event_process: recv WC status 5, 
connection lost
[ 2937.597841] RPC:  6385 sync task resuming
[ 2937.597844] RPC:  6385 sleep_on(queue "xprt_pending" time 4295627490)
[ 2937.597846] RPC:  6385 added to queue ffff88012f9802f0 "xprt_pending"
[ 2937.597848] RPC:  6385 setting alarm for 60000 ms
[ 2937.597850] RPC:  6385 sync task going to sleep
[ 2937.598207] RPC:       rpcrdma_conn_upcall: disconnected: 
192.168.1.13:20049 (ep 0xffff88012f980628 event 0xa)
[ 2937.598210] RPC:       rpcrdma_conn_upcall: disconnected
[ 2937.598213] rpcrdma: connection to 192.168.1.13:20049 closed (-103)
[ 2967.547845] RPC:       xprt_rdma_connect_worker: reconnect
[ 2967.558976] RPC:       rpcrdma_ep_disconnect: after wait, disconnected
[ 2967.561651] RPC:       rpcrdma_conn_upcall: 4 responder resources (1 
initiator)

This keeps looping until mount is cancelled.
Looking at the code, rpcrdma_qp_async_error_upcall is called with 
event=3 (IB_EVENT_QP_ACCESS_ERROR) and the device name is mlx4_0
This is initated from mlx4_ib_qp_event and it is receiving 
MLX4_EVENT_TYPE_WQ_ACCESS_ERROR.

What could cause this mlx4 driver unable to access the WQ or raise such 
an interrupt? I checked setup of qp in mlx4_ib_create_qp and it returns 
success.

This is SLES11SP1 - kernel 2.6.32.59-0.3

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2012-05-29 18:35 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-29 18:35 NFS/RDMA connection establish/break in loop Goldwyn Rodrigues

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.