* how to debug (mlx4) CQ overrun
@ 2011-09-23 21:15 Wendy Cheng
[not found] ` <CABgxfbHAMu9Lvd1j8nJF8DdTk0UYQOuxN70Z73XJv3VuLSk7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 3+ messages in thread
From: Wendy Cheng @ 2011-09-23 21:15 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
I have a test program that does RDMA read-write as the following:
node A: server listens and handles connection requests
setup a piece of memory initialized to "0"
node B: two processes parent & child
child:
1. setup a new channel with server, including a CQ with 1024 entries
(ibv_create_cq(ctx, 1024, NULL, channel, 0);)
2. RDMA sequential write (8192 bytes a time) to server memory
4. sync with parent
parent:
1. setup the new channel with server, including a CQ with 1024 entries
(ibv_create_cq(ctx, 1024, NULL, channel, 0);)
3. RDMA sequential read (8192 byes a time) to the same piece of
memory from server
- check the buffer contents.
- if memory content is still zero, re-read
4. sync with child
The parent hangs (but child finishes its write) after the following
pops up in /var/log/messages:
mlx4_core 0000:06:00.0: CQ overrun on CQN 000087
I have my own counters that restrict the read (and write) to 512 max.
Both write and read are blocking (i.e. cq is polled after each
read/write). I suspect I do not have the cq poll logic correct. The
question here is .. is there any diag tool available to check on the
internal counters (and /or states) of ibverbs library and/or kernel
drivers (to help RDMA applications debug) ? In my case, it hangs
around 14546 block (i.e. after 14546*8192 byes).
Thanks,
Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread[parent not found: <CABgxfbHAMu9Lvd1j8nJF8DdTk0UYQOuxN70Z73XJv3VuLSk7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: how to debug (mlx4) CQ overrun [not found] ` <CABgxfbHAMu9Lvd1j8nJF8DdTk0UYQOuxN70Z73XJv3VuLSk7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-09-23 21:30 ` Jason Gunthorpe [not found] ` <20110923213010.GA2807-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 3+ messages in thread From: Jason Gunthorpe @ 2011-09-23 21:30 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA On Fri, Sep 23, 2011 at 02:15:30PM -0700, Wendy Cheng wrote: > I have my own counters that restrict the read (and write) to 512 max. > Both write and read are blocking (i.e. cq is polled after each > read/write). I suspect I do not have the cq poll logic correct. The > question here is .. is there any diag tool available to check on the > internal counters (and /or states) of ibverbs library and/or kernel > drivers (to help RDMA applications debug) ? In my case, it hangs > around 14546 block (i.e. after 14546*8192 byes). There are not really any tools, but this is usually straightforward to look at from your app. Every time you post to the send Q increment a counter. Everytime you get something back from ibv_poll_cq increment another counter. The (A - B) must never exceed the number of entries in the CQ, and it must not exceed the number of entries in the send Q (very important). This assumes you are posting everything with IBV_SEND_SIGNALED. Doing otherwise is basically the same but there is a bit more complexity to manage the CQ counter as each completion represents multiple sendQ entries. Make sure you check for error codes from ibv_post_send. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <20110923213010.GA2807-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: how to debug (mlx4) CQ overrun [not found] ` <20110923213010.GA2807-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2011-10-08 11:21 ` Wendy Cheng 0 siblings, 0 replies; 3+ messages in thread From: Wendy Cheng @ 2011-10-08 11:21 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA On Fri, Sep 23, 2011 at 2:30 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > There are not really any tools, but this is usually straightforward to > look at from your app. > Great thanks for the response. It helped (to ensure our cq handling logic was ok). The issue turns out to be build related. After doing a clean rebuild of OFED IB modules with the modified header files, the problem went away. The (header file) change was a result of exporting kernel FMR (fast memory registration) to user space for an experimental project. Again, thank you for the write-up. It is very appreciated. -- Wendy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2011-10-08 11:21 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-23 21:15 how to debug (mlx4) CQ overrun Wendy Cheng
[not found] ` <CABgxfbHAMu9Lvd1j8nJF8DdTk0UYQOuxN70Z73XJv3VuLSk7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-09-23 21:30 ` Jason Gunthorpe
[not found] ` <20110923213010.GA2807-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2011-10-08 11:21 ` Wendy Cheng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox