* how to debug (mlx4) CQ overrun
@ 2011-09-23 21:15 Wendy Cheng
[not found] ` <CABgxfbHAMu9Lvd1j8nJF8DdTk0UYQOuxN70Z73XJv3VuLSk7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 3+ messages in thread
From: Wendy Cheng @ 2011-09-23 21:15 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
I have a test program that does RDMA read-write as the following:
node A: server listens and handles connection requests
setup a piece of memory initialized to "0"
node B: two processes parent & child
child:
1. setup a new channel with server, including a CQ with 1024 entries
(ibv_create_cq(ctx, 1024, NULL, channel, 0);)
2. RDMA sequential write (8192 bytes a time) to server memory
4. sync with parent
parent:
1. setup the new channel with server, including a CQ with 1024 entries
(ibv_create_cq(ctx, 1024, NULL, channel, 0);)
3. RDMA sequential read (8192 byes a time) to the same piece of
memory from server
- check the buffer contents.
- if memory content is still zero, re-read
4. sync with child
The parent hangs (but child finishes its write) after the following
pops up in /var/log/messages:
mlx4_core 0000:06:00.0: CQ overrun on CQN 000087
I have my own counters that restrict the read (and write) to 512 max.
Both write and read are blocking (i.e. cq is polled after each
read/write). I suspect I do not have the cq poll logic correct. The
question here is .. is there any diag tool available to check on the
internal counters (and /or states) of ibverbs library and/or kernel
drivers (to help RDMA applications debug) ? In my case, it hangs
around 14546 block (i.e. after 14546*8192 byes).
Thanks,
Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: how to debug (mlx4) CQ overrun
[not found] ` <CABgxfbHAMu9Lvd1j8nJF8DdTk0UYQOuxN70Z73XJv3VuLSk7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-09-23 21:30 ` Jason Gunthorpe
[not found] ` <20110923213010.GA2807-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
0 siblings, 1 reply; 3+ messages in thread
From: Jason Gunthorpe @ 2011-09-23 21:30 UTC (permalink / raw)
To: Wendy Cheng; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
On Fri, Sep 23, 2011 at 02:15:30PM -0700, Wendy Cheng wrote:
> I have my own counters that restrict the read (and write) to 512 max.
> Both write and read are blocking (i.e. cq is polled after each
> read/write). I suspect I do not have the cq poll logic correct. The
> question here is .. is there any diag tool available to check on the
> internal counters (and /or states) of ibverbs library and/or kernel
> drivers (to help RDMA applications debug) ? In my case, it hangs
> around 14546 block (i.e. after 14546*8192 byes).
There are not really any tools, but this is usually straightforward to
look at from your app.
Every time you post to the send Q increment a counter. Everytime you
get something back from ibv_poll_cq increment another counter.
The (A - B) must never exceed the number of entries in the CQ, and it
must not exceed the number of entries in the send Q (very important).
This assumes you are posting everything with IBV_SEND_SIGNALED. Doing
otherwise is basically the same but there is a bit more complexity to
manage the CQ counter as each completion represents multiple sendQ
entries.
Make sure you check for error codes from ibv_post_send.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: how to debug (mlx4) CQ overrun
[not found] ` <20110923213010.GA2807-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2011-10-08 11:21 ` Wendy Cheng
0 siblings, 0 replies; 3+ messages in thread
From: Wendy Cheng @ 2011-10-08 11:21 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
On Fri, Sep 23, 2011 at 2:30 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> There are not really any tools, but this is usually straightforward to
> look at from your app.
>
Great thanks for the response. It helped (to ensure our cq handling
logic was ok). The issue turns out to be build related. After doing a
clean rebuild of OFED IB modules with the modified header files, the
problem went away. The (header file) change was a result of exporting
kernel FMR (fast memory registration) to user space for an
experimental project.
Again, thank you for the write-up. It is very appreciated.
-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2011-10-08 11:21 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-23 21:15 how to debug (mlx4) CQ overrun Wendy Cheng
[not found] ` <CABgxfbHAMu9Lvd1j8nJF8DdTk0UYQOuxN70Z73XJv3VuLSk7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-09-23 21:30 ` Jason Gunthorpe
[not found] ` <20110923213010.GA2807-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2011-10-08 11:21 ` Wendy Cheng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox