public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* how to debug (mlx4) CQ overrun
@ 2011-09-23 21:15 Wendy Cheng
       [not found] ` <CABgxfbHAMu9Lvd1j8nJF8DdTk0UYQOuxN70Z73XJv3VuLSk7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Wendy Cheng @ 2011-09-23 21:15 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

I have a test program that does RDMA read-write as the following:

node A: server listens and handles connection requests
               setup a piece of memory initialized to "0"
node B: two processes parent & child

child:
  1. setup a new channel with server, including a CQ with 1024 entries
                (ibv_create_cq(ctx, 1024, NULL, channel, 0);)
  2. RDMA sequential write (8192 bytes a time) to server memory
  4. sync with parent

parent:
   1. setup the new channel with server, including a CQ with 1024 entries
                  (ibv_create_cq(ctx, 1024, NULL, channel, 0);)
    3. RDMA sequential read (8192 byes a time) to the same piece of
memory from server
                 - check the buffer contents.
                 - if memory content is still zero, re-read
    4. sync with child

The parent hangs (but child finishes its write) after the following
pops up in /var/log/messages:
 mlx4_core 0000:06:00.0: CQ overrun on CQN 000087

I have my own counters that restrict the read (and write) to 512 max.
Both write and read are blocking (i.e. cq is polled after each
read/write). I suspect I do not have the cq poll logic correct. The
question here is .. is there any diag tool available to check on the
internal counters (and /or states) of ibverbs library and/or kernel
drivers (to help RDMA applications debug) ? In my case, it hangs
around 14546 block (i.e. after 14546*8192 byes).

Thanks,
Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: how to debug (mlx4) CQ overrun
       [not found] ` <CABgxfbHAMu9Lvd1j8nJF8DdTk0UYQOuxN70Z73XJv3VuLSk7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-09-23 21:30   ` Jason Gunthorpe
       [not found]     ` <20110923213010.GA2807-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Jason Gunthorpe @ 2011-09-23 21:30 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, Sep 23, 2011 at 02:15:30PM -0700, Wendy Cheng wrote:

> I have my own counters that restrict the read (and write) to 512 max.
> Both write and read are blocking (i.e. cq is polled after each
> read/write). I suspect I do not have the cq poll logic correct. The
> question here is .. is there any diag tool available to check on the
> internal counters (and /or states) of ibverbs library and/or kernel
> drivers (to help RDMA applications debug) ? In my case, it hangs
> around 14546 block (i.e. after 14546*8192 byes).

There are not really any tools, but this is usually straightforward to
look at from your app.

Every time you post to the send Q increment a counter. Everytime you
get something back from ibv_poll_cq increment another counter.

The (A - B) must never exceed the number of entries in the CQ, and it
must not exceed the number of entries in the send Q (very important).

This assumes you are posting everything with IBV_SEND_SIGNALED. Doing
otherwise is basically the same but there is a bit more complexity to
manage the CQ counter as each completion represents multiple sendQ
entries.

Make sure you check for error codes from ibv_post_send.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: how to debug (mlx4) CQ overrun
       [not found]     ` <20110923213010.GA2807-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2011-10-08 11:21       ` Wendy Cheng
  0 siblings, 0 replies; 3+ messages in thread
From: Wendy Cheng @ 2011-10-08 11:21 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, Sep 23, 2011 at 2:30 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

> There are not really any tools, but this is usually straightforward to
> look at from your app.
>

Great thanks for the response. It helped (to ensure our cq handling
logic was ok). The issue turns out to be build related. After doing a
clean rebuild of OFED IB modules with the modified header files, the
problem went away. The (header file) change was a result of exporting
kernel FMR (fast memory registration) to user space for an
experimental project.

Again, thank you for the write-up. It is very appreciated.

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-10-08 11:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-23 21:15 how to debug (mlx4) CQ overrun Wendy Cheng
     [not found] ` <CABgxfbHAMu9Lvd1j8nJF8DdTk0UYQOuxN70Z73XJv3VuLSk7-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-09-23 21:30   ` Jason Gunthorpe
     [not found]     ` <20110923213010.GA2807-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2011-10-08 11:21       ` Wendy Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox