asynchronous operation with poll()

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* asynchronous operation with poll()
@ 2010-11-09 15:58 Jonathan Rosser
  2010-11-09 20:44 ` Jason Gunthorpe
  0 siblings, 1 reply; 9+ messages in thread
From: Jonathan Rosser @ 2010-11-09 15:58 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

I have a client and server test program to explore fully asynchronous 
communication written as close to a conventional sockets application as 
possible and am encountering difficulty.

Both programs run the same code in a thread, sending buffers to each 
other as fast as possible. On the client side only, my poll() call never 
blocks and cm_id->send_cq_channel->fd always seems to be readable. This 
causes the program to loop wildly and consume 100% CPU.

Any ideas? I have ensured that O_NONBLOCK is set on the underlying file 
descriptors. I'm not sure why the server side should run with almost no 
cpu usage yet the client does not.

Here is the client/server loop:

>   struct ibv_mr *mr;
>   int ret;
>   int send_buf_num = 0;
>   int recv_buf_num = 0;
>
>   #define NUM_BUFFERS 20
>   #define SIZE 1024*1024
>   uint8_t *buffer = (uint8_t*)malloc(SIZE * NUM_BUFFERS * 2);
>   uint8_t *send_msg[NUM_BUFFERS];
>   uint8_t *recv_msg[NUM_BUFFERS];
>
>   for(int i=0; i<NUM_BUFFERS; i++) {
>     send_msg[i] = buffer + (i*SIZE);
>     recv_msg[i] = buffer + ((i+NUM_BUFFERS) * SIZE);
>   }
>
>   //--------------------------------------------------------------------
>   // setup
>   fprintf(stderr, "rdma_reg_msgs\n");
>   mr = rdma_reg_msgs(cm_id, buffer, SIZE*NUM_BUFFERS*2);
>   if (!mr) {
>     perror("rdma_reg_msgs");
>   }
>
>   //prepare to for the first receive before connecting
>   for(int i=0; i<10; i++) {
>     fprintf(stderr, "rdma_post_recv\n");
>     ret = rdma_post_recv(cm_id, NULL, recv_msg[recv_buf_num++], SIZE, mr);
>     recv_buf_num %= NUM_BUFFERS;
>     if (ret) {
>       perror("rdma_post_recv");
>     }
>   }
>
>   //connect
>   fprintf(stderr, "rdma_connect\n");
>   ret = rdma_connect(cm_id, NULL);
>   if (ret) {
>     perror("rdma_connect");
>   }
>
>   const int NUM_FDS = 4;
>
>   const int POLL_CM = 0;
>   const int POLL_RECV_CQ = 1;
>   const int POLL_SEND_CQ = 2;
>   const int POLL_WAKE = 3;
>   struct pollfd fds[NUM_FDS];
>
>   //prime notification of events on the recv completion queue
>   ibv_req_notify_cq(cm_id->recv_cq, 0);
>   //
>
>   //--------------------------------------------------------------------
>   // main loop
>   while(ret == 0)
>   {
>     memset(fds, 0, sizeof(pollfd) * NUM_FDS);
>     fds[POLL_CM].fd = cm_channel->fd;
>     fds[POLL_CM].events = POLLIN;
>
>     fds[POLL_RECV_CQ].fd = cm_id->recv_cq_channel->fd;
>     fds[POLL_RECV_CQ].events = POLLIN;
>
>     fds[POLL_SEND_CQ].fd = cm_id->send_cq_channel->fd;
>     fds[POLL_SEND_CQ].events = POLLIN;
>
>     fds[POLL_WAKE].fd = wake_fds[0];
>     fds[POLL_WAKE].events = POLLIN;
>
>     int nready = poll(fds, NUM_FDS, -1);
>     if(nready < 0) {
>       perror("poll");
>     }
>
>     if(fds[POLL_CM].revents & POLLIN) {
>       struct rdma_cm_event *cm_event;
>       ret = rdma_get_cm_event(cm_channel, &cm_event);
>       if(ret) {
>         perror("client connection rdma_get_cm_event");
>       }
>       fprintf(stderr, "Got cm event %s\n", rdma_event_str(cm_event->event));
>
>       if(cm_event->event == RDMA_CM_EVENT_ESTABLISHED) {
>         //send as soon as we are connected
>         ibv_req_notify_cq(cm_id->send_cq, 0);
>         ret = rdma_post_send(cm_id, NULL, send_msg[send_buf_num], SIZE, mr, 0);
>         send_buf_num++;
>         send_buf_num %= NUM_BUFFERS;
>         if (ret) {
>           perror("rdma_post_send");
>         }
>       }
>
>       int finish=0;
>       if(cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
>          cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL)
>           finish = 1;
>
>       rdma_ack_cm_event(cm_event);
>       if(finish) {
>         goto out;
>       }
>     }
>
>     //if the send completed
>     if(fds[POLL_SEND_CQ].revents & POLLIN) {
>       struct ibv_cq *cq;
>       struct ibv_wc wc[10];
>       void *context;
>       int num_send = ibv_poll_cq(cm_id->send_cq, 10, &wc[0]);
>
>       if(num_send == 0) fprintf(stderr, ".");
>
>       for(int i=0; i<num_send; i++) {
>         fprintf(stderr,"Got SEND CQ event : %d of %d %s\n", i, num_send, ibv_wc_status_str(wc[i].status));
>         ibv_get_cq_event(cm_id->send_cq_channel, &cq, &context);
>         assert(cq == cm_id->send_cq);
>
>         //our send completed, send some more right away
>         fprintf(stderr, "rdma_post_send\n");
>         ret = rdma_post_send(cm_id, NULL, send_msg[send_buf_num++], SIZE, mr, 0);
>         send_buf_num %= NUM_BUFFERS;
>         if (ret) {
>           perror("rdma_post_send");
>         }
>       }
>
>       //expensive call, ack all received events together
>       ibv_ack_cq_events(cm_id->send_cq, num_send);
>       ibv_req_notify_cq(cm_id->send_cq, 0);
>     }
>
>     //if the receive completed, prepare to receive more
>     if(fds[POLL_RECV_CQ].revents & POLLIN) {
>       struct ibv_cq *cq;
>       struct ibv_wc wc[10];
>       void *context;
>       int num_recv=ibv_poll_cq(cm_id->recv_cq, 10, &wc[0]);
>
>       for(int i=0; i<num_recv; i++) {
>         fprintf(stderr,"Got RECV CQ event : %d of %d %s\n", i, num_recv, ibv_wc_status_str(wc[i].status));
>         ibv_get_cq_event(cm_id->recv_cq_channel, &cq, &context);
>         assert(cq == cm_id->recv_cq);
>
>         //we received some payload, prepare to receive more
>         fprintf(stderr, "rdma_post_recv\n");
>         ret = rdma_post_recv(cm_id, NULL, recv_msg[recv_buf_num++], SIZE, mr);
>         recv_buf_num %= NUM_BUFFERS;
>         if (ret) {
>            perror("rdma_post_recv");
>         }
>       }
>
>       //expensive call, ack all received events together
>       ibv_ack_cq_events(cm_id->recv_cq, num_recv);
>       ibv_req_notify_cq(cm_id->recv_cq, 0);
>     }
>
>     if(fds[POLL_WAKE].revents & POLLIN) {
>       fprintf(stderr, "poll WAKE\n");
>       char buffer[1];
>       int nread = read(wake_fds[0], buffer, 1);
>       fprintf(stderr, "Got Wake event %d\n", nread);
>       goto out;
>     }
>
>   }
>
> out:
>   rdma_disconnect(cm_id);
>   rdma_dereg_mr(mr);
>   rdma_destroy_ep(cm_id);
>
>   free(buffer);
>   fprintf(stderr, "poll: client completed\n");



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: asynchronous operation with poll()
  2010-11-09 15:58 asynchronous operation with poll() Jonathan Rosser
@ 2010-11-09 20:44 ` Jason Gunthorpe
       [not found]   ` <20101109204452.GG909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Jason Gunthorpe @ 2010-11-09 20:44 UTC (permalink / raw)
  To: Jonathan Rosser; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 09, 2010 at 03:58:27PM +0000, Jonathan Rosser wrote:
> I have a client and server test program to explore fully
> asynchronous communication written as close to a conventional
> sockets application as possible and am encountering difficulty.

Broadly it looks to me like your actions are in the wrong order.
A poll based RDMA loop should look like this:

- exit poll
- Check poll bit
- call ibv_get_cq_event
- call ibv_req_notify_cq
- repeatedly call ibv_poll_cq (while rc == num requested)
- Issue new work
- return to poll

Generally, for your own sanity, I recommend splitting into 3 functions
- Do the stuff with ibv_get_cq_event
- Drain and process WC's
- Issue new work

Most real use cases will also want to call the latter two functions
from other waiters in the poll loop (ie whatever your wak_fds is for).

Some random mild comments for you:

> >  const int NUM_FDS = 4;
> >
> >  const int POLL_CM = 0;
> >  const int POLL_RECV_CQ = 1;
> >  const int POLL_SEND_CQ = 2;
> >  const int POLL_WAKE = 3;

You can use an enum for these constants

> >  //prime notification of events on the recv completion queue
> >  ibv_req_notify_cq(cm_id->recv_cq, 0);

Do this earlier, before posting recvs, otherwise you could race
getting your first recv.


> >  while(ret == 0)
> >  {
> >    memset(fds, 0, sizeof(pollfd) * NUM_FDS);
> >    fds[POLL_CM].fd = cm_channel->fd;
> >    fds[POLL_CM].events = POLLIN;
> >
> >    fds[POLL_RECV_CQ].fd = cm_id->recv_cq_channel->fd;
> >    fds[POLL_RECV_CQ].events = POLLIN;
> >
> >    fds[POLL_SEND_CQ].fd = cm_id->send_cq_channel->fd;
> >    fds[POLL_SEND_CQ].events = POLLIN;
> >
> >    fds[POLL_WAKE].fd = wake_fds[0];
> >    fds[POLL_WAKE].events = POLLIN;

The efficient use of poll does not put these inside the main loop. You
only need to initialize fd and events once at the start.

> >    if(fds[POLL_CM].revents & POLLIN) {
> >      struct rdma_cm_event *cm_event;
> >      ret = rdma_get_cm_event(cm_channel, &cm_event);
> >      if(ret) {
> >        perror("client connection rdma_get_cm_event");
> >      }
> >      fprintf(stderr, "Got cm event %s\n", rdma_event_str(cm_event->event));
> >
> >      if(cm_event->event == RDMA_CM_EVENT_ESTABLISHED) {
> >        //send as soon as we are connected
> >        ibv_req_notify_cq(cm_id->send_cq, 0);

Again, this should be done once, right after the cq is created.

> >    //if the send completed
> >    if(fds[POLL_SEND_CQ].revents & POLLIN) {
> >      struct ibv_cq *cq;
> >      struct ibv_wc wc[10];
> >      void *context;
> >      int num_send = ibv_poll_cq(cm_id->send_cq, 10, &wc[0]);
> >      if(num_send == 0) fprintf(stderr, ".");

Check that num_sends == 10 and loop again

> >      for(int i=0; i<num_send; i++) {
> >        fprintf(stderr,"Got SEND CQ event : %d of %d %s\n", i, num_send, ibv_wc_status_str(wc[i].status));
> >        ibv_get_cq_event(cm_id->send_cq_channel, &cq, &context);

cq_events are not tied to send WC's, this should be done
exactly once, prior to calling ibv_poll_cq

> >      //expensive call, ack all received events together
> >      ibv_ack_cq_events(cm_id->send_cq, num_send);

You don't have to do this at all in the loop unless you are
doing multithreaded things. Using num_send is wrong, I use this:

bool checkCQPoll(struct pollfd &p)
{
    if ((p.revents & POLLIN) == 0 ||
        ibv_get_cq_event(comp,&jnk1,&jnk2) != 0)
        return false;

    compEvents++;
    if (compEvents >= INT_MAX)
    {
        ibv_ack_cq_events(cq,compEvents);
        compEvents = 0;
    }
    int rc;
    if ((rc = ibv_req_notify_cq(cq,0)) == -1)
    {
        errno = rc;
	[..]

And then call ibv_ack_cq_events(cq,compEvents) before trying to
destroy the CQ. All it used for it synchronizing exits between threads.

> >      ibv_req_notify_cq(cm_id->send_cq, 0);

Do right after calling ibv_get_cq_event

> >    //if the receive completed, prepare to receive more
> >    if(fds[POLL_RECV_CQ].revents & POLLIN) {
> >      struct ibv_cq *cq;
> >      struct ibv_wc wc[10];
> >      void *context;
> >      int num_recv=ibv_poll_cq(cm_id->recv_cq, 10, &wc[0]);

Same problems as for send, they should be the same. Implement a
function like my checkCQPoll example and call it for both cases.

Continually posting sends and recvs will get you into trouble, you
will run out of recvs and get RNR's. These days the wisdom for
implementing RDMA is that you should have explicit message flow
control. Ie for something simple like this you could say that getting
a recv means another send is OK, but you still need a mechanism to
wait for a send buffer to be returned on the send CQ - there is no
ordering guarantee.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: asynchronous operation with poll()
       [not found]   ` <20101109204452.GG909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-11-10 10:30     ` Andrea Gozzelino
       [not found]       ` <4538690.1289385043068.SLOX.WebMail.wwwrun-XDIR3SKYeFbgKi2NxijLtw@public.gmane.org>
  2010-11-10 14:39     ` Jonathan Rosser
  1 sibling, 1 reply; 9+ messages in thread
From: Andrea Gozzelino @ 2010-11-10 10:30 UTC (permalink / raw)
  To: Jonathan Rosser; +Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 6035 bytes --]

Hi Jonathan,

I wrote down a test (latency and transfer speed) with RDMA.
Server and client work with the same code and they change defined size
buffers for n times (loop). In the makefile.txt, you can find an help to
use the code.

I tested Intel NetEffect NE020 E10G81GP cards with this code and I found
minimum latency about 11 us,maximum transfer speed about 9,6 GBytes, CPU
usage up to 90% on client side.
The last value is not good for us.

I hope these help you.
Thank you,
Andrea


Andrea Gozzelino

INFN - Laboratori Nazionali di Legnaro	(LNL)
Viale dell'Universita' 2 -I-35020 - Legnaro (PD)- ITALIA
Office: E-101
Tel: +39 049 8068346
Fax: +39 049 641925
Mail: andrea.gozzelino-PK20h7lG/Rc1GQ1Ptb7lUw@public.gmane.org
Cell: +39 3488245552				


On Nov 09, 2010 09:44 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

> On Tue, Nov 09, 2010 at 03:58:27PM +0000, Jonathan Rosser wrote:
> > I have a client and server test program to explore fully
> > asynchronous communication written as close to a conventional
> > sockets application as possible and am encountering difficulty.
> 
> Broadly it looks to me like your actions are in the wrong order.
> A poll based RDMA loop should look like this:
> 
> - exit poll
> - Check poll bit
> - call ibv_get_cq_event
> - call ibv_req_notify_cq
> - repeatedly call ibv_poll_cq (while rc == num requested)
> - Issue new work
> - return to poll
> 
> Generally, for your own sanity, I recommend splitting into 3 functions
> - Do the stuff with ibv_get_cq_event
> - Drain and process WC's
> - Issue new work
> 
> Most real use cases will also want to call the latter two functions
> from other waiters in the poll loop (ie whatever your wak_fds is for).
> 
> Some random mild comments for you:
> 
> > >  const int NUM_FDS = 4;
> > >
> > >  const int POLL_CM = 0;
> > >  const int POLL_RECV_CQ = 1;
> > >  const int POLL_SEND_CQ = 2;
> > >  const int POLL_WAKE = 3;
> 
> You can use an enum for these constants
> 
> > >  //prime notification of events on the recv completion queue
> > >  ibv_req_notify_cq(cm_id->recv_cq, 0);
> 
> Do this earlier, before posting recvs, otherwise you could race
> getting your first recv.
> 
> 
> > >  while(ret == 0)
> > >  {
> > >    memset(fds, 0, sizeof(pollfd) * NUM_FDS);
> > >    fds[POLL_CM].fd = cm_channel->fd;
> > >    fds[POLL_CM].events = POLLIN;
> > >
> > >    fds[POLL_RECV_CQ].fd = cm_id->recv_cq_channel->fd;
> > >    fds[POLL_RECV_CQ].events = POLLIN;
> > >
> > >    fds[POLL_SEND_CQ].fd = cm_id->send_cq_channel->fd;
> > >    fds[POLL_SEND_CQ].events = POLLIN;
> > >
> > >    fds[POLL_WAKE].fd = wake_fds[0];
> > >    fds[POLL_WAKE].events = POLLIN;
> 
> The efficient use of poll does not put these inside the main loop. You
> only need to initialize fd and events once at the start.
> 
> > >    if(fds[POLL_CM].revents & POLLIN) {
> > >      struct rdma_cm_event *cm_event;
> > >      ret = rdma_get_cm_event(cm_channel, &cm_event);
> > >      if(ret) {
> > >        perror("client connection rdma_get_cm_event");
> > >      }
> > >      fprintf(stderr, "Got cm event %s
", rdma_event_str(cm_event->event));
> > >
> > >      if(cm_event->event == RDMA_CM_EVENT_ESTABLISHED) {
> > >        //send as soon as we are connected
> > >        ibv_req_notify_cq(cm_id->send_cq, 0);
> 
> Again, this should be done once, right after the cq is created.
> 
> > >    //if the send completed
> > >    if(fds[POLL_SEND_CQ].revents & POLLIN) {
> > >      struct ibv_cq *cq;
> > >      struct ibv_wc wc[10];
> > >      void *context;
> > >      int num_send = ibv_poll_cq(cm_id->send_cq, 10, &wc[0]);
> > >      if(num_send == 0) fprintf(stderr, ".");
> 
> Check that num_sends == 10 and loop again
> 
> > >      for(int i=0; i<num_send; i++) {
> > >        fprintf(stderr,"Got SEND CQ event : %d of %d %s
", i, num_send, ibv_wc_status_str(wc[i].status));
> > >        ibv_get_cq_event(cm_id->send_cq_channel, &cq, &context);
> 
> cq_events are not tied to send WC's, this should be done
> exactly once, prior to calling ibv_poll_cq
> 
> > >      //expensive call, ack all received events together
> > >      ibv_ack_cq_events(cm_id->send_cq, num_send);
> 
> You don't have to do this at all in the loop unless you are
> doing multithreaded things. Using num_send is wrong, I use this:
> 
> bool checkCQPoll(struct pollfd &p)
> {
>     if ((p.revents & POLLIN) == 0 ||
>         ibv_get_cq_event(comp,&jnk1,&jnk2) != 0)
>         return false;
> 
>     compEvents++;
>     if (compEvents >= INT_MAX)
>     {
>         ibv_ack_cq_events(cq,compEvents);
>         compEvents = 0;
>     }
>     int rc;
>     if ((rc = ibv_req_notify_cq(cq,0)) == -1)
>     {
>         errno = rc;
> 	[..]
> 
> And then call ibv_ack_cq_events(cq,compEvents) before trying to
> destroy the CQ. All it used for it synchronizing exits between
> threads.
> 
> > >      ibv_req_notify_cq(cm_id->send_cq, 0);
> 
> Do right after calling ibv_get_cq_event
> 
> > >    //if the receive completed, prepare to receive more
> > >    if(fds[POLL_RECV_CQ].revents & POLLIN) {
> > >      struct ibv_cq *cq;
> > >      struct ibv_wc wc[10];
> > >      void *context;
> > >      int num_recv=ibv_poll_cq(cm_id->recv_cq, 10, &wc[0]);
> 
> Same problems as for send, they should be the same. Implement a
> function like my checkCQPoll example and call it for both cases.
> 
> Continually posting sends and recvs will get you into trouble, you
> will run out of recvs and get RNR's. These days the wisdom for
> implementing RDMA is that you should have explicit message flow
> control. Ie for something simple like this you could say that getting
> a recv means another send is OK, but you still need a mechanism to
> wait for a send buffer to be returned on the send CQ - there is no
> ordering guarantee.
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: server_client.tar.gz --]
[-- Type: application/x-gzip, Size: 65406 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: asynchronous operation with poll()
       [not found]   ` <20101109204452.GG909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2010-11-10 10:30     ` Andrea Gozzelino
@ 2010-11-10 14:39     ` Jonathan Rosser
  2010-11-10 17:43       ` Roland Dreier
  2010-11-10 18:04       ` Jason Gunthorpe
  1 sibling, 2 replies; 9+ messages in thread
From: Jonathan Rosser @ 2010-11-10 14:39 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 11/09/10 20:44, Jason Gunthorpe wrote:
> Broadly it looks to me like your actions are in the wrong order.
> A poll based RDMA loop should look like this:
>
> - exit poll
> - Check poll bit
> - call ibv_get_cq_event
> - call ibv_req_notify_cq
> - repeatedly call ibv_poll_cq (while rc == num requested)
> - Issue new work
> - return to poll
>
> Generally, for your own sanity, I recommend splitting into 3 functions
> - Do the stuff with ibv_get_cq_event
> - Drain and process WC's
> - Issue new work
>

Hi Jason,

Thanks very much for your advice. I had misunderstood the relationship 
between CQ events and available WC's.

> doing multithreaded things. Using num_send is wrong, I use this:
>
 > bool checkCQPoll(struct pollfd&p)

Right - using a function like your checkCQPoll has sorted out the 
behaviour of the poll() loop.

> Continually posting sends and recvs will get you into trouble, you
> will run out of recvs and get RNR's. These days the wisdom for
> implementing RDMA is that you should have explicit message flow

OK - I appreciate that a real world protocol ought to have flow control 
rather than just send as fast as possible. I've been trying to exercise 
the interfaces as far as possible and make sure my RDMA implementation 
is solid before building something real on top of it.

> control. Ie for something simple like this you could say that getting
> a recv means another send is OK, but you still need a mechanism to
> wait for a send buffer to be returned on the send CQ - there is no
> ordering guarantee.

Could I get some clarification on where there is no ordering guarantee? 
The WC's do not necessarily come back in the order that the sends were 
posted?


Many Thanks,
Jonathan.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: asynchronous operation with poll()
       [not found]       ` <4538690.1289385043068.SLOX.WebMail.wwwrun-XDIR3SKYeFbgKi2NxijLtw@public.gmane.org>
@ 2010-11-10 16:05         ` Jonathan Rosser
  2010-11-11  8:43           ` Andrea Gozzelino
  0 siblings, 1 reply; 9+ messages in thread
From: Jonathan Rosser @ 2010-11-10 16:05 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 11/10/10 10:30, Andrea Gozzelino wrote:
> Hi Jonathan,
>
> I wrote down a test (latency and transfer speed) with RDMA.
> Server and client work with the same code and they change defined size
> buffers for n times (loop). In the makefile.txt, you can find an help to
> use the code.
>
> I tested Intel NetEffect NE020 E10G81GP cards with this code and I found
> minimum latency about 11 us,maximum transfer speed about 9,6 GBytes, CPU
> usage up to 90% on client side.
> The last value is not good for us.

Hi Andrea,

Thanks for the code. With the advice from Jason I have changed my test 
program to get reliable communication using 1Mbyte buffers. The CPU 
usage is less than 2% on both client and server for 10Gb throughput. I 
have Chelsio S310CR.

I find using the poll() approach more natural as I have experience with 
conventional sockets based programming before.

The rdma_client/rdma_server example programs from librdmacm were the 
easiest to start from and I have incrementally changed them from 
synchronous to asynchronous operation, and moved the internals of the 
high level functions in <rdma/rdma_verbs.h> into my own code piece by 
piece. The learning curve is very steep :)

I found this paper quite interesting 
www.systems.ethz.ch/research/awards/minimizingthehidden.pdf

Cheers,
Jonathan.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: asynchronous operation with poll()
  2010-11-10 14:39     ` Jonathan Rosser
@ 2010-11-10 17:43       ` Roland Dreier
       [not found]         ` <adak4klqdlb.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
  2010-11-10 18:04       ` Jason Gunthorpe
  1 sibling, 1 reply; 9+ messages in thread
From: Roland Dreier @ 2010-11-10 17:43 UTC (permalink / raw)
  To: Jonathan Rosser; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

 > Could I get some clarification on where there is no ordering
 > guarantee? The WC's do not necessarily come back in the order that the
 > sends were posted?

For a given queue, completions are always returned in the order that
work requests were posted.  However there is no ordering between
different queues.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: asynchronous operation with poll()
       [not found]         ` <adak4klqdlb.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
@ 2010-11-10 17:56           ` Jason Gunthorpe
  0 siblings, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2010-11-10 17:56 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Jonathan Rosser, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 10, 2010 at 09:43:12AM -0800, Roland Dreier wrote:
>  > Could I get some clarification on where there is no ordering
>  > guarantee? The WC's do not necessarily come back in the order that the
>  > sends were posted?
> 
> For a given queue, completions are always returned in the order that
> work requests were posted.  However there is no ordering between
> different queues.

Right, so if you design a scheme where getting a recv grants a send
credit then you are doing

  Post Send #1    ---> Recv WC
  Recv WC        <---- Post Send #2
  Send WC #2           Send WC #2
  PostSend #3     ---> Recv WC

and there is no order guarentee for when you get a Recv WC vs a Send
WC, even though by protocol design they are in fact ordered. If you
load the HCA the Send Complete WC's do get behind Recv WCs.

This is true even if you use the same CQ for Recv WC and Send Complete
WCs.

In practice this can make the implementation quite troublesome since
the posting of a new Send can't necessarily be done in the Recv WC
path as no send buffers could be available at that time.

I've found properly placing all the activities in a RDMA system to be
the hardest challenge of the whole design. Avoiding deadlock and
starvation and untangling the various data dependencies can be
hard :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: asynchronous operation with poll()
  2010-11-10 14:39     ` Jonathan Rosser
  2010-11-10 17:43       ` Roland Dreier
@ 2010-11-10 18:04       ` Jason Gunthorpe
  1 sibling, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2010-11-10 18:04 UTC (permalink / raw)
  To: Jonathan Rosser; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Nov 10, 2010 at 02:39:03PM +0000, Jonathan Rosser wrote:

> >Continually posting sends and recvs will get you into trouble, you
> >will run out of recvs and get RNR's. These days the wisdom for
> >implementing RDMA is that you should have explicit message flow
> 
> OK - I appreciate that a real world protocol ought to have flow
> control rather than just send as fast as possible. I've been trying
> to exercise the interfaces as far as possible and make sure my RDMA
> implementation is solid before building something real on top of it.

With iWarp if you send more than the receiver is able to receive you
will break the connection with an error. Some kind of flow control
scheme is mandatory..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: asynchronous operation with poll()
  2010-11-10 16:05         ` Jonathan Rosser
@ 2010-11-11  8:43           ` Andrea Gozzelino
  0 siblings, 0 replies; 9+ messages in thread
From: Andrea Gozzelino @ 2010-11-11  8:43 UTC (permalink / raw)
  To: Jonathan Rosser; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Nov 10, 2010 05:05 PM, Jonathan Rosser <jonathan.rosser-gjMux1o1B1/QXOPxS62xeg@public.gmane.org>
wrote:

> On 11/10/10 10:30, Andrea Gozzelino wrote:
> > Hi Jonathan,
> >
> > I wrote down a test (latency and transfer speed) with RDMA.
> > Server and client work with the same code and they change defined
> > size
> > buffers for n times (loop). In the makefile.txt, you can find an
> > help to
> > use the code.
> >
> > I tested Intel NetEffect NE020 E10G81GP cards with this code and I
> > found
> > minimum latency about 11 us,maximum transfer speed about 9,6 GBytes,
> > CPU
> > usage up to 90% on client side.
> > The last value is not good for us.
> 
> Hi Andrea,
> 
> Thanks for the code. With the advice from Jason I have changed my test
> program to get reliable communication using 1Mbyte buffers. The CPU 
> usage is less than 2% on both client and server for 10Gb throughput. I
> have Chelsio S310CR.
> 
> I find using the poll() approach more natural as I have experience
> with
> conventional sockets based programming before.
> 
> The rdma_client/rdma_server example programs from librdmacm were the 
> easiest to start from and I have incrementally changed them from 
> synchronous to asynchronous operation, and moved the internals of the 
> high level functions in <rdma/rdma_verbs.h> into my own code piece by 
> piece. The learning curve is very steep :)
> 
> I found this paper quite interesting 
> www.systems.ethz.ch/research/awards/minimizingthehidden.pdf
> 
> Cheers,
> Jonathan.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hi Jonathan,

I would like testing my Intel NetEffect cards with your code, if it is
possible.
Could you please send me the files client/server and the compiling
commands?

Thank you very much for the pdf paper: I'm going to read it during the
morning.

Cheers,
Andrea Gozzelino

INFN - Laboratori Nazionali di Legnaro	(LNL)
Viale dell'Universita' 2 -I-35020 - Legnaro (PD)- ITALIA
Office: E-101
Tel: +39 049 8068346
Fax: +39 049 641925
Mail: andrea.gozzelino-PK20h7lG/Rc1GQ1Ptb7lUw@public.gmane.org
Cell: +39 3488245552				

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-11-11  8:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-09 15:58 asynchronous operation with poll() Jonathan Rosser
2010-11-09 20:44 ` Jason Gunthorpe
     [not found]   ` <20101109204452.GG909-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-11-10 10:30     ` Andrea Gozzelino
     [not found]       ` <4538690.1289385043068.SLOX.WebMail.wwwrun-XDIR3SKYeFbgKi2NxijLtw@public.gmane.org>
2010-11-10 16:05         ` Jonathan Rosser
2010-11-11  8:43           ` Andrea Gozzelino
2010-11-10 14:39     ` Jonathan Rosser
2010-11-10 17:43       ` Roland Dreier
     [not found]         ` <adak4klqdlb.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2010-11-10 17:56           ` Jason Gunthorpe
2010-11-10 18:04       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox