Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
Cc: aliguori@us.ibm.com, qemu-devel@nongnu.org, owasserm@redhat.com,
	abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com,
	pbonzini@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
Date: Thu, 11 Apr 2013 17:37:18 +0300	[thread overview]
Message-ID: <20130411143718.GC24942@redhat.com> (raw)
In-Reply-To: <5166C19A.1040402@linux.vnet.ibm.com>

On Thu, Apr 11, 2013 at 09:58:50AM -0400, Michael R. Hines wrote:
> On 04/11/2013 09:48 AM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote:
> >>On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
> >>>On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> >>>Maybe we should just say "RDMA is incompatible with memory
> >>>overcommit" and be done with it then. But see below.
> >>>>I would like to propose a compromise:
> >>>>
> >>>>How about we *keep* the registration capability and leave it enabled
> >>>>by default?
> >>>>
> >>>>This gives management tools the ability to get performance if they want to,
> >>>>but also satisfies your requirements in case management doesn't know the
> >>>>feature exists - they will just get the default enabled?
> >>>Well unfortunately the "overcommit" feature as implemented seems useless
> >>>really.  Someone wants to migrate with RDMA but with low performance?
> >>>Why not migrate with TCP then?
> >>Answer below.
> >>
> >>>>Either way, I agree that the optimization would be very useful,
> >>>>but I disagree that it is possible for an optimized registration algorithm
> >>>>to perform *as well as* the case when there is no dynamic
> >>>>registration at all.
> >>>>
> >>>>The point is that dynamic registration *only* helps overcommitment.
> >>>>
> >>>>It does nothing for performance - and since that's true any optimizations
> >>>>that improve on dynamic registrations will always be sub-optimal to turning
> >>>>off dynamic registration in the first place.
> >>>>
> >>>>- Michael
> >>>So you've given up on it.  Question is, sub-optimal by how much?  And
> >>>where's the bottleneck?
> >>>
> >>>Let's do some math. Assume you send 16 bytes registration request and
> >>>get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
> >>>32/4096 < 1% transport overhead. Negligeable.
> >>>
> >>>Is it the source CPU then? But CPU on source is basically doing same
> >>>things as with pre-registration: you do not pin all memory on source.
> >>>
> >>>So it must be the destination CPU that does not keep up then?
> >>>But it has to do even less than the source CPU.
> >>>
> >>>I suggest one explanation: the protocol you proposed is inefficient.
> >>>It seems to basically do everything in a single thread:
> >>>get a chunk,pin,wait for control credit,request,response,rdma,unpin,
> >>>There are two round-trips of send/receive here where you are not
> >>>going anything useful. Why not let migration proceed?
> >>>
> >>>Doesn't all of this sound worth checking before we give up?
> >>>
> >>First, let me remind you:
> >>
> >>Chunks are already doing this!
> >>
> >>Perhaps you don't fully understand how chunks work or perhaps I
> >>should be more verbose
> >>in the documentation. The protocol is already joining multiple pages into a
> >>single chunk without issuing any writes. It is only until the chunk
> >>is full that an
> >>actual page registration request occurs.
> >I think I got that at a high level.
> >But there is a stall between chunks. If you make chunks smaller,
> >but pipeline registration, then there will never be any stall.
> 
> Pipelineing == chunking.

pipelining:
https://en.wikipedia.org/wiki/Pipeline_%28computing%29
chunking:
https://en.wikipedia.org/wiki/Chunking_%28computing%29

> You cannot eliminate the stall,
> that's impossible.

Sure, you can eliminate the stalls. Just hide them
behind data transfers. See a diagram below.


> You can *grow* the chunk size (i.e. the pipeline)
> to amortize the cost of the stall, but you cannot eliminate
> the stall at the end of the pipeline.
> 
> At some point you have to flush the pipeline (i.e. the chunk),
> whether you like it or not.

You can process many chunks in parallel. Make chunks smaller but process
them in a pipelined fashion.  Yes the pipe might stall but it won't if
receive side is as fast as send side, then you won't have to flush at
all.


> >>So, basically what you want to know is what happens if we *change*
> >>the chunk size
> >>dynamically?
> >What I wanted to know is where is performance going?
> >Why is chunk based slower? It's not the extra messages,
> >on the wire, these take up negligeable BW.
> 
> Answer above.


Here's how things are supposed to work in a pipeline:

req -> registration request
res -> response
done -> rdma done notification (remote can unregister)
pgX  -> page, or chunk, or whatever unit is used
        for registration
rdma -> one or more rdma write requests



pg1 ->  pin -> req -> res -> rdma -> done
        pg2 ->  pin -> req -> res -> rdma -> done
                pg3 -> pin -> req -> res -> rdma -> done
                       pg4 -> pin -> req -> res -> rdma -> done
                              pg4 -> pin -> req -> res -> rdma -> done



It's like a assembly line see?  So while software does the registration
roundtrip dance, hardware is processing rdma requests for previous
chunks.

....

When do you have to stall? when you run out of rx buffer credits so you
can not start a new req.  Your protocol has 2 outstanding buffers,
so you can only have one req in the air. Do more and
you will not need to stall - possibly at all.

One other minor point is that your protocol requires extra explicit
ready commands. You can pass the number of rx buffers as extra payload
in the traffic you are sending anyway, and reduce that overhead.

-- 
MST

next prev parent reply	other threads:[~2013-04-11 14:37 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma mrhines
2013-04-09 17:05   ` Paolo Bonzini
2013-04-09 18:07     ` Michael R. Hines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 02/12] check for CONFIG_RDMA mrhines
2013-04-09 16:46   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation mrhines
2013-04-10  5:27   ` Michael S. Tsirkin
2013-04-10 13:04     ` Michael R. Hines
2013-04-10 13:34       ` Michael S. Tsirkin
2013-04-10 15:29         ` Michael R. Hines
2013-04-10 17:41           ` Michael S. Tsirkin
2013-04-10 20:05             ` Michael R. Hines
2013-04-11  7:19               ` Michael S. Tsirkin
2013-04-11 13:12                 ` Michael R. Hines
2013-04-11 13:48                   ` Michael S. Tsirkin
2013-04-11 13:58                     ` Michael R. Hines
2013-04-11 14:37                       ` Michael S. Tsirkin [this message]
2013-04-11 14:50                         ` Paolo Bonzini
2013-04-11 14:56                           ` Michael S. Tsirkin
2013-04-11 17:49                             ` Michael R. Hines
2013-04-11 19:15                               ` Michael S. Tsirkin
2013-04-11 20:33                                 ` Michael R. Hines
2013-04-12 10:48                                   ` Michael S. Tsirkin
2013-04-12 10:53                                     ` Paolo Bonzini
2013-04-12 11:25                                       ` Michael S. Tsirkin
2013-04-12 14:43                                         ` Paolo Bonzini
2013-04-14 11:59                                           ` Michael S. Tsirkin
2013-04-14 14:09                                             ` Paolo Bonzini
2013-04-14 14:40                                               ` Michael R. Hines
2013-04-14 14:27                                             ` Michael R. Hines
2013-04-14 16:03                                               ` Michael S. Tsirkin
2013-04-14 16:07                                                 ` Michael R. Hines
2013-04-14 16:40                                                 ` Michael R. Hines
2013-04-14 18:30                                                   ` Michael S. Tsirkin
2013-04-14 19:06                                                     ` Michael R. Hines
2013-04-14 21:10                                                       ` Michael S. Tsirkin
2013-04-15  1:06                                                         ` Michael R. Hines
2013-04-15  6:00                                                           ` Michael S. Tsirkin
2013-04-15 13:07                                                             ` Michael R. Hines
2013-04-15 22:20                                                               ` Michael S. Tsirkin
2013-04-15  8:28                                                           ` Paolo Bonzini
2013-04-15 13:08                                                             ` Michael R. Hines
2013-04-15  8:26                                                       ` Paolo Bonzini
2013-04-12 13:47                                     ` Michael R. Hines
2013-04-14  8:28                                       ` Michael S. Tsirkin
2013-04-14 14:31                                         ` Michael R. Hines
2013-04-14 18:51                                           ` Michael S. Tsirkin
2013-04-14 19:43                                             ` Michael R. Hines
2013-04-14 21:16                                               ` Michael S. Tsirkin
2013-04-15  1:10                                                 ` Michael R. Hines
2013-04-15  6:10                                                   ` Michael S. Tsirkin
2013-04-15  8:34                                                   ` Paolo Bonzini
2013-04-15 13:24                                                     ` Michael R. Hines
2013-04-15 13:30                                                       ` Paolo Bonzini
2013-04-15 19:55                                                         ` Michael R. Hines
2013-04-11 15:01                           ` Michael R. Hines
2013-04-11 15:18                         ` Michael R. Hines
2013-04-11 15:33                           ` Paolo Bonzini
2013-04-11 15:46                             ` Michael S. Tsirkin
2013-04-11 15:47                               ` Paolo Bonzini
2013-04-11 15:58                                 ` Michael S. Tsirkin
2013-04-11 16:06                                   ` Michael R. Hines
2013-04-12  5:10                             ` Michael R. Hines
2013-04-12  5:26                               ` Paolo Bonzini
2013-04-12  5:54                                 ` Michael R. Hines
2013-04-11 15:44                           ` Michael S. Tsirkin
2013-04-11 16:09                             ` Michael R. Hines
2013-04-11 17:04                               ` Michael S. Tsirkin
2013-04-11 17:27                                 ` Michael R. Hines
2013-04-11 16:13                             ` Michael R. Hines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 04/12] introduce qemu_ram_foreach_block() mrhines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 05/12] core RDMA migration logic w/ new protocol mrhines
2013-04-09 16:57   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 06/12] connection-establishment for RDMA mrhines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors " mrhines
2013-04-09 17:03   ` Paolo Bonzini
2013-04-09 17:31   ` Peter Maydell
2013-04-09 18:04     ` Michael R. Hines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma' mrhines
2013-04-09 17:01   ` Paolo Bonzini
2013-04-10  1:11     ` Michael R. Hines
2013-04-10  8:07       ` Paolo Bonzini
2013-04-10 10:35         ` Michael S. Tsirkin
2013-04-10 12:24         ` Michael R. Hines
2013-04-09 17:02   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 09/12] transmit pc.ram using RDMA mrhines
2013-04-09 16:50   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 10/12] new header file prototypes for savevm.c mrhines
2013-04-09 16:43   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 11/12] update schema to define new capabilities mrhines
2013-04-09 16:43   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 12/12] don't set nonblock on invalid file descriptor mrhines
2013-04-09 16:45   ` Paolo Bonzini
2013-04-09  4:24 ` [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design Michael R. Hines
2013-04-09 12:44 ` Michael S. Tsirkin
2013-04-09 14:23   ` Michael R. Hines

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130411143718.GC24942@redhat.com \
    --to=mst@redhat.com \
    --cc=abali@us.ibm.com \
    --cc=aliguori@us.ibm.com \
    --cc=gokul@us.ibm.com \
    --cc=mrhines@linux.vnet.ibm.com \
    --cc=mrhines@us.ibm.com \
    --cc=owasserm@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).