From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:49276) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UQ7Xj-0005Fn-4U for qemu-devel@nongnu.org; Wed, 10 Apr 2013 22:47:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UQ7Xg-0007zy-BN for qemu-devel@nongnu.org; Wed, 10 Apr 2013 22:47:19 -0400 Received: from e34.co.us.ibm.com ([32.97.110.152]:33575) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UQ7Xg-0007zo-3A for qemu-devel@nongnu.org; Wed, 10 Apr 2013 22:47:16 -0400 Received: from /spool/local by e34.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 10 Apr 2013 20:47:15 -0600 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by d03dlp03.boulder.ibm.com (Postfix) with ESMTP id 26C0B19D8043 for ; Wed, 10 Apr 2013 20:47:07 -0600 (MDT) Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r3B2lBwL140316 for ; Wed, 10 Apr 2013 20:47:11 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r3B2lBdD025415 for ; Wed, 10 Apr 2013 20:47:11 -0600 Message-ID: <5166242E.1030904@linux.vnet.ibm.com> Date: Wed, 10 Apr 2013 22:47:10 -0400 From: "Michael R. Hines" MIME-Version: 1.0 References: <1365632901-15470-1-git-send-email-mrhines@linux.vnet.ibm.com> <1365632901-15470-13-git-send-email-mrhines@linux.vnet.ibm.com> <51662342.8090802@redhat.com> In-Reply-To: <51662342.8090802@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v1: 12/13] updated protocol documentation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Eric Blake Cc: aliguori@us.ibm.com, mst@redhat.com, qemu-devel@nongnu.org, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com Great comments, thanks. On 04/10/2013 10:43 PM, Eric Blake wrote: > On 04/10/2013 04:28 PM, mrhines@linux.vnet.ibm.com wrote: >> From: "Michael R. Hines" >> >> Full documentation on the rdma protocol: docs/rdma.txt >> >> Signed-off-by: Michael R. Hines >> --- >> docs/rdma.txt | 331 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 331 insertions(+) >> create mode 100644 docs/rdma.txt >> >> diff --git a/docs/rdma.txt b/docs/rdma.txt >> new file mode 100644 >> index 0000000..ae68d2f >> --- /dev/null >> +++ b/docs/rdma.txt >> @@ -0,0 +1,331 @@ >> +Changes since v6: >> + >> +(Thanks, Paolo - things look much cleaner now.) >> + >> +- Try to get patch-ordering correct =) >> +- Much cleaner use of QEMUFileOps >> +- Much fewer header files changes >> +- Convert zero check capability to QMP command instead >> +- Updated documentation > The above text probably shouldn't be in the file. > >> + >> +Wiki: http://wiki.qemu.org/Features/RDMALiveMigration >> +Github: git@github.com:hinesmr/qemu.git >> +Contact: Michael R. Hines, mrhines@us.ibm.com > Missing a copyright statement, but that's just following the example of > other docs, so I guess it's okay? > >> + >> +RDMA Live Migration Specification, Version # 1 >> + >> +Contents: >> +================================= >> +* Running >> +* RDMA Protocol Description >> +* Versioning and Capabilities >> +* QEMUFileRDMA Interface >> +* Migration of pc.ram >> +* Error handling >> +* TODO >> +* Performance >> + > No high-level overview of what the acronym RDMA even stands for? > >> +RUNNING: >> +=============================== >> + >> +First, decide if you want dynamic page registration on the server-side. >> +This always happens on the primary-VM side, but is optional on the server. >> +Doing this allows you to support overcommit (such as cgroups or ballooning) >> +with a smaller footprint on the server-side without having to register the >> +entire VM memory footprint. >> +NOTE: This significantly slows down RDMA throughput (about 30% slower). >> + >> +$ virsh qemu-monitor-command --hmp \ >> + --cmd "migrate_set_capability chunk_register_destination off" # enabled by default > 'virsh qemu-monitor-command' is documented as unsupported by libvirt > (it's intended solely as a development/debugging aid); but I guess until > libvirt learns to expose RDMA support by default, this is okay for a > first cut of documentation. Furthermore, you are missing a domain argument. > > Do you really want to be requiring the user to do everything through > libvirt? This is qemu documentation, so you should document how things > work without needing libvirt in the picture. > >> + >> +Next, if you decided *not* to use chunked registration on the server, >> +it is recommended to also disable zero page detection. While this is not >> +strictly necessary, zero page detection also significantly slows down >> +throughput on higher-performance links (by about 50%), like 40 gbps infiniband cards: >> + >> +$ virsh qemu-monitor-command --hmp \ >> + --cmd "migrate_check_for_zero off" # enabled by default > Missing a domain argument. > >> + >> +Finally, set the migration speed to match your hardware's capabilities: >> + >> +$ virsh qemu-monitor-command --hmp \ >> + --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device > This modifies qemu state behind libvirt's back, and won't necessarily do > what you want if libvirt tries to change things back to the speed it > thought it was managing. Instead, use 'virsh migrate-setspeed $dom 40'. > >> + >> +Finally, perform the actual migration: >> + >> +$ virsh migrate domain rdma:xx.xx.xx.xx:port > That's not quite valid syntax for 'virsh migrate'. Again, do you really > want to be documenting libvirt's interface, or qemu's interface? > >> + >> +RDMA Protocol Description: >> +================================= > Aesthetics: match the length of === to the line above it. > > I'm not reviewing technical content, just face value... > >> + >> +These two functions are very short and simply used the protocol >> +describe above to deliver bytes without changing the upper-level >> +users of QEMUFile that depend on a bytstream abstraction. > s/bytstream/bytestream/ > > ... >> + >> +After pinning, an RDMA Write is generated and tramsmitted >> +for the entire chunk. > s/tramsmitted/transmitted/ > >> +5. Also, some form of balloon-device usage tracking would also >> + help aleviate some of these issues. > s/aleviate/alleviate/ > >> + >> +PERFORMANCE >> +=================== >> + >> +Using a 40gbps infinband link performing a worst-case stress test: > s/infinband/infiniband/ > >> + >> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep >> +Approximately 30 gpbs (little better than the paper) > which paper? Call that out in your high-level summary > > ... >> + >> +An *exhaustive* paper (2010) shows additional performance details >> +linked on the QEMU wiki: > Missing the actual reference? And it would help to mention it at the > beginning of the file. >