From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:53224) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UQ1H8-0001eQ-BU for qemu-devel@nongnu.org; Wed, 10 Apr 2013 16:05:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UQ1H3-0000h5-Mk for qemu-devel@nongnu.org; Wed, 10 Apr 2013 16:05:46 -0400 Received: from e39.co.us.ibm.com ([32.97.110.160]:45862) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UQ1H3-0000gj-H4 for qemu-devel@nongnu.org; Wed, 10 Apr 2013 16:05:41 -0400 Received: from /spool/local by e39.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 10 Apr 2013 14:05:39 -0600 Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com [9.56.227.147]) by d01dlp02.pok.ibm.com (Postfix) with ESMTP id 0495C6E8054 for ; Wed, 10 Apr 2013 16:05:35 -0400 (EDT) Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r3AK5ZAF52428954 for ; Wed, 10 Apr 2013 16:05:36 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r3AK5Zff028380 for ; Wed, 10 Apr 2013 17:05:35 -0300 Message-ID: <5165C60E.20006@linux.vnet.ibm.com> Date: Wed, 10 Apr 2013 16:05:34 -0400 From: "Michael R. Hines" MIME-Version: 1.0 References: <1365476681-31593-1-git-send-email-mrhines@linux.vnet.ibm.com> <1365476681-31593-4-git-send-email-mrhines@linux.vnet.ibm.com> <20130410052714.GB12777@redhat.com> <5165636C.1090908@linux.vnet.ibm.com> <20130410133448.GA18128@redhat.com> <51658554.2000909@linux.vnet.ibm.com> <20130410174107.GB32247@redhat.com> In-Reply-To: <20130410174107.GB32247@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: aliguori@us.ibm.com, qemu-devel@nongnu.org, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com On 04/10/2013 01:41 PM, Michael S. Tsirkin wrote: >>>> >>>> Thanks. >>>> >>>> However, IMHO restricting the policy to only used chunk-based is really >>>> not an acceptable choice: >>>> >>>> Here's the reason: Using my 10gbs RDMA hardware, throughput takes a >>>> dive from 10gbps to 6gbps. >>> Who cares about the throughput really? What we do care about >>> is how long the whole process takes. >>> >> Low latency and high throughput is very important =) >> >> Without these properties of RDMA, many workloads simply either >> take to long to finish migrating or do not converge to a stopping >> point altogether. >> >> *Not making this a configurable option would defeat the purpose of >> using RDMA altogether. >> >> Otherwise, you're no better off than just using TCP. > So we have two protocols implemented: one is slow the other pins all > memory on destination indefinitely. > > I see two options here: > - improve the slow version so it's fast, drop the pin all version > - give up and declare RDMA requires pinning all memory on destination > > But giving management a way to do RDMA at the speed of TCP? Why is this > useful? This is "useful" because of the overcommit concerns you brought before, which is the reason why I volunteered to write dynamic server registration in the first place. We never required that overcommit and performance had From prior experience, I don't believe overcommit and good performance are compatible with each other in general (i.e. using compression, page sharing, etc, etc.), but that's a debate for another day =) I would like to propose a compromise: How about we *keep* the registration capability and leave it enabled by default? This gives management tools the ability to get performance if they want to, but also satisfies your requirements in case management doesn't know the feature exists - they will just get the default enabled? >> But the problem is more complicated than that: there is no coordination >> between the migration_thread and RDMA right now because Paolo is >> trying to maintain a very clean separation of function. >> >> However we *can* do what you described in a future patch like this: >> >> 1. Migration thread says "iteration starts, how much memory is dirty?" >> 2. RDMA protocol says "Is there a lot of dirty memory?" >> OK, yes? Then batch all the registration messages into a >> single request >> but do not write the memory until all the registrations have >> completed. >> >> OK, no? Then just issue registrations with very little >> batching so that >> we can quickly move on to the next iteration round. >> >> Make sense? > Actually, I think you just need to get a page from migration core and > give it to the FSM above. Then let it give you another page, until you > have N pages in flight in the FSM all at different stages in the > pipeline. That's the theory. > > But if you want to try changing management core, go wild. Very little > is written in stone here. The FSM and what I described are basically the same thing, I just described it more abstractly than you did. Either way, I agree that the optimization would be very useful, but I disagree that it is possible for an optimized registration algorithm to perform *as well as* the case when there is no dynamic registration at all. The point is that dynamic registration *only* helps overcommitment. It does nothing for performance - and since that's true any optimizations that improve on dynamic registrations will always be sub-optimal to turning off dynamic registration in the first place. - Michael