From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:53224)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1UQ1H8-0001eQ-BU
	for qemu-devel@nongnu.org; Wed, 10 Apr 2013 16:05:50 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1UQ1H3-0000h5-Mk
	for qemu-devel@nongnu.org; Wed, 10 Apr 2013 16:05:46 -0400
Received: from e39.co.us.ibm.com ([32.97.110.160]:45862)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1UQ1H3-0000gj-H4
	for qemu-devel@nongnu.org; Wed, 10 Apr 2013 16:05:41 -0400
Received: from /spool/local
	by e39.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only!
	Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <mrhines@linux.vnet.ibm.com>;
	Wed, 10 Apr 2013 14:05:39 -0600
Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com [9.56.227.147])
	by d01dlp02.pok.ibm.com (Postfix) with ESMTP id 0495C6E8054
	for <qemu-devel@nongnu.org>; Wed, 10 Apr 2013 16:05:35 -0400 (EDT)
Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216])
	by d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	r3AK5ZAF52428954
	for <qemu-devel@nongnu.org>; Wed, 10 Apr 2013 16:05:36 -0400
Received: from d01av02.pok.ibm.com (loopback [127.0.0.1])
	by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id
	r3AK5Zff028380
	for <qemu-devel@nongnu.org>; Wed, 10 Apr 2013 17:05:35 -0300
Message-ID: <5165C60E.20006@linux.vnet.ibm.com>
Date: Wed, 10 Apr 2013 16:05:34 -0400
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <1365476681-31593-1-git-send-email-mrhines@linux.vnet.ibm.com>
	<1365476681-31593-4-git-send-email-mrhines@linux.vnet.ibm.com>
	<20130410052714.GB12777@redhat.com>
	<5165636C.1090908@linux.vnet.ibm.com>
	<20130410133448.GA18128@redhat.com>
	<51658554.2000909@linux.vnet.ibm.com>
	<20130410174107.GB32247@redhat.com>
In-Reply-To: <20130410174107.GB32247@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive
	protocol documentation
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: aliguori@us.ibm.com, qemu-devel@nongnu.org, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com

On 04/10/2013 01:41 PM, Michael S. Tsirkin wrote:
>>>>
>>>> Thanks.
>>>>
>>>> However, IMHO restricting the policy to only used chunk-based is really
>>>> not an acceptable choice:
>>>>
>>>> Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
>>>> dive from 10gbps to 6gbps.
>>> Who cares about the throughput really? What we do care about
>>> is how long the whole process takes.
>>>
>> Low latency and high throughput is very important =)
>>
>> Without these properties of RDMA, many workloads simply either
>> take to long to finish migrating or do not converge to a stopping
>> point altogether.
>>
>> *Not making this a configurable option would defeat the purpose of
>> using RDMA altogether.
>>
>> Otherwise, you're no better off than just using TCP.
> So we have two protocols implemented: one is slow the other pins all
> memory on destination indefinitely.
>
> I see two options here:
> - improve the slow version so it's fast, drop the pin all version
> - give up and declare RDMA requires pinning all memory on destination
>
> But giving management a way to do RDMA at the speed of TCP? Why is this
> useful?

This is "useful" because of the overcommit concerns you brought
before, which is the reason why I volunteered to write dynamic
server registration in the first place. We never required that overcommit
and performance had

 From prior experience, I don't believe overcommit and good performance
are compatible with each other in general (i.e. using compression,
page sharing, etc, etc.), but that's a debate for another day =)

I would like to propose a compromise:

How about we *keep* the registration capability and leave it enabled by 
default?

This gives management tools the ability to get performance if they want to,
but also satisfies your requirements in case management doesn't know the
feature exists - they will just get the default enabled?
>> But the problem is more complicated than that: there is no coordination
>> between the migration_thread and RDMA right now because Paolo is
>> trying to maintain a very clean separation of function.
>>
>> However we *can* do what you described in a future patch like this:
>>
>> 1. Migration thread says "iteration starts, how much memory is dirty?"
>> 2. RDMA protocol says "Is there a lot of dirty memory?"
>>          OK, yes? Then batch all the registration messages into a
>> single request
>>          but do not write the memory until all the registrations have
>> completed.
>>
>>          OK, no?  Then just issue registrations with very little
>> batching so that
>>                        we can quickly move on to the next iteration round.
>>
>> Make sense?
> Actually, I think you just need to get a page from migration core and
> give it to the FSM above.  Then let it give you another page, until you
> have N pages in flight in the FSM all at different stages in the
> pipeline.  That's the theory.
>
> But if you want to try changing management core, go wild.  Very little
> is written in stone here.

The FSM and what I described are basically the same thing, I just
described it more abstractly than you did.

Either way, I agree that the optimization would be very useful,
but I disagree that it is possible for an optimized registration algorithm
to perform *as well as* the case when there is no dynamic registration 
at all.

The point is that dynamic registration *only* helps overcommitment.

It does nothing for performance - and since that's true any optimizations
that improve on dynamic registrations will always be sub-optimal to turning
off dynamic registration in the first place.

- Michael