From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=47720 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OBttv-00074P-Rd
	for qemu-devel@nongnu.org; Tue, 11 May 2010 14:09:55 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1OBttr-0003VB-1K
	for qemu-devel@nongnu.org; Tue, 11 May 2010 14:09:51 -0400
Received: from mx1.redhat.com ([209.132.183.28]:56580)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1OBttq-0003Uo-Ph
	for qemu-devel@nongnu.org; Tue, 11 May 2010 14:09:46 -0400
Message-ID: <4BE99D64.6040000@redhat.com>
Date: Tue, 11 May 2010 21:09:40 +0300
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
References: <1271872408-22842-1-git-send-email-cam@cs.ualberta.ca>	
	<4BE82623.4000905@redhat.com> <4BE82877.1040408@codemonkey.ws>	
	<4BE83B69.4040904@redhat.com> <4BE84172.9080305@codemonkey.ws>	
	<AANLkTin8CwI0Oly-LR2zncrEWnLXbpOVXl2OMDS3kb0U@mail.gmail.com>	
	<4BE847CB.7050503@codemonkey.ws> <4BE90E6D.7070007@redhat.com>	
	<4BE9572B.3010104@codemonkey.ws> <4BE963C9.9090308@redhat.com>
	<AANLkTilesGTWMgWLTfIqgNVqYSLQi4qGnvBwb7O5I1AJ@mail.gmail.com>
	<4BE96F50.1040506@redhat.com> <4BE97CE6.6000001@codemonkey.ws>
In-Reply-To: <4BE97CE6.6000001@codemonkey.ws>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] Re: [PATCH v5 4/5] Inter-VM shared memory PCI device
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Cam Macdonell <cam@cs.ualberta.ca>, qemu-devel@nongnu.org, kvm@vger.kernel.org

On 05/11/2010 06:51 PM, Anthony Liguori wrote:
> On 05/11/2010 09:53 AM, Avi Kivity wrote:
>> On 05/11/2010 05:17 PM, Cam Macdonell wrote:
>>>
>>>> The master is the shared memory area.  It's a completely separate 
>>>> entity
>>>> that is represented by the backing file (or shared memory server 
>>>> handing out
>>>> the fd to mmap).  It can exists independently of any guest.
>>> I think the master/peer idea would be necessary if we were sharing
>>> guest memory (sharing guest A's memory with guest B).  Then if the
>>> master (guest A) dies, perhaps something needs to happen to preserve
>>> the memory contents.
>>
>> Definitely.  But we aren't...
>
> Then transparent live migration is impossible.  IMHO, that's a 
> fundamental mistake that we will regret down the road.

I don't see why the two cases are any different.  In all cases, all 
guests have to be migrated simultaneously, or we have to support 
distributed shared memory (likely at the kernel level).  Who owns the 
memory makes no difference.

There is a two non-transparent variants:
- forcibly disconnect the migrating guest, and migrate it later
   - puts all the burden on the guest application
- ask the guest to detach from the memory device
   - host is at the mercy of the guest

Since the consumers of shared memory are academia, they'll probably 
implement DSM.

>
>>>    But since we're sharing host memory, the
>>> applications in the guests can race to determine the master by
>>> grabbing a lock at offset 0 or by using lowest VM ID.
>>>
>>> Looking at it another way, it is the applications using shared memory
>>> that may or may not need a master, the Qemu processes don't need the
>>> concept of a master since the memory belongs to the host.
>>
>> Exactly.  Furthermore, even in a master/slave relationship, there 
>> will be different masters for different sub-areas, it would be a pity 
>> to expose all this in the hardware abstraction.  This way we have an 
>> external device, and PCI HBAs which connect to it - just like a 
>> multi-tailed SCSI disk.
>
> To support transparent live migration, it's necessary to do two things:
>
> 1) Preserve the memory contents of the PCI BAR after disconnected from 
> a shared memory segment
> 2) Synchronize any changes made to the PCI BAR with the shared memory 
> segment upon reconnect/initial connection.

Disconnect/reconnect mean it's no longer transparent.

>
> N.B. savevm/loadvm both constitute disconnect and reconnect events 
> respectively.
>
> Supporting (1) is easy since we just need to memcpy() the contents of 
> the shared memory segment to a temporary RAM area upon disconnect.
>
> Supporting (2) is easy when the shared memory segment is viewed as 
> owned by the guest since it has the definitive copy of the data.  
> IMHO, this is what role=master means. 

There is no 'the guest', if the memory is to be shared there will be 
multiple guests (or multiple entities).

> However, if we want to support a model where the guest does not have a 
> definitive copy of the data, upon reconnect, we need to throw away the 
> guest's changes and make the shared memory segment appear to 
> simultaneously update to the guest.  This is what role=peer means.
>
> For role=peer, it's necessary to signal to the guest when it's not 
> connected.  This means prior to savevm it's necessary to indicate to 
> the guest that it's been disconnected.
>
> I think it's important that we build this mechanism in from the start 
> because as I've stated in the past, I don't think role=peer is going 
> to be the dominant use-case.  I actually don't think that shared 
> memory between guests is all that interesting compared to shared 
> memory to an external process on the host.

I'd like to avoid making the distinction.  Why limit at the outset?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.