From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:48612)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <zhang.zhanghailiang@huawei.com>) id 1XJAwQ-0002I6-2o
	for qemu-devel@nongnu.org; Sun, 17 Aug 2014 20:36:58 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <zhang.zhanghailiang@huawei.com>) id 1XJAwH-00066q-OD
	for qemu-devel@nongnu.org; Sun, 17 Aug 2014 20:36:54 -0400
Received: from szxga03-in.huawei.com ([119.145.14.66]:13349)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <zhang.zhanghailiang@huawei.com>) id 1XJAwG-00066b-UU
	for qemu-devel@nongnu.org; Sun, 17 Aug 2014 20:36:45 -0400
Message-ID: <53F14A76.2020608@huawei.com>
Date: Mon, 18 Aug 2014 08:36:06 +0800
From: zhanghailiang <zhang.zhanghailiang@huawei.com>
MIME-Version: 1.0
References: <1407928917-16220-1-git-send-email-zhang.zhanghailiang@huawei.com>
	<20140813115020.GC20244@redhat.com> <53EC57CD.3020503@huawei.com>
	<20140814071503.GA18294@G08FNSTD100614.fnst.cn.fujitsu.com>
	<53EC7CAC.2020509@huawei.com> <20140814095651.GA30944@redhat.com>
In-Reply-To: <20140814095651.GA30944@redhat.com>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH] mlock: fix bug when mlockall called before
 mbind
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Hu Tao <hutao@cn.fujitsu.com>, luonengjun@huawei.com, peter.huangpeng@huawei.com, xiexiangyou <xiexiangyou@huawei.com>, qemu-devel@nongnu.org, aliguori@amazon.com, pbonzini@redhat.com, imammedo@redhat.com, gaowanlong@cn.fujitsu.com

On 2014/8/14 17:56, Michael S. Tsirkin wrote:
> On Thu, Aug 14, 2014 at 05:09:00PM +0800, zhanghailiang wrote:
>> On 2014/8/14 15:15, Hu Tao wrote:
>>> On Thu, Aug 14, 2014 at 02:31:41PM +0800, zhanghailiang wrote:
>>>> On 2014/8/13 19:50, Michael S. Tsirkin wrote:
>>>>> On Wed, Aug 13, 2014 at 07:21:57PM +0800, zhanghailiang wrote:
>>>>>> If we configure qemu with realtime-mlock-on and memory-node-bind at the same time,
>>>>>> Qemu will fail to start, and mbind() fails with message "Input/output error".
>>>>>>
>>>>>>>  From man page:
>>>>>> int mbind(void *addr, unsigned long len, int mode,
>>>>>>                   unsigned long *nodemask, unsigned long maxnode,
>>>>>>                   unsigned flags);
>>>>>> The *MPOL_BIND* mode specifies a strict policy that restricts memory allocation
>>>>>> to the nodes specified in nodemask.
>>>>>> If *MPOL_MF_STRICT* is passed in flags and policy is not MPOL_DEFAULT(In qemu
>>>>>> here is MPOL_BIND), then the call will fail with the error EIO if the existing
>>>>>> pages in  the memory range don't follow the policy.
>>>>>>
>>>>>> The memory locked ahead by mlockall can not guarantee to follow the policy above,
>>>>>> And if that happens, it will result in an EIO error.
>>>>>>
>>>>>> So we should call mlock after mbind, here we adjust the place where called mlock,
>>>>>> Move it to function pc_memory_init.
>>>>>>
>>>>>> Signed-off-by: xiexiangyou<xiexiangyou@huawei.com>
>>>>>> Signed-off-by: zhanghailiang<zhang.zhanghailiang@huawei.com>
>>>>>
>>>>> OK but won't this still fail in case of memory hotplug?
>>>>> We set MCL_FUTURE so the same will apply?
>>>
>>> It has already been set.
>>>
>>>>> Maybe it's enough to set MPOL_MF_MOVE?
>>>>> Does the following work for you?
>>>>>
>>>>
>>>> Hi Michael,
>>>>
>>>> I have tested memory hotplug, use virsh command like
>>>> 'virsh setmem redhat-6.4 6388608 --config --live', it is OK, and
>>>> it will not call mbind when do such memory hotplug.
>>>> but i don't know if there is command like 'memory-node hotplug' ?
>>>
>>> Using qemu monitor command:
>>> object_add memory-backend-ram,id=ram1,size=128M,host-nodes=0,policy=bind
>>>
>>
>> Hi Hu Tao,
>>
>> I have tested it use the above command, and yes, it failed.
>> And if i remove the *MCL_FUTURE* flag of mlockall(), it will work fine.
>>
>> *MCL_FUTURE*  Lock  all  pages  which  will become mapped into the
>> address space of the process in the future.(From man page)
>>
>> So i think we could remove this flag, and call mlockall(MCL_CURRENT)
>> every time when we do the above 'memory object add'.
>
> No that's wrong. We want allocations of qemu memory locked as well -
> not just guest memory.
>
>>>>
>>>> MPOL_MF_MOVE can work, it is more simple, but it is not perfect. It
>>>> consumes more time to *move the memory*(i guess will reconstruct
>>>> pages and copy memory)which has been locked by mlockall. The result
>>>> is VM will start slower than the above scenario.
>>>
>>> I think your patch makes sense but it doesn't work for hotplugged
>>> memory. We need MPOL_MF_MOVE, too.
>>>
>>
>> Thank you very much, before send another patch, I will look into the
>> qemu process of hotplug memory, and try to solve this fault.:)
>
> Does setting MPOL_MF_MOVE solve all problems?
>

Hmm, if we don't consider its side-effect(VM will start slower),
That solves all problems. :)

>>>>
>>>> BTW, i think the follow process is clearer and more logical:
>>>> Allocate memory--->Set memory policy--->Lock memory.
>>>>
>>>> So what's your opinion? Thanks very much.
>>>>
>>>>
>>>>> -->
>>>>>
>>>>> hostmem: set MPOL_MF_MOVE
>>>>>
>>>>> When memory is allocated on a wrong node, MPOL_MF_STRICT
>>>>> doesn't move it - it just fails the allocation.
>>>>> A simple way to reproduce the failure is with mlock=on
>>>>> realtime feature.
>>>>>
>>>>> The code comment actually says: "ensure policy won't be ignored"
>>>>> so setting MPOL_MF_MOVE seems like a better way to do this.
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/backends/hostmem.c b/backends/hostmem.c
>>>>> index ca10c51..a9905c0 100644
>>>>> --- a/backends/hostmem.c
>>>>> +++ b/backends/hostmem.c
>>>>> @@ -304,7 +304,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
>>>>>           /* ensure policy won't be ignored in case memory is preallocated
>>>>>            * before mbind(). note: MPOL_MF_STRICT is ignored on hugepages so
>>>>>            * this doesn't catch hugepage case. */
>>>>> -        unsigned flags = MPOL_MF_STRICT;
>>>>> +        unsigned flags = MPOL_MF_STRICT | MPOL_MF_MOVE;
>>>>>
>>>>>           /* check for invalid host-nodes and policies and give more verbose
>>>>>            * error messages than mbind(). */
>>>>>
>>>>
>>>
>>>
>>> .
>>>
>>
>
> .
>