From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1MImWv-0000qN-4w for qemu-devel@nongnu.org; Mon, 22 Jun 2009 12:38:01 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1MImWq-0000mk-FU for qemu-devel@nongnu.org; Mon, 22 Jun 2009 12:38:00 -0400 Received: from [199.232.76.173] (port=57582 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MImWq-0000mU-6L for qemu-devel@nongnu.org; Mon, 22 Jun 2009 12:37:56 -0400 Received: from mx2.redhat.com ([66.187.237.31]:43661) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1MImWp-0004DN-Lu for qemu-devel@nongnu.org; Mon, 22 Jun 2009 12:37:55 -0400 Message-ID: <4A3FB390.4060809@redhat.com> Date: Mon, 22 Jun 2009 19:38:40 +0300 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away References: <200906221549.n5MFn3Qd015389@d03av02.boulder.ibm.com> <4A3FAD69.60507@redhat.com> <4A3FB077.4040607@codemonkey.ws> In-Reply-To: <4A3FB077.4040607@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Anthony Liguori , qemu-devel On 06/22/2009 07:25 PM, Anthony Liguori wrote: > Avi Kivity wrote: >> On 06/22/2009 06:51 PM, Anthony Liguori wrote: >>> From: Anthony Liguori >>> >>> Otherwise, after migration, we end up with a much larger RSS size >>> then we >>> ought to have. >>> >> >> We have the same issue on the migration source node. I don't see a >> simple way to solve it, though. > > I don't follow. In this case, the issue is: > > 1) Start a guest with 1024, balloon down to 128MB. RSS size is now > ~128MB > 2) Live migrate to a different node > 3) RSS on different node jumps to ~1GB 3.5) RSS on source node jumps to ~1GB, since reading the page instantiates the pte > 4) Weep at all your lost memory 4.5) And at the swapping going on in the source node > > Xen had a similar issue. This ends up biting people who overcommit > their VMs via ballooning, live migration, and badness ensues. At > least for us, the error is swapping but madvise also avoids the issue > by never consuming that memory to begin with. Right. I'd love to do madvise() on the source node as well if we fault in a page and find out it's zero, but the guest (and aio) is still running and we might drop live data. We need a madvise(MADV_DONTNEED_IFZERO), or a mincore() flag that tells us if the page exists (vs. swapped). ksm would also do this, but it is overkill for some applications. Note that the patch contains a small bug -- the kernel is allowed to ignore the advise according to the manual page, so it's better to memset() the memory before dropping it. -- error compiling committee.c: too many arguments to function