From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47481) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z4QLK-000873-1d for qemu-devel@nongnu.org; Mon, 15 Jun 2015 05:06:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Z4QLG-00007h-S3 for qemu-devel@nongnu.org; Mon, 15 Jun 2015 05:06:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:49265) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z4QLG-00007W-KE for qemu-devel@nongnu.org; Mon, 15 Jun 2015 05:06:06 -0400 Date: Mon, 15 Jun 2015 11:06:02 +0200 From: "Michael S. Tsirkin" Message-ID: <20150615104654-mutt-send-email-mst@redhat.com> References: <1433845144-26889-1-git-send-email-den@openvz.org> <1433845144-26889-2-git-send-email-den@openvz.org> <5576C1CF.40305@de.ibm.com> <5578274D.6070900@openvz.org> <20150610151113-mutt-send-email-mst@redhat.com> <557AC8F5.6040105@de.ibm.com> <20150612185256-mutt-send-email-mst@redhat.com> <557E7861.7070207@de.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <557E7861.7070207@de.ibm.com> Subject: Re: [Qemu-devel] [PATCH 1/1] balloon: add a feature bit to let Guest OS deflate balloon on oom List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Christian Borntraeger Cc: James.Bottomley@HansenPartnership.com, "Denis V. Lunev" , qemu-devel@nongnu.org, Raushaniya Maksudova , Anthony Liguori On Mon, Jun 15, 2015 at 09:01:53AM +0200, Christian Borntraeger wrote: > Am 13.06.2015 um 22:10 schrieb Michael S. Tsirkin: > > On Fri, Jun 12, 2015 at 01:56:37PM +0200, Christian Borntraeger wrote: > >> Am 10.06.2015 um 15:13 schrieb Michael S. Tsirkin: > >>> On Wed, Jun 10, 2015 at 03:02:21PM +0300, Denis V. Lunev wrote: > >>>> On 09/06/15 13:37, Christian Borntraeger wrote: > >>>>> Am 09.06.2015 um 12:19 schrieb Denis V. Lunev: > >>>>>> Excessive virtio_balloon inflation can cause invocation of OOM-killer, > >>>>>> when Linux is under severe memory pressure. Various mechanisms are > >>>>>> responsible for correct virtio_balloon memory management. Nevertheless it > >>>>>> is often the case that these control tools does not have enough time to > >>>>>> react on fast changing memory load. As a result OS runs out of memory and > >>>>>> invokes OOM-killer. The balancing of memory by use of the virtio balloon > >>>>>> should not cause the termination of processes while there are pages in the > >>>>>> balloon. Now there is no way for virtio balloon driver to free memory at > >>>>>> the last moment before some process get killed by OOM-killer. > >>>>>> > >>>>>> This does not provide a security breach as balloon itself is running > >>>>>> inside Guest OS and is working in the cooperation with the host. Thus > >>>>>> some improvements from Guest side should be considered as normal. > >>>>>> > >>>>>> To solve the problem, introduce a virtio_balloon callback which is > >>>>>> expected to be called from the oom notifier call chain in out_of_memory() > >>>>>> function. If virtio balloon could release some memory, it will make the > >>>>>> system return and retry the allocation that forced the out of memory > >>>>>> killer to run. > >>>>>> > >>>>>> This behavior should be enabled if and only if appropriate feature bit > >>>>>> is set on the device. It is off by default. > >>>>> The balloon frees pages in this way > >>>>> > >>>>> static void balloon_page(void *addr, int deflate) > >>>>> { > >>>>> #if defined(__linux__) > >>>>> if (!kvm_enabled() || kvm_has_sync_mmu()) > >>>>> qemu_madvise(addr, TARGET_PAGE_SIZE, > >>>>> deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > >>>>> #endif > >>>>> } > >>>>> > >>>>> The guest can re-touch that page and get a empty zero or the old page back without > >>>>> tampering the host integrity. This should work for all cases I am aware of (without sync_mmu its a nop anyway) so why not enable that by default? Anything that I missed? > >>>>> > >>>>> Christian > >>>> > >>>> I'd like to do that :) Actually original version of kernel patch > >>>> has enabled this unconditionally. But Michael asked to make > >>>> it configurable and off by default. > >>>> > >>>> Den > >>> > >>> That's not the question here. The question is why is it limited by kvm_has_sync_mmu. > >> > >> Well we have two interesting options here: > >> > >> VIRTIO_BALLOON_F_MUST_TELL_HOST and VIRTIO_BALLOON_F_DEFLATE_ON_OOM > >> > >> For any sane host with ondemand paging just re-accessing the page > >> should simply work. So the common case could be > >> VIRTIO_BALLOON_F_MUST_TELL_HOST == off > > > > Disabling this breaks useful optimizations such as > > ability not to migrate memory in the balloon. > > memory in the balloon is usually backed by the empty zero page after > the madvise (WONT_NEED will finally result in zap_pte_range for the > common case). In a ideal world migration should be able to optimize > zero pages. This still involves reading them in as opposed to just skipping them. > > >> VIRTIO_BALLOON_F_DEFLATE_ON_OOM == on > > > > AFAIK management tools depend on balloon not deflating > > below host-specified threshold to avoid OOM on the host. > > So I don't think we can make this a default, > > management needs to enable this explicitly. > > If the ballooning is required to keep the host memory managedment > from OOM - iow abusing ballooning as memory hotplug between guests > then yes better let the guest oom - that makes sense. > > Now: I think that doing so (not having enough swap in the host if > all guests deflate) and relying on balloon semantics is fundamentally > broken. Let me explain this: The problem is that we rely on guest > cooperation for the host integrity. As I explained using madvise > WONT_NEED will replace the current PTEs with invalid/emtpy PTEs. As > soon as the guest kernel re-touches the page (e.g. a malicious > kernel module - not the balloon driver) it will be backed by the VMAs > default method - so usually with a shared R/O copy of the empty > zero page. Write accesses will result in a copy-on-write and allocate > new memory in the host. > There is nothing we can do in the balloon protocol to protect the host > against malicious guests allocating all the maximum memory. If we want to try and harden host, we can unmap it so guest will crash if it touches pages without deflate. > If you need host integrity against guest memory usage, something like > cgroups_memory or so is probably the only reliable way. In the original design, protection against a malicious guest is not the point of the balloon, it's a technology that let you overcommit cooperative guests. > > > >> Only for the rare case of hypervisors without paging or other memory > >> related restrictions we have to enable MUST_TELL_HOST. > >> Now: QEMU knows exactly which case we have, so why not let QEMU tell > >> the guest what the capabilities are. (e.g. sync_mmu ---> no need to > >> tell the host). > >> > >> I can at least imaging that some admin wants to make the the oom case > >> configurable, but a sane default seems to be to not kill random > >> guest processes. > >> > >> Christian > > > >