From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40723) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Xty9x-00021f-In for qemu-devel@nongnu.org; Thu, 27 Nov 2014 07:27:02 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Xty9s-0002ak-HL for qemu-devel@nongnu.org; Thu, 27 Nov 2014 07:26:57 -0500 Received: from relay.parallels.com ([195.214.232.42]:38153) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Xty9s-0002WK-5i for qemu-devel@nongnu.org; Thu, 27 Nov 2014 07:26:52 -0500 Message-ID: <54771880.8000504@parallels.com> Date: Thu, 27 Nov 2014 15:26:40 +0300 From: "Denis V. Lunev" MIME-Version: 1.0 References: <1417088742-4538-1-git-send-email-den@openvz.org> <1417088742-4538-3-git-send-email-den@openvz.org> In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 2/2] balloon: add a feature bit to let Guest OS deflate balloon on oom List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Andrey Korolyov , "Denis V. Lunev" Cc: Raushaniya Maksudova , "qemu-devel@nongnu.org" , Anthony Liguori , "Michael S. Tsirkin" On 27/11/14 14:50, Andrey Korolyov wrote: > On Thu, Nov 27, 2014 at 2:45 PM, Denis V. Lunev wrote: >> Excessive virtio_balloon inflation can cause invocation of OOM-killer, >> when Linux is under severe memory pressure. Various mechanisms are >> responsible for correct virtio_balloon memory management. Nevertheless it >> is often the case that these control tools does not have enough time to >> react on fast changing memory load. As a result OS runs out of memory and >> invokes OOM-killer. The balancing of memory by use of the virtio balloon >> should not cause the termination of processes while there are pages in the >> balloon. Now there is no way for virtio balloon driver to free memory at >> the last moment before some process get killed by OOM-killer. >> >> This does not provide a security breach as balloon itself is running >> inside Guest OS and is working in the cooperation with the host. Thus >> some improvements from Guest side should be considered as normal. >> >> To solve the problem, introduce a virtio_balloon callback which is >> expected to be called from the oom notifier call chain in out_of_memory() >> function. If virtio balloon could release some memory, it will make the >> system to return and retry the allocation that forced the out of memory >> killer to run. >> >> This behavior should be enabled if and only if appropriate feature bit >> is set on the device. It is off by default. >> >> This functionality was recently merged into vanilla Linux (actually in >> linux-next at the moment) >> >> commit 5a10b7dbf904bfe01bb9fcc6298f7df09eed77d5 >> Author: Raushaniya Maksudova >> Date: Mon Nov 10 09:36:29 2014 +1030 >> >> This patch adds respective control bits into QEMU. It introduces >> deflate-on-oom option for baloon device which do the trick. >> >> Signed-off-by: Denis V. Lunev >> CC: Raushaniya Maksudova >> CC: Anthony Liguori >> CC: Michael S. Tsirkin >> --- >> hw/virtio/virtio-balloon.c | 6 ++++-- >> include/hw/virtio/virtio-balloon.h | 2 ++ >> 2 files changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c >> index 7bfbb75..4d043ce 100644 >> --- a/hw/virtio/virtio-balloon.c >> +++ b/hw/virtio/virtio-balloon.c >> @@ -305,8 +305,8 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, >> >> static uint32_t virtio_balloon_get_features(VirtIODevice *vdev, uint32_t f) >> { >> - f |= (1 << VIRTIO_BALLOON_F_STATS_VQ); >> - return f; >> + VirtIOBalloon *dev = VIRTIO_BALLOON(vdev); >> + return (f | VIRTIO_BALLOON_F_STATS_VQ) | dev->host_features; >> } >> >> static void virtio_balloon_stat(void *opaque, BalloonInfo *info) >> @@ -409,6 +409,8 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp) >> } >> >> static Property virtio_balloon_properties[] = { >> + DEFINE_PROP_BIT("deflate-on-oom", VirtIOBalloon, host_features, >> + VIRTIO_BALLOON_F_DEFLATE_ON_OOM, false), >> DEFINE_PROP_END_OF_LIST(), >> }; >> >> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h >> index f863bfe..2e1ccd9 100644 >> --- a/include/hw/virtio/virtio-balloon.h >> +++ b/include/hw/virtio/virtio-balloon.h >> @@ -30,6 +30,7 @@ >> /* The feature bitmap for virtio balloon */ >> #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */ >> #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory stats virtqueue */ >> +#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ >> >> /* Size of a PFN in the balloon interface. */ >> #define VIRTIO_BALLOON_PFN_SHIFT 12 >> @@ -67,6 +68,7 @@ typedef struct VirtIOBalloon { >> QEMUTimer *stats_timer; >> int64_t stats_last_update; >> int64_t stats_poll_interval; >> + uint32_t host_features; >> } VirtIOBalloon; >> >> #endif >> -- >> 1.9.1 >> >> > > Had you tried this with a system-wide OOM on a real workload? This > behavior can work perfectly with dedicated memory cgroups, but I`m > afraid it would be unusable when entire system stalls and waits for a > balloon deflation. > we have tried this with test workloads only at the moment. I think that this is a matter of setup. Yes, this setup probably will result in host OOM. But host system has quite a lot of options to toss host memory (including VMs memory) and the system will survive longer. Host cgroup is also a good idea but in this case (most probably) you will have entire qemu killed. We could think on this in the following terms: OOM is guest is equivalent to OOM in host from the point of critical service interaction. Most likely guest OOM will the fattest eater in guest which is the most critical one and this will not be seen by host at all. If entire QEMU will be killed, the VM could be restarted by the fault tolerance system and even this restart could happen on the different node. These are just simple speculations... Anyway, this behavior is quite native from the point of guest and is off by default. I do not see much problem with it. Though this ability with a proper guest-to-host feedback seems promising from the management point of view.