From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C13FC43381 for ; Mon, 1 Apr 2019 13:24:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E160120870 for ; Mon, 1 Apr 2019 13:24:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E160120870 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 46C9B6B0003; Mon, 1 Apr 2019 09:24:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 445CA6B0008; Mon, 1 Apr 2019 09:24:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 35A846B000A; Mon, 1 Apr 2019 09:24:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by kanga.kvack.org (Postfix) with ESMTP id 138466B0003 for ; Mon, 1 Apr 2019 09:24:29 -0400 (EDT) Received: by mail-qt1-f198.google.com with SMTP id p26so9854017qtq.21 for ; Mon, 01 Apr 2019 06:24:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:date:from:to :cc:subject:message-id:references:mime-version:content-disposition :in-reply-to; bh=KS+1c3Vojzyb+7dVwmo5U26YD3fLjLn9a1tMKJjoS4k=; b=TvNUb8MhCme/9fNZ1pmamqHtEmEfxI88dhBie4ik4ye3q0ztYV5xF1Wulx02tA63uI 0KDh3oxgJQaMPJredIZmcPEmURm6Z1FthTVSPZhhW77VffjEFyYVQ2ay86JjdO+F93mC Ei1LspnBlywDTrhkPYUX+poNcDHNqU8zEH/UkG8y+HxdyZ800YYR6FcDio2Oz4V2MMdr 0DYptJbyY7ca/Gh7x2VYInE8LlDnHZW2ugw0m2V+gsyMYLFcJ9F40BzWvx9uz8E6eow3 Ab9JBfaA19/K9hv7I0rtfmRKKsa8RjQqwbrgzcqGcEQRR2eyQYznph7//FmLPMuWDMrQ MprA== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mst@redhat.com designates 209.85.220.41 as permitted sender) smtp.mailfrom=mst@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Gm-Message-State: APjAAAUJi1cIf9F3H18dJPzTAJR4fBQg5Xj5eiv5I0p5ywXyPzSDSUXk S/oJzDdojEZiFlYKrFMDIAq+47fOtNOUVo97wkdSwkVNMh2tzqVl/XpjrjEDu/254XrLbDYSY3c DEMC39rps8pYX4s/ZOL3ey+cIUUDZFCpj8/F+mB3+6RMOkBpWhzA8z6CWR+bdNaF+RA== X-Received: by 2002:a37:7602:: with SMTP id r2mr49981131qkc.97.1554125068665; Mon, 01 Apr 2019 06:24:28 -0700 (PDT) X-Received: by 2002:a37:7602:: with SMTP id r2mr49981036qkc.97.1554125067306; Mon, 01 Apr 2019 06:24:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1554125067; cv=none; d=google.com; s=arc-20160816; b=AFAeb7DuqYrKeGWP9xbgpuOO+TLSHBCDFUQHe39m+IGXGbivgwkUzHpZlbRSTVPEL9 E7dg9TZEANFecVhdrjzNopqogtamoey4fN/4b6Ki3oCnYrEMV9iTnvBjwI29XxdB1TRA jqlp79E7eUYQbpywtoYkChKNNNpWCvjUq9WFwReavRhiwIy+YQiX191xPTw5jGorYv0f TxBI9GGlCKGU5aHX6y2fXZNl2kLBrWWjjVjfMlJVVolZanE+KlsuLNDOLUWy3Tu9BtEo FcQH6f1GJae5vF+K1nOxDIfOA4phl7mL+yV60N19X7U7ykMgMqW7DIYtFXNi/ZFv9+v2 mxdg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date; bh=KS+1c3Vojzyb+7dVwmo5U26YD3fLjLn9a1tMKJjoS4k=; b=gYhIyjP/6asb2uGqou/1X2mqT2Dd4ya0O56qOl1/0/HFnivCyf/OragkPtGHlfJCVf binXeWMGuDeRwmAKwf3SUpmYDFBQpSdmq9hUFnCEjCdEIVMuRD9EKIlE0RSqixHgP6QP aE7Hdi4ZrMm4bHTVJO5ItN0viLQCOLh3ZyrzIuw90QgNuXR5ArrIj1oueXi5WIZuyL3z LezPo3rx0jTNH6hAeQC59+y2q7fcO+mUEBm/5JFD/zZ5SBzmwiI3Ivps5bFfA7VVvlaJ bZmmxXxQMA2tO38xyYsn6BjK/YeCd6fvARhKiW6XOw5uuDmFihpnEZULddMHSRjJ/Qm2 vmIw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mst@redhat.com designates 209.85.220.41 as permitted sender) smtp.mailfrom=mst@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id h24sor5816935qkg.54.2019.04.01.06.24.27 for (Google Transport Security); Mon, 01 Apr 2019 06:24:27 -0700 (PDT) Received-SPF: pass (google.com: domain of mst@redhat.com designates 209.85.220.41 as permitted sender) client-ip=209.85.220.41; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mst@redhat.com designates 209.85.220.41 as permitted sender) smtp.mailfrom=mst@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Google-Smtp-Source: APXvYqxC6uhAZYJO2uzybFMhDPdKfjq16vwXLlqC7zLTwqXm0B59DECvR1zrYtTN/TVC/JhIYOw1Lw== X-Received: by 2002:a37:a78b:: with SMTP id q133mr12320217qke.289.1554125066981; Mon, 01 Apr 2019 06:24:26 -0700 (PDT) Received: from redhat.com (pool-173-76-246-42.bstnma.fios.verizon.net. [173.76.246.42]) by smtp.gmail.com with ESMTPSA id d17sm5201048qtl.43.2019.04.01.06.24.24 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 01 Apr 2019 06:24:26 -0700 (PDT) Date: Mon, 1 Apr 2019 09:24:18 -0400 From: "Michael S. Tsirkin" To: David Hildenbrand Cc: Nitesh Narayan Lal , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, yang.zhang.wz@gmail.com, riel@surriel.com, dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com, aarcange@redhat.com, alexander.duyck@gmail.com Subject: Re: On guest free page hinting and OOM Message-ID: <20190401073007-mutt-send-email-mst@kernel.org> References: <20190329084058-mutt-send-email-mst@kernel.org> <20190329104311-mutt-send-email-mst@kernel.org> <7a3baa90-5963-e6e2-c862-9cd9cc1b5f60@redhat.com> <20190329125034-mutt-send-email-mst@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Apr 01, 2019 at 10:17:51AM +0200, David Hildenbrand wrote: > On 29.03.19 17:51, Michael S. Tsirkin wrote: > > On Fri, Mar 29, 2019 at 04:45:58PM +0100, David Hildenbrand wrote: > >> On 29.03.19 16:37, David Hildenbrand wrote: > >>> On 29.03.19 16:08, Michael S. Tsirkin wrote: > >>>> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote: > >>>>> > >>>>> We had a very simple idea in mind: As long as a hinting request is > >>>>> pending, don't actually trigger any OOM activity, but wait for it to be > >>>>> processed. Can be done using simple atomic variable. > >>>>> > >>>>> This is a scenario that will only pop up when already pretty low on > >>>>> memory. And the main difference to ballooning is that we *know* we will > >>>>> get more memory soon. > >>>> > >>>> No we don't. If we keep polling we are quite possibly keeping the CPU > >>>> busy so delaying the hint request processing. Again the issue it's a > >>> > >>> You can always yield. But that's a different topic. > >>> > >>>> tradeoff. One performance for the other. Very hard to know which path do > >>>> you hit in advance, and in the real world no one has the time to profile > >>>> and tune things. By comparison trading memory for performance is well > >>>> understood. > >>>> > >>>> > >>>>> "appended to guest memory", "global list of memory", malicious guests > >>>>> always using that memory like what about NUMA? > >>>> > >>>> This can be up to the guest. A good approach would be to take > >>>> a chunk out of each node and add to the hints buffer. > >>> > >>> This might lead to you not using the buffer efficiently. But also, > >>> different topic. > >>> > >>>> > >>>>> What about different page > >>>>> granularity? > >>>> > >>>> Seems like an orthogonal issue to me. > >>> > >>> It is similar, yes. But if you support multiple granularities (e.g. > >>> MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort > >>> of buddy for the buffer. This is different than just a list for each node. > > > > Right but we don't plan to do it yet. > > MAX_ORDER - 2 on x86-64 seems to work just fine (no THP splits) and > early performance numbers indicate it might be the right thing to do. So > it could be very desirable once we do more performance tests. > > > > >> Oh, and before I forget, different zones might of course also be a problem. > > > > I would just split the hint buffer evenly between zones. > > > > Thinking about your approach, there is one elementary thing to notice: > > Giving the guest pages from the buffer while hinting requests are being > processed means that the guest can and will temporarily make use of more > memory than desired. Essentially up to the point where MADV_FREE is > finally called for the hinted pages. Right - but that seems like exactly the reverse of the issue with the current approach which is guest can temporarily use less memory than desired. > Even then the guest will logicall > make use of more memory than desired until core MM takes pages away. That sounds more like a host issue though. If it wants to it can use e.g. MAD_DONTNEED. > So: > 1) Unmodified guests will make use of more memory than desired. One interesting possibility for this is to add the buffer memory by hotplug after the feature has been negotiated. I agree this sounds complex. But I have an idea: how about we include the hint size in the num_pages counter? Then unmodified guests put it in the balloon and don't use it. Modified ones will know to use it just for hinting. > 2) Malicious guests will make use of more memory than desired. Well this limitation is fundamental to balloon right? If host wants to add tracking of balloon memory, it can enforce the limits. So far no one bothered, but maybe with this feature we should start to do that. > 3) Sane, modified guests will make use of more memory than desired. > > Instead, we could make our life much easier by doing the following: > > 1) Introduce a parameter to cap the amount of memory concurrently hinted > similar like you suggested, just don't consider it a buffer value. > "-device virtio-balloon,hinting_size=1G". This gives us control over the > hinting proceess. > > hinting_size=0 (default) disables hinting > > The admin can tweak the number along with memory requirements of the > guest. We can make suggestions (e.g. calculate depending on #cores,#size > of memory, or simply "1GB") So if it's all up to the guest and for the benefit of the guest, and with no cost/benefit to the host, then why are we supplying this value from the host? > 2) In the guest, track the size of hints in progress, cap at the > hinting_size. > > 3) Document hinting behavior > > "When hinting is enabled, memory up to hinting_size might temporarily be > removed from your guest in order to be hinted to the hypervisor. This is > only for a very short time, but might affect applications. Consider the > hinting_size when sizing your guest. If your application was tested with > XGB and a hinting size of 1G is used, please configure X+1GB for the > guest. Otherwise, performance degradation might be possible." OK, so let's start with this. Now let us assume that guest follows the advice. We thus know that 1GB is not needed for guest applications. So why do we want to allow applications to still use this extra memory? > 4) Do the loop/yield on OOM as discussed to improve performance when OOM > and avoid false OOM triggers just to be sure. Yes, I'm not against trying the simpler approach as a first step. But then we need this path actually tested so see whether hinting introduced unreasonable overhead on this path. And it is tricky to test oom as you are skating close to system's limits. That's one reason I prefer avoiding oom handler if possible. When you say yield, I would guess that would involve config space access to the balloon to flush out outstanding hints? > > BTW, one alternatives I initially had in mind was to add pages from the > buffer from the OOM handler only and putting these pages back into the > buffer once freed. I don't think that works easily - pages get used so we can't return them into the buffer. Another problem with only handling oom is that oom is a guest decision. So host really can't enforce any limits even if it wants to. > I thought this might help for certain memory offline > scenarios where pages stuck in the buffer might hinder offlining of > memory. And of course, improve performance as the buffer is only touched > when really needed. But it would only help for memory (e.g. DIMM) added > after boot, so it is also not 100% safe. Also, same issues as with your > given approach. So you can look at this approach as a combination of - balloon inflate with separate accounting - deflate on oom - hinting ? Put this way, it seems rather uncontroversial, right? > -- > > Thanks, > > David / dhildenb