From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kvm-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0C64DC38142
	for <kvm@archiver.kernel.org>; Tue, 31 Jan 2023 10:28:57 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230404AbjAaK2z (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Tue, 31 Jan 2023 05:28:55 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54382 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229565AbjAaK2w (ORCPT <rfc822;kvm@vger.kernel.org>);
        Tue, 31 Jan 2023 05:28:52 -0500
Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B4B7A2DE4A
        for <kvm@vger.kernel.org>; Tue, 31 Jan 2023 02:28:51 -0800 (PST)
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
        (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
        (No client certificate requested)
        by dfw.source.kernel.org (Postfix) with ESMTPS id 4639A614B5
        for <kvm@vger.kernel.org>; Tue, 31 Jan 2023 10:28:51 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id A61FCC433EF;
        Tue, 31 Jan 2023 10:28:50 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
        s=k20201202; t=1675160930;
        bh=CFpXrESVC1VdJC/0chHwon8lVrNNHg2Yos0VogDVZXk=;
        h=Date:From:To:Cc:Subject:In-Reply-To:References:From;
        b=OUeXrhtJWXHtJce7JnnW0wsghJTKEZpyIB3fsQZDm6enKXhrI7pcsKyTbkOexod28
         0A9T2eh5MVqP7Botskkqg2gQgQAmvWz1OLrwwQBp/g/grOdenkOCKH9yqd4QjHHKri
         q75M1+5WX1631uzP8mYvAqunLhAUnD5igrMVnvrSLJZaW/8PyVszO3NoZVoABDCE1q
         tpa6c48qQtwsgh7temQqia5CUUVW4ycBEVq+Iofq1FjQ8z2ZCA45hx3xNVJxtjyWhc
         vknJYJ0t/D5M7KyKu75osVnUWPQ06Dv5uUuC3yccAs3bTrDsQ0RKMHBmwmaMUEFbaT
         TF1nKdgEyRsBQ==
Received: from sofa.misterjones.org ([185.219.108.64] helo=goblin-girl.misterjones.org)
        by disco-boy.misterjones.org with esmtpsa  (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
        (Exim 4.95)
        (envelope-from <maz@kernel.org>)
        id 1pMnse-006A5h-B5;
        Tue, 31 Jan 2023 10:28:48 +0000
Date:   Tue, 31 Jan 2023 10:28:47 +0000
Message-ID: <86h6w70zhc.wl-maz@kernel.org>
From:   Marc Zyngier <maz@kernel.org>
To:     Ricardo Koller <ricarkol@google.com>
Cc:     Oliver Upton <oliver.upton@linux.dev>, pbonzini@redhat.com,
        oupton@google.com, yuzenghui@huawei.com, dmatlack@google.com,
        kvm@vger.kernel.org, kvmarm@lists.linux.dev, qperret@google.com,
        catalin.marinas@arm.com, andrew.jones@linux.dev, seanjc@google.com,
        alexandru.elisei@arm.com, suzuki.poulose@arm.com,
        eric.auger@redhat.com, gshan@redhat.com, reijiw@google.com,
        rananta@google.com, bgardon@google.com, ricarkol@gmail.com
Subject: Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
In-Reply-To: <CAOHnOrx-vvuZ9n8xDRmJTBCZNiqvcqURVyrEt2tDpw5bWT0qew@mail.gmail.com>
References: <20230113035000.480021-1-ricarkol@google.com>
        <20230113035000.480021-7-ricarkol@google.com>
        <Y9BfdgL+JSYCirvm@thinky-boi>
        <CAOHnOrysMhp_8Kdv=Pe-O8ZGDbhN5HiHWVhBv795_E6+4RAzPw@mail.gmail.com>
        <86v8ktkqfx.wl-maz@kernel.org>
        <CAOHnOrx-vvuZ9n8xDRmJTBCZNiqvcqURVyrEt2tDpw5bWT0qew@mail.gmail.com>
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue)
 FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/28.2
 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO)
MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
Content-Type: text/plain; charset=US-ASCII
X-SA-Exim-Connect-IP: 185.219.108.64
X-SA-Exim-Rcpt-To: ricarkol@google.com, oliver.upton@linux.dev, pbonzini@redhat.com, oupton@google.com, yuzenghui@huawei.com, dmatlack@google.com, kvm@vger.kernel.org, kvmarm@lists.linux.dev, qperret@google.com, catalin.marinas@arm.com, andrew.jones@linux.dev, seanjc@google.com, alexandru.elisei@arm.com, suzuki.poulose@arm.com, eric.auger@redhat.com, gshan@redhat.com, reijiw@google.com, rananta@google.com, bgardon@google.com, ricarkol@gmail.com
X-SA-Exim-Mail-From: maz@kernel.org
X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

On Fri, 27 Jan 2023 15:45:15 +0000,
Ricardo Koller <ricarkol@google.com> wrote:
> 
> > The one thing that would convince me to make it an option is the
> > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > I'm not overly happy about that. Why can't this be a userspace visible
> > option, selectable on a per VM (or memslot) basis?
> >
> 
> It should be possible.  I am exploring a couple of ideas that could
> help when the hugepages are not 1G (e.g., 2M).  However, they add
> complexity and I'm not sure they help much.
> 
> (will be using PAGE_SIZE=4K to make things simpler)
> 
> This feature pre-allocates 513 pages before splitting every 1G range.
> For example, it converts 1G block PTEs into trees made of 513 pages.
> When not using this feature, the same 513 pages would be allocated,
> but lazily over a longer period of time.

This is an important difference. It avoids the upfront allocation
"thermal shock", giving time to the kernel to reclaim memory from
somewhere else. Doing it upfront means you *must* have 2MB+ of
immediately available memory for each GB of RAM you guest uses.

> 
> Eager-splitting pre-allocates those pages in order to split huge-pages
> into fully populated trees.  Which is needed in order to use FEAT_BBM
> and skipping the expensive TLBI broadcasts.  513 is just the number of
> pages needed to break a 1G huge-page.

I understand that. But it also clear that 1GB huge pages are unlikely
to be THPs, and I wonder if we should treat the two differently. Using
HugeTLBFS pages is significant here.

> 
> We could optimize for smaller huge-pages, like 2M by splitting 1
> huge-page at a time: only preallocate one 4K page at a time.  The
> trick is how to know that we are splitting 2M huge-pages.  We could
> either get the vma pagesize or use hints from userspace.  I'm not sure
> that this is worth it though.  The user will most likely want to split
> big ranges of memory (>1G), so optimizing for smaller huge-pages only
> converts the left into the right:
> 
> alloc 1 page            |    |  alloc 512 pages
> split 2M huge-page      |    |  split 2M huge-page
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      | => |  split 2M huge-page
>                         ...
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      |    |  split 2M huge-page
> 
> Still thinking of what else to do.

I think the 1G case fits your own use case, but I doubt this covers
the majority of the users. Most people rely on the kernel ability to
use THPs, which are capped at the first level of block mapping.

2MB (and 32MB for 16kB base pages) are the most likely mappings in my
experience (512MB with 64kB pages are vanishingly rare).

Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it
fits the model. For THPs, where everything is opportunistic and the
user not involved, this is a lot more debatable.

This is why I'd like this behaviour to be a buy-in, either directly (a
first class userspace API) or indirectly (the provenance of the
memory).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.