From: Greg Kurz <groug@kaod.org>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: lvivier@redhat.com, Thomas Huth <thuth@redhat.com>,
Xiao Guangrong <xiaoguangrong.eric@gmail.com>,
farosas@linux.ibm.com, aik@ozlabs.ru,
"Michael S. Tsirkin" <mst@redhat.com>,
Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>,
qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>,
qemu-ppc@nongnu.org, clg@kaod.org,
Igor Mammedov <imammedo@redhat.com>,
"Edgar E. Iglesias" <edgar.iglesias@gmail.com>,
paulus@samba.org
Subject: Re: [PATCH v7 14/17] spapr: Don't attempt to clamp RMA to VRMA constraint
Date: Tue, 3 Mar 2020 10:55:37 +0100 [thread overview]
Message-ID: <20200303105537.2c7ece09@bahia.home> (raw)
In-Reply-To: <20200303034351.333043-15-david@gibson.dropbear.id.au>
On Tue, 3 Mar 2020 14:43:48 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> The Real Mode Area (RMA) is the part of memory which a guest can access
> when in real (MMU off) mode. Of course, for a guest under KVM, the MMU
> isn't really turned off, it's just in a special translation mode - Virtual
> Real Mode Area (VRMA) - which looks like real mode in guest mode.
>
> The mechanics of how this works when using the hash MMU (HPT) put a
> constraint on the size of the RMA, which depends on the size of the
> HPT. So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA
> we advertise to the guest based on this VRMA limit.
>
> There are several things wrong with this:
> 1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum
> of Node 0 memory size and the VRMA limit. That will *often* work the
> same as clamping, but there can be other constraints on RMA size which
> supersede Node 0 memory size. We have real bugs caused by this
> (currently worked around in the guest kernel)
> 2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where
> we're past the point that we can actually advertise an RMA limit to the
> guest
> 3) But most fundamentally, the VRMA limit depends on host configuration
> (page size) which shouldn't be visible to the guest, but this partially
> exposes it. This can cause problems with migration in certain edge
> cases, although we will mostly get away with it.
>
> In practice, this clamping is almost never applied anyway. With 64kiB
> pages and the normal rules for sizing of the HPT, the theoretical VRMA
> limit will be 4x(guest memory size) and so never hit. It will hit with
> 4kiB pages, where it will be (guest memory size)/4. However all mainstream
> distro kernels for POWER have used a 64kiB page size for at least 10 years.
>
> So, simply replace this logic with a check that the RMA we've calculated
> based only on guest visible configuration will fit within the host implied
> VRMA limit. This can break if running HPT guests on a host kernel with
> 4kiB page size. As noted that's very rare. There also exist several
> possible workarounds:
> * Change the host kernel to use 64kiB pages
> * Use radix MMU (RPT) guests instead of HPT
> * Use 64kiB hugepages on the host to back guest memory
> * Increase the guest memory size so that the RMA hits one of the fixed
> limits before the RMA limit. This is relatively easy on POWER8 which
> has a 16GiB limit, harder on POWER9 which has a 1TiB limit.
> * Use a guest NUMA configuration which artificially constrains the RMA
> within the VRMA limit (the RMA must always fit within Node 0).
>
> Previously, on KVM, we also temporarily reduced the rma_size to 256M so
> that the we'd load the kernel and initrd safely, regardless of the VRMA
> limit. This was a) confusing, b) could significantly limit the size of
> images we could load and c) introduced a behavioural difference between
> KVM and TCG. So we remove that as well.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
Reviewed-by: Greg Kurz <groug@kaod.org>
> hw/ppc/spapr.c | 28 ++++++++++------------------
> hw/ppc/spapr_hcall.c | 4 ++--
> include/hw/ppc/spapr.h | 3 +--
> 3 files changed, 13 insertions(+), 22 deletions(-)
>
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 18bf4bc3de..ef7667455c 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1569,7 +1569,7 @@ void spapr_reallocate_hpt(SpaprMachineState *spapr, int shift,
> spapr_set_all_lpcrs(0, LPCR_HR | LPCR_UPRT);
> }
>
> -void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr)
> +void spapr_setup_hpt(SpaprMachineState *spapr)
> {
> int hpt_shift;
>
> @@ -1585,10 +1585,16 @@ void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr)
> }
> spapr_reallocate_hpt(spapr, hpt_shift, &error_fatal);
>
> - if (spapr->vrma_adjust) {
> + if (kvm_enabled()) {
> hwaddr vrma_limit = kvmppc_vrma_limit(spapr->htab_shift);
>
> - spapr->rma_size = MIN(spapr_node0_size(MACHINE(spapr)), vrma_limit);
> + /* Check our RMA fits in the possible VRMA */
> + if (vrma_limit < spapr->rma_size) {
> + error_report("Unable to create %" HWADDR_PRIu
> + "MiB RMA (VRMA only allows %" HWADDR_PRIu "MiB",
> + spapr->rma_size / MiB, vrma_limit / MiB);
> + exit(EXIT_FAILURE);
> + }
> }
> }
>
> @@ -1628,7 +1634,7 @@ static void spapr_machine_reset(MachineState *machine)
> spapr->patb_entry = PATE1_GR;
> spapr_set_all_lpcrs(LPCR_HR | LPCR_UPRT, LPCR_HR | LPCR_UPRT);
> } else {
> - spapr_setup_hpt_and_vrma(spapr);
> + spapr_setup_hpt(spapr);
> }
>
> qemu_devices_reset();
> @@ -2695,20 +2701,6 @@ static void spapr_machine_init(MachineState *machine)
>
> spapr->rma_size = node0_size;
>
> - /* With KVM, we don't actually know whether KVM supports an
> - * unbounded RMA (PR KVM) or is limited by the hash table size
> - * (HV KVM using VRMA), so we always assume the latter
> - *
> - * In that case, we also limit the initial allocations for RTAS
> - * etc... to 256M since we have no way to know what the VRMA size
> - * is going to be as it depends on the size of the hash table
> - * which isn't determined yet.
> - */
> - if (kvm_enabled()) {
> - spapr->vrma_adjust = 1;
> - spapr->rma_size = MIN(spapr->rma_size, 0x10000000);
> - }
> -
> /* Actually we don't support unbounded RMA anymore since we added
> * proper emulation of HV mode. The max we can get is 16G which
> * also happens to be what we configure for PAPR mode so make sure
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index c2b3286625..40c86e91eb 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1458,7 +1458,7 @@ static void spapr_check_setup_free_hpt(SpaprMachineState *spapr,
> spapr_free_hpt(spapr);
> } else if (!(patbe_new & PATE1_GR)) {
> /* RADIX->HASH || NOTHING->HASH : Allocate HPT */
> - spapr_setup_hpt_and_vrma(spapr);
> + spapr_setup_hpt(spapr);
> }
> return;
> }
> @@ -1845,7 +1845,7 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> * (because the guest isn't going to use radix) then set it up here. */
> if ((spapr->patb_entry & PATE1_GR) && !guest_radix) {
> /* legacy hash or new hash: */
> - spapr_setup_hpt_and_vrma(spapr);
> + spapr_setup_hpt(spapr);
> }
>
> if (fdt_bufsize < sizeof(hdr)) {
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index a4216935a1..90dbc55931 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -156,7 +156,6 @@ struct SpaprMachineState {
> SpaprPendingHpt *pending_hpt; /* in-progress resize */
>
> hwaddr rma_size;
> - int vrma_adjust;
> uint32_t fdt_size;
> uint32_t fdt_initial_size;
> void *fdt_blob;
> @@ -795,7 +794,7 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
> void spapr_events_init(SpaprMachineState *sm);
> void spapr_dt_events(SpaprMachineState *sm, void *fdt);
> void close_htab_fd(SpaprMachineState *spapr);
> -void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
> +void spapr_setup_hpt(SpaprMachineState *spapr);
> void spapr_free_hpt(SpaprMachineState *spapr);
> SpaprTceTable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
> void spapr_tce_table_enable(SpaprTceTable *tcet,
next prev parent reply other threads:[~2020-03-03 9:56 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-03 3:43 [PATCH v7 00/17] target/ppc: Correct some errors with real mode handling David Gibson
2020-03-03 3:43 ` [PATCH v7 01/17] ppc: Remove stub support for 32-bit hypervisor mode David Gibson
2020-03-05 8:58 ` Philippe Mathieu-Daudé
2020-03-03 3:43 ` [PATCH v7 02/17] ppc: Remove stub of PPC970 HID4 implementation David Gibson
2020-03-03 3:43 ` [PATCH v7 03/17] target/ppc: Correct handling of real mode accesses with vhyp on hash MMU David Gibson
2020-03-03 3:43 ` [PATCH v7 04/17] target/ppc: Introduce ppc_hash64_use_vrma() helper David Gibson
2020-03-05 8:59 ` Philippe Mathieu-Daudé
2020-03-03 3:43 ` [PATCH v7 05/17] spapr, ppc: Remove VPM0/RMLS hacks for POWER9 David Gibson
2020-03-03 3:43 ` [PATCH v7 06/17] target/ppc: Remove RMOR register from POWER9 & POWER10 David Gibson
2020-03-03 3:43 ` [PATCH v7 07/17] target/ppc: Use class fields to simplify LPCR masking David Gibson
2020-03-05 8:56 ` Philippe Mathieu-Daudé
2020-03-10 10:06 ` Cédric Le Goater
2020-03-11 3:15 ` David Gibson
2020-03-03 3:43 ` [PATCH v7 08/17] target/ppc: Streamline calculation of RMA limit from LPCR[RMLS] David Gibson
2020-03-03 7:52 ` Greg Kurz
2020-03-05 8:53 ` Philippe Mathieu-Daudé
2020-03-03 3:43 ` [PATCH v7 09/17] target/ppc: Correct RMLS table David Gibson
2020-03-03 3:43 ` [PATCH v7 10/17] target/ppc: Only calculate RMLS derived RMA limit on demand David Gibson
2020-03-03 8:57 ` Greg Kurz
2020-03-05 8:48 ` Philippe Mathieu-Daudé
2020-03-03 3:43 ` [PATCH v7 11/17] target/ppc: Don't store VRMA SLBE persistently David Gibson
2020-03-03 9:37 ` Greg Kurz
2020-03-05 9:17 ` Philippe Mathieu-Daudé
2020-03-05 9:47 ` Greg Kurz
2020-03-05 10:35 ` Philippe Mathieu-Daudé
2020-03-03 3:43 ` [PATCH v7 12/17] spapr: Don't use weird units for MIN_RMA_SLOF David Gibson
2020-03-05 8:39 ` Philippe Mathieu-Daudé
2020-03-03 3:43 ` [PATCH v7 13/17] spapr,ppc: Simplify signature of kvmppc_rma_size() David Gibson
2020-03-05 8:47 ` Philippe Mathieu-Daudé
2020-03-03 3:43 ` [PATCH v7 14/17] spapr: Don't attempt to clamp RMA to VRMA constraint David Gibson
2020-03-03 9:55 ` Greg Kurz [this message]
2020-03-03 3:43 ` [PATCH v7 15/17] spapr: Don't clamp RMA to 16GiB on new machine types David Gibson
2020-03-03 10:02 ` Greg Kurz
2020-03-05 8:45 ` Philippe Mathieu-Daudé
2020-03-03 3:43 ` [PATCH v7 16/17] spapr: Clean up RMA size calculation David Gibson
2020-03-03 10:18 ` Greg Kurz
2020-03-05 8:43 ` Philippe Mathieu-Daudé
2020-03-03 3:43 ` [PATCH v7 17/17] spapr: Fold spapr_node0_size() into its only caller David Gibson
2020-03-03 10:32 ` Greg Kurz
2020-03-04 1:25 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200303105537.2c7ece09@bahia.home \
--to=groug@kaod.org \
--cc=aik@ozlabs.ru \
--cc=clg@kaod.org \
--cc=david@gibson.dropbear.id.au \
--cc=edgar.iglesias@gmail.com \
--cc=farosas@linux.ibm.com \
--cc=imammedo@redhat.com \
--cc=lvivier@redhat.com \
--cc=mark.cave-ayland@ilande.co.uk \
--cc=mst@redhat.com \
--cc=paulus@samba.org \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
--cc=thuth@redhat.com \
--cc=xiaoguangrong.eric@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).