qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: peter.maydell@linaro.org
Cc: lvivier@redhat.com, aik@ozlabs.ru, mdroth@linux.vnet.ibm.com,
	qemu-devel@nongnu.org, groug@kaod.org, qemu-ppc@nongnu.org,
	clg@kaod.org, David Gibson <david@gibson.dropbear.id.au>
Subject: [PULL 19/45] spapr: Don't attempt to clamp RMA to VRMA constraint
Date: Tue, 17 Mar 2020 21:03:57 +1100	[thread overview]
Message-ID: <20200317100423.622643-20-david@gibson.dropbear.id.au> (raw)
In-Reply-To: <20200317100423.622643-1-david@gibson.dropbear.id.au>

The Real Mode Area (RMA) is the part of memory which a guest can access
when in real (MMU off) mode.  Of course, for a guest under KVM, the MMU
isn't really turned off, it's just in a special translation mode - Virtual
Real Mode Area (VRMA) - which looks like real mode in guest mode.

The mechanics of how this works when using the hash MMU (HPT) put a
constraint on the size of the RMA, which depends on the size of the
HPT.  So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA
we advertise to the guest based on this VRMA limit.

There are several things wrong with this:
 1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum
    of Node 0 memory size and the VRMA limit.  That will *often* work the
    same as clamping, but there can be other constraints on RMA size which
    supersede Node 0 memory size.  We have real bugs caused by this
    (currently worked around in the guest kernel)
 2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where
    we're past the point that we can actually advertise an RMA limit to the
    guest
 3) But most fundamentally, the VRMA limit depends on host configuration
    (page size) which shouldn't be visible to the guest, but this partially
    exposes it.  This can cause problems with migration in certain edge
    cases, although we will mostly get away with it.

In practice, this clamping is almost never applied anyway.  With 64kiB
pages and the normal rules for sizing of the HPT, the theoretical VRMA
limit will be 4x(guest memory size) and so never hit.  It will hit with
4kiB pages, where it will be (guest memory size)/4.  However all mainstream
distro kernels for POWER have used a 64kiB page size for at least 10 years.

So, simply replace this logic with a check that the RMA we've calculated
based only on guest visible configuration will fit within the host implied
VRMA limit.  This can break if running HPT guests on a host kernel with
4kiB page size.  As noted that's very rare.  There also exist several
possible workarounds:
  * Change the host kernel to use 64kiB pages
  * Use radix MMU (RPT) guests instead of HPT
  * Use 64kiB hugepages on the host to back guest memory
  * Increase the guest memory size so that the RMA hits one of the fixed
    limits before the RMA limit.  This is relatively easy on POWER8 which
    has a 16GiB limit, harder on POWER9 which has a 1TiB limit.
  * Use a guest NUMA configuration which artificially constrains the RMA
    within the VRMA limit (the RMA must always fit within Node 0).

Previously, on KVM, we also temporarily reduced the rma_size to 256M so
that the we'd load the kernel and initrd safely, regardless of the VRMA
limit.  This was a) confusing, b) could significantly limit the size of
images we could load and c) introduced a behavioural difference between
KVM and TCG.  So we remove that as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Greg Kurz <groug@kaod.org>
---
 hw/ppc/spapr.c         | 28 ++++++++++------------------
 hw/ppc/spapr_hcall.c   |  4 ++--
 include/hw/ppc/spapr.h |  3 +--
 3 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 18bf4bc3de..ef7667455c 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1569,7 +1569,7 @@ void spapr_reallocate_hpt(SpaprMachineState *spapr, int shift,
     spapr_set_all_lpcrs(0, LPCR_HR | LPCR_UPRT);
 }
 
-void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr)
+void spapr_setup_hpt(SpaprMachineState *spapr)
 {
     int hpt_shift;
 
@@ -1585,10 +1585,16 @@ void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr)
     }
     spapr_reallocate_hpt(spapr, hpt_shift, &error_fatal);
 
-    if (spapr->vrma_adjust) {
+    if (kvm_enabled()) {
         hwaddr vrma_limit = kvmppc_vrma_limit(spapr->htab_shift);
 
-        spapr->rma_size = MIN(spapr_node0_size(MACHINE(spapr)), vrma_limit);
+        /* Check our RMA fits in the possible VRMA */
+        if (vrma_limit < spapr->rma_size) {
+            error_report("Unable to create %" HWADDR_PRIu
+                         "MiB RMA (VRMA only allows %" HWADDR_PRIu "MiB",
+                         spapr->rma_size / MiB, vrma_limit / MiB);
+            exit(EXIT_FAILURE);
+        }
     }
 }
 
@@ -1628,7 +1634,7 @@ static void spapr_machine_reset(MachineState *machine)
         spapr->patb_entry = PATE1_GR;
         spapr_set_all_lpcrs(LPCR_HR | LPCR_UPRT, LPCR_HR | LPCR_UPRT);
     } else {
-        spapr_setup_hpt_and_vrma(spapr);
+        spapr_setup_hpt(spapr);
     }
 
     qemu_devices_reset();
@@ -2695,20 +2701,6 @@ static void spapr_machine_init(MachineState *machine)
 
     spapr->rma_size = node0_size;
 
-    /* With KVM, we don't actually know whether KVM supports an
-     * unbounded RMA (PR KVM) or is limited by the hash table size
-     * (HV KVM using VRMA), so we always assume the latter
-     *
-     * In that case, we also limit the initial allocations for RTAS
-     * etc... to 256M since we have no way to know what the VRMA size
-     * is going to be as it depends on the size of the hash table
-     * which isn't determined yet.
-     */
-    if (kvm_enabled()) {
-        spapr->vrma_adjust = 1;
-        spapr->rma_size = MIN(spapr->rma_size, 0x10000000);
-    }
-
     /* Actually we don't support unbounded RMA anymore since we added
      * proper emulation of HV mode. The max we can get is 16G which
      * also happens to be what we configure for PAPR mode so make sure
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index c2b3286625..40c86e91eb 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1458,7 +1458,7 @@ static void spapr_check_setup_free_hpt(SpaprMachineState *spapr,
         spapr_free_hpt(spapr);
     } else if (!(patbe_new & PATE1_GR)) {
         /* RADIX->HASH || NOTHING->HASH : Allocate HPT */
-        spapr_setup_hpt_and_vrma(spapr);
+        spapr_setup_hpt(spapr);
     }
     return;
 }
@@ -1845,7 +1845,7 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
          * (because the guest isn't going to use radix) then set it up here. */
         if ((spapr->patb_entry & PATE1_GR) && !guest_radix) {
             /* legacy hash or new hash: */
-            spapr_setup_hpt_and_vrma(spapr);
+            spapr_setup_hpt(spapr);
         }
 
         if (fdt_bufsize < sizeof(hdr)) {
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index a4216935a1..90dbc55931 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -156,7 +156,6 @@ struct SpaprMachineState {
     SpaprPendingHpt *pending_hpt; /* in-progress resize */
 
     hwaddr rma_size;
-    int vrma_adjust;
     uint32_t fdt_size;
     uint32_t fdt_initial_size;
     void *fdt_blob;
@@ -795,7 +794,7 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
 void spapr_events_init(SpaprMachineState *sm);
 void spapr_dt_events(SpaprMachineState *sm, void *fdt);
 void close_htab_fd(SpaprMachineState *spapr);
-void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
+void spapr_setup_hpt(SpaprMachineState *spapr);
 void spapr_free_hpt(SpaprMachineState *spapr);
 SpaprTceTable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
 void spapr_tce_table_enable(SpaprTceTable *tcet,
-- 
2.24.1



  parent reply	other threads:[~2020-03-17 10:18 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-17 10:03 [PULL 00/45] ppc-for-5.0 queue 20200317 David Gibson
2020-03-17 10:03 ` [PULL 01/45] pseries: Update SLOF firmware image David Gibson
2020-03-17 10:03 ` [PULL 02/45] spapr: Handle pending hot plug/unplug requests at CAS David Gibson
2020-03-17 10:03 ` [PULL 03/45] ppc: Officially deprecate the CPU "compat" property David Gibson
2020-03-17 10:03 ` [PULL 04/45] spapr: Fix Coverity warning while validating nvdimm options David Gibson
2020-03-17 10:03 ` [PULL 05/45] hw/ppc/pnv: Fix typo in comment David Gibson
2020-03-17 10:03 ` [PULL 06/45] ppc: Remove stub support for 32-bit hypervisor mode David Gibson
2020-03-17 10:03 ` [PULL 07/45] ppc: Remove stub of PPC970 HID4 implementation David Gibson
2020-03-17 10:03 ` [PULL 08/45] target/ppc: Correct handling of real mode accesses with vhyp on hash MMU David Gibson
2020-03-17 10:03 ` [PULL 09/45] target/ppc: Introduce ppc_hash64_use_vrma() helper David Gibson
2020-03-17 10:03 ` [PULL 10/45] spapr, ppc: Remove VPM0/RMLS hacks for POWER9 David Gibson
2020-03-17 10:03 ` [PULL 11/45] target/ppc: Remove RMOR register from POWER9 & POWER10 David Gibson
2020-03-17 10:03 ` [PULL 12/45] target/ppc: Use class fields to simplify LPCR masking David Gibson
2020-03-17 10:03 ` [PULL 13/45] target/ppc: Streamline calculation of RMA limit from LPCR[RMLS] David Gibson
2020-03-17 10:03 ` [PULL 14/45] target/ppc: Correct RMLS table David Gibson
2020-03-17 10:03 ` [PULL 15/45] target/ppc: Only calculate RMLS derived RMA limit on demand David Gibson
2020-03-17 10:03 ` [PULL 16/45] target/ppc: Don't store VRMA SLBE persistently David Gibson
2020-03-17 10:03 ` [PULL 17/45] spapr: Don't use weird units for MIN_RMA_SLOF David Gibson
2020-03-17 10:03 ` [PULL 18/45] spapr,ppc: Simplify signature of kvmppc_rma_size() David Gibson
2020-03-17 10:03 ` David Gibson [this message]
2020-03-17 10:03 ` [PULL 20/45] spapr: Don't clamp RMA to 16GiB on new machine types David Gibson
2020-03-17 10:03 ` [PULL 21/45] spapr: Clean up RMA size calculation David Gibson
2020-03-17 10:04 ` [PULL 22/45] hw/scsi/viosrp: Add missing 'hw/scsi/srp.h' include David Gibson
2020-03-17 10:04 ` [PULL 23/45] hw/scsi/spapr_vscsi: Use SRP_MAX_IU_LEN instead of sizeof flexible array David Gibson
2020-03-17 10:04 ` [PULL 24/45] hw/scsi/spapr_vscsi: Simplify a bit David Gibson
2020-03-17 10:04 ` [PULL 25/45] hw/scsi/spapr_vscsi: Introduce req_iu() helper David Gibson
2020-03-17 10:04 ` [PULL 26/45] hw/scsi/spapr_vscsi: Do not mix SRP IU size with DMA buffer size David Gibson
2020-03-17 10:04 ` [PULL 27/45] hw/scsi/spapr_vscsi: Prevent buffer overflow David Gibson
2020-03-17 10:04 ` [PULL 28/45] hw/scsi/spapr_vscsi: Convert debug fprintf() to trace event David Gibson
2020-03-17 10:04 ` [PULL 29/45] spapr/xive: use SPAPR_IRQ_IPI to define IPI ranges exposed to the guest David Gibson
2020-03-17 10:04 ` [PULL 30/45] target/ppc: Fix rlwinm on ppc64 David Gibson
2020-03-17 10:04 ` [PULL 31/45] ppc/spapr: Move GPRs setup to one place David Gibson
2020-03-17 10:04 ` [PULL 32/45] pseries: Update SLOF firmware image David Gibson
2020-03-17 10:04 ` [PULL 33/45] spapr/rtas: Reserve space for RTAS blob and log David Gibson
2020-03-17 10:04 ` [PULL 34/45] spapr: Move creation of ibm, dynamic-reconfiguration-memory dt node David Gibson
2020-03-17 10:04 ` [PULL 35/45] spapr: Move creation of ibm,architecture-vec-5 property David Gibson
2020-03-17 10:04 ` [PULL 36/45] spapr: Rename DT functions to newer naming convention David Gibson
2020-03-17 10:04 ` [PULL 37/45] ppc/spapr: Fix FWNMI machine check failure handling David Gibson
2020-03-17 10:04 ` [PULL 38/45] ppc/spapr: Change FWNMI names David Gibson
2020-03-17 10:04 ` [PULL 39/45] ppc/spapr: Add FWNMI System Reset state David Gibson
2020-03-17 10:04 ` [PULL 40/45] ppc/spapr: Fix FWNMI machine check interrupt delivery David Gibson
2020-03-17 10:04 ` [PULL 41/45] ppc/spapr: Allow FWNMI on TCG David Gibson
2020-03-17 10:04 ` [PULL 42/45] target/ppc: allow ppc_cpu_do_system_reset to take an alternate vector David Gibson
2020-03-17 10:04 ` [PULL 43/45] ppc/spapr: Implement FWNMI System Reset delivery David Gibson
2020-03-17 10:04 ` [PULL 44/45] ppc/spapr: Ignore common "ibm,nmi-interlock" Linux bug David Gibson
2020-03-17 10:04 ` [PULL 45/45] pseries: Update SLOF firmware image David Gibson
2020-03-17 10:30 ` [PULL 00/45] ppc-for-5.0 queue 20200317 Paolo Bonzini
2020-03-17 22:33   ` David Gibson
2020-03-17 23:58     ` Alexey Kardashevskiy
2020-03-18  5:46       ` David Gibson
2020-03-17 11:24 ` no-reply
2020-03-18 17:57 ` Peter Maydell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200317100423.622643-20-david@gibson.dropbear.id.au \
    --to=david@gibson.dropbear.id.au \
    --cc=aik@ozlabs.ru \
    --cc=clg@kaod.org \
    --cc=groug@kaod.org \
    --cc=lvivier@redhat.com \
    --cc=mdroth@linux.vnet.ibm.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).