qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Andrea Bolognani <abologna@redhat.com>
Cc: groug@kaod.org, aik@ozlabs.ru, qemu-ppc@nongnu.org,
	qemu-devel@nongnu.org, clg@kaod.org
Subject: Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
Date: Thu, 26 Apr 2018 10:55:55 +1000	[thread overview]
Message-ID: <20180426005555.GA8800@umbus.fritz.box> (raw)
In-Reply-To: <1524672566.23669.15.camel@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 7803 bytes --]

On Wed, Apr 25, 2018 at 06:09:26PM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> > On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > > Is the 16 MiB page size available for both POWER8 and POWER9?
> > 
> > No.  That's a big part of what makes this such a mess.  HPT has 16MiB
> > and 16GiB hugepages, RPT has 2MiB and 1GiB hugepages.  (Well, I guess
> > tecnically Power9 does have 16MiB pages - but only in hash mode, which
> > the host won't be).
> > 
> [...]
> > > > This does mean, for example, that if
> > > > it was just set to the hugepage size on a p9, 21 (2MiB) things should
> > > > work correctly (in practice it would act identically to setting it to
> > > > 16).
> > > 
> > > Wouldn't that lead to different behavior depending on whether you
> > > start the guest on a POWER9 or POWER8 machine? The former would be
> > > able to use 2 MiB hugepages, while the latter would be stuck using
> > > regular 64 KiB pages.
> > 
> > Well, no, because 2MiB hugepages aren't a thing in HPT mode.  In RPT
> > mode it'd be able to use 2MiB hugepages either way, because the
> > limitations only apply to HPT mode.
> > 
> > > Migration of such a guest from POWER9 to
> > > POWER8 wouldn't work because the hugepage allocation couldn't be
> > > fulfilled,
> > 
> > Sort of, you couldn't even get as far as staring the incoming qemu
> > with hpt-mps=21 on the POWER8 (unless you gave it 16MiB hugepages for
> > backing).
> > 
> > > but the other way around would probably work and lead to
> > > different page sizes being available inside the guest after a power
> > > cycle, no?
> > 
> > Well.. there are a few cases here.  If you migrated p8 -> p8 with
> > hpt-mps=21 on both ends, you couldn't actually start the guest on the
> > source without giving it hugepage backing.  In which case it'll be
> > fine on the p9 with hugepage mapping.
> > 
> > If you had hpt-mps=16 on the source and hpt-mps=21 on the other end,
> > well, you don't get to count on anything because you changed the VM
> > definition.  In fact it would work in this case, and you wouldn't even
> > get new page sizes after restart because HPT mode doesn't support any
> > pagesizes between 64kiB and 16MiB.
> > 
> > > > > I guess 34 corresponds to 1 GiB hugepages?
> > > > 
> > > > No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
> > > > machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
> > > > be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
> > > > practice).
> > > 
> > > Isn't 1 GiB hugepages support at least being worked on[1]?
> > 
> > That's for radix mode.  Hash mode has 16MiB and 16GiB, no 1GiB.
> 
> So, I've spent some more time trying to wrap my head around the
> whole ordeal I'm still unclear about some of the details, though;
> hopefully you'll be willing to answer a few more questions.
> 
> Basically the only page sizes you can have for HPT guests are
> 4 KiB, 64 KiB, 16 MiB and 16 GiB; in each case, for KVM, you need
> the guest memory to be backed by host pages which are at least as
> big, or it won't work. The same limitation doesn't apply to either
> RPT or TCG guests.

That's right.  The limitation also doesn't apply to KVM PR, just KVM
HV.

[If you're interested, the reason for the limitation is that unlike
 x86 or POWER9 there aren't separate sets of gva->gpa and gpa->hpa
 pagetables. Instead there's just a single gva->hpa (hash) pagetable
 that's managed by the _host_.  When the guest wants to create a new
 mapping it uses an hcall to insert a PTE, and the hcall
 implementation translates the gpa into an hpa before inserting it
 into the HPT.  The full contents of the real HPT aren't visible to
 the guest, but the precise slot numbers within it are, so the
 assumption that there's an exact 1:1 correspondence between guest
 PTEs and host PTEs is pretty much baked into the PAPR interface.  So,
 if a hugepage is to be inserted into the guest HPT, then it's also
 being inserted into the host HPT, and needs to be really, truly host
 contiguous]

> The new parameter would make it possible to make sure you will
> actually be able to use the page size you're interested in inside
> the guest, by preventing it from starting at all if the host didn't
> provide big enough backing pages;

That's right

> it would also ensure the guest
> gets access to different page sizes when running using TCG as an
> accelerator instead of KVM.

Uh.. it would ensure the guest *doesn't* get access to different page
sizes in TCG vs. KVM.  Is that what you meant to say?

> For a KVM guest running on a POWER8 host, the matrix would look
> like
> 
>     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
>    64 KiB | 64 KiB | 64 KiB |        |        |        |
>   -------- -------- -------- -------- -------- --------
>    16 MiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
>   -------- -------- -------- -------- -------- --------
>    16 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
> 
> with backing page sizes from top to bottom, requested max page
> sizes from left to right, actual max page sizes in the cells and
> empty cells meaning the guest won't be able to start; on a POWER9
> machine, the matrix would look like
> 
>     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
>    64 KiB | 64 KiB | 64 KiB |        |        |        |
>   -------- -------- -------- -------- -------- --------
>     2 MiB | 64 KiB | 64 KiB |        |        |        |
>   -------- -------- -------- -------- -------- --------
>     1 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
>   -------- -------- -------- -------- -------- --------
> 
> instead, and finally on TCG the backing page size wouldn't matter
> and you would simply have
> 
>     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
>           | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
> 
> Does everything up until here make sense?

Yes, that all looks right.

> While trying to figure out this, one of the things I attempted to
> do was run a guest in POWER8 compatibility mode on a POWER9 host
> and use hugepages for backing, but that didn't seem to work at
> all, possibly hinting at the fact that not all of the above is
> actually accurate and I need you to correct me :)
> 
> This is the command line I used:
> 
>   /usr/libexec/qemu-kvm \
>   -machine pseries,accel=kvm \
>   -cpu host,compat=power8 \
>   -m 2048 \
>   -mem-prealloc \
>   -mem-path /dev/hugepages \
>   -smp 8,sockets=8,cores=1,threads=1 \
>   -display none \
>   -no-user-config \
>   -nodefaults \
>   -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x2,drive=vda \
>   -drive file=/var/lib/libvirt/images/huge.qcow2,format=qcow2,if=none,id=vda \
>   -serial mon:stdio

Ok, so note that the scheme I'm talking about here is *not* merged as
yet.  The above command line will run the guest with 2MiB backing.

With the existing code that should work, but the guest will only be
able to use 64kiB pages.  If it didn't work at all.. there was a bug
fixed relatively recently that broke all hugepage backing, so you
could try updating to a more recent host kernel.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2018-04-26  1:19 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-19  6:29 [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 1/7] spapr: Maximum (HPT) pagesize property David Gibson
2018-05-02 21:06   ` Murilo Opsfelder Araujo
2018-05-03  1:34     ` David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 2/7] spapr: Use maximum page size capability to simplify memory backend checking David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 3/7] target/ppc: Add ppc_hash64_filter_pagesizes() David Gibson
2018-05-03 15:57   ` Murilo Opsfelder Araujo
2018-05-04  6:30     ` David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 4/7] spapr: Add cpu_apply hook to capabilities David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 5/7] spapr: Limit available pagesizes to provide a consistent guest environment David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 6/7] spapr: Don't rewrite mmu capabilities in KVM mode David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 7/7] spapr_pci: Remove unhelpful pagesize warning David Gibson
2018-04-19 15:30 ` [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling Andrea Bolognani
2018-04-20  2:35   ` David Gibson
2018-04-20  9:31     ` Andrea Bolognani
2018-04-20 10:21       ` David Gibson
2018-04-23  8:31         ` Andrea Bolognani
2018-04-24  1:26           ` David Gibson
2018-04-24 15:35         ` Andrea Bolognani
2018-04-25  6:32           ` David Gibson
2018-04-25 16:09         ` Andrea Bolognani
2018-04-26  0:55           ` David Gibson [this message]
2018-04-26  8:45             ` Andrea Bolognani
2018-04-27  2:14               ` David Gibson
2018-04-27  8:31                 ` Andrea Bolognani
2018-04-27 12:17                   ` David Gibson
2018-05-07 13:48                     ` Andrea Bolognani
2018-06-14  1:52                       ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180426005555.GA8800@umbus.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=abologna@redhat.com \
    --cc=aik@ozlabs.ru \
    --cc=clg@kaod.org \
    --cc=groug@kaod.org \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).