From: Joao Martins <joao.m.martins@oracle.com>
To: qemu-devel@nongnu.org
Cc: Eduardo Habkost <eduardo@habkost.net>,
"Michael S . Tsirkin" <mst@redhat.com>,
Richard Henderson <richard.henderson@linaro.org>,
Daniel Jordan <daniel.m.jordan@oracle.com>,
David Edmondson <david.edmondson@oracle.com>,
Alex Williamson <alex.williamson@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>, Ani Sinha <ani@anisinha.ca>,
Marcel Apfelbaum <marcel.apfelbaum@gmail.com>,
Igor Mammedov <imammedo@redhat.com>,
Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Subject: Re: [PATCH v4 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
Date: Tue, 10 May 2022 10:43:27 +0100 [thread overview]
Message-ID: <d4f9ce8f-42df-e06c-e9dc-8b6ff12f2e5d@oracle.com> (raw)
In-Reply-To: <20220420201138.23854-1-joao.m.martins@oracle.com>
On 4/20/22 21:11, Joao Martins wrote:
> v3[4] -> v4:
> (changes in patch 4 and 5 only)
> * Rebased to 7.1.0, hence move compat machine attribute to <= 7.0.0 versions
> * Check guest vCPU vendor rather than host CPU vendor (Michael Tsirkin)
> * Squash previous patch 5 into patch 4 to tie in the phys-bits check
> into the relocate-4g-start logic: We now error out if the phys-bits
> aren't enough on configurations that require above-4g ram relocation. (Michael Tsirkin)
> * Make the error message more explicit when phys-bits isn't enough to also
> mention: "cannot avoid AMD HT range"
> * Add comments inside x86_update_above_4g_mem_start() explaining the
> logic behind it. (Michael Tsirkin)
> * Tested on old guests old guests with Linux 2.6.32/3.10/4.14.35/4.1 based kernels
> alongside Win2008/2K12/2K16/2K19 on configs spanning 1T and 2T (Michael Tsirkin)
> Validated -numa topologies too as well as making sure qtests observe no regressions;
>
> Notes:
>
> * the machine attribute that enables this new logic (see last patch)
> is called ::enforce_valid_iova since the RFC. Let me know if folks think it
> is poorly named, and whether something a bit more obvious is preferred
> (e.g. ::amd_relocate_1t).
>
> * @mst one of the comments you said was to add "host checks" in vdpa/vfio devices.
> In discussion with Alex and you over the last version of the patches it seems
> that we weren't keen on making this device-specific or behind any machine
> property flags (besides machine-compat). Just to reiterate there, making sure we do
> the above-4g relocation requiring properly sized phys-bits and AMD as vCPU
> vendor (as this series) already ensures thtat this is going to be right for
> offending configuration with VDPA/VFIO device that might be
> configured/hotplugged. Unless you were thinking that somehow vfio/vdpa devices
> start poking into machine-specific details when we fail to relocate due to the
> lack of phys-bits? Otherwise Qemu, just doesn't have enough information to tell
> what's a valid IOVA or not, in which case kernel vhost-iotlb/vhost-vdpa is the one
> that needs fixing (as VFIO did in v5.4).
>
Ping -- any further comments on this series?
> ---
>
> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
> particularly when running on AMD systems with an IOMMU.
>
> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> affected by this extra validation. But AMD systems with IOMMU have a hole in
> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>
> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
> -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> of the failure:
>
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1:
> failed to setup container for group 258: memory listener initialization failed:
> Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>
> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> as documented on the links down below.
>
> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
> to be above the 1T if the maximum RAM range crosses the HT reserved range.
> It is organized as following:
>
> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
> address of the 4G boundary
>
> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
> to get accessing to pci_hole64_size. The actual pci-host
> initialization is kept as is, only the qdev_new.
>
> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
> possible address acrosses the HT region. Errors out if the phys-bits is too
> low, which is only the case for >=1010G configurations or something that
> crosses the HT region.
>
> patch 5: Ensure valid IOVAs only on new machine types, but not older
> ones (<= v7.0.0)
>
> The 'consequence' of this approach is that we may need more than the default
> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
> address, consequently needing 41 phys-bits as opposed to the default of 40
> (TCG_PHYS_BITS). Today there's already a precedent to depend on the user to
> pick the right value of phys-bits (regardless of this series), so we warn in
> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
> old seabios.
>
> Additionally, the reserved region is added to E820 if the relocation is done.
>
> Alternative options considered (in RFC[0]):
>
> a) Dealing with the 1T hole like the 4G hole -- which also represents what
> hardware closely does.
>
> Thanks,
> Joao
>
> Older Changelog,
>
> RFCv2[3] -> v3[4]:
>
> * Add missing brackets in single line statement, in patch 5 (David)
> * Change ranges printf to use PRIx64, in patch 5 (David)
> * Move the check to after changing above_4g_mem_start, in patch 5 (David)
> * Make the check generic and move it to pc_memory_init rather being specific
> to AMD, as the check is useful to capture invalid phys-bits
> configs (patch 5, Igor).
> * Fix comment as 'Start address of the initial RAM above 4G' in patch 1 (Igor)
> * Consider pci_hole64_size in patch 4 (Igor)
> * To consider pci_hole64_size in max used addr we need to get it from pci-host,
> so introduce two new patches (2 and 3) which move only the qdev_new("i440fx") or
> qdev_new("q35") to be before pc_memory_init().
> * Consider sgx_epc.size in max used address, in patch 4 (Igor)
> * Rename relocate_4g() to x86_update_above_4g_mem_start() (Igor)
> * Keep warn_report() in patch 5, as erroring out will break a few x86_64 qtests
> due to pci_hole64 accounting surprass phys-bits possible maxphysaddr.
>
> RFC[0] -> RFCv2[3]:
>
> * At Igor's suggestion in one of the patches I reworked the series enterily,
> and more or less as he was thinking it is far simpler to relocate the
> ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler,
> and less intrusive. (patch 1 & 2)
> * Check phys-bits is big enough prior to relocating (new patch 3)
> * Remove the machine property, and it's only internal and set by new machine
> version (Igor, patch 4).
> * Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2)
> * Add IOMMU SDM in the commit message (Igor, patch 2)
>
> [0] https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.martins@oracle.com/
> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> [3] https://lore.kernel.org/qemu-devel/20220207202422.31582-1-joao.m.martins@oracle.com/T/#u
> [4] https://lore.kernel.org/all/20220223184455.9057-1-joao.m.martins@oracle.com/
>
> Joao Martins (5):
> hw/i386: add 4g boundary start to X86MachineState
> i386/pc: create pci-host qdev prior to pc_memory_init()
> i386/pc: pass pci_hole64_size to pc_memory_init()
> i386/pc: relocate 4g start to 1T where applicable
> i386/pc: restrict AMD only enforcing of valid IOVAs to new machine
> type
>
> hw/i386/acpi-build.c | 2 +-
> hw/i386/pc.c | 126 +++++++++++++++++++++++++++++++++--
> hw/i386/pc_piix.c | 12 +++-
> hw/i386/pc_q35.c | 14 +++-
> hw/i386/sgx.c | 2 +-
> hw/i386/x86.c | 1 +
> hw/pci-host/i440fx.c | 10 ++-
> include/hw/i386/pc.h | 4 +-
> include/hw/i386/x86.h | 3 +
> include/hw/pci-host/i440fx.h | 3 +-
> 10 files changed, 161 insertions(+), 16 deletions(-)
>
prev parent reply other threads:[~2022-05-10 10:09 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-20 20:11 [PATCH v4 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
2022-04-20 20:11 ` [PATCH v4 1/5] hw/i386: add 4g boundary start to X86MachineState Joao Martins
2022-04-20 20:11 ` [PATCH v4 2/5] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
2022-04-20 20:11 ` [PATCH v4 3/5] i386/pc: pass pci_hole64_size " Joao Martins
2022-04-20 20:11 ` [PATCH v4 4/5] i386/pc: relocate 4g start to 1T where applicable Joao Martins
2022-05-13 12:33 ` Michael S. Tsirkin
2022-05-13 15:06 ` Joao Martins
2022-05-13 18:28 ` Joao Martins
2022-04-20 20:11 ` [PATCH v4 5/5] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
2022-05-10 9:43 ` Joao Martins [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d4f9ce8f-42df-e06c-e9dc-8b6ff12f2e5d@oracle.com \
--to=joao.m.martins@oracle.com \
--cc=alex.williamson@redhat.com \
--cc=ani@anisinha.ca \
--cc=daniel.m.jordan@oracle.com \
--cc=david.edmondson@oracle.com \
--cc=eduardo@habkost.net \
--cc=imammedo@redhat.com \
--cc=marcel.apfelbaum@gmail.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=suravee.suthikulpanit@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).