From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>,
"Michael S . Tsirkin" <mst@redhat.com>,
Richard Henderson <richard.henderson@linaro.org>,
qemu-devel@nongnu.org, Daniel Jordan <daniel.m.jordan@oracle.com>,
David Edmondson <david.edmondson@oracle.com>,
Auger Eric <eric.auger@redhat.com>,
Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
Igor Mammedov <imammedo@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Joao Martins <joao.m.martins@oracle.com>
Subject: Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Date: Thu, 24 Jun 2021 10:22:11 +0100 [thread overview]
Message-ID: <YNROw6ATTRUlmHbU@work-vm> (raw)
In-Reply-To: <20210623132736.1c7b326a.alex.williamson@redhat.com>
* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Wed, 23 Jun 2021 10:30:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>
> > On 6/22/21 10:16 PM, Alex Williamson wrote:
> > > On Tue, 22 Jun 2021 16:48:59 +0100
> > > Joao Martins <joao.m.martins@oracle.com> wrote:
> > >
> > >> Hey,
> > >>
> > >> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
> > >> when running on AMD systems with an IOMMU.
> > >>
> > >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> > >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> > >> affected by this extra validation. But AMD systems with IOMMU have a hole in
> > >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> > >> here FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> > >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> > >>
> > >> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
> > >> -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> > >> of the failure:
> > >>
> > >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> > >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1:
> > >> failed to setup container for group 258: memory listener initialization failed:
> > >> Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> > >>
> > >> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
> > >> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> > >> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> > >> as documented on the links down below.
> > >>
> > >> This series tries to address that by dealing with this AMD-specific 1Tb hole,
> > >> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
> > >> as following:
> > >>
> > >> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
> > >> which gets used too in other parts of pc/acpi besides MR creation. The
> > >> allowed IOVA *only* changes if it's an AMD host, so no change for
> > >> Intel. We walk the allowed ranges for memory above 4G, and
> > >> add a E820_RESERVED type everytime we find a hole (which is at the
> > >> 1TB boundary).
> > >>
> > >> NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
> > >> understand that it doesn't cover the non-x86 host case running TCG.
> > >>
> > >> Additionally, an alternative to hardcoded ranges as we do today,
> > >> VFIO could advertise the platform valid IOVA ranges without necessarily
> > >> requiring to have a PCI device added in the vfio container. That would
> > >> fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> > >> ranges as we do today. But sadly, wouldn't work for older hypervisors.
> > >
> > >
> > > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> > > 0x00000000fee00000 0x00000000feefffff msi
> > > 0x000000fd00000000 0x000000ffffffffff reserved
> > >
> > Yeap, I am aware.
> >
> > The VFIO advertising extension was just because we already advertise the above info,
> > although behind a non-empty vfio container e.g. we seem to use that for example in
> > collect_usable_iova_ranges().
>
> VFIO can't guess what groups you'll use to mark reserved ranges in an
> empty container. Each group might have unique ranges. A container
> enforcing ranges unrelated to the groups/devices in use doesn't make
> sense.
>
> > > Ideally we might take that into account on all hosts, but of course
> > > then we run into massive compatibility issues when we consider
> > > migration. We run into similar problems when people try to assign
> > > devices to non-x86 TCG hosts, where the arch doesn't have a natural
> > > memory hole overlapping the msi range.
> > >
> > > The issue here is similar to trying to find a set of supported CPU
> > > flags across hosts, QEMU only has visibility to the host where it runs,
> > > an upper level tool needs to be able to pass through information about
> > > compatibility to all possible migration targets.
> >
> > I agree with your generic sentiment (and idea) but are we sure this is really something as
> > dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
> > in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
> > pc/q35 is one very good example, because this hasn't changed since it's inception [a
> > decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
> > with more than 1Tb). Additionally, there might be architectural impositions like on x86
> > e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
> > targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
> > keep the reserved range nonetheless in the common denominator)
>
> I like the flexibility that being able to specify reserved ranges would
> provide, but I agree that the machine memory map is usually deeply
> embedded into the arch code and would probably be difficult to
> generalize. Cross vendor migration should be a consideration and only
> an inter-system management policy could specify the importance of that.
On x86 at least, the cross vendor part doesn't seem to be an issue; I
wouldn't expect an Intel->AMD migration to work reliably anyway.
> Perhaps as David mentioned, this is really a machine type issue, where
> the address width downsides you've noted might be sufficient reason
> to introduce a new machine type that includes this memory hole. That
> would likely be the more traditional solution to this issue. Thanks,
To me this seems a combination of machine type+CPU model; perhaps what
we're looking at here is having a list of holes, which can be
contributed to by any of:
a) The machine type
b) The CPU model
c) and extra command line option like you list
Dave
> Alex
>
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2021-06-24 9:23 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
2021-06-22 15:49 ` [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary Joao Martins
2021-06-23 7:11 ` Igor Mammedov
2021-06-23 9:37 ` Joao Martins
2021-06-23 11:39 ` Igor Mammedov
2021-06-23 13:04 ` Joao Martins
2021-06-28 14:32 ` Igor Mammedov
2021-08-06 10:41 ` Joao Martins
2021-06-23 9:03 ` Igor Mammedov
2021-06-23 9:51 ` Joao Martins
2021-06-23 12:09 ` Igor Mammedov
2021-06-23 13:07 ` Joao Martins
2021-06-28 13:25 ` Igor Mammedov
2021-06-28 13:43 ` Joao Martins
2021-06-28 15:21 ` Igor Mammedov
2021-06-24 9:32 ` Dr. David Alan Gilbert
2021-06-28 14:42 ` Igor Mammedov
2021-06-22 15:49 ` [PATCH RFC 2/6] i386/pc: Round up the hotpluggable memory within valid IOVA ranges Joao Martins
2021-06-22 15:49 ` [PATCH RFC 3/6] pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary Joao Martins
2021-06-22 15:49 ` [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space Joao Martins
2021-06-23 12:30 ` Igor Mammedov
2021-06-23 13:22 ` Joao Martins
2021-06-28 15:37 ` Igor Mammedov
2021-06-23 16:33 ` Laszlo Ersek
2021-06-25 17:19 ` Joao Martins
2021-06-22 15:49 ` [PATCH RFC 5/6] i386/acpi: Fix SRAT ranges in accordance to usable IOVA Joao Martins
2021-06-22 15:49 ` [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs Joao Martins
2021-06-23 9:18 ` Igor Mammedov
2021-06-23 9:59 ` Joao Martins
2021-06-22 21:16 ` [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Alex Williamson
2021-06-23 7:40 ` David Edmondson
2021-06-23 19:13 ` Alex Williamson
2021-06-23 9:30 ` Joao Martins
2021-06-23 11:58 ` Igor Mammedov
2021-06-23 13:15 ` Joao Martins
2021-06-23 19:27 ` Alex Williamson
2021-06-24 9:22 ` Dr. David Alan Gilbert [this message]
2021-06-25 16:54 ` Joao Martins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YNROw6ATTRUlmHbU@work-vm \
--to=dgilbert@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=daniel.m.jordan@oracle.com \
--cc=david.edmondson@oracle.com \
--cc=ehabkost@redhat.com \
--cc=eric.auger@redhat.com \
--cc=imammedo@redhat.com \
--cc=joao.m.martins@oracle.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=suravee.suthikulpanit@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).