* [Qemu-devel] QEMU etc/e820 and fw_cfg
@ 2015-03-03 10:32 Gordan Bobic
2015-03-04 13:20 ` Gerd Hoffmann
0 siblings, 1 reply; 8+ messages in thread
From: Gordan Bobic @ 2015-03-03 10:32 UTC (permalink / raw)
To: qemu-devel
I need to pass a custom e820 map to a virtual machine for
troubleshooting purposes and working around IOMMU hardware
bugs.
I have found references to a custom map being providable
via an external file, mentioned as "etc/e820" and "fw_cfg".
Unfortunately, I have not found any documentation that
explains how to use this from userspace when invoking
qemu. Can anybody point me in the right direction?
What is the exact format of this e820 map file and how
do I tell qemu to use it (and where to find it) when
initializing the guest environment?
Many thanks.
Gordan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] QEMU etc/e820 and fw_cfg
2015-03-03 10:32 [Qemu-devel] QEMU etc/e820 and fw_cfg Gordan Bobic
@ 2015-03-04 13:20 ` Gerd Hoffmann
2015-03-04 19:12 ` Gordan Bobic
0 siblings, 1 reply; 8+ messages in thread
From: Gerd Hoffmann @ 2015-03-04 13:20 UTC (permalink / raw)
To: Gordan Bobic; +Cc: qemu-devel
On Di, 2015-03-03 at 10:32 +0000, Gordan Bobic wrote:
> I need to pass a custom e820 map to a virtual machine for
> troubleshooting purposes and working around IOMMU hardware
> bugs.
>
> I have found references to a custom map being providable
> via an external file, mentioned as "etc/e820" and "fw_cfg".
That is the (filesystem-like) interface between qemu and firmware
(seabios usually), it doesn't refer to a on-disk file.
> Unfortunately, I have not found any documentation that
> explains how to use this from userspace when invoking
> qemu.
You can't.
Passing a different e820 map requires patching qemu (or seabios, which
mangles the e820 table to add reservations for acpi etc).
What exactly do you need?
cheers,
Gerd
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] QEMU etc/e820 and fw_cfg
2015-03-04 13:20 ` Gerd Hoffmann
@ 2015-03-04 19:12 ` Gordan Bobic
2015-03-05 8:08 ` Gerd Hoffmann
0 siblings, 1 reply; 8+ messages in thread
From: Gordan Bobic @ 2015-03-04 19:12 UTC (permalink / raw)
To: Gerd Hoffmann; +Cc: qemu-devel
On 2015-03-04 13:20, Gerd Hoffmann wrote:
> On Di, 2015-03-03 at 10:32 +0000, Gordan Bobic wrote:
>> I need to pass a custom e820 map to a virtual machine for
>> troubleshooting purposes and working around IOMMU hardware
>> bugs.
>>
>> I have found references to a custom map being providable
>> via an external file, mentioned as "etc/e820" and "fw_cfg".
>
> That is the (filesystem-like) interface between qemu and firmware
> (seabios usually), it doesn't refer to a on-disk file.
>
>> Unfortunately, I have not found any documentation that
>> explains how to use this from userspace when invoking
>> qemu.
>
> You can't.
>
> Passing a different e820 map requires patching qemu (or seabios, which
> mangles the e820 table to add reservations for acpi etc).
>
> What exactly do you need?
Thank you for responding. The situation I have is that my PCIe
bridges are buggy and they seem to bypass the upstream PCIe hub
IOMMU. The problem with this is that when the guest accesses
RAM within it's emulated address space that overlaps with
PCI I/O memory ranges in the host's address space, what should
have ended up in RAM in the guest ends up trampling over the
IOMEM on the host. This typically results in crashing the
host (or worse, if it happens to trample any IOMEM regions
mapped to disk controllers).
The solution seems to be to prevent the guest from accessing
the areas of memory that are mapped as something other than
RAM on the host.
So what I need to be able to do is set a bseline e820 map
that marks all areas as reserved if they are not marked
as usable on the host.
I wrote a prototype patch (an ugly bodge not for public
consumption) for Xen to test the theory of whether this
would fix the problem, and it did. But I would like to
use KVM now instead. I tried using the max-ram-below-4g
option to --machine, and that fixes a part of the problem,
but because it doesn't mark the memory between the set
value and 4GB as reserved, it ends up mapping the PCI
devices passed through to the guest into that area, which
similarly ends up trampling over the host's IOMEM area
and crashing the machine. So I need a way to explicitly
reserve certain memory ranges in the map.
What is the most sensible way to do this with QEMU?
Gordan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] QEMU etc/e820 and fw_cfg
2015-03-04 19:12 ` Gordan Bobic
@ 2015-03-05 8:08 ` Gerd Hoffmann
2015-03-05 10:18 ` Gordan Bobic
0 siblings, 1 reply; 8+ messages in thread
From: Gerd Hoffmann @ 2015-03-05 8:08 UTC (permalink / raw)
To: Gordan Bobic; +Cc: qemu-devel
On Mi, 2015-03-04 at 19:12 +0000, Gordan Bobic wrote:
> On 2015-03-04 13:20, Gerd Hoffmann wrote:
> > On Di, 2015-03-03 at 10:32 +0000, Gordan Bobic wrote:
> >> I need to pass a custom e820 map to a virtual machine for
> >> troubleshooting purposes and working around IOMMU hardware
> >> bugs.
> >>
> >> I have found references to a custom map being providable
> >> via an external file, mentioned as "etc/e820" and "fw_cfg".
> >
> > That is the (filesystem-like) interface between qemu and firmware
> > (seabios usually), it doesn't refer to a on-disk file.
> >
> >> Unfortunately, I have not found any documentation that
> >> explains how to use this from userspace when invoking
> >> qemu.
> >
> > You can't.
> >
> > Passing a different e820 map requires patching qemu (or seabios, which
> > mangles the e820 table to add reservations for acpi etc).
> >
> > What exactly do you need?
>
> Thank you for responding. The situation I have is that my PCIe
> bridges are buggy and they seem to bypass the upstream PCIe hub
> IOMMU. The problem with this is that when the guest accesses
> RAM within it's emulated address space that overlaps with
> PCI I/O memory ranges in the host's address space, what should
> have ended up in RAM in the guest ends up trampling over the
> IOMEM on the host.
The iommu isn't involved here at all. When the pci devices are
accessing host ram via busmaster dma, *this* goes through the iommu.
And unless you are trying to use pci device assignment the iommu should
not matter at all.
What you describe sounds more like a bug in ept/ntp/softmmu (either
kernel driver or hardware). What machine is this? Intel? Has it ept?
What happens if you turn off ept?
I'd also suggest to go to the kvm list with this issue.
cheers,
Gerd
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] QEMU etc/e820 and fw_cfg
2015-03-05 8:08 ` Gerd Hoffmann
@ 2015-03-05 10:18 ` Gordan Bobic
2015-03-05 10:42 ` Gerd Hoffmann
0 siblings, 1 reply; 8+ messages in thread
From: Gordan Bobic @ 2015-03-05 10:18 UTC (permalink / raw)
To: Gerd Hoffmann; +Cc: qemu-devel
On 2015-03-05 08:08, Gerd Hoffmann wrote:
> On Mi, 2015-03-04 at 19:12 +0000, Gordan Bobic wrote:
>> On 2015-03-04 13:20, Gerd Hoffmann wrote:
>> > On Di, 2015-03-03 at 10:32 +0000, Gordan Bobic wrote:
>> >> I need to pass a custom e820 map to a virtual machine for
>> >> troubleshooting purposes and working around IOMMU hardware
>> >> bugs.
>> >>
>> >> I have found references to a custom map being providable
>> >> via an external file, mentioned as "etc/e820" and "fw_cfg".
>> >
>> > That is the (filesystem-like) interface between qemu and firmware
>> > (seabios usually), it doesn't refer to a on-disk file.
>> >
>> >> Unfortunately, I have not found any documentation that
>> >> explains how to use this from userspace when invoking
>> >> qemu.
>> >
>> > You can't.
>> >
>> > Passing a different e820 map requires patching qemu (or seabios, which
>> > mangles the e820 table to add reservations for acpi etc).
>> >
>> > What exactly do you need?
>>
>> Thank you for responding. The situation I have is that my PCIe
>> bridges are buggy and they seem to bypass the upstream PCIe hub
>> IOMMU. The problem with this is that when the guest accesses
>> RAM within it's emulated address space that overlaps with
>> PCI I/O memory ranges in the host's address space, what should
>> have ended up in RAM in the guest ends up trampling over the
>> IOMEM on the host.
>
> The iommu isn't involved here at all. When the pci devices are
> accessing host ram via busmaster dma, *this* goes through the iommu.
> And unless you are trying to use pci device assignment the iommu should
> not matter at all.
I am using PCI device assignment. I'm passing a PCI devices to the
guest VM.
> What you describe sounds more like a bug in ept/ntp/softmmu (either
> kernel driver or hardware). What machine is this? Intel? Has it ept?
> What happens if you turn off ept?
It's an EVGA SR-2 (Intel Nehalem, 5520 NB), and I am 99% certain
the problem is related to the Nvidia NF200 PCIe multiplexer bridges.
Similar problems seem to have been reported by other people with
different motherboards that have NF200 bridges. The workaround is
usually to put the passthrough GPU on a slot that isn't behind the
NF200, but in my case that is not possible because all 7 PCIe slots
are behind the NF200 bridges.
> I'd also suggest to go to the kvm list with this issue.
I'm pretty sure I am dealing with a hardware bug here. I have
a workaround that I know works (mark the host's IOMEM areas
as reserved) - I just need a way to get QEMU to adjust the
exposed e820 map accordingly. I will try disabling EPT and
see if that helps, but my understanding is that there is a
hefty penalty involved, which wouldn't be incurred if I
were to simply have reserved holes in the memory at the
appropriate ranges, hence why the latter solution would
be greatly preferable.
My bodge test patch for Xen's hvmloader simply marked the
entire memory range between the first and last IOMEM mapped
address on the host as reserved (essentially everything between
1.5GB and 4GB). This results in a fully working system but
because this wasn't plumbed in everywhere else it needs to be
plumbed in, the net result is that up to 2.5GB of RAM go missing
in each VM (i.e. memory is marked as reserved but not being made
into a hole - like I said it was a quick and dirty bodge to prove
that it would fix the problem).
I am currently using OVMF for the guest, and I have a bootable
system (Windows 7 guest) that works OK initially, but any access
to the indirect BARs (as soon as anything requiring DirectX
happens) results in the entire host locking up solid (I suspect
that one of the virtual BARs overlaps a physical BAR).
The question is - if a convenient hook for e820 reservations
functionality does not currently exist, would the best place to
add the e820 reservations be to patch it into QEMU, or OVMF/EDK2,
or somewhere else entirely?
Gordan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] QEMU etc/e820 and fw_cfg
2015-03-05 10:18 ` Gordan Bobic
@ 2015-03-05 10:42 ` Gerd Hoffmann
2015-03-05 11:01 ` Gordan Bobic
0 siblings, 1 reply; 8+ messages in thread
From: Gerd Hoffmann @ 2015-03-05 10:42 UTC (permalink / raw)
To: Gordan Bobic; +Cc: qemu-devel
Hi,
> >> Thank you for responding. The situation I have is that my PCIe
> >> bridges are buggy and they seem to bypass the upstream PCIe hub
> >> IOMMU. The problem with this is that when the guest accesses
> >> RAM within it's emulated address space that overlaps with
> >> PCI I/O memory ranges in the host's address space, what should
> >> have ended up in RAM in the guest ends up trampling over the
> >> IOMEM on the host.
> >
> > The iommu isn't involved here at all. When the pci devices are
> > accessing host ram via busmaster dma, *this* goes through the iommu.
> > And unless you are trying to use pci device assignment the iommu should
> > not matter at all.
>
> I am using PCI device assignment. I'm passing a PCI devices to the
> guest VM.
Oh. I didn't expect someone trying to use device assign with a
known-broken iommu. /me looks surprised.
> I'm pretty sure I am dealing with a hardware bug here. I have
> a workaround that I know works (mark the host's IOMEM areas
> as reserved) - I just need a way to get QEMU to adjust the
> exposed e820 map accordingly.
Add "e820_add_entry(start, size, E820_RESERVED)" calls in qemu.
Also make sure the firmware doesn't use those ranges, which may need
firmware patching. At least seabios should happily add those
reservations to the e820 map, but will not look at them otherwise, so
you could end up with pci bars being mapped within the reserved regions.
The linux kernel might fix it up at boot though.
Not fully sure how OVMF behaves here.
cheers,
Gerd
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] QEMU etc/e820 and fw_cfg
2015-03-05 10:42 ` Gerd Hoffmann
@ 2015-03-05 11:01 ` Gordan Bobic
2015-03-05 11:17 ` Gerd Hoffmann
0 siblings, 1 reply; 8+ messages in thread
From: Gordan Bobic @ 2015-03-05 11:01 UTC (permalink / raw)
To: Gerd Hoffmann; +Cc: qemu-devel
On 2015-03-05 10:42, Gerd Hoffmann wrote:
> Hi,
>
>> >> Thank you for responding. The situation I have is that my PCIe
>> >> bridges are buggy and they seem to bypass the upstream PCIe hub
>> >> IOMMU. The problem with this is that when the guest accesses
>> >> RAM within it's emulated address space that overlaps with
>> >> PCI I/O memory ranges in the host's address space, what should
>> >> have ended up in RAM in the guest ends up trampling over the
>> >> IOMEM on the host.
>> >
>> > The iommu isn't involved here at all. When the pci devices are
>> > accessing host ram via busmaster dma, *this* goes through the iommu.
>> > And unless you are trying to use pci device assignment the iommu should
>> > not matter at all.
>>
>> I am using PCI device assignment. I'm passing a PCI devices to the
>> guest VM.
>
> Oh. I didn't expect someone trying to use device assign with a
> known-broken iommu. /me looks surprised.
Since all I have is lemons I'm trying to make lemonade. :)
>> I'm pretty sure I am dealing with a hardware bug here. I have
>> a workaround that I know works (mark the host's IOMEM areas
>> as reserved) - I just need a way to get QEMU to adjust the
>> exposed e820 map accordingly.
>
> Add "e820_add_entry(start, size, E820_RESERVED)" calls in qemu.
Could you please point me at the correct file/function to add
the relevant block into?
I would probably look to do add these based on a config file
in /etc/qemu/. Happy to forward a patch for inclusion if I
manage to make it work.
> Also make sure the firmware doesn't use those ranges, which may need
> firmware patching. At least seabios should happily add those
> reservations to the e820 map, but will not look at them otherwise, so
> you could end up with pci bars being mapped within the reserved
> regions.
Are you saying that seabios will find reserved areas in the e820
map and despite that map a BAR into a reserved block? That's pretty
broken...
> The linux kernel might fix it up at boot though.
If you mean inside the VM, Linux-on-Linux isn't my intended use case,
though.
> Not fully sure how OVMF behaves here.
Thanks for your input. I'll find an appropriate place to ask
about OVMF once I have the QEMU patched appropriately.
Gordan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] QEMU etc/e820 and fw_cfg
2015-03-05 11:01 ` Gordan Bobic
@ 2015-03-05 11:17 ` Gerd Hoffmann
0 siblings, 0 replies; 8+ messages in thread
From: Gerd Hoffmann @ 2015-03-05 11:17 UTC (permalink / raw)
To: Gordan Bobic; +Cc: qemu-devel
Hi,
> > Add "e820_add_entry(start, size, E820_RESERVED)" calls in qemu.
>
> Could you please point me at the correct file/function to add
> the relevant block into?
There are already calls (in hw/i386/pc.c I think) already, adding
entries for RAM. I'd try to place the code nearby, especially as you
might change the ram code too to avoid ram being allocated for the
reserved areas.
> > Also make sure the firmware doesn't use those ranges, which may need
> > firmware patching. At least seabios should happily add those
> > reservations to the e820 map, but will not look at them otherwise, so
> > you could end up with pci bars being mapped within the reserved
> > regions.
>
> Are you saying that seabios will find reserved areas in the e820
> map and despite that map a BAR into a reserved block?
It just copies over the entries, from qemu firmware interface to guest
ram, so the OS (linux/windows/whatever) can see the reservations.
> That's pretty
> broken...
There was no need so far to implement something more advanced in
seabios.
Another option is using coreboot as firmware. coreboot resource
management is alot more powerful. It has to run on real hardware not
only qemu, so it needs to be able to deal with all sorts of quirks. It
should handle this just fine and place all pci bars outside the
reservations.
> > The linux kernel might fix it up at boot though.
>
> If you mean inside the VM, Linux-on-Linux isn't my intended use case,
> though.
Might be useful for testing though as you can easily check stuff in the
kernel boot log and /proc/iomem.
cheers,
Gerd
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-03-05 11:19 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-03 10:32 [Qemu-devel] QEMU etc/e820 and fw_cfg Gordan Bobic
2015-03-04 13:20 ` Gerd Hoffmann
2015-03-04 19:12 ` Gordan Bobic
2015-03-05 8:08 ` Gerd Hoffmann
2015-03-05 10:18 ` Gordan Bobic
2015-03-05 10:42 ` Gerd Hoffmann
2015-03-05 11:01 ` Gordan Bobic
2015-03-05 11:17 ` Gerd Hoffmann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).