From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: Some questions regarding QEMU, UEFI,
 PCI/VGA Passthrough, and other things
Date: Mon, 8 Dec 2014 13:46:18 +0000
Message-ID: <5485ABAA.8050803@citrix.com>
References: <SNT151-W418A2CEE673F81B3E75E99F3660@phx.gbl>
Mime-Version: 1.0
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <SNT151-W418A2CEE673F81B3E75E99F3660@phx.gbl>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Zir Blazer <zir_blazer@hotmail.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

On 06/12/14 00:45, Zir Blazer wrote:

Replying somewhat out of order:

> I hope someone finds my questions interesing to answer.

It is certainly an interesting read - I will answer what I can.

> While I am not a developer myself (I always sucked hard when it comes to =
read and write code), there are several capabilities of Xen and its support=
ing Software which I'm always interesed in how they progress, more out of c=
uriosity than anything else. However, usually, documentation seems to backt=
rack a lot what its currently implemented in code, and sometimes you catch =
a mail here with some useful data regarding a topic but later you don't hea=
r about that any more, missing any progress, or because the whole topic was=
 inconclusive. So, this mail is pretty much a compilation of small question=
s of things I came across but didn't popped up later, but can serve to brai=
nstorm someone, which is why I believe it to be more useful for xen-devel t=
han xen-users.

This is indeed more appropriate for xen-devel.  Documentation is
certainly something we are poor at.  The monthly documentation days are
helping to counter this, but it is slow going.

>    QEMU
> Because as a VGA Passthrough user I'm currently forced to use qemu-xen-tr=
aditional (Through I hear some success about some users using qemu-xen in X=
en 4.4, but I myself didn't had any luck with it), I'm stuck with an old QE=
MU version. However, looking at changelog from latest versions I always see=
 some interesing features, which as far that I know Xen doesn't currently i=
ncorporate.
>
>
> 1a - One of the things that newer QEMU versions seems to be capable of do=
ing, is emulating the much newer Intel Q35 Chipset, instead of only the cur=
rent 440FX from the P5 Pentium era. Some data from Q35 emulation here:
> www.linux-kvm.org/wiki/images/0/06/2012-forum-Q35.pdf
> wiki.qemu.org/Features/Q35
>
> I'm aware that newer doesn't neccesarily means better, specially because =
the practical advantages of Q35 vs 440FX aren't very clear. There are sever=
al new emulated features like an AHCI Controller and a PCIe Bus, which soun=
ds interesing on paper, but I don't know if they add any useful feature or =
increases performance/compatibility. Some comments I read about the matter =
wrongly stated that Q35 would be needed to do PCIe Passthrough, but this is=
 currently possible on 440FX, through I don't know about the low level impl=
ementation differences. I think most of the idea about Q35 is to make the V=
M look more closely to real Hardware, instead of looking like a ridiculous =
obvious emulated platform.
> In the case of the AHCI Controller, I suppose than the OS would need to i=
nclude Drivers for the controller during installation time, which if I reca=
ll correctly both Windows Vista/7/8 and Linux should have, through for a Wi=
ndows XP install the Q35 AHCI Controller Drivers should probabily need to b=
e slipstreamed with nLite to an install ISO for it to work.

Qemu traditional with PCI passthrough of a PCIe device makes a PCI
topology which couldn't possibly work electrically speaking.  It ends up
with a PCIe device on a PCI bus with other PCI devices.  It works well
enough because operating systems have to cope with completely bogus
firmware information.

Q35 is certainly newer, and offers a different set of devices which will
be far more commonly found in more modern systems.  Whether this
constitutes "better" is purely subjective.

> A curious thing is that if I check /sys/kernel/iommu_groups/ as stated on=
 the blog I find the folder empty (This is on Dom0, with a DomU with 2 pass=
throughed devices). I suppose it may be VFIO exclusive or something. Point =
is, after some googling I couldn't find a way to check for IOMMU groups, th=
rough Xen doesn't seem to manage that anyways. I think that it may be usefu=
l to get a layout of IOMMU groups to at least identify if passthrough issue=
s could be related to that. Anyone can imagine current scenarios where this=
 may break something or limit possible passthrough, why I have my IOMMU gro=
ups listing empty, and how to get such list?

Xen has no concept of iommu_groups, and the dom0 kernel doesn't have
blanket permissions to poke around in the system topology, which is
probably why the dom0 kernel doesn't list any.  (In particular, dom0
can't see the ACPI DMAR table as Xen hides it from dom0.)  Try booting
your dom0 kernel as native and seeing whether the groups become populated.

Having read up on iommu_groups, they are a concept Xen should gain, and
"passing through a device" would turn into "passing through an
iommu_group".  Currently, the toolstack (and by implication, admin) can
set up anything it want, even when it doesn't make sense in the
slightest, and the results are a subtly or completely broken device. =

There are also several errata which would cause lots of PCIe devices to
consolidate into the same iommu_group.

IOMMU groups certainly wouldn't fix all passthrough issues, but it would
remove one avenue of being able to set up a known-invalid configurations.

> 1b - Another experimental feature that recently popped in QEMU is IOMMU e=
mulation. Info here:
> www.mulix.org/pubs/iommu/viommu.pdf
> www.linux-kvm.org/wiki/images/4/4a/2010-forum-joro-pv-iommu.pdf
>
> IOMMU emulation usefulness seems to be so you can do PCI Passthrough in a=
 Nested Virtualization enviroment. At first sight this looked a bit useless=
, cause using a DomU to do PCI Passthrough with an emulated IOMMU sounds ra=
ther too much overhead if you can simply emulate that device in the nested =
DomU. However, I also read about the possibility of Xen using Hardware virt=
ualization for Dom0 instead of it being Paravirtualized. In that case, woul=
d it be possible to provide the IOMMU emulation layer to Dom0 so you could =
do PCI Passthrough in platforms without proper support for it? It seems a r=
ather interesing idea.
> I think it would also be useful to serve as an standarized debug platform=
 for IOMMU virtualization and passthrough, cause some years ago missing or =
malformed ACPI DMAR/IVRS tables were all over the place and getting IOMMU v=
irtualization working was pretty much random luck and at the mercy of the g=
oodwill of the Motherboard maker to fix their BIOSes.

IOMMU emulation without IOMMU hardware can only possibly work in
combination with completely emulated devices.

IOMMU emulation in combination with IOMMU hardware could be made to work
if Xen changes its current model of only having a single IOMMU root per
domain.

The IOMMU architecture is basically just some sets of pagetables, and
each device gets a "cr3".  Currently, Xen has one single set of
pagetables for each domain needing the IOMMU, and every device assigned
to that domain gets the same set of tables.  It is perfectly possible to
have each device assigned to a domain using a different set of tables,
for intra-vm isolation, or nested pci passthrough, but this would
require a change in Xens interface (and a reasonably large quantity of
tuits)

>    UEFI for DomUs
> I managed to get this one working, but it seems to need some clarificatio=
ns here and there.
>
> 2a - As far that I know, if you add --enable-ovmf to ./configure before b=
uilding Xen, it downloads and builds some extra code from a OVMF repository=
 which Xen maintains, through I don't know if its a snapshop of whatever th=
e edk2 repository had at that time, or if it does includes custom patchs fo=
r the OVMF Firmware to work in Xen. Xen also has another ./configure option=
, --with-system-ovmf, which is supposed to be used to specify a path to pro=
vide an OVMF Firmware binary. However, when I tried that option some months=
 ago, I never managed to get it working, either using a package with a prec=
ompiled ovmf.bin from Arch Linux User Repository, or using another package =
with the source to compile it myself. Both binaries worked with standalone =
QEMU, through.
> Besides than that parameter itself was quite hidden, there is absolutely =
no info regarding if the provided OVMF binary has to comply with some speci=
al requeriments, be it some custom patchs for OVMF so it works with Xen, if=
 it has to be a binary that only includes TianoCore, or the unified one tha=
t includes the NVRAM in a single file. In Arch Linux, for the Xen 4.4 packa=
ge, the maintainer decided that the way to go for including OVMF support to=
 Xen was to use --enable-ovmf, cause at least it was possible to get it wor=
king with some available patches. However, for both download and build time=
s, it would be better to simply distribute a working binary. Any ideas of w=
hy --with-system-ovmf didn't worked for us?
>
>
> 2b - On successful Xen builds with OVMF support, something which I looked=
 for is the actual ovmf.bin file. So far, the only thing which I noticed is=
 that the hvmloader is 2 MiB bigger that on non-OVMF builds. Is there any r=
eason why OVMF is build into the hvmloader instead of what happens to the o=
ther Firmware binaries, which are usually sitting in a directory as standal=
one files?

(answering 2a and 2b together)

ovmf is currently unconditionally compiled into hvmloader, which is why
it gets 2MB bigger.  I believe --with-system-ovmf=3D (and
-with-system-seabios for that matter) needs the system ovmf available in
the build environment to be linked into hvmloader.

For a separate project, I have a usecase for hvmloader itself being a
multiboot image.  This would allow the use of multiboot modules, which
would be far more flexible than compiling all the binaries into
hvmloader itself.  In particular, when a system qemu updates its system
seabios/ovmf, hvmloader could use the updated bioses rather than the
linked bioses.

> 2c - Something which I'm aware is that an OVMF binary can be in two forma=
ts: A unified binary that has both OVMF and NVRAM, or a OVMF binary with a =
separate NVRAM (1.87 MiB + 128 KiB respectively). According to what I read =
about using OVMF with QEMU, it seems that if using a unified binary, you ne=
ed one per VM, cause the NVRAM content is different. I suppose than with th=
e second option you have one OVMF Firmware binary and a 128 KB NVRAM per UE=
FI VM. How does Xen handles this? If I recall correctly, I heared than it i=
s currently volatile (NVRAM contents aren't saved on DomU shutdown).

Currently nothing is saved.  With mutliboot modules and in particular,
separate multiboot modules for the main OVMF binary and a small nvram,
it would be possible to specify "nvram =3D /path/to/nvram.bin" in your
vm.cfg and gain proper nvram which persists across reboot.

> 2d - Is there any recorded experience or info regarding how a UEFI DomU w=
ould behave with something like, say, Windows 8 with Fast Boot, or other UE=
FI features for native systems? This is pretty much a "what if..." scenario=
 than something that I could really use.

I believe Anthony has managed to get this working with a Xenified OVMF?

>    PCI/VGA Passthrough
> It was four years ago when I learned about IOMMU virtualization making po=
ssible gaming in a VM via VGA Passthrough (First time I heared about that w=
as with some of Teo En Ming videos on Youtube), something which was quite e=
xperimental back at that time. Even currently, the only other Hypervisor or=
 VMM that can compete with Xen in this area is QEMU with KVM VFIO, which al=
so has decent VGA Passthrough capabilities. While I'm aware that Xen is pre=
tty much enterprise oriented, it was also the first to allow a power user t=
o make a system based on Xen as Hypervisor and everything else virtualized,=
 getting nearly all the functionality of running native with the flexibilit=
y than virtualization offers, at the cost of some overhead, quircks and com=
plexity on usage. Its a pain to configure it the first time, but if you man=
age to get it working, its wonderful. So far, this feature has created a sm=
all niche of power users that uses either Xen or QEMU KVM VFIO for virtuali=
zed gaming, and I consider VGA Passthrough a quite major feature because it=
 is what allows such setups on the first place.

I wouldn=92t necessarily say that Xen is specifically enterprise
orientated.  However, Xen is certainly harder to set up and use than
alternatives, which does raise the bar to start using it.

>
>
> 3a - On some of the Threads of the original guides I read about how to us=
e Xen to do VGA Passthrough, you usually see the author and others users sa=
ying that they didn't manage to get VGA Passthrough working on newer versio=
ns. This usually affected people that was doing the migration from the xm t=
o xl toolstack, but also between some Xen versions (I reported a regression=
 on Xen 4.4 vs a fully working 4.3). Passthrough compatibility previously u=
sed to be a Hardware-related pain cause it was extremely Motherboard and BI=
OS dependant on an era where consumer Motherboards makers didn't paid atten=
tion to the IOMMU, but at least on the Intel Haswell platform support for I=
OMMU is starting to get more mainstream.
> Considering than PCI/VGA Passthrough compatibility with a system or regre=
ssions of it between Xen versions is pretty much a hit-or-miss, would it be=
 possible to do something to get this feature under control? It seems like =
this isn't deeply tested, or at least not with too many variables involved =
(Hard to do, cause they're A LOT). I believe that it should be possible to =
have a few systems at hand which are know to work and representative of a H=
ardware platform tested against regression with different Video Cards, but =
it sounds extremely time consuming to switch cards, reboot, test with diffe=
rent DomUs OSes/Drivers, etc. At the moment, once you get a Computer/Distri=
bution/Kernel/Xen/Toolstack/DomU OS/Drivers combination that works, you sim=
ply stick to it, so many early adopters of VGA Passthrough are still using =
extremely outdated versions. Even worse, for users of versions like 4.2 wit=
h xm, if they want to upgrade to 4.4 with xl and want to figure out why it =
doesn't work, it will be a royal pain in the butt to figure out what patch =
was introduced that breaks compatibility for them, so those early adopters =
are pretty much out of luck if they have to go through years worth of code =
and version testing.

PCI Passthrough is in an awkward position.  I am not aware of any
dedicates testing that the stable/master branches get, and it is
surprisingly difficult to automate.  It would certainly be nice for
passthrough to get some form of dedicated testing, but currently the
best we have is users like yourself complaining when it breaks.  This is
certainly a situation which needs improving.

In XenServer, we support passthrough in a very restricted set of
circumstances, because there are simply too many system quirks (that we
know about, let alone those we don't) for us to be comfortable
supporting it in general.  Furthermore, our testing only covers the
version of Xen we are using in trunk, which is generally the latest stable.

>
>
> 3b - Do someone knows what is the actual difference on Intel platforms re=
garding VT-d support? As far that I know, the VT-d specification allows for=
 multiple "DMA Remapping Engines", of which a Haswell Processor has two, on=
e for its Integrated PCIe Controller and another for the Integrated GPU. Yo=
u also have Chipsets, some of which according to Intel Ark support VT-d (Wh=
ich I believe should be in the form of a third DMA Remapping Engine), like =
the Q87 and C226, and those that don't like the H87 and Z87. Based on worki=
ng samples I have been lead to believe than a Processor supporting VT-d wil=
l provide the IOMMU capabilities for the devices connected to its own PCIe =
Slots regardless of what Chipset you're using (That's the reason why you ca=
n do Passthrough with only Processor VT-d support), I would believe the sam=
e holds true with a VT-d Chipset with a non VT-d Processor, through I didn'=
t saw any working example of this.
> When I was researching about this one year ago, Supermicro support said t=
his to me:
>
> Since Z87 chipset does not support VT-d,  onboard LAN will not support it=
 either because it is connected to PCH PCIe port.  One workaround is to use=
 a VT-d enabled PCIe device and plug it into CPU based PCIe-port on board. =
 Along with a VT-d enabled CPU the above workaround should work per Intel.
>
> Based on this, there should be a not-very-well-documented quirck. The mos=
t common configuration for VGA Passthrough users is a VT-d supporting Proce=
ssor with a consumer Motherboard, so basically, if you have a VT-d supporti=
ng Processor like a Core i7 4790K, you can do Passthrough of the devices co=
nnected to the Processor PCIe Slots, and also of the ones connected to the =
Chipset if you apply that workaround (I don't know what does "VT-d enabled =
PCIe device" means exactly). I recall seeing some people using VMWare ESXi =
commenting that they couldn't passthrough the integrated NIC even through s=
ome a RAID Controller connected to the Processor could in such setups. Don'=
t have link at hand about the matter, but I believe that reelevant for the =
question.
> Considering that if workarounded you would be using the Processor DMA Rem=
apping Engine for Chipset devices, is there any potential bottleneck or per=
formance degradation there?

The only reasonable interpretation that stands a chance of working is a
PCIe device with an IOMMU on it, but I am not aware of any such device,
or whether it would actually work.

It is certainly possible to have more than one IOMMU.  Servers typically
have one per socket and one for the chipset.  This doesn't necessarily
mean that all devices are covered by IOMMUs.

>
>
> 3c - There is a feature that enhances VT-d called ACS (Access Control Ser=
vice), related to IOMMU groups isolation. This feature seems to be excluded=
 from consumer platforms, and support for it seems to already be on Xen wis=
hlist based on comments. Info here:
> vfio.blogspot.com.ar/2014/08/iommu-groups-inside-and-out.html
> comments.gmane.org/gmane.comp.emulators.xen.devel/212561

ACS is required to fix issues caused by optimisation permitted under the
PCIe spec, which are invalid in combination with IOMMU.  The main one is
peer-to-peer DMA which permits a switch to complete peer-to-peer traffic
without forwarding it upstream.  This is wrong between two devices with
different IOMMU mappings, and ACS provides an override to say "forward
everything upstream - the IOMMU will make it go in different directions".

Presence or lack of ACS certainly does affect whether devices behind a
PCIe switch can safely be isolated into different IOMMU contexts.

>
>
> 3d - The new SATA Express and M.2 connectors combines SATA and some PCI E=
xpress lanes on the same connector. Depending on implementation, the PCI Ex=
press lanes could come from either the Chipset or the Processor. Considerin=
g than some people likes to passthrough the entire SATA Controller, how doe=
s it interacts with this frankenstein connector with the PCIe lanes coming =
from elsewhere? I'm curious.

No idea, but I suspect it would appear as a different device, separate
to the SATA controller.

>
>
>    Miscelaneous Virtualization stuff
>
>
> 4a - There are several instances where the Software is trying to check if=
 it is under a virtualized enviroment or not. Examples which I recall havin=
g read about are some malware, which tries to hide if it detects that it is=
 running virtualized (Cause it means that it is not your exploitable Averag=
e Joe computer), or according some comments I read, some Drivers like those=
 of NVIDIA to force you to use a Quadro for VGA Passthrough instead of a co=
nsumer based GeForce. Is the goal of virtualization to reproduce the exact =
behaviator in a VM of a system running native, or just be functionally equi=
valent? This is because as more Software appears that makes a distinction b=
etween native and a VM, it seems that in the end it will be forcing VMs to =
look and behave like a native system to maintain compatibility. While curre=
ntly such Software is pretty much a specific niche, it exist the possibilit=
y than it becomes a trend with the growing popularity of the cloud.
> For example, one of the things that pretty much tells the whole history, =
is the 440FX Chipset, because if you see that Chipset running anything but =
a P5 Pentium, you know you're running either emulated or virtualized. Also,=
 if I use an application like CPU-Z, it says than the BIOS Brand is Xen, Ve=
rsion 4.3.3, which makes the status of the system as inside a VM also obvio=
us. I think that based on the rare but existant Software pieces that attemp=
ts to check if its running on a VM or not to decide behavior, at some point=
 in time a part of the virtualization segment will be playing a catching up=
 game to mask being a VM from these types of applications. I suppose that a=
 possible endgame for this topic would be where you have a VM that tries to=
 represent accurately as possible the PCI Layout of a commercial Chipset (W=
hich I believe was one of the aims of QEMU Q35 emulation), and deliberately=
 lying and/or masking the Processor CPUID data, BIOS vendor, and other reco=
gnizable things, to try to match what you would expect from a native system=
 of that Hardware generation.
> This point could be questionable, cause making a perfect VM that is indis=
tinguishable from a native system could harm some vendors that may rely on =
identifying if its running on a VM or not for enforcing licensing and the l=
ike.

I would go so far as to say that the majority of people using
virtualisation want something which works (for varying definitions of
'works'), and is as fast as possible.  Making an HVM guest
indistinguishable from a real computer is a very difficult task, and one
which I don't believe is practical to achieve.  An OS which is really
trying to identify a virtualised environment can even make a guess by
timing certain operations which would vmexit for emulation purposes.

>
>
> 4b - The only feature which I feel that Xen is missing from a home user p=
erspective, is sound. As far that I know you can currently tell QEMU to emu=
late a Sound Card in a DomU, but there is no way to easily get the sound ou=
t of a DomU like other VMMs do. Some of the solutions I saw relied on eithe=
r multiple passthroughed Sound Cards, or a PulseAudio Server adding massive=
 sound latency. While Xen is enterprise oriented where sound is unneeded, I=
 recall hearing that this feature was getting considered, but didn't see an=
y mention about it for months. How hard or complex it would be to add sound=
 support to Xen? Is the way to do it decided? Could it take the form of usi=
ng Dom0 Drivers for the Sound Card to act as a mixer and some PV Drivers fo=
r the DomU like the ones currently available for the NIC and storage?
>

Sorry, I don't have any useful input here, other than "that would be nice".

~Andrew