From: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: <linux-pci@vger.kernel.org>, Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH] Documentation: PCI: add vmd documentation
Date: Thu, 18 Apr 2024 14:51:19 -0700 [thread overview]
Message-ID: <68b92dca-2e51-4604-99e8-58a42459218a@intel.com> (raw)
In-Reply-To: <20240418182653.GA216968@bhelgaas>
On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
> [+cc Keith]
>
> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
>> Adding documentation for the Intel VMD driver and updating the index
>> file to include it.
>>
>> Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
>> ---
>> Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++
>> Documentation/PCI/index.rst | 1 +
>> 2 files changed, 52 insertions(+)
>> create mode 100644 Documentation/PCI/controller/vmd.rst
>>
>> diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst
>> new file mode 100644
>> index 000000000000..e1a019035245
>> --- /dev/null
>> +++ b/Documentation/PCI/controller/vmd.rst
>> @@ -0,0 +1,51 @@
>> +.. SPDX-License-Identifier: GPL-2.0+
>> +
>> +=================================================================
>> +Linux Base Driver for the Intel(R) Volume Management Device (VMD)
>> +=================================================================
>> +
>> +Intel vmd Linux driver.
>> +
>> +Contents
>> +========
>> +
>> +- Overview
>> +- Features
>> +- Limitations
>> +
>> +The Intel VMD provides the means to provide volume management across separate
>> +PCI Express HBAs and SSDs without requiring operating system support or
>> +communication between drivers. It does this by obscuring each storage
>> +controller from the OS, but allowing a single driver to be loaded that would
>> +control each storage controller. A Volume Management Device (VMD) provides a
>> +single device for a single storage driver. The VMD resides in the IIO root
>
> I'm not sure IIO (and PCH below) are really relevant to this. I think
I'm trying to describe where in the CPU architecture VMD exists because
it's not like other devices. It's not like a storage device or
networking device that is plugged in somewhere; it exists as part of the
CPU (in the IIO). I'm ok removing it, but it might be confusing to
someone looking at the documentation. I'm also close to this so it may
be clear to me, but confusing to others (which I know it is) so any help
making it clearer would be appreciated.
> we really just care about the PCI topology enumerable by the OS. If
> they are relevant, expand them on first use as you did for VMD so we
> have a hint about how to learn more about it.
>
I don't fully understand this comment. The PCI topology behind VMD is
not enumerable by the OS unless we are considering the vmd driver the
OS. If the user enables VMD in the BIOS and the vmd driver isn't loaded,
then the OS never sees the devices behind VMD.
The only reason the devices are seen by the OS is that the VMD driver
does some mapping when the VMD driver loads during boot.
>> +complex and it appears to the OS as a root bus integrated endpoint. In the IIO,
>
> I suspect "root bus integrated endpoint" means the same as "Root
> Complex Integrated Endpoint" as defined by the PCIe spec? If so,
> please use that term and capitalize it so there's no confusion.
>
OK, will fix.
>> +the VMD is in a central location to manipulate access to storage devices which
>> +may be attached directly to the IIO or indirectly through the PCH. Instead of
>> +allowing individual storage devices to be detected by the OS and allow it to
>> +load a separate driver instance for each, the VMD provides configuration
>> +settings to allow specific devices and root ports on the root bus to be
>> +invisible to the OS.
>
> How are these settings configured? BIOS setup menu?
>
I believe there are 2 ways this is done:
The first is that the system designer creates a design such that some
root ports and end points are behind VMD. If VMD is enabled in the BIOS
then these devices don't show up to the OS and require a driver to use
them (the vmd driver). If VMD is disabled in the BIOS then the devices
are seen by the OS at boot time.
The second way is that there are settings in the BIOS for VMD. I don't
think there are many settings... it's mostly enable/disable VMD
>> +VMD works by creating separate PCI domains for each VMD device in the system.
>> +This makes VMD look more like a host bridge than an endpoint so VMD must try
>> +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system.
>
> As Keith pointed out, I think this needs more details about how the
> hardware itself works. I don't think there's enough information here
> to maintain the OS/platform interface on an ongoing basis.
>
> I think "creating a separate PCI domain" is a consequence of providing
> a new config access mechanism, e.g., a new ECAM region, for devices
> below the VMD bridge. That hardware mechanism is important to
> understand because it means those downstream devices are unknown to
> anything that doesn't grok the config access mechanism. For example,
> firmware wouldn't know anything about them unless it had a VMD driver.
>
> Some of the pieces that might help figure this out:
>
I'll add some details to answer these in the documentation, but I'll
give a brief answer here as well
> - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root
> Ports) are enumerated in the host?
>
Only the VMD device (as a PCI end point) are seen by the OS without the
vmd driver
> - Which devices are passed through to a virtual guest and enumerated
> there?
>
All devices under VMD are passed to a virtual guest
> - Where does the vmd driver run (host or guest or both)?
>
I believe the answer is both.
> - Who (host or guest) runs the _OSC for the new VMD domain?
>
I believe the answer here is neither :) This has been an issue since
commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've
submitted this patch
(https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/)
to attempt to fix the issue.
You are much more of an expert in this area than I am, but as far as I
can tell the only way the _OSC bits get cleared is via ACPI
(specifically this code
https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038).
Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get
set properly for them.
Ultimately the only _OSC bits that VMD cares about are the hotplug bits
because that is a feature of our device; it enables hotplug in guests
where there is no way to enable it. That's why my patch is to set them
all the time and copy the other _OSC bits because there is no other way
to enable this feature (i.e. there is no user space tool to
enable/disable it).
> - What happens to interrupts generated by devices downstream from
> VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts
> from VMD Root Ports or switch downstream ports? Who fields them?
> In general firmware would field them unless it grants ownership
> via _OSC. If firmware grants ownership (or the OS forcibly takes
> it by overriding it for hotplug), I guess the OS that requested
> ownership would field them?
>
The interrupts are passed through VMD to the OS. This was the AER issue
that resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe
features"). IIRC AER was disabled in the BIOS, but is was enabled in the
VMD host bridge because pci_init_host_bridge() sets all the bits to 1
and that generated an AER interrupt storm.
In bare metal scenarios the _OSC bits are correct, but in a hypervisor
scenario the bits are wrong because they are all 0 regardless of what
the ACPI tables indicate. The challenge is that the VMD driver has no
way to know it's in a hypervisor to set the hotplug bits correctly.
> - How do interrupts (hotplug, AER, etc) for things below VMD work?
> Assuming the OS owns the feature, how does the OS discover them?
I feel like this is the same question as above? Or maybe I'm missing a
subtlety about this...
> I guess probably the usual PCIe Capability and MSI/MSI-X
> Capabilities? Which OS (host or guest) fields them?
>
>> +A couple of the _OSC flags regard hotplug support. Hotplug is a feature that
>> +is always enabled when using VMD regardless of the _OSC flags.
>
> We log the _OSC negotiation in dmesg, so if we ignore or override _OSC
> for hotplug, maybe that should be made explicit in the logging
> somehow?
>
That's a really good idea and something I can add to
https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/
Would a message like this help from the VMD driver?
"VMD enabled, hotplug enabled by VMD"
>> +Features
>> +========
>> +
>> +- Virtualization
>> +- MSIX interrupts
>> +- Power Management
>> +- Hotplug
>
> s/MSIX/MSI-X/ to match spec usage.
>
> I'm not sure what this list is telling us.
>
Will fix
>> +Limitations
>> +===========
>> +
>> +When VMD is enabled and used in a hypervisor the _OSC flags provided by the
>> +hypervisor BIOS may not be correct. The most critical of these flags are the
>> +hotplug bits. If these bits are incorrect then the storage devices behind the
>> +VMD will not be able to be hotplugged. The driver always supports hotplug for
>> +the devices behind it so the hotplug bits reported by the OS are not used.
>
> "_OSC may not be correct" sounds kind of problematic. How does the
> OS deal with this? How does the OS know whether to pay attention to
> _OSC or ignore it because it tells us garbage?
>
That's the $64K question, lol. We've been trying to solve that since
commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") :)
> If we ignore _OSC hotplug bits because "we know what we want, and we
> know we won't conflict with firmware," how do we deal with other _OSC
> bits? AER? PME? What about bits that may be added in the future?
> Is there some kind of roadmap to help answer these questions?
>
As I mentioned earlier, VMD only really cares about hotplug because that
is the feature we enable for guests (and hosts).
I believe the solution is to use the root bridge settings for all other
bits (which is what is happening currently). What this will mean in
practice is that in a bare metal scenario the bits will be correct for
all the features (AER et al) and that in a guest scenario all the bits
other than hotplug (which we will enable always) will be 0 (that's what
we see in all hypervisor scenarios we've tested) which is fine for us
because we don't care about any of the other bits.
That's why I think it's ok for us to set the hotplug bits to 1 when the
VMD driver loads; we aren't harming any other devices, we are enabling a
feature that we know our users want and we are setting all the other
_OSC bits "correctly" (for some values of correctly :) )
I appreciate your feedback and I'll start working on updating the
documentation to make it clearer. I'll wait to send a v2 until I feel
like we've finished our discussion from this one.
Paul
> Bjorn
>
next prev parent reply other threads:[~2024-04-18 21:51 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-17 20:15 [PATCH] Documentation: PCI: add vmd documentation Paul M Stillwell Jr
2024-04-17 23:51 ` Keith Busch
2024-04-18 15:07 ` Paul M Stillwell Jr
2024-04-18 23:34 ` Keith Busch
2024-04-18 18:26 ` Bjorn Helgaas
2024-04-18 21:51 ` Paul M Stillwell Jr [this message]
2024-04-19 21:14 ` Bjorn Helgaas
2024-04-19 22:18 ` Paul M Stillwell Jr
2024-04-22 20:27 ` Bjorn Helgaas
2024-04-22 21:39 ` Paul M Stillwell Jr
2024-04-22 22:52 ` Bjorn Helgaas
2024-04-22 23:39 ` Paul M Stillwell Jr
2024-04-23 21:26 ` Bjorn Helgaas
2024-04-23 23:10 ` Paul M Stillwell Jr
2024-04-24 0:47 ` Bjorn Helgaas
2024-04-24 21:29 ` Paul M Stillwell Jr
2024-04-25 17:24 ` Bjorn Helgaas
2024-04-25 21:43 ` Paul M Stillwell Jr
2024-04-25 22:32 ` Bjorn Helgaas
2024-04-25 23:32 ` Paul M Stillwell Jr
2024-04-26 21:36 ` Bjorn Helgaas
2024-04-26 21:46 ` Paul M Stillwell Jr
2024-06-12 21:52 ` Paul M Stillwell Jr
2024-06-12 22:25 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=68b92dca-2e51-4604-99e8-58a42459218a@intel.com \
--to=paul.m.stillwell.jr@intel.com \
--cc=helgaas@kernel.org \
--cc=kbusch@kernel.org \
--cc=linux-pci@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox