Re: [PATCH] Documentation: PCI: add vmd documentation

Linux PCI subsystem development
 help / color / mirror / Atom feed

From: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: <linux-pci@vger.kernel.org>, Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH] Documentation: PCI: add vmd documentation
Date: Thu, 18 Apr 2024 14:51:19 -0700	[thread overview]
Message-ID: <68b92dca-2e51-4604-99e8-58a42459218a@intel.com> (raw)
In-Reply-To: <20240418182653.GA216968@bhelgaas>

On 4/18/2024 11:26 AM, Bjorn Helgaas wrote:
> [+cc Keith]
> 
> On Wed, Apr 17, 2024 at 01:15:42PM -0700, Paul M Stillwell Jr wrote:
>> Adding documentation for the Intel VMD driver and updating the index
>> file to include it.
>>
>> Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
>> ---
>>   Documentation/PCI/controller/vmd.rst | 51 ++++++++++++++++++++++++++++
>>   Documentation/PCI/index.rst          |  1 +
>>   2 files changed, 52 insertions(+)
>>   create mode 100644 Documentation/PCI/controller/vmd.rst
>>
>> diff --git a/Documentation/PCI/controller/vmd.rst b/Documentation/PCI/controller/vmd.rst
>> new file mode 100644
>> index 000000000000..e1a019035245
>> --- /dev/null
>> +++ b/Documentation/PCI/controller/vmd.rst
>> @@ -0,0 +1,51 @@
>> +.. SPDX-License-Identifier: GPL-2.0+
>> +
>> +=================================================================
>> +Linux Base Driver for the Intel(R) Volume Management Device (VMD)
>> +=================================================================
>> +
>> +Intel vmd Linux driver.
>> +
>> +Contents
>> +========
>> +
>> +- Overview
>> +- Features
>> +- Limitations
>> +
>> +The Intel VMD provides the means to provide volume management across separate
>> +PCI Express HBAs and SSDs without requiring operating system support or
>> +communication between drivers. It does this by obscuring each storage
>> +controller from the OS, but allowing a single driver to be loaded that would
>> +control each storage controller. A Volume Management Device (VMD) provides a
>> +single device for a single storage driver. The VMD resides in the IIO root
> 
> I'm not sure IIO (and PCH below) are really relevant to this.  I think

I'm trying to describe where in the CPU architecture VMD exists because 
it's not like other devices. It's not like a storage device or 
networking device that is plugged in somewhere; it exists as part of the 
CPU (in the IIO). I'm ok removing it, but it might be confusing to 
someone looking at the documentation. I'm also close to this so it may 
be clear to me, but confusing to others (which I know it is) so any help 
making it clearer would be appreciated.

> we really just care about the PCI topology enumerable by the OS.  If
> they are relevant, expand them on first use as you did for VMD so we
> have a hint about how to learn more about it.
> 

I don't fully understand this comment. The PCI topology behind VMD is 
not enumerable by the OS unless we are considering the vmd driver the 
OS. If the user enables VMD in the BIOS and the vmd driver isn't loaded, 
then the OS never sees the devices behind VMD.

The only reason the devices are seen by the OS is that the VMD driver 
does some mapping when the VMD driver loads during boot.

>> +complex and it appears to the OS as a root bus integrated endpoint. In the IIO,
> 
> I suspect "root bus integrated endpoint" means the same as "Root
> Complex Integrated Endpoint" as defined by the PCIe spec?  If so,
> please use that term and capitalize it so there's no confusion.
> 

OK, will fix.

>> +the VMD is in a central location to manipulate access to storage devices which
>> +may be attached directly to the IIO or indirectly through the PCH. Instead of
>> +allowing individual storage devices to be detected by the OS and allow it to
>> +load a separate driver instance for each, the VMD provides configuration
>> +settings to allow specific devices and root ports on the root bus to be
>> +invisible to the OS.
> 
> How are these settings configured?  BIOS setup menu?
> 

I believe there are 2 ways this is done:

The first is that the system designer creates a design such that some 
root ports and end points are behind VMD. If VMD is enabled in the BIOS 
then these devices don't show up to the OS and require a driver to use 
them (the vmd driver). If VMD is disabled in the BIOS then the devices 
are seen by the OS at boot time.

The second way is that there are settings in the BIOS for VMD. I don't 
think there are many settings... it's mostly enable/disable VMD

>> +VMD works by creating separate PCI domains for each VMD device in the system.
>> +This makes VMD look more like a host bridge than an endpoint so VMD must try
>> +to adhere to the ACPI Operating System Capabilities (_OSC) flags of the system.
> 
> As Keith pointed out, I think this needs more details about how the
> hardware itself works.  I don't think there's enough information here
> to maintain the OS/platform interface on an ongoing basis.
> 
> I think "creating a separate PCI domain" is a consequence of providing
> a new config access mechanism, e.g., a new ECAM region, for devices
> below the VMD bridge.  That hardware mechanism is important to
> understand because it means those downstream devices are unknown to
> anything that doesn't grok the config access mechanism.  For example,
> firmware wouldn't know anything about them unless it had a VMD driver.
> 
> Some of the pieces that might help figure this out:
>

I'll add some details to answer these in the documentation, but I'll 
give a brief answer here as well

>    - Which devices (VMD bridge, VMD Root Ports, devices below VMD Root
>      Ports) are enumerated in the host?
> 

Only the VMD device (as a PCI end point) are seen by the OS without the 
vmd driver

>    - Which devices are passed through to a virtual guest and enumerated
>      there?
> 

All devices under VMD are passed to a virtual guest

>    - Where does the vmd driver run (host or guest or both)?
> 

I believe the answer is both.

>    - Who (host or guest) runs the _OSC for the new VMD domain?
> 

I believe the answer here is neither :) This has been an issue since 
commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features"). I've 
submitted this patch 
(https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/) 
to attempt to fix the issue.

You are much more of an expert in this area than I am, but as far as I 
can tell the only way the _OSC bits get cleared is via ACPI 
(specifically this code 
https://elixir.bootlin.com/linux/latest/source/drivers/acpi/pci_root.c#L1038). 
Since ACPI doesn't run on the devices behind VMD the _OSC bits don't get 
set properly for them.

Ultimately the only _OSC bits that VMD cares about are the hotplug bits 
because that is a feature of our device; it enables hotplug in guests 
where there is no way to enable it. That's why my patch is to set them 
all the time and copy the other _OSC bits because there is no other way 
to enable this feature (i.e. there is no user space tool to 
enable/disable it).

>    - What happens to interrupts generated by devices downstream from
>      VMD, e.g., AER interrupts from VMD Root Ports, hotplug interrupts
>      from VMD Root Ports or switch downstream ports?  Who fields them?
>      In general firmware would field them unless it grants ownership
>      via _OSC.  If firmware grants ownership (or the OS forcibly takes
>      it by overriding it for hotplug), I guess the OS that requested
>      ownership would field them?
> 

The interrupts are passed through VMD to the OS. This was the AER issue 
that resulted in commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe 
features"). IIRC AER was disabled in the BIOS, but is was enabled in the 
VMD host bridge because pci_init_host_bridge() sets all the bits to 1 
and that generated an AER interrupt storm.

In bare metal scenarios the _OSC bits are correct, but in a hypervisor 
scenario the bits are wrong because they are all 0 regardless of what 
the ACPI tables indicate. The challenge is that the VMD driver has no 
way to know it's in a hypervisor to set the hotplug bits correctly.

>    - How do interrupts (hotplug, AER, etc) for things below VMD work?
>      Assuming the OS owns the feature, how does the OS discover them?

I feel like this is the same question as above? Or maybe I'm missing a 
subtlety about this...

>      I guess probably the usual PCIe Capability and MSI/MSI-X
>      Capabilities?  Which OS (host or guest) fields them?
>
>> +A couple of the _OSC flags regard hotplug support.  Hotplug is a feature that
>> +is always enabled when using VMD regardless of the _OSC flags.
> 
> We log the _OSC negotiation in dmesg, so if we ignore or override _OSC
> for hotplug, maybe that should be made explicit in the logging
> somehow?
> 

That's a really good idea and something I can add to 
https://lore.kernel.org/linux-pci/20240408183927.135-1-paul.m.stillwell.jr@intel.com/

Would a message like this help from the VMD driver?

"VMD enabled, hotplug enabled by VMD"

>> +Features
>> +========
>> +
>> +- Virtualization
>> +- MSIX interrupts
>> +- Power Management
>> +- Hotplug
> 
> s/MSIX/MSI-X/ to match spec usage.
> 
> I'm not sure what this list is telling us.
> 

Will fix

>> +Limitations
>> +===========
>> +
>> +When VMD is enabled and used in a hypervisor the _OSC flags provided by the
>> +hypervisor BIOS may not be correct. The most critical of these flags are the
>> +hotplug bits. If these bits are incorrect then the storage devices behind the
>> +VMD will not be able to be hotplugged. The driver always supports hotplug for
>> +the devices behind it so the hotplug bits reported by the OS are not used.
> 
> "_OSC may not be correct" sounds kind of problematic.  How does the
> OS deal with this?  How does the OS know whether to pay attention to
> _OSC or ignore it because it tells us garbage?
> 

That's the $64K question, lol. We've been trying to solve that since 
commit 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") :)

> If we ignore _OSC hotplug bits because "we know what we want, and we
> know we won't conflict with firmware," how do we deal with other _OSC
> bits?  AER?  PME?  What about bits that may be added in the future?
> Is there some kind of roadmap to help answer these questions?
> 

As I mentioned earlier, VMD only really cares about hotplug because that 
is the feature we enable for guests (and hosts).

I believe the solution is to use the root bridge settings for all other 
bits (which is what is happening currently). What this will mean in 
practice is that in a bare metal scenario the bits will be correct for 
all the features (AER et al) and that in a guest scenario all the bits 
other than hotplug (which we will enable always) will be 0 (that's what 
we see in all hypervisor scenarios we've tested) which is fine for us 
because we don't care about any of the other bits.

That's why I think it's ok for us to set the hotplug bits to 1 when the 
VMD driver loads; we aren't harming any other devices, we are enabling a 
feature that we know our users want and we are setting all the other 
_OSC bits "correctly" (for some values of correctly :) )

I appreciate your feedback and I'll start working on updating the 
documentation to make it clearer. I'll wait to send a v2 until I feel 
like we've finished our discussion from this one.

Paul

> Bjorn
>

next prev parent reply	other threads:[~2024-04-18 21:51 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-17 20:15 [PATCH] Documentation: PCI: add vmd documentation Paul M Stillwell Jr
2024-04-17 23:51 ` Keith Busch
2024-04-18 15:07   ` Paul M Stillwell Jr
2024-04-18 23:34     ` Keith Busch
2024-04-18 18:26 ` Bjorn Helgaas
2024-04-18 21:51   ` Paul M Stillwell Jr [this message]
2024-04-19 21:14     ` Bjorn Helgaas
2024-04-19 22:18       ` Paul M Stillwell Jr
2024-04-22 20:27         ` Bjorn Helgaas
2024-04-22 21:39           ` Paul M Stillwell Jr
2024-04-22 22:52             ` Bjorn Helgaas
2024-04-22 23:39               ` Paul M Stillwell Jr
2024-04-23 21:26                 ` Bjorn Helgaas
2024-04-23 23:10                   ` Paul M Stillwell Jr
2024-04-24  0:47                     ` Bjorn Helgaas
2024-04-24 21:29                       ` Paul M Stillwell Jr
2024-04-25 17:24                         ` Bjorn Helgaas
2024-04-25 21:43                           ` Paul M Stillwell Jr
2024-04-25 22:32                             ` Bjorn Helgaas
2024-04-25 23:32                               ` Paul M Stillwell Jr
2024-04-26 21:36                                 ` Bjorn Helgaas
2024-04-26 21:46                                   ` Paul M Stillwell Jr
2024-06-12 21:52                                     ` Paul M Stillwell Jr
2024-06-12 22:25                                       ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=68b92dca-2e51-4604-99e8-58a42459218a@intel.com \
    --to=paul.m.stillwell.jr@intel.com \
    --cc=helgaas@kernel.org \
    --cc=kbusch@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox