From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Oliver O'Halloran <oohall@gmail.com>, linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH 06/15] powerpc/powernv/sriov: Explain how SR-IOV works on PowerNV
Date: Wed, 15 Jul 2020 10:40:44 +1000 [thread overview]
Message-ID: <2d0aadde-be32-2cfb-fa77-9ba8cccb7746@ozlabs.ru> (raw)
In-Reply-To: <20200710052340.737567-7-oohall@gmail.com>
On 10/07/2020 15:23, Oliver O'Halloran wrote:
> SR-IOV support on PowerNV is a byzantine maze of hooks. I have no idea
> how anyone is supposed to know how it works except through a lot of
> stuffering. Write up some docs about the overall story to help out
> the next sucker^Wperson who needs to tinker with it.
Sounds about right :)
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
> arch/powerpc/platforms/powernv/pci-sriov.c | 130 +++++++++++++++++++++
> 1 file changed, 130 insertions(+)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c b/arch/powerpc/platforms/powernv/pci-sriov.c
> index 080ea39f5a83..f4c74ab1284d 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -12,6 +12,136 @@
> /* for pci_dev_is_added() */
> #include "../../../../drivers/pci/pci.h"
>
> +/*
> + * The majority of the complexity in supporting SR-IOV on PowerNV comes from
> + * the need to put the MMIO space for each VF into a separate PE. Internally
> + * the PHB maps MMIO addresses to a specific PE using the "Memory BAR Table".
> + * The MBT historically only applied to the 64bit MMIO window of the PHB
> + * so it's common to see it referred to as the "M64BT".
> + *
> + * An MBT entry stores the mapped range as an <base>,<mask> pair. This forces
> + * the address range that we want to map to be power-of-two sized and aligned.
> + * For conventional PCI devices this isn't really an issue since PCI device BARs
> + * have the same requirement.
> + *
> + * For a SR-IOV BAR things are a little more awkward since size and alignment
> + * are not coupled. The alignment is set based on the the per-VF BAR size, but
> + * the total BAR area is: number-of-vfs * per-vf-size. The number of VFs
> + * isn't necessarily a power of two, so neither is the total size. To fix that
> + * we need to finesse (read: hack) the Linux BAR allocator so that it will
> + * allocate the SR-IOV BARs in a way that lets us map them using the MBT.
> + *
> + * The changes to size and alignment that we need to do depend on the "mode"
> + * of MBT entry that we use. We only support SR-IOV on PHB3 (IODA2) and above,
> + * so as a baseline we can assume that we have the following BAR modes
> + * available:
> + *
> + * NB: $PE_COUNT is the number of PEs that the PHB supports.
> + *
> + * a) A segmented BAR that splits the mapped range into $PE_COUNT equally sized
> + * segments. The n'th segment is mapped to the n'th PE.
> + * b) An un-segmented BAR that maps the whole address range to a specific PE.
> + *
> + *
> + * We prefer to use mode a) since it only requires one MBT entry per SR-IOV BAR
> + * For comparison b) requires one entry per-VF per-BAR, or:
> + * (num-vfs * num-sriov-bars) in total. To use a) we need the size of each segment
> + * to equal the size of the per-VF BAR area. So:
> + *
> + * new_size = per-vf-size * number-of-PEs
> + *
> + * The alignment for the SR-IOV BAR also needs to be changed from per-vf-size
> + * to "new_size", calculated above. Implementing this is a convoluted process
> + * which requires several hooks in the PCI core:
> + *
> + * 1. In pcibios_add_device() we call pnv_pci_ioda_fixup_iov().
> + *
> + * At this point the device has been probed and the device's BARs are sized,
> + * but no resource allocations have been done. The SR-IOV BARs are sized
> + * based on the maximum number of VFs supported by the device and we need
> + * to increase that to new_size.
> + *
> + * 2. Later, when Linux actually assigns resources it tries to make the resource
> + * allocations for each PCI bus as compact as possible. As a part of that it
> + * sorts the BARs on a bus by their required alignment, which is calculated
> + * using pci_resource_alignment().
> + *
> + * For IOV resources this goes:
> + * pci_resource_alignment()
> + * pci_sriov_resource_alignment()
> + * pcibios_sriov_resource_alignment()
> + * pnv_pci_iov_resource_alignment()
> + *
> + * Our hook overrides the default alignment, equal to the per-vf-size, with
> + * new_size computed above.
> + *
> + * 3. When userspace enables VFs for a device:
> + *
> + * sriov_enable()
> + * pcibios_sriov_enable()
> + * pnv_pcibios_sriov_enable()
> + *
> + * This is where we actually allocate PE numbers for each VF and setup the
> + * MBT mapping for each SR-IOV BAR. In steps 1) and 2) we setup an "arena"
> + * where each MBT segment is equal in size to the VF BAR so we can shift
> + * around the actual SR-IOV BAR location within this arena. We need this
> + * ability because the PE space is shared by all devices on the same PHB.
> + * When using mode a) described above segment 0 in maps to PE#0 which might
> + * be already being used by another device on the PHB.
> + *
> + * As a result we need allocate a contigious range of PE numbers, then shift
> + * the address programmed into the SR-IOV BAR of the PF so that the address
> + * of VF0 matches up with the segment corresponding to the first allocated
> + * PE number. This is handled in pnv_pci_vf_resource_shift().
> + *
> + * Once all that is done we return to the PCI core which then enables VFs,
> + * scans them and creates pci_devs for each. The init process for a VF is
> + * largely the same as a normal device, but the VF is inserted into the IODA
> + * PE that we allocated for it rather than the PE associated with the bus.
> + *
> + * 4. When userspace disables VFs we unwind the above in
> + * pnv_pcibios_sriov_disable(). Fortunately this is relatively simple since
> + * we don't need to validate anything, just tear down the mappings and
> + * move SR-IOV resource back to its "proper" location.
> + *
> + * That's how mode a) works. In theory mode b) (single PE mapping) is less work
> + * since we can map each individual VF with a separate BAR. However, there's a
> + * few limitations:
> + *
> + * 1) For IODA2 mode b) has a minimum alignment requirement of 32MB. This makes
> + * it only usable for devices with very large per-VF BARs. Such devices are
> + * similar to Big Foot. They definitely exist, but I've never seen one.
> + *
> + * 2) The number of MBT entries that we have is limited. PHB3 and PHB4 only
> + * 16 total and some are needed for. Most SR-IOV capable network cards can support
> + * more than 16 VFs on each port.
> + *
> + * We use b) when using a) would use more than 1/4 of the entire 64 bit MMIO
> + * window of the PHB.
> + *
> + *
> + *
> + * PHB4 (IODA3) added a few new features that would be useful for SR-IOV. It
> + * allowed the MBT to map 32bit MMIO space in addition to 64bit which allows
> + * us to support SR-IOV BARs in the 32bit MMIO window. This is useful since
> + * the Linux BAR allocation will place any BAR marked as non-prefetchable into
> + * the non-prefetchable bridge window, which is 32bit only. It also added two
> + * new modes:
> + *
> + * c) A segmented BAR similar to a), but each segment can be individually
> + * mapped to any PE. This is matches how the 32bit MMIO window worked on
> + * IODA1&2.
> + *
> + * d) A segmented BAR with 8, 64, or 128 segments. This works similarly to a),
> + * but with fewer segments and configurable base PE.
> + *
> + * i.e. The n'th segment maps to the (n + base)'th PE.
> + *
> + * The base PE is also required to be a multiple of the window size.
> + *
> + * Unfortunately, the OPAL API doesn't currently (as of skiboot v6.6) allow us
> + * to exploit any of the IODA3 features.
> + */
>
> static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
> {
>
--
Alexey
next prev parent reply other threads:[~2020-07-15 0:42 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-10 5:23 PowerNV PCI & SR-IOV cleanups Oliver O'Halloran
2020-07-10 5:23 ` [PATCH 01/15] powernv/pci: Add pci_bus_to_pnvhb() helper Oliver O'Halloran
2020-07-13 8:28 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 02/15] powerpc/powernv/pci: Always tear down DMA windows on PE release Oliver O'Halloran
2020-07-13 8:30 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 03/15] powerpc/powernv/pci: Add explicit tracking of the DMA setup state Oliver O'Halloran
2020-07-14 5:37 ` Alexey Kardashevskiy
2020-07-14 5:58 ` Oliver O'Halloran
2020-07-14 7:21 ` Alexey Kardashevskiy
2020-07-15 0:23 ` Alexey Kardashevskiy
2020-07-15 1:38 ` Oliver O'Halloran
2020-07-15 3:33 ` Alexey Kardashevskiy
2020-07-15 7:05 ` Cédric Le Goater
2020-07-15 9:00 ` Oliver O'Halloran
2020-07-15 10:05 ` Cédric Le Goater
2020-07-10 5:23 ` [PATCH 04/15] powerpc/powernv/pci: Initialise M64 for IODA1 as a 1-1 window Oliver O'Halloran
2020-07-14 7:39 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 05/15] powerpc/powernv/sriov: Move SR-IOV into a seperate file Oliver O'Halloran
2020-07-14 9:16 ` Alexey Kardashevskiy
2020-07-22 5:01 ` Oliver O'Halloran
2020-07-22 9:53 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 06/15] powerpc/powernv/sriov: Explain how SR-IOV works on PowerNV Oliver O'Halloran
2020-07-15 0:40 ` Alexey Kardashevskiy [this message]
2020-07-10 5:23 ` [PATCH 07/15] powerpc/powernv/sriov: Rename truncate_iov Oliver O'Halloran
2020-07-15 0:46 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 08/15] powerpc/powernv/sriov: Simplify used window tracking Oliver O'Halloran
2020-07-15 1:34 ` Alexey Kardashevskiy
2020-07-15 1:41 ` Oliver O'Halloran
2020-07-10 5:23 ` [PATCH 09/15] powerpc/powernv/sriov: Factor out M64 BAR setup Oliver O'Halloran
2020-07-15 2:09 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 10/15] powerpc/powernv/pci: Refactor pnv_ioda_alloc_pe() Oliver O'Halloran
2020-07-15 2:29 ` Alexey Kardashevskiy
2020-07-15 2:53 ` Oliver O'Halloran
2020-07-15 3:15 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 11/15] powerpc/powernv/sriov: Drop iov->pe_num_map[] Oliver O'Halloran
2020-07-15 3:31 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 12/15] powerpc/powernv/sriov: De-indent setup and teardown Oliver O'Halloran
2020-07-15 4:00 ` Alexey Kardashevskiy
2020-07-15 4:21 ` Oliver O'Halloran
2020-07-15 4:41 ` Alexey Kardashevskiy
2020-07-15 4:46 ` Oliver O'Halloran
2020-07-15 4:58 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 13/15] powerpc/powernv/sriov: Move M64 BAR allocation into a helper Oliver O'Halloran
2020-07-15 4:02 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 14/15] powerpc/powernv/sriov: Refactor M64 BAR setup Oliver O'Halloran
2020-07-15 4:50 ` Alexey Kardashevskiy
2020-07-10 5:23 ` [PATCH 15/15] powerpc/powernv/sriov: Make single PE mode a per-BAR setting Oliver O'Halloran
2020-07-15 5:24 ` Alexey Kardashevskiy
2020-07-15 6:16 ` Oliver O'Halloran
2020-07-15 8:00 ` Alexey Kardashevskiy
2020-07-22 5:39 ` Oliver O'Halloran
2020-07-22 10:06 ` Alexey Kardashevskiy
2020-07-24 3:40 ` Oliver O'Halloran
2020-07-10 6:45 ` PowerNV PCI & SR-IOV cleanups Christoph Hellwig
2020-07-10 12:45 ` Oliver O'Halloran
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2d0aadde-be32-2cfb-fa77-9ba8cccb7746@ozlabs.ru \
--to=aik@ozlabs.ru \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=oohall@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).