* [PATCH v3 0/8] PCI: Align small BARs
@ 2024-08-07 15:17 Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 1/8] x86/PCI: Improve code readability Stewart Hildebrand
` (8 more replies)
0 siblings, 9 replies; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw)
To: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Naveen N. Rao, Thomas Zimmermann, Arnd Bergmann,
Sam Ravnborg, Yongji Xie, Ilpo Järvinen, Philipp Stanner
Cc: Stewart Hildebrand, x86, linux-pci, linux-kernel, linuxppc-dev
In this context, "small" is defined as max(SZ_4K, PAGE_SIZE).
This series sets the default minimum resource alignment to
max(SZ_4K, PAGE_SIZE) for memory BARs. In preparation, it makes an
optimization and addresses some corner cases observed when reallocating
BARs. I consider the prepapatory patches to be prerequisites to changing
the default BAR alignment.
I considered introducing checks for the specific scenarios described,
but chose not to pursue this. A check such as "if (xen_domain())" may be
pretty simple, but that doesn't account for other hypervisors. If other
hypervisors are to be considered, or if we try to dynamically reallocate
BARs for devices being marked for passthrough, such a check may quickly
grow unwieldy. Further, checking for the MSI-X tables residing in a
small (<4k) BAR is unlikely to be a one-liner. Making 4k alignment the
default seems more robust. Lastly, when using IORESOURCE_STARTALIGN, all
resources in the system need to be aligned.
I considered alternatively adding new functionality to the
pci=resource_alignment= option, but that approach was already attempted
and decided against [1].
[1] https://lore.kernel.org/linux-pci/1473757234-5284-4-git-send-email-xyjxie@linux.vnet.ibm.com/
v2->v3:
* clarify 4k vs PAGE_SIZE
* rename ("x86/PCI: Move some logic to new function") to
("x86/PCI: Improve code readability")
* rename ("PCI: Align small (<4k) BARs") to
("PCI: Align small BARs")
v1->v2:
* rename ("PCI: don't clear already cleared bit") to
("PCI: Don't unnecessarily disable memory decoding")
* new patch: ("x86/PCI: Move some logic to new function")
* new patch: ("powerpc/pci: Preserve IORESOURCE_STARTALIGN alignment")
Stewart Hildebrand (8):
x86/PCI: Improve code readability
PCI: Don't unnecessarily disable memory decoding
PCI: Restore resource alignment
PCI: Restore memory decoding after reallocation
x86/PCI: Preserve IORESOURCE_STARTALIGN alignment
powerpc/pci: Preserve IORESOURCE_STARTALIGN alignment
PCI: Don't reassign resources that are already aligned
PCI: Align small BARs
arch/powerpc/kernel/pci-common.c | 6 +++--
arch/x86/pci/i386.c | 38 +++++++++++++++------------
drivers/pci/pci.c | 43 +++++++++++++++++++++++--------
drivers/pci/setup-bus.c | 44 ++++++++++++++++++++++++++++++++
include/linux/pci.h | 2 ++
5 files changed, 103 insertions(+), 30 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 21+ messages in thread* [PATCH v3 1/8] x86/PCI: Improve code readability 2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand @ 2024-08-07 15:17 ` Stewart Hildebrand 2024-08-08 8:55 ` Ilpo Järvinen 2024-08-07 15:17 ` [PATCH v3 2/8] PCI: Don't unnecessarily disable memory decoding Stewart Hildebrand ` (7 subsequent siblings) 8 siblings, 1 reply; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw) To: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Ilpo Järvinen, Philipp Stanner Cc: Stewart Hildebrand, x86, linux-pci, linux-kernel The indentation in pcibios_allocate_dev_resources() is unusually deep. Improve that by moving some of its code to a new function, alloc_resource(). Take the opportunity to remove redundant information from dev_dbg(). Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> --- v2->v3: * new subject (was: "x86/PCI: Move some logic to new function") * reword commit message (thanks Philipp) v1->v2: * new patch --- arch/x86/pci/i386.c | 38 +++++++++++++++++++++----------------- 1 file changed, 21 insertions(+), 17 deletions(-) diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c index f2f4a5d50b27..3abd55902dbc 100644 --- a/arch/x86/pci/i386.c +++ b/arch/x86/pci/i386.c @@ -246,6 +246,25 @@ struct pci_check_idx_range { int end; }; +static void alloc_resource(struct pci_dev *dev, int idx, int pass) +{ + struct resource *r = &dev->resource[idx]; + + dev_dbg(&dev->dev, "BAR %d: reserving %pr (p=%d)\n", idx, r, pass); + + if (pci_claim_resource(dev, idx) < 0) { + if (r->flags & IORESOURCE_PCI_FIXED) { + dev_info(&dev->dev, "BAR %d %pR is immovable\n", + idx, r); + } else { + /* We'll assign a new address later */ + pcibios_save_fw_addr(dev, idx, r->start); + r->end -= r->start; + r->start = 0; + } + } +} + static void pcibios_allocate_dev_resources(struct pci_dev *dev, int pass) { int idx, disabled, i; @@ -271,23 +290,8 @@ static void pcibios_allocate_dev_resources(struct pci_dev *dev, int pass) disabled = !(command & PCI_COMMAND_IO); else disabled = !(command & PCI_COMMAND_MEMORY); - if (pass == disabled) { - dev_dbg(&dev->dev, - "BAR %d: reserving %pr (d=%d, p=%d)\n", - idx, r, disabled, pass); - if (pci_claim_resource(dev, idx) < 0) { - if (r->flags & IORESOURCE_PCI_FIXED) { - dev_info(&dev->dev, "BAR %d %pR is immovable\n", - idx, r); - } else { - /* We'll assign a new address later */ - pcibios_save_fw_addr(dev, - idx, r->start); - r->end -= r->start; - r->start = 0; - } - } - } + if (pass == disabled) + alloc_resource(dev, idx, pass); } if (!pass) { r = &dev->resource[PCI_ROM_RESOURCE]; -- 2.46.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v3 1/8] x86/PCI: Improve code readability 2024-08-07 15:17 ` [PATCH v3 1/8] x86/PCI: Improve code readability Stewart Hildebrand @ 2024-08-08 8:55 ` Ilpo Järvinen 0 siblings, 0 replies; 21+ messages in thread From: Ilpo Järvinen @ 2024-08-08 8:55 UTC (permalink / raw) To: Stewart Hildebrand Cc: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Philipp Stanner, x86, linux-pci, LKML [-- Attachment #1: Type: text/plain, Size: 2507 bytes --] On Wed, 7 Aug 2024, Stewart Hildebrand wrote: > The indentation in pcibios_allocate_dev_resources() is unusually deep. > Improve that by moving some of its code to a new function, > alloc_resource(). > > Take the opportunity to remove redundant information from dev_dbg(). > > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > --- > v2->v3: > * new subject (was: "x86/PCI: Move some logic to new function") > * reword commit message (thanks Philipp) > > v1->v2: > * new patch > --- > arch/x86/pci/i386.c | 38 +++++++++++++++++++++----------------- > 1 file changed, 21 insertions(+), 17 deletions(-) > > diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c > index f2f4a5d50b27..3abd55902dbc 100644 > --- a/arch/x86/pci/i386.c > +++ b/arch/x86/pci/i386.c > @@ -246,6 +246,25 @@ struct pci_check_idx_range { > int end; > }; > > +static void alloc_resource(struct pci_dev *dev, int idx, int pass) > +{ > + struct resource *r = &dev->resource[idx]; > + > + dev_dbg(&dev->dev, "BAR %d: reserving %pr (p=%d)\n", idx, r, pass); > + > + if (pci_claim_resource(dev, idx) < 0) { > + if (r->flags & IORESOURCE_PCI_FIXED) { > + dev_info(&dev->dev, "BAR %d %pR is immovable\n", > + idx, r); > + } else { > + /* We'll assign a new address later */ > + pcibios_save_fw_addr(dev, idx, r->start); > + r->end -= r->start; > + r->start = 0; > + } > + } > +} > + > static void pcibios_allocate_dev_resources(struct pci_dev *dev, int pass) > { > int idx, disabled, i; > @@ -271,23 +290,8 @@ static void pcibios_allocate_dev_resources(struct pci_dev *dev, int pass) > disabled = !(command & PCI_COMMAND_IO); > else > disabled = !(command & PCI_COMMAND_MEMORY); > - if (pass == disabled) { > - dev_dbg(&dev->dev, > - "BAR %d: reserving %pr (d=%d, p=%d)\n", > - idx, r, disabled, pass); > - if (pci_claim_resource(dev, idx) < 0) { > - if (r->flags & IORESOURCE_PCI_FIXED) { > - dev_info(&dev->dev, "BAR %d %pR is immovable\n", > - idx, r); > - } else { > - /* We'll assign a new address later */ > - pcibios_save_fw_addr(dev, > - idx, r->start); > - r->end -= r->start; > - r->start = 0; > - } > - } > - } > + if (pass == disabled) > + alloc_resource(dev, idx, pass); > } > if (!pass) { > r = &dev->resource[PCI_ROM_RESOURCE]; > Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> -- i. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v3 2/8] PCI: Don't unnecessarily disable memory decoding 2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 1/8] x86/PCI: Improve code readability Stewart Hildebrand @ 2024-08-07 15:17 ` Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 3/8] PCI: Restore resource alignment Stewart Hildebrand ` (6 subsequent siblings) 8 siblings, 0 replies; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: Stewart Hildebrand, linux-pci, linux-kernel If alignment is requested for a device and all the BARs are already sufficiently aligned, there is no need to disable memory decoding for the device. Add a check for this scenario to save a PCI config space access. Also, there is no need to disable memory decoding if already disabled, since the bit in question (PCI_COMMAND_MEMORY) is a RW bit. Make the write conditional to save a PCI config space access. Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> --- v2->v3: * no change v1->v2: * new subject (was: "PCI: don't clear already cleared bit") * don't disable memory decoding if alignment is not needed --- drivers/pci/pci.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index e3a49f66982d..acecdd6edd5a 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -6564,7 +6564,7 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev, return align; } -static void pci_request_resource_alignment(struct pci_dev *dev, int bar, +static bool pci_request_resource_alignment(struct pci_dev *dev, int bar, resource_size_t align, bool resize) { struct resource *r = &dev->resource[bar]; @@ -6572,17 +6572,17 @@ static void pci_request_resource_alignment(struct pci_dev *dev, int bar, resource_size_t size; if (!(r->flags & IORESOURCE_MEM)) - return; + return false; if (r->flags & IORESOURCE_PCI_FIXED) { pci_info(dev, "%s %pR: ignoring requested alignment %#llx\n", r_name, r, (unsigned long long)align); - return; + return false; } size = resource_size(r); if (size >= align) - return; + return false; /* * Increase the alignment of the resource. There are two ways we @@ -6625,6 +6625,8 @@ static void pci_request_resource_alignment(struct pci_dev *dev, int bar, r->end = r->start + size - 1; } r->flags |= IORESOURCE_UNSET; + + return true; } /* @@ -6640,7 +6642,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev) struct resource *r; resource_size_t align; u16 command; - bool resize = false; + bool resize = false, align_needed = false; /* * VF BARs are read-only zero according to SR-IOV spec r1.1, sec @@ -6662,12 +6664,21 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev) return; } - pci_read_config_word(dev, PCI_COMMAND, &command); - command &= ~PCI_COMMAND_MEMORY; - pci_write_config_word(dev, PCI_COMMAND, command); + for (i = 0; i <= PCI_ROM_RESOURCE; i++) { + bool ret; - for (i = 0; i <= PCI_ROM_RESOURCE; i++) - pci_request_resource_alignment(dev, i, align, resize); + ret = pci_request_resource_alignment(dev, i, align, resize); + align_needed = align_needed || ret; + } + + if (!align_needed) + return; + + pci_read_config_word(dev, PCI_COMMAND, &command); + if (command & PCI_COMMAND_MEMORY) { + command &= ~PCI_COMMAND_MEMORY; + pci_write_config_word(dev, PCI_COMMAND, command); + } /* * Need to disable bridge's resource window, -- 2.46.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 3/8] PCI: Restore resource alignment 2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 1/8] x86/PCI: Improve code readability Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 2/8] PCI: Don't unnecessarily disable memory decoding Stewart Hildebrand @ 2024-08-07 15:17 ` Stewart Hildebrand 2024-08-08 19:28 ` Bjorn Helgaas 2024-08-07 15:17 ` [PATCH v3 4/8] PCI: Restore memory decoding after reallocation Stewart Hildebrand ` (5 subsequent siblings) 8 siblings, 1 reply; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: Stewart Hildebrand, linux-pci, linux-kernel Devices with alignment specified will lose their alignment in cases when the bridge resources have been released, e.g. due to insufficient bridge window size. Restore the alignment. Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> --- v2->v3: * no change v1->v2: * capitalize subject text --- drivers/pci/setup-bus.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 23082bc0ca37..ab7510ce6917 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1594,6 +1594,23 @@ static void __pci_bridge_assign_resources(const struct pci_dev *bridge, } } +static void restore_child_resource_alignment(struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, &bus->devices, bus_list) { + struct pci_bus *b; + + pci_reassigndev_resource_alignment(dev); + + b = dev->subordinate; + if (!b) + continue; + + restore_child_resource_alignment(b); + } +} + #define PCI_RES_TYPE_MASK \ (IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\ IORESOURCE_MEM_64) @@ -1648,6 +1665,8 @@ static void pci_bridge_release_resources(struct pci_bus *bus, r->start = 0; r->flags = 0; + restore_child_resource_alignment(bus); + /* Avoiding touch the one without PREF */ if (type & IORESOURCE_PREFETCH) type = IORESOURCE_PREFETCH; -- 2.46.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v3 3/8] PCI: Restore resource alignment 2024-08-07 15:17 ` [PATCH v3 3/8] PCI: Restore resource alignment Stewart Hildebrand @ 2024-08-08 19:28 ` Bjorn Helgaas 2024-08-08 20:28 ` Stewart Hildebrand 0 siblings, 1 reply; 21+ messages in thread From: Bjorn Helgaas @ 2024-08-08 19:28 UTC (permalink / raw) To: Stewart Hildebrand; +Cc: Bjorn Helgaas, linux-pci, linux-kernel On Wed, Aug 07, 2024 at 11:17:12AM -0400, Stewart Hildebrand wrote: > Devices with alignment specified will lose their alignment in cases when > the bridge resources have been released, e.g. due to insufficient bridge > window size. Restore the alignment. I guess this fixes a problem when the user has specified "pci=resource_alignment=..." and we've decided to release and reallocate a bridge window? Just looking for a bit more concrete description of what this problem would look like to a user. > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > --- > v2->v3: > * no change > > v1->v2: > * capitalize subject text > --- > drivers/pci/setup-bus.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > index 23082bc0ca37..ab7510ce6917 100644 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@ -1594,6 +1594,23 @@ static void __pci_bridge_assign_resources(const struct pci_dev *bridge, > } > } > > +static void restore_child_resource_alignment(struct pci_bus *bus) > +{ > + struct pci_dev *dev; > + > + list_for_each_entry(dev, &bus->devices, bus_list) { > + struct pci_bus *b; > + > + pci_reassigndev_resource_alignment(dev); > + > + b = dev->subordinate; > + if (!b) > + continue; > + > + restore_child_resource_alignment(b); > + } > +} > + > #define PCI_RES_TYPE_MASK \ > (IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\ > IORESOURCE_MEM_64) > @@ -1648,6 +1665,8 @@ static void pci_bridge_release_resources(struct pci_bus *bus, > r->start = 0; > r->flags = 0; > > + restore_child_resource_alignment(bus); > + > /* Avoiding touch the one without PREF */ > if (type & IORESOURCE_PREFETCH) > type = IORESOURCE_PREFETCH; > -- > 2.46.0 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 3/8] PCI: Restore resource alignment 2024-08-08 19:28 ` Bjorn Helgaas @ 2024-08-08 20:28 ` Stewart Hildebrand 2024-08-08 21:54 ` Bjorn Helgaas 0 siblings, 1 reply; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-08 20:28 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: Bjorn Helgaas, linux-pci, linux-kernel On 8/8/24 15:28, Bjorn Helgaas wrote: > On Wed, Aug 07, 2024 at 11:17:12AM -0400, Stewart Hildebrand wrote: >> Devices with alignment specified will lose their alignment in cases when >> the bridge resources have been released, e.g. due to insufficient bridge >> window size. Restore the alignment. > > I guess this fixes a problem when the user has specified > "pci=resource_alignment=..." and we've decided to release and > reallocate a bridge window? Just looking for a bit more concrete > description of what this problem would look like to a user. Yes. When alignment has been specified via pcibios_default_alignment() or by the user with "pci=resource_alignment=...", and the bridge window is being reallocated, the specified alignment is lost and the resource may not be sufficiently aligned after reallocation. I can expand the commit description. > >> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> >> --- >> v2->v3: >> * no change >> >> v1->v2: >> * capitalize subject text >> --- >> drivers/pci/setup-bus.c | 19 +++++++++++++++++++ >> 1 file changed, 19 insertions(+) >> >> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c >> index 23082bc0ca37..ab7510ce6917 100644 >> --- a/drivers/pci/setup-bus.c >> +++ b/drivers/pci/setup-bus.c >> @@ -1594,6 +1594,23 @@ static void __pci_bridge_assign_resources(const struct pci_dev *bridge, >> } >> } >> >> +static void restore_child_resource_alignment(struct pci_bus *bus) >> +{ >> + struct pci_dev *dev; >> + >> + list_for_each_entry(dev, &bus->devices, bus_list) { >> + struct pci_bus *b; >> + >> + pci_reassigndev_resource_alignment(dev); >> + >> + b = dev->subordinate; >> + if (!b) >> + continue; >> + >> + restore_child_resource_alignment(b); >> + } >> +} >> + >> #define PCI_RES_TYPE_MASK \ >> (IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\ >> IORESOURCE_MEM_64) >> @@ -1648,6 +1665,8 @@ static void pci_bridge_release_resources(struct pci_bus *bus, >> r->start = 0; >> r->flags = 0; >> >> + restore_child_resource_alignment(bus); >> + >> /* Avoiding touch the one without PREF */ >> if (type & IORESOURCE_PREFETCH) >> type = IORESOURCE_PREFETCH; >> -- >> 2.46.0 >> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 3/8] PCI: Restore resource alignment 2024-08-08 20:28 ` Stewart Hildebrand @ 2024-08-08 21:54 ` Bjorn Helgaas 2024-08-14 18:12 ` Stewart Hildebrand 0 siblings, 1 reply; 21+ messages in thread From: Bjorn Helgaas @ 2024-08-08 21:54 UTC (permalink / raw) To: Stewart Hildebrand; +Cc: Bjorn Helgaas, linux-pci, linux-kernel On Thu, Aug 08, 2024 at 04:28:50PM -0400, Stewart Hildebrand wrote: > On 8/8/24 15:28, Bjorn Helgaas wrote: > > On Wed, Aug 07, 2024 at 11:17:12AM -0400, Stewart Hildebrand wrote: > >> Devices with alignment specified will lose their alignment in cases when > >> the bridge resources have been released, e.g. due to insufficient bridge > >> window size. Restore the alignment. > > > > I guess this fixes a problem when the user has specified > > "pci=resource_alignment=..." and we've decided to release and > > reallocate a bridge window? Just looking for a bit more concrete > > description of what this problem would look like to a user. > > Yes. When alignment has been specified via pcibios_default_alignment() > or by the user with "pci=resource_alignment=...", and the bridge window > is being reallocated, the specified alignment is lost and the resource > may not be sufficiently aligned after reallocation. > > I can expand the commit description. I think a hint about where the alignment gets lost would be helpful, too. This seems like a problem users could be seeing today, even independent of the device passthrough issue that I think is the main thrust of this series. If there's a problem report or an easy way to reproduce this problem, that would be nice, too. Bjorn ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 3/8] PCI: Restore resource alignment 2024-08-08 21:54 ` Bjorn Helgaas @ 2024-08-14 18:12 ` Stewart Hildebrand 0 siblings, 0 replies; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-14 18:12 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: Bjorn Helgaas, linux-pci, linux-kernel On 8/8/24 17:54, Bjorn Helgaas wrote: > On Thu, Aug 08, 2024 at 04:28:50PM -0400, Stewart Hildebrand wrote: >> On 8/8/24 15:28, Bjorn Helgaas wrote: >>> On Wed, Aug 07, 2024 at 11:17:12AM -0400, Stewart Hildebrand wrote: >>>> Devices with alignment specified will lose their alignment in cases when >>>> the bridge resources have been released, e.g. due to insufficient bridge >>>> window size. Restore the alignment. >>> >>> I guess this fixes a problem when the user has specified >>> "pci=resource_alignment=..." and we've decided to release and >>> reallocate a bridge window? Just looking for a bit more concrete >>> description of what this problem would look like to a user. >> >> Yes. When alignment has been specified via pcibios_default_alignment() >> or by the user with "pci=resource_alignment=...", and the bridge window >> is being reallocated, the specified alignment is lost and the resource >> may not be sufficiently aligned after reallocation. >> >> I can expand the commit description. > > I think a hint about where the alignment gets lost would be helpful, > too. > > This seems like a problem users could be seeing today, even > independent of the device passthrough issue that I think is the main > thrust of this series. If there's a problem report or an easy way to > reproduce this problem, that would be nice, too. Oh, sorry, I just realized that it's only alignments with IORESOURCE_STARTALIGN that get lost during bridge window realloc. Specifically, r->start gets overwritten in __release_child_resources(). There are a few code paths, such as the one in __release_child_resources(), that assume IORESOURCE_SIZEALIGN. In case of IORESOURCE_SIZEALIGN, the alignment is restored by a simple calculation. However, with IORESOURCE_STARTALIGN, we can't use a simple calculation, instead we need to consult pci_reassigndev_resource_alignment() to restore the alignment. I'll update the commit message. An alternative approach might be to store the alignment in a new member in struct resource, thus saving us from having to call pci_reassigndev_resource_alignment() for each PCI device on the bridge undergoing reallocation. BTW, this patch and the two "[powerpc,x86]/pci: Preserve IORESOURCE_STARTALIGN alignment" patches could potentially be folded into a single patch as they're all dealing with fixing IORESOURCE_STARTALIGN. To repro the issue on x86, we will need to apply this series except the current patch [3/8]. I ran into the issue with the following device. Let's call it example 1. Scrubbed/partial output from lspci -vv: Region 0: Memory at d1924600 (32-bit, non-prefetchable) [size=256] Region 1: Memory at d1924400 (32-bit, non-prefetchable) [size=512] Region 2: Memory at d1924000 (32-bit, non-prefetchable) [size=1K] Region 3: Memory at d1920000 (32-bit, non-prefetchable) [size=16K] Region 4: Memory at d1900000 (32-bit, non-prefetchable) [size=128K] Region 5: Memory at d1800000 (32-bit, non-prefetchable) [size=1M] Capabilities: [b0] MSI-X: Enable- Count=2 Masked- Vector table: BAR=0 offset=00000080 PBA: BAR=0 offset=00000090 Capabilities: [200 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 4, Total VFs: 4, Number of VFs: 0, Function Dependency Link: 00 VF offset: 6, stride: 1, Device ID: 0100 Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 00000000d0800000 (64-bit, non-prefetchable) VF Migration: offset: 00000000, BIR: 0 Kernel driver in use: pci-endpoint-test Kernel modules: pci_endpoint_test BARs 0, 1, and 2 are small, and the host firmware placed them on the same page. The host firmware did not page align the BARs. There is also an SR-IOV BAR that the firmware couldn't fit in the bridge window. In example 1, I did not specify pci=realloc=on. I used a kernel with CONFIG_PCI_REALLOC_ENABLE_AUTO=y, and the SR-IOV BAR triggered the bridge window realloc. Alignment was requested for BARs 0, 1, and 2 of this device, using IORESOURCE_STARTALIGN, because of patch [8/8]. The alignment was lost during realloc. Such a device from example 1 may be hard to come by, so here's a way to reproduce with qemu. Example 2: 01:00.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device Subsystem: Red Hat, Inc. QEMU Virtual Machine Flags: fast devsel Memory at fe800000 (32-bit, non-prefetchable) [size=4K] I/O ports at c000 [size=256] Memory at 7050000000 (64-bit, prefetchable) [size=32] 01:01.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device Subsystem: Red Hat, Inc. QEMU Virtual Machine Flags: fast devsel Memory at fe801000 (32-bit, non-prefetchable) [size=4K] I/O ports at c100 [size=256] Memory at 7050000020 (64-bit, prefetchable) [size=32] 01:02.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device Subsystem: Red Hat, Inc. QEMU Virtual Machine Flags: fast devsel Memory at fe802000 (32-bit, non-prefetchable) [size=4K] I/O ports at c200 [size=256] Memory at 7050000040 (64-bit, prefetchable) [size=32] 01:03.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device Subsystem: Red Hat, Inc. QEMU Virtual Machine Flags: fast devsel Memory at fe803000 (32-bit, non-prefetchable) [size=4K] I/O ports at c300 [size=256] Memory at 7040000000 (64-bit, prefetchable) [size=256M] This example can reproduced with Qemu's pci-testdev and a SeaBIOS hack. In this case we will need to specify pci=realloc=on to trigger the realloc because there's no SR-IOV BAR to trigger it automatically. Add this to your usual qemu-system-x86_64 args: -append "console=ttyS0 ignore_loglevel pci=realloc=on" \ -device pcie-pci-bridge,id=pcie.1 \ -device pci-testdev,bus=pcie.1,membar=32 \ -device pci-testdev,bus=pcie.1,membar=32 \ -device pci-testdev,bus=pcie.1,membar=32 \ -device pci-testdev,bus=pcie.1,membar=268435456 Apply this SeaBIOS hack: diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index b3e359d7..769007a4 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -25,7 +25,7 @@ #include "util.h" // pci_setup #include "x86.h" // outb -#define PCI_DEVICE_MEM_MIN (1<<12) // 4k == page size +#define PCI_DEVICE_MEM_MIN (0) #define PCI_BRIDGE_MEM_MIN (1<<21) // 2M == hugepage size #define PCI_BRIDGE_IO_MIN 0x1000 // mandated by pci bridge spec @@ -1089,6 +1089,7 @@ pci_region_map_one_entry(struct pci_region_entry *entry, u64 addr) pci_config_writew(bdf, PCI_MEMORY_LIMIT, limit >> PCI_MEMORY_SHIFT); } if (entry->type == PCI_REGION_TYPE_PREFMEM) { + limit = addr + PCI_BRIDGE_MEM_MIN - 1; pci_config_writew(bdf, PCI_PREF_MEMORY_BASE, addr >> PCI_PREF_MEMORY_SHIFT); pci_config_writew(bdf, PCI_PREF_MEMORY_LIMIT, limit >> PCI_PREF_MEMORY_SHIFT); pci_config_writel(bdf, PCI_PREF_BASE_UPPER32, addr >> 32); ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 4/8] PCI: Restore memory decoding after reallocation 2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand ` (2 preceding siblings ...) 2024-08-07 15:17 ` [PATCH v3 3/8] PCI: Restore resource alignment Stewart Hildebrand @ 2024-08-07 15:17 ` Stewart Hildebrand 2024-08-08 19:37 ` Bjorn Helgaas 2024-08-07 15:17 ` [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment Stewart Hildebrand ` (4 subsequent siblings) 8 siblings, 1 reply; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: Stewart Hildebrand, linux-pci, linux-kernel Currently, the PCI subsystem unconditionally clears the memory decoding bit of devices with resource alignment specified. While drivers should call pci_enable_device() to enable memory decoding, some platforms have devices that expect memory decoding to be enabled even when no driver is bound to the device. It is assumed firmware enables memory decoding in these cases. For example, the vgacon driver uses the 0xb8000 buffer to display a console without any PCI involvement, yet, on some platforms, memory decoding must be enabled on the PCI VGA device in order for the console to be displayed properly. Restore the memory decoding bit after the resource has been reallocated. Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> --- v2->v3: * no change v1->v2: * capitalize subject text * reword commit message Testing notes / how to check if your platform needs memory decoding enabled in order to use vgacon: 1. Boot your system with a monitor attached, without any driver bound to your PCI VGA device. You should see a console on your display. 2. Find the SBDF of your VGA device with lspci -v (00:01.0 in this example). 3. Disable memory decoding (replace 00:01.0 with your SBDF): $ sudo setpci -v -s 00:01.0 0x04.W=0x05 4. Type something to see if it appears on the console display 5. Re-enable memory decoding: $ sudo setpci -v -s 00:01.0 0x04.W=0x07 Relevant prior discussion at [1] [1] https://lore.kernel.org/linux-pci/20160906165652.GE1214@localhost/ --- drivers/pci/pci.c | 1 + drivers/pci/setup-bus.c | 25 +++++++++++++++++++++++++ include/linux/pci.h | 2 ++ 3 files changed, 28 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index acecdd6edd5a..4b97d8d5c2d8 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -6676,6 +6676,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev) pci_read_config_word(dev, PCI_COMMAND, &command); if (command & PCI_COMMAND_MEMORY) { + dev->dev_flags |= PCI_DEV_FLAGS_MEMORY_ENABLE; command &= ~PCI_COMMAND_MEMORY; pci_write_config_word(dev, PCI_COMMAND, command); } diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index ab7510ce6917..6847b251e19a 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -2131,6 +2131,29 @@ pci_root_bus_distribute_available_resources(struct pci_bus *bus, } } +static void restore_memory_decoding(struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, &bus->devices, bus_list) { + struct pci_bus *b; + + if (dev->dev_flags & PCI_DEV_FLAGS_MEMORY_ENABLE) { + u16 command; + + pci_read_config_word(dev, PCI_COMMAND, &command); + command |= PCI_COMMAND_MEMORY; + pci_write_config_word(dev, PCI_COMMAND, command); + } + + b = dev->subordinate; + if (!b) + continue; + + restore_memory_decoding(b); + } +} + /* * First try will not touch PCI bridge res. * Second and later try will clear small leaf bridge res. @@ -2229,6 +2252,8 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) goto again; dump: + restore_memory_decoding(bus); + /* Dump the resource on buses */ pci_bus_dump_resources(bus); } diff --git a/include/linux/pci.h b/include/linux/pci.h index 4246cb790c7b..74636acf152f 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -245,6 +245,8 @@ enum pci_dev_flags { PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11), /* Device does honor MSI masking despite saying otherwise */ PCI_DEV_FLAGS_HAS_MSI_MASKING = (__force pci_dev_flags_t) (1 << 12), + /* Firmware enabled memory decoding, to be restored if BAR is updated */ + PCI_DEV_FLAGS_MEMORY_ENABLE = (__force pci_dev_flags_t) (1 << 13), }; enum pci_irq_reroute_variant { -- 2.46.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v3 4/8] PCI: Restore memory decoding after reallocation 2024-08-07 15:17 ` [PATCH v3 4/8] PCI: Restore memory decoding after reallocation Stewart Hildebrand @ 2024-08-08 19:37 ` Bjorn Helgaas 2024-08-14 18:31 ` Stewart Hildebrand 0 siblings, 1 reply; 21+ messages in thread From: Bjorn Helgaas @ 2024-08-08 19:37 UTC (permalink / raw) To: Stewart Hildebrand; +Cc: Bjorn Helgaas, linux-pci, linux-kernel On Wed, Aug 07, 2024 at 11:17:13AM -0400, Stewart Hildebrand wrote: > Currently, the PCI subsystem unconditionally clears the memory decoding > bit of devices with resource alignment specified. While drivers should > call pci_enable_device() to enable memory decoding, some platforms have > devices that expect memory decoding to be enabled even when no driver is > bound to the device. It is assumed firmware enables memory decoding in > these cases. For example, the vgacon driver uses the 0xb8000 buffer to > display a console without any PCI involvement, yet, on some platforms, > memory decoding must be enabled on the PCI VGA device in order for the > console to be displayed properly. > > Restore the memory decoding bit after the resource has been reallocated. > > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > --- > v2->v3: > * no change > > v1->v2: > * capitalize subject text > * reword commit message > > Testing notes / how to check if your platform needs memory decoding > enabled in order to use vgacon: > 1. Boot your system with a monitor attached, without any driver bound to > your PCI VGA device. You should see a console on your display. > 2. Find the SBDF of your VGA device with lspci -v (00:01.0 in this > example). > 3. Disable memory decoding (replace 00:01.0 with your SBDF): > $ sudo setpci -v -s 00:01.0 0x04.W=0x05 > 4. Type something to see if it appears on the console display > 5. Re-enable memory decoding: > $ sudo setpci -v -s 00:01.0 0x04.W=0x07 > > Relevant prior discussion at [1] > > [1] https://lore.kernel.org/linux-pci/20160906165652.GE1214@localhost/ > --- > drivers/pci/pci.c | 1 + > drivers/pci/setup-bus.c | 25 +++++++++++++++++++++++++ > include/linux/pci.h | 2 ++ > 3 files changed, 28 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index acecdd6edd5a..4b97d8d5c2d8 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -6676,6 +6676,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev) > > pci_read_config_word(dev, PCI_COMMAND, &command); > if (command & PCI_COMMAND_MEMORY) { > + dev->dev_flags |= PCI_DEV_FLAGS_MEMORY_ENABLE; > command &= ~PCI_COMMAND_MEMORY; > pci_write_config_word(dev, PCI_COMMAND, command); > } > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > index ab7510ce6917..6847b251e19a 100644 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@ -2131,6 +2131,29 @@ pci_root_bus_distribute_available_resources(struct pci_bus *bus, > } > } > > +static void restore_memory_decoding(struct pci_bus *bus) > +{ > + struct pci_dev *dev; > + > + list_for_each_entry(dev, &bus->devices, bus_list) { > + struct pci_bus *b; > + > + if (dev->dev_flags & PCI_DEV_FLAGS_MEMORY_ENABLE) { > + u16 command; > + > + pci_read_config_word(dev, PCI_COMMAND, &command); > + command |= PCI_COMMAND_MEMORY; > + pci_write_config_word(dev, PCI_COMMAND, command); > + } > + > + b = dev->subordinate; > + if (!b) > + continue; > + > + restore_memory_decoding(b); > + } > +} > + > /* > * First try will not touch PCI bridge res. > * Second and later try will clear small leaf bridge res. > @@ -2229,6 +2252,8 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) > goto again; > > dump: > + restore_memory_decoding(bus); The fact that we traverse the whole hierarchy here to restore PCI_COMMAND_MEMORY makes me suspect there's a window between point A (where we clear PCI_COMMAND_MEMORY and update a BAR) and point B (where we restore PCI_COMMAND_MEMORY) where VGA console output will not work. This tickles my memory like we might have talked about this before and there's some reason for having to wait. If so, sorry, and maybe we need a comment in the code about that reason. > /* Dump the resource on buses */ > pci_bus_dump_resources(bus); > } > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 4246cb790c7b..74636acf152f 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -245,6 +245,8 @@ enum pci_dev_flags { > PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11), > /* Device does honor MSI masking despite saying otherwise */ > PCI_DEV_FLAGS_HAS_MSI_MASKING = (__force pci_dev_flags_t) (1 << 12), > + /* Firmware enabled memory decoding, to be restored if BAR is updated */ > + PCI_DEV_FLAGS_MEMORY_ENABLE = (__force pci_dev_flags_t) (1 << 13), > }; > > enum pci_irq_reroute_variant { > -- > 2.46.0 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 4/8] PCI: Restore memory decoding after reallocation 2024-08-08 19:37 ` Bjorn Helgaas @ 2024-08-14 18:31 ` Stewart Hildebrand 0 siblings, 0 replies; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-14 18:31 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: Bjorn Helgaas, linux-pci, linux-kernel On 8/8/24 15:37, Bjorn Helgaas wrote: > On Wed, Aug 07, 2024 at 11:17:13AM -0400, Stewart Hildebrand wrote: >> Currently, the PCI subsystem unconditionally clears the memory decoding >> bit of devices with resource alignment specified. While drivers should >> call pci_enable_device() to enable memory decoding, some platforms have >> devices that expect memory decoding to be enabled even when no driver is >> bound to the device. It is assumed firmware enables memory decoding in >> these cases. For example, the vgacon driver uses the 0xb8000 buffer to >> display a console without any PCI involvement, yet, on some platforms, >> memory decoding must be enabled on the PCI VGA device in order for the >> console to be displayed properly. >> >> Restore the memory decoding bit after the resource has been reallocated. >> >> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> >> --- >> v2->v3: >> * no change >> >> v1->v2: >> * capitalize subject text >> * reword commit message >> >> Testing notes / how to check if your platform needs memory decoding >> enabled in order to use vgacon: >> 1. Boot your system with a monitor attached, without any driver bound to >> your PCI VGA device. You should see a console on your display. >> 2. Find the SBDF of your VGA device with lspci -v (00:01.0 in this >> example). >> 3. Disable memory decoding (replace 00:01.0 with your SBDF): >> $ sudo setpci -v -s 00:01.0 0x04.W=0x05 >> 4. Type something to see if it appears on the console display >> 5. Re-enable memory decoding: >> $ sudo setpci -v -s 00:01.0 0x04.W=0x07 >> >> Relevant prior discussion at [1] >> >> [1] https://lore.kernel.org/linux-pci/20160906165652.GE1214@localhost/ >> --- >> drivers/pci/pci.c | 1 + >> drivers/pci/setup-bus.c | 25 +++++++++++++++++++++++++ >> include/linux/pci.h | 2 ++ >> 3 files changed, 28 insertions(+) >> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index acecdd6edd5a..4b97d8d5c2d8 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -6676,6 +6676,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev) >> >> pci_read_config_word(dev, PCI_COMMAND, &command); >> if (command & PCI_COMMAND_MEMORY) { >> + dev->dev_flags |= PCI_DEV_FLAGS_MEMORY_ENABLE; >> command &= ~PCI_COMMAND_MEMORY; >> pci_write_config_word(dev, PCI_COMMAND, command); >> } >> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c >> index ab7510ce6917..6847b251e19a 100644 >> --- a/drivers/pci/setup-bus.c >> +++ b/drivers/pci/setup-bus.c >> @@ -2131,6 +2131,29 @@ pci_root_bus_distribute_available_resources(struct pci_bus *bus, >> } >> } >> >> +static void restore_memory_decoding(struct pci_bus *bus) >> +{ >> + struct pci_dev *dev; >> + >> + list_for_each_entry(dev, &bus->devices, bus_list) { >> + struct pci_bus *b; >> + >> + if (dev->dev_flags & PCI_DEV_FLAGS_MEMORY_ENABLE) { >> + u16 command; >> + >> + pci_read_config_word(dev, PCI_COMMAND, &command); >> + command |= PCI_COMMAND_MEMORY; >> + pci_write_config_word(dev, PCI_COMMAND, command); >> + } >> + >> + b = dev->subordinate; >> + if (!b) >> + continue; >> + >> + restore_memory_decoding(b); >> + } >> +} >> + >> /* >> * First try will not touch PCI bridge res. >> * Second and later try will clear small leaf bridge res. >> @@ -2229,6 +2252,8 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) >> goto again; >> >> dump: >> + restore_memory_decoding(bus); > > The fact that we traverse the whole hierarchy here to restore > PCI_COMMAND_MEMORY makes me suspect there's a window between point A > (where we clear PCI_COMMAND_MEMORY and update a BAR) and point B > (where we restore PCI_COMMAND_MEMORY) where VGA console output will > not work. Yes, there is a brief moment of garbled junk on the display, but it is not fatal. The VGA console display returns to normal after the bit is restored. > This tickles my memory like we might have talked about this before and > there's some reason for having to wait. If so, sorry, and maybe we > need a comment in the code about that reason. I don't have a strong opinion on which way to go with this, but my understanding is that we want memory decoding disabled while r->start doesn't match the actual BAR value. If you agree with this rationale, I'll add something to that effect in a comment. See the prior discussion at [1] (link above), and discussion from v1 of this patch [2]. [2] https://lore.kernel.org/linux-pci/1d0b52bd-1654-4510-92dc-bd48447ff41d@amd.com/ >> /* Dump the resource on buses */ >> pci_bus_dump_resources(bus); >> } >> diff --git a/include/linux/pci.h b/include/linux/pci.h >> index 4246cb790c7b..74636acf152f 100644 >> --- a/include/linux/pci.h >> +++ b/include/linux/pci.h >> @@ -245,6 +245,8 @@ enum pci_dev_flags { >> PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11), >> /* Device does honor MSI masking despite saying otherwise */ >> PCI_DEV_FLAGS_HAS_MSI_MASKING = (__force pci_dev_flags_t) (1 << 12), >> + /* Firmware enabled memory decoding, to be restored if BAR is updated */ >> + PCI_DEV_FLAGS_MEMORY_ENABLE = (__force pci_dev_flags_t) (1 << 13), >> }; >> >> enum pci_irq_reroute_variant { >> -- >> 2.46.0 >> ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment 2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand ` (3 preceding siblings ...) 2024-08-07 15:17 ` [PATCH v3 4/8] PCI: Restore memory decoding after reallocation Stewart Hildebrand @ 2024-08-07 15:17 ` Stewart Hildebrand 2024-08-08 20:10 ` Bjorn Helgaas 2024-08-07 15:17 ` [PATCH v3 6/8] powerpc/pci: " Stewart Hildebrand ` (3 subsequent siblings) 8 siblings, 1 reply; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw) To: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Yongji Xie, Ilpo Järvinen Cc: Stewart Hildebrand, x86, linux-pci, linux-kernel There is a corner case in pcibios_allocate_dev_resources() where the IORESOURCE_STARTALIGN alignment of memory BAR resources gets overwritten. This does not affect bridge resources. The corner case is not yet possible to trigger on x86, but it will be possible once the default resource alignment changes, and memory BAR resources will start to use IORESOURCE_STARTALIGN. Account for IORESOURCE_STARTALIGN in preparation for changing the default resource alignment. Skip the pcibios_save_fw_addr() call since the resource doesn't contain a valid address when alignment has been requested. The impact of this is that we won't be able to restore the firmware allocated BAR, which does not meet alignment requirements anyway. Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> --- v2->v3: * no change v1->v2: * capitalize subject text * clarify commit message * skip pcibios_save_fw_addr() call --- arch/x86/pci/i386.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c index 3abd55902dbc..13d7f7ac3bde 100644 --- a/arch/x86/pci/i386.c +++ b/arch/x86/pci/i386.c @@ -256,7 +256,7 @@ static void alloc_resource(struct pci_dev *dev, int idx, int pass) if (r->flags & IORESOURCE_PCI_FIXED) { dev_info(&dev->dev, "BAR %d %pR is immovable\n", idx, r); - } else { + } else if (!(r->flags & IORESOURCE_STARTALIGN)) { /* We'll assign a new address later */ pcibios_save_fw_addr(dev, idx, r->start); r->end -= r->start; -- 2.46.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment 2024-08-07 15:17 ` [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment Stewart Hildebrand @ 2024-08-08 20:10 ` Bjorn Helgaas 0 siblings, 0 replies; 21+ messages in thread From: Bjorn Helgaas @ 2024-08-08 20:10 UTC (permalink / raw) To: Stewart Hildebrand Cc: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Yongji Xie, Ilpo Järvinen, x86, linux-pci, linux-kernel On Wed, Aug 07, 2024 at 11:17:14AM -0400, Stewart Hildebrand wrote: > There is a corner case in pcibios_allocate_dev_resources() where the > IORESOURCE_STARTALIGN alignment of memory BAR resources gets > overwritten. This does not affect bridge resources. The corner case is > not yet possible to trigger on x86, but it will be possible once the > default resource alignment changes, and memory BAR resources will start > to use IORESOURCE_STARTALIGN. I see from [8/8] that "Changing pcibios_default_alignment() results in the second method of alignment with IORESOURCE_STARTALIGN", but that connection is not at all obvious, and there's no patch in this series that sets IORESOURCE_STARTALIGN, so it's kind of hard to connect all the dots here. The only caller of pcibios_default_alignment() is pci_specified_resource_alignment(), and that doesn't mention IORESOURCE_STARTALIGN either. Neither does pcibios_allocate_dev_resources(). I feel like we've had this conversation before; apologies if so. But we need to figure out to make this more explicit and less magic. > Account for IORESOURCE_STARTALIGN in > preparation for changing the default resource alignment. > > Skip the pcibios_save_fw_addr() call since the resource doesn't contain > a valid address when alignment has been requested. The impact of this is > that we won't be able to restore the firmware allocated BAR, which does > not meet alignment requirements anyway. > > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > --- > v2->v3: > * no change > > v1->v2: > * capitalize subject text > * clarify commit message > * skip pcibios_save_fw_addr() call > --- > arch/x86/pci/i386.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c > index 3abd55902dbc..13d7f7ac3bde 100644 > --- a/arch/x86/pci/i386.c > +++ b/arch/x86/pci/i386.c > @@ -256,7 +256,7 @@ static void alloc_resource(struct pci_dev *dev, int idx, int pass) > if (r->flags & IORESOURCE_PCI_FIXED) { > dev_info(&dev->dev, "BAR %d %pR is immovable\n", > idx, r); > - } else { > + } else if (!(r->flags & IORESOURCE_STARTALIGN)) { > /* We'll assign a new address later */ > pcibios_save_fw_addr(dev, idx, r->start); > r->end -= r->start; > -- > 2.46.0 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v3 6/8] powerpc/pci: Preserve IORESOURCE_STARTALIGN alignment 2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand ` (4 preceding siblings ...) 2024-08-07 15:17 ` [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment Stewart Hildebrand @ 2024-08-07 15:17 ` Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 7/8] PCI: Don't reassign resources that are already aligned Stewart Hildebrand ` (2 subsequent siblings) 8 siblings, 0 replies; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw) To: Bjorn Helgaas, Michael Ellerman, Nicholas Piggin, Christophe Leroy, Naveen N. Rao, Thomas Zimmermann, Arnd Bergmann, Sam Ravnborg Cc: Stewart Hildebrand, linux-pci, linux-kernel, linuxppc-dev There is a corner case in pcibios_allocate_resources()/alloc_resource() where the IORESOURCE_STARTALIGN alignment of memory BAR resources gets overwritten. This does not affect bridge resources. Account for IORESOURCE_STARTALIGN. Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> --- v2->v3: * no change v1->v2: * new patch --- arch/powerpc/kernel/pci-common.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index eac84d687b53..3f346ad641e0 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -1303,8 +1303,10 @@ static inline void alloc_resource(struct pci_dev *dev, int idx) pr_debug("PCI: parent is %p: %pR\n", pr, pr); /* We'll assign a new address later */ r->flags |= IORESOURCE_UNSET; - r->end -= r->start; - r->start = 0; + if (!(r->flags & IORESOURCE_STARTALIGN)) { + r->end -= r->start; + r->start = 0; + } } } -- 2.46.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 7/8] PCI: Don't reassign resources that are already aligned 2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand ` (5 preceding siblings ...) 2024-08-07 15:17 ` [PATCH v3 6/8] powerpc/pci: " Stewart Hildebrand @ 2024-08-07 15:17 ` Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 8/8] PCI: Align small BARs Stewart Hildebrand 2024-08-07 15:42 ` [PATCH v3 0/8] " Arnd Bergmann 8 siblings, 0 replies; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: Stewart Hildebrand, linux-pci, linux-kernel As a follow-up to commit 0dde1c08d1b9 ("PCI: Don't reassign resources that are already aligned"), don't reassign already aligned resources that have requested alignment without resizing. Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> --- v2->v3: * no change v1->v2: * capitalize subject text --- drivers/pci/pci.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 4b97d8d5c2d8..af34407f2fb9 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -6584,6 +6584,9 @@ static bool pci_request_resource_alignment(struct pci_dev *dev, int bar, if (size >= align) return false; + if (!resize && (r->start > align) && !(r->start & (align - 1))) + return false; + /* * Increase the alignment of the resource. There are two ways we * can do this: -- 2.46.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 8/8] PCI: Align small BARs 2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand ` (6 preceding siblings ...) 2024-08-07 15:17 ` [PATCH v3 7/8] PCI: Don't reassign resources that are already aligned Stewart Hildebrand @ 2024-08-07 15:17 ` Stewart Hildebrand 2024-08-08 21:53 ` Bjorn Helgaas 2024-08-07 15:42 ` [PATCH v3 0/8] " Arnd Bergmann 8 siblings, 1 reply; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw) To: Bjorn Helgaas, Ilpo Järvinen Cc: Stewart Hildebrand, linux-pci, linux-kernel In this context, "small" is defined as less than max(SZ_4K, PAGE_SIZE). Issues observed when small BARs are not sufficiently aligned are: 1. Devices to be passed through (to e.g. a Xen HVM guest) with small BARs require each memory BAR to be page aligned. Currently, the only way to guarantee this alignment from a user perspective is to fake the size of the BARs using the pci=resource_alignment= option. This is a bad user experience, and faking the BAR size is not always desirable. For example, pcitest is a tool that is useful for PCI passthrough validation with Xen, but pcitest fails with a fake BAR size. 2. Devices with multiple small BARs could have the MSI-X tables located in one of its small BARs. This may lead to the MSI-X tables being mapped in the same 4k region as other data. The PCIe 6.1 specification (section 7.7.2 MSI-X Capability and Table Structure) says we probably should avoid that. To improve the user experience (i.e. don't require the user to specify pci=resource_alignment=), and increase conformance to PCIe spec, set the default minimum resource alignment of memory BARs to the greater of 4k or PAGE_SIZE. Quoting the comment in drivers/pci/pci.c:pci_request_resource_alignment(), there are two ways we can increase the resource alignment: 1) Increase the size of the resource. BARs are aligned on their size, so when we reallocate space for this resource, we'll allocate it with the larger alignment. This also prevents assignment of any other BARs inside the alignment region, so if we're requesting page alignment, this means no other BARs will share the page. The disadvantage is that this makes the resource larger than the hardware BAR, which may break drivers that compute things based on the resource size, e.g., to find registers at a fixed offset before the end of the BAR. 2) Retain the resource size, but use IORESOURCE_STARTALIGN and set r->start to the desired alignment. By itself this doesn't prevent other BARs being put inside the alignment region, but if we realign *every* resource of every device in the system, none of them will share an alignment region. Changing pcibios_default_alignment() results in the second method of alignment with IORESOURCE_STARTALIGN. The new default alignment may be overridden by arches by implementing pcibios_default_alignment(), or by the user on a per-device basis with the pci=resource_alignment= option (although this reverts to using IORESOURCE_SIZEALIGN). Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> --- Preparatory patches in this series are prerequisites to this patch. v2->v3: * new subject (was: "PCI: Align small (<4k) BARs") * clarify 4k vs PAGE_SIZE in commit message v1->v2: * capitalize subject text * s/4 * 1024/SZ_4K/ * #include <linux/sizes.h> * update commit message * use max(SZ_4K, PAGE_SIZE) for alignment value --- drivers/pci/pci.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index af34407f2fb9..efdd5b85ea8c 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -31,6 +31,7 @@ #include <asm/dma.h> #include <linux/aer.h> #include <linux/bitfield.h> +#include <linux/sizes.h> #include "pci.h" DEFINE_MUTEX(pci_slot_mutex); @@ -6484,7 +6485,12 @@ struct pci_dev __weak *pci_real_dma_dev(struct pci_dev *dev) resource_size_t __weak pcibios_default_alignment(void) { - return 0; + /* + * Avoid MSI-X tables being mapped in the same 4k region as other data + * according to PCIe 6.1 specification section 7.7.2 MSI-X Capability + * and Table Structure. + */ + return max(SZ_4K, PAGE_SIZE); } /* -- 2.46.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v3 8/8] PCI: Align small BARs 2024-08-07 15:17 ` [PATCH v3 8/8] PCI: Align small BARs Stewart Hildebrand @ 2024-08-08 21:53 ` Bjorn Helgaas 2024-08-14 13:55 ` Stewart Hildebrand 0 siblings, 1 reply; 21+ messages in thread From: Bjorn Helgaas @ 2024-08-08 21:53 UTC (permalink / raw) To: Stewart Hildebrand Cc: Bjorn Helgaas, Ilpo Järvinen, linux-pci, linux-kernel On Wed, Aug 07, 2024 at 11:17:17AM -0400, Stewart Hildebrand wrote: > In this context, "small" is defined as less than max(SZ_4K, PAGE_SIZE). > > Issues observed when small BARs are not sufficiently aligned are: > > 1. Devices to be passed through (to e.g. a Xen HVM guest) with small > BARs require each memory BAR to be page aligned. Currently, the only way > to guarantee this alignment from a user perspective is to fake the size > of the BARs using the pci=resource_alignment= option. This is a bad user > experience, and faking the BAR size is not always desirable. For > example, pcitest is a tool that is useful for PCI passthrough validation > with Xen, but pcitest fails with a fake BAR size. I guess this is the "money" patch for the main problem you're solving, i.e., passthrough to a guest doesn't work as you want? Is it the case that if you have two BARs in the same page, a device can't be passed through to a guest at all? Or is it just that all devices with BARs that share a page have to be passed through to the same guest, sort of like how lack of ACS can force several devices to be in the same IOMMU isolation group? I think the subject should mention the problem to help motivate this. The fact that we address this by potentially reassigning every BAR of every device, regardless of whether the admin even wants to pass through a device to a guest, seems a bit aggressive to me. Previously we haven't trusted our reassignment machinery enough to enable it all the time, so we still have the "pci=realloc" parameter. By default, I don't think we even move devices around to make space for a BAR that we failed to allocate. I agree "pci=resource_alignment=" is a bit user-unfriendly, and I don't think it solves the problem unless we apply it to every device in the system. > 2. Devices with multiple small BARs could have the MSI-X tables located > in one of its small BARs. This may lead to the MSI-X tables being mapped > in the same 4k region as other data. The PCIe 6.1 specification (section > 7.7.2 MSI-X Capability and Table Structure) says we probably should > avoid that. If you're referring to this: If a Base Address Register or entry in the Enhanced Allocation capability that maps address space for the MSI-X Table or MSI-X PBA also maps other usable address space that is not associated with MSI-X structures, locations (e.g., for CSRs) used in the other address space must not share any naturally aligned 4-KB address range with one where either MSI-X structure resides. This allows system software where applicable to use different processor attributes for MSI-X structures and the other address space. I think this is technically a requirement about how space within a single BAR should be organized, not about how multiple BARs should be assigned. I don't think this really adds to the case for what you're doing, so we could just drop it. > To improve the user experience (i.e. don't require the user to specify > pci=resource_alignment=), and increase conformance to PCIe spec, set the > default minimum resource alignment of memory BARs to the greater of 4k > or PAGE_SIZE. > > Quoting the comment in > drivers/pci/pci.c:pci_request_resource_alignment(), there are two ways > we can increase the resource alignment: > > 1) Increase the size of the resource. BARs are aligned on their > size, so when we reallocate space for this resource, we'll > allocate it with the larger alignment. This also prevents > assignment of any other BARs inside the alignment region, so > if we're requesting page alignment, this means no other BARs > will share the page. > > The disadvantage is that this makes the resource larger than > the hardware BAR, which may break drivers that compute things > based on the resource size, e.g., to find registers at a > fixed offset before the end of the BAR. > > 2) Retain the resource size, but use IORESOURCE_STARTALIGN and > set r->start to the desired alignment. By itself this > doesn't prevent other BARs being put inside the alignment > region, but if we realign *every* resource of every device in > the system, none of them will share an alignment region. > > Changing pcibios_default_alignment() results in the second method of > alignment with IORESOURCE_STARTALIGN. > > The new default alignment may be overridden by arches by implementing > pcibios_default_alignment(), or by the user on a per-device basis with > the pci=resource_alignment= option (although this reverts to using > IORESOURCE_SIZEALIGN). > > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > --- > Preparatory patches in this series are prerequisites to this patch. > > v2->v3: > * new subject (was: "PCI: Align small (<4k) BARs") > * clarify 4k vs PAGE_SIZE in commit message > > v1->v2: > * capitalize subject text > * s/4 * 1024/SZ_4K/ > * #include <linux/sizes.h> > * update commit message > * use max(SZ_4K, PAGE_SIZE) for alignment value > --- > drivers/pci/pci.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index af34407f2fb9..efdd5b85ea8c 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -31,6 +31,7 @@ > #include <asm/dma.h> > #include <linux/aer.h> > #include <linux/bitfield.h> > +#include <linux/sizes.h> > #include "pci.h" > > DEFINE_MUTEX(pci_slot_mutex); > @@ -6484,7 +6485,12 @@ struct pci_dev __weak *pci_real_dma_dev(struct pci_dev *dev) > > resource_size_t __weak pcibios_default_alignment(void) > { > - return 0; > + /* > + * Avoid MSI-X tables being mapped in the same 4k region as other data > + * according to PCIe 6.1 specification section 7.7.2 MSI-X Capability > + * and Table Structure. > + */ I think this is sort of a "spec compliance" comment that is not the *real* reason we want to do this, i.e., it doesn't say that by doing this we can pass through more devices to guests. Doing this in pcibios_default_alignment() ends up being a very non-obvious way to make this happen. We have to: - Know what the purpose of this is, and the current comment doesn't point to that. - Look at all the implementations of pcibios_default_alignment() (thanks, powerpc). - Trace up through pci_specified_resource_alignment(), which contains a bunch of code that is not relevant to this case and always just returns PAGE_SIZE. - Trace up again to pci_reassigndev_resource_alignment() to see where this finally applies to the resources we care about. The comment here about "check if specified PCI is target device" is actively misleading for the passthrough usage. I hate adding new kernel parameters, but I kind of think this would be easier if we added one that mentioned passthrough or guests and tested it directly in pci_reassigndev_resource_alignment(). This would also be a way to avoid the "Can't reassign resources to host bridge" warning that I think we're going to see all the time. > + return max(SZ_4K, PAGE_SIZE); > } > > /* > -- > 2.46.0 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 8/8] PCI: Align small BARs 2024-08-08 21:53 ` Bjorn Helgaas @ 2024-08-14 13:55 ` Stewart Hildebrand 2024-08-14 22:05 ` Bjorn Helgaas 0 siblings, 1 reply; 21+ messages in thread From: Stewart Hildebrand @ 2024-08-14 13:55 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: Bjorn Helgaas, Ilpo Järvinen, linux-pci, linux-kernel Hi Bjorn, Thanks for the feedback! On 8/8/24 17:53, Bjorn Helgaas wrote: > On Wed, Aug 07, 2024 at 11:17:17AM -0400, Stewart Hildebrand wrote: >> In this context, "small" is defined as less than max(SZ_4K, PAGE_SIZE). >> >> Issues observed when small BARs are not sufficiently aligned are: >> >> 1. Devices to be passed through (to e.g. a Xen HVM guest) with small >> BARs require each memory BAR to be page aligned. Currently, the only way >> to guarantee this alignment from a user perspective is to fake the size >> of the BARs using the pci=resource_alignment= option. This is a bad user >> experience, and faking the BAR size is not always desirable. For >> example, pcitest is a tool that is useful for PCI passthrough validation >> with Xen, but pcitest fails with a fake BAR size. > > I guess this is the "money" patch for the main problem you're solving, > i.e., passthrough to a guest doesn't work as you want? Haha, yup! > Is it the case that if you have two BARs in the same page, a device > can't be passed through to a guest at all? If the conditions are just right, passing through such a device could maybe work, but in practice it's problematic and unlikely to work reliably across different configurations. Let me show example 1, from a real device that I'm working with. Scrubbed/partial output from lspci -vv, from the host's point of view: Region 0: Memory at d1924600 (32-bit, non-prefetchable) [size=256] Region 1: Memory at d1924400 (32-bit, non-prefetchable) [size=512] Region 2: Memory at d1924000 (32-bit, non-prefetchable) [size=1K] Region 3: Memory at d1920000 (32-bit, non-prefetchable) [size=16K] Region 4: Memory at d1900000 (32-bit, non-prefetchable) [size=128K] Region 5: Memory at d1800000 (32-bit, non-prefetchable) [size=1M] Capabilities: [b0] MSI-X: Enable- Count=2 Masked- Vector table: BAR=0 offset=00000080 PBA: BAR=0 offset=00000090 Capabilities: [200 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 4, Total VFs: 4, Number of VFs: 0, Function Dependency Link: 00 VF offset: 6, stride: 1, Device ID: 0100 Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 00000000d0800000 (64-bit, non-prefetchable) VF Migration: offset: 00000000, BIR: 0 Kernel driver in use: pci-endpoint-test Kernel modules: pci_endpoint_test BARs 0, 1, and 2 are small, and the host firmware placed them on the same page. The host firmware did not page align the BARs. The hypervisor can only map full pages. The hypervisor cannot map partial pages. It cannot map a guest page offset from a host page where the offset is smaller than PAGE_SIZE. To pass this device through (physfn) as-is, the hypervisor would need to preserve the page offsets of each BAR and propagate them to the guest, taking translation into account. The guest (both firmware + OS) necessarily has to preserve the offsets as well. If the page offsets aren't preserved, the guest would be accessing the wrong data. We can't reliably predict what the guest behavior will be. SeaBIOS aligns BARs to 4k [1]. [1] https://review.coreboot.org/plugins/gitiles/seabios/+/refs/tags/rel-1.16.3/src/fw/pciinit.c#28 Xen's hvmloader does not align BARs to 4k. A patch was submitted to fix this, but it wasn't merged upstream [2]. [2] https://lore.kernel.org/xen-devel/20200117110811.43321-1-roger.pau@citrix.com/ Arm guests don't usually have firmware to initialize BARs, so it's usually up to the OS (which may or may not be Linux). The point is that there is not a consistent BAR initialization strategy/convention in the ecosystem when it comes to small BARs. The host doesn't have a way to enforce the guest always map the small BARs at the required offsets. IMO the most sensible thing to do is not impose any sort of arbitrary page offset requirements on guests because it happened to suit the host. If the host were to use fake BAR sizes via the current pci=resource_alignment=... option, the fake BAR size would propagate to the guest (lying to the guest), pcitest would break, and the guest can't do anything about it. To avoid these problems, small BARs should be predictably page aligned in both host and guest. > Or is it just that all > devices with BARs that share a page have to be passed through to the > same guest, sort of like how lack of ACS can force several devices to > be in the same IOMMU isolation group? This case is much worse. If two devices have BARs sharing a page in a passthrough scenario, it's a security issue because guest can access data of another device. See XSA-461 / CVE-2024-31146 [3]. Aside: I was unaware that there was a XSA/CVE associated with this until after the embargo was lifted. [3] https://lore.kernel.org/xen-devel/E1seE0f-0001zO-Nj@xenbits.xenproject.org/ For completeness, see example 2: 01:00.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device Subsystem: Red Hat, Inc. QEMU Virtual Machine Flags: fast devsel Memory at fe800000 (32-bit, non-prefetchable) [size=4K] I/O ports at c000 [size=256] Memory at 7050000000 (64-bit, prefetchable) [size=32] 01:01.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device Subsystem: Red Hat, Inc. QEMU Virtual Machine Flags: fast devsel Memory at fe801000 (32-bit, non-prefetchable) [size=4K] I/O ports at c100 [size=256] Memory at 7050000020 (64-bit, prefetchable) [size=32] 01:02.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device Subsystem: Red Hat, Inc. QEMU Virtual Machine Flags: fast devsel Memory at fe802000 (32-bit, non-prefetchable) [size=4K] I/O ports at c200 [size=256] Memory at 7050000040 (64-bit, prefetchable) [size=32] 01:03.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device Subsystem: Red Hat, Inc. QEMU Virtual Machine Flags: fast devsel Memory at fe803000 (32-bit, non-prefetchable) [size=4K] I/O ports at c300 [size=256] Memory at 7040000000 (64-bit, prefetchable) [size=256M] This example can reproduced with Qemu's pci-testdev and a SeaBIOS hack. Add this to your usual qemu-system-x86_64 args: -device pcie-pci-bridge,id=pcie.1 \ -device pci-testdev,bus=pcie.1,membar=32 \ -device pci-testdev,bus=pcie.1,membar=32 \ -device pci-testdev,bus=pcie.1,membar=32 \ -device pci-testdev,bus=pcie.1,membar=268435456 Apply this SeaBIOS hack: diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c index b3e359d7..769007a4 100644 --- a/src/fw/pciinit.c +++ b/src/fw/pciinit.c @@ -25,7 +25,7 @@ #include "util.h" // pci_setup #include "x86.h" // outb -#define PCI_DEVICE_MEM_MIN (1<<12) // 4k == page size +#define PCI_DEVICE_MEM_MIN (0) #define PCI_BRIDGE_MEM_MIN (1<<21) // 2M == hugepage size #define PCI_BRIDGE_IO_MIN 0x1000 // mandated by pci bridge spec If you want to trigger the bridge window realloc (where BAR alignments currently get lost), also apply this hack to SeaBIOS in the same file: @@ -1089,6 +1089,7 @@ pci_region_map_one_entry(struct pci_region_entry *entry, u64 addr) pci_config_writew(bdf, PCI_MEMORY_LIMIT, limit >> PCI_MEMORY_SHIFT); } if (entry->type == PCI_REGION_TYPE_PREFMEM) { + limit = addr + PCI_BRIDGE_MEM_MIN - 1; pci_config_writew(bdf, PCI_PREF_MEMORY_BASE, addr >> PCI_PREF_MEMORY_SHIFT); pci_config_writew(bdf, PCI_PREF_MEMORY_LIMIT, limit >> PCI_PREF_MEMORY_SHIFT); pci_config_writel(bdf, PCI_PREF_BASE_UPPER32, addr >> 32); > I think the subject should mention the problem to help motivate this. > > The fact that we address this by potentially reassigning every BAR of > every device, regardless of whether the admin even wants to pass > through a device to a guest, seems a bit aggressive to me. Patch [7/8] should limit the impact somewhat, but yes, it's quite aggressive... Perhaps such a change in default should be paired with the ability to turn it off via pci=realloc=off (or similar), and/or Kconfig. > Previously we haven't trusted our reassignment machinery enough to > enable it all the time, so we still have the "pci=realloc" parameter. > By default, I don't think we even move devices around to make space > for a BAR that we failed to allocate. One exception is SR-IOV device resources when CONFIG_PCI_REALLOC_ENABLE_AUTO=y. > I agree "pci=resource_alignment=" is a bit user-unfriendly, and I > don't think it solves the problem unless we apply it to every device > in the system. Right. >> 2. Devices with multiple small BARs could have the MSI-X tables located >> in one of its small BARs. This may lead to the MSI-X tables being mapped >> in the same 4k region as other data. The PCIe 6.1 specification (section >> 7.7.2 MSI-X Capability and Table Structure) says we probably should >> avoid that. > > If you're referring to this: > > If a Base Address Register or entry in the Enhanced Allocation > capability that maps address space for the MSI-X Table or MSI-X PBA > also maps other usable address space that is not associated with > MSI-X structures, locations (e.g., for CSRs) used in the other > address space must not share any naturally aligned 4-KB address > range with one where either MSI-X structure resides. This allows > system software where applicable to use different processor > attributes for MSI-X structures and the other address space. Yes, that's the correct reference. > I think this is technically a requirement about how space within a > single BAR should be organized, not about how multiple BARs should be > assigned. I don't think this really adds to the case for what you're > doing, so we could just drop it. I'm OK to drop the reference to the spec. For completeness, example 1 above was what led me to mention it: This device has the MSI-X tables located in BAR 0, which is mapped in the same 4k region as other data. >> To improve the user experience (i.e. don't require the user to specify >> pci=resource_alignment=), and increase conformance to PCIe spec, set the >> default minimum resource alignment of memory BARs to the greater of 4k >> or PAGE_SIZE. >> >> Quoting the comment in >> drivers/pci/pci.c:pci_request_resource_alignment(), there are two ways >> we can increase the resource alignment: >> >> 1) Increase the size of the resource. BARs are aligned on their >> size, so when we reallocate space for this resource, we'll >> allocate it with the larger alignment. This also prevents >> assignment of any other BARs inside the alignment region, so >> if we're requesting page alignment, this means no other BARs >> will share the page. >> >> The disadvantage is that this makes the resource larger than >> the hardware BAR, which may break drivers that compute things >> based on the resource size, e.g., to find registers at a >> fixed offset before the end of the BAR. >> >> 2) Retain the resource size, but use IORESOURCE_STARTALIGN and >> set r->start to the desired alignment. By itself this >> doesn't prevent other BARs being put inside the alignment >> region, but if we realign *every* resource of every device in >> the system, none of them will share an alignment region. >> >> Changing pcibios_default_alignment() results in the second method of >> alignment with IORESOURCE_STARTALIGN. >> >> The new default alignment may be overridden by arches by implementing >> pcibios_default_alignment(), or by the user on a per-device basis with >> the pci=resource_alignment= option (although this reverts to using >> IORESOURCE_SIZEALIGN). >> >> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> >> --- >> Preparatory patches in this series are prerequisites to this patch. >> >> v2->v3: >> * new subject (was: "PCI: Align small (<4k) BARs") >> * clarify 4k vs PAGE_SIZE in commit message >> >> v1->v2: >> * capitalize subject text >> * s/4 * 1024/SZ_4K/ >> * #include <linux/sizes.h> >> * update commit message >> * use max(SZ_4K, PAGE_SIZE) for alignment value >> --- >> drivers/pci/pci.c | 8 +++++++- >> 1 file changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index af34407f2fb9..efdd5b85ea8c 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -31,6 +31,7 @@ >> #include <asm/dma.h> >> #include <linux/aer.h> >> #include <linux/bitfield.h> >> +#include <linux/sizes.h> >> #include "pci.h" >> >> DEFINE_MUTEX(pci_slot_mutex); >> @@ -6484,7 +6485,12 @@ struct pci_dev __weak *pci_real_dma_dev(struct pci_dev *dev) >> >> resource_size_t __weak pcibios_default_alignment(void) >> { >> - return 0; >> + /* >> + * Avoid MSI-X tables being mapped in the same 4k region as other data >> + * according to PCIe 6.1 specification section 7.7.2 MSI-X Capability >> + * and Table Structure. >> + */ > > I think this is sort of a "spec compliance" comment that is not the > *real* reason we want to do this, i.e., it doesn't say that by doing > this we can pass through more devices to guests. > > Doing this in pcibios_default_alignment() ends up being a very > non-obvious way to make this happen. We have to: > > - Know what the purpose of this is, and the current comment doesn't > point to that. > > - Look at all the implementations of pcibios_default_alignment() > (thanks, powerpc). > > - Trace up through pci_specified_resource_alignment(), which > contains a bunch of code that is not relevant to this case and > always just returns PAGE_SIZE. > > - Trace up again to pci_reassigndev_resource_alignment() to see > where this finally applies to the resources we care about. The > comment here about "check if specified PCI is target device" is > actively misleading for the passthrough usage. > > I hate adding new kernel parameters, but I kind of think this would be > easier if we added one that mentioned passthrough or guests and tested > it directly in pci_reassigndev_resource_alignment(). > > This would also be a way to avoid the "Can't reassign resources to > host bridge" warning that I think we're going to see all the time. I did actually prepare a pci=resource_alignment=all patch, but I hesitated to send it because of the discussion at [4]. I'll send it with the next revision of the series. [4] https://lore.kernel.org/linux-pci/20160929115422.GA31048@localhost/ I'd like to also propose introducing a Kconfig option, e.g. CONFIG_PCI_PAGE_ALIGN_BARS, selectable by menuconfig or other usual means. >> + return max(SZ_4K, PAGE_SIZE); >> } >> >> /* >> -- >> 2.46.0 >> ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v3 8/8] PCI: Align small BARs 2024-08-14 13:55 ` Stewart Hildebrand @ 2024-08-14 22:05 ` Bjorn Helgaas 0 siblings, 0 replies; 21+ messages in thread From: Bjorn Helgaas @ 2024-08-14 22:05 UTC (permalink / raw) To: Stewart Hildebrand Cc: Bjorn Helgaas, Ilpo Järvinen, linux-pci, linux-kernel, Alex Williamson, kvm [+cc Alex, kvm list since this topic is of general interest for passthrough; thread begins at https://lore.kernel.org/r/20240807151723.613742-1-stewart.hildebrand@amd.com] On Wed, Aug 14, 2024 at 09:55:26AM -0400, Stewart Hildebrand wrote: > On 8/8/24 17:53, Bjorn Helgaas wrote: > > On Wed, Aug 07, 2024 at 11:17:17AM -0400, Stewart Hildebrand wrote: > >> In this context, "small" is defined as less than max(SZ_4K, PAGE_SIZE). > >> > >> Issues observed when small BARs are not sufficiently aligned are: > >> > >> 1. Devices to be passed through (to e.g. a Xen HVM guest) with small > >> BARs require each memory BAR to be page aligned. Currently, the only way > >> to guarantee this alignment from a user perspective is to fake the size > >> of the BARs using the pci=resource_alignment= option. This is a bad user > >> experience, and faking the BAR size is not always desirable. For > >> example, pcitest is a tool that is useful for PCI passthrough validation > >> with Xen, but pcitest fails with a fake BAR size. > > > > I guess this is the "money" patch for the main problem you're solving, > > i.e., passthrough to a guest doesn't work as you want? > > Haha, yup! > > > Is it the case that if you have two BARs in the same page, a device > > can't be passed through to a guest at all? > > If the conditions are just right, passing through such a device could > maybe work, but in practice it's problematic and unlikely to work > reliably across different configurations. > > Let me show example 1, from a real device that I'm working with. > Scrubbed/partial output from lspci -vv, from the host's point of view: > > Region 0: Memory at d1924600 (32-bit, non-prefetchable) [size=256] > Region 1: Memory at d1924400 (32-bit, non-prefetchable) [size=512] > Region 2: Memory at d1924000 (32-bit, non-prefetchable) [size=1K] > Region 3: Memory at d1920000 (32-bit, non-prefetchable) [size=16K] > Region 4: Memory at d1900000 (32-bit, non-prefetchable) [size=128K] > Region 5: Memory at d1800000 (32-bit, non-prefetchable) [size=1M] > Capabilities: [b0] MSI-X: Enable- Count=2 Masked- > Vector table: BAR=0 offset=00000080 > PBA: BAR=0 offset=00000090 > Capabilities: [200 v1] Single Root I/O Virtualization (SR-IOV) > IOVCap: Migration-, Interrupt Message Number: 000 > IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ > IOVSta: Migration- > Initial VFs: 4, Total VFs: 4, Number of VFs: 0, Function Dependency Link: 00 > VF offset: 6, stride: 1, Device ID: 0100 > Supported Page Size: 00000553, System Page Size: 00000001 > Region 0: Memory at 00000000d0800000 (64-bit, non-prefetchable) > VF Migration: offset: 00000000, BIR: 0 > Kernel driver in use: pci-endpoint-test > Kernel modules: pci_endpoint_test > > BARs 0, 1, and 2 are small, and the host firmware placed them on the > same page. The host firmware did not page align the BARs. > > The hypervisor can only map full pages. The hypervisor cannot map > partial pages. It cannot map a guest page offset from a host page where > the offset is smaller than PAGE_SIZE. > > To pass this device through (physfn) as-is, the hypervisor would need to > preserve the page offsets of each BAR and propagate them to the guest, > taking translation into account. The guest (both firmware + OS) > necessarily has to preserve the offsets as well. If the page offsets > aren't preserved, the guest would be accessing the wrong data. > > We can't reliably predict what the guest behavior will be. > > SeaBIOS aligns BARs to 4k [1]. > > [1] https://review.coreboot.org/plugins/gitiles/seabios/+/refs/tags/rel-1.16.3/src/fw/pciinit.c#28 > > Xen's hvmloader does not align BARs to 4k. A patch was submitted to fix > this, but it wasn't merged upstream [2]. > > [2] https://lore.kernel.org/xen-devel/20200117110811.43321-1-roger.pau@citrix.com/ > > Arm guests don't usually have firmware to initialize BARs, so it's > usually up to the OS (which may or may not be Linux). > > The point is that there is not a consistent BAR initialization > strategy/convention in the ecosystem when it comes to small BARs. > > The host doesn't have a way to enforce the guest always map the small > BARs at the required offsets. IMO the most sensible thing to do is not > impose any sort of arbitrary page offset requirements on guests because > it happened to suit the host. > > If the host were to use fake BAR sizes via the current > pci=resource_alignment=... option, the fake BAR size would propagate to > the guest (lying to the guest), pcitest would break, and the guest can't > do anything about it. > > To avoid these problems, small BARs should be predictably page aligned > in both host and guest. Right, all the above makes sense to me. I was fishing to see if vfio, etc., actually checked for two BARs in the same page and disallowed passthrough if that happened, but it doesn't sound like it. It sounds like things will just break, e.g., guest accesses data at incorrect offsets, etc. > > Or is it just that all > > devices with BARs that share a page have to be passed through to the > > same guest, sort of like how lack of ACS can force several devices to > > be in the same IOMMU isolation group? > > This case is much worse. If two devices have BARs sharing a page in a > passthrough scenario, it's a security issue because guest can access > data of another device. See XSA-461 / CVE-2024-31146 [3]. Aside: I was > unaware that there was a XSA/CVE associated with this until after the > embargo was lifted. Yes. IIUC, vfio etc *could* check for devices sharing a page and force them to be passed through together, which I suppose would mitigate this CVE. But they don't in fact check. > [3] https://lore.kernel.org/xen-devel/E1seE0f-0001zO-Nj@xenbits.xenproject.org/ > > For completeness, see example 2: > > 01:00.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device > Subsystem: Red Hat, Inc. QEMU Virtual Machine > Flags: fast devsel > Memory at fe800000 (32-bit, non-prefetchable) [size=4K] > I/O ports at c000 [size=256] > Memory at 7050000000 (64-bit, prefetchable) [size=32] > > 01:01.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device > Subsystem: Red Hat, Inc. QEMU Virtual Machine > Flags: fast devsel > Memory at fe801000 (32-bit, non-prefetchable) [size=4K] > I/O ports at c100 [size=256] > Memory at 7050000020 (64-bit, prefetchable) [size=32] > > 01:02.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device > Subsystem: Red Hat, Inc. QEMU Virtual Machine > Flags: fast devsel > Memory at fe802000 (32-bit, non-prefetchable) [size=4K] > I/O ports at c200 [size=256] > Memory at 7050000040 (64-bit, prefetchable) [size=32] > > 01:03.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device > Subsystem: Red Hat, Inc. QEMU Virtual Machine > Flags: fast devsel > Memory at fe803000 (32-bit, non-prefetchable) [size=4K] > I/O ports at c300 [size=256] > Memory at 7040000000 (64-bit, prefetchable) [size=256M] > > This example can reproduced with Qemu's pci-testdev and a SeaBIOS hack. > Add this to your usual qemu-system-x86_64 args: > > -device pcie-pci-bridge,id=pcie.1 \ > -device pci-testdev,bus=pcie.1,membar=32 \ > -device pci-testdev,bus=pcie.1,membar=32 \ > -device pci-testdev,bus=pcie.1,membar=32 \ > -device pci-testdev,bus=pcie.1,membar=268435456 > > Apply this SeaBIOS hack: > > diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c > index b3e359d7..769007a4 100644 > --- a/src/fw/pciinit.c > +++ b/src/fw/pciinit.c > @@ -25,7 +25,7 @@ > #include "util.h" // pci_setup > #include "x86.h" // outb > > -#define PCI_DEVICE_MEM_MIN (1<<12) // 4k == page size > +#define PCI_DEVICE_MEM_MIN (0) > #define PCI_BRIDGE_MEM_MIN (1<<21) // 2M == hugepage size > #define PCI_BRIDGE_IO_MIN 0x1000 // mandated by pci bridge spec > > > If you want to trigger the bridge window realloc (where BAR alignments > currently get lost), also apply this hack to SeaBIOS in the same file: > > @@ -1089,6 +1089,7 @@ pci_region_map_one_entry(struct pci_region_entry *entry, u64 addr) > pci_config_writew(bdf, PCI_MEMORY_LIMIT, limit >> PCI_MEMORY_SHIFT); > } > if (entry->type == PCI_REGION_TYPE_PREFMEM) { > + limit = addr + PCI_BRIDGE_MEM_MIN - 1; > pci_config_writew(bdf, PCI_PREF_MEMORY_BASE, addr >> PCI_PREF_MEMORY_SHIFT); > pci_config_writew(bdf, PCI_PREF_MEMORY_LIMIT, limit >> PCI_PREF_MEMORY_SHIFT); > pci_config_writel(bdf, PCI_PREF_BASE_UPPER32, addr >> 32); > > > I think the subject should mention the problem to help motivate this. > > > > The fact that we address this by potentially reassigning every BAR of > > every device, regardless of whether the admin even wants to pass > > through a device to a guest, seems a bit aggressive to me. > > Patch [7/8] should limit the impact somewhat, but yes, it's quite > aggressive... Perhaps such a change in default should be paired with the > ability to turn it off via pci=realloc=off (or similar), and/or Kconfig. I'm frankly pretty scared about reassigning every BAR by default. Maybe my fear is unfounded, but I suspect things will break. I'd be more comfortable if users had to specify a new option to get this behavior, because then they would be aware of what they changed and would be able to change back if it didn't work. Or maybe if we checked for page sharing and prevented passthrough in that case. > > Previously we haven't trusted our reassignment machinery enough to > > enable it all the time, so we still have the "pci=realloc" parameter. > > By default, I don't think we even move devices around to make space > > for a BAR that we failed to allocate. > > One exception is SR-IOV device resources when > CONFIG_PCI_REALLOC_ENABLE_AUTO=y. > > > I agree "pci=resource_alignment=" is a bit user-unfriendly, and I > > don't think it solves the problem unless we apply it to every device > > in the system. > > Right. > > >> 2. Devices with multiple small BARs could have the MSI-X tables located > >> in one of its small BARs. This may lead to the MSI-X tables being mapped > >> in the same 4k region as other data. The PCIe 6.1 specification (section > >> 7.7.2 MSI-X Capability and Table Structure) says we probably should > >> avoid that. > > > > If you're referring to this: > > > > If a Base Address Register or entry in the Enhanced Allocation > > capability that maps address space for the MSI-X Table or MSI-X PBA > > also maps other usable address space that is not associated with > > MSI-X structures, locations (e.g., for CSRs) used in the other > > address space must not share any naturally aligned 4-KB address > > range with one where either MSI-X structure resides. This allows > > system software where applicable to use different processor > > attributes for MSI-X structures and the other address space. > > Yes, that's the correct reference. > > > I think this is technically a requirement about how space within a > > single BAR should be organized, not about how multiple BARs should be > > assigned. I don't think this really adds to the case for what you're > > doing, so we could just drop it. > > I'm OK to drop the reference to the spec. For completeness, example 1 > above was what led me to mention it: This device has the MSI-X tables > located in BAR 0, which is mapped in the same 4k region as other data. > > >> To improve the user experience (i.e. don't require the user to specify > >> pci=resource_alignment=), and increase conformance to PCIe spec, set the > >> default minimum resource alignment of memory BARs to the greater of 4k > >> or PAGE_SIZE. > >> > >> Quoting the comment in > >> drivers/pci/pci.c:pci_request_resource_alignment(), there are two ways > >> we can increase the resource alignment: > >> > >> 1) Increase the size of the resource. BARs are aligned on their > >> size, so when we reallocate space for this resource, we'll > >> allocate it with the larger alignment. This also prevents > >> assignment of any other BARs inside the alignment region, so > >> if we're requesting page alignment, this means no other BARs > >> will share the page. > >> > >> The disadvantage is that this makes the resource larger than > >> the hardware BAR, which may break drivers that compute things > >> based on the resource size, e.g., to find registers at a > >> fixed offset before the end of the BAR. > >> > >> 2) Retain the resource size, but use IORESOURCE_STARTALIGN and > >> set r->start to the desired alignment. By itself this > >> doesn't prevent other BARs being put inside the alignment > >> region, but if we realign *every* resource of every device in > >> the system, none of them will share an alignment region. > >> > >> Changing pcibios_default_alignment() results in the second method of > >> alignment with IORESOURCE_STARTALIGN. > >> > >> The new default alignment may be overridden by arches by implementing > >> pcibios_default_alignment(), or by the user on a per-device basis with > >> the pci=resource_alignment= option (although this reverts to using > >> IORESOURCE_SIZEALIGN). > >> > >> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > >> --- > >> Preparatory patches in this series are prerequisites to this patch. > >> > >> v2->v3: > >> * new subject (was: "PCI: Align small (<4k) BARs") > >> * clarify 4k vs PAGE_SIZE in commit message > >> > >> v1->v2: > >> * capitalize subject text > >> * s/4 * 1024/SZ_4K/ > >> * #include <linux/sizes.h> > >> * update commit message > >> * use max(SZ_4K, PAGE_SIZE) for alignment value > >> --- > >> drivers/pci/pci.c | 8 +++++++- > >> 1 file changed, 7 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > >> index af34407f2fb9..efdd5b85ea8c 100644 > >> --- a/drivers/pci/pci.c > >> +++ b/drivers/pci/pci.c > >> @@ -31,6 +31,7 @@ > >> #include <asm/dma.h> > >> #include <linux/aer.h> > >> #include <linux/bitfield.h> > >> +#include <linux/sizes.h> > >> #include "pci.h" > >> > >> DEFINE_MUTEX(pci_slot_mutex); > >> @@ -6484,7 +6485,12 @@ struct pci_dev __weak *pci_real_dma_dev(struct pci_dev *dev) > >> > >> resource_size_t __weak pcibios_default_alignment(void) > >> { > >> - return 0; > >> + /* > >> + * Avoid MSI-X tables being mapped in the same 4k region as other data > >> + * according to PCIe 6.1 specification section 7.7.2 MSI-X Capability > >> + * and Table Structure. > >> + */ > > > > I think this is sort of a "spec compliance" comment that is not the > > *real* reason we want to do this, i.e., it doesn't say that by doing > > this we can pass through more devices to guests. > > > > Doing this in pcibios_default_alignment() ends up being a very > > non-obvious way to make this happen. We have to: > > > > - Know what the purpose of this is, and the current comment doesn't > > point to that. > > > > - Look at all the implementations of pcibios_default_alignment() > > (thanks, powerpc). > > > > - Trace up through pci_specified_resource_alignment(), which > > contains a bunch of code that is not relevant to this case and > > always just returns PAGE_SIZE. > > > > - Trace up again to pci_reassigndev_resource_alignment() to see > > where this finally applies to the resources we care about. The > > comment here about "check if specified PCI is target device" is > > actively misleading for the passthrough usage. > > > > I hate adding new kernel parameters, but I kind of think this would be > > easier if we added one that mentioned passthrough or guests and tested > > it directly in pci_reassigndev_resource_alignment(). > > > > This would also be a way to avoid the "Can't reassign resources to > > host bridge" warning that I think we're going to see all the time. > > I did actually prepare a pci=resource_alignment=all patch, but I > hesitated to send it because of the discussion at [4]. I'll send it with > the next revision of the series. > > [4] https://lore.kernel.org/linux-pci/20160929115422.GA31048@localhost/ > > I'd like to also propose introducing a Kconfig option, e.g. > CONFIG_PCI_PAGE_ALIGN_BARS, selectable by menuconfig or other usual > means. > > >> + return max(SZ_4K, PAGE_SIZE); > >> } > >> > >> /* > >> -- > >> 2.46.0 > >> > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 0/8] PCI: Align small BARs 2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand ` (7 preceding siblings ...) 2024-08-07 15:17 ` [PATCH v3 8/8] PCI: Align small BARs Stewart Hildebrand @ 2024-08-07 15:42 ` Arnd Bergmann 8 siblings, 0 replies; 21+ messages in thread From: Arnd Bergmann @ 2024-08-07 15:42 UTC (permalink / raw) To: Stewart Hildebrand, Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin, Michael Ellerman, Nicholas Piggin, Christophe Leroy, Naveen N. Rao, Thomas Zimmermann, Sam Ravnborg, Yongji Xie, Ilpo Järvinen, Philipp Stanner Cc: x86, linux-pci, linux-kernel, linuxppc-dev On Wed, Aug 7, 2024, at 17:17, Stewart Hildebrand wrote: > In this context, "small" is defined as max(SZ_4K, PAGE_SIZE). > > This series sets the default minimum resource alignment to > max(SZ_4K, PAGE_SIZE) for memory BARs. In preparation, it makes an > optimization and addresses some corner cases observed when reallocating > BARs. I consider the prepapatory patches to be prerequisites to changing > the default BAR alignment. It's probably worth noting that Linux does not support any architectures with software page sizes smaller than 4KB, and it would likely break a lot of assumptions, so max(SZ_4K, PAGE_SIZE) is really the same as PAGE_SIZE in practice. Arnd ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-08-14 22:05 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 1/8] x86/PCI: Improve code readability Stewart Hildebrand 2024-08-08 8:55 ` Ilpo Järvinen 2024-08-07 15:17 ` [PATCH v3 2/8] PCI: Don't unnecessarily disable memory decoding Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 3/8] PCI: Restore resource alignment Stewart Hildebrand 2024-08-08 19:28 ` Bjorn Helgaas 2024-08-08 20:28 ` Stewart Hildebrand 2024-08-08 21:54 ` Bjorn Helgaas 2024-08-14 18:12 ` Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 4/8] PCI: Restore memory decoding after reallocation Stewart Hildebrand 2024-08-08 19:37 ` Bjorn Helgaas 2024-08-14 18:31 ` Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment Stewart Hildebrand 2024-08-08 20:10 ` Bjorn Helgaas 2024-08-07 15:17 ` [PATCH v3 6/8] powerpc/pci: " Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 7/8] PCI: Don't reassign resources that are already aligned Stewart Hildebrand 2024-08-07 15:17 ` [PATCH v3 8/8] PCI: Align small BARs Stewart Hildebrand 2024-08-08 21:53 ` Bjorn Helgaas 2024-08-14 13:55 ` Stewart Hildebrand 2024-08-14 22:05 ` Bjorn Helgaas 2024-08-07 15:42 ` [PATCH v3 0/8] " Arnd Bergmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox