* [PATCH v3 0/8] PCI: Align small BARs
@ 2024-08-07 15:17 Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 1/8] x86/PCI: Improve code readability Stewart Hildebrand
` (8 more replies)
0 siblings, 9 replies; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw)
To: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Naveen N. Rao, Thomas Zimmermann, Arnd Bergmann,
Sam Ravnborg, Yongji Xie, Ilpo Järvinen, Philipp Stanner
Cc: Stewart Hildebrand, x86, linux-pci, linux-kernel, linuxppc-dev
In this context, "small" is defined as max(SZ_4K, PAGE_SIZE).
This series sets the default minimum resource alignment to
max(SZ_4K, PAGE_SIZE) for memory BARs. In preparation, it makes an
optimization and addresses some corner cases observed when reallocating
BARs. I consider the prepapatory patches to be prerequisites to changing
the default BAR alignment.
I considered introducing checks for the specific scenarios described,
but chose not to pursue this. A check such as "if (xen_domain())" may be
pretty simple, but that doesn't account for other hypervisors. If other
hypervisors are to be considered, or if we try to dynamically reallocate
BARs for devices being marked for passthrough, such a check may quickly
grow unwieldy. Further, checking for the MSI-X tables residing in a
small (<4k) BAR is unlikely to be a one-liner. Making 4k alignment the
default seems more robust. Lastly, when using IORESOURCE_STARTALIGN, all
resources in the system need to be aligned.
I considered alternatively adding new functionality to the
pci=resource_alignment= option, but that approach was already attempted
and decided against [1].
[1] https://lore.kernel.org/linux-pci/1473757234-5284-4-git-send-email-xyjxie@linux.vnet.ibm.com/
v2->v3:
* clarify 4k vs PAGE_SIZE
* rename ("x86/PCI: Move some logic to new function") to
("x86/PCI: Improve code readability")
* rename ("PCI: Align small (<4k) BARs") to
("PCI: Align small BARs")
v1->v2:
* rename ("PCI: don't clear already cleared bit") to
("PCI: Don't unnecessarily disable memory decoding")
* new patch: ("x86/PCI: Move some logic to new function")
* new patch: ("powerpc/pci: Preserve IORESOURCE_STARTALIGN alignment")
Stewart Hildebrand (8):
x86/PCI: Improve code readability
PCI: Don't unnecessarily disable memory decoding
PCI: Restore resource alignment
PCI: Restore memory decoding after reallocation
x86/PCI: Preserve IORESOURCE_STARTALIGN alignment
powerpc/pci: Preserve IORESOURCE_STARTALIGN alignment
PCI: Don't reassign resources that are already aligned
PCI: Align small BARs
arch/powerpc/kernel/pci-common.c | 6 +++--
arch/x86/pci/i386.c | 38 +++++++++++++++------------
drivers/pci/pci.c | 43 +++++++++++++++++++++++--------
drivers/pci/setup-bus.c | 44 ++++++++++++++++++++++++++++++++
include/linux/pci.h | 2 ++
5 files changed, 103 insertions(+), 30 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v3 1/8] x86/PCI: Improve code readability
2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand
@ 2024-08-07 15:17 ` Stewart Hildebrand
2024-08-08 8:55 ` Ilpo Järvinen
2024-08-07 15:17 ` [PATCH v3 2/8] PCI: Don't unnecessarily disable memory decoding Stewart Hildebrand
` (7 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw)
To: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Ilpo Järvinen, Philipp Stanner
Cc: Stewart Hildebrand, x86, linux-pci, linux-kernel
The indentation in pcibios_allocate_dev_resources() is unusually deep.
Improve that by moving some of its code to a new function,
alloc_resource().
Take the opportunity to remove redundant information from dev_dbg().
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
v2->v3:
* new subject (was: "x86/PCI: Move some logic to new function")
* reword commit message (thanks Philipp)
v1->v2:
* new patch
---
arch/x86/pci/i386.c | 38 +++++++++++++++++++++-----------------
1 file changed, 21 insertions(+), 17 deletions(-)
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index f2f4a5d50b27..3abd55902dbc 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -246,6 +246,25 @@ struct pci_check_idx_range {
int end;
};
+static void alloc_resource(struct pci_dev *dev, int idx, int pass)
+{
+ struct resource *r = &dev->resource[idx];
+
+ dev_dbg(&dev->dev, "BAR %d: reserving %pr (p=%d)\n", idx, r, pass);
+
+ if (pci_claim_resource(dev, idx) < 0) {
+ if (r->flags & IORESOURCE_PCI_FIXED) {
+ dev_info(&dev->dev, "BAR %d %pR is immovable\n",
+ idx, r);
+ } else {
+ /* We'll assign a new address later */
+ pcibios_save_fw_addr(dev, idx, r->start);
+ r->end -= r->start;
+ r->start = 0;
+ }
+ }
+}
+
static void pcibios_allocate_dev_resources(struct pci_dev *dev, int pass)
{
int idx, disabled, i;
@@ -271,23 +290,8 @@ static void pcibios_allocate_dev_resources(struct pci_dev *dev, int pass)
disabled = !(command & PCI_COMMAND_IO);
else
disabled = !(command & PCI_COMMAND_MEMORY);
- if (pass == disabled) {
- dev_dbg(&dev->dev,
- "BAR %d: reserving %pr (d=%d, p=%d)\n",
- idx, r, disabled, pass);
- if (pci_claim_resource(dev, idx) < 0) {
- if (r->flags & IORESOURCE_PCI_FIXED) {
- dev_info(&dev->dev, "BAR %d %pR is immovable\n",
- idx, r);
- } else {
- /* We'll assign a new address later */
- pcibios_save_fw_addr(dev,
- idx, r->start);
- r->end -= r->start;
- r->start = 0;
- }
- }
- }
+ if (pass == disabled)
+ alloc_resource(dev, idx, pass);
}
if (!pass) {
r = &dev->resource[PCI_ROM_RESOURCE];
--
2.46.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 2/8] PCI: Don't unnecessarily disable memory decoding
2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 1/8] x86/PCI: Improve code readability Stewart Hildebrand
@ 2024-08-07 15:17 ` Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 3/8] PCI: Restore resource alignment Stewart Hildebrand
` (6 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw)
To: Bjorn Helgaas; +Cc: Stewart Hildebrand, linux-pci, linux-kernel
If alignment is requested for a device and all the BARs are already
sufficiently aligned, there is no need to disable memory decoding for
the device. Add a check for this scenario to save a PCI config space
access.
Also, there is no need to disable memory decoding if already disabled,
since the bit in question (PCI_COMMAND_MEMORY) is a RW bit. Make the
write conditional to save a PCI config space access.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
v2->v3:
* no change
v1->v2:
* new subject (was: "PCI: don't clear already cleared bit")
* don't disable memory decoding if alignment is not needed
---
drivers/pci/pci.c | 31 +++++++++++++++++++++----------
1 file changed, 21 insertions(+), 10 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e3a49f66982d..acecdd6edd5a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6564,7 +6564,7 @@ static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
return align;
}
-static void pci_request_resource_alignment(struct pci_dev *dev, int bar,
+static bool pci_request_resource_alignment(struct pci_dev *dev, int bar,
resource_size_t align, bool resize)
{
struct resource *r = &dev->resource[bar];
@@ -6572,17 +6572,17 @@ static void pci_request_resource_alignment(struct pci_dev *dev, int bar,
resource_size_t size;
if (!(r->flags & IORESOURCE_MEM))
- return;
+ return false;
if (r->flags & IORESOURCE_PCI_FIXED) {
pci_info(dev, "%s %pR: ignoring requested alignment %#llx\n",
r_name, r, (unsigned long long)align);
- return;
+ return false;
}
size = resource_size(r);
if (size >= align)
- return;
+ return false;
/*
* Increase the alignment of the resource. There are two ways we
@@ -6625,6 +6625,8 @@ static void pci_request_resource_alignment(struct pci_dev *dev, int bar,
r->end = r->start + size - 1;
}
r->flags |= IORESOURCE_UNSET;
+
+ return true;
}
/*
@@ -6640,7 +6642,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev)
struct resource *r;
resource_size_t align;
u16 command;
- bool resize = false;
+ bool resize = false, align_needed = false;
/*
* VF BARs are read-only zero according to SR-IOV spec r1.1, sec
@@ -6662,12 +6664,21 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev)
return;
}
- pci_read_config_word(dev, PCI_COMMAND, &command);
- command &= ~PCI_COMMAND_MEMORY;
- pci_write_config_word(dev, PCI_COMMAND, command);
+ for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+ bool ret;
- for (i = 0; i <= PCI_ROM_RESOURCE; i++)
- pci_request_resource_alignment(dev, i, align, resize);
+ ret = pci_request_resource_alignment(dev, i, align, resize);
+ align_needed = align_needed || ret;
+ }
+
+ if (!align_needed)
+ return;
+
+ pci_read_config_word(dev, PCI_COMMAND, &command);
+ if (command & PCI_COMMAND_MEMORY) {
+ command &= ~PCI_COMMAND_MEMORY;
+ pci_write_config_word(dev, PCI_COMMAND, command);
+ }
/*
* Need to disable bridge's resource window,
--
2.46.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 3/8] PCI: Restore resource alignment
2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 1/8] x86/PCI: Improve code readability Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 2/8] PCI: Don't unnecessarily disable memory decoding Stewart Hildebrand
@ 2024-08-07 15:17 ` Stewart Hildebrand
2024-08-08 19:28 ` Bjorn Helgaas
2024-08-07 15:17 ` [PATCH v3 4/8] PCI: Restore memory decoding after reallocation Stewart Hildebrand
` (5 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw)
To: Bjorn Helgaas; +Cc: Stewart Hildebrand, linux-pci, linux-kernel
Devices with alignment specified will lose their alignment in cases when
the bridge resources have been released, e.g. due to insufficient bridge
window size. Restore the alignment.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
v2->v3:
* no change
v1->v2:
* capitalize subject text
---
drivers/pci/setup-bus.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 23082bc0ca37..ab7510ce6917 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1594,6 +1594,23 @@ static void __pci_bridge_assign_resources(const struct pci_dev *bridge,
}
}
+static void restore_child_resource_alignment(struct pci_bus *bus)
+{
+ struct pci_dev *dev;
+
+ list_for_each_entry(dev, &bus->devices, bus_list) {
+ struct pci_bus *b;
+
+ pci_reassigndev_resource_alignment(dev);
+
+ b = dev->subordinate;
+ if (!b)
+ continue;
+
+ restore_child_resource_alignment(b);
+ }
+}
+
#define PCI_RES_TYPE_MASK \
(IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\
IORESOURCE_MEM_64)
@@ -1648,6 +1665,8 @@ static void pci_bridge_release_resources(struct pci_bus *bus,
r->start = 0;
r->flags = 0;
+ restore_child_resource_alignment(bus);
+
/* Avoiding touch the one without PREF */
if (type & IORESOURCE_PREFETCH)
type = IORESOURCE_PREFETCH;
--
2.46.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 4/8] PCI: Restore memory decoding after reallocation
2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand
` (2 preceding siblings ...)
2024-08-07 15:17 ` [PATCH v3 3/8] PCI: Restore resource alignment Stewart Hildebrand
@ 2024-08-07 15:17 ` Stewart Hildebrand
2024-08-08 19:37 ` Bjorn Helgaas
2024-08-07 15:17 ` [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment Stewart Hildebrand
` (4 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw)
To: Bjorn Helgaas; +Cc: Stewart Hildebrand, linux-pci, linux-kernel
Currently, the PCI subsystem unconditionally clears the memory decoding
bit of devices with resource alignment specified. While drivers should
call pci_enable_device() to enable memory decoding, some platforms have
devices that expect memory decoding to be enabled even when no driver is
bound to the device. It is assumed firmware enables memory decoding in
these cases. For example, the vgacon driver uses the 0xb8000 buffer to
display a console without any PCI involvement, yet, on some platforms,
memory decoding must be enabled on the PCI VGA device in order for the
console to be displayed properly.
Restore the memory decoding bit after the resource has been reallocated.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
v2->v3:
* no change
v1->v2:
* capitalize subject text
* reword commit message
Testing notes / how to check if your platform needs memory decoding
enabled in order to use vgacon:
1. Boot your system with a monitor attached, without any driver bound to
your PCI VGA device. You should see a console on your display.
2. Find the SBDF of your VGA device with lspci -v (00:01.0 in this
example).
3. Disable memory decoding (replace 00:01.0 with your SBDF):
$ sudo setpci -v -s 00:01.0 0x04.W=0x05
4. Type something to see if it appears on the console display
5. Re-enable memory decoding:
$ sudo setpci -v -s 00:01.0 0x04.W=0x07
Relevant prior discussion at [1]
[1] https://lore.kernel.org/linux-pci/20160906165652.GE1214@localhost/
---
drivers/pci/pci.c | 1 +
drivers/pci/setup-bus.c | 25 +++++++++++++++++++++++++
include/linux/pci.h | 2 ++
3 files changed, 28 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index acecdd6edd5a..4b97d8d5c2d8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6676,6 +6676,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev)
pci_read_config_word(dev, PCI_COMMAND, &command);
if (command & PCI_COMMAND_MEMORY) {
+ dev->dev_flags |= PCI_DEV_FLAGS_MEMORY_ENABLE;
command &= ~PCI_COMMAND_MEMORY;
pci_write_config_word(dev, PCI_COMMAND, command);
}
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index ab7510ce6917..6847b251e19a 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -2131,6 +2131,29 @@ pci_root_bus_distribute_available_resources(struct pci_bus *bus,
}
}
+static void restore_memory_decoding(struct pci_bus *bus)
+{
+ struct pci_dev *dev;
+
+ list_for_each_entry(dev, &bus->devices, bus_list) {
+ struct pci_bus *b;
+
+ if (dev->dev_flags & PCI_DEV_FLAGS_MEMORY_ENABLE) {
+ u16 command;
+
+ pci_read_config_word(dev, PCI_COMMAND, &command);
+ command |= PCI_COMMAND_MEMORY;
+ pci_write_config_word(dev, PCI_COMMAND, command);
+ }
+
+ b = dev->subordinate;
+ if (!b)
+ continue;
+
+ restore_memory_decoding(b);
+ }
+}
+
/*
* First try will not touch PCI bridge res.
* Second and later try will clear small leaf bridge res.
@@ -2229,6 +2252,8 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
goto again;
dump:
+ restore_memory_decoding(bus);
+
/* Dump the resource on buses */
pci_bus_dump_resources(bus);
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 4246cb790c7b..74636acf152f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -245,6 +245,8 @@ enum pci_dev_flags {
PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
/* Device does honor MSI masking despite saying otherwise */
PCI_DEV_FLAGS_HAS_MSI_MASKING = (__force pci_dev_flags_t) (1 << 12),
+ /* Firmware enabled memory decoding, to be restored if BAR is updated */
+ PCI_DEV_FLAGS_MEMORY_ENABLE = (__force pci_dev_flags_t) (1 << 13),
};
enum pci_irq_reroute_variant {
--
2.46.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment
2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand
` (3 preceding siblings ...)
2024-08-07 15:17 ` [PATCH v3 4/8] PCI: Restore memory decoding after reallocation Stewart Hildebrand
@ 2024-08-07 15:17 ` Stewart Hildebrand
2024-08-08 20:10 ` Bjorn Helgaas
2024-08-07 15:17 ` [PATCH v3 6/8] powerpc/pci: " Stewart Hildebrand
` (3 subsequent siblings)
8 siblings, 1 reply; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw)
To: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Yongji Xie, Ilpo Järvinen
Cc: Stewart Hildebrand, x86, linux-pci, linux-kernel
There is a corner case in pcibios_allocate_dev_resources() where the
IORESOURCE_STARTALIGN alignment of memory BAR resources gets
overwritten. This does not affect bridge resources. The corner case is
not yet possible to trigger on x86, but it will be possible once the
default resource alignment changes, and memory BAR resources will start
to use IORESOURCE_STARTALIGN. Account for IORESOURCE_STARTALIGN in
preparation for changing the default resource alignment.
Skip the pcibios_save_fw_addr() call since the resource doesn't contain
a valid address when alignment has been requested. The impact of this is
that we won't be able to restore the firmware allocated BAR, which does
not meet alignment requirements anyway.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
v2->v3:
* no change
v1->v2:
* capitalize subject text
* clarify commit message
* skip pcibios_save_fw_addr() call
---
arch/x86/pci/i386.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index 3abd55902dbc..13d7f7ac3bde 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -256,7 +256,7 @@ static void alloc_resource(struct pci_dev *dev, int idx, int pass)
if (r->flags & IORESOURCE_PCI_FIXED) {
dev_info(&dev->dev, "BAR %d %pR is immovable\n",
idx, r);
- } else {
+ } else if (!(r->flags & IORESOURCE_STARTALIGN)) {
/* We'll assign a new address later */
pcibios_save_fw_addr(dev, idx, r->start);
r->end -= r->start;
--
2.46.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 6/8] powerpc/pci: Preserve IORESOURCE_STARTALIGN alignment
2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand
` (4 preceding siblings ...)
2024-08-07 15:17 ` [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment Stewart Hildebrand
@ 2024-08-07 15:17 ` Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 7/8] PCI: Don't reassign resources that are already aligned Stewart Hildebrand
` (2 subsequent siblings)
8 siblings, 0 replies; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw)
To: Bjorn Helgaas, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Naveen N. Rao, Thomas Zimmermann, Arnd Bergmann,
Sam Ravnborg
Cc: Stewart Hildebrand, linux-pci, linux-kernel, linuxppc-dev
There is a corner case in pcibios_allocate_resources()/alloc_resource()
where the IORESOURCE_STARTALIGN alignment of memory BAR resources gets
overwritten. This does not affect bridge resources. Account for
IORESOURCE_STARTALIGN.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
v2->v3:
* no change
v1->v2:
* new patch
---
arch/powerpc/kernel/pci-common.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index eac84d687b53..3f346ad641e0 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1303,8 +1303,10 @@ static inline void alloc_resource(struct pci_dev *dev, int idx)
pr_debug("PCI: parent is %p: %pR\n", pr, pr);
/* We'll assign a new address later */
r->flags |= IORESOURCE_UNSET;
- r->end -= r->start;
- r->start = 0;
+ if (!(r->flags & IORESOURCE_STARTALIGN)) {
+ r->end -= r->start;
+ r->start = 0;
+ }
}
}
--
2.46.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 7/8] PCI: Don't reassign resources that are already aligned
2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand
` (5 preceding siblings ...)
2024-08-07 15:17 ` [PATCH v3 6/8] powerpc/pci: " Stewart Hildebrand
@ 2024-08-07 15:17 ` Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 8/8] PCI: Align small BARs Stewart Hildebrand
2024-08-07 15:42 ` [PATCH v3 0/8] " Arnd Bergmann
8 siblings, 0 replies; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw)
To: Bjorn Helgaas; +Cc: Stewart Hildebrand, linux-pci, linux-kernel
As a follow-up to commit 0dde1c08d1b9 ("PCI: Don't reassign resources
that are already aligned"), don't reassign already aligned resources
that have requested alignment without resizing.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
v2->v3:
* no change
v1->v2:
* capitalize subject text
---
drivers/pci/pci.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 4b97d8d5c2d8..af34407f2fb9 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6584,6 +6584,9 @@ static bool pci_request_resource_alignment(struct pci_dev *dev, int bar,
if (size >= align)
return false;
+ if (!resize && (r->start > align) && !(r->start & (align - 1)))
+ return false;
+
/*
* Increase the alignment of the resource. There are two ways we
* can do this:
--
2.46.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH v3 8/8] PCI: Align small BARs
2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand
` (6 preceding siblings ...)
2024-08-07 15:17 ` [PATCH v3 7/8] PCI: Don't reassign resources that are already aligned Stewart Hildebrand
@ 2024-08-07 15:17 ` Stewart Hildebrand
2024-08-08 21:53 ` Bjorn Helgaas
2024-08-07 15:42 ` [PATCH v3 0/8] " Arnd Bergmann
8 siblings, 1 reply; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-07 15:17 UTC (permalink / raw)
To: Bjorn Helgaas, Ilpo Järvinen
Cc: Stewart Hildebrand, linux-pci, linux-kernel
In this context, "small" is defined as less than max(SZ_4K, PAGE_SIZE).
Issues observed when small BARs are not sufficiently aligned are:
1. Devices to be passed through (to e.g. a Xen HVM guest) with small
BARs require each memory BAR to be page aligned. Currently, the only way
to guarantee this alignment from a user perspective is to fake the size
of the BARs using the pci=resource_alignment= option. This is a bad user
experience, and faking the BAR size is not always desirable. For
example, pcitest is a tool that is useful for PCI passthrough validation
with Xen, but pcitest fails with a fake BAR size.
2. Devices with multiple small BARs could have the MSI-X tables located
in one of its small BARs. This may lead to the MSI-X tables being mapped
in the same 4k region as other data. The PCIe 6.1 specification (section
7.7.2 MSI-X Capability and Table Structure) says we probably should
avoid that.
To improve the user experience (i.e. don't require the user to specify
pci=resource_alignment=), and increase conformance to PCIe spec, set the
default minimum resource alignment of memory BARs to the greater of 4k
or PAGE_SIZE.
Quoting the comment in
drivers/pci/pci.c:pci_request_resource_alignment(), there are two ways
we can increase the resource alignment:
1) Increase the size of the resource. BARs are aligned on their
size, so when we reallocate space for this resource, we'll
allocate it with the larger alignment. This also prevents
assignment of any other BARs inside the alignment region, so
if we're requesting page alignment, this means no other BARs
will share the page.
The disadvantage is that this makes the resource larger than
the hardware BAR, which may break drivers that compute things
based on the resource size, e.g., to find registers at a
fixed offset before the end of the BAR.
2) Retain the resource size, but use IORESOURCE_STARTALIGN and
set r->start to the desired alignment. By itself this
doesn't prevent other BARs being put inside the alignment
region, but if we realign *every* resource of every device in
the system, none of them will share an alignment region.
Changing pcibios_default_alignment() results in the second method of
alignment with IORESOURCE_STARTALIGN.
The new default alignment may be overridden by arches by implementing
pcibios_default_alignment(), or by the user on a per-device basis with
the pci=resource_alignment= option (although this reverts to using
IORESOURCE_SIZEALIGN).
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Preparatory patches in this series are prerequisites to this patch.
v2->v3:
* new subject (was: "PCI: Align small (<4k) BARs")
* clarify 4k vs PAGE_SIZE in commit message
v1->v2:
* capitalize subject text
* s/4 * 1024/SZ_4K/
* #include <linux/sizes.h>
* update commit message
* use max(SZ_4K, PAGE_SIZE) for alignment value
---
drivers/pci/pci.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index af34407f2fb9..efdd5b85ea8c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -31,6 +31,7 @@
#include <asm/dma.h>
#include <linux/aer.h>
#include <linux/bitfield.h>
+#include <linux/sizes.h>
#include "pci.h"
DEFINE_MUTEX(pci_slot_mutex);
@@ -6484,7 +6485,12 @@ struct pci_dev __weak *pci_real_dma_dev(struct pci_dev *dev)
resource_size_t __weak pcibios_default_alignment(void)
{
- return 0;
+ /*
+ * Avoid MSI-X tables being mapped in the same 4k region as other data
+ * according to PCIe 6.1 specification section 7.7.2 MSI-X Capability
+ * and Table Structure.
+ */
+ return max(SZ_4K, PAGE_SIZE);
}
/*
--
2.46.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v3 0/8] PCI: Align small BARs
2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand
` (7 preceding siblings ...)
2024-08-07 15:17 ` [PATCH v3 8/8] PCI: Align small BARs Stewart Hildebrand
@ 2024-08-07 15:42 ` Arnd Bergmann
8 siblings, 0 replies; 21+ messages in thread
From: Arnd Bergmann @ 2024-08-07 15:42 UTC (permalink / raw)
To: Stewart Hildebrand, Bjorn Helgaas, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Michael Ellerman,
Nicholas Piggin, Christophe Leroy, Naveen N. Rao,
Thomas Zimmermann, Sam Ravnborg, Yongji Xie, Ilpo Järvinen,
Philipp Stanner
Cc: x86, linux-pci, linux-kernel, linuxppc-dev
On Wed, Aug 7, 2024, at 17:17, Stewart Hildebrand wrote:
> In this context, "small" is defined as max(SZ_4K, PAGE_SIZE).
>
> This series sets the default minimum resource alignment to
> max(SZ_4K, PAGE_SIZE) for memory BARs. In preparation, it makes an
> optimization and addresses some corner cases observed when reallocating
> BARs. I consider the prepapatory patches to be prerequisites to changing
> the default BAR alignment.
It's probably worth noting that Linux does not support any
architectures with software page sizes smaller than 4KB,
and it would likely break a lot of assumptions, so
max(SZ_4K, PAGE_SIZE) is really the same as PAGE_SIZE
in practice.
Arnd
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 1/8] x86/PCI: Improve code readability
2024-08-07 15:17 ` [PATCH v3 1/8] x86/PCI: Improve code readability Stewart Hildebrand
@ 2024-08-08 8:55 ` Ilpo Järvinen
0 siblings, 0 replies; 21+ messages in thread
From: Ilpo Järvinen @ 2024-08-08 8:55 UTC (permalink / raw)
To: Stewart Hildebrand
Cc: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Philipp Stanner, x86, linux-pci,
LKML
[-- Attachment #1: Type: text/plain, Size: 2507 bytes --]
On Wed, 7 Aug 2024, Stewart Hildebrand wrote:
> The indentation in pcibios_allocate_dev_resources() is unusually deep.
> Improve that by moving some of its code to a new function,
> alloc_resource().
>
> Take the opportunity to remove redundant information from dev_dbg().
>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
> v2->v3:
> * new subject (was: "x86/PCI: Move some logic to new function")
> * reword commit message (thanks Philipp)
>
> v1->v2:
> * new patch
> ---
> arch/x86/pci/i386.c | 38 +++++++++++++++++++++-----------------
> 1 file changed, 21 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
> index f2f4a5d50b27..3abd55902dbc 100644
> --- a/arch/x86/pci/i386.c
> +++ b/arch/x86/pci/i386.c
> @@ -246,6 +246,25 @@ struct pci_check_idx_range {
> int end;
> };
>
> +static void alloc_resource(struct pci_dev *dev, int idx, int pass)
> +{
> + struct resource *r = &dev->resource[idx];
> +
> + dev_dbg(&dev->dev, "BAR %d: reserving %pr (p=%d)\n", idx, r, pass);
> +
> + if (pci_claim_resource(dev, idx) < 0) {
> + if (r->flags & IORESOURCE_PCI_FIXED) {
> + dev_info(&dev->dev, "BAR %d %pR is immovable\n",
> + idx, r);
> + } else {
> + /* We'll assign a new address later */
> + pcibios_save_fw_addr(dev, idx, r->start);
> + r->end -= r->start;
> + r->start = 0;
> + }
> + }
> +}
> +
> static void pcibios_allocate_dev_resources(struct pci_dev *dev, int pass)
> {
> int idx, disabled, i;
> @@ -271,23 +290,8 @@ static void pcibios_allocate_dev_resources(struct pci_dev *dev, int pass)
> disabled = !(command & PCI_COMMAND_IO);
> else
> disabled = !(command & PCI_COMMAND_MEMORY);
> - if (pass == disabled) {
> - dev_dbg(&dev->dev,
> - "BAR %d: reserving %pr (d=%d, p=%d)\n",
> - idx, r, disabled, pass);
> - if (pci_claim_resource(dev, idx) < 0) {
> - if (r->flags & IORESOURCE_PCI_FIXED) {
> - dev_info(&dev->dev, "BAR %d %pR is immovable\n",
> - idx, r);
> - } else {
> - /* We'll assign a new address later */
> - pcibios_save_fw_addr(dev,
> - idx, r->start);
> - r->end -= r->start;
> - r->start = 0;
> - }
> - }
> - }
> + if (pass == disabled)
> + alloc_resource(dev, idx, pass);
> }
> if (!pass) {
> r = &dev->resource[PCI_ROM_RESOURCE];
>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
--
i.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 3/8] PCI: Restore resource alignment
2024-08-07 15:17 ` [PATCH v3 3/8] PCI: Restore resource alignment Stewart Hildebrand
@ 2024-08-08 19:28 ` Bjorn Helgaas
2024-08-08 20:28 ` Stewart Hildebrand
0 siblings, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2024-08-08 19:28 UTC (permalink / raw)
To: Stewart Hildebrand; +Cc: Bjorn Helgaas, linux-pci, linux-kernel
On Wed, Aug 07, 2024 at 11:17:12AM -0400, Stewart Hildebrand wrote:
> Devices with alignment specified will lose their alignment in cases when
> the bridge resources have been released, e.g. due to insufficient bridge
> window size. Restore the alignment.
I guess this fixes a problem when the user has specified
"pci=resource_alignment=..." and we've decided to release and
reallocate a bridge window? Just looking for a bit more concrete
description of what this problem would look like to a user.
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
> v2->v3:
> * no change
>
> v1->v2:
> * capitalize subject text
> ---
> drivers/pci/setup-bus.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 23082bc0ca37..ab7510ce6917 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -1594,6 +1594,23 @@ static void __pci_bridge_assign_resources(const struct pci_dev *bridge,
> }
> }
>
> +static void restore_child_resource_alignment(struct pci_bus *bus)
> +{
> + struct pci_dev *dev;
> +
> + list_for_each_entry(dev, &bus->devices, bus_list) {
> + struct pci_bus *b;
> +
> + pci_reassigndev_resource_alignment(dev);
> +
> + b = dev->subordinate;
> + if (!b)
> + continue;
> +
> + restore_child_resource_alignment(b);
> + }
> +}
> +
> #define PCI_RES_TYPE_MASK \
> (IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\
> IORESOURCE_MEM_64)
> @@ -1648,6 +1665,8 @@ static void pci_bridge_release_resources(struct pci_bus *bus,
> r->start = 0;
> r->flags = 0;
>
> + restore_child_resource_alignment(bus);
> +
> /* Avoiding touch the one without PREF */
> if (type & IORESOURCE_PREFETCH)
> type = IORESOURCE_PREFETCH;
> --
> 2.46.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 4/8] PCI: Restore memory decoding after reallocation
2024-08-07 15:17 ` [PATCH v3 4/8] PCI: Restore memory decoding after reallocation Stewart Hildebrand
@ 2024-08-08 19:37 ` Bjorn Helgaas
2024-08-14 18:31 ` Stewart Hildebrand
0 siblings, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2024-08-08 19:37 UTC (permalink / raw)
To: Stewart Hildebrand; +Cc: Bjorn Helgaas, linux-pci, linux-kernel
On Wed, Aug 07, 2024 at 11:17:13AM -0400, Stewart Hildebrand wrote:
> Currently, the PCI subsystem unconditionally clears the memory decoding
> bit of devices with resource alignment specified. While drivers should
> call pci_enable_device() to enable memory decoding, some platforms have
> devices that expect memory decoding to be enabled even when no driver is
> bound to the device. It is assumed firmware enables memory decoding in
> these cases. For example, the vgacon driver uses the 0xb8000 buffer to
> display a console without any PCI involvement, yet, on some platforms,
> memory decoding must be enabled on the PCI VGA device in order for the
> console to be displayed properly.
>
> Restore the memory decoding bit after the resource has been reallocated.
>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
> v2->v3:
> * no change
>
> v1->v2:
> * capitalize subject text
> * reword commit message
>
> Testing notes / how to check if your platform needs memory decoding
> enabled in order to use vgacon:
> 1. Boot your system with a monitor attached, without any driver bound to
> your PCI VGA device. You should see a console on your display.
> 2. Find the SBDF of your VGA device with lspci -v (00:01.0 in this
> example).
> 3. Disable memory decoding (replace 00:01.0 with your SBDF):
> $ sudo setpci -v -s 00:01.0 0x04.W=0x05
> 4. Type something to see if it appears on the console display
> 5. Re-enable memory decoding:
> $ sudo setpci -v -s 00:01.0 0x04.W=0x07
>
> Relevant prior discussion at [1]
>
> [1] https://lore.kernel.org/linux-pci/20160906165652.GE1214@localhost/
> ---
> drivers/pci/pci.c | 1 +
> drivers/pci/setup-bus.c | 25 +++++++++++++++++++++++++
> include/linux/pci.h | 2 ++
> 3 files changed, 28 insertions(+)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index acecdd6edd5a..4b97d8d5c2d8 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -6676,6 +6676,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev)
>
> pci_read_config_word(dev, PCI_COMMAND, &command);
> if (command & PCI_COMMAND_MEMORY) {
> + dev->dev_flags |= PCI_DEV_FLAGS_MEMORY_ENABLE;
> command &= ~PCI_COMMAND_MEMORY;
> pci_write_config_word(dev, PCI_COMMAND, command);
> }
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index ab7510ce6917..6847b251e19a 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -2131,6 +2131,29 @@ pci_root_bus_distribute_available_resources(struct pci_bus *bus,
> }
> }
>
> +static void restore_memory_decoding(struct pci_bus *bus)
> +{
> + struct pci_dev *dev;
> +
> + list_for_each_entry(dev, &bus->devices, bus_list) {
> + struct pci_bus *b;
> +
> + if (dev->dev_flags & PCI_DEV_FLAGS_MEMORY_ENABLE) {
> + u16 command;
> +
> + pci_read_config_word(dev, PCI_COMMAND, &command);
> + command |= PCI_COMMAND_MEMORY;
> + pci_write_config_word(dev, PCI_COMMAND, command);
> + }
> +
> + b = dev->subordinate;
> + if (!b)
> + continue;
> +
> + restore_memory_decoding(b);
> + }
> +}
> +
> /*
> * First try will not touch PCI bridge res.
> * Second and later try will clear small leaf bridge res.
> @@ -2229,6 +2252,8 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
> goto again;
>
> dump:
> + restore_memory_decoding(bus);
The fact that we traverse the whole hierarchy here to restore
PCI_COMMAND_MEMORY makes me suspect there's a window between point A
(where we clear PCI_COMMAND_MEMORY and update a BAR) and point B
(where we restore PCI_COMMAND_MEMORY) where VGA console output will
not work.
This tickles my memory like we might have talked about this before and
there's some reason for having to wait. If so, sorry, and maybe we
need a comment in the code about that reason.
> /* Dump the resource on buses */
> pci_bus_dump_resources(bus);
> }
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 4246cb790c7b..74636acf152f 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -245,6 +245,8 @@ enum pci_dev_flags {
> PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> /* Device does honor MSI masking despite saying otherwise */
> PCI_DEV_FLAGS_HAS_MSI_MASKING = (__force pci_dev_flags_t) (1 << 12),
> + /* Firmware enabled memory decoding, to be restored if BAR is updated */
> + PCI_DEV_FLAGS_MEMORY_ENABLE = (__force pci_dev_flags_t) (1 << 13),
> };
>
> enum pci_irq_reroute_variant {
> --
> 2.46.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment
2024-08-07 15:17 ` [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment Stewart Hildebrand
@ 2024-08-08 20:10 ` Bjorn Helgaas
0 siblings, 0 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2024-08-08 20:10 UTC (permalink / raw)
To: Stewart Hildebrand
Cc: Bjorn Helgaas, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Yongji Xie, Ilpo Järvinen, x86,
linux-pci, linux-kernel
On Wed, Aug 07, 2024 at 11:17:14AM -0400, Stewart Hildebrand wrote:
> There is a corner case in pcibios_allocate_dev_resources() where the
> IORESOURCE_STARTALIGN alignment of memory BAR resources gets
> overwritten. This does not affect bridge resources. The corner case is
> not yet possible to trigger on x86, but it will be possible once the
> default resource alignment changes, and memory BAR resources will start
> to use IORESOURCE_STARTALIGN.
I see from [8/8] that "Changing pcibios_default_alignment() results in
the second method of alignment with IORESOURCE_STARTALIGN", but that
connection is not at all obvious, and there's no patch in this series
that sets IORESOURCE_STARTALIGN, so it's kind of hard to connect all
the dots here.
The only caller of pcibios_default_alignment() is
pci_specified_resource_alignment(), and that doesn't mention
IORESOURCE_STARTALIGN either.
Neither does pcibios_allocate_dev_resources().
I feel like we've had this conversation before; apologies if so. But
we need to figure out to make this more explicit and less magic.
> Account for IORESOURCE_STARTALIGN in
> preparation for changing the default resource alignment.
>
> Skip the pcibios_save_fw_addr() call since the resource doesn't contain
> a valid address when alignment has been requested. The impact of this is
> that we won't be able to restore the firmware allocated BAR, which does
> not meet alignment requirements anyway.
>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
> v2->v3:
> * no change
>
> v1->v2:
> * capitalize subject text
> * clarify commit message
> * skip pcibios_save_fw_addr() call
> ---
> arch/x86/pci/i386.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
> index 3abd55902dbc..13d7f7ac3bde 100644
> --- a/arch/x86/pci/i386.c
> +++ b/arch/x86/pci/i386.c
> @@ -256,7 +256,7 @@ static void alloc_resource(struct pci_dev *dev, int idx, int pass)
> if (r->flags & IORESOURCE_PCI_FIXED) {
> dev_info(&dev->dev, "BAR %d %pR is immovable\n",
> idx, r);
> - } else {
> + } else if (!(r->flags & IORESOURCE_STARTALIGN)) {
> /* We'll assign a new address later */
> pcibios_save_fw_addr(dev, idx, r->start);
> r->end -= r->start;
> --
> 2.46.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 3/8] PCI: Restore resource alignment
2024-08-08 19:28 ` Bjorn Helgaas
@ 2024-08-08 20:28 ` Stewart Hildebrand
2024-08-08 21:54 ` Bjorn Helgaas
0 siblings, 1 reply; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-08 20:28 UTC (permalink / raw)
To: Bjorn Helgaas; +Cc: Bjorn Helgaas, linux-pci, linux-kernel
On 8/8/24 15:28, Bjorn Helgaas wrote:
> On Wed, Aug 07, 2024 at 11:17:12AM -0400, Stewart Hildebrand wrote:
>> Devices with alignment specified will lose their alignment in cases when
>> the bridge resources have been released, e.g. due to insufficient bridge
>> window size. Restore the alignment.
>
> I guess this fixes a problem when the user has specified
> "pci=resource_alignment=..." and we've decided to release and
> reallocate a bridge window? Just looking for a bit more concrete
> description of what this problem would look like to a user.
Yes. When alignment has been specified via pcibios_default_alignment()
or by the user with "pci=resource_alignment=...", and the bridge window
is being reallocated, the specified alignment is lost and the resource
may not be sufficiently aligned after reallocation.
I can expand the commit description.
>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>> ---
>> v2->v3:
>> * no change
>>
>> v1->v2:
>> * capitalize subject text
>> ---
>> drivers/pci/setup-bus.c | 19 +++++++++++++++++++
>> 1 file changed, 19 insertions(+)
>>
>> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
>> index 23082bc0ca37..ab7510ce6917 100644
>> --- a/drivers/pci/setup-bus.c
>> +++ b/drivers/pci/setup-bus.c
>> @@ -1594,6 +1594,23 @@ static void __pci_bridge_assign_resources(const struct pci_dev *bridge,
>> }
>> }
>>
>> +static void restore_child_resource_alignment(struct pci_bus *bus)
>> +{
>> + struct pci_dev *dev;
>> +
>> + list_for_each_entry(dev, &bus->devices, bus_list) {
>> + struct pci_bus *b;
>> +
>> + pci_reassigndev_resource_alignment(dev);
>> +
>> + b = dev->subordinate;
>> + if (!b)
>> + continue;
>> +
>> + restore_child_resource_alignment(b);
>> + }
>> +}
>> +
>> #define PCI_RES_TYPE_MASK \
>> (IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\
>> IORESOURCE_MEM_64)
>> @@ -1648,6 +1665,8 @@ static void pci_bridge_release_resources(struct pci_bus *bus,
>> r->start = 0;
>> r->flags = 0;
>>
>> + restore_child_resource_alignment(bus);
>> +
>> /* Avoiding touch the one without PREF */
>> if (type & IORESOURCE_PREFETCH)
>> type = IORESOURCE_PREFETCH;
>> --
>> 2.46.0
>>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 8/8] PCI: Align small BARs
2024-08-07 15:17 ` [PATCH v3 8/8] PCI: Align small BARs Stewart Hildebrand
@ 2024-08-08 21:53 ` Bjorn Helgaas
2024-08-14 13:55 ` Stewart Hildebrand
0 siblings, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2024-08-08 21:53 UTC (permalink / raw)
To: Stewart Hildebrand
Cc: Bjorn Helgaas, Ilpo Järvinen, linux-pci, linux-kernel
On Wed, Aug 07, 2024 at 11:17:17AM -0400, Stewart Hildebrand wrote:
> In this context, "small" is defined as less than max(SZ_4K, PAGE_SIZE).
>
> Issues observed when small BARs are not sufficiently aligned are:
>
> 1. Devices to be passed through (to e.g. a Xen HVM guest) with small
> BARs require each memory BAR to be page aligned. Currently, the only way
> to guarantee this alignment from a user perspective is to fake the size
> of the BARs using the pci=resource_alignment= option. This is a bad user
> experience, and faking the BAR size is not always desirable. For
> example, pcitest is a tool that is useful for PCI passthrough validation
> with Xen, but pcitest fails with a fake BAR size.
I guess this is the "money" patch for the main problem you're solving,
i.e., passthrough to a guest doesn't work as you want?
Is it the case that if you have two BARs in the same page, a device
can't be passed through to a guest at all? Or is it just that all
devices with BARs that share a page have to be passed through to the
same guest, sort of like how lack of ACS can force several devices to
be in the same IOMMU isolation group?
I think the subject should mention the problem to help motivate this.
The fact that we address this by potentially reassigning every BAR of
every device, regardless of whether the admin even wants to pass
through a device to a guest, seems a bit aggressive to me.
Previously we haven't trusted our reassignment machinery enough to
enable it all the time, so we still have the "pci=realloc" parameter.
By default, I don't think we even move devices around to make space
for a BAR that we failed to allocate.
I agree "pci=resource_alignment=" is a bit user-unfriendly, and I
don't think it solves the problem unless we apply it to every device
in the system.
> 2. Devices with multiple small BARs could have the MSI-X tables located
> in one of its small BARs. This may lead to the MSI-X tables being mapped
> in the same 4k region as other data. The PCIe 6.1 specification (section
> 7.7.2 MSI-X Capability and Table Structure) says we probably should
> avoid that.
If you're referring to this:
If a Base Address Register or entry in the Enhanced Allocation
capability that maps address space for the MSI-X Table or MSI-X PBA
also maps other usable address space that is not associated with
MSI-X structures, locations (e.g., for CSRs) used in the other
address space must not share any naturally aligned 4-KB address
range with one where either MSI-X structure resides. This allows
system software where applicable to use different processor
attributes for MSI-X structures and the other address space.
I think this is technically a requirement about how space within a
single BAR should be organized, not about how multiple BARs should be
assigned. I don't think this really adds to the case for what you're
doing, so we could just drop it.
> To improve the user experience (i.e. don't require the user to specify
> pci=resource_alignment=), and increase conformance to PCIe spec, set the
> default minimum resource alignment of memory BARs to the greater of 4k
> or PAGE_SIZE.
>
> Quoting the comment in
> drivers/pci/pci.c:pci_request_resource_alignment(), there are two ways
> we can increase the resource alignment:
>
> 1) Increase the size of the resource. BARs are aligned on their
> size, so when we reallocate space for this resource, we'll
> allocate it with the larger alignment. This also prevents
> assignment of any other BARs inside the alignment region, so
> if we're requesting page alignment, this means no other BARs
> will share the page.
>
> The disadvantage is that this makes the resource larger than
> the hardware BAR, which may break drivers that compute things
> based on the resource size, e.g., to find registers at a
> fixed offset before the end of the BAR.
>
> 2) Retain the resource size, but use IORESOURCE_STARTALIGN and
> set r->start to the desired alignment. By itself this
> doesn't prevent other BARs being put inside the alignment
> region, but if we realign *every* resource of every device in
> the system, none of them will share an alignment region.
>
> Changing pcibios_default_alignment() results in the second method of
> alignment with IORESOURCE_STARTALIGN.
>
> The new default alignment may be overridden by arches by implementing
> pcibios_default_alignment(), or by the user on a per-device basis with
> the pci=resource_alignment= option (although this reverts to using
> IORESOURCE_SIZEALIGN).
>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
> Preparatory patches in this series are prerequisites to this patch.
>
> v2->v3:
> * new subject (was: "PCI: Align small (<4k) BARs")
> * clarify 4k vs PAGE_SIZE in commit message
>
> v1->v2:
> * capitalize subject text
> * s/4 * 1024/SZ_4K/
> * #include <linux/sizes.h>
> * update commit message
> * use max(SZ_4K, PAGE_SIZE) for alignment value
> ---
> drivers/pci/pci.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index af34407f2fb9..efdd5b85ea8c 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -31,6 +31,7 @@
> #include <asm/dma.h>
> #include <linux/aer.h>
> #include <linux/bitfield.h>
> +#include <linux/sizes.h>
> #include "pci.h"
>
> DEFINE_MUTEX(pci_slot_mutex);
> @@ -6484,7 +6485,12 @@ struct pci_dev __weak *pci_real_dma_dev(struct pci_dev *dev)
>
> resource_size_t __weak pcibios_default_alignment(void)
> {
> - return 0;
> + /*
> + * Avoid MSI-X tables being mapped in the same 4k region as other data
> + * according to PCIe 6.1 specification section 7.7.2 MSI-X Capability
> + * and Table Structure.
> + */
I think this is sort of a "spec compliance" comment that is not the
*real* reason we want to do this, i.e., it doesn't say that by doing
this we can pass through more devices to guests.
Doing this in pcibios_default_alignment() ends up being a very
non-obvious way to make this happen. We have to:
- Know what the purpose of this is, and the current comment doesn't
point to that.
- Look at all the implementations of pcibios_default_alignment()
(thanks, powerpc).
- Trace up through pci_specified_resource_alignment(), which
contains a bunch of code that is not relevant to this case and
always just returns PAGE_SIZE.
- Trace up again to pci_reassigndev_resource_alignment() to see
where this finally applies to the resources we care about. The
comment here about "check if specified PCI is target device" is
actively misleading for the passthrough usage.
I hate adding new kernel parameters, but I kind of think this would be
easier if we added one that mentioned passthrough or guests and tested
it directly in pci_reassigndev_resource_alignment().
This would also be a way to avoid the "Can't reassign resources to
host bridge" warning that I think we're going to see all the time.
> + return max(SZ_4K, PAGE_SIZE);
> }
>
> /*
> --
> 2.46.0
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 3/8] PCI: Restore resource alignment
2024-08-08 20:28 ` Stewart Hildebrand
@ 2024-08-08 21:54 ` Bjorn Helgaas
2024-08-14 18:12 ` Stewart Hildebrand
0 siblings, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2024-08-08 21:54 UTC (permalink / raw)
To: Stewart Hildebrand; +Cc: Bjorn Helgaas, linux-pci, linux-kernel
On Thu, Aug 08, 2024 at 04:28:50PM -0400, Stewart Hildebrand wrote:
> On 8/8/24 15:28, Bjorn Helgaas wrote:
> > On Wed, Aug 07, 2024 at 11:17:12AM -0400, Stewart Hildebrand wrote:
> >> Devices with alignment specified will lose their alignment in cases when
> >> the bridge resources have been released, e.g. due to insufficient bridge
> >> window size. Restore the alignment.
> >
> > I guess this fixes a problem when the user has specified
> > "pci=resource_alignment=..." and we've decided to release and
> > reallocate a bridge window? Just looking for a bit more concrete
> > description of what this problem would look like to a user.
>
> Yes. When alignment has been specified via pcibios_default_alignment()
> or by the user with "pci=resource_alignment=...", and the bridge window
> is being reallocated, the specified alignment is lost and the resource
> may not be sufficiently aligned after reallocation.
>
> I can expand the commit description.
I think a hint about where the alignment gets lost would be helpful,
too.
This seems like a problem users could be seeing today, even
independent of the device passthrough issue that I think is the main
thrust of this series. If there's a problem report or an easy way to
reproduce this problem, that would be nice, too.
Bjorn
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 8/8] PCI: Align small BARs
2024-08-08 21:53 ` Bjorn Helgaas
@ 2024-08-14 13:55 ` Stewart Hildebrand
2024-08-14 22:05 ` Bjorn Helgaas
0 siblings, 1 reply; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-14 13:55 UTC (permalink / raw)
To: Bjorn Helgaas; +Cc: Bjorn Helgaas, Ilpo Järvinen, linux-pci, linux-kernel
Hi Bjorn,
Thanks for the feedback!
On 8/8/24 17:53, Bjorn Helgaas wrote:
> On Wed, Aug 07, 2024 at 11:17:17AM -0400, Stewart Hildebrand wrote:
>> In this context, "small" is defined as less than max(SZ_4K, PAGE_SIZE).
>>
>> Issues observed when small BARs are not sufficiently aligned are:
>>
>> 1. Devices to be passed through (to e.g. a Xen HVM guest) with small
>> BARs require each memory BAR to be page aligned. Currently, the only way
>> to guarantee this alignment from a user perspective is to fake the size
>> of the BARs using the pci=resource_alignment= option. This is a bad user
>> experience, and faking the BAR size is not always desirable. For
>> example, pcitest is a tool that is useful for PCI passthrough validation
>> with Xen, but pcitest fails with a fake BAR size.
>
> I guess this is the "money" patch for the main problem you're solving,
> i.e., passthrough to a guest doesn't work as you want?
Haha, yup!
> Is it the case that if you have two BARs in the same page, a device
> can't be passed through to a guest at all?
If the conditions are just right, passing through such a device could
maybe work, but in practice it's problematic and unlikely to work
reliably across different configurations.
Let me show example 1, from a real device that I'm working with.
Scrubbed/partial output from lspci -vv, from the host's point of view:
Region 0: Memory at d1924600 (32-bit, non-prefetchable) [size=256]
Region 1: Memory at d1924400 (32-bit, non-prefetchable) [size=512]
Region 2: Memory at d1924000 (32-bit, non-prefetchable) [size=1K]
Region 3: Memory at d1920000 (32-bit, non-prefetchable) [size=16K]
Region 4: Memory at d1900000 (32-bit, non-prefetchable) [size=128K]
Region 5: Memory at d1800000 (32-bit, non-prefetchable) [size=1M]
Capabilities: [b0] MSI-X: Enable- Count=2 Masked-
Vector table: BAR=0 offset=00000080
PBA: BAR=0 offset=00000090
Capabilities: [200 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
IOVSta: Migration-
Initial VFs: 4, Total VFs: 4, Number of VFs: 0, Function Dependency Link: 00
VF offset: 6, stride: 1, Device ID: 0100
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 00000000d0800000 (64-bit, non-prefetchable)
VF Migration: offset: 00000000, BIR: 0
Kernel driver in use: pci-endpoint-test
Kernel modules: pci_endpoint_test
BARs 0, 1, and 2 are small, and the host firmware placed them on the
same page. The host firmware did not page align the BARs.
The hypervisor can only map full pages. The hypervisor cannot map
partial pages. It cannot map a guest page offset from a host page where
the offset is smaller than PAGE_SIZE.
To pass this device through (physfn) as-is, the hypervisor would need to
preserve the page offsets of each BAR and propagate them to the guest,
taking translation into account. The guest (both firmware + OS)
necessarily has to preserve the offsets as well. If the page offsets
aren't preserved, the guest would be accessing the wrong data.
We can't reliably predict what the guest behavior will be.
SeaBIOS aligns BARs to 4k [1].
[1] https://review.coreboot.org/plugins/gitiles/seabios/+/refs/tags/rel-1.16.3/src/fw/pciinit.c#28
Xen's hvmloader does not align BARs to 4k. A patch was submitted to fix
this, but it wasn't merged upstream [2].
[2] https://lore.kernel.org/xen-devel/20200117110811.43321-1-roger.pau@citrix.com/
Arm guests don't usually have firmware to initialize BARs, so it's
usually up to the OS (which may or may not be Linux).
The point is that there is not a consistent BAR initialization
strategy/convention in the ecosystem when it comes to small BARs.
The host doesn't have a way to enforce the guest always map the small
BARs at the required offsets. IMO the most sensible thing to do is not
impose any sort of arbitrary page offset requirements on guests because
it happened to suit the host.
If the host were to use fake BAR sizes via the current
pci=resource_alignment=... option, the fake BAR size would propagate to
the guest (lying to the guest), pcitest would break, and the guest can't
do anything about it.
To avoid these problems, small BARs should be predictably page aligned
in both host and guest.
> Or is it just that all
> devices with BARs that share a page have to be passed through to the
> same guest, sort of like how lack of ACS can force several devices to
> be in the same IOMMU isolation group?
This case is much worse. If two devices have BARs sharing a page in a
passthrough scenario, it's a security issue because guest can access
data of another device. See XSA-461 / CVE-2024-31146 [3]. Aside: I was
unaware that there was a XSA/CVE associated with this until after the
embargo was lifted.
[3] https://lore.kernel.org/xen-devel/E1seE0f-0001zO-Nj@xenbits.xenproject.org/
For completeness, see example 2:
01:00.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Flags: fast devsel
Memory at fe800000 (32-bit, non-prefetchable) [size=4K]
I/O ports at c000 [size=256]
Memory at 7050000000 (64-bit, prefetchable) [size=32]
01:01.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Flags: fast devsel
Memory at fe801000 (32-bit, non-prefetchable) [size=4K]
I/O ports at c100 [size=256]
Memory at 7050000020 (64-bit, prefetchable) [size=32]
01:02.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Flags: fast devsel
Memory at fe802000 (32-bit, non-prefetchable) [size=4K]
I/O ports at c200 [size=256]
Memory at 7050000040 (64-bit, prefetchable) [size=32]
01:03.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Flags: fast devsel
Memory at fe803000 (32-bit, non-prefetchable) [size=4K]
I/O ports at c300 [size=256]
Memory at 7040000000 (64-bit, prefetchable) [size=256M]
This example can reproduced with Qemu's pci-testdev and a SeaBIOS hack.
Add this to your usual qemu-system-x86_64 args:
-device pcie-pci-bridge,id=pcie.1 \
-device pci-testdev,bus=pcie.1,membar=32 \
-device pci-testdev,bus=pcie.1,membar=32 \
-device pci-testdev,bus=pcie.1,membar=32 \
-device pci-testdev,bus=pcie.1,membar=268435456
Apply this SeaBIOS hack:
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
index b3e359d7..769007a4 100644
--- a/src/fw/pciinit.c
+++ b/src/fw/pciinit.c
@@ -25,7 +25,7 @@
#include "util.h" // pci_setup
#include "x86.h" // outb
-#define PCI_DEVICE_MEM_MIN (1<<12) // 4k == page size
+#define PCI_DEVICE_MEM_MIN (0)
#define PCI_BRIDGE_MEM_MIN (1<<21) // 2M == hugepage size
#define PCI_BRIDGE_IO_MIN 0x1000 // mandated by pci bridge spec
If you want to trigger the bridge window realloc (where BAR alignments
currently get lost), also apply this hack to SeaBIOS in the same file:
@@ -1089,6 +1089,7 @@ pci_region_map_one_entry(struct pci_region_entry *entry, u64 addr)
pci_config_writew(bdf, PCI_MEMORY_LIMIT, limit >> PCI_MEMORY_SHIFT);
}
if (entry->type == PCI_REGION_TYPE_PREFMEM) {
+ limit = addr + PCI_BRIDGE_MEM_MIN - 1;
pci_config_writew(bdf, PCI_PREF_MEMORY_BASE, addr >> PCI_PREF_MEMORY_SHIFT);
pci_config_writew(bdf, PCI_PREF_MEMORY_LIMIT, limit >> PCI_PREF_MEMORY_SHIFT);
pci_config_writel(bdf, PCI_PREF_BASE_UPPER32, addr >> 32);
> I think the subject should mention the problem to help motivate this.
>
> The fact that we address this by potentially reassigning every BAR of
> every device, regardless of whether the admin even wants to pass
> through a device to a guest, seems a bit aggressive to me.
Patch [7/8] should limit the impact somewhat, but yes, it's quite
aggressive... Perhaps such a change in default should be paired with the
ability to turn it off via pci=realloc=off (or similar), and/or Kconfig.
> Previously we haven't trusted our reassignment machinery enough to
> enable it all the time, so we still have the "pci=realloc" parameter.
> By default, I don't think we even move devices around to make space
> for a BAR that we failed to allocate.
One exception is SR-IOV device resources when
CONFIG_PCI_REALLOC_ENABLE_AUTO=y.
> I agree "pci=resource_alignment=" is a bit user-unfriendly, and I
> don't think it solves the problem unless we apply it to every device
> in the system.
Right.
>> 2. Devices with multiple small BARs could have the MSI-X tables located
>> in one of its small BARs. This may lead to the MSI-X tables being mapped
>> in the same 4k region as other data. The PCIe 6.1 specification (section
>> 7.7.2 MSI-X Capability and Table Structure) says we probably should
>> avoid that.
>
> If you're referring to this:
>
> If a Base Address Register or entry in the Enhanced Allocation
> capability that maps address space for the MSI-X Table or MSI-X PBA
> also maps other usable address space that is not associated with
> MSI-X structures, locations (e.g., for CSRs) used in the other
> address space must not share any naturally aligned 4-KB address
> range with one where either MSI-X structure resides. This allows
> system software where applicable to use different processor
> attributes for MSI-X structures and the other address space.
Yes, that's the correct reference.
> I think this is technically a requirement about how space within a
> single BAR should be organized, not about how multiple BARs should be
> assigned. I don't think this really adds to the case for what you're
> doing, so we could just drop it.
I'm OK to drop the reference to the spec. For completeness, example 1
above was what led me to mention it: This device has the MSI-X tables
located in BAR 0, which is mapped in the same 4k region as other data.
>> To improve the user experience (i.e. don't require the user to specify
>> pci=resource_alignment=), and increase conformance to PCIe spec, set the
>> default minimum resource alignment of memory BARs to the greater of 4k
>> or PAGE_SIZE.
>>
>> Quoting the comment in
>> drivers/pci/pci.c:pci_request_resource_alignment(), there are two ways
>> we can increase the resource alignment:
>>
>> 1) Increase the size of the resource. BARs are aligned on their
>> size, so when we reallocate space for this resource, we'll
>> allocate it with the larger alignment. This also prevents
>> assignment of any other BARs inside the alignment region, so
>> if we're requesting page alignment, this means no other BARs
>> will share the page.
>>
>> The disadvantage is that this makes the resource larger than
>> the hardware BAR, which may break drivers that compute things
>> based on the resource size, e.g., to find registers at a
>> fixed offset before the end of the BAR.
>>
>> 2) Retain the resource size, but use IORESOURCE_STARTALIGN and
>> set r->start to the desired alignment. By itself this
>> doesn't prevent other BARs being put inside the alignment
>> region, but if we realign *every* resource of every device in
>> the system, none of them will share an alignment region.
>>
>> Changing pcibios_default_alignment() results in the second method of
>> alignment with IORESOURCE_STARTALIGN.
>>
>> The new default alignment may be overridden by arches by implementing
>> pcibios_default_alignment(), or by the user on a per-device basis with
>> the pci=resource_alignment= option (although this reverts to using
>> IORESOURCE_SIZEALIGN).
>>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>> ---
>> Preparatory patches in this series are prerequisites to this patch.
>>
>> v2->v3:
>> * new subject (was: "PCI: Align small (<4k) BARs")
>> * clarify 4k vs PAGE_SIZE in commit message
>>
>> v1->v2:
>> * capitalize subject text
>> * s/4 * 1024/SZ_4K/
>> * #include <linux/sizes.h>
>> * update commit message
>> * use max(SZ_4K, PAGE_SIZE) for alignment value
>> ---
>> drivers/pci/pci.c | 8 +++++++-
>> 1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index af34407f2fb9..efdd5b85ea8c 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -31,6 +31,7 @@
>> #include <asm/dma.h>
>> #include <linux/aer.h>
>> #include <linux/bitfield.h>
>> +#include <linux/sizes.h>
>> #include "pci.h"
>>
>> DEFINE_MUTEX(pci_slot_mutex);
>> @@ -6484,7 +6485,12 @@ struct pci_dev __weak *pci_real_dma_dev(struct pci_dev *dev)
>>
>> resource_size_t __weak pcibios_default_alignment(void)
>> {
>> - return 0;
>> + /*
>> + * Avoid MSI-X tables being mapped in the same 4k region as other data
>> + * according to PCIe 6.1 specification section 7.7.2 MSI-X Capability
>> + * and Table Structure.
>> + */
>
> I think this is sort of a "spec compliance" comment that is not the
> *real* reason we want to do this, i.e., it doesn't say that by doing
> this we can pass through more devices to guests.
>
> Doing this in pcibios_default_alignment() ends up being a very
> non-obvious way to make this happen. We have to:
>
> - Know what the purpose of this is, and the current comment doesn't
> point to that.
>
> - Look at all the implementations of pcibios_default_alignment()
> (thanks, powerpc).
>
> - Trace up through pci_specified_resource_alignment(), which
> contains a bunch of code that is not relevant to this case and
> always just returns PAGE_SIZE.
>
> - Trace up again to pci_reassigndev_resource_alignment() to see
> where this finally applies to the resources we care about. The
> comment here about "check if specified PCI is target device" is
> actively misleading for the passthrough usage.
>
> I hate adding new kernel parameters, but I kind of think this would be
> easier if we added one that mentioned passthrough or guests and tested
> it directly in pci_reassigndev_resource_alignment().
>
> This would also be a way to avoid the "Can't reassign resources to
> host bridge" warning that I think we're going to see all the time.
I did actually prepare a pci=resource_alignment=all patch, but I
hesitated to send it because of the discussion at [4]. I'll send it with
the next revision of the series.
[4] https://lore.kernel.org/linux-pci/20160929115422.GA31048@localhost/
I'd like to also propose introducing a Kconfig option, e.g.
CONFIG_PCI_PAGE_ALIGN_BARS, selectable by menuconfig or other usual
means.
>> + return max(SZ_4K, PAGE_SIZE);
>> }
>>
>> /*
>> --
>> 2.46.0
>>
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v3 3/8] PCI: Restore resource alignment
2024-08-08 21:54 ` Bjorn Helgaas
@ 2024-08-14 18:12 ` Stewart Hildebrand
0 siblings, 0 replies; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-14 18:12 UTC (permalink / raw)
To: Bjorn Helgaas; +Cc: Bjorn Helgaas, linux-pci, linux-kernel
On 8/8/24 17:54, Bjorn Helgaas wrote:
> On Thu, Aug 08, 2024 at 04:28:50PM -0400, Stewart Hildebrand wrote:
>> On 8/8/24 15:28, Bjorn Helgaas wrote:
>>> On Wed, Aug 07, 2024 at 11:17:12AM -0400, Stewart Hildebrand wrote:
>>>> Devices with alignment specified will lose their alignment in cases when
>>>> the bridge resources have been released, e.g. due to insufficient bridge
>>>> window size. Restore the alignment.
>>>
>>> I guess this fixes a problem when the user has specified
>>> "pci=resource_alignment=..." and we've decided to release and
>>> reallocate a bridge window? Just looking for a bit more concrete
>>> description of what this problem would look like to a user.
>>
>> Yes. When alignment has been specified via pcibios_default_alignment()
>> or by the user with "pci=resource_alignment=...", and the bridge window
>> is being reallocated, the specified alignment is lost and the resource
>> may not be sufficiently aligned after reallocation.
>>
>> I can expand the commit description.
>
> I think a hint about where the alignment gets lost would be helpful,
> too.
>
> This seems like a problem users could be seeing today, even
> independent of the device passthrough issue that I think is the main
> thrust of this series. If there's a problem report or an easy way to
> reproduce this problem, that would be nice, too.
Oh, sorry, I just realized that it's only alignments with
IORESOURCE_STARTALIGN that get lost during bridge window realloc.
Specifically, r->start gets overwritten in __release_child_resources().
There are a few code paths, such as the one in
__release_child_resources(), that assume IORESOURCE_SIZEALIGN. In case
of IORESOURCE_SIZEALIGN, the alignment is restored by a simple
calculation. However, with IORESOURCE_STARTALIGN, we can't use a simple
calculation, instead we need to consult
pci_reassigndev_resource_alignment() to restore the alignment. I'll
update the commit message.
An alternative approach might be to store the alignment in a new member
in struct resource, thus saving us from having to call
pci_reassigndev_resource_alignment() for each PCI device on the bridge
undergoing reallocation.
BTW, this patch and the two "[powerpc,x86]/pci: Preserve
IORESOURCE_STARTALIGN alignment" patches could potentially be folded
into a single patch as they're all dealing with fixing
IORESOURCE_STARTALIGN.
To repro the issue on x86, we will need to apply this series except the
current patch [3/8].
I ran into the issue with the following device. Let's call it example 1.
Scrubbed/partial output from lspci -vv:
Region 0: Memory at d1924600 (32-bit, non-prefetchable) [size=256]
Region 1: Memory at d1924400 (32-bit, non-prefetchable) [size=512]
Region 2: Memory at d1924000 (32-bit, non-prefetchable) [size=1K]
Region 3: Memory at d1920000 (32-bit, non-prefetchable) [size=16K]
Region 4: Memory at d1900000 (32-bit, non-prefetchable) [size=128K]
Region 5: Memory at d1800000 (32-bit, non-prefetchable) [size=1M]
Capabilities: [b0] MSI-X: Enable- Count=2 Masked-
Vector table: BAR=0 offset=00000080
PBA: BAR=0 offset=00000090
Capabilities: [200 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
IOVSta: Migration-
Initial VFs: 4, Total VFs: 4, Number of VFs: 0, Function Dependency Link: 00
VF offset: 6, stride: 1, Device ID: 0100
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 00000000d0800000 (64-bit, non-prefetchable)
VF Migration: offset: 00000000, BIR: 0
Kernel driver in use: pci-endpoint-test
Kernel modules: pci_endpoint_test
BARs 0, 1, and 2 are small, and the host firmware placed them on the
same page. The host firmware did not page align the BARs. There is also
an SR-IOV BAR that the firmware couldn't fit in the bridge window.
In example 1, I did not specify pci=realloc=on. I used a kernel with
CONFIG_PCI_REALLOC_ENABLE_AUTO=y, and the SR-IOV BAR triggered the
bridge window realloc. Alignment was requested for BARs 0, 1, and 2 of
this device, using IORESOURCE_STARTALIGN, because of patch [8/8]. The
alignment was lost during realloc.
Such a device from example 1 may be hard to come by, so here's a way to
reproduce with qemu. Example 2:
01:00.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Flags: fast devsel
Memory at fe800000 (32-bit, non-prefetchable) [size=4K]
I/O ports at c000 [size=256]
Memory at 7050000000 (64-bit, prefetchable) [size=32]
01:01.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Flags: fast devsel
Memory at fe801000 (32-bit, non-prefetchable) [size=4K]
I/O ports at c100 [size=256]
Memory at 7050000020 (64-bit, prefetchable) [size=32]
01:02.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Flags: fast devsel
Memory at fe802000 (32-bit, non-prefetchable) [size=4K]
I/O ports at c200 [size=256]
Memory at 7050000040 (64-bit, prefetchable) [size=32]
01:03.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
Subsystem: Red Hat, Inc. QEMU Virtual Machine
Flags: fast devsel
Memory at fe803000 (32-bit, non-prefetchable) [size=4K]
I/O ports at c300 [size=256]
Memory at 7040000000 (64-bit, prefetchable) [size=256M]
This example can reproduced with Qemu's pci-testdev and a SeaBIOS hack.
In this case we will need to specify pci=realloc=on to trigger the
realloc because there's no SR-IOV BAR to trigger it automatically. Add
this to your usual qemu-system-x86_64 args:
-append "console=ttyS0 ignore_loglevel pci=realloc=on" \
-device pcie-pci-bridge,id=pcie.1 \
-device pci-testdev,bus=pcie.1,membar=32 \
-device pci-testdev,bus=pcie.1,membar=32 \
-device pci-testdev,bus=pcie.1,membar=32 \
-device pci-testdev,bus=pcie.1,membar=268435456
Apply this SeaBIOS hack:
diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
index b3e359d7..769007a4 100644
--- a/src/fw/pciinit.c
+++ b/src/fw/pciinit.c
@@ -25,7 +25,7 @@
#include "util.h" // pci_setup
#include "x86.h" // outb
-#define PCI_DEVICE_MEM_MIN (1<<12) // 4k == page size
+#define PCI_DEVICE_MEM_MIN (0)
#define PCI_BRIDGE_MEM_MIN (1<<21) // 2M == hugepage size
#define PCI_BRIDGE_IO_MIN 0x1000 // mandated by pci bridge spec
@@ -1089,6 +1089,7 @@ pci_region_map_one_entry(struct pci_region_entry *entry, u64 addr)
pci_config_writew(bdf, PCI_MEMORY_LIMIT, limit >> PCI_MEMORY_SHIFT);
}
if (entry->type == PCI_REGION_TYPE_PREFMEM) {
+ limit = addr + PCI_BRIDGE_MEM_MIN - 1;
pci_config_writew(bdf, PCI_PREF_MEMORY_BASE, addr >> PCI_PREF_MEMORY_SHIFT);
pci_config_writew(bdf, PCI_PREF_MEMORY_LIMIT, limit >> PCI_PREF_MEMORY_SHIFT);
pci_config_writel(bdf, PCI_PREF_BASE_UPPER32, addr >> 32);
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v3 4/8] PCI: Restore memory decoding after reallocation
2024-08-08 19:37 ` Bjorn Helgaas
@ 2024-08-14 18:31 ` Stewart Hildebrand
0 siblings, 0 replies; 21+ messages in thread
From: Stewart Hildebrand @ 2024-08-14 18:31 UTC (permalink / raw)
To: Bjorn Helgaas; +Cc: Bjorn Helgaas, linux-pci, linux-kernel
On 8/8/24 15:37, Bjorn Helgaas wrote:
> On Wed, Aug 07, 2024 at 11:17:13AM -0400, Stewart Hildebrand wrote:
>> Currently, the PCI subsystem unconditionally clears the memory decoding
>> bit of devices with resource alignment specified. While drivers should
>> call pci_enable_device() to enable memory decoding, some platforms have
>> devices that expect memory decoding to be enabled even when no driver is
>> bound to the device. It is assumed firmware enables memory decoding in
>> these cases. For example, the vgacon driver uses the 0xb8000 buffer to
>> display a console without any PCI involvement, yet, on some platforms,
>> memory decoding must be enabled on the PCI VGA device in order for the
>> console to be displayed properly.
>>
>> Restore the memory decoding bit after the resource has been reallocated.
>>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>> ---
>> v2->v3:
>> * no change
>>
>> v1->v2:
>> * capitalize subject text
>> * reword commit message
>>
>> Testing notes / how to check if your platform needs memory decoding
>> enabled in order to use vgacon:
>> 1. Boot your system with a monitor attached, without any driver bound to
>> your PCI VGA device. You should see a console on your display.
>> 2. Find the SBDF of your VGA device with lspci -v (00:01.0 in this
>> example).
>> 3. Disable memory decoding (replace 00:01.0 with your SBDF):
>> $ sudo setpci -v -s 00:01.0 0x04.W=0x05
>> 4. Type something to see if it appears on the console display
>> 5. Re-enable memory decoding:
>> $ sudo setpci -v -s 00:01.0 0x04.W=0x07
>>
>> Relevant prior discussion at [1]
>>
>> [1] https://lore.kernel.org/linux-pci/20160906165652.GE1214@localhost/
>> ---
>> drivers/pci/pci.c | 1 +
>> drivers/pci/setup-bus.c | 25 +++++++++++++++++++++++++
>> include/linux/pci.h | 2 ++
>> 3 files changed, 28 insertions(+)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index acecdd6edd5a..4b97d8d5c2d8 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -6676,6 +6676,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev)
>>
>> pci_read_config_word(dev, PCI_COMMAND, &command);
>> if (command & PCI_COMMAND_MEMORY) {
>> + dev->dev_flags |= PCI_DEV_FLAGS_MEMORY_ENABLE;
>> command &= ~PCI_COMMAND_MEMORY;
>> pci_write_config_word(dev, PCI_COMMAND, command);
>> }
>> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
>> index ab7510ce6917..6847b251e19a 100644
>> --- a/drivers/pci/setup-bus.c
>> +++ b/drivers/pci/setup-bus.c
>> @@ -2131,6 +2131,29 @@ pci_root_bus_distribute_available_resources(struct pci_bus *bus,
>> }
>> }
>>
>> +static void restore_memory_decoding(struct pci_bus *bus)
>> +{
>> + struct pci_dev *dev;
>> +
>> + list_for_each_entry(dev, &bus->devices, bus_list) {
>> + struct pci_bus *b;
>> +
>> + if (dev->dev_flags & PCI_DEV_FLAGS_MEMORY_ENABLE) {
>> + u16 command;
>> +
>> + pci_read_config_word(dev, PCI_COMMAND, &command);
>> + command |= PCI_COMMAND_MEMORY;
>> + pci_write_config_word(dev, PCI_COMMAND, command);
>> + }
>> +
>> + b = dev->subordinate;
>> + if (!b)
>> + continue;
>> +
>> + restore_memory_decoding(b);
>> + }
>> +}
>> +
>> /*
>> * First try will not touch PCI bridge res.
>> * Second and later try will clear small leaf bridge res.
>> @@ -2229,6 +2252,8 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
>> goto again;
>>
>> dump:
>> + restore_memory_decoding(bus);
>
> The fact that we traverse the whole hierarchy here to restore
> PCI_COMMAND_MEMORY makes me suspect there's a window between point A
> (where we clear PCI_COMMAND_MEMORY and update a BAR) and point B
> (where we restore PCI_COMMAND_MEMORY) where VGA console output will
> not work.
Yes, there is a brief moment of garbled junk on the display, but it is
not fatal. The VGA console display returns to normal after the bit is
restored.
> This tickles my memory like we might have talked about this before and
> there's some reason for having to wait. If so, sorry, and maybe we
> need a comment in the code about that reason.
I don't have a strong opinion on which way to go with this, but my
understanding is that we want memory decoding disabled while r->start
doesn't match the actual BAR value. If you agree with this rationale,
I'll add something to that effect in a comment.
See the prior discussion at [1] (link above), and discussion from v1 of
this patch [2].
[2] https://lore.kernel.org/linux-pci/1d0b52bd-1654-4510-92dc-bd48447ff41d@amd.com/
>> /* Dump the resource on buses */
>> pci_bus_dump_resources(bus);
>> }
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 4246cb790c7b..74636acf152f 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -245,6 +245,8 @@ enum pci_dev_flags {
>> PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
>> /* Device does honor MSI masking despite saying otherwise */
>> PCI_DEV_FLAGS_HAS_MSI_MASKING = (__force pci_dev_flags_t) (1 << 12),
>> + /* Firmware enabled memory decoding, to be restored if BAR is updated */
>> + PCI_DEV_FLAGS_MEMORY_ENABLE = (__force pci_dev_flags_t) (1 << 13),
>> };
>>
>> enum pci_irq_reroute_variant {
>> --
>> 2.46.0
>>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v3 8/8] PCI: Align small BARs
2024-08-14 13:55 ` Stewart Hildebrand
@ 2024-08-14 22:05 ` Bjorn Helgaas
0 siblings, 0 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2024-08-14 22:05 UTC (permalink / raw)
To: Stewart Hildebrand
Cc: Bjorn Helgaas, Ilpo Järvinen, linux-pci, linux-kernel,
Alex Williamson, kvm
[+cc Alex, kvm list since this topic is of general interest for
passthrough; thread begins at
https://lore.kernel.org/r/20240807151723.613742-1-stewart.hildebrand@amd.com]
On Wed, Aug 14, 2024 at 09:55:26AM -0400, Stewart Hildebrand wrote:
> On 8/8/24 17:53, Bjorn Helgaas wrote:
> > On Wed, Aug 07, 2024 at 11:17:17AM -0400, Stewart Hildebrand wrote:
> >> In this context, "small" is defined as less than max(SZ_4K, PAGE_SIZE).
> >>
> >> Issues observed when small BARs are not sufficiently aligned are:
> >>
> >> 1. Devices to be passed through (to e.g. a Xen HVM guest) with small
> >> BARs require each memory BAR to be page aligned. Currently, the only way
> >> to guarantee this alignment from a user perspective is to fake the size
> >> of the BARs using the pci=resource_alignment= option. This is a bad user
> >> experience, and faking the BAR size is not always desirable. For
> >> example, pcitest is a tool that is useful for PCI passthrough validation
> >> with Xen, but pcitest fails with a fake BAR size.
> >
> > I guess this is the "money" patch for the main problem you're solving,
> > i.e., passthrough to a guest doesn't work as you want?
>
> Haha, yup!
>
> > Is it the case that if you have two BARs in the same page, a device
> > can't be passed through to a guest at all?
>
> If the conditions are just right, passing through such a device could
> maybe work, but in practice it's problematic and unlikely to work
> reliably across different configurations.
>
> Let me show example 1, from a real device that I'm working with.
> Scrubbed/partial output from lspci -vv, from the host's point of view:
>
> Region 0: Memory at d1924600 (32-bit, non-prefetchable) [size=256]
> Region 1: Memory at d1924400 (32-bit, non-prefetchable) [size=512]
> Region 2: Memory at d1924000 (32-bit, non-prefetchable) [size=1K]
> Region 3: Memory at d1920000 (32-bit, non-prefetchable) [size=16K]
> Region 4: Memory at d1900000 (32-bit, non-prefetchable) [size=128K]
> Region 5: Memory at d1800000 (32-bit, non-prefetchable) [size=1M]
> Capabilities: [b0] MSI-X: Enable- Count=2 Masked-
> Vector table: BAR=0 offset=00000080
> PBA: BAR=0 offset=00000090
> Capabilities: [200 v1] Single Root I/O Virtualization (SR-IOV)
> IOVCap: Migration-, Interrupt Message Number: 000
> IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
> IOVSta: Migration-
> Initial VFs: 4, Total VFs: 4, Number of VFs: 0, Function Dependency Link: 00
> VF offset: 6, stride: 1, Device ID: 0100
> Supported Page Size: 00000553, System Page Size: 00000001
> Region 0: Memory at 00000000d0800000 (64-bit, non-prefetchable)
> VF Migration: offset: 00000000, BIR: 0
> Kernel driver in use: pci-endpoint-test
> Kernel modules: pci_endpoint_test
>
> BARs 0, 1, and 2 are small, and the host firmware placed them on the
> same page. The host firmware did not page align the BARs.
>
> The hypervisor can only map full pages. The hypervisor cannot map
> partial pages. It cannot map a guest page offset from a host page where
> the offset is smaller than PAGE_SIZE.
>
> To pass this device through (physfn) as-is, the hypervisor would need to
> preserve the page offsets of each BAR and propagate them to the guest,
> taking translation into account. The guest (both firmware + OS)
> necessarily has to preserve the offsets as well. If the page offsets
> aren't preserved, the guest would be accessing the wrong data.
>
> We can't reliably predict what the guest behavior will be.
>
> SeaBIOS aligns BARs to 4k [1].
>
> [1] https://review.coreboot.org/plugins/gitiles/seabios/+/refs/tags/rel-1.16.3/src/fw/pciinit.c#28
>
> Xen's hvmloader does not align BARs to 4k. A patch was submitted to fix
> this, but it wasn't merged upstream [2].
>
> [2] https://lore.kernel.org/xen-devel/20200117110811.43321-1-roger.pau@citrix.com/
>
> Arm guests don't usually have firmware to initialize BARs, so it's
> usually up to the OS (which may or may not be Linux).
>
> The point is that there is not a consistent BAR initialization
> strategy/convention in the ecosystem when it comes to small BARs.
>
> The host doesn't have a way to enforce the guest always map the small
> BARs at the required offsets. IMO the most sensible thing to do is not
> impose any sort of arbitrary page offset requirements on guests because
> it happened to suit the host.
>
> If the host were to use fake BAR sizes via the current
> pci=resource_alignment=... option, the fake BAR size would propagate to
> the guest (lying to the guest), pcitest would break, and the guest can't
> do anything about it.
>
> To avoid these problems, small BARs should be predictably page aligned
> in both host and guest.
Right, all the above makes sense to me. I was fishing to see if vfio,
etc., actually checked for two BARs in the same page and disallowed
passthrough if that happened, but it doesn't sound like it. It sounds
like things will just break, e.g., guest accesses data at incorrect
offsets, etc.
> > Or is it just that all
> > devices with BARs that share a page have to be passed through to the
> > same guest, sort of like how lack of ACS can force several devices to
> > be in the same IOMMU isolation group?
>
> This case is much worse. If two devices have BARs sharing a page in a
> passthrough scenario, it's a security issue because guest can access
> data of another device. See XSA-461 / CVE-2024-31146 [3]. Aside: I was
> unaware that there was a XSA/CVE associated with this until after the
> embargo was lifted.
Yes. IIUC, vfio etc *could* check for devices sharing a page and
force them to be passed through together, which I suppose would
mitigate this CVE. But they don't in fact check.
> [3] https://lore.kernel.org/xen-devel/E1seE0f-0001zO-Nj@xenbits.xenproject.org/
>
> For completeness, see example 2:
>
> 01:00.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
> Subsystem: Red Hat, Inc. QEMU Virtual Machine
> Flags: fast devsel
> Memory at fe800000 (32-bit, non-prefetchable) [size=4K]
> I/O ports at c000 [size=256]
> Memory at 7050000000 (64-bit, prefetchable) [size=32]
>
> 01:01.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
> Subsystem: Red Hat, Inc. QEMU Virtual Machine
> Flags: fast devsel
> Memory at fe801000 (32-bit, non-prefetchable) [size=4K]
> I/O ports at c100 [size=256]
> Memory at 7050000020 (64-bit, prefetchable) [size=32]
>
> 01:02.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
> Subsystem: Red Hat, Inc. QEMU Virtual Machine
> Flags: fast devsel
> Memory at fe802000 (32-bit, non-prefetchable) [size=4K]
> I/O ports at c200 [size=256]
> Memory at 7050000040 (64-bit, prefetchable) [size=32]
>
> 01:03.0 Unclassified device [00ff]: Red Hat, Inc. QEMU PCI Test Device
> Subsystem: Red Hat, Inc. QEMU Virtual Machine
> Flags: fast devsel
> Memory at fe803000 (32-bit, non-prefetchable) [size=4K]
> I/O ports at c300 [size=256]
> Memory at 7040000000 (64-bit, prefetchable) [size=256M]
>
> This example can reproduced with Qemu's pci-testdev and a SeaBIOS hack.
> Add this to your usual qemu-system-x86_64 args:
>
> -device pcie-pci-bridge,id=pcie.1 \
> -device pci-testdev,bus=pcie.1,membar=32 \
> -device pci-testdev,bus=pcie.1,membar=32 \
> -device pci-testdev,bus=pcie.1,membar=32 \
> -device pci-testdev,bus=pcie.1,membar=268435456
>
> Apply this SeaBIOS hack:
>
> diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
> index b3e359d7..769007a4 100644
> --- a/src/fw/pciinit.c
> +++ b/src/fw/pciinit.c
> @@ -25,7 +25,7 @@
> #include "util.h" // pci_setup
> #include "x86.h" // outb
>
> -#define PCI_DEVICE_MEM_MIN (1<<12) // 4k == page size
> +#define PCI_DEVICE_MEM_MIN (0)
> #define PCI_BRIDGE_MEM_MIN (1<<21) // 2M == hugepage size
> #define PCI_BRIDGE_IO_MIN 0x1000 // mandated by pci bridge spec
>
>
> If you want to trigger the bridge window realloc (where BAR alignments
> currently get lost), also apply this hack to SeaBIOS in the same file:
>
> @@ -1089,6 +1089,7 @@ pci_region_map_one_entry(struct pci_region_entry *entry, u64 addr)
> pci_config_writew(bdf, PCI_MEMORY_LIMIT, limit >> PCI_MEMORY_SHIFT);
> }
> if (entry->type == PCI_REGION_TYPE_PREFMEM) {
> + limit = addr + PCI_BRIDGE_MEM_MIN - 1;
> pci_config_writew(bdf, PCI_PREF_MEMORY_BASE, addr >> PCI_PREF_MEMORY_SHIFT);
> pci_config_writew(bdf, PCI_PREF_MEMORY_LIMIT, limit >> PCI_PREF_MEMORY_SHIFT);
> pci_config_writel(bdf, PCI_PREF_BASE_UPPER32, addr >> 32);
>
> > I think the subject should mention the problem to help motivate this.
> >
> > The fact that we address this by potentially reassigning every BAR of
> > every device, regardless of whether the admin even wants to pass
> > through a device to a guest, seems a bit aggressive to me.
>
> Patch [7/8] should limit the impact somewhat, but yes, it's quite
> aggressive... Perhaps such a change in default should be paired with the
> ability to turn it off via pci=realloc=off (or similar), and/or Kconfig.
I'm frankly pretty scared about reassigning every BAR by default.
Maybe my fear is unfounded, but I suspect things will break.
I'd be more comfortable if users had to specify a new option to get
this behavior, because then they would be aware of what they changed
and would be able to change back if it didn't work. Or maybe if we
checked for page sharing and prevented passthrough in that case.
> > Previously we haven't trusted our reassignment machinery enough to
> > enable it all the time, so we still have the "pci=realloc" parameter.
> > By default, I don't think we even move devices around to make space
> > for a BAR that we failed to allocate.
>
> One exception is SR-IOV device resources when
> CONFIG_PCI_REALLOC_ENABLE_AUTO=y.
>
> > I agree "pci=resource_alignment=" is a bit user-unfriendly, and I
> > don't think it solves the problem unless we apply it to every device
> > in the system.
>
> Right.
>
> >> 2. Devices with multiple small BARs could have the MSI-X tables located
> >> in one of its small BARs. This may lead to the MSI-X tables being mapped
> >> in the same 4k region as other data. The PCIe 6.1 specification (section
> >> 7.7.2 MSI-X Capability and Table Structure) says we probably should
> >> avoid that.
> >
> > If you're referring to this:
> >
> > If a Base Address Register or entry in the Enhanced Allocation
> > capability that maps address space for the MSI-X Table or MSI-X PBA
> > also maps other usable address space that is not associated with
> > MSI-X structures, locations (e.g., for CSRs) used in the other
> > address space must not share any naturally aligned 4-KB address
> > range with one where either MSI-X structure resides. This allows
> > system software where applicable to use different processor
> > attributes for MSI-X structures and the other address space.
>
> Yes, that's the correct reference.
>
> > I think this is technically a requirement about how space within a
> > single BAR should be organized, not about how multiple BARs should be
> > assigned. I don't think this really adds to the case for what you're
> > doing, so we could just drop it.
>
> I'm OK to drop the reference to the spec. For completeness, example 1
> above was what led me to mention it: This device has the MSI-X tables
> located in BAR 0, which is mapped in the same 4k region as other data.
>
> >> To improve the user experience (i.e. don't require the user to specify
> >> pci=resource_alignment=), and increase conformance to PCIe spec, set the
> >> default minimum resource alignment of memory BARs to the greater of 4k
> >> or PAGE_SIZE.
> >>
> >> Quoting the comment in
> >> drivers/pci/pci.c:pci_request_resource_alignment(), there are two ways
> >> we can increase the resource alignment:
> >>
> >> 1) Increase the size of the resource. BARs are aligned on their
> >> size, so when we reallocate space for this resource, we'll
> >> allocate it with the larger alignment. This also prevents
> >> assignment of any other BARs inside the alignment region, so
> >> if we're requesting page alignment, this means no other BARs
> >> will share the page.
> >>
> >> The disadvantage is that this makes the resource larger than
> >> the hardware BAR, which may break drivers that compute things
> >> based on the resource size, e.g., to find registers at a
> >> fixed offset before the end of the BAR.
> >>
> >> 2) Retain the resource size, but use IORESOURCE_STARTALIGN and
> >> set r->start to the desired alignment. By itself this
> >> doesn't prevent other BARs being put inside the alignment
> >> region, but if we realign *every* resource of every device in
> >> the system, none of them will share an alignment region.
> >>
> >> Changing pcibios_default_alignment() results in the second method of
> >> alignment with IORESOURCE_STARTALIGN.
> >>
> >> The new default alignment may be overridden by arches by implementing
> >> pcibios_default_alignment(), or by the user on a per-device basis with
> >> the pci=resource_alignment= option (although this reverts to using
> >> IORESOURCE_SIZEALIGN).
> >>
> >> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >> ---
> >> Preparatory patches in this series are prerequisites to this patch.
> >>
> >> v2->v3:
> >> * new subject (was: "PCI: Align small (<4k) BARs")
> >> * clarify 4k vs PAGE_SIZE in commit message
> >>
> >> v1->v2:
> >> * capitalize subject text
> >> * s/4 * 1024/SZ_4K/
> >> * #include <linux/sizes.h>
> >> * update commit message
> >> * use max(SZ_4K, PAGE_SIZE) for alignment value
> >> ---
> >> drivers/pci/pci.c | 8 +++++++-
> >> 1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >> index af34407f2fb9..efdd5b85ea8c 100644
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
> >> @@ -31,6 +31,7 @@
> >> #include <asm/dma.h>
> >> #include <linux/aer.h>
> >> #include <linux/bitfield.h>
> >> +#include <linux/sizes.h>
> >> #include "pci.h"
> >>
> >> DEFINE_MUTEX(pci_slot_mutex);
> >> @@ -6484,7 +6485,12 @@ struct pci_dev __weak *pci_real_dma_dev(struct pci_dev *dev)
> >>
> >> resource_size_t __weak pcibios_default_alignment(void)
> >> {
> >> - return 0;
> >> + /*
> >> + * Avoid MSI-X tables being mapped in the same 4k region as other data
> >> + * according to PCIe 6.1 specification section 7.7.2 MSI-X Capability
> >> + * and Table Structure.
> >> + */
> >
> > I think this is sort of a "spec compliance" comment that is not the
> > *real* reason we want to do this, i.e., it doesn't say that by doing
> > this we can pass through more devices to guests.
> >
> > Doing this in pcibios_default_alignment() ends up being a very
> > non-obvious way to make this happen. We have to:
> >
> > - Know what the purpose of this is, and the current comment doesn't
> > point to that.
> >
> > - Look at all the implementations of pcibios_default_alignment()
> > (thanks, powerpc).
> >
> > - Trace up through pci_specified_resource_alignment(), which
> > contains a bunch of code that is not relevant to this case and
> > always just returns PAGE_SIZE.
> >
> > - Trace up again to pci_reassigndev_resource_alignment() to see
> > where this finally applies to the resources we care about. The
> > comment here about "check if specified PCI is target device" is
> > actively misleading for the passthrough usage.
> >
> > I hate adding new kernel parameters, but I kind of think this would be
> > easier if we added one that mentioned passthrough or guests and tested
> > it directly in pci_reassigndev_resource_alignment().
> >
> > This would also be a way to avoid the "Can't reassign resources to
> > host bridge" warning that I think we're going to see all the time.
>
> I did actually prepare a pci=resource_alignment=all patch, but I
> hesitated to send it because of the discussion at [4]. I'll send it with
> the next revision of the series.
>
> [4] https://lore.kernel.org/linux-pci/20160929115422.GA31048@localhost/
>
> I'd like to also propose introducing a Kconfig option, e.g.
> CONFIG_PCI_PAGE_ALIGN_BARS, selectable by menuconfig or other usual
> means.
>
> >> + return max(SZ_4K, PAGE_SIZE);
> >> }
> >>
> >> /*
> >> --
> >> 2.46.0
> >>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-08-14 22:05 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-07 15:17 [PATCH v3 0/8] PCI: Align small BARs Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 1/8] x86/PCI: Improve code readability Stewart Hildebrand
2024-08-08 8:55 ` Ilpo Järvinen
2024-08-07 15:17 ` [PATCH v3 2/8] PCI: Don't unnecessarily disable memory decoding Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 3/8] PCI: Restore resource alignment Stewart Hildebrand
2024-08-08 19:28 ` Bjorn Helgaas
2024-08-08 20:28 ` Stewart Hildebrand
2024-08-08 21:54 ` Bjorn Helgaas
2024-08-14 18:12 ` Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 4/8] PCI: Restore memory decoding after reallocation Stewart Hildebrand
2024-08-08 19:37 ` Bjorn Helgaas
2024-08-14 18:31 ` Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 5/8] x86/PCI: Preserve IORESOURCE_STARTALIGN alignment Stewart Hildebrand
2024-08-08 20:10 ` Bjorn Helgaas
2024-08-07 15:17 ` [PATCH v3 6/8] powerpc/pci: " Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 7/8] PCI: Don't reassign resources that are already aligned Stewart Hildebrand
2024-08-07 15:17 ` [PATCH v3 8/8] PCI: Align small BARs Stewart Hildebrand
2024-08-08 21:53 ` Bjorn Helgaas
2024-08-14 13:55 ` Stewart Hildebrand
2024-08-14 22:05 ` Bjorn Helgaas
2024-08-07 15:42 ` [PATCH v3 0/8] " Arnd Bergmann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox