* [PATCH v3 00/10] Add PCIe Bandwidth Controller
@ 2023-09-29 11:57 Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 01/10] PCI: Protect Link Control 2 Register with RMW locking Ilpo Järvinen
` (9 more replies)
0 siblings, 10 replies; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm
Cc: LKML, Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui,
Ilpo Järvinen
Hi all,
This series adds PCIe bandwidth controller (bwctrl) and associated PCIe
cooling driver to the thermal core side for limiting PCIe Link Speed
due to thermal reasons. PCIe bandwidth controller is a PCI express bus
port service driver. A cooling device is created for each port the
service driver finds if they support changing speeds.
This series only adds support for controlling PCIe Link Speed.
Controlling PCIe Link Width might also be useful but AFAIK, there is no
mechanism for that until PCIe 6.0 (L0p). Based on feedback for v1, the
thermal/cooling device side prefers Link Speed and Link Width to be
separate cooling devices [1] which is taken into account in naming the
cooling device for Link Speed but the Link Width one is not added yet
as it would not be able to control anything at the moment.
bwctrl is built on top of BW notifications revert. I'm just not sure
what is the best practice when re-adding some old functionality in a
modified form so please let me know if I need to somehow alter that
patch.
[1] https://lore.kernel.org/linux-pci/f35db90cd67adf4b0f48cd6f2a6ad8fbd0c1a679.camel@linux.intel.com/
v3:
- Correct hfi1 shortlog prefix
- Improve error prints in hfi1
- Add L: linux-pci to the MAINTAINERS entry
v2:
- Adds LNKCTL2 to RMW safe list in Documentation/PCI/pciebus-howto.rst
- Renamed cooling devices from PCIe_Port_* to PCIe_Port_Link_Speed_* in
order to plan for possibility of adding Link Width cooling devices
later on
- Moved struct thermal_cooling_device declaration to the correct patch
- Small tweaks to Kconfig texts
- Series rebased to resolve conflict (in the selftest list)
Ilpo Järvinen (10):
PCI: Protect Link Control 2 Register with RMW locking
drm/radeon: Use RMW accessors for changing LNKCTL2
drm/amdgpu: Use RMW accessors for changing LNKCTL2
RDMA/hfi1: Use RMW accessors for changing LNKCTL2
PCI: Store all PCIe Supported Link Speeds
PCI: Cache PCIe device's Supported Speed Vector
PCI/LINK: Re-add BW notification portdrv as PCIe BW controller
PCI/bwctrl: Add "controller" part into PCIe bwctrl
thermal: Add PCIe cooling driver
selftests/pcie_bwctrl: Create selftests
Documentation/PCI/pciebus-howto.rst | 8 +-
MAINTAINERS | 9 +
drivers/gpu/drm/amd/amdgpu/cik.c | 41 +--
drivers/gpu/drm/amd/amdgpu/si.c | 41 +--
drivers/gpu/drm/radeon/cik.c | 40 +--
drivers/gpu/drm/radeon/si.c | 40 +--
drivers/infiniband/hw/hfi1/pcie.c | 30 +-
drivers/pci/pcie/Kconfig | 9 +
drivers/pci/pcie/Makefile | 1 +
drivers/pci/pcie/bwctrl.c | 309 ++++++++++++++++++
drivers/pci/pcie/portdrv.c | 9 +-
drivers/pci/pcie/portdrv.h | 10 +-
drivers/pci/probe.c | 38 ++-
drivers/pci/remove.c | 2 +
drivers/thermal/Kconfig | 10 +
drivers/thermal/Makefile | 2 +
drivers/thermal/pcie_cooling.c | 107 ++++++
include/linux/pci-bwctrl.h | 33 ++
include/linux/pci.h | 3 +
include/uapi/linux/pci_regs.h | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/pcie_bwctrl/Makefile | 2 +
.../pcie_bwctrl/set_pcie_cooling_state.sh | 122 +++++++
.../selftests/pcie_bwctrl/set_pcie_speed.sh | 67 ++++
24 files changed, 800 insertions(+), 135 deletions(-)
create mode 100644 drivers/pci/pcie/bwctrl.c
create mode 100644 drivers/thermal/pcie_cooling.c
create mode 100644 include/linux/pci-bwctrl.h
create mode 100644 tools/testing/selftests/pcie_bwctrl/Makefile
create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh
create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh
--
2.30.2
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v3 01/10] PCI: Protect Link Control 2 Register with RMW locking
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
@ 2023-09-29 11:57 ` Ilpo Järvinen
2023-12-30 10:33 ` Lukas Wunner
2023-09-29 11:57 ` [PATCH v3 02/10] drm/radeon: Use RMW accessors for changing LNKCTL2 Ilpo Järvinen
` (8 subsequent siblings)
9 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, Jonathan Corbet,
linux-doc, linux-kernel
Cc: Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui,
Ilpo Järvinen
PCIe Bandwidth Controller performs RMW accesses the Link Control 2
Register which can occur concurrently to other sources of Link Control
2 Register writes. Therefore, add Link Control 2 Register among the PCI
Express Capability Registers that need RMW locking.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
Documentation/PCI/pciebus-howto.rst | 8 ++++----
include/linux/pci.h | 1 +
2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/Documentation/PCI/pciebus-howto.rst b/Documentation/PCI/pciebus-howto.rst
index a0027e8fb0d0..3ba322ca1ce1 100644
--- a/Documentation/PCI/pciebus-howto.rst
+++ b/Documentation/PCI/pciebus-howto.rst
@@ -218,7 +218,7 @@ that is shared between many drivers including the service drivers.
RMW Capability accessors (pcie_capability_clear_and_set_word(),
pcie_capability_set_word(), and pcie_capability_clear_word()) protect
a selected set of PCI Express Capability Registers (Link Control
-Register and Root Control Register). Any change to those registers
-should be performed using RMW accessors to avoid problems due to
-concurrent updates. For the up-to-date list of protected registers,
-see pcie_capability_clear_and_set_word().
+Register, Root Control Register, and Link Control 2 Register). Any
+change to those registers should be performed using RMW accessors to
+avoid problems due to concurrent updates. For the up-to-date list of
+protected registers, see pcie_capability_clear_and_set_word().
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8c7c2c3c6c65..16db80f8b15c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1243,6 +1243,7 @@ static inline int pcie_capability_clear_and_set_word(struct pci_dev *dev,
{
switch (pos) {
case PCI_EXP_LNKCTL:
+ case PCI_EXP_LNKCTL2:
case PCI_EXP_RTCTL:
return pcie_capability_clear_and_set_word_locked(dev, pos,
clear, set);
--
2.30.2
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 02/10] drm/radeon: Use RMW accessors for changing LNKCTL2
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 01/10] PCI: Protect Link Control 2 Register with RMW locking Ilpo Järvinen
@ 2023-09-29 11:57 ` Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 03/10] drm/amdgpu: " Ilpo Järvinen
` (7 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Alex Deucher, Christian König,
Pan, Xinhui, David Airlie, Daniel Vetter, amd-gfx, dri-devel,
linux-kernel
Cc: Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui,
Ilpo Järvinen
Don't assume that only the driver would be accessing LNKCTL2. In the
case of upstream (parent), the driver does not even own the device it's
changing the registers for.
Use RMW capability accessors which do proper locking to avoid losing
concurrent updates to the register value. This change is also useful as
a cleanup.
Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
drivers/gpu/drm/radeon/cik.c | 40 ++++++++++++++----------------------
drivers/gpu/drm/radeon/si.c | 40 ++++++++++++++----------------------
2 files changed, 30 insertions(+), 50 deletions(-)
diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
index 10be30366c2b..b5e96a8fc2c1 100644
--- a/drivers/gpu/drm/radeon/cik.c
+++ b/drivers/gpu/drm/radeon/cik.c
@@ -9592,28 +9592,18 @@ static void cik_pcie_gen3_enable(struct radeon_device *rdev)
PCI_EXP_LNKCTL_HAWD);
/* linkctl2 */
- pcie_capability_read_word(root, PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (bridge_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(root,
- PCI_EXP_LNKCTL2,
- tmp16);
-
- pcie_capability_read_word(rdev->pdev,
- PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (gpu_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(rdev->pdev,
- PCI_EXP_LNKCTL2,
- tmp16);
+ pcie_capability_clear_and_set_word(root, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ bridge_cfg2 |
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
+ pcie_capability_clear_and_set_word(rdev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ gpu_cfg2 |
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
tmp = RREG32_PCIE_PORT(PCIE_LC_CNTL4);
tmp &= ~LC_SET_QUIESCE;
@@ -9627,15 +9617,15 @@ static void cik_pcie_gen3_enable(struct radeon_device *rdev)
speed_cntl &= ~LC_FORCE_DIS_SW_SPEED_CHANGE;
WREG32_PCIE_PORT(PCIE_LC_SPEED_CNTL, speed_cntl);
- pcie_capability_read_word(rdev->pdev, PCI_EXP_LNKCTL2, &tmp16);
- tmp16 &= ~PCI_EXP_LNKCTL2_TLS;
+ tmp16 = 0;
if (speed_cap == PCIE_SPEED_8_0GT)
tmp16 |= PCI_EXP_LNKCTL2_TLS_8_0GT; /* gen3 */
else if (speed_cap == PCIE_SPEED_5_0GT)
tmp16 |= PCI_EXP_LNKCTL2_TLS_5_0GT; /* gen2 */
else
tmp16 |= PCI_EXP_LNKCTL2_TLS_2_5GT; /* gen1 */
- pcie_capability_write_word(rdev->pdev, PCI_EXP_LNKCTL2, tmp16);
+ pcie_capability_clear_and_set_word(rdev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_TLS, tmp16);
speed_cntl = RREG32_PCIE_PORT(PCIE_LC_SPEED_CNTL);
speed_cntl |= LC_INITIATE_LINK_SPEED_CHANGE;
diff --git a/drivers/gpu/drm/radeon/si.c b/drivers/gpu/drm/radeon/si.c
index a91012447b56..32871ca09a0f 100644
--- a/drivers/gpu/drm/radeon/si.c
+++ b/drivers/gpu/drm/radeon/si.c
@@ -7189,28 +7189,18 @@ static void si_pcie_gen3_enable(struct radeon_device *rdev)
PCI_EXP_LNKCTL_HAWD);
/* linkctl2 */
- pcie_capability_read_word(root, PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (bridge_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(root,
- PCI_EXP_LNKCTL2,
- tmp16);
-
- pcie_capability_read_word(rdev->pdev,
- PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (gpu_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(rdev->pdev,
- PCI_EXP_LNKCTL2,
- tmp16);
+ pcie_capability_clear_and_set_word(root, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ bridge_cfg2 &
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
+ pcie_capability_clear_and_set_word(rdev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ gpu_cfg2 &
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
tmp = RREG32_PCIE_PORT(PCIE_LC_CNTL4);
tmp &= ~LC_SET_QUIESCE;
@@ -7224,15 +7214,15 @@ static void si_pcie_gen3_enable(struct radeon_device *rdev)
speed_cntl &= ~LC_FORCE_DIS_SW_SPEED_CHANGE;
WREG32_PCIE_PORT(PCIE_LC_SPEED_CNTL, speed_cntl);
- pcie_capability_read_word(rdev->pdev, PCI_EXP_LNKCTL2, &tmp16);
- tmp16 &= ~PCI_EXP_LNKCTL2_TLS;
+ tmp16 = 0;
if (speed_cap == PCIE_SPEED_8_0GT)
tmp16 |= PCI_EXP_LNKCTL2_TLS_8_0GT; /* gen3 */
else if (speed_cap == PCIE_SPEED_5_0GT)
tmp16 |= PCI_EXP_LNKCTL2_TLS_5_0GT; /* gen2 */
else
tmp16 |= PCI_EXP_LNKCTL2_TLS_2_5GT; /* gen1 */
- pcie_capability_write_word(rdev->pdev, PCI_EXP_LNKCTL2, tmp16);
+ pcie_capability_clear_and_set_word(rdev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_TLS, tmp16);
speed_cntl = RREG32_PCIE_PORT(PCIE_LC_SPEED_CNTL);
speed_cntl |= LC_INITIATE_LINK_SPEED_CHANGE;
--
2.30.2
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 03/10] drm/amdgpu: Use RMW accessors for changing LNKCTL2
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 01/10] PCI: Protect Link Control 2 Register with RMW locking Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 02/10] drm/radeon: Use RMW accessors for changing LNKCTL2 Ilpo Järvinen
@ 2023-09-29 11:57 ` Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 04/10] RDMA/hfi1: " Ilpo Järvinen
` (6 subsequent siblings)
9 siblings, 0 replies; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Alex Deucher, Christian König,
Pan, Xinhui, David Airlie, Daniel Vetter, amd-gfx, dri-devel,
linux-kernel
Cc: Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui,
Ilpo Järvinen
Don't assume that only the driver would be accessing LNKCTL2. In the
case of upstream (parent), the driver does not even own the device it's
changing the registers for.
Use RMW capability accessors which do proper locking to avoid losing
concurrent updates to the register value. This change is also useful as
a cleanup.
Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
drivers/gpu/drm/amd/amdgpu/cik.c | 41 ++++++++++++--------------------
drivers/gpu/drm/amd/amdgpu/si.c | 41 ++++++++++++--------------------
2 files changed, 30 insertions(+), 52 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c
index e63abdf52b6c..7bcd41996927 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik.c
@@ -1638,28 +1638,18 @@ static void cik_pcie_gen3_enable(struct amdgpu_device *adev)
PCI_EXP_LNKCTL_HAWD);
/* linkctl2 */
- pcie_capability_read_word(root, PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (bridge_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(root,
- PCI_EXP_LNKCTL2,
- tmp16);
-
- pcie_capability_read_word(adev->pdev,
- PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (gpu_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(adev->pdev,
- PCI_EXP_LNKCTL2,
- tmp16);
+ pcie_capability_clear_and_set_word(root, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ bridge_cfg2 &
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
+ pcie_capability_clear_and_set_word(adev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ gpu_cfg2 &
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
tmp = RREG32_PCIE(ixPCIE_LC_CNTL4);
tmp &= ~PCIE_LC_CNTL4__LC_SET_QUIESCE_MASK;
@@ -1674,16 +1664,15 @@ static void cik_pcie_gen3_enable(struct amdgpu_device *adev)
speed_cntl &= ~PCIE_LC_SPEED_CNTL__LC_FORCE_DIS_SW_SPEED_CHANGE_MASK;
WREG32_PCIE(ixPCIE_LC_SPEED_CNTL, speed_cntl);
- pcie_capability_read_word(adev->pdev, PCI_EXP_LNKCTL2, &tmp16);
- tmp16 &= ~PCI_EXP_LNKCTL2_TLS;
-
+ tmp16 = 0;
if (adev->pm.pcie_gen_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3)
tmp16 |= PCI_EXP_LNKCTL2_TLS_8_0GT; /* gen3 */
else if (adev->pm.pcie_gen_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN2)
tmp16 |= PCI_EXP_LNKCTL2_TLS_5_0GT; /* gen2 */
else
tmp16 |= PCI_EXP_LNKCTL2_TLS_2_5GT; /* gen1 */
- pcie_capability_write_word(adev->pdev, PCI_EXP_LNKCTL2, tmp16);
+ pcie_capability_clear_and_set_word(adev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_TLS, tmp16);
speed_cntl = RREG32_PCIE(ixPCIE_LC_SPEED_CNTL);
speed_cntl |= PCIE_LC_SPEED_CNTL__LC_INITIATE_LINK_SPEED_CHANGE_MASK;
diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
index 4b81f29e5fd5..8ea60fdd1b1d 100644
--- a/drivers/gpu/drm/amd/amdgpu/si.c
+++ b/drivers/gpu/drm/amd/amdgpu/si.c
@@ -2331,28 +2331,18 @@ static void si_pcie_gen3_enable(struct amdgpu_device *adev)
gpu_cfg &
PCI_EXP_LNKCTL_HAWD);
- pcie_capability_read_word(root, PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (bridge_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(root,
- PCI_EXP_LNKCTL2,
- tmp16);
-
- pcie_capability_read_word(adev->pdev,
- PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (gpu_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(adev->pdev,
- PCI_EXP_LNKCTL2,
- tmp16);
+ pcie_capability_clear_and_set_word(root, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ bridge_cfg2 &
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
+ pcie_capability_clear_and_set_word(adev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ gpu_cfg2 &
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
tmp = RREG32_PCIE_PORT(PCIE_LC_CNTL4);
tmp &= ~LC_SET_QUIESCE;
@@ -2365,16 +2355,15 @@ static void si_pcie_gen3_enable(struct amdgpu_device *adev)
speed_cntl &= ~LC_FORCE_DIS_SW_SPEED_CHANGE;
WREG32_PCIE_PORT(PCIE_LC_SPEED_CNTL, speed_cntl);
- pcie_capability_read_word(adev->pdev, PCI_EXP_LNKCTL2, &tmp16);
- tmp16 &= ~PCI_EXP_LNKCTL2_TLS;
-
+ tmp16 = 0;
if (adev->pm.pcie_gen_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3)
tmp16 |= PCI_EXP_LNKCTL2_TLS_8_0GT; /* gen3 */
else if (adev->pm.pcie_gen_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN2)
tmp16 |= PCI_EXP_LNKCTL2_TLS_5_0GT; /* gen2 */
else
tmp16 |= PCI_EXP_LNKCTL2_TLS_2_5GT; /* gen1 */
- pcie_capability_write_word(adev->pdev, PCI_EXP_LNKCTL2, tmp16);
+ pcie_capability_clear_and_set_word(adev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_TLS, tmp16);
speed_cntl = RREG32_PCIE_PORT(PCIE_LC_SPEED_CNTL);
speed_cntl |= LC_INITIATE_LINK_SPEED_CHANGE;
--
2.30.2
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 04/10] RDMA/hfi1: Use RMW accessors for changing LNKCTL2
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
` (2 preceding siblings ...)
2023-09-29 11:57 ` [PATCH v3 03/10] drm/amdgpu: " Ilpo Järvinen
@ 2023-09-29 11:57 ` Ilpo Järvinen
2023-09-29 13:03 ` Dean Luick
2023-09-29 11:57 ` [PATCH v3 05/10] PCI: Store all PCIe Supported Link Speeds Ilpo Järvinen
` (5 subsequent siblings)
9 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Dennis Dalessandro, Jason Gunthorpe,
Leon Romanovsky, linux-rdma, linux-kernel
Cc: Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui,
Ilpo Järvinen
Don't assume that only the driver would be accessing LNKCTL2. In the
case of upstream (parent), the driver does not even own the device it's
changing the registers for.
Use RMW capability accessors which do proper locking to avoid losing
concurrent updates to the register value. This change is also useful as
a cleanup.
Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
drivers/infiniband/hw/hfi1/pcie.c | 30 ++++++++----------------------
1 file changed, 8 insertions(+), 22 deletions(-)
diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
index 08732e1ac966..4487a05bea04 100644
--- a/drivers/infiniband/hw/hfi1/pcie.c
+++ b/drivers/infiniband/hw/hfi1/pcie.c
@@ -1212,14 +1212,11 @@ int do_pcie_gen3_transition(struct hfi1_devdata *dd)
(u32)lnkctl2);
/* only write to parent if target is not as high as ours */
if ((lnkctl2 & PCI_EXP_LNKCTL2_TLS) < target_vector) {
- lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
- lnkctl2 |= target_vector;
- dd_dev_info(dd, "%s: ..new link control2: 0x%x\n", __func__,
- (u32)lnkctl2);
- ret = pcie_capability_write_word(parent,
- PCI_EXP_LNKCTL2, lnkctl2);
+ ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_TLS,
+ target_vector);
if (ret) {
- dd_dev_err(dd, "Unable to write to PCI config\n");
+ dd_dev_err(dd, "Unable to change parent PCI target speed\n");
return_error = 1;
goto done;
}
@@ -1228,22 +1225,11 @@ int do_pcie_gen3_transition(struct hfi1_devdata *dd)
}
dd_dev_info(dd, "%s: setting target link speed\n", __func__);
- ret = pcie_capability_read_word(dd->pcidev, PCI_EXP_LNKCTL2, &lnkctl2);
+ ret = pcie_capability_clear_and_set_word(dd->pcidev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_TLS,
+ target_vector);
if (ret) {
- dd_dev_err(dd, "Unable to read from PCI config\n");
- return_error = 1;
- goto done;
- }
-
- dd_dev_info(dd, "%s: ..old link control2: 0x%x\n", __func__,
- (u32)lnkctl2);
- lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
- lnkctl2 |= target_vector;
- dd_dev_info(dd, "%s: ..new link control2: 0x%x\n", __func__,
- (u32)lnkctl2);
- ret = pcie_capability_write_word(dd->pcidev, PCI_EXP_LNKCTL2, lnkctl2);
- if (ret) {
- dd_dev_err(dd, "Unable to write to PCI config\n");
+ dd_dev_err(dd, "Unable to change device PCI target speed\n");
return_error = 1;
goto done;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 05/10] PCI: Store all PCIe Supported Link Speeds
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
` (3 preceding siblings ...)
2023-09-29 11:57 ` [PATCH v3 04/10] RDMA/hfi1: " Ilpo Järvinen
@ 2023-09-29 11:57 ` Ilpo Järvinen
2023-12-30 11:45 ` Lukas Wunner
2023-09-29 11:57 ` [PATCH v3 06/10] PCI: Cache PCIe device's Supported Speed Vector Ilpo Järvinen
` (4 subsequent siblings)
9 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, linux-kernel
Cc: Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui,
Ilpo Järvinen
struct pci_bus stores max_bus_speed. Implementation Note in PCIe r6.0.1
sec 7.5.3.18, however, recommends determining supported Link Speeds
using the Supported Link Speeds Vector in the Link Capabilities 2
Register (when available).
Add pcie_bus_speeds into struct pci_bus which caches the Supported Link
Speeds. The value is taken directly from the Supported Link Speeds
Vector or synthetized from the Max Link Speed in the Link Capabilities
Register when the Link Capabilities 2 Register is not available.
pcie_bus_speeds field keeps the extra reserved zero at the least
significant bit to match the Link Capabilities 2 Register layouting.
Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
drivers/pci/probe.c | 28 +++++++++++++++++++++++++++-
include/linux/pci.h | 1 +
include/uapi/linux/pci_regs.h | 1 +
3 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 795534589b98..ca1d797a30cb 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -767,6 +767,29 @@ static enum pci_bus_speed agp_speed(int agp3, int agpstat)
return agp_speeds[index];
}
+/*
+ * Implementation Note in PCIe r6.0.1 sec 7.5.3.18 recommends determining
+ * supported link speeds using the Supported Link Speeds Vector in the Link
+ * Capabilities 2 Register (when available).
+ */
+static u8 pcie_get_supported_speeds(u32 linkcap, u32 linkcap2)
+{
+ u8 speeds;
+
+ speeds = linkcap2 & PCI_EXP_LNKCAP2_SLS;
+ if (speeds)
+ return speeds;
+
+ /*
+ * Synthetize supported link speeds from the Max Link Speed in the
+ * Link Capabilities Register.
+ */
+ speeds = PCI_EXP_LNKCAP2_SLS_2_5GB;
+ if ((linkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_5_0GB)
+ speeds |= PCI_EXP_LNKCAP2_SLS_5_0GB;
+ return speeds;
+}
+
static void pci_set_bus_speed(struct pci_bus *bus)
{
struct pci_dev *bridge = bus->self;
@@ -814,12 +837,15 @@ static void pci_set_bus_speed(struct pci_bus *bus)
}
if (pci_is_pcie(bridge)) {
- u32 linkcap;
+ u32 linkcap, linkcap2;
u16 linksta;
pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, &linkcap);
bus->max_bus_speed = pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS];
+ pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP2, &linkcap2);
+ bus->pcie_bus_speeds = pcie_get_supported_speeds(linkcap, linkcap2);
+
pcie_capability_read_word(bridge, PCI_EXP_LNKSTA, &linksta);
pcie_update_link_speed(bus, linksta);
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 16db80f8b15c..cb03f3ff9d23 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -664,6 +664,7 @@ struct pci_bus {
unsigned char primary; /* Number of primary bridge */
unsigned char max_bus_speed; /* enum pci_bus_speed */
unsigned char cur_bus_speed; /* enum pci_bus_speed */
+ u8 pcie_bus_speeds;/* Supported Link Speeds Vector (+ reserved 0 at LSB) */
#ifdef CONFIG_PCI_DOMAINS_GENERIC
int domain_nr;
#endif
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index e5f558d96493..2b27e4f6854a 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -674,6 +674,7 @@
#define PCI_EXP_DEVSTA2 0x2a /* Device Status 2 */
#define PCI_CAP_EXP_RC_ENDPOINT_SIZEOF_V2 0x2c /* end of v2 EPs w/o link */
#define PCI_EXP_LNKCAP2 0x2c /* Link Capabilities 2 */
+#define PCI_EXP_LNKCAP2_SLS 0x000000fe /* Supported Link Speeds Vector */
#define PCI_EXP_LNKCAP2_SLS_2_5GB 0x00000002 /* Supported Speed 2.5GT/s */
#define PCI_EXP_LNKCAP2_SLS_5_0GB 0x00000004 /* Supported Speed 5GT/s */
#define PCI_EXP_LNKCAP2_SLS_8_0GB 0x00000008 /* Supported Speed 8GT/s */
--
2.30.2
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 06/10] PCI: Cache PCIe device's Supported Speed Vector
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
` (4 preceding siblings ...)
2023-09-29 11:57 ` [PATCH v3 05/10] PCI: Store all PCIe Supported Link Speeds Ilpo Järvinen
@ 2023-09-29 11:57 ` Ilpo Järvinen
2023-12-30 15:19 ` Lukas Wunner
2023-09-29 11:57 ` [PATCH v3 07/10] PCI/LINK: Re-add BW notification portdrv as PCIe BW controller Ilpo Järvinen
` (3 subsequent siblings)
9 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, linux-kernel
Cc: Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui,
Ilpo Järvinen
The Supported Link Speeds Vector in the Link Capabilities Register 2
corresponds to the bus below on Root Ports and Downstream Ports,
whereas it corresponds to the bus above on Upstream Ports and
Endpoints. Only the former is currently cached in pcie_bus_speeds in
the struct pci_bus. The link speeds that are supported is the
intersection of these two.
Store the device's Supported Link Speeds Vector into the struct pci_bus
when the Function 0 is enumerated (the Multi-Function Devices must have
same speeds the same for all Functions) to be easily able to calculate
the intersection of Supported Link Speeds.
Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
drivers/pci/probe.c | 10 ++++++++++
drivers/pci/remove.c | 2 ++
include/linux/pci.h | 1 +
3 files changed, 13 insertions(+)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ca1d797a30cb..a9408f2420e5 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2564,6 +2564,7 @@ static void pci_set_msi_domain(struct pci_dev *dev)
void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
{
+ u8 dev_speeds = 0;
int ret;
pci_configure_device(dev);
@@ -2590,11 +2591,20 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
pci_init_capabilities(dev);
+ if (pci_is_pcie(dev) && PCI_FUNC(dev->devfn) == 0) {
+ u32 linkcap, linkcap2;
+
+ pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &linkcap);
+ pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &linkcap2);
+ dev_speeds = pcie_get_supported_speeds(linkcap, linkcap2);
+ }
/*
* Add the device to our list of discovered devices
* and the bus list for fixup functions, etc.
*/
down_write(&pci_bus_sem);
+ if (dev_speeds)
+ bus->pcie_dev_speeds = dev_speeds;
list_add_tail(&dev->bus_list, &bus->devices);
up_write(&pci_bus_sem);
diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index d749ea8250d6..656784cfb291 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -36,6 +36,8 @@ static void pci_destroy_dev(struct pci_dev *dev)
device_del(&dev->dev);
down_write(&pci_bus_sem);
+ if (pci_is_pcie(dev) && PCI_FUNC(dev->devfn) == 0)
+ dev->bus->pcie_dev_speeds = 0;
list_del(&dev->bus_list);
up_write(&pci_bus_sem);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index cb03f3ff9d23..b8bd3dc92032 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -665,6 +665,7 @@ struct pci_bus {
unsigned char max_bus_speed; /* enum pci_bus_speed */
unsigned char cur_bus_speed; /* enum pci_bus_speed */
u8 pcie_bus_speeds;/* Supported Link Speeds Vector (+ reserved 0 at LSB) */
+ u8 pcie_dev_speeds;/* Device's Supported Link Speeds Vector (+ 0 at LSB) */
#ifdef CONFIG_PCI_DOMAINS_GENERIC
int domain_nr;
#endif
--
2.30.2
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 07/10] PCI/LINK: Re-add BW notification portdrv as PCIe BW controller
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
` (5 preceding siblings ...)
2023-09-29 11:57 ` [PATCH v3 06/10] PCI: Cache PCIe device's Supported Speed Vector Ilpo Järvinen
@ 2023-09-29 11:57 ` Ilpo Järvinen
2023-12-30 15:58 ` Lukas Wunner
2023-09-29 11:57 ` [PATCH v3 08/10] PCI/bwctrl: Add "controller" part into PCIe bwctrl Ilpo Järvinen
` (2 subsequent siblings)
9 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, Ilpo Järvinen,
linux-kernel
Cc: Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui
This mostly reverts b4c7d2076b4e ("PCI/LINK: Remove bandwidth
notification"), however, there are small tweaks:
1) Call it PCIe bwctrl (bandwidth controller) instead of just
bandwidth notifications.
2) Don't print the notifications into kernel log, just keep the current
link speed updated.
3) Use concurrency safe LNKCTL RMW operations.
4) Read link speed after enabling the notification to ensure the
current link speed is correct from the start.
5) Add local variable in probe for srv->port.
6) Handle link speed read and LBMS write race in
pcie_bw_notification_irq().
The reason for 1) is to indicate the increased scope of the driver. A
subsequent commit extends the driver to allow controlling PCIe
bandwidths from user space upon crossing thermal thresholds.
While 2) is somewhat unfortunate, the log spam was the source of
complaints that eventually lead to the removal of the bandwidth
notifications driver (see the links below for further information).
After re-adding this driver back the userspace can, if it wishes to,
observe the link speed changes using the current bus speed files under
sysfs.
Link: https://lore.kernel.org/all/20190429185611.121751-1-helgaas@kernel.org/
Link: https://lore.kernel.org/linux-pci/20190501142942.26972-1-keith.busch@intel.com/
Link: https://lore.kernel.org/linux-pci/20200115221008.GA191037@google.com/
Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
drivers/pci/pcie/Kconfig | 8 +++
drivers/pci/pcie/Makefile | 1 +
drivers/pci/pcie/bwctrl.c | 131 +++++++++++++++++++++++++++++++++++++
drivers/pci/pcie/portdrv.c | 9 +--
drivers/pci/pcie/portdrv.h | 10 ++-
5 files changed, 153 insertions(+), 6 deletions(-)
create mode 100644 drivers/pci/pcie/bwctrl.c
diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 228652a59f27..1ef8073fa89a 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -137,6 +137,14 @@ config PCIE_PTM
This is only useful if you have devices that support PTM, but it
is safe to enable even if you don't.
+config PCIE_BW
+ bool "PCI Express Bandwidth Change Notification"
+ depends on PCIEPORTBUS
+ help
+ This enables PCI Express Bandwidth Change Notification. If
+ you know link width or rate changes occur to correct unreliable
+ links, you may answer Y.
+
config PCIE_EDR
bool "PCI Express Error Disconnect Recover support"
depends on PCIE_DPC && ACPI
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 8de4ed5f98f1..175065a495cf 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -12,4 +12,5 @@ obj-$(CONFIG_PCIEAER_INJECT) += aer_inject.o
obj-$(CONFIG_PCIE_PME) += pme.o
obj-$(CONFIG_PCIE_DPC) += dpc.o
obj-$(CONFIG_PCIE_PTM) += ptm.o
+obj-$(CONFIG_PCIE_BW) += bwctrl.o
obj-$(CONFIG_PCIE_EDR) += edr.o
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
new file mode 100644
index 000000000000..4fc6718fc0e5
--- /dev/null
+++ b/drivers/pci/pcie/bwctrl.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * PCI Express Link Bandwidth Notification services driver
+ * Author: Alexandru Gagniuc <mr.nuke.me@gmail.com>
+ *
+ * Copyright (C) 2019, Dell Inc
+ *
+ * The PCIe Link Bandwidth Notification provides a way to notify the
+ * operating system when the link width or data rate changes. This
+ * capability is required for all root ports and downstream ports
+ * supporting links wider than x1 and/or multiple link speeds.
+ *
+ * This service port driver hooks into the bandwidth notification interrupt
+ * watching for link speed changes or links becoming degraded in operation
+ * and updates the cached link speed exposed to user space.
+ */
+
+#define dev_fmt(fmt) "bwctrl: " fmt
+
+#include "../pci.h"
+#include "portdrv.h"
+
+static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
+{
+ int ret;
+ u32 lnk_cap;
+
+ ret = pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnk_cap);
+ return (ret == PCIBIOS_SUCCESSFUL) && (lnk_cap & PCI_EXP_LNKCAP_LBNC);
+}
+
+static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev)
+{
+ u16 link_status;
+ int ret;
+
+ pcie_capability_write_word(dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
+ pcie_capability_set_word(dev, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LBMIE);
+
+ /* Read after enabling notifications to ensure link speed is up to date */
+ ret = pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &link_status);
+ if (ret == PCIBIOS_SUCCESSFUL)
+ pcie_update_link_speed(dev->subordinate, link_status);
+}
+
+static void pcie_disable_link_bandwidth_notification(struct pci_dev *dev)
+{
+ pcie_capability_clear_word(dev, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LBMIE);
+}
+
+static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
+{
+ struct pcie_device *srv = context;
+ struct pci_dev *port = srv->port;
+ u16 link_status, events;
+ int ret;
+
+ ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, &link_status);
+ events = link_status & PCI_EXP_LNKSTA_LBMS;
+
+ if (ret != PCIBIOS_SUCCESSFUL || !events)
+ return IRQ_NONE;
+
+ pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
+
+ /*
+ * The write to clear LBMS prevents getting interrupt from the
+ * latest link speed when the link speed changes between the above
+ * LNKSTA read and write. Therefore, re-read the speed before
+ * updating it.
+ */
+ ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, &link_status);
+ if (ret != PCIBIOS_SUCCESSFUL)
+ return IRQ_HANDLED;
+ pcie_update_link_speed(port->subordinate, link_status);
+
+ return IRQ_HANDLED;
+}
+
+static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
+{
+ struct pci_dev *port = srv->port;
+ int ret;
+
+ /* Single-width or single-speed ports do not have to support this. */
+ if (!pcie_link_bandwidth_notification_supported(port))
+ return -ENODEV;
+
+ ret = request_irq(srv->irq, pcie_bw_notification_irq,
+ IRQF_SHARED, "PCIe BW ctrl", srv);
+ if (ret)
+ return ret;
+
+ pcie_enable_link_bandwidth_notification(port);
+ pci_info(port, "enabled with IRQ %d\n", srv->irq);
+
+ return 0;
+}
+
+static void pcie_bandwidth_notification_remove(struct pcie_device *srv)
+{
+ pcie_disable_link_bandwidth_notification(srv->port);
+ free_irq(srv->irq, srv);
+}
+
+static int pcie_bandwidth_notification_suspend(struct pcie_device *srv)
+{
+ pcie_disable_link_bandwidth_notification(srv->port);
+ return 0;
+}
+
+static int pcie_bandwidth_notification_resume(struct pcie_device *srv)
+{
+ pcie_enable_link_bandwidth_notification(srv->port);
+ return 0;
+}
+
+static struct pcie_port_service_driver pcie_bandwidth_notification_driver = {
+ .name = "pcie_bwctrl",
+ .port_type = PCIE_ANY_PORT,
+ .service = PCIE_PORT_SERVICE_BWCTRL,
+ .probe = pcie_bandwidth_notification_probe,
+ .suspend = pcie_bandwidth_notification_suspend,
+ .resume = pcie_bandwidth_notification_resume,
+ .remove = pcie_bandwidth_notification_remove,
+};
+
+int __init pcie_bwctrl_init(void)
+{
+ return pcie_port_service_register(&pcie_bandwidth_notification_driver);
+}
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 46fad0d813b2..ed33049bffd6 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -67,7 +67,7 @@ static int pcie_message_numbers(struct pci_dev *dev, int mask,
*/
if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP |
- PCIE_PORT_SERVICE_BWNOTIF)) {
+ PCIE_PORT_SERVICE_BWCTRL)) {
pcie_capability_read_word(dev, PCI_EXP_FLAGS, ®16);
*pme = (reg16 & PCI_EXP_FLAGS_IRQ) >> 9;
nvec = *pme + 1;
@@ -149,11 +149,11 @@ static int pcie_port_enable_irq_vec(struct pci_dev *dev, int *irqs, int mask)
/* PME, hotplug and bandwidth notification share an MSI/MSI-X vector */
if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP |
- PCIE_PORT_SERVICE_BWNOTIF)) {
+ PCIE_PORT_SERVICE_BWCTRL)) {
pcie_irq = pci_irq_vector(dev, pme);
irqs[PCIE_PORT_SERVICE_PME_SHIFT] = pcie_irq;
irqs[PCIE_PORT_SERVICE_HP_SHIFT] = pcie_irq;
- irqs[PCIE_PORT_SERVICE_BWNOTIF_SHIFT] = pcie_irq;
+ irqs[PCIE_PORT_SERVICE_BWCTRL_SHIFT] = pcie_irq;
}
if (mask & PCIE_PORT_SERVICE_AER)
@@ -270,7 +270,7 @@ static int get_port_device_capability(struct pci_dev *dev)
pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &linkcap);
if (linkcap & PCI_EXP_LNKCAP_LBNC)
- services |= PCIE_PORT_SERVICE_BWNOTIF;
+ services |= PCIE_PORT_SERVICE_BWCTRL;
}
return services;
@@ -828,6 +828,7 @@ static void __init pcie_init_services(void)
pcie_pme_init();
pcie_dpc_init();
pcie_hp_init();
+ pcie_bwctrl_init();
}
static int __init pcie_portdrv_init(void)
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index 58a2b1a1cae4..f622c8a02a5b 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -20,8 +20,8 @@
#define PCIE_PORT_SERVICE_HP (1 << PCIE_PORT_SERVICE_HP_SHIFT)
#define PCIE_PORT_SERVICE_DPC_SHIFT 3 /* Downstream Port Containment */
#define PCIE_PORT_SERVICE_DPC (1 << PCIE_PORT_SERVICE_DPC_SHIFT)
-#define PCIE_PORT_SERVICE_BWNOTIF_SHIFT 4 /* Bandwidth notification */
-#define PCIE_PORT_SERVICE_BWNOTIF (1 << PCIE_PORT_SERVICE_BWNOTIF_SHIFT)
+#define PCIE_PORT_SERVICE_BWCTRL_SHIFT 4 /* Bandwidth Controller (notifications) */
+#define PCIE_PORT_SERVICE_BWCTRL (1 << PCIE_PORT_SERVICE_BWCTRL_SHIFT)
#define PCIE_PORT_DEVICE_MAXSERVICES 5
@@ -53,6 +53,12 @@ int pcie_dpc_init(void);
static inline int pcie_dpc_init(void) { return 0; }
#endif
+#ifdef CONFIG_PCIE_BW
+int pcie_bwctrl_init(void);
+#else
+static inline int pcie_bwctrl_init(void) { return 0; }
+#endif
+
/* Port Type */
#define PCIE_ANY_PORT (~0)
--
2.30.2
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 08/10] PCI/bwctrl: Add "controller" part into PCIe bwctrl
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
` (6 preceding siblings ...)
2023-09-29 11:57 ` [PATCH v3 07/10] PCI/LINK: Re-add BW notification portdrv as PCIe BW controller Ilpo Järvinen
@ 2023-09-29 11:57 ` Ilpo Järvinen
2023-12-30 18:49 ` Lukas Wunner
2023-09-29 11:57 ` [PATCH v3 09/10] thermal: Add PCIe cooling driver Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 10/10] selftests/pcie_bwctrl: Create selftests Ilpo Järvinen
9 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, Ilpo Järvinen,
linux-kernel
Cc: Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui
Add "controller" parts into PCIe bwctrl for limiting PCIe Link Speed
(due to thermal reasons).
PCIe bandwidth controller introduces an in-kernel API to set PCIe Link
Speed. This new API is intended to be used in an upcoming commit that
adds a thermal cooling device to throttle PCIe bandwidth when thermal
thresholds are reached. No users are introduced in this commit yet.
The PCIe bandwidth control procedure is as follows. The requested speed
is validated against Link Speeds supported by the port and downstream
device. Then bandwidth controller sets the Target Link Speed in the
Link Control 2 Register and retrains the PCIe Link.
Bandwidth notifications enable the cur_bus_speed in the struct pci_bus
to keep track PCIe Link Speed changes. This keeps the link speed seen
through sysfs correct (both for PCI device and thermal cooling device).
While bandwidth notifications should also be generated when bandwidth
controller alters the PCIe Link Speed, a few platforms do not deliver
LMBS interrupt after Link Training as expected. Thus, after changing
the Link Speed, bandwidth controller makes additional read for the Link
Status Register to ensure cur_bus_speed is consistent with the new PCIe
Link Speed.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
MAINTAINERS | 7 ++
drivers/pci/pcie/Kconfig | 9 +-
drivers/pci/pcie/bwctrl.c | 177 +++++++++++++++++++++++++++++++++++--
include/linux/pci-bwctrl.h | 17 ++++
4 files changed, 201 insertions(+), 9 deletions(-)
create mode 100644 include/linux/pci-bwctrl.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 90f13281d297..2e4ad074eaf3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16569,6 +16569,13 @@ F: include/linux/pci*
F: include/uapi/linux/pci*
F: lib/pci*
+PCIE BANDWIDTH CONTROLLER
+M: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
+L: linux-pci@vger.kernel.org
+S: Supported
+F: drivers/pci/pcie/bwctrl.c
+F: include/linux/pci-bwctrl.h
+
PCIE DRIVER FOR AMAZON ANNAPURNA LABS
M: Jonathan Chocron <jonnyc@amazon.com>
L: linux-pci@vger.kernel.org
diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 1ef8073fa89a..1c6509cf169a 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -138,12 +138,13 @@ config PCIE_PTM
is safe to enable even if you don't.
config PCIE_BW
- bool "PCI Express Bandwidth Change Notification"
+ bool "PCI Express Bandwidth Controller"
depends on PCIEPORTBUS
help
- This enables PCI Express Bandwidth Change Notification. If
- you know link width or rate changes occur to correct unreliable
- links, you may answer Y.
+ This enables PCI Express Bandwidth Controller. The Bandwidth
+ Controller allows controlling PCIe link speed and listens for link
+ peed Change Notifications. If you know link width or rate changes
+ occur to correct unreliable links, you may answer Y.
config PCIE_EDR
bool "PCI Express Error Disconnect Recover support"
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index 4fc6718fc0e5..e3172d69476f 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -1,14 +1,16 @@
// SPDX-License-Identifier: GPL-2.0+
/*
- * PCI Express Link Bandwidth Notification services driver
+ * PCIe bandwidth controller
+ *
* Author: Alexandru Gagniuc <mr.nuke.me@gmail.com>
*
* Copyright (C) 2019, Dell Inc
+ * Copyright (C) 2023 Intel Corporation.
*
- * The PCIe Link Bandwidth Notification provides a way to notify the
- * operating system when the link width or data rate changes. This
- * capability is required for all root ports and downstream ports
- * supporting links wider than x1 and/or multiple link speeds.
+ * The PCIe Bandwidth Controller provides a way to alter PCIe link speeds
+ * and notify the operating system when the link width or data rate changes.
+ * The notification capability is required for all Root Ports and Downstream
+ * Ports supporting links wider than x1 and/or multiple link speeds.
*
* This service port driver hooks into the bandwidth notification interrupt
* watching for link speed changes or links becoming degraded in operation
@@ -17,9 +19,48 @@
#define dev_fmt(fmt) "bwctrl: " fmt
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci-bwctrl.h>
+#include <linux/types.h>
+
+#include <asm/rwonce.h>
+
#include "../pci.h"
#include "portdrv.h"
+/**
+ * struct bwctrl_service_data - PCIe Port Bandwidth Controller
+ * @set_speed_mutex: serializes link speed changes
+ */
+struct bwctrl_service_data {
+ struct mutex set_speed_mutex;
+};
+
+static bool bwctrl_valid_pcie_speed(enum pci_bus_speed speed)
+{
+ return (speed >= PCIE_SPEED_2_5GT) && (speed <= PCIE_SPEED_64_0GT);
+}
+
+static u16 speed2lnkctl2(enum pci_bus_speed speed)
+{
+ static const u16 speed_conv[] = {
+ [PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT,
+ [PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT,
+ [PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT,
+ [PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT,
+ [PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT,
+ [PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT,
+ };
+
+ if (WARN_ON_ONCE(!bwctrl_valid_pcie_speed(speed)))
+ return 0;
+
+ return speed_conv[speed];
+}
+
static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
{
int ret;
@@ -77,8 +118,118 @@ static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
return IRQ_HANDLED;
}
+/* Configure target speed to the requested speed and set train link */
+static int bwctrl_set_speed(struct pci_dev *port, u16 lnkctl2_speed)
+{
+ int ret;
+
+ ret = pcie_capability_clear_and_set_word(port, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_TLS, lnkctl2_speed);
+ if (ret != PCIBIOS_SUCCESSFUL)
+ return pcibios_err_to_errno(ret);
+
+ return 0;
+}
+
+static int bwctrl_select_speed(struct pcie_device *srv, enum pci_bus_speed *speed)
+{
+ struct pci_bus *bus = srv->port->subordinate;
+ u8 speeds, dev_speeds;
+ int i;
+
+ if (*speed > PCIE_LNKCAP2_SLS2SPEED(bus->pcie_bus_speeds))
+ return -EINVAL;
+
+ dev_speeds = READ_ONCE(bus->pcie_dev_speeds);
+ /* Only the lowest speed can be set when there are no devices */
+ if (!dev_speeds)
+ dev_speeds = PCI_EXP_LNKCAP2_SLS_2_5GB;
+
+ /*
+ * Implementation Note in PCIe r6.0.1 sec 7.5.3.18 recommends OS to
+ * utilize Supported Link Speeds vector for determining which link
+ * speeds are supported.
+ *
+ * Take into account Supported Link Speeds both from the Root Port
+ * and the device.
+ */
+ speeds = bus->pcie_bus_speeds & dev_speeds;
+ i = BIT(fls(speeds));
+ while (i >= PCI_EXP_LNKCAP2_SLS_2_5GB) {
+ enum pci_bus_speed candidate;
+
+ if (speeds & i) {
+ candidate = PCIE_LNKCAP2_SLS2SPEED(i);
+ if (candidate <= *speed) {
+ *speed = candidate;
+ return 0;
+ }
+ }
+ i >>= 1;
+ }
+
+ return -EINVAL;
+}
+
+/**
+ * bwctrl_set_current_speed - Set downstream link speed for PCIe port
+ * @srv: PCIe port
+ * @speed: PCIe bus speed to set
+ *
+ * Attempts to set PCIe port link speed to @speed. As long as @speed is less
+ * than the maximum of what is supported by @srv, the speed is adjusted
+ * downwards to the best speed supported by both the port and device
+ * underneath it.
+ *
+ * Return:
+ * * 0 - on success
+ * * -EINVAL - @speed is higher than the maximum @srv supports
+ * * -ETIMEDOUT - changing link speed took too long
+ * * -EAGAIN - link speed was changed but @speed was not achieved
+ */
+int bwctrl_set_current_speed(struct pcie_device *srv, enum pci_bus_speed speed)
+{
+ struct bwctrl_service_data *data = get_service_data(srv);
+ struct pci_dev *port = srv->port;
+ u16 link_status;
+ int ret;
+
+ if (WARN_ON_ONCE(!bwctrl_valid_pcie_speed(speed)))
+ return -EINVAL;
+
+ ret = bwctrl_select_speed(srv, &speed);
+ if (ret < 0)
+ return ret;
+
+ mutex_lock(&data->set_speed_mutex);
+ ret = bwctrl_set_speed(port, speed2lnkctl2(speed));
+ if (ret < 0)
+ goto unlock;
+
+ ret = pcie_retrain_link(port, true);
+ if (ret < 0)
+ goto unlock;
+
+ /*
+ * Ensure link speed updates also with platforms that have problems
+ * with notifications
+ */
+ ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, &link_status);
+ if (ret == PCIBIOS_SUCCESSFUL)
+ pcie_update_link_speed(port->subordinate, link_status);
+
+ if (port->subordinate->cur_bus_speed != speed)
+ ret = -EAGAIN;
+
+unlock:
+ mutex_unlock(&data->set_speed_mutex);
+
+ return ret;
+}
+
static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
{
+ struct bwctrl_service_data *data;
struct pci_dev *port = srv->port;
int ret;
@@ -91,16 +242,32 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
if (ret)
return ret;
+ data = kzalloc(sizeof(*data), GFP_KERNEL);
+ if (!data) {
+ ret = -ENOMEM;
+ goto free_irq;
+ }
+ mutex_init(&data->set_speed_mutex);
+ set_service_data(srv, data);
+
pcie_enable_link_bandwidth_notification(port);
pci_info(port, "enabled with IRQ %d\n", srv->irq);
return 0;
+
+free_irq:
+ free_irq(srv->irq, srv);
+ return ret;
}
static void pcie_bandwidth_notification_remove(struct pcie_device *srv)
{
+ struct bwctrl_service_data *data = get_service_data(srv);
+
pcie_disable_link_bandwidth_notification(srv->port);
free_irq(srv->irq, srv);
+ mutex_destroy(&data->set_speed_mutex);
+ kfree(data);
}
static int pcie_bandwidth_notification_suspend(struct pcie_device *srv)
diff --git a/include/linux/pci-bwctrl.h b/include/linux/pci-bwctrl.h
new file mode 100644
index 000000000000..8eae09bd03b5
--- /dev/null
+++ b/include/linux/pci-bwctrl.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * PCIe bandwidth controller
+ *
+ * Copyright (C) 2023 Intel Corporation.
+ */
+
+#ifndef LINUX_PCI_BWCTRL_H
+#define LINUX_PCI_BWCTRL_H
+
+#include <linux/pci.h>
+
+struct pcie_device;
+
+int bwctrl_set_current_speed(struct pcie_device *srv, enum pci_bus_speed speed);
+
+#endif
--
2.30.2
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 09/10] thermal: Add PCIe cooling driver
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
` (7 preceding siblings ...)
2023-09-29 11:57 ` [PATCH v3 08/10] PCI/bwctrl: Add "controller" part into PCIe bwctrl Ilpo Järvinen
@ 2023-09-29 11:57 ` Ilpo Järvinen
2023-12-30 19:08 ` Lukas Wunner
2023-09-29 11:57 ` [PATCH v3 10/10] selftests/pcie_bwctrl: Create selftests Ilpo Järvinen
9 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Ilpo Järvinen, Bjorn Helgaas,
Daniel Lezcano, Amit Kucheria, Zhang Rui, linux-kernel
Cc: Alex Deucher
Add a thermal cooling driver to provide path to access PCIe bandwidth
controller using the usual thermal interfaces.
A cooling device is instantiated for controllable PCIe ports from the
bwctrl service driver.
The thermal side state 0 means no throttling, i.e., maximum supported
PCIe speed.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # From the cooling device interface perspective
---
MAINTAINERS | 1 +
drivers/pci/pcie/bwctrl.c | 11 ++++
drivers/thermal/Kconfig | 10 +++
drivers/thermal/Makefile | 2 +
drivers/thermal/pcie_cooling.c | 107 +++++++++++++++++++++++++++++++++
include/linux/pci-bwctrl.h | 16 +++++
6 files changed, 147 insertions(+)
create mode 100644 drivers/thermal/pcie_cooling.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 2e4ad074eaf3..40dd7c0b154d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16574,6 +16574,7 @@ M: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
L: linux-pci@vger.kernel.org
S: Supported
F: drivers/pci/pcie/bwctrl.c
+F: drivers/thermal/pcie_cooling.c
F: include/linux/pci-bwctrl.h
PCIE DRIVER FOR AMAZON ANNAPURNA LABS
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index e3172d69476f..13c73546244e 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -34,9 +34,11 @@
/**
* struct bwctrl_service_data - PCIe Port Bandwidth Controller
* @set_speed_mutex: serializes link speed changes
+ * @cdev: thermal cooling device associated with the port
*/
struct bwctrl_service_data {
struct mutex set_speed_mutex;
+ struct thermal_cooling_device *cdev;
};
static bool bwctrl_valid_pcie_speed(enum pci_bus_speed speed)
@@ -253,8 +255,16 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
pcie_enable_link_bandwidth_notification(port);
pci_info(port, "enabled with IRQ %d\n", srv->irq);
+ data->cdev = pcie_cooling_device_register(port, srv);
+ if (IS_ERR(data->cdev)) {
+ ret = PTR_ERR(data->cdev);
+ goto disable_notifications;
+ }
return 0;
+disable_notifications:
+ pcie_disable_link_bandwidth_notification(srv->port);
+ kfree(data);
free_irq:
free_irq(srv->irq, srv);
return ret;
@@ -264,6 +274,7 @@ static void pcie_bandwidth_notification_remove(struct pcie_device *srv)
{
struct bwctrl_service_data *data = get_service_data(srv);
+ pcie_cooling_device_unregister(data->cdev);
pcie_disable_link_bandwidth_notification(srv->port);
free_irq(srv->irq, srv);
mutex_destroy(&data->set_speed_mutex);
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index c81a00fbca7d..3a071396f1c6 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -219,6 +219,16 @@ config DEVFREQ_THERMAL
If you want this support, you should say Y here.
+config PCIE_THERMAL
+ bool "PCIe cooling support"
+ depends on PCIEPORTBUS
+ select PCIE_BW
+ help
+ This implements PCIe cooling mechanism through bandwidth reduction
+ for PCIe devices.
+
+ If you want this support, you should say Y here.
+
config THERMAL_EMULATION
bool "Thermal emulation mode support"
help
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index c934cab309ae..a0b25a2742b7 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -30,6 +30,8 @@ thermal_sys-$(CONFIG_CPU_IDLE_THERMAL) += cpuidle_cooling.o
# devfreq cooling
thermal_sys-$(CONFIG_DEVFREQ_THERMAL) += devfreq_cooling.o
+thermal_sys-$(CONFIG_PCIE_THERMAL) += pcie_cooling.o
+
obj-$(CONFIG_K3_THERMAL) += k3_bandgap.o k3_j72xx_bandgap.o
# platform thermal drivers
obj-y += broadcom/
diff --git a/drivers/thermal/pcie_cooling.c b/drivers/thermal/pcie_cooling.c
new file mode 100644
index 000000000000..c23b59dd0331
--- /dev/null
+++ b/drivers/thermal/pcie_cooling.c
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * PCIe cooling device
+ *
+ * Copyright (C) 2023 Intel Corporation.
+ */
+
+#include <linux/build_bug.h>
+#include <linux/err.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/pci-bwctrl.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/thermal.h>
+
+#define COOLING_DEV_TYPE_PREFIX "PCIe_Port_Link_Speed_"
+
+struct pcie_cooling_device {
+ struct pci_dev *port;
+ struct pcie_device *pdev;
+};
+
+static int pcie_cooling_get_max_level(struct thermal_cooling_device *cdev, unsigned long *state)
+{
+ struct pcie_cooling_device *pcie_cdev = cdev->devdata;
+
+ /* cooling state 0 is same as the maximum PCIe speed */
+ *state = pcie_cdev->port->subordinate->max_bus_speed - PCIE_SPEED_2_5GT;
+
+ return 0;
+}
+
+static int pcie_cooling_get_cur_level(struct thermal_cooling_device *cdev, unsigned long *state)
+{
+ struct pcie_cooling_device *pcie_cdev = cdev->devdata;
+
+ /* cooling state 0 is same as the maximum PCIe speed */
+ *state = cdev->max_state -
+ (pcie_cdev->port->subordinate->cur_bus_speed - PCIE_SPEED_2_5GT);
+
+ return 0;
+}
+
+static int pcie_cooling_set_cur_level(struct thermal_cooling_device *cdev, unsigned long state)
+{
+ struct pcie_cooling_device *pcie_cdev = cdev->devdata;
+ enum pci_bus_speed speed;
+
+ /* cooling state 0 is same as the maximum PCIe speed */
+ speed = (cdev->max_state - state) + PCIE_SPEED_2_5GT;
+
+ return bwctrl_set_current_speed(pcie_cdev->pdev, speed);
+}
+
+static struct thermal_cooling_device_ops pcie_cooling_ops = {
+ .get_max_state = pcie_cooling_get_max_level,
+ .get_cur_state = pcie_cooling_get_cur_level,
+ .set_cur_state = pcie_cooling_set_cur_level,
+};
+
+struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port,
+ struct pcie_device *pdev)
+{
+ struct pcie_cooling_device *pcie_cdev;
+ struct thermal_cooling_device *cdev;
+ size_t name_len;
+ char *name;
+
+ pcie_cdev = kzalloc(sizeof(*pcie_cdev), GFP_KERNEL);
+ if (!pcie_cdev)
+ return ERR_PTR(-ENOMEM);
+
+ pcie_cdev->port = port;
+ pcie_cdev->pdev = pdev;
+
+ name_len = strlen(COOLING_DEV_TYPE_PREFIX) + strlen(pci_name(port)) + 1;
+ name = kzalloc(name_len, GFP_KERNEL);
+ if (!name) {
+ kfree(pcie_cdev);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ snprintf(name, name_len, COOLING_DEV_TYPE_PREFIX "%s", pci_name(port));
+ cdev = thermal_cooling_device_register(name, pcie_cdev, &pcie_cooling_ops);
+ kfree(name);
+
+ return cdev;
+}
+
+void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev)
+{
+ struct pcie_cooling_device *pcie_cdev = cdev->devdata;
+
+ thermal_cooling_device_unregister(cdev);
+ kfree(pcie_cdev);
+}
+
+/* For bus_speed <-> state arithmetic */
+static_assert(PCIE_SPEED_2_5GT + 1 == PCIE_SPEED_5_0GT);
+static_assert(PCIE_SPEED_5_0GT + 1 == PCIE_SPEED_8_0GT);
+static_assert(PCIE_SPEED_8_0GT + 1 == PCIE_SPEED_16_0GT);
+static_assert(PCIE_SPEED_16_0GT + 1 == PCIE_SPEED_32_0GT);
+static_assert(PCIE_SPEED_32_0GT + 1 == PCIE_SPEED_64_0GT);
+
+MODULE_AUTHOR("Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>");
+MODULE_DESCRIPTION("PCIe cooling driver");
diff --git a/include/linux/pci-bwctrl.h b/include/linux/pci-bwctrl.h
index 8eae09bd03b5..366445517b72 100644
--- a/include/linux/pci-bwctrl.h
+++ b/include/linux/pci-bwctrl.h
@@ -11,7 +11,23 @@
#include <linux/pci.h>
struct pcie_device;
+struct thermal_cooling_device;
int bwctrl_set_current_speed(struct pcie_device *srv, enum pci_bus_speed speed);
+#ifdef CONFIG_PCIE_THERMAL
+struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port,
+ struct pcie_device *pdev);
+void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev);
+#else
+static inline struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port,
+ struct pcie_device *pdev)
+{
+ return NULL;
+}
+static inline void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev)
+{
+}
+#endif
+
#endif
--
2.30.2
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v3 10/10] selftests/pcie_bwctrl: Create selftests
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
` (8 preceding siblings ...)
2023-09-29 11:57 ` [PATCH v3 09/10] thermal: Add PCIe cooling driver Ilpo Järvinen
@ 2023-09-29 11:57 ` Ilpo Järvinen
9 siblings, 0 replies; 29+ messages in thread
From: Ilpo Järvinen @ 2023-09-29 11:57 UTC (permalink / raw)
To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczyński, Lukas Wunner, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Shuah Khan, Ilpo Järvinen,
linux-kernel, linux-kselftest
Cc: Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui
Create selftests for PCIe BW control through the PCIe cooling device
sysfs interface.
First, the BW control selftest finds the PCIe port to test with. By
default, the PCIe port with the highest bus speed is selected but
another PCIe port can be provided with -d parameter.
The actual test steps the cur_state of the cooling device one-by-one
from max_state to what the cur_state was initially. The speed change
is confirmed by observing the current_link_speed for the corresponding
PCIe port.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
---
MAINTAINERS | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/pcie_bwctrl/Makefile | 2 +
.../pcie_bwctrl/set_pcie_cooling_state.sh | 122 ++++++++++++++++++
.../selftests/pcie_bwctrl/set_pcie_speed.sh | 67 ++++++++++
5 files changed, 193 insertions(+)
create mode 100644 tools/testing/selftests/pcie_bwctrl/Makefile
create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh
create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh
diff --git a/MAINTAINERS b/MAINTAINERS
index 40dd7c0b154d..1a35f75323fd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16576,6 +16576,7 @@ S: Supported
F: drivers/pci/pcie/bwctrl.c
F: drivers/thermal/pcie_cooling.c
F: include/linux/pci-bwctrl.h
+F: tools/testing/selftests/pcie_bwctrl/
PCIE DRIVER FOR AMAZON ANNAPURNA LABS
M: Jonathan Chocron <jonnyc@amazon.com>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 42806add0114..18ad9acd440a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -59,6 +59,7 @@ TARGETS += net/mptcp
TARGETS += net/openvswitch
TARGETS += netfilter
TARGETS += nsfs
+TARGETS += pcie_bwctrl
TARGETS += perf_events
TARGETS += pidfd
TARGETS += pid_namespace
diff --git a/tools/testing/selftests/pcie_bwctrl/Makefile b/tools/testing/selftests/pcie_bwctrl/Makefile
new file mode 100644
index 000000000000..3e84e26341d1
--- /dev/null
+++ b/tools/testing/selftests/pcie_bwctrl/Makefile
@@ -0,0 +1,2 @@
+TEST_PROGS = set_pcie_cooling_state.sh
+include ../lib.mk
diff --git a/tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh b/tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh
new file mode 100755
index 000000000000..3a8f91f0309e
--- /dev/null
+++ b/tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh
@@ -0,0 +1,122 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+SYSFS=
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+retval=0
+skipmsg="skip all tests:"
+
+PCIEPORTTYPE="PCIe_Port_Link_Speed"
+
+prerequisite()
+{
+ local ports
+
+ if [ $UID != 0 ]; then
+ echo $skipmsg must be run as root >&2
+ exit $ksft_skip
+ fi
+
+ SYSFS=`mount -t sysfs | head -1 | awk '{ print $3 }'`
+
+ if [ ! -d "$SYSFS" ]; then
+ echo $skipmsg sysfs is not mounted >&2
+ exit $ksft_skip
+ fi
+
+ if ! ls $SYSFS/class/thermal/cooling_device* > /dev/null 2>&1; then
+ echo $skipmsg thermal cooling devices missing >&2
+ exit $ksft_skip
+ fi
+
+ ports=`grep -e "^$PCIEPORTTYPE" $SYSFS/class/thermal/cooling_device*/type | wc -l`
+ if [ $ports -eq 0 ]; then
+ echo $skipmsg pcie cooling devices missing >&2
+ exit $ksft_skip
+ fi
+}
+
+testport=
+find_pcie_port()
+{
+ local patt="$1"
+ local pcieports
+ local max
+ local cur
+ local delta
+ local bestdelta=-1
+
+ pcieports=`grep -l -F -e "$patt" /sys/class/thermal/cooling_device*/type`
+ if [ -z "$pcieports" ]; then
+ return
+ fi
+ pcieports=${pcieports//\/type/}
+ # Find the port with the highest PCIe Link Speed
+ for port in $pcieports; do
+ max=`cat $port/max_state`
+ cur=`cat $port/cur_state`
+ delta=$((max-cur))
+ if [ $delta -gt $bestdelta ]; then
+ testport="$port"
+ bestdelta=$delta
+ fi
+ done
+}
+
+sysfspcidev=
+find_sysfs_pci_dev()
+{
+ local typefile="$1/type"
+ local pcidir
+
+ pcidir="$SYSFS/bus/pci/devices/`sed -e "s|^${PCIEPORTTYPE}_||g" $typefile`"
+
+ if [ -r "$pcidir/current_link_speed" ]; then
+ sysfspcidev="$pcidir/current_link_speed"
+ fi
+}
+
+usage()
+{
+ echo "Usage $0 [ -d dev ]"
+ echo -e "\t-d: PCIe port BDF string (e.g., 0000:00:04.0)"
+}
+
+pattern="$PCIEPORTTYPE"
+parse_arguments()
+{
+ while getopts d:h opt; do
+ case $opt in
+ h)
+ usage "$0"
+ exit 0
+ ;;
+ d)
+ pattern="$PCIEPORTTYPE_$OPTARG"
+ ;;
+ *)
+ usage "$0"
+ exit 0
+ ;;
+ esac
+ done
+}
+
+parse_arguments "$@"
+prerequisite
+find_pcie_port "$pattern"
+if [ -z "$testport" ]; then
+ echo $skipmsg "pcie cooling device not found from sysfs" >&2
+ exit $ksft_skip
+fi
+find_sysfs_pci_dev "$testport"
+if [ -z "$sysfspcidev" ]; then
+ echo $skipmsg "PCIe port device not found from sysfs" >&2
+ exit $ksft_skip
+fi
+
+./set_pcie_speed.sh "$testport" "$sysfspcidev"
+retval=$?
+
+exit $retval
diff --git a/tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh b/tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh
new file mode 100755
index 000000000000..584596949312
--- /dev/null
+++ b/tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh
@@ -0,0 +1,67 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+set -e
+
+TESTNAME=set_pcie_speed
+
+declare -a PCIELINKSPEED=(
+ "2.5 GT/s PCIe"
+ "5.0 GT/s PCIe"
+ "8.0 GT/s PCIe"
+ "16.0 GT/s PCIe"
+ "32.0 GT/s PCIe"
+ "64.0 GT/s PCIe"
+)
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+retval=0
+
+coolingdev="$1"
+statefile="$coolingdev/cur_state"
+maxfile="$coolingdev/max_state"
+linkspeedfile="$2"
+
+oldstate=`cat $statefile`
+maxstate=`cat $maxfile`
+
+set_state()
+{
+ local state=$1
+ local linkspeed
+ local expected_linkspeed
+
+ echo $state > $statefile
+
+ sleep 1
+
+ linkspeed="`cat $linkspeedfile`"
+ expected_linkspeed=$((maxstate-state))
+ expected_str="${PCIELINKSPEED[$expected_linkspeed]}"
+ if [ ! "${expected_str}" = "${linkspeed}" ]; then
+ echo "$TESTNAME failed: expected: ${expected_str}; got ${linkspeed}"
+ retval=1
+ fi
+}
+
+cleanup_skip ()
+{
+ set_state $oldstate
+ exit $ksft_skip
+}
+
+trap cleanup_skip EXIT
+
+echo "$TESTNAME: testing states $maxstate .. $oldstate with $coolingdev"
+for i in $(seq $maxstate -1 $oldstate); do
+ set_state "$i"
+done
+
+trap EXIT
+if [ $retval -eq 0 ]; then
+ echo "$TESTNAME [PASS]"
+else
+ echo "$TESTNAME [FAIL]"
+fi
+exit $retval
--
2.30.2
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v3 04/10] RDMA/hfi1: Use RMW accessors for changing LNKCTL2
2023-09-29 11:57 ` [PATCH v3 04/10] RDMA/hfi1: " Ilpo Järvinen
@ 2023-09-29 13:03 ` Dean Luick
0 siblings, 0 replies; 29+ messages in thread
From: Dean Luick @ 2023-09-29 13:03 UTC (permalink / raw)
To: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
Rob Herring, Krzysztof Wilczyński, Lukas Wunner,
Alexandru Gagniuc, Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Dennis Dalessandro, Jason Gunthorpe,
Leon Romanovsky, linux-rdma, linux-kernel
Cc: Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui
On 9/29/2023 6:57 AM, Ilpo Järvinen wrote:
> Don't assume that only the driver would be accessing LNKCTL2. In the
> case of upstream (parent), the driver does not even own the device it's
> changing the registers for.
>
> Use RMW capability accessors which do proper locking to avoid losing
> concurrent updates to the register value. This change is also useful as
> a cleanup.
>
> Suggested-by: Lukas Wunner <lukas@wunner.de>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> ---
> drivers/infiniband/hw/hfi1/pcie.c | 30 ++++++++----------------------
> 1 file changed, 8 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
> index 08732e1ac966..4487a05bea04 100644
> --- a/drivers/infiniband/hw/hfi1/pcie.c
> +++ b/drivers/infiniband/hw/hfi1/pcie.c
> @@ -1212,14 +1212,11 @@ int do_pcie_gen3_transition(struct hfi1_devdata *dd)
> (u32)lnkctl2);
> /* only write to parent if target is not as high as ours */
> if ((lnkctl2 & PCI_EXP_LNKCTL2_TLS) < target_vector) {
> - lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> - lnkctl2 |= target_vector;
> - dd_dev_info(dd, "%s: ..new link control2: 0x%x\n", __func__,
> - (u32)lnkctl2);
> - ret = pcie_capability_write_word(parent,
> - PCI_EXP_LNKCTL2, lnkctl2);
> + ret = pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL2,
> + PCI_EXP_LNKCTL2_TLS,
> + target_vector);
> if (ret) {
> - dd_dev_err(dd, "Unable to write to PCI config\n");
> + dd_dev_err(dd, "Unable to change parent PCI target speed\n");
> return_error = 1;
> goto done;
> }
> @@ -1228,22 +1225,11 @@ int do_pcie_gen3_transition(struct hfi1_devdata *dd)
> }
>
> dd_dev_info(dd, "%s: setting target link speed\n", __func__);
> - ret = pcie_capability_read_word(dd->pcidev, PCI_EXP_LNKCTL2, &lnkctl2);
> + ret = pcie_capability_clear_and_set_word(dd->pcidev, PCI_EXP_LNKCTL2,
> + PCI_EXP_LNKCTL2_TLS,
> + target_vector);
> if (ret) {
> - dd_dev_err(dd, "Unable to read from PCI config\n");
> - return_error = 1;
> - goto done;
> - }
> -
> - dd_dev_info(dd, "%s: ..old link control2: 0x%x\n", __func__,
> - (u32)lnkctl2);
> - lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> - lnkctl2 |= target_vector;
> - dd_dev_info(dd, "%s: ..new link control2: 0x%x\n", __func__,
> - (u32)lnkctl2);
> - ret = pcie_capability_write_word(dd->pcidev, PCI_EXP_LNKCTL2, lnkctl2);
> - if (ret) {
> - dd_dev_err(dd, "Unable to write to PCI config\n");
> + dd_dev_err(dd, "Unable to change device PCI target speed\n");
> return_error = 1;
> goto done;
> }
Looks good. Thank you for my suggested changes.
Reviewed-by: Dean Luick <dean.luick@cornelisnetworks.com>
-Dean
External recipient
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 01/10] PCI: Protect Link Control 2 Register with RMW locking
2023-09-29 11:57 ` [PATCH v3 01/10] PCI: Protect Link Control 2 Register with RMW locking Ilpo Järvinen
@ 2023-12-30 10:33 ` Lukas Wunner
0 siblings, 0 replies; 29+ messages in thread
From: Lukas Wunner @ 2023-12-30 10:33 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, Jonathan Corbet,
linux-doc, linux-kernel, Alex Deucher, Daniel Lezcano,
Amit Kucheria, Zhang Rui
On Fri, Sep 29, 2023 at 02:57:14PM +0300, Ilpo Järvinen wrote:
> PCIe Bandwidth Controller performs RMW accesses the Link Control 2
^
to
> Register which can occur concurrently to other sources of Link Control
> 2 Register writes. Therefore, add Link Control 2 Register among the PCI
> Express Capability Registers that need RMW locking.
[...]
> --- a/Documentation/PCI/pciebus-howto.rst
> +++ b/Documentation/PCI/pciebus-howto.rst
> @@ -218,7 +218,7 @@ that is shared between many drivers including the service drivers.
> RMW Capability accessors (pcie_capability_clear_and_set_word(),
> pcie_capability_set_word(), and pcie_capability_clear_word()) protect
> a selected set of PCI Express Capability Registers (Link Control
> -Register and Root Control Register). Any change to those registers
> -should be performed using RMW accessors to avoid problems due to
> -concurrent updates. For the up-to-date list of protected registers,
> -see pcie_capability_clear_and_set_word().
> +Register, Root Control Register, and Link Control 2 Register). Any
> +change to those registers should be performed using RMW accessors to
> +avoid problems due to concurrent updates. For the up-to-date list of
> +protected registers, see pcie_capability_clear_and_set_word().
Maybe use a list of bullet points of the affected registers so that
this can be extended more easily in the future.
Otherwise,
Reviewed-by: Lukas Wunner <lukas@wunner.de>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 05/10] PCI: Store all PCIe Supported Link Speeds
2023-09-29 11:57 ` [PATCH v3 05/10] PCI: Store all PCIe Supported Link Speeds Ilpo Järvinen
@ 2023-12-30 11:45 ` Lukas Wunner
2023-12-30 19:30 ` Lukas Wunner
0 siblings, 1 reply; 29+ messages in thread
From: Lukas Wunner @ 2023-12-30 11:45 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, linux-kernel,
Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui
On Fri, Sep 29, 2023 at 02:57:18PM +0300, Ilpo Järvinen wrote:
> struct pci_bus stores max_bus_speed. Implementation Note in PCIe r6.0.1
> sec 7.5.3.18, however, recommends determining supported Link Speeds
> using the Supported Link Speeds Vector in the Link Capabilities 2
> Register (when available).
>
> Add pcie_bus_speeds into struct pci_bus which caches the Supported Link
> Speeds. The value is taken directly from the Supported Link Speeds
> Vector or synthetized from the Max Link Speed in the Link Capabilities
> Register when the Link Capabilities 2 Register is not available.
Remind me, what's the reason again to cache this and why is
max_bus_speed not sufficient? Is the point that there may be
"gaps" in the supported link speeds, i.e. not every bit below
the maximum supported speed may be set? And you need to skip
over those gaps when throttling to a lower speed?
Maybe this becomes apparent in a later patch but from a reviewer's
perspective starting at patch 1 and working one's way forward through
the series, it's a bit puzzling, so an explanation in the commit
message would be beneficial.
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
[...]
> +static u8 pcie_get_supported_speeds(u32 linkcap, u32 linkcap2)
> +{
> + u8 speeds;
> +
> + speeds = linkcap2 & PCI_EXP_LNKCAP2_SLS;
> + if (speeds)
> + return speeds;
> +
> + /*
> + * Synthetize supported link speeds from the Max Link Speed in the
> + * Link Capabilities Register.
> + */
> + speeds = PCI_EXP_LNKCAP2_SLS_2_5GB;
> + if ((linkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_5_0GB)
> + speeds |= PCI_EXP_LNKCAP2_SLS_5_0GB;
> + return speeds;
> +}
This seems to duplicate portions of pcie_get_speed_cap().
Can you refactor that function to take advantage of the cached value,
i.e. basically return PCIE_LNKCAP2_SLS2SPEED(dev->bus->pcie_bus_speeds)?
Also, I note that pci_set_bus_speed() doesn't use LNKCAP2.
Presumably that's a historic artefact but maybe it can be
converted to use LNKCAP2 as well. Granted, it's not directly
related to this series, but always nice to clean up and
rationalize the code.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 06/10] PCI: Cache PCIe device's Supported Speed Vector
2023-09-29 11:57 ` [PATCH v3 06/10] PCI: Cache PCIe device's Supported Speed Vector Ilpo Järvinen
@ 2023-12-30 15:19 ` Lukas Wunner
2024-01-01 18:31 ` Ilpo Järvinen
0 siblings, 1 reply; 29+ messages in thread
From: Lukas Wunner @ 2023-12-30 15:19 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, linux-kernel,
Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui
On Fri, Sep 29, 2023 at 02:57:19PM +0300, Ilpo Järvinen wrote:
> The Supported Link Speeds Vector in the Link Capabilities Register 2
> corresponds to the bus below on Root Ports and Downstream Ports,
> whereas it corresponds to the bus above on Upstream Ports and
> Endpoints.
It would be good to add a pointer to the spec here. I think the
relevant section is PCIe r6.1 sec 7.5.3.18 which says:
"Supported Link Speeds Vector - This field indicates the supported
Link speed(s) of the associated Port."
^^^^^^^^^^^^^^^
Obviously the associated port is upstream on a Switch Upstream Port
or Endpoint, whereas it is downstream on a Switch Downstream Port
or Root Port.
Come to think of it, what about edge cases such as RCiEPs?
> Only the former is currently cached in pcie_bus_speeds in
> the struct pci_bus. The link speeds that are supported is the
> intersection of these two.
I'm wondering if caching both is actually necessary. Why not cache
just the intersection? Do we need either of the two somewhere?
> Store the device's Supported Link Speeds Vector into the struct pci_bus
> when the Function 0 is enumerated (the Multi-Function Devices must have
> same speeds the same for all Functions) to be easily able to calculate
> the intersection of Supported Link Speeds.
Might want to add an explanation what you're going to need this for,
I assume it's accessed frequently by the bandwidth throttling driver
in a subsequent patch?
Thanks,
Lukas
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 07/10] PCI/LINK: Re-add BW notification portdrv as PCIe BW controller
2023-09-29 11:57 ` [PATCH v3 07/10] PCI/LINK: Re-add BW notification portdrv as PCIe BW controller Ilpo Järvinen
@ 2023-12-30 15:58 ` Lukas Wunner
2024-01-01 17:37 ` Ilpo Järvinen
0 siblings, 1 reply; 29+ messages in thread
From: Lukas Wunner @ 2023-12-30 15:58 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, linux-kernel,
Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui
On Fri, Sep 29, 2023 at 02:57:20PM +0300, Ilpo Järvinen wrote:
> This mostly reverts b4c7d2076b4e ("PCI/LINK: Remove bandwidth
> notification"), however, there are small tweaks:
>
> 1) Call it PCIe bwctrl (bandwidth controller) instead of just
> bandwidth notifications.
> 2) Don't print the notifications into kernel log, just keep the current
> link speed updated.
> 3) Use concurrency safe LNKCTL RMW operations.
> 4) Read link speed after enabling the notification to ensure the
> current link speed is correct from the start.
> 5) Add local variable in probe for srv->port.
> 6) Handle link speed read and LBMS write race in
> pcie_bw_notification_irq().
>
> The reason for 1) is to indicate the increased scope of the driver. A
> subsequent commit extends the driver to allow controlling PCIe
> bandwidths from user space upon crossing thermal thresholds.
>
> While 2) is somewhat unfortunate, the log spam was the source of
> complaints that eventually lead to the removal of the bandwidth
> notifications driver (see the links below for further information).
> After re-adding this driver back the userspace can, if it wishes to,
> observe the link speed changes using the current bus speed files under
> sysfs.
Good commit message.
> --- a/drivers/pci/pcie/Kconfig
> +++ b/drivers/pci/pcie/Kconfig
> @@ -137,6 +137,14 @@ config PCIE_PTM
> This is only useful if you have devices that support PTM, but it
> is safe to enable even if you don't.
>
> +config PCIE_BW
> + bool "PCI Express Bandwidth Change Notification"
> + depends on PCIEPORTBUS
> + help
> + This enables PCI Express Bandwidth Change Notification. If
> + you know link width or rate changes occur to correct unreliable
> + links, you may answer Y.
> +
For an end user browsing Kconfig entries, this isn't as helpful as it
could be. Maybe mention that autonomous link changes are automatically
picked up and observable through sysfs (name the relevant attributes).
> --- /dev/null
> +++ b/drivers/pci/pcie/bwctrl.c
> @@ -0,0 +1,131 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * PCI Express Link Bandwidth Notification services driver
> + * Author: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> + *
> + * Copyright (C) 2019, Dell Inc
> + *
> + * The PCIe Link Bandwidth Notification provides a way to notify the
> + * operating system when the link width or data rate changes. This
> + * capability is required for all root ports and downstream ports
> + * supporting links wider than x1 and/or multiple link speeds.
Capitalize Root Ports and Downstream Ports.
Reference the spec section prescribing this.
> +static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
> +{
> + int ret;
> + u32 lnk_cap;
Inverse Christmas tree?
> +static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev)
> +{
> + u16 link_status;
> + int ret;
> +
> + pcie_capability_write_word(dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
> + pcie_capability_set_word(dev, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LBMIE);
I'm wondering why we're not enabling LABIE as well?
(And clear LABS.)
Can't it happen that we miss bandwidth changes unless we enable that
as well?
> +static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
> +{
> + struct pci_dev *port = srv->port;
> + int ret;
> +
> + /* Single-width or single-speed ports do not have to support this. */
> + if (!pcie_link_bandwidth_notification_supported(port))
> + return -ENODEV;
I'm wondering if this should be checked in get_port_device_capability()
instead?
> + ret = request_irq(srv->irq, pcie_bw_notification_irq,
> + IRQF_SHARED, "PCIe BW ctrl", srv);
Is there a reason to run the IRQ handler in hardirq context
or would it work to run it in an IRQ thread? Usually on systems
than enable PREEMPT_RT, a threaded IRQ handler is preferred,
so unless hardirq context is necessary, I'd recommend using
an IRQ thread.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 08/10] PCI/bwctrl: Add "controller" part into PCIe bwctrl
2023-09-29 11:57 ` [PATCH v3 08/10] PCI/bwctrl: Add "controller" part into PCIe bwctrl Ilpo Järvinen
@ 2023-12-30 18:49 ` Lukas Wunner
2024-01-01 18:12 ` Ilpo Järvinen
0 siblings, 1 reply; 29+ messages in thread
From: Lukas Wunner @ 2023-12-30 18:49 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, linux-kernel,
Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui
On Fri, Sep 29, 2023 at 02:57:21PM +0300, Ilpo Järvinen wrote:
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16569,6 +16569,13 @@ F: include/linux/pci*
> F: include/uapi/linux/pci*
> F: lib/pci*
>
> +PCIE BANDWIDTH CONTROLLER
> +M: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> +L: linux-pci@vger.kernel.org
> +S: Supported
> +F: drivers/pci/pcie/bwctrl.c
> +F: include/linux/pci-bwctrl.h
> +
Maybe add this already in the preceding patch.
> --- a/drivers/pci/pcie/Kconfig
> +++ b/drivers/pci/pcie/Kconfig
> @@ -138,12 +138,13 @@ config PCIE_PTM
> is safe to enable even if you don't.
>
> config PCIE_BW
> - bool "PCI Express Bandwidth Change Notification"
> + bool "PCI Express Bandwidth Controller"
I'd fold this change into the preceding patch.
Deleting a line introduced by the preceding patch isn't great.
Never mind that the patch isn't a clean revert that way.
Your approach of making changes to the original version
of the bandwith monitoring driver and calling out those
changes in the commit message seems fine to me.
> depends on PCIEPORTBUS
> help
> - This enables PCI Express Bandwidth Change Notification. If
> - you know link width or rate changes occur to correct unreliable
> - links, you may answer Y.
> + This enables PCI Express Bandwidth Controller. The Bandwidth
> + Controller allows controlling PCIe link speed and listens for link
> + peed Change Notifications. If you know link width or rate changes
> + occur to correct unreliable links, you may answer Y.
It would be neater if you could avoid deleting lines introduced by
the preceding patch. Ideally you'd add a new paragraph, thus only
add new lines but not delete any existing ones.
> --- a/drivers/pci/pcie/bwctrl.c
> +++ b/drivers/pci/pcie/bwctrl.c
> @@ -1,14 +1,16 @@
> // SPDX-License-Identifier: GPL-2.0+
> /*
> - * PCI Express Link Bandwidth Notification services driver
> + * PCIe bandwidth controller
> + *
Again, fold this change into the preceding patch please.
> * Author: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> *
> * Copyright (C) 2019, Dell Inc
> + * Copyright (C) 2023 Intel Corporation.
For extra neatness, drop the comma in the preceding patch and
the period in this patch.
> - * The PCIe Link Bandwidth Notification provides a way to notify the
> - * operating system when the link width or data rate changes. This
> - * capability is required for all root ports and downstream ports
> - * supporting links wider than x1 and/or multiple link speeds.
> + * The PCIe Bandwidth Controller provides a way to alter PCIe link speeds
> + * and notify the operating system when the link width or data rate changes.
> + * The notification capability is required for all Root Ports and Downstream
> + * Ports supporting links wider than x1 and/or multiple link speeds.
Again, adding a new paragraph and not deleting lines introduced by
the preceding patch would be much nicer.
> +#include <linux/bitops.h>
> +#include <linux/errno.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci-bwctrl.h>
> +#include <linux/types.h>
> +
> +#include <asm/rwonce.h>
> +
Which of these are necessary for functionality introduced by this
patch and which are necessary anyway and should be moved to the
preceding patch?
> +/**
> + * struct bwctrl_service_data - PCIe Port Bandwidth Controller
Code comment at the top of this file says "PCIe bandwidth controller",
use here as well for neatness.
> +static u16 speed2lnkctl2(enum pci_bus_speed speed)
> +{
> + static const u16 speed_conv[] = {
> + [PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT,
> + [PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT,
> + [PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT,
> + [PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT,
> + [PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT,
> + [PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT,
> + };
Looks like this could be a u8 array to save a little bit of memory.
Also, this seems to duplicate pcie_link_speed[] defined in probe.c.
> +static int bwctrl_select_speed(struct pcie_device *srv, enum pci_bus_speed *speed)
> +{
> + struct pci_bus *bus = srv->port->subordinate;
> + u8 speeds, dev_speeds;
> + int i;
> +
> + if (*speed > PCIE_LNKCAP2_SLS2SPEED(bus->pcie_bus_speeds))
> + return -EINVAL;
> +
> + dev_speeds = READ_ONCE(bus->pcie_dev_speeds);
Hm, why is the compiler barrier needed?
> + /* Only the lowest speed can be set when there are no devices */
> + if (!dev_speeds)
> + dev_speeds = PCI_EXP_LNKCAP2_SLS_2_5GB;
Maybe move this to patch [06/10], i.e. set dev->bus->pcie_dev_speeds to
PCI_EXP_LNKCAP2_SLS_2_5GB on removal (instead of 0).
> + speeds = bus->pcie_bus_speeds & dev_speeds;
> + i = BIT(fls(speeds));
> + while (i >= PCI_EXP_LNKCAP2_SLS_2_5GB) {
> + enum pci_bus_speed candidate;
> +
> + if (speeds & i) {
> + candidate = PCIE_LNKCAP2_SLS2SPEED(i);
> + if (candidate <= *speed) {
> + *speed = candidate;
> + return 0;
> + }
> + }
> + i >>= 1;
> + }
Can't we just do something like
supported_speeds = bus->pcie_bus_speeds & dev_speeds;
desired_speeds = GEN_MASK(pcie_link_speed[*speed], 1);
*speed = BIT(fls(supported_speeds & desired_speeds));
and thus avoid the loop altogether?
> +/**
> + * bwctrl_set_current_speed - Set downstream link speed for PCIe port
> + * @srv: PCIe port
> + * @speed: PCIe bus speed to set
> + *
> + * Attempts to set PCIe port link speed to @speed. As long as @speed is less
> + * than the maximum of what is supported by @srv, the speed is adjusted
> + * downwards to the best speed supported by both the port and device
> + * underneath it.
> + *
> + * Return:
> + * * 0 - on success
> + * * -EINVAL - @speed is higher than the maximum @srv supports
> + * * -ETIMEDOUT - changing link speed took too long
> + * * -EAGAIN - link speed was changed but @speed was not achieved
> + */
So passing a higher speed than what's supported by the Endpoint is fine
but passing a higher speed than what's supported by the Downstream Port
is not? Why the distinction? Maybe it would be more logical to just
pick the maximum of what either supports and use that in lieu of a
too-high speed?
> +int bwctrl_set_current_speed(struct pcie_device *srv, enum pci_bus_speed speed)
> +{
> + struct bwctrl_service_data *data = get_service_data(srv);
> + struct pci_dev *port = srv->port;
> + u16 link_status;
> + int ret;
> +
> + if (WARN_ON_ONCE(!bwctrl_valid_pcie_speed(speed)))
> + return -EINVAL;
> +
> + ret = bwctrl_select_speed(srv, &speed);
> + if (ret < 0)
> + return ret;
How about checking here if the selected speed is equivalent to what's
being used right now and bailing out if so? No need to retrain if
we're already at the desired speed.
> @@ -91,16 +242,32 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
> if (ret)
> return ret;
>
> + data = kzalloc(sizeof(*data), GFP_KERNEL);
> + if (!data) {
> + ret = -ENOMEM;
> + goto free_irq;
> + }
> + mutex_init(&data->set_speed_mutex);
> + set_service_data(srv, data);
> +
I think you should move that further up so that you allocate and populate
the data struct before requesting the IRQ. Otherwise if BIOS has already
enabled link bandwith notification for some reason, the IRQ handler might
be invoked without the data struct being allocated.
> --- /dev/null
> +++ b/include/linux/pci-bwctrl.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * PCIe bandwidth controller
> + *
> + * Copyright (C) 2023 Intel Corporation.
Another trailing period I'd remove.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 09/10] thermal: Add PCIe cooling driver
2023-09-29 11:57 ` [PATCH v3 09/10] thermal: Add PCIe cooling driver Ilpo Järvinen
@ 2023-12-30 19:08 ` Lukas Wunner
2024-01-01 16:39 ` Ilpo Järvinen
0 siblings, 1 reply; 29+ messages in thread
From: Lukas Wunner @ 2023-12-30 19:08 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, Daniel Lezcano,
Amit Kucheria, Zhang Rui, linux-kernel, Alex Deucher
On Fri, Sep 29, 2023 at 02:57:22PM +0300, Ilpo Järvinen wrote:
> @@ -253,8 +255,16 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
> pcie_enable_link_bandwidth_notification(port);
> pci_info(port, "enabled with IRQ %d\n", srv->irq);
>
> + data->cdev = pcie_cooling_device_register(port, srv);
> + if (IS_ERR(data->cdev)) {
> + ret = PTR_ERR(data->cdev);
> + goto disable_notifications;
> + }
> return 0;
Now wait a minute, if you can't register the cooling device,
you still want to provide accurate link speeds to the user
in sysfs, right? At least that's what you promise in Kconfig.
> --- /dev/null
> +++ b/drivers/thermal/pcie_cooling.c
> @@ -0,0 +1,107 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * PCIe cooling device
> + *
> + * Copyright (C) 2023 Intel Corporation.
Another trailing period I'd remove.
I take it this patch (only) allows manual bandwidth throttling
through sysfs, right? And emergency throttling is introduced
separately on top of this?
Thanks,
Lukas
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 05/10] PCI: Store all PCIe Supported Link Speeds
2023-12-30 11:45 ` Lukas Wunner
@ 2023-12-30 19:30 ` Lukas Wunner
2024-01-01 16:26 ` Ilpo Järvinen
0 siblings, 1 reply; 29+ messages in thread
From: Lukas Wunner @ 2023-12-30 19:30 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, linux-kernel,
Alex Deucher, Daniel Lezcano, Amit Kucheria, Zhang Rui
On Sat, Dec 30, 2023 at 12:45:49PM +0100, Lukas Wunner wrote:
> On Fri, Sep 29, 2023 at 02:57:18PM +0300, Ilpo Järvinen wrote:
> > struct pci_bus stores max_bus_speed. Implementation Note in PCIe r6.0.1
> > sec 7.5.3.18, however, recommends determining supported Link Speeds
> > using the Supported Link Speeds Vector in the Link Capabilities 2
> > Register (when available).
> >
> > Add pcie_bus_speeds into struct pci_bus which caches the Supported Link
> > Speeds. The value is taken directly from the Supported Link Speeds
> > Vector or synthetized from the Max Link Speed in the Link Capabilities
> > Register when the Link Capabilities 2 Register is not available.
>
> Remind me, what's the reason again to cache this and why is
> max_bus_speed not sufficient? Is the point that there may be
> "gaps" in the supported link speeds, i.e. not every bit below
> the maximum supported speed may be set? And you need to skip
> over those gaps when throttling to a lower speed?
FWIW I went and re-read the internal review I provided on May 18.
Turns out I already mentioned back then that gaps aren't permitted:
"Per PCIe r6.0.1 sec 8.2.1, the bitfield in the Link Capabilities 2
register is not permitted to contain gaps between maximum supported
speed and lowest possible speed (2.5 GT/s Gen1)."
> Also, I note that pci_set_bus_speed() doesn't use LNKCAP2.
About that, I wrote in May:
"Actually, scratch that. pci_set_bus_speed() is fine. Since it's only
interested in the *maximum* link speed, reading just LnkCap is correct.
LnkCap2 only needs to be read to determine if a certain speed is
*supported*. E.g., even though 32 GT/s are supported, perhaps 16 GT/s
are not.
It's rather pcie_get_speed_cap() which should be changed. There's
no need for it to read LnkCap2. The commit which introduced this,
6cf57be0f78e, was misguided and had to be fixed up with f1f90e254e46.
It could be simplified to just read LnkCap and return
pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS]. If the device is a
Root Port or Downstream Port, it doesn't even have to do that but
could return the cached value in subordinate->max_bus_speed.
If you add another attribute to struct pci_bus for the downstream
device's maximum speed, the maximum speed for Endpoints and Upstream
Ports could be returned directly as well from that attribute."
Thanks,
Lukas
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 05/10] PCI: Store all PCIe Supported Link Speeds
2023-12-30 19:30 ` Lukas Wunner
@ 2024-01-01 16:26 ` Ilpo Järvinen
2024-01-01 16:40 ` Lukas Wunner
0 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2024-01-01 16:26 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, LKML, Alex Deucher,
Daniel Lezcano, Amit Kucheria, Zhang Rui
[-- Attachment #1: Type: text/plain, Size: 3835 bytes --]
On Sat, 30 Dec 2023, Lukas Wunner wrote:
> On Sat, Dec 30, 2023 at 12:45:49PM +0100, Lukas Wunner wrote:
> > On Fri, Sep 29, 2023 at 02:57:18PM +0300, Ilpo Järvinen wrote:
> > > struct pci_bus stores max_bus_speed. Implementation Note in PCIe r6.0.1
> > > sec 7.5.3.18, however, recommends determining supported Link Speeds
> > > using the Supported Link Speeds Vector in the Link Capabilities 2
> > > Register (when available).
> > >
> > > Add pcie_bus_speeds into struct pci_bus which caches the Supported Link
> > > Speeds. The value is taken directly from the Supported Link Speeds
> > > Vector or synthetized from the Max Link Speed in the Link Capabilities
> > > Register when the Link Capabilities 2 Register is not available.
> >
> > Remind me, what's the reason again to cache this and why is
> > max_bus_speed not sufficient? Is the point that there may be
> > "gaps" in the supported link speeds, i.e. not every bit below
> > the maximum supported speed may be set? And you need to skip
> > over those gaps when throttling to a lower speed?
>
> FWIW I went and re-read the internal review I provided on May 18.
> Turns out I already mentioned back then that gaps aren't permitted:
>
> "Per PCIe r6.0.1 sec 8.2.1, the bitfield in the Link Capabilities 2
> register is not permitted to contain gaps between maximum supported
> speed and lowest possible speed (2.5 GT/s Gen1)."
>
>
> > Also, I note that pci_set_bus_speed() doesn't use LNKCAP2.
>
> About that, I wrote in May:
>
> "Actually, scratch that. pci_set_bus_speed() is fine. Since it's only
> interested in the *maximum* link speed, reading just LnkCap is correct.
> LnkCap2 only needs to be read to determine if a certain speed is
> *supported*. E.g., even though 32 GT/s are supported, perhaps 16 GT/s
> are not.
>
> It's rather pcie_get_speed_cap() which should be changed. There's
> no need for it to read LnkCap2. The commit which introduced this,
> 6cf57be0f78e, was misguided and had to be fixed up with f1f90e254e46.
> It could be simplified to just read LnkCap and return
> pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS]. If the device is a
> Root Port or Downstream Port, it doesn't even have to do that but
> could return the cached value in subordinate->max_bus_speed.
> If you add another attribute to struct pci_bus for the downstream
> device's maximum speed, the maximum speed for Endpoints and Upstream
> Ports could be returned directly as well from that attribute."
I know it's quite far back so it's understandable to forget :-),
but already by May 23rd your position had changed and you wrote this:
'Per the Implementation Note at the end of PCIe r6.0.1 sec 7.5.3.18,
"It is strongly encouraged that software primarily utilize the
Supported Link Speeds Vector instead of the Max Link Speed field,
so that software can determine the exact set of supported speeds on
current and future hardware. This can avoid software being confused
if a future specification defines Links that do not require support
for all slower speeds."
This means that it's not sufficient if you just check that the desired
speed is lower than the maximum. Instead, you should check if the bit
corresponding to the desired speed is set in the LnkCap2 register's
Supported Link Speeds Vector.
PCIe r6.0.1 sec 8.2.1 stipulates that the bitfield is not permitted to
contain gaps between maximum supported speed and lowest possible speed
(2.5 GT/s Gen1). However the Implementation Note suggests that rule may
no longer apply in future revisions of the PCIe Base Spec.'
So I'd assume I should still follow the way spec recommends, not the "old
method" that may not function correctly after some future version of the
spec, or have you really changed position once again on this?
--
i.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 09/10] thermal: Add PCIe cooling driver
2023-12-30 19:08 ` Lukas Wunner
@ 2024-01-01 16:39 ` Ilpo Järvinen
0 siblings, 0 replies; 29+ messages in thread
From: Ilpo Järvinen @ 2024-01-01 16:39 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, Daniel Lezcano,
Amit Kucheria, Zhang Rui, LKML, Alex Deucher
[-- Attachment #1: Type: text/plain, Size: 1864 bytes --]
On Sat, 30 Dec 2023, Lukas Wunner wrote:
> On Fri, Sep 29, 2023 at 02:57:22PM +0300, Ilpo Järvinen wrote:
> > @@ -253,8 +255,16 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
> > pcie_enable_link_bandwidth_notification(port);
> > pci_info(port, "enabled with IRQ %d\n", srv->irq);
> >
> > + data->cdev = pcie_cooling_device_register(port, srv);
> > + if (IS_ERR(data->cdev)) {
> > + ret = PTR_ERR(data->cdev);
> > + goto disable_notifications;
> > + }
> > return 0;
>
> Now wait a minute, if you can't register the cooling device,
> you still want to provide accurate link speeds to the user
> in sysfs, right? At least that's what you promise in Kconfig.
When thermal side is not even configured, it returns NULL which is not
ERR.
I guess I can change the behavior for the real ERR cases (I was bit on
borderline what to do with those failures).
> > --- /dev/null
> > +++ b/drivers/thermal/pcie_cooling.c
> > @@ -0,0 +1,107 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * PCIe cooling device
> > + *
> > + * Copyright (C) 2023 Intel Corporation.
>
> Another trailing period I'd remove.
>
> I take it this patch (only) allows manual bandwidth throttling
> through sysfs, right? And emergency throttling is introduced
> separately on top of this?
Only sysfs throttling is introduced by this series, there's no emergency
throttling in the series. Also, many things have been simplified in this
series because more complex things would be only justified with
the emergency throttling and would just raise questions 'why is this and
that being done' (e.g., the critical section was enlarged as per your
request where if there would be emergency throttlink doesn't make sense to
wait until the end of the link training before "emergency throttling" can
attempt to lower the link speed).
--
i.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 05/10] PCI: Store all PCIe Supported Link Speeds
2024-01-01 16:26 ` Ilpo Järvinen
@ 2024-01-01 16:40 ` Lukas Wunner
2024-01-01 16:53 ` Ilpo Järvinen
0 siblings, 1 reply; 29+ messages in thread
From: Lukas Wunner @ 2024-01-01 16:40 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, LKML, Alex Deucher,
Daniel Lezcano, Amit Kucheria, Zhang Rui
On Mon, Jan 01, 2024 at 06:26:40PM +0200, Ilpo Järvinen wrote:
> On Sat, 30 Dec 2023, Lukas Wunner wrote:
>
> > On Sat, Dec 30, 2023 at 12:45:49PM +0100, Lukas Wunner wrote:
> > > On Fri, Sep 29, 2023 at 02:57:18PM +0300, Ilpo Järvinen wrote:
> > > > struct pci_bus stores max_bus_speed. Implementation Note in PCIe r6.0.1
> > > > sec 7.5.3.18, however, recommends determining supported Link Speeds
> > > > using the Supported Link Speeds Vector in the Link Capabilities 2
> > > > Register (when available).
> > > >
> > > > Add pcie_bus_speeds into struct pci_bus which caches the Supported Link
> > > > Speeds. The value is taken directly from the Supported Link Speeds
> > > > Vector or synthetized from the Max Link Speed in the Link Capabilities
> > > > Register when the Link Capabilities 2 Register is not available.
> > >
> > > Remind me, what's the reason again to cache this and why is
> > > max_bus_speed not sufficient? Is the point that there may be
> > > "gaps" in the supported link speeds, i.e. not every bit below
> > > the maximum supported speed may be set? And you need to skip
> > > over those gaps when throttling to a lower speed?
> >
> > FWIW I went and re-read the internal review I provided on May 18.
> > Turns out I already mentioned back then that gaps aren't permitted:
> >
> > "Per PCIe r6.0.1 sec 8.2.1, the bitfield in the Link Capabilities 2
> > register is not permitted to contain gaps between maximum supported
> > speed and lowest possible speed (2.5 GT/s Gen1)."
> >
> >
> > > Also, I note that pci_set_bus_speed() doesn't use LNKCAP2.
> >
> > About that, I wrote in May:
> >
> > "Actually, scratch that. pci_set_bus_speed() is fine. Since it's only
> > interested in the *maximum* link speed, reading just LnkCap is correct.
> > LnkCap2 only needs to be read to determine if a certain speed is
> > *supported*. E.g., even though 32 GT/s are supported, perhaps 16 GT/s
> > are not.
> >
> > It's rather pcie_get_speed_cap() which should be changed. There's
> > no need for it to read LnkCap2. The commit which introduced this,
> > 6cf57be0f78e, was misguided and had to be fixed up with f1f90e254e46.
> > It could be simplified to just read LnkCap and return
> > pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS]. If the device is a
> > Root Port or Downstream Port, it doesn't even have to do that but
> > could return the cached value in subordinate->max_bus_speed.
> > If you add another attribute to struct pci_bus for the downstream
> > device's maximum speed, the maximum speed for Endpoints and Upstream
> > Ports could be returned directly as well from that attribute."
>
> I know it's quite far back so it's understandable to forget :-),
> but already by May 23rd your position had changed and you wrote this:
>
> 'Per the Implementation Note at the end of PCIe r6.0.1 sec 7.5.3.18,
>
> "It is strongly encouraged that software primarily utilize the
> Supported Link Speeds Vector instead of the Max Link Speed field,
> so that software can determine the exact set of supported speeds on
> current and future hardware. This can avoid software being confused
> if a future specification defines Links that do not require support
> for all slower speeds."
>
> This means that it's not sufficient if you just check that the desired
> speed is lower than the maximum. Instead, you should check if the bit
> corresponding to the desired speed is set in the LnkCap2 register's
> Supported Link Speeds Vector.
>
> PCIe r6.0.1 sec 8.2.1 stipulates that the bitfield is not permitted to
> contain gaps between maximum supported speed and lowest possible speed
> (2.5 GT/s Gen1). However the Implementation Note suggests that rule may
> no longer apply in future revisions of the PCIe Base Spec.'
>
> So I'd assume I should still follow the way spec recommends, not the "old
> method" that may not function correctly after some future version of the
> spec, or have you really changed position once again on this?
I haven't, you're right, I forgot about all those details.
Thanks for that blast from the past. ;)
But it would be good to extend the commit message because without all
that context, it's difficult to understand why the max_bus_speed isn't
sufficient.
Lukas
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 05/10] PCI: Store all PCIe Supported Link Speeds
2024-01-01 16:40 ` Lukas Wunner
@ 2024-01-01 16:53 ` Ilpo Järvinen
0 siblings, 0 replies; 29+ messages in thread
From: Ilpo Järvinen @ 2024-01-01 16:53 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, LKML, Alex Deucher,
Daniel Lezcano, Amit Kucheria, Zhang Rui
[-- Attachment #1: Type: text/plain, Size: 4545 bytes --]
On Mon, 1 Jan 2024, Lukas Wunner wrote:
> On Mon, Jan 01, 2024 at 06:26:40PM +0200, Ilpo Järvinen wrote:
> > On Sat, 30 Dec 2023, Lukas Wunner wrote:
> >
> > > On Sat, Dec 30, 2023 at 12:45:49PM +0100, Lukas Wunner wrote:
> > > > On Fri, Sep 29, 2023 at 02:57:18PM +0300, Ilpo Järvinen wrote:
> > > > > struct pci_bus stores max_bus_speed. Implementation Note in PCIe r6.0.1
> > > > > sec 7.5.3.18, however, recommends determining supported Link Speeds
> > > > > using the Supported Link Speeds Vector in the Link Capabilities 2
> > > > > Register (when available).
> > > > >
> > > > > Add pcie_bus_speeds into struct pci_bus which caches the Supported Link
> > > > > Speeds. The value is taken directly from the Supported Link Speeds
> > > > > Vector or synthetized from the Max Link Speed in the Link Capabilities
> > > > > Register when the Link Capabilities 2 Register is not available.
> > > >
> > > > Remind me, what's the reason again to cache this and why is
> > > > max_bus_speed not sufficient? Is the point that there may be
> > > > "gaps" in the supported link speeds, i.e. not every bit below
> > > > the maximum supported speed may be set? And you need to skip
> > > > over those gaps when throttling to a lower speed?
> > >
> > > FWIW I went and re-read the internal review I provided on May 18.
> > > Turns out I already mentioned back then that gaps aren't permitted:
> > >
> > > "Per PCIe r6.0.1 sec 8.2.1, the bitfield in the Link Capabilities 2
> > > register is not permitted to contain gaps between maximum supported
> > > speed and lowest possible speed (2.5 GT/s Gen1)."
> > >
> > >
> > > > Also, I note that pci_set_bus_speed() doesn't use LNKCAP2.
> > >
> > > About that, I wrote in May:
> > >
> > > "Actually, scratch that. pci_set_bus_speed() is fine. Since it's only
> > > interested in the *maximum* link speed, reading just LnkCap is correct.
> > > LnkCap2 only needs to be read to determine if a certain speed is
> > > *supported*. E.g., even though 32 GT/s are supported, perhaps 16 GT/s
> > > are not.
> > >
> > > It's rather pcie_get_speed_cap() which should be changed. There's
> > > no need for it to read LnkCap2. The commit which introduced this,
> > > 6cf57be0f78e, was misguided and had to be fixed up with f1f90e254e46.
> > > It could be simplified to just read LnkCap and return
> > > pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS]. If the device is a
> > > Root Port or Downstream Port, it doesn't even have to do that but
> > > could return the cached value in subordinate->max_bus_speed.
> > > If you add another attribute to struct pci_bus for the downstream
> > > device's maximum speed, the maximum speed for Endpoints and Upstream
> > > Ports could be returned directly as well from that attribute."
> >
> > I know it's quite far back so it's understandable to forget :-),
> > but already by May 23rd your position had changed and you wrote this:
> >
> > 'Per the Implementation Note at the end of PCIe r6.0.1 sec 7.5.3.18,
> >
> > "It is strongly encouraged that software primarily utilize the
> > Supported Link Speeds Vector instead of the Max Link Speed field,
> > so that software can determine the exact set of supported speeds on
> > current and future hardware. This can avoid software being confused
> > if a future specification defines Links that do not require support
> > for all slower speeds."
> >
> > This means that it's not sufficient if you just check that the desired
> > speed is lower than the maximum. Instead, you should check if the bit
> > corresponding to the desired speed is set in the LnkCap2 register's
> > Supported Link Speeds Vector.
> >
> > PCIe r6.0.1 sec 8.2.1 stipulates that the bitfield is not permitted to
> > contain gaps between maximum supported speed and lowest possible speed
> > (2.5 GT/s Gen1). However the Implementation Note suggests that rule may
> > no longer apply in future revisions of the PCIe Base Spec.'
> >
> > So I'd assume I should still follow the way spec recommends, not the "old
> > method" that may not function correctly after some future version of the
> > spec, or have you really changed position once again on this?
>
> I haven't, you're right, I forgot about all those details.
> Thanks for that blast from the past. ;)
>
> But it would be good to extend the commit message because without all
> that context, it's difficult to understand why the max_bus_speed isn't
> sufficient.
Thanks. I'll extend the commit message.
--
i.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 07/10] PCI/LINK: Re-add BW notification portdrv as PCIe BW controller
2023-12-30 15:58 ` Lukas Wunner
@ 2024-01-01 17:37 ` Ilpo Järvinen
2024-01-01 18:11 ` Lukas Wunner
0 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2024-01-01 17:37 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, LKML, Alex Deucher,
Daniel Lezcano, Amit Kucheria, Zhang Rui
[-- Attachment #1: Type: text/plain, Size: 3014 bytes --]
On Sat, 30 Dec 2023, Lukas Wunner wrote:
> On Fri, Sep 29, 2023 at 02:57:20PM +0300, Ilpo Järvinen wrote:
> > This mostly reverts b4c7d2076b4e ("PCI/LINK: Remove bandwidth
> > notification"), however, there are small tweaks:
> >
> > 1) Call it PCIe bwctrl (bandwidth controller) instead of just
> > bandwidth notifications.
> > 2) Don't print the notifications into kernel log, just keep the current
> > link speed updated.
> > 3) Use concurrency safe LNKCTL RMW operations.
> > 4) Read link speed after enabling the notification to ensure the
> > current link speed is correct from the start.
> > 5) Add local variable in probe for srv->port.
> > 6) Handle link speed read and LBMS write race in
> > pcie_bw_notification_irq().
> >
> > The reason for 1) is to indicate the increased scope of the driver. A
> > subsequent commit extends the driver to allow controlling PCIe
> > bandwidths from user space upon crossing thermal thresholds.
> >
> > While 2) is somewhat unfortunate, the log spam was the source of
> > complaints that eventually lead to the removal of the bandwidth
> > notifications driver (see the links below for further information).
> > After re-adding this driver back the userspace can, if it wishes to,
> > observe the link speed changes using the current bus speed files under
> > sysfs.
>
> Good commit message.
>
> > --- /dev/null
> > +++ b/drivers/pci/pcie/bwctrl.c
>
> > +static void pcie_enable_link_bandwidth_notification(struct pci_dev *dev)
> > +{
> > + u16 link_status;
> > + int ret;
> > +
> > + pcie_capability_write_word(dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
> > + pcie_capability_set_word(dev, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LBMIE);
>
> I'm wondering why we're not enabling LABIE as well?
> (And clear LABS.)
>
> Can't it happen that we miss bandwidth changes unless we enable that
> as well?
Thanks. Reading the spec, it sounds like both are necessary to not miss
changes.
> > +static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
> > +{
> > + struct pci_dev *port = srv->port;
> > + int ret;
> > +
> > + /* Single-width or single-speed ports do not have to support this. */
> > + if (!pcie_link_bandwidth_notification_supported(port))
> > + return -ENODEV;
>
> I'm wondering if this should be checked in get_port_device_capability()
> instead?
I can move the check there.
> > + ret = request_irq(srv->irq, pcie_bw_notification_irq,
> > + IRQF_SHARED, "PCIe BW ctrl", srv);
>
> Is there a reason to run the IRQ handler in hardirq context
> or would it work to run it in an IRQ thread? Usually on systems
> than enable PREEMPT_RT, a threaded IRQ handler is preferred,
> so unless hardirq context is necessary, I'd recommend using
> an IRQ thread.
Can I somehow postpone the decision between IRQ_NONE / IRQ_HANDLED
straight into the thread_fn? One LNKSTA read is necessary to decide
that.
I suppose the other write + reread of LNKSTA could be moved into
thread_fn even if the first read would not be movable.
--
i.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 07/10] PCI/LINK: Re-add BW notification portdrv as PCIe BW controller
2024-01-01 17:37 ` Ilpo Järvinen
@ 2024-01-01 18:11 ` Lukas Wunner
0 siblings, 0 replies; 29+ messages in thread
From: Lukas Wunner @ 2024-01-01 18:11 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, LKML, Alex Deucher,
Daniel Lezcano, Amit Kucheria, Zhang Rui
On Mon, Jan 01, 2024 at 07:37:25PM +0200, Ilpo Järvinen wrote:
> On Sat, 30 Dec 2023, Lukas Wunner wrote:
> > On Fri, Sep 29, 2023 at 02:57:20PM +0300, Ilpo Järvinen wrote:
> > > + pcie_capability_write_word(dev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
> > > + pcie_capability_set_word(dev, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LBMIE);
> >
> > I'm wondering why we're not enabling LABIE as well?
> > (And clear LABS.)
> >
> > Can't it happen that we miss bandwidth changes unless we enable that
> > as well?
>
> Thanks. Reading the spec, it sounds like both are necessary to not miss
> changes.
I guess this is an artefact of Alex' original patch.
I don't know why he enabled one but not the other.
> > > + ret = request_irq(srv->irq, pcie_bw_notification_irq,
> > > + IRQF_SHARED, "PCIe BW ctrl", srv);
> >
> > Is there a reason to run the IRQ handler in hardirq context
> > or would it work to run it in an IRQ thread? Usually on systems
> > than enable PREEMPT_RT, a threaded IRQ handler is preferred,
> > so unless hardirq context is necessary, I'd recommend using
> > an IRQ thread.
>
> Can I somehow postpone the decision between IRQ_NONE / IRQ_HANDLED
> straight into the thread_fn? One LNKSTA read is necessary to decide
> that.
>
> I suppose the other write + reread of LNKSTA could be moved into
> thread_fn even if the first read would not be movable.
You can just use request_threaded_irq(), pass NULL for the "handler"
argument and pcie_bw_notification_irq for the "thread_fn" argument.
Because of the NULL argument for "handler", the hardirq handler will
then become irq_default_primary_handler(). Which does nothing else
but return IRQ_WAKE_THREAD. And the decision between IRQ_NONE and
IRQ_HANDLED is then indeed postponed to the IRQ thread.
Alternatively you can split the IRQ handler, move the check whether
PCI_EXP_LNKSTA_LBMS is set to the hardirq handler and keep the rest
in the IRQ thread. Means you won't have unnecessary wakeups of the
IRQ thread if the interrupt is caused by something else (I understand
it's always shared with PME and hotplug). But you'll spend more time
in hardirq context. In practice bandwidth notifications may be more
frequent than PME and hotplug interrupts, so unnecessary wakeups of
the IRQ thread will be rare. Hence not splitting the IRQ handler
may be better. Dunno. Ask Thomas Gleixner or Sebastian Siewior. :)
Thanks,
Lukas
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 08/10] PCI/bwctrl: Add "controller" part into PCIe bwctrl
2023-12-30 18:49 ` Lukas Wunner
@ 2024-01-01 18:12 ` Ilpo Järvinen
2024-01-03 16:40 ` Lukas Wunner
0 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2024-01-01 18:12 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, LKML, Alex Deucher,
Daniel Lezcano, Amit Kucheria, Zhang Rui
[-- Attachment #1: Type: text/plain, Size: 4208 bytes --]
On Sat, 30 Dec 2023, Lukas Wunner wrote:
> On Fri, Sep 29, 2023 at 02:57:21PM +0300, Ilpo Järvinen wrote:
>
>
> > --- a/drivers/pci/pcie/bwctrl.c
> > +static u16 speed2lnkctl2(enum pci_bus_speed speed)
> > +{
> > + static const u16 speed_conv[] = {
> > + [PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT,
> > + [PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT,
> > + [PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT,
> > + [PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT,
> > + [PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT,
> > + [PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT,
> > + };
>
> Looks like this could be a u8 array to save a little bit of memory.
>
> Also, this seems to duplicate pcie_link_speed[] defined in probe.c.
It's not the same. pcie_link_speed[] is indexed by a different thing
unless you also suggest I should do index minus a number? There are
plenty of arithmetics possible when converting between the types but
the existing converions functions don't seem to take advantage of those
properties so I've been a bit hesitant to add such assumptions myself.
I suppose I used u16 because the reg is u16 but you're of course correct
that the values don't take up more than u8.
> > +static int bwctrl_select_speed(struct pcie_device *srv, enum pci_bus_speed *speed)
> > +{
> > + struct pci_bus *bus = srv->port->subordinate;
> > + u8 speeds, dev_speeds;
> > + int i;
> > +
> > + if (*speed > PCIE_LNKCAP2_SLS2SPEED(bus->pcie_bus_speeds))
> > + return -EINVAL;
> > +
> > + dev_speeds = READ_ONCE(bus->pcie_dev_speeds);
>
> Hm, why is the compiler barrier needed?
It's probably an overkill but there could be a checker which finds this
read is not protected by anything while the value could get updated
concurrectly (there's probably already such checker as I've seen patches
to data races found with some tool). I suppose the alternative would be to
mark the data race being harmless (because if endpoint removal clears
pcie_dev_speeds, bwctrl will be pretty moot). I just chose to use
READ_ONCE() that prevents rereading the same value later in this function
making the function behave consistently regardless of what occurs parallel
to it with the endpoints.
If the value selected cannot be set because of endpoint no longer being
there, the other parts of the code will detect that.
So if I just add a comment to this line why there's the data race and keep
it as is?
> > + /* Only the lowest speed can be set when there are no devices */
> > + if (!dev_speeds)
> > + dev_speeds = PCI_EXP_LNKCAP2_SLS_2_5GB;
>
> Maybe move this to patch [06/10], i.e. set dev->bus->pcie_dev_speeds to
> PCI_EXP_LNKCAP2_SLS_2_5GB on removal (instead of 0).
Okay, I'll set it to 2_5GB on init and removal.
> > + speeds = bus->pcie_bus_speeds & dev_speeds;
> > + i = BIT(fls(speeds));
> > + while (i >= PCI_EXP_LNKCAP2_SLS_2_5GB) {
> > + enum pci_bus_speed candidate;
> > +
> > + if (speeds & i) {
> > + candidate = PCIE_LNKCAP2_SLS2SPEED(i);
> > + if (candidate <= *speed) {
> > + *speed = candidate;
> > + return 0;
> > + }
> > + }
> > + i >>= 1;
> > + }
>
> Can't we just do something like
>
> supported_speeds = bus->pcie_bus_speeds & dev_speeds;
> desired_speeds = GEN_MASK(pcie_link_speed[*speed], 1);
> *speed = BIT(fls(supported_speeds & desired_speeds));
>
> and thus avoid the loop altogether?
Yes, I can change to loopless version.
I'll try to create functions for the speed format conversions though
rather than open coding them into the logic.
> > @@ -91,16 +242,32 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
> > if (ret)
> > return ret;
> >
> > + data = kzalloc(sizeof(*data), GFP_KERNEL);
> > + if (!data) {
> > + ret = -ENOMEM;
> > + goto free_irq;
> > + }
> > + mutex_init(&data->set_speed_mutex);
> > + set_service_data(srv, data);
> > +
>
> I think you should move that further up so that you allocate and populate
> the data struct before requesting the IRQ. Otherwise if BIOS has already
> enabled link bandwith notification for some reason, the IRQ handler might
> be invoked without the data struct being allocated.
Sure, I don't know why I missed that possibility.
--
i.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 06/10] PCI: Cache PCIe device's Supported Speed Vector
2023-12-30 15:19 ` Lukas Wunner
@ 2024-01-01 18:31 ` Ilpo Järvinen
2024-01-03 16:51 ` Lukas Wunner
0 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2024-01-01 18:31 UTC (permalink / raw)
To: Lukas Wunner
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, LKML, Alex Deucher,
Daniel Lezcano, Amit Kucheria, Zhang Rui
[-- Attachment #1: Type: text/plain, Size: 3096 bytes --]
On Sat, 30 Dec 2023, Lukas Wunner wrote:
> On Fri, Sep 29, 2023 at 02:57:19PM +0300, Ilpo Järvinen wrote:
> > The Supported Link Speeds Vector in the Link Capabilities Register 2
> > corresponds to the bus below on Root Ports and Downstream Ports,
> > whereas it corresponds to the bus above on Upstream Ports and
> > Endpoints.
>
> It would be good to add a pointer to the spec here. I think the
> relevant section is PCIe r6.1 sec 7.5.3.18 which says:
>
> "Supported Link Speeds Vector - This field indicates the supported
> Link speed(s) of the associated Port."
> ^^^^^^^^^^^^^^^
>
> Obviously the associated port is upstream on a Switch Upstream Port
> or Endpoint, whereas it is downstream on a Switch Downstream Port
> or Root Port.
>
> Come to think of it, what about edge cases such as RCiEPs?
On real HW I've seen, RCiEPs don't seem to have these speeds at all
(PCIe r6.1, sec 7.5.3):
"The Link Capabilities, Link Status, and Link Control registers are
required for all Root Ports, Switch Ports, Bridges, and Endpoints that are
not RCiEPs. For Functions that do not implement the Link Capabilities,
Link Status, and Link Contro registers, these spaces must be hardwired to
0. Link Capabilities 2, Link Status 2, and Link Control 2 registers are
required for all Root Ports, Switch Ports, Bridges, and Endpoints (except
for RCiEPs) that implement capabilities requiring those registers. For
Functions that do not implement the Link Capabilities 2, Link Status 2,
and Link Control 2 registers, these spaces must be hardwired to 0b."
> > Only the former is currently cached in pcie_bus_speeds in
> > the struct pci_bus. The link speeds that are supported is the
> > intersection of these two.
>
> I'm wondering if caching both is actually necessary. Why not cache
> just the intersection? Do we need either of the two somewhere?
Intersection is enough at least for bwctrl. The only downside that is
barely worth mentioning is that the bus SLSV has to be re-read when
function 0 sets the intersection.
I can think of somebody wanting to expose the list of both supported speed
to userspace though sysfs (not done by this patch series), but they could
be read from the registers in that case so that use case doesn't really
matter much, IMO.
> > Store the device's Supported Link Speeds Vector into the struct pci_bus
> > when the Function 0 is enumerated (the Multi-Function Devices must have
> > same speeds the same for all Functions) to be easily able to calculate
> > the intersection of Supported Link Speeds.
>
> Might want to add an explanation what you're going to need this for,
> I assume it's accessed frequently by the bandwidth throttling driver
> in a subsequent patch?
Yes. I tend to try to avoid forward references because some maintainers
complain about them (leading to minimal changes where true motivations
have to be hidden because "future" cannot be used to motivate a change
even if that's often the truest motivation within a patch series). But
I'll add a fwd ref here to make it more obvious. :-)
--
i.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 08/10] PCI/bwctrl: Add "controller" part into PCIe bwctrl
2024-01-01 18:12 ` Ilpo Järvinen
@ 2024-01-03 16:40 ` Lukas Wunner
0 siblings, 0 replies; 29+ messages in thread
From: Lukas Wunner @ 2024-01-03 16:40 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, LKML, Alex Deucher,
Daniel Lezcano, Amit Kucheria, Zhang Rui
On Mon, Jan 01, 2024 at 08:12:43PM +0200, Ilpo Järvinen wrote:
> On Sat, 30 Dec 2023, Lukas Wunner wrote:
> > On Fri, Sep 29, 2023 at 02:57:21PM +0300, Ilpo Järvinen wrote:
> > > --- a/drivers/pci/pcie/bwctrl.c
> > > +static u16 speed2lnkctl2(enum pci_bus_speed speed)
> > > +{
> > > + static const u16 speed_conv[] = {
> > > + [PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT,
> > > + [PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT,
> > > + [PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT,
> > > + [PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT,
> > > + [PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT,
> > > + [PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT,
> > > + };
> >
> > Looks like this could be a u8 array to save a little bit of memory.
> >
> > Also, this seems to duplicate pcie_link_speed[] defined in probe.c.
>
> It's not the same. pcie_link_speed[] is indexed by a different thing
Ah, missed that. Those various conversion functions are confusing.
> > > + dev_speeds = READ_ONCE(bus->pcie_dev_speeds);
> >
> > Hm, why is the compiler barrier needed?
>
> It's probably an overkill but there could be a checker which finds this
> read is not protected by anything while the value could get updated
> concurrectly
Why should it be updated? It's a static value cached on enumeration
and never changed AFAICS.
> If the value selected cannot be set because of endpoint no longer being
> there, the other parts of the code will detect that.
If the endpoint is hot-unplugged, the link is down, so retraining
the link will fail.
If the device is replaced concurrently to retraining then you may
try to retrain to a speed which is too fast or too slow for the new
device. Maybe it's necessary to cope with that? Basically find the
pci_dev on the bus with the device/function id and check if the pci_dev
pointer still points to the same location. Another idea would be to
hook up bwctrl with pciehp so that pciehp tells bwctrl the device is
gone and any speed changes should be aborted. We've hooked up DPC
with pciehp in a similar way.
> So if I just add a comment to this line why there's the data race and keep
> it as is?
I'd just drop the READ_ONCE() here if there's not a good reason for it.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 06/10] PCI: Cache PCIe device's Supported Speed Vector
2024-01-01 18:31 ` Ilpo Järvinen
@ 2024-01-03 16:51 ` Lukas Wunner
0 siblings, 0 replies; 29+ messages in thread
From: Lukas Wunner @ 2024-01-03 16:51 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
Krzysztof Wilczy??ski, Alexandru Gagniuc,
Krishna chaitanya chundru, Srinivas Pandruvada,
Rafael J . Wysocki, linux-pm, Bjorn Helgaas, LKML, Alex Deucher,
Daniel Lezcano, Amit Kucheria, Zhang Rui
On Mon, Jan 01, 2024 at 08:31:39PM +0200, Ilpo Järvinen wrote:
> On Sat, 30 Dec 2023, Lukas Wunner wrote:
> > On Fri, Sep 29, 2023 at 02:57:19PM +0300, Ilpo Järvinen wrote:
> > > Only the former is currently cached in pcie_bus_speeds in
> > > the struct pci_bus. The link speeds that are supported is the
> > > intersection of these two.
> >
> > I'm wondering if caching both is actually necessary. Why not cache
> > just the intersection? Do we need either of the two somewhere?
>
> Intersection is enough at least for bwctrl. The only downside that is
> barely worth mentioning is that the bus SLSV has to be re-read when
> function 0 sets the intersection.
>
> I can think of somebody wanting to expose the list of both supported speed
> to userspace though sysfs (not done by this patch series), but they could
> be read from the registers in that case so that use case doesn't really
> matter much, IMO.
Yes, that would be a reasonable argument to keep both values instead
of storing just the intersection.
> > > Store the device's Supported Link Speeds Vector into the struct pci_bus
> > > when the Function 0 is enumerated (the Multi-Function Devices must have
> > > same speeds the same for all Functions) to be easily able to calculate
> > > the intersection of Supported Link Speeds.
> >
> > Might want to add an explanation what you're going to need this for,
> > I assume it's accessed frequently by the bandwidth throttling driver
> > in a subsequent patch?
>
> Yes. I tend to try to avoid forward references because some maintainers
> complain about them (leading to minimal changes where true motivations
> have to be hidden because "future" cannot be used to motivate a change
> even if that's often the truest motivation within a patch series). But
> I'll add a fwd ref here to make it more obvious. :-)
Bjorn has used phrases such as "We're about to ..." a couple of times
in commit messages to convey that a particular change in the present
patch will be taken advantage of by a subsequent patch.
I've used the same phrase but got criticized (in other subsystems)
for using "we".
So I use phrases such as:
"An upcoming commit will create DOE mailboxes upon device enumeration by
the PCI core. Their lifetime shall not be limited by a driver.
Therefore rework..." (see 022b66f38195)
Can also reference the past:
"The PCI core has just been amended to create a pci_doe_mb struct for
every DOE instance on device enumeration. [...] That leaves [...]
without any callers, so drop them." (see 74e491e5d1bc)
If someone finds your commit e.g. through git blame, it may help them
enormously if you provide context in the commit message. If maintainers
in other subsystem tell you otherwise, they're wrong. ;)
Thanks,
Lukas
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2024-01-03 16:51 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-29 11:57 [PATCH v3 00/10] Add PCIe Bandwidth Controller Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 01/10] PCI: Protect Link Control 2 Register with RMW locking Ilpo Järvinen
2023-12-30 10:33 ` Lukas Wunner
2023-09-29 11:57 ` [PATCH v3 02/10] drm/radeon: Use RMW accessors for changing LNKCTL2 Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 03/10] drm/amdgpu: " Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 04/10] RDMA/hfi1: " Ilpo Järvinen
2023-09-29 13:03 ` Dean Luick
2023-09-29 11:57 ` [PATCH v3 05/10] PCI: Store all PCIe Supported Link Speeds Ilpo Järvinen
2023-12-30 11:45 ` Lukas Wunner
2023-12-30 19:30 ` Lukas Wunner
2024-01-01 16:26 ` Ilpo Järvinen
2024-01-01 16:40 ` Lukas Wunner
2024-01-01 16:53 ` Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 06/10] PCI: Cache PCIe device's Supported Speed Vector Ilpo Järvinen
2023-12-30 15:19 ` Lukas Wunner
2024-01-01 18:31 ` Ilpo Järvinen
2024-01-03 16:51 ` Lukas Wunner
2023-09-29 11:57 ` [PATCH v3 07/10] PCI/LINK: Re-add BW notification portdrv as PCIe BW controller Ilpo Järvinen
2023-12-30 15:58 ` Lukas Wunner
2024-01-01 17:37 ` Ilpo Järvinen
2024-01-01 18:11 ` Lukas Wunner
2023-09-29 11:57 ` [PATCH v3 08/10] PCI/bwctrl: Add "controller" part into PCIe bwctrl Ilpo Järvinen
2023-12-30 18:49 ` Lukas Wunner
2024-01-01 18:12 ` Ilpo Järvinen
2024-01-03 16:40 ` Lukas Wunner
2023-09-29 11:57 ` [PATCH v3 09/10] thermal: Add PCIe cooling driver Ilpo Järvinen
2023-12-30 19:08 ` Lukas Wunner
2024-01-01 16:39 ` Ilpo Järvinen
2023-09-29 11:57 ` [PATCH v3 10/10] selftests/pcie_bwctrl: Create selftests Ilpo Järvinen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).