linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/9] PCI: Add PCIe bandwidth controller
@ 2024-10-18 14:47 Ilpo Järvinen
  2024-10-18 14:47 ` [PATCH v9 1/9] Documentation PCI: Reformat RMW ops documentation Ilpo Järvinen
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: Ilpo Järvinen @ 2024-10-18 14:47 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli
  Cc: linux-kernel, Daniel Lezcano, Amit Kucheria, Zhang Rui,
	Christophe JAILLET, Ilpo Järvinen

Hi all,

This series adds PCIe bandwidth controller (bwctrl) and associated PCIe
cooling driver to the thermal core side for limiting PCIe Link Speed
due to thermal reasons. PCIe bandwidth controller is a PCI express bus
port service driver. A cooling device is created for each port the
service driver finds to support changing speeds.

This series only adds support for controlling PCIe Link Speed.
Controlling PCIe Link Width might also be useful but there is no
mechanism for that until PCIe 6.0 (L0p) so Link Width throttling is not
added by this series.


v9:
- Split RMW ops doc reformat into own patch before adding LNKCTL2.
- Comment reserved 0 LSB even better than it already was.
- Consider portdrv future plans:
	- Use devm helpers for mem alloc, IRQ, and mutex init.
	- Don't use get/set_service_data().
- Split rwsem into two to avoid recursive locking splat through
  pcie_retrain_link().
- Small wording improvements to commit messages (from Jonathan)

v8:
- Removed CONFIG_PCIE_BWCTRL (use CONFIG_PCIEPORTBUS)
- Removed locking wrappers that dealt with the CONFIG variations
- Protect macro parameter with parenthesis to be on the safe side

v7:
- Rebased on top of Maciej's latest Target Speed quirk patches
- Target Speed quirk runs very early, w/o ->subordinate existing yet.
  This required adapting logic:
	- Move Supported Link Speeds back to pci_dev
	- Check for ->subordinate == NULL where necessary
	- Cannot always take bwctrl's per port mutex (in pcie_bwctrl_data)
- Cleaned up locking in pcie_set_target_speed() using wrappers
	- Allowed removing confusing __pcie_set_target_speed()
- Fix building with CONFIG_PCI=n
- Correct error check in pcie_lbms_seen()
- Don't return error for an empty bus that remains at 2.5GT
- Use rwsem to protect ->link_bwctrl setup and bwnotif enable
- Clear LBMS in remove_board()
- Adding export for pcie_get_supported_speeds() was unnecessary
- Call bwctrl's init before hotplug.
- Added local variable 'bus' into a few functions

v6:
- Removed unnecessary PCI_EXP_LNKCAP_SLS mask from PCIE_LNKCAP_SLS2SPEED()
- Split error handling in pcie_bwnotif_irq_thread()
- pci_info() -> pci_dbg() on bwctrl probe success path
- Handle cooling device pointer -Exx codes in bwctrl probe
- Reorder port->link_bwctrl setup / bwnotif enable for symmetry
- Handle LBMS count == 0 in PCIe quirk by checking LBMS (avoids a race
  between quirk and bwctrl)
- Use cleanup.h in PCIe cooling device's register

v5:
- Removed patches: LNKCTL2 RMW driver patches went in separately
- Refactor pcie_update_link_speed() to read LNKSTA + add __ variant
  for hotplug that has LNKSTA value at hand
- Make series fully compatible with the Target Speed quirk
	- LBMS counter added, quirk falls back to LBMS bit when bwctrl =n
	- Separate LBMS patch from set target speed patches
- Always provide pcie_bwctrl_change_speed() even if bwctrl =n so drivers
  don't need to come up their own version (also required by the Target
  Speed quirk)
- Remove devm_* (based on Lukas' comment on some other service
  driver patch)
- Convert to use cleanup.h
- Renamed functions/struct to have shorter names

v4:
- Merge Port's and Endpoint's Supported Link Speeds Vectors into
  supported_speeds in the struct pci_bus
- Reuse pcie_get_speed_cap()'s code for pcie_get_supported_speeds()
- Setup supported_speeds with PCI_EXP_LNKCAP2_SLS_2_5GB when no
  Endpoint exists
- Squash revert + add bwctrl patches into one
- Change to use threaded IRQ + IRQF_ONESHOT
- Enable also LABIE / LABS
- Convert Link Speed selection to use bit logic instead of loop
- Allocate before requesting IRQ during probe
- Use devm_*()
- Use u8 for speed_conv array instead of u16
- Removed READ_ONCE()
- Improve changelogs, comments, and Kconfig
- Name functions slightly more consistently
- Use bullet list for RMW protected registers in docs

v3:
- Correct hfi1 shortlog prefix
- Improve error prints in hfi1
- Add L: linux-pci to the MAINTAINERS entry

v2:
- Adds LNKCTL2 to RMW safe list in Documentation/PCI/pciebus-howto.rst
- Renamed cooling devices from PCIe_Port_* to PCIe_Port_Link_Speed_* in
  order to plan for possibility of adding Link Width cooling devices
  later on
- Moved struct thermal_cooling_device declaration to the correct patch
- Small tweaks to Kconfig texts
- Series rebased to resolve conflict (in the selftest list)

Ilpo Järvinen (9):
  Documentation PCI: Reformat RMW ops documentation
  PCI: Protect Link Control 2 Register with RMW locking
  PCI: Store all PCIe Supported Link Speeds
  PCI: Refactor pcie_update_link_speed()
  PCI/quirks: Abstract LBMS seen check into own function
  PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller
  PCI/bwctrl: Add API to set PCIe Link Speed
  thermal: Add PCIe cooling driver
  selftests/pcie_bwctrl: Create selftests

 Documentation/PCI/pciebus-howto.rst           |  14 +-
 MAINTAINERS                                   |   9 +
 drivers/pci/hotplug/pciehp_ctrl.c             |   5 +
 drivers/pci/hotplug/pciehp_hpc.c              |   2 +-
 drivers/pci/pci.c                             |  62 ++-
 drivers/pci/pci.h                             |  38 +-
 drivers/pci/pcie/Makefile                     |   2 +-
 drivers/pci/pcie/bwctrl.c                     | 366 ++++++++++++++++++
 drivers/pci/pcie/portdrv.c                    |   9 +-
 drivers/pci/pcie/portdrv.h                    |   6 +-
 drivers/pci/probe.c                           |  15 +-
 drivers/pci/quirks.c                          |  32 +-
 drivers/thermal/Kconfig                       |   9 +
 drivers/thermal/Makefile                      |   2 +
 drivers/thermal/pcie_cooling.c                |  80 ++++
 include/linux/pci-bwctrl.h                    |  28 ++
 include/linux/pci.h                           |  24 +-
 include/uapi/linux/pci_regs.h                 |   1 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/pcie_bwctrl/Makefile  |   2 +
 .../pcie_bwctrl/set_pcie_cooling_state.sh     | 122 ++++++
 .../selftests/pcie_bwctrl/set_pcie_speed.sh   |  67 ++++
 22 files changed, 843 insertions(+), 53 deletions(-)
 create mode 100644 drivers/pci/pcie/bwctrl.c
 create mode 100644 drivers/thermal/pcie_cooling.c
 create mode 100644 include/linux/pci-bwctrl.h
 create mode 100644 tools/testing/selftests/pcie_bwctrl/Makefile
 create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh
 create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh

-- 
2.39.5


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v9 1/9] Documentation PCI: Reformat RMW ops documentation
  2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
@ 2024-10-18 14:47 ` Ilpo Järvinen
  2024-10-18 14:47 ` [PATCH v9 2/9] PCI: Protect Link Control 2 Register with RMW locking Ilpo Järvinen
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Ilpo Järvinen @ 2024-10-18 14:47 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, Jonathan Corbet, linux-doc, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	Ilpo Järvinen, Jonathan Cameron

Extract the list of RMW protected PCIe Capability registers into a
bullet list to make them easier to pick up on a glance. An upcoming
change is going to add one more register among them so it will be much
cleaner to have them as bullets.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 Documentation/PCI/pciebus-howto.rst | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/Documentation/PCI/pciebus-howto.rst b/Documentation/PCI/pciebus-howto.rst
index f344452651e1..e48d01422efc 100644
--- a/Documentation/PCI/pciebus-howto.rst
+++ b/Documentation/PCI/pciebus-howto.rst
@@ -217,8 +217,11 @@ capability structure except the PCI Express capability structure,
 that is shared between many drivers including the service drivers.
 RMW Capability accessors (pcie_capability_clear_and_set_word(),
 pcie_capability_set_word(), and pcie_capability_clear_word()) protect
-a selected set of PCI Express Capability Registers (Link Control
-Register and Root Control Register). Any change to those registers
-should be performed using RMW accessors to avoid problems due to
-concurrent updates. For the up-to-date list of protected registers,
-see pcie_capability_clear_and_set_word().
+a selected set of PCI Express Capability Registers:
+
+* Link Control Register
+* Root Control Register
+
+Any change to those registers should be performed using RMW accessors to
+avoid problems due to concurrent updates. For the up-to-date list of
+protected registers, see pcie_capability_clear_and_set_word().
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v9 2/9] PCI: Protect Link Control 2 Register with RMW locking
  2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
  2024-10-18 14:47 ` [PATCH v9 1/9] Documentation PCI: Reformat RMW ops documentation Ilpo Järvinen
@ 2024-10-18 14:47 ` Ilpo Järvinen
  2024-10-18 14:47 ` [PATCH v9 3/9] PCI: Store all PCIe Supported Link Speeds Ilpo Järvinen
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Ilpo Järvinen @ 2024-10-18 14:47 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, Jonathan Corbet, linux-doc, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	Ilpo Järvinen, Jonathan Cameron

PCIe Bandwidth Controller performs RMW accesses the Link Control 2
Register which can occur concurrently to other sources of Link Control
2 Register writes. Therefore, add Link Control 2 Register among the PCI
Express Capability Registers that need RMW locking.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 Documentation/PCI/pciebus-howto.rst | 1 +
 include/linux/pci.h                 | 1 +
 2 files changed, 2 insertions(+)

diff --git a/Documentation/PCI/pciebus-howto.rst b/Documentation/PCI/pciebus-howto.rst
index e48d01422efc..375d9ce171f6 100644
--- a/Documentation/PCI/pciebus-howto.rst
+++ b/Documentation/PCI/pciebus-howto.rst
@@ -221,6 +221,7 @@ a selected set of PCI Express Capability Registers:
 
 * Link Control Register
 * Root Control Register
+* Link Control 2 Register
 
 Any change to those registers should be performed using RMW accessors to
 avoid problems due to concurrent updates. For the up-to-date list of
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 573b4c4c2be6..be5ed534c39c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1274,6 +1274,7 @@ static inline int pcie_capability_clear_and_set_word(struct pci_dev *dev,
 {
 	switch (pos) {
 	case PCI_EXP_LNKCTL:
+	case PCI_EXP_LNKCTL2:
 	case PCI_EXP_RTCTL:
 		return pcie_capability_clear_and_set_word_locked(dev, pos,
 								 clear, set);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v9 3/9] PCI: Store all PCIe Supported Link Speeds
  2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
  2024-10-18 14:47 ` [PATCH v9 1/9] Documentation PCI: Reformat RMW ops documentation Ilpo Järvinen
  2024-10-18 14:47 ` [PATCH v9 2/9] PCI: Protect Link Control 2 Register with RMW locking Ilpo Järvinen
@ 2024-10-18 14:47 ` Ilpo Järvinen
  2024-11-11 13:23   ` Lukas Wunner
  2024-10-18 14:47 ` [PATCH v9 4/9] PCI: Refactor pcie_update_link_speed() Ilpo Järvinen
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Ilpo Järvinen @ 2024-10-18 14:47 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	Ilpo Järvinen, Jonathan Cameron

PCIe bandwidth controller added by a subsequent commit will require
selecting PCIe Link Speeds that are lower than the Maximum Link Speed.

The struct pci_bus only stores max_bus_speed. Even if PCIe r6.1 sec
8.2.1 currently disallows gaps in supported Link Speeds, the
Implementation Note in PCIe r6.1 sec 7.5.3.18, recommends determining
supported Link Speeds using the Supported Link Speeds Vector in the
Link Capabilities 2 Register (when available) to "avoid software being
confused if a future specification defines Links that do not require
support for all slower speeds."

Reuse code in pcie_get_speed_cap() to add pcie_get_supported_speeds()
to query the Supported Link Speeds Vector of a PCIe device. The value
is taken directly from the Supported Link Speeds Vector or synthetized
from the Max Link Speed in the Link Capabilities Register when the Link
Capabilities 2 Register is not available.

The Supported Link Speeds Vector in the Link Capabilities Register 2
corresponds to the bus below on Root Ports and Downstream Ports,
whereas it corresponds to the bus above on Upstream Ports and
Endpoints (PCIe r6.1 sec 7.5.3.18):

	"Supported Link Speeds Vector - This field indicates the
	supported Link speed(s) of the associated Port."

Add supported_speeds into the struct pci_dev that caches the
Supported Link Speeds Vector.

supported_speeds contains a set of Link Speeds only in the case where
PCIe Link Speed can be determined. The Root Complex Integrated
Endpoints do not have a well-defined Link Speed because they do not
seem to implement either of the Link Capabilities Registers, which is
allowed by PCIe r6.1 sec 7.5.3 (the same limitation applies to
determining cur_bus_speed and max_bus_speed that are PCI_SPEED_UNKNOWN
in such case). This is of no concern from PCIe bandwidth controller
point of view because such devices are not attached into a PCIe Root
Port that could be controlled.

supported_speeds field keeps the extra reserved zero at the least
significant bit to match the Link Capabilities 2 Register layout.

An attempt was made to store supported_speeds field into the struct
pci_bus as an intersection of the both ends of the Link, however, the
subordinate struct pci_bus is not available early enough. The Target
Speed quirk (in pcie_failed_link_retrain()) can run either during
initial scan or later requiring it to use the API PCIe bandwidth
controller provides to set the Target Link Speed in order to co-exist
with the bandwidth controller. When the Target Speed quirk is calling
the bandwidth controller during initial scan, the struct pci_bus is not
yet initialized. As such, storing supported_speeds into the struct
pci_bus is not viable.

Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/pci/pci.c             | 60 +++++++++++++++++++++++++----------
 drivers/pci/probe.c           |  3 ++
 include/linux/pci.h           | 11 ++++++-
 include/uapi/linux/pci_regs.h |  1 +
 4 files changed, 57 insertions(+), 18 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 7d85c04fbba2..6230bb8e9f06 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6189,38 +6189,64 @@ u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev,
 EXPORT_SYMBOL(pcie_bandwidth_available);
 
 /**
- * pcie_get_speed_cap - query for the PCI device's link speed capability
+ * pcie_get_supported_speeds - query Supported Link Speed Vector
  * @dev: PCI device to query
  *
- * Query the PCI device speed capability.  Return the maximum link speed
- * supported by the device.
+ * Query @dev supported link speeds.
+ *
+ * Implementation Note in PCIe r6.0.1 sec 7.5.3.18 recommends determining
+ * supported link speeds using the Supported Link Speeds Vector in the Link
+ * Capabilities 2 Register (when available).
+ *
+ * Link Capabilities 2 was added in PCIe r3.0, sec 7.8.18.
+ *
+ * Without Link Capabilities 2, i.e., prior to PCIe r3.0, Supported Link
+ * Speeds field in Link Capabilities is used and only 2.5 GT/s and 5.0 GT/s
+ * speeds were defined.
+ *
+ * For @dev without Supported Link Speed Vector, the field is synthetized
+ * from the Max Link Speed field in the Link Capabilities Register.
+ *
+ * Return: Supported Link Speeds Vector (+ reserved 0 at LSB).
  */
-enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
+u8 pcie_get_supported_speeds(struct pci_dev *dev)
 {
 	u32 lnkcap2, lnkcap;
+	u8 speeds;
 
+	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &lnkcap2);
 	/*
-	 * Link Capabilities 2 was added in PCIe r3.0, sec 7.8.18.  The
-	 * implementation note there recommends using the Supported Link
-	 * Speeds Vector in Link Capabilities 2 when supported.
-	 *
-	 * Without Link Capabilities 2, i.e., prior to PCIe r3.0, software
-	 * should use the Supported Link Speeds field in Link Capabilities,
-	 * where only 2.5 GT/s and 5.0 GT/s speeds were defined.
+	 * Speeds retain the reserved 0 at LSB before PCIe Supported Link
+	 * Speeds Vector to allow using SLS Vector bit defines directly.
 	 */
-	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP2, &lnkcap2);
+	speeds = lnkcap2 & PCI_EXP_LNKCAP2_SLS;
 
 	/* PCIe r3.0-compliant */
-	if (lnkcap2)
-		return PCIE_LNKCAP2_SLS2SPEED(lnkcap2);
+	if (speeds)
+		return speeds;
 
 	pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
+
+	/* Synthetize from the Max Link Speed field */
 	if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_5_0GB)
-		return PCIE_SPEED_5_0GT;
+		speeds = PCI_EXP_LNKCAP2_SLS_5_0GB | PCI_EXP_LNKCAP2_SLS_2_5GB;
 	else if ((lnkcap & PCI_EXP_LNKCAP_SLS) == PCI_EXP_LNKCAP_SLS_2_5GB)
-		return PCIE_SPEED_2_5GT;
+		speeds = PCI_EXP_LNKCAP2_SLS_2_5GB;
+
+	return speeds;
+}
 
-	return PCI_SPEED_UNKNOWN;
+/**
+ * pcie_get_speed_cap - query for the PCI device's link speed capability
+ * @dev: PCI device to query
+ *
+ * Query the PCI device speed capability.
+ *
+ * Return: the maximum link speed supported by the device.
+ */
+enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev)
+{
+	return PCIE_LNKCAP2_SLS2SPEED(dev->supported_speeds);
 }
 EXPORT_SYMBOL(pcie_get_speed_cap);
 
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 4f68414c3086..af153a8e8225 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1947,6 +1947,9 @@ int pci_setup_device(struct pci_dev *dev)
 
 	set_pcie_untrusted(dev);
 
+	if (pci_is_pcie(dev))
+		dev->supported_speeds = pcie_get_supported_speeds(dev);
+
 	/* "Unknown power state" */
 	dev->current_state = PCI_UNKNOWN;
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index be5ed534c39c..a02b77fe7865 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -303,6 +303,7 @@ enum pci_bus_speed {
 	PCI_SPEED_UNKNOWN		= 0xff,
 };
 
+u8 pcie_get_supported_speeds(struct pci_dev *dev);
 enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
 enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
 
@@ -318,7 +319,14 @@ struct pci_sriov;
 struct pci_p2pdma;
 struct rcec_ea;
 
-/* The pci_dev structure describes PCI devices */
+/* struct pci_dev - describes a PCI device
+ *
+ * @supported_speeds:	PCIe Supported Link Speeds Vector (+ reserved 0 at
+ *			LSB). 0 when the supported speeds cannot be
+ *			determined (e.g., for Root Complex Integrated
+ *			Endpoints without the relevant Capability
+ *			Registers).
+ */
 struct pci_dev {
 	struct list_head bus_list;	/* Node in per-bus list */
 	struct pci_bus	*bus;		/* Bus this device is on */
@@ -522,6 +530,7 @@ struct pci_dev {
 	struct npem	*npem;		/* Native PCIe Enclosure Management */
 #endif
 	u16		acs_cap;	/* ACS Capability offset */
+	u8		supported_speeds; /* Supported Link Speeds Vector */
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	/*
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 12323b3334a9..f3c9de0a497c 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -678,6 +678,7 @@
 #define PCI_EXP_DEVSTA2		0x2a	/* Device Status 2 */
 #define PCI_CAP_EXP_RC_ENDPOINT_SIZEOF_V2 0x2c	/* end of v2 EPs w/o link */
 #define PCI_EXP_LNKCAP2		0x2c	/* Link Capabilities 2 */
+#define  PCI_EXP_LNKCAP2_SLS		0x000000fe /* Supported Link Speeds Vector */
 #define  PCI_EXP_LNKCAP2_SLS_2_5GB	0x00000002 /* Supported Speed 2.5GT/s */
 #define  PCI_EXP_LNKCAP2_SLS_5_0GB	0x00000004 /* Supported Speed 5GT/s */
 #define  PCI_EXP_LNKCAP2_SLS_8_0GB	0x00000008 /* Supported Speed 8GT/s */
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v9 4/9] PCI: Refactor pcie_update_link_speed()
  2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
                   ` (2 preceding siblings ...)
  2024-10-18 14:47 ` [PATCH v9 3/9] PCI: Store all PCIe Supported Link Speeds Ilpo Järvinen
@ 2024-10-18 14:47 ` Ilpo Järvinen
  2024-10-18 14:47 ` [PATCH v9 5/9] PCI/quirks: Abstract LBMS seen check into own function Ilpo Järvinen
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Ilpo Järvinen @ 2024-10-18 14:47 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	Ilpo Järvinen, Jonathan Cameron

pcie_update_link_speed() is passed the Link Status register but not all
callers have that value at hand nor need the value.

Refactor pcie_update_link_speed() to include reading the Link Status
register and create __pcie_update_link_speed() which can be used by the
hotplug code that has the register value at hand beforehand (and needs
the value for other purposes).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/pci/hotplug/pciehp_hpc.c |  2 +-
 drivers/pci/pci.h                |  7 ++++++-
 drivers/pci/probe.c              | 12 +++++++-----
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 736ad8baa2a5..bb5a8d9f03ad 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -319,7 +319,7 @@ int pciehp_check_link_status(struct controller *ctrl)
 		return -1;
 	}
 
-	pcie_update_link_speed(ctrl->pcie->port->subordinate, lnk_status);
+	__pcie_update_link_speed(ctrl->pcie->port->subordinate, lnk_status);
 
 	if (!found) {
 		ctrl_info(ctrl, "Slot(%s): No device found\n",
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 14d00ce45bfa..6e827d47aa85 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -378,7 +378,12 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
 enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
 void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
 void pcie_report_downtraining(struct pci_dev *dev);
-void pcie_update_link_speed(struct pci_bus *bus, u16 link_status);
+
+static inline void __pcie_update_link_speed(struct pci_bus *bus, u16 linksta)
+{
+	bus->cur_bus_speed = pcie_link_speed[linksta & PCI_EXP_LNKSTA_CLS];
+}
+void pcie_update_link_speed(struct pci_bus *bus);
 
 /* Single Root I/O Virtualization */
 struct pci_sriov {
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index af153a8e8225..c138daf78961 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -742,9 +742,13 @@ const char *pci_speed_string(enum pci_bus_speed speed)
 }
 EXPORT_SYMBOL_GPL(pci_speed_string);
 
-void pcie_update_link_speed(struct pci_bus *bus, u16 linksta)
+void pcie_update_link_speed(struct pci_bus *bus)
 {
-	bus->cur_bus_speed = pcie_link_speed[linksta & PCI_EXP_LNKSTA_CLS];
+	struct pci_dev *bridge = bus->self;
+	u16 linksta;
+
+	pcie_capability_read_word(bridge, PCI_EXP_LNKSTA, &linksta);
+	__pcie_update_link_speed(bus, linksta);
 }
 EXPORT_SYMBOL_GPL(pcie_update_link_speed);
 
@@ -827,13 +831,11 @@ static void pci_set_bus_speed(struct pci_bus *bus)
 
 	if (pci_is_pcie(bridge)) {
 		u32 linkcap;
-		u16 linksta;
 
 		pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, &linkcap);
 		bus->max_bus_speed = pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS];
 
-		pcie_capability_read_word(bridge, PCI_EXP_LNKSTA, &linksta);
-		pcie_update_link_speed(bus, linksta);
+		pcie_update_link_speed(bus);
 	}
 }
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v9 5/9] PCI/quirks: Abstract LBMS seen check into own function
  2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
                   ` (3 preceding siblings ...)
  2024-10-18 14:47 ` [PATCH v9 4/9] PCI: Refactor pcie_update_link_speed() Ilpo Järvinen
@ 2024-10-18 14:47 ` Ilpo Järvinen
  2024-10-18 14:47 ` [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller Ilpo Järvinen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Ilpo Järvinen @ 2024-10-18 14:47 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	Ilpo Järvinen, Jonathan Cameron

The Target Speed quirk in pcie_failed_link_retrain() uses the presence
of LBMS bit as one of the triggering conditions effectively
monopolizing the use of that bit. An upcoming change will introduce a
PCIe bandwidth controller which sets up an interrupt to track LBMS. As
LBMS will be cleared by the interrupt handler, the Target Speed quirk
will no longer be able to observe LBMS directly.

As a preparatory step for the change, extract the LBMS seen check from
pcie_failed_link_retrain() into a new function pcie_lmbs_seen().

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/pci/quirks.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index dccb60c1d9cc..a560ea403b8e 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -33,6 +33,11 @@
 #include <linux/switchtec.h>
 #include "pci.h"
 
+static bool pcie_lbms_seen(struct pci_dev *dev, u16 lnksta)
+{
+	return lnksta & PCI_EXP_LNKSTA_LBMS;
+}
+
 /*
  * Retrain the link of a downstream PCIe port by hand if necessary.
  *
@@ -96,8 +101,7 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
 
 	pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
 	pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
-	if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
-	    PCI_EXP_LNKSTA_LBMS) {
+	if (!(lnksta & PCI_EXP_LNKSTA_DLLLA) && pcie_lbms_seen(dev, lnksta)) {
 		u16 oldlnkctl2 = lnkctl2;
 
 		pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller
  2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
                   ` (4 preceding siblings ...)
  2024-10-18 14:47 ` [PATCH v9 5/9] PCI/quirks: Abstract LBMS seen check into own function Ilpo Järvinen
@ 2024-10-18 14:47 ` Ilpo Järvinen
  2024-12-06 18:12   ` Niklas Schnelle
  2024-10-18 14:47 ` [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed Ilpo Järvinen
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Ilpo Järvinen @ 2024-10-18 14:47 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, Ilpo Järvinen, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	Jonathan Cameron

This mostly reverts the commit b4c7d2076b4e ("PCI/LINK: Remove
bandwidth notification"). An upcoming commit extends this driver
building PCIe bandwidth controller on top of it.

The PCIe bandwidth notification were first added in the commit
e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth
notification") but later had to be removed. The significant changes
compared with the old bandwidth notification driver include:

1) Don't print the notifications into kernel log, just keep the Link
   Speed cached in struct pci_bus updated. While somewhat unfortunate,
   the log spam was the source of complaints that eventually lead to
   the removal of the bandwidth notifications driver (see the links
   below for further information).

2) Besides the Link Bandwidth Management Interrupt, enable also Link
   Autonomous Bandwidth Interrupt to cover the other source of
   bandwidth changes.

3) Use threaded IRQ with IRQF_ONESHOT to handle Bandwidth Notification
   Interrupts to address the problem fixed in the commit 3e82a7f9031f
   ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are acked")).

4) Handle Link Speed updates robustly. Refresh the cached Link Speed
   when enabling Bandwidth Notification Interrupts, and solve the race
   between Link Speed read and LBMS/LABS update in
   pcie_bwnotif_irq_thread().

5) Use concurrency safe LNKCTL RMW operations.

6) The driver is now called PCIe bwctrl (bandwidth controller) instead
   of just bandwidth notifications because of increased scope and
   functionality within the driver.

7) Coexist with the Target Link Speed quirk in
   pcie_failed_link_retrain(). Provide LBMS counting API for it.

8) Tweaks to variable/functions names for consistency and length
   reasons.

Bandwidth Notifications enable the cur_bus_speed in the struct pci_bus
to keep track PCIe Link Speed changes.

Link: https://lore.kernel.org/all/20190429185611.121751-1-helgaas@kernel.org/
Link: https://lore.kernel.org/linux-pci/20190501142942.26972-1-keith.busch@intel.com/
Link: https://lore.kernel.org/linux-pci/20200115221008.GA191037@google.com/
Suggested-by: Lukas Wunner <lukas@wunner.de> # Building bwctrl on top of bwnotif
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 MAINTAINERS                       |   6 +
 drivers/pci/hotplug/pciehp_ctrl.c |   5 +
 drivers/pci/pci.c                 |   2 +-
 drivers/pci/pci.h                 |  11 ++
 drivers/pci/pcie/Makefile         |   2 +-
 drivers/pci/pcie/bwctrl.c         | 187 ++++++++++++++++++++++++++++++
 drivers/pci/pcie/portdrv.c        |   9 +-
 drivers/pci/pcie/portdrv.h        |   6 +-
 drivers/pci/quirks.c              |   9 +-
 include/linux/pci.h               |   2 +
 10 files changed, 230 insertions(+), 9 deletions(-)
 create mode 100644 drivers/pci/pcie/bwctrl.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c27f3190737f..8c555b3325d6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17933,6 +17933,12 @@ F:	include/linux/of_pci.h
 F:	include/linux/pci*
 F:	include/uapi/linux/pci*
 
+PCIE BANDWIDTH CONTROLLER
+M:	Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
+L:	linux-pci@vger.kernel.org
+S:	Supported
+F:	drivers/pci/pcie/bwctrl.c
+
 PCIE DRIVER FOR AMAZON ANNAPURNA LABS
 M:	Jonathan Chocron <jonnyc@amazon.com>
 L:	linux-pci@vger.kernel.org
diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
index dcdbfcf404dd..d603a7aa7483 100644
--- a/drivers/pci/hotplug/pciehp_ctrl.c
+++ b/drivers/pci/hotplug/pciehp_ctrl.c
@@ -19,6 +19,8 @@
 #include <linux/types.h>
 #include <linux/pm_runtime.h>
 #include <linux/pci.h>
+
+#include "../pci.h"
 #include "pciehp.h"
 
 /* The following routines constitute the bulk of the
@@ -127,6 +129,9 @@ static void remove_board(struct controller *ctrl, bool safe_removal)
 
 	pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_OFF,
 			      INDICATOR_NOOP);
+
+	/* Don't carry LBMS indications across */
+	pcie_reset_lbms_count(ctrl->pcie->port);
 }
 
 static int pciehp_enable_slot(struct controller *ctrl);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 6230bb8e9f06..05c9db199105 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4740,7 +4740,7 @@ int pcie_retrain_link(struct pci_dev *pdev, bool use_lt)
 	 * to track link speed or width changes made by hardware itself
 	 * in attempt to correct unreliable link operation.
 	 */
-	pcie_capability_write_word(pdev, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
+	pcie_reset_lbms_count(pdev);
 	return rc;
 }
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 6e827d47aa85..33ed324d1953 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -697,6 +697,17 @@ static inline void pcie_set_ecrc_checking(struct pci_dev *dev) { }
 static inline void pcie_ecrc_get_policy(char *str) { }
 #endif
 
+#ifdef CONFIG_PCIEPORTBUS
+void pcie_reset_lbms_count(struct pci_dev *port);
+int pcie_lbms_count(struct pci_dev *port, unsigned long *val);
+#else
+static inline void pcie_reset_lbms_count(struct pci_dev *port) {}
+static inline int pcie_lbms_count(struct pci_dev *port, unsigned long *val)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
 struct pci_dev_reset_methods {
 	u16 vendor;
 	u16 device;
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 6461aa93fe76..53ccab62314d 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -4,7 +4,7 @@
 
 pcieportdrv-y			:= portdrv.o rcec.o
 
-obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
+obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o bwctrl.o
 
 obj-y				+= aspm.o
 obj-$(CONFIG_PCIEAER)		+= aer.o err.o
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
new file mode 100644
index 000000000000..b31a31453872
--- /dev/null
+++ b/drivers/pci/pcie/bwctrl.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * PCIe bandwidth controller
+ *
+ * Author: Alexandru Gagniuc <mr.nuke.me@gmail.com>
+ *
+ * Copyright (C) 2019 Dell Inc
+ * Copyright (C) 2023-2024 Intel Corporation
+ *
+ * This service port driver hooks into the Bandwidth Notification interrupt
+ * watching for changes or links becoming degraded in operation. It updates
+ * the cached Current Link Speed that is exposed to user space through sysfs.
+ */
+
+#define dev_fmt(fmt) "bwctrl: " fmt
+
+#include <linux/atomic.h>
+#include <linux/cleanup.h>
+#include <linux/errno.h>
+#include <linux/interrupt.h>
+#include <linux/pci.h>
+#include <linux/rwsem.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include "../pci.h"
+#include "portdrv.h"
+
+/**
+ * struct pcie_bwctrl_data - PCIe bandwidth controller
+ * @lbms_count:		Count for LBMS (since last reset)
+ */
+struct pcie_bwctrl_data {
+	atomic_t lbms_count;
+};
+
+/* Prevents port removal during LBMS count accessors */
+static DECLARE_RWSEM(pcie_bwctrl_lbms_rwsem);
+
+static void pcie_bwnotif_enable(struct pcie_device *srv)
+{
+	struct pcie_bwctrl_data *data = srv->port->link_bwctrl;
+	struct pci_dev *port = srv->port;
+	u16 link_status;
+	int ret;
+
+	/* Count LBMS seen so far as one */
+	ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, &link_status);
+	if (ret == PCIBIOS_SUCCESSFUL && link_status & PCI_EXP_LNKSTA_LBMS)
+		atomic_inc(&data->lbms_count);
+
+	pcie_capability_set_word(port, PCI_EXP_LNKCTL,
+				 PCI_EXP_LNKCTL_LBMIE | PCI_EXP_LNKCTL_LABIE);
+	pcie_capability_write_word(port, PCI_EXP_LNKSTA,
+				   PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_LABS);
+
+	/*
+	 * Update after enabling notifications & clearing status bits ensures
+	 * link speed is up to date.
+	 */
+	pcie_update_link_speed(port->subordinate);
+}
+
+static void pcie_bwnotif_disable(struct pci_dev *port)
+{
+	pcie_capability_clear_word(port, PCI_EXP_LNKCTL,
+				   PCI_EXP_LNKCTL_LBMIE | PCI_EXP_LNKCTL_LABIE);
+}
+
+static irqreturn_t pcie_bwnotif_irq_thread(int irq, void *context)
+{
+	struct pcie_device *srv = context;
+	struct pcie_bwctrl_data *data = srv->port->link_bwctrl;
+	struct pci_dev *port = srv->port;
+	u16 link_status, events;
+	int ret;
+
+	ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, &link_status);
+	if (ret != PCIBIOS_SUCCESSFUL)
+		return IRQ_NONE;
+
+	events = link_status & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_LABS);
+	if (!events)
+		return IRQ_NONE;
+
+	if (events & PCI_EXP_LNKSTA_LBMS)
+		atomic_inc(&data->lbms_count);
+
+	pcie_capability_write_word(port, PCI_EXP_LNKSTA, events);
+
+	/*
+	 * Interrupts will not be triggered from any further Link Speed
+	 * change until LBMS is cleared by the write. Therefore, re-read the
+	 * speed (inside pcie_update_link_speed()) after LBMS has been
+	 * cleared to avoid missing link speed changes.
+	 */
+	pcie_update_link_speed(port->subordinate);
+
+	return IRQ_HANDLED;
+}
+
+void pcie_reset_lbms_count(struct pci_dev *port)
+{
+	struct pcie_bwctrl_data *data;
+
+	guard(rwsem_read)(&pcie_bwctrl_lbms_rwsem);
+	data = port->link_bwctrl;
+	if (data)
+		atomic_set(&data->lbms_count, 0);
+	else
+		pcie_capability_write_word(port, PCI_EXP_LNKSTA, PCI_EXP_LNKSTA_LBMS);
+}
+
+int pcie_lbms_count(struct pci_dev *port, unsigned long *val)
+{
+	struct pcie_bwctrl_data *data;
+
+	guard(rwsem_read)(&pcie_bwctrl_lbms_rwsem);
+	data = port->link_bwctrl;
+	if (!data)
+		return -ENOTTY;
+
+	*val = atomic_read(&data->lbms_count);
+
+	return 0;
+}
+
+static int pcie_bwnotif_probe(struct pcie_device *srv)
+{
+	struct pci_dev *port = srv->port;
+	int ret;
+
+	struct pcie_bwctrl_data *data = devm_kzalloc(&srv->device, sizeof(*data),
+						     GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
+					pcie_bwnotif_irq_thread,
+					IRQF_SHARED | IRQF_ONESHOT,
+					"PCIe bwctrl", srv);
+	if (ret)
+		return ret;
+
+	scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
+		port->link_bwctrl = no_free_ptr(data);
+		pcie_bwnotif_enable(srv);
+	}
+
+	pci_dbg(port, "enabled with IRQ %d\n", srv->irq);
+
+	return 0;
+}
+
+static void pcie_bwnotif_remove(struct pcie_device *srv)
+{
+	pcie_bwnotif_disable(srv->port);
+	scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
+		srv->port->link_bwctrl = NULL;
+}
+
+static int pcie_bwnotif_suspend(struct pcie_device *srv)
+{
+	pcie_bwnotif_disable(srv->port);
+	return 0;
+}
+
+static int pcie_bwnotif_resume(struct pcie_device *srv)
+{
+	pcie_bwnotif_enable(srv);
+	return 0;
+}
+
+static struct pcie_port_service_driver pcie_bwctrl_driver = {
+	.name		= "pcie_bwctrl",
+	.port_type	= PCIE_ANY_PORT,
+	.service	= PCIE_PORT_SERVICE_BWCTRL,
+	.probe		= pcie_bwnotif_probe,
+	.suspend	= pcie_bwnotif_suspend,
+	.resume		= pcie_bwnotif_resume,
+	.remove		= pcie_bwnotif_remove,
+};
+
+int __init pcie_bwctrl_init(void)
+{
+	return pcie_port_service_register(&pcie_bwctrl_driver);
+}
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 6af5e0425872..5e10306b6308 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -68,7 +68,7 @@ static int pcie_message_numbers(struct pci_dev *dev, int mask,
 	 */
 
 	if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP |
-		    PCIE_PORT_SERVICE_BWNOTIF)) {
+		    PCIE_PORT_SERVICE_BWCTRL)) {
 		pcie_capability_read_word(dev, PCI_EXP_FLAGS, &reg16);
 		*pme = FIELD_GET(PCI_EXP_FLAGS_IRQ, reg16);
 		nvec = *pme + 1;
@@ -150,11 +150,11 @@ static int pcie_port_enable_irq_vec(struct pci_dev *dev, int *irqs, int mask)
 
 	/* PME, hotplug and bandwidth notification share an MSI/MSI-X vector */
 	if (mask & (PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP |
-		    PCIE_PORT_SERVICE_BWNOTIF)) {
+		    PCIE_PORT_SERVICE_BWCTRL)) {
 		pcie_irq = pci_irq_vector(dev, pme);
 		irqs[PCIE_PORT_SERVICE_PME_SHIFT] = pcie_irq;
 		irqs[PCIE_PORT_SERVICE_HP_SHIFT] = pcie_irq;
-		irqs[PCIE_PORT_SERVICE_BWNOTIF_SHIFT] = pcie_irq;
+		irqs[PCIE_PORT_SERVICE_BWCTRL_SHIFT] = pcie_irq;
 	}
 
 	if (mask & PCIE_PORT_SERVICE_AER)
@@ -271,7 +271,7 @@ static int get_port_device_capability(struct pci_dev *dev)
 
 		pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &linkcap);
 		if (linkcap & PCI_EXP_LNKCAP_LBNC)
-			services |= PCIE_PORT_SERVICE_BWNOTIF;
+			services |= PCIE_PORT_SERVICE_BWCTRL;
 	}
 
 	return services;
@@ -828,6 +828,7 @@ static void __init pcie_init_services(void)
 	pcie_aer_init();
 	pcie_pme_init();
 	pcie_dpc_init();
+	pcie_bwctrl_init();
 	pcie_hp_init();
 }
 
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index 12c89ea0313b..bd29d1cc7b8b 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -20,8 +20,8 @@
 #define PCIE_PORT_SERVICE_HP		(1 << PCIE_PORT_SERVICE_HP_SHIFT)
 #define PCIE_PORT_SERVICE_DPC_SHIFT	3	/* Downstream Port Containment */
 #define PCIE_PORT_SERVICE_DPC		(1 << PCIE_PORT_SERVICE_DPC_SHIFT)
-#define PCIE_PORT_SERVICE_BWNOTIF_SHIFT	4	/* Bandwidth notification */
-#define PCIE_PORT_SERVICE_BWNOTIF	(1 << PCIE_PORT_SERVICE_BWNOTIF_SHIFT)
+#define PCIE_PORT_SERVICE_BWCTRL_SHIFT	4	/* Bandwidth Controller (notifications) */
+#define PCIE_PORT_SERVICE_BWCTRL	(1 << PCIE_PORT_SERVICE_BWCTRL_SHIFT)
 
 #define PCIE_PORT_DEVICE_MAXSERVICES   5
 
@@ -51,6 +51,8 @@ int pcie_dpc_init(void);
 static inline int pcie_dpc_init(void) { return 0; }
 #endif
 
+int pcie_bwctrl_init(void);
+
 /* Port Type */
 #define PCIE_ANY_PORT			(~0)
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index a560ea403b8e..e6d502dca939 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -35,7 +35,14 @@
 
 static bool pcie_lbms_seen(struct pci_dev *dev, u16 lnksta)
 {
-	return lnksta & PCI_EXP_LNKSTA_LBMS;
+	unsigned long count;
+	int ret;
+
+	ret = pcie_lbms_count(dev, &count);
+	if (ret < 0)
+		return lnksta & PCI_EXP_LNKSTA_LBMS;
+
+	return count > 0;
 }
 
 /*
diff --git a/include/linux/pci.h b/include/linux/pci.h
index a02b77fe7865..5f9de226be13 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -314,6 +314,7 @@ struct pci_vpd {
 };
 
 struct irq_affinity;
+struct pcie_bwctrl_data;
 struct pcie_link_state;
 struct pci_sriov;
 struct pci_p2pdma;
@@ -503,6 +504,7 @@ struct pci_dev {
 	unsigned int	dpc_rp_extensions:1;
 	u8		dpc_rp_log_size;
 #endif
+	struct pcie_bwctrl_data		*link_bwctrl;
 #ifdef CONFIG_PCI_ATS
 	union {
 		struct pci_sriov	*sriov;		/* PF: SR-IOV info */
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed
  2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
                   ` (5 preceding siblings ...)
  2024-10-18 14:47 ` [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller Ilpo Järvinen
@ 2024-10-18 14:47 ` Ilpo Järvinen
  2024-11-12 15:47   ` Lukas Wunner
  2025-01-02 10:38   ` Lukas Wunner
  2024-10-18 14:47 ` [PATCH v9 8/9] thermal: Add PCIe cooling driver Ilpo Järvinen
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 35+ messages in thread
From: Ilpo Järvinen @ 2024-10-18 14:47 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, Ilpo Järvinen, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	Jonathan Cameron

Currently, PCIe Link Speeds are adjusted by custom code rather than in
a common function provided in PCI core. PCIe bandwidth controller
(bwctrl) introduces an in-kernel API to set PCIe Link Speed.

Convert Target Speed quirk to use the new API. The Target Speed quirk
runs very early when bwctrl is not yet probed for a Port and can also
run later when bwctrl is already setup for the Port, which requires the
per port mutex (set_speed_mutex) to be only taken if the bwctrl setup
is already complete.

The new API is also intended to be used in an upcoming commit that adds
a thermal cooling device to throttle PCIe bandwidth when thermal
thresholds are reached.

The PCIe bandwidth control procedure is as follows. The highest speed
supported by the Port and the PCIe device which is not higher than the
requested speed is selected and written into the Target Link Speed in
the Link Control 2 Register. Then bandwidth controller retrains the
PCIe Link.

Bandwidth Notifications enable the cur_bus_speed in the struct pci_bus
to keep track PCIe Link Speed changes. While Bandwidth Notifications
should also be generated when bandwidth controller alters the PCIe Link
Speed, a few platforms do not deliver LMBS interrupt after Link
Training as expected. Thus, after changing the Link Speed, bandwidth
controller makes additional read for the Link Status Register to ensure
cur_bus_speed is consistent with the new PCIe Link Speed.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/pci/pci.h         |  20 +++++
 drivers/pci/pcie/bwctrl.c | 178 ++++++++++++++++++++++++++++++++++++--
 drivers/pci/quirks.c      |  17 +---
 include/linux/pci.h       |  10 +++
 4 files changed, 206 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 33ed324d1953..c8ea672c1892 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -331,6 +331,17 @@ void pci_disable_bridge_window(struct pci_dev *dev);
 struct pci_bus *pci_bus_get(struct pci_bus *bus);
 void pci_bus_put(struct pci_bus *bus);
 
+#define PCIE_LNKCAP_SLS2SPEED(lnkcap)					\
+({									\
+	((lnkcap) == PCI_EXP_LNKCAP_SLS_64_0GB ? PCIE_SPEED_64_0GT :	\
+	 (lnkcap) == PCI_EXP_LNKCAP_SLS_32_0GB ? PCIE_SPEED_32_0GT :	\
+	 (lnkcap) == PCI_EXP_LNKCAP_SLS_16_0GB ? PCIE_SPEED_16_0GT :	\
+	 (lnkcap) == PCI_EXP_LNKCAP_SLS_8_0GB ? PCIE_SPEED_8_0GT :	\
+	 (lnkcap) == PCI_EXP_LNKCAP_SLS_5_0GB ? PCIE_SPEED_5_0GT :	\
+	 (lnkcap) == PCI_EXP_LNKCAP_SLS_2_5GB ? PCIE_SPEED_2_5GT :	\
+	 PCI_SPEED_UNKNOWN);						\
+})
+
 /* PCIe link information from Link Capabilities 2 */
 #define PCIE_LNKCAP2_SLS2SPEED(lnkcap2) \
 	((lnkcap2) & PCI_EXP_LNKCAP2_SLS_64_0GB ? PCIE_SPEED_64_0GT : \
@@ -341,6 +352,15 @@ void pci_bus_put(struct pci_bus *bus);
 	 (lnkcap2) & PCI_EXP_LNKCAP2_SLS_2_5GB ? PCIE_SPEED_2_5GT : \
 	 PCI_SPEED_UNKNOWN)
 
+#define PCIE_LNKCTL2_TLS2SPEED(lnkctl2) \
+	((lnkctl2) == PCI_EXP_LNKCTL2_TLS_64_0GT ? PCIE_SPEED_64_0GT : \
+	 (lnkctl2) == PCI_EXP_LNKCTL2_TLS_32_0GT ? PCIE_SPEED_32_0GT : \
+	 (lnkctl2) == PCI_EXP_LNKCTL2_TLS_16_0GT ? PCIE_SPEED_16_0GT : \
+	 (lnkctl2) == PCI_EXP_LNKCTL2_TLS_8_0GT ? PCIE_SPEED_8_0GT : \
+	 (lnkctl2) == PCI_EXP_LNKCTL2_TLS_5_0GT ? PCIE_SPEED_5_0GT : \
+	 (lnkctl2) == PCI_EXP_LNKCTL2_TLS_2_5GT ? PCIE_SPEED_2_5GT : \
+	 PCI_SPEED_UNKNOWN)
+
 /* PCIe speed to Mb/s reduced by encoding overhead */
 #define PCIE_SPEED2MBS_ENC(speed) \
 	((speed) == PCIE_SPEED_64_0GT ? 64000*1/1 : \
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index b31a31453872..8a2bd1e887e2 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -7,6 +7,11 @@
  * Copyright (C) 2019 Dell Inc
  * Copyright (C) 2023-2024 Intel Corporation
  *
+ * The PCIe bandwidth controller provides a way to alter PCIe Link Speeds
+ * and notify the operating system when the Link Width or Speed changes. The
+ * notification capability is required for all Root Ports and Downstream
+ * Ports supporting Link Width wider than x1 and/or multiple Link Speeds.
+ *
  * This service port driver hooks into the Bandwidth Notification interrupt
  * watching for changes or links becoming degraded in operation. It updates
  * the cached Current Link Speed that is exposed to user space through sysfs.
@@ -15,9 +20,12 @@
 #define dev_fmt(fmt) "bwctrl: " fmt
 
 #include <linux/atomic.h>
+#include <linux/bitops.h>
+#include <linux/bits.h>
 #include <linux/cleanup.h>
 #include <linux/errno.h>
 #include <linux/interrupt.h>
+#include <linux/mutex.h>
 #include <linux/pci.h>
 #include <linux/rwsem.h>
 #include <linux/slab.h>
@@ -28,14 +36,167 @@
 
 /**
  * struct pcie_bwctrl_data - PCIe bandwidth controller
+ * @set_speed_mutex:	Serializes link speed changes
  * @lbms_count:		Count for LBMS (since last reset)
  */
 struct pcie_bwctrl_data {
+	struct mutex set_speed_mutex;
 	atomic_t lbms_count;
 };
 
-/* Prevents port removal during LBMS count accessors */
+/*
+ * Prevent port removal during LBMS count accessors and Link Speed changes.
+ *
+ * These have to be differentiated because pcie_bwctrl_change_speed() calls
+ * pcie_retrain_link() which uses LBMS count reset accessor on success
+ * (using just one rwsem triggers "possible recursive locking detected"
+ * warning).
+ */
 static DECLARE_RWSEM(pcie_bwctrl_lbms_rwsem);
+static DECLARE_RWSEM(pcie_bwctrl_setspeed_rwsem);
+
+static bool pcie_valid_speed(enum pci_bus_speed speed)
+{
+	return (speed >= PCIE_SPEED_2_5GT) && (speed <= PCIE_SPEED_64_0GT);
+}
+
+static u16 pci_bus_speed2lnkctl2(enum pci_bus_speed speed)
+{
+	static const u8 speed_conv[] = {
+		[PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT,
+		[PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT,
+		[PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT,
+		[PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT,
+		[PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT,
+		[PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT,
+	};
+
+	if (WARN_ON_ONCE(!pcie_valid_speed(speed)))
+		return 0;
+
+	return speed_conv[speed];
+}
+
+static inline u16 pcie_supported_speeds2target_speed(u8 supported_speeds)
+{
+	return __fls(supported_speeds);
+}
+
+/**
+ * pcie_bwctrl_select_speed - Select Target Link Speed
+ * @port:	PCIe Port
+ * @speed_req:	requested PCIe Link Speed
+ *
+ * Select Target Link Speed by take into account Supported Link Speeds of
+ * both the Root Port and the Endpoint.
+ *
+ * Return: Target Link Speed (1=2.5GT/s, 2=5GT/s, 3=8GT/s, etc.)
+ */
+static u16 pcie_bwctrl_select_speed(struct pci_dev *port, enum pci_bus_speed speed_req)
+{
+	struct pci_bus *bus = port->subordinate;
+	u8 desired_speeds, supported_speeds;
+	struct pci_dev *dev;
+
+	desired_speeds = GENMASK(pci_bus_speed2lnkctl2(speed_req),
+				 __fls(PCI_EXP_LNKCAP2_SLS_2_5GB));
+
+	supported_speeds = port->supported_speeds;
+	if (bus) {
+		down_read(&pci_bus_sem);
+		dev = list_first_entry_or_null(&bus->devices, struct pci_dev, bus_list);
+		if (dev)
+			supported_speeds &= dev->supported_speeds;
+		up_read(&pci_bus_sem);
+	}
+	if (!supported_speeds)
+		return PCI_EXP_LNKCAP2_SLS_2_5GB;
+
+	return pcie_supported_speeds2target_speed(supported_speeds & desired_speeds);
+}
+
+static int pcie_bwctrl_change_speed(struct pci_dev *port, u16 target_speed, bool use_lt)
+{
+	int ret;
+
+	ret = pcie_capability_clear_and_set_word(port, PCI_EXP_LNKCTL2,
+						 PCI_EXP_LNKCTL2_TLS, target_speed);
+	if (ret != PCIBIOS_SUCCESSFUL)
+		return pcibios_err_to_errno(ret);
+
+	ret = pcie_retrain_link(port, use_lt);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Ensure link speed updates also with platforms that have problems
+	 * with notifications.
+	 */
+	if (port->subordinate)
+		pcie_update_link_speed(port->subordinate);
+
+	return 0;
+}
+
+/**
+ * pcie_set_target_speed - Set downstream Link Speed for PCIe Port
+ * @port:	PCIe Port
+ * @speed_req:	requested PCIe Link Speed
+ * @use_lt:	Wait for the LT or DLLLA bit to detect the end of link training
+ *
+ * Attempts to set PCIe Port Link Speed to @speed_req. @speed_req may be
+ * adjusted downwards to the best speed supported by both the Port and PCIe
+ * Device underneath it.
+ *
+ * Return:
+ * * 0		- on success
+ * * -EINVAL	- @speed_req is not a PCIe Link Speed
+ * * -ENODEV	- @port is not controllable
+ * * -ETIMEDOUT	- changing Link Speed took too long
+ * * -EAGAIN	- Link Speed was changed but @speed_req was not achieved
+ */
+int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req,
+			  bool use_lt)
+{
+	struct pci_bus *bus = port->subordinate;
+	u16 target_speed;
+	int ret;
+
+	if (WARN_ON_ONCE(!pcie_valid_speed(speed_req)))
+		return -EINVAL;
+
+	if (bus && bus->cur_bus_speed == speed_req)
+		return 0;
+
+	target_speed = pcie_bwctrl_select_speed(port, speed_req);
+
+	scoped_guard(rwsem_read, &pcie_bwctrl_setspeed_rwsem) {
+		struct pcie_bwctrl_data *data = port->link_bwctrl;
+
+		/*
+		 * port->link_bwctrl is NULL during initial scan when called
+		 * e.g. from the Target Speed quirk.
+		 */
+		if (data)
+			mutex_lock(&data->set_speed_mutex);
+
+		ret = pcie_bwctrl_change_speed(port, target_speed, use_lt);
+
+		if (data)
+			mutex_unlock(&data->set_speed_mutex);
+	}
+
+	/*
+	 * Despite setting higher speed into the Target Link Speed, empty
+	 * bus won't train to 5GT+ speeds.
+	 */
+	if (!ret && bus && bus->cur_bus_speed != speed_req &&
+	    !list_empty(&bus->devices))
+		ret = -EAGAIN;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pcie_set_target_speed);
 
 static void pcie_bwnotif_enable(struct pcie_device *srv)
 {
@@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
 	if (!data)
 		return -ENOMEM;
 
+	devm_mutex_init(&srv->device, &data->set_speed_mutex);
 	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
 					pcie_bwnotif_irq_thread,
 					IRQF_SHARED | IRQF_ONESHOT,
@@ -142,9 +304,11 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
 	if (ret)
 		return ret;
 
-	scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
-		port->link_bwctrl = no_free_ptr(data);
-		pcie_bwnotif_enable(srv);
+	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) {
+		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
+			port->link_bwctrl = no_free_ptr(data);
+			pcie_bwnotif_enable(srv);
+		}
 	}
 
 	pci_dbg(port, "enabled with IRQ %d\n", srv->irq);
@@ -155,8 +319,10 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
 static void pcie_bwnotif_remove(struct pcie_device *srv)
 {
 	pcie_bwnotif_disable(srv->port);
-	scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
-		srv->port->link_bwctrl = NULL;
+
+	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
+		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem)
+			srv->port->link_bwctrl = NULL;
 }
 
 static int pcie_bwnotif_suspend(struct pcie_device *srv)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index e6d502dca939..dcf1c86a5488 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -113,16 +113,11 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
 
 		pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
 
-		lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
-		lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
-		pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
-
-		ret = pcie_retrain_link(dev, false);
+		ret = pcie_set_target_speed(dev, PCIE_SPEED_2_5GT, false);
 		if (ret) {
 			pci_info(dev, "retraining failed\n");
-			pcie_capability_write_word(dev, PCI_EXP_LNKCTL2,
-						   oldlnkctl2);
-			pcie_retrain_link(dev, true);
+			pcie_set_target_speed(dev, PCIE_LNKCTL2_TLS2SPEED(oldlnkctl2),
+					      true);
 			return ret;
 		}
 
@@ -136,11 +131,7 @@ int pcie_failed_link_retrain(struct pci_dev *dev)
 
 		pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
 		pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
-		lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
-		lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
-		pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
-
-		ret = pcie_retrain_link(dev, false);
+		ret = pcie_set_target_speed(dev, PCIE_LNKCAP_SLS2SPEED(lnkcap), false);
 		if (ret) {
 			pci_info(dev, "retraining failed\n");
 			return ret;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 5f9de226be13..b5ce9513b06f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1798,9 +1798,19 @@ static inline int pci_irqd_intx_xlate(struct irq_domain *d,
 #ifdef CONFIG_PCIEPORTBUS
 extern bool pcie_ports_disabled;
 extern bool pcie_ports_native;
+
+int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req,
+			  bool use_lt);
 #else
 #define pcie_ports_disabled	true
 #define pcie_ports_native	false
+
+static inline int pcie_set_target_speed(struct pci_dev *port,
+					enum pci_bus_speed speed_req,
+					bool use_lt)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 #define PCIE_LINK_STATE_L0S		(BIT(0) | BIT(1)) /* Upstr/dwnstr L0s */
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v9 8/9] thermal: Add PCIe cooling driver
  2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
                   ` (6 preceding siblings ...)
  2024-10-18 14:47 ` [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed Ilpo Järvinen
@ 2024-10-18 14:47 ` Ilpo Järvinen
  2024-11-13  8:44   ` Lukas Wunner
  2024-10-18 14:47 ` [PATCH v9 9/9] selftests/pcie_bwctrl: Create selftests Ilpo Järvinen
  2024-10-23 22:19 ` [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Bjorn Helgaas
  9 siblings, 1 reply; 35+ messages in thread
From: Ilpo Järvinen @ 2024-10-18 14:47 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, Ilpo Järvinen, Daniel Lezcano, Zhang Rui,
	Lukasz Luba, linux-kernel
  Cc: Amit Kucheria, Christophe JAILLET, Jonathan Cameron

Add a thermal cooling driver to provide path to access PCIe bandwidth
controller using the usual thermal interfaces.

A cooling device is instantiated for controllable PCIe Ports from the
bwctrl service driver.

If registering the cooling device fails, allow bwctrl's probe to
succeed regardless. As cdev in that case contains IS_ERR() pseudo
"pointer", clean that up inside the probe function so the remove side
doesn't need to suddenly make an odd looking IS_ERR() check.

The thermal side state 0 means no throttling, i.e., maximum supported
PCIe Link Speed.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # From the cooling device interface perspective
---
 MAINTAINERS                    |  2 +
 drivers/pci/pcie/bwctrl.c      | 13 ++++++
 drivers/thermal/Kconfig        |  9 ++++
 drivers/thermal/Makefile       |  2 +
 drivers/thermal/pcie_cooling.c | 80 ++++++++++++++++++++++++++++++++++
 include/linux/pci-bwctrl.h     | 28 ++++++++++++
 6 files changed, 134 insertions(+)
 create mode 100644 drivers/thermal/pcie_cooling.c
 create mode 100644 include/linux/pci-bwctrl.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 8c555b3325d6..393ed7ce5ea1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17938,6 +17938,8 @@ M:	Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
 L:	linux-pci@vger.kernel.org
 S:	Supported
 F:	drivers/pci/pcie/bwctrl.c
+F:	drivers/thermal/pcie_cooling.c
+F:	include/linux/pci-bwctrl.h
 
 PCIE DRIVER FOR AMAZON ANNAPURNA LABS
 M:	Jonathan Chocron <jonnyc@amazon.com>
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index 8a2bd1e887e2..d00675e33ad9 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -27,6 +27,7 @@
 #include <linux/interrupt.h>
 #include <linux/mutex.h>
 #include <linux/pci.h>
+#include <linux/pci-bwctrl.h>
 #include <linux/rwsem.h>
 #include <linux/slab.h>
 #include <linux/types.h>
@@ -38,10 +39,12 @@
  * struct pcie_bwctrl_data - PCIe bandwidth controller
  * @set_speed_mutex:	Serializes link speed changes
  * @lbms_count:		Count for LBMS (since last reset)
+ * @cdev:		thermal cooling device associated with the port
  */
 struct pcie_bwctrl_data {
 	struct mutex set_speed_mutex;
 	atomic_t lbms_count;
+	struct thermal_cooling_device *cdev;
 };
 
 /*
@@ -313,11 +316,21 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
 
 	pci_dbg(port, "enabled with IRQ %d\n", srv->irq);
 
+	/* Don't fail on errors. Don't leave IS_ERR() "pointer" into ->cdev */
+	port->link_bwctrl->cdev = pcie_cooling_device_register(port);
+	if (IS_ERR(port->link_bwctrl->cdev))
+		port->link_bwctrl->cdev = NULL;
+
 	return 0;
 }
 
 static void pcie_bwnotif_remove(struct pcie_device *srv)
 {
+	struct pcie_bwctrl_data *data = srv->port->link_bwctrl;
+
+	if (data->cdev)
+		pcie_cooling_device_unregister(data->cdev);
+
 	pcie_bwnotif_disable(srv->port);
 
 	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem)
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 61e7ae524b1f..d3f9686e26e7 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -220,6 +220,15 @@ config DEVFREQ_THERMAL
 
 	  If you want this support, you should say Y here.
 
+config PCIE_THERMAL
+	bool "PCIe cooling support"
+	depends on PCIEPORTBUS
+	help
+	  This implements PCIe cooling mechanism through bandwidth reduction
+	  for PCIe devices.
+
+	  If you want this support, you should say Y here.
+
 config THERMAL_EMULATION
 	bool "Thermal emulation mode support"
 	help
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 41c4d56beb40..210c16c91461 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -31,6 +31,8 @@ thermal_sys-$(CONFIG_CPU_IDLE_THERMAL)	+= cpuidle_cooling.o
 # devfreq cooling
 thermal_sys-$(CONFIG_DEVFREQ_THERMAL) += devfreq_cooling.o
 
+thermal_sys-$(CONFIG_PCIE_THERMAL) += pcie_cooling.o
+
 obj-$(CONFIG_K3_THERMAL)	+= k3_bandgap.o k3_j72xx_bandgap.o
 # platform thermal drivers
 obj-y				+= broadcom/
diff --git a/drivers/thermal/pcie_cooling.c b/drivers/thermal/pcie_cooling.c
new file mode 100644
index 000000000000..a876d64f1582
--- /dev/null
+++ b/drivers/thermal/pcie_cooling.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * PCIe cooling device
+ *
+ * Copyright (C) 2023-2024 Intel Corporation
+ */
+
+#include <linux/build_bug.h>
+#include <linux/cleanup.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/pci-bwctrl.h>
+#include <linux/slab.h>
+#include <linux/sprintf.h>
+#include <linux/thermal.h>
+
+#define COOLING_DEV_TYPE_PREFIX		"PCIe_Port_Link_Speed_"
+
+static int pcie_cooling_get_max_level(struct thermal_cooling_device *cdev, unsigned long *state)
+{
+	struct pci_dev *port = cdev->devdata;
+
+	/* cooling state 0 is same as the maximum PCIe speed */
+	*state = port->subordinate->max_bus_speed - PCIE_SPEED_2_5GT;
+
+	return 0;
+}
+
+static int pcie_cooling_get_cur_level(struct thermal_cooling_device *cdev, unsigned long *state)
+{
+	struct pci_dev *port = cdev->devdata;
+
+	/* cooling state 0 is same as the maximum PCIe speed */
+	*state = cdev->max_state - (port->subordinate->cur_bus_speed - PCIE_SPEED_2_5GT);
+
+	return 0;
+}
+
+static int pcie_cooling_set_cur_level(struct thermal_cooling_device *cdev, unsigned long state)
+{
+	struct pci_dev *port = cdev->devdata;
+	enum pci_bus_speed speed;
+
+	/* cooling state 0 is same as the maximum PCIe speed */
+	speed = (cdev->max_state - state) + PCIE_SPEED_2_5GT;
+
+	return pcie_set_target_speed(port, speed, true);
+}
+
+static struct thermal_cooling_device_ops pcie_cooling_ops = {
+	.get_max_state = pcie_cooling_get_max_level,
+	.get_cur_state = pcie_cooling_get_cur_level,
+	.set_cur_state = pcie_cooling_set_cur_level,
+};
+
+struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port)
+{
+	char *name __free(kfree) =
+		kasprintf(GFP_KERNEL, COOLING_DEV_TYPE_PREFIX "%s", pci_name(port));
+	if (!name)
+		return ERR_PTR(-ENOMEM);
+
+	return thermal_cooling_device_register(name, port, &pcie_cooling_ops);
+}
+
+void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev)
+{
+	thermal_cooling_device_unregister(cdev);
+}
+
+/* For bus_speed <-> state arithmetic */
+static_assert(PCIE_SPEED_2_5GT + 1 == PCIE_SPEED_5_0GT);
+static_assert(PCIE_SPEED_5_0GT + 1 == PCIE_SPEED_8_0GT);
+static_assert(PCIE_SPEED_8_0GT + 1 == PCIE_SPEED_16_0GT);
+static_assert(PCIE_SPEED_16_0GT + 1 == PCIE_SPEED_32_0GT);
+static_assert(PCIE_SPEED_32_0GT + 1 == PCIE_SPEED_64_0GT);
+
+MODULE_AUTHOR("Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>");
+MODULE_DESCRIPTION("PCIe cooling driver");
diff --git a/include/linux/pci-bwctrl.h b/include/linux/pci-bwctrl.h
new file mode 100644
index 000000000000..cee07127455b
--- /dev/null
+++ b/include/linux/pci-bwctrl.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * PCIe bandwidth controller
+ *
+ * Copyright (C) 2023-2024 Intel Corporation
+ */
+
+#ifndef LINUX_PCI_BWCTRL_H
+#define LINUX_PCI_BWCTRL_H
+
+#include <linux/pci.h>
+
+struct thermal_cooling_device;
+
+#ifdef CONFIG_PCIE_THERMAL
+struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port);
+void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev);
+#else
+static inline struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port)
+{
+	return NULL;
+}
+static inline void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev)
+{
+}
+#endif
+
+#endif
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v9 9/9] selftests/pcie_bwctrl: Create selftests
  2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
                   ` (7 preceding siblings ...)
  2024-10-18 14:47 ` [PATCH v9 8/9] thermal: Add PCIe cooling driver Ilpo Järvinen
@ 2024-10-18 14:47 ` Ilpo Järvinen
  2024-10-23 22:19 ` [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Bjorn Helgaas
  9 siblings, 0 replies; 35+ messages in thread
From: Ilpo Järvinen @ 2024-10-18 14:47 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, Shuah Khan, Ilpo Järvinen, linux-kernel,
	linux-kselftest
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	Jonathan Cameron

Create selftests for PCIe BW control through the PCIe cooling device
sysfs interface.

First, the BW control selftest finds the PCIe Port to test with. By
default, the PCIe Port with the highest Link Speed is selected but
another PCIe Port can be provided with -d parameter.

The actual test steps the cur_state of the cooling device one-by-one
from max_state to what the cur_state was initially. The speed change
is confirmed by observing the current_link_speed for the corresponding
PCIe Port.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 MAINTAINERS                                   |   1 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/pcie_bwctrl/Makefile  |   2 +
 .../pcie_bwctrl/set_pcie_cooling_state.sh     | 122 ++++++++++++++++++
 .../selftests/pcie_bwctrl/set_pcie_speed.sh   |  67 ++++++++++
 5 files changed, 193 insertions(+)
 create mode 100644 tools/testing/selftests/pcie_bwctrl/Makefile
 create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh
 create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 393ed7ce5ea1..d7ffef4382df 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17940,6 +17940,7 @@ S:	Supported
 F:	drivers/pci/pcie/bwctrl.c
 F:	drivers/thermal/pcie_cooling.c
 F:	include/linux/pci-bwctrl.h
+F:	tools/testing/selftests/pcie_bwctrl/
 
 PCIE DRIVER FOR AMAZON ANNAPURNA LABS
 M:	Jonathan Chocron <jonnyc@amazon.com>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index b38199965f99..7181756f47ff 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -72,6 +72,7 @@ TARGETS += net/packetdrill
 TARGETS += net/rds
 TARGETS += net/tcp_ao
 TARGETS += nsfs
+TARGETS += pcie_bwctrl
 TARGETS += perf_events
 TARGETS += pidfd
 TARGETS += pid_namespace
diff --git a/tools/testing/selftests/pcie_bwctrl/Makefile b/tools/testing/selftests/pcie_bwctrl/Makefile
new file mode 100644
index 000000000000..3e84e26341d1
--- /dev/null
+++ b/tools/testing/selftests/pcie_bwctrl/Makefile
@@ -0,0 +1,2 @@
+TEST_PROGS = set_pcie_cooling_state.sh
+include ../lib.mk
diff --git a/tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh b/tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh
new file mode 100755
index 000000000000..9df606552af3
--- /dev/null
+++ b/tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh
@@ -0,0 +1,122 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+SYSFS=
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+retval=0
+skipmsg="skip all tests:"
+
+PCIEPORTTYPE="PCIe_Port_Link_Speed"
+
+prerequisite()
+{
+	local ports
+
+	if [ $UID != 0 ]; then
+		echo $skipmsg must be run as root >&2
+		exit $ksft_skip
+	fi
+
+	SYSFS=`mount -t sysfs | head -1 | awk '{ print $3 }'`
+
+	if [ ! -d "$SYSFS" ]; then
+		echo $skipmsg sysfs is not mounted >&2
+		exit $ksft_skip
+	fi
+
+	if ! ls $SYSFS/class/thermal/cooling_device* > /dev/null 2>&1; then
+		echo $skipmsg thermal cooling devices missing >&2
+		exit $ksft_skip
+	fi
+
+	ports=`grep -e "^$PCIEPORTTYPE" $SYSFS/class/thermal/cooling_device*/type | wc -l`
+	if [ $ports -eq 0 ]; then
+		echo $skipmsg pcie cooling devices missing >&2
+		exit $ksft_skip
+	fi
+}
+
+testport=
+find_pcie_port()
+{
+	local patt="$1"
+	local pcieports
+	local max
+	local cur
+	local delta
+	local bestdelta=-1
+
+	pcieports=`grep -l -F -e "$patt" /sys/class/thermal/cooling_device*/type`
+	if [ -z "$pcieports" ]; then
+		return
+	fi
+	pcieports=${pcieports//\/type/}
+	# Find the port with the highest PCIe Link Speed
+	for port in $pcieports; do
+		max=`cat $port/max_state`
+		cur=`cat $port/cur_state`
+		delta=$((max-cur))
+		if [ $delta -gt $bestdelta ]; then
+			testport="$port"
+			bestdelta=$delta
+		fi
+	done
+}
+
+sysfspcidev=
+find_sysfs_pci_dev()
+{
+	local typefile="$1/type"
+	local pcidir
+
+	pcidir="$SYSFS/bus/pci/devices/`sed -e "s|^${PCIEPORTTYPE}_||g" $typefile`"
+
+	if [ -r "$pcidir/current_link_speed" ]; then
+		sysfspcidev="$pcidir/current_link_speed"
+	fi
+}
+
+usage()
+{
+	echo "Usage $0 [ -d dev ]"
+	echo -e "\t-d: PCIe port BDF string (e.g., 0000:00:04.0)"
+}
+
+pattern="$PCIEPORTTYPE"
+parse_arguments()
+{
+	while getopts d:h opt; do
+		case $opt in
+			h)
+				usage "$0"
+				exit 0
+				;;
+			d)
+				pattern="$PCIEPORTTYPE_$OPTARG"
+				;;
+			*)
+				usage "$0"
+				exit 0
+				;;
+		esac
+	done
+}
+
+parse_arguments "$@"
+prerequisite
+find_pcie_port "$pattern"
+if [ -z "$testport" ]; then
+	echo $skipmsg "pcie cooling device not found from sysfs" >&2
+	exit $ksft_skip
+fi
+find_sysfs_pci_dev "$testport"
+if [ -z "$sysfspcidev" ]; then
+	echo $skipmsg "PCIe port device not found from sysfs" >&2
+	exit $ksft_skip
+fi
+
+./set_pcie_speed.sh "$testport" "$sysfspcidev"
+retval=$?
+
+exit $retval
diff --git a/tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh b/tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh
new file mode 100755
index 000000000000..584596949312
--- /dev/null
+++ b/tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh
@@ -0,0 +1,67 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+set -e
+
+TESTNAME=set_pcie_speed
+
+declare -a PCIELINKSPEED=(
+	"2.5 GT/s PCIe"
+	"5.0 GT/s PCIe"
+	"8.0 GT/s PCIe"
+	"16.0 GT/s PCIe"
+	"32.0 GT/s PCIe"
+	"64.0 GT/s PCIe"
+)
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+retval=0
+
+coolingdev="$1"
+statefile="$coolingdev/cur_state"
+maxfile="$coolingdev/max_state"
+linkspeedfile="$2"
+
+oldstate=`cat $statefile`
+maxstate=`cat $maxfile`
+
+set_state()
+{
+	local state=$1
+	local linkspeed
+	local expected_linkspeed
+
+	echo $state > $statefile
+
+	sleep 1
+
+	linkspeed="`cat $linkspeedfile`"
+	expected_linkspeed=$((maxstate-state))
+	expected_str="${PCIELINKSPEED[$expected_linkspeed]}"
+	if [ ! "${expected_str}" = "${linkspeed}" ]; then
+		echo "$TESTNAME failed: expected: ${expected_str}; got ${linkspeed}"
+		retval=1
+	fi
+}
+
+cleanup_skip ()
+{
+	set_state $oldstate
+	exit $ksft_skip
+}
+
+trap cleanup_skip EXIT
+
+echo "$TESTNAME: testing states $maxstate .. $oldstate with $coolingdev"
+for i in $(seq $maxstate -1 $oldstate); do
+	set_state "$i"
+done
+
+trap EXIT
+if [ $retval -eq 0 ]; then
+	echo "$TESTNAME [PASS]"
+else
+	echo "$TESTNAME [FAIL]"
+fi
+exit $retval
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 0/9] PCI: Add PCIe bandwidth controller
  2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
                   ` (8 preceding siblings ...)
  2024-10-18 14:47 ` [PATCH v9 9/9] selftests/pcie_bwctrl: Create selftests Ilpo Järvinen
@ 2024-10-23 22:19 ` Bjorn Helgaas
  2024-11-13 21:48   ` Bjorn Helgaas
  9 siblings, 1 reply; 35+ messages in thread
From: Bjorn Helgaas @ 2024-10-23 22:19 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, linux-kernel, Daniel Lezcano, Amit Kucheria,
	Zhang Rui, Christophe JAILLET

On Fri, Oct 18, 2024 at 05:47:46PM +0300, Ilpo Järvinen wrote:
> Hi all,
> 
> This series adds PCIe bandwidth controller (bwctrl) and associated PCIe
> cooling driver to the thermal core side for limiting PCIe Link Speed
> due to thermal reasons. PCIe bandwidth controller is a PCI express bus
> port service driver. A cooling device is created for each port the
> service driver finds to support changing speeds.
> 
> This series only adds support for controlling PCIe Link Speed.
> Controlling PCIe Link Width might also be useful but there is no
> mechanism for that until PCIe 6.0 (L0p) so Link Width throttling is not
> added by this series.
> 
> 
> v9:
> - Split RMW ops doc reformat into own patch before adding LNKCTL2.
> - Comment reserved 0 LSB even better than it already was.
> - Consider portdrv future plans:
> 	- Use devm helpers for mem alloc, IRQ, and mutex init.
> 	- Don't use get/set_service_data().
> - Split rwsem into two to avoid recursive locking splat through
>   pcie_retrain_link().
> - Small wording improvements to commit messages (from Jonathan)
> 
> v8:
> - Removed CONFIG_PCIE_BWCTRL (use CONFIG_PCIEPORTBUS)
> - Removed locking wrappers that dealt with the CONFIG variations
> - Protect macro parameter with parenthesis to be on the safe side
> 
> v7:
> - Rebased on top of Maciej's latest Target Speed quirk patches
> - Target Speed quirk runs very early, w/o ->subordinate existing yet.
>   This required adapting logic:
> 	- Move Supported Link Speeds back to pci_dev
> 	- Check for ->subordinate == NULL where necessary
> 	- Cannot always take bwctrl's per port mutex (in pcie_bwctrl_data)
> - Cleaned up locking in pcie_set_target_speed() using wrappers
> 	- Allowed removing confusing __pcie_set_target_speed()
> - Fix building with CONFIG_PCI=n
> - Correct error check in pcie_lbms_seen()
> - Don't return error for an empty bus that remains at 2.5GT
> - Use rwsem to protect ->link_bwctrl setup and bwnotif enable
> - Clear LBMS in remove_board()
> - Adding export for pcie_get_supported_speeds() was unnecessary
> - Call bwctrl's init before hotplug.
> - Added local variable 'bus' into a few functions
> 
> v6:
> - Removed unnecessary PCI_EXP_LNKCAP_SLS mask from PCIE_LNKCAP_SLS2SPEED()
> - Split error handling in pcie_bwnotif_irq_thread()
> - pci_info() -> pci_dbg() on bwctrl probe success path
> - Handle cooling device pointer -Exx codes in bwctrl probe
> - Reorder port->link_bwctrl setup / bwnotif enable for symmetry
> - Handle LBMS count == 0 in PCIe quirk by checking LBMS (avoids a race
>   between quirk and bwctrl)
> - Use cleanup.h in PCIe cooling device's register
> 
> v5:
> - Removed patches: LNKCTL2 RMW driver patches went in separately
> - Refactor pcie_update_link_speed() to read LNKSTA + add __ variant
>   for hotplug that has LNKSTA value at hand
> - Make series fully compatible with the Target Speed quirk
> 	- LBMS counter added, quirk falls back to LBMS bit when bwctrl =n
> 	- Separate LBMS patch from set target speed patches
> - Always provide pcie_bwctrl_change_speed() even if bwctrl =n so drivers
>   don't need to come up their own version (also required by the Target
>   Speed quirk)
> - Remove devm_* (based on Lukas' comment on some other service
>   driver patch)
> - Convert to use cleanup.h
> - Renamed functions/struct to have shorter names
> 
> v4:
> - Merge Port's and Endpoint's Supported Link Speeds Vectors into
>   supported_speeds in the struct pci_bus
> - Reuse pcie_get_speed_cap()'s code for pcie_get_supported_speeds()
> - Setup supported_speeds with PCI_EXP_LNKCAP2_SLS_2_5GB when no
>   Endpoint exists
> - Squash revert + add bwctrl patches into one
> - Change to use threaded IRQ + IRQF_ONESHOT
> - Enable also LABIE / LABS
> - Convert Link Speed selection to use bit logic instead of loop
> - Allocate before requesting IRQ during probe
> - Use devm_*()
> - Use u8 for speed_conv array instead of u16
> - Removed READ_ONCE()
> - Improve changelogs, comments, and Kconfig
> - Name functions slightly more consistently
> - Use bullet list for RMW protected registers in docs
> 
> v3:
> - Correct hfi1 shortlog prefix
> - Improve error prints in hfi1
> - Add L: linux-pci to the MAINTAINERS entry
> 
> v2:
> - Adds LNKCTL2 to RMW safe list in Documentation/PCI/pciebus-howto.rst
> - Renamed cooling devices from PCIe_Port_* to PCIe_Port_Link_Speed_* in
>   order to plan for possibility of adding Link Width cooling devices
>   later on
> - Moved struct thermal_cooling_device declaration to the correct patch
> - Small tweaks to Kconfig texts
> - Series rebased to resolve conflict (in the selftest list)
> 
> Ilpo Järvinen (9):
>   Documentation PCI: Reformat RMW ops documentation
>   PCI: Protect Link Control 2 Register with RMW locking
>   PCI: Store all PCIe Supported Link Speeds
>   PCI: Refactor pcie_update_link_speed()
>   PCI/quirks: Abstract LBMS seen check into own function
>   PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller
>   PCI/bwctrl: Add API to set PCIe Link Speed
>   thermal: Add PCIe cooling driver
>   selftests/pcie_bwctrl: Create selftests
> 
>  Documentation/PCI/pciebus-howto.rst           |  14 +-
>  MAINTAINERS                                   |   9 +
>  drivers/pci/hotplug/pciehp_ctrl.c             |   5 +
>  drivers/pci/hotplug/pciehp_hpc.c              |   2 +-
>  drivers/pci/pci.c                             |  62 ++-
>  drivers/pci/pci.h                             |  38 +-
>  drivers/pci/pcie/Makefile                     |   2 +-
>  drivers/pci/pcie/bwctrl.c                     | 366 ++++++++++++++++++
>  drivers/pci/pcie/portdrv.c                    |   9 +-
>  drivers/pci/pcie/portdrv.h                    |   6 +-
>  drivers/pci/probe.c                           |  15 +-
>  drivers/pci/quirks.c                          |  32 +-
>  drivers/thermal/Kconfig                       |   9 +
>  drivers/thermal/Makefile                      |   2 +
>  drivers/thermal/pcie_cooling.c                |  80 ++++
>  include/linux/pci-bwctrl.h                    |  28 ++
>  include/linux/pci.h                           |  24 +-
>  include/uapi/linux/pci_regs.h                 |   1 +
>  tools/testing/selftests/Makefile              |   1 +
>  tools/testing/selftests/pcie_bwctrl/Makefile  |   2 +
>  .../pcie_bwctrl/set_pcie_cooling_state.sh     | 122 ++++++
>  .../selftests/pcie_bwctrl/set_pcie_speed.sh   |  67 ++++
>  22 files changed, 843 insertions(+), 53 deletions(-)
>  create mode 100644 drivers/pci/pcie/bwctrl.c
>  create mode 100644 drivers/thermal/pcie_cooling.c
>  create mode 100644 include/linux/pci-bwctrl.h
>  create mode 100644 tools/testing/selftests/pcie_bwctrl/Makefile
>  create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh
>  create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh

Applied to pci/bwctrl for v6.13, thanks Ilpo (and Alexandru, for the
bandwidth notification interrupt support)!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 3/9] PCI: Store all PCIe Supported Link Speeds
  2024-10-18 14:47 ` [PATCH v9 3/9] PCI: Store all PCIe Supported Link Speeds Ilpo Järvinen
@ 2024-11-11 13:23   ` Lukas Wunner
  2024-11-11 20:30     ` Bjorn Helgaas
  0 siblings, 1 reply; 35+ messages in thread
From: Lukas Wunner @ 2024-11-11 13:23 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczy??ski, Maciej W . Rozycki, Jonathan Cameron,
	Alexandru Gagniuc, Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, linux-kernel,
	Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET

On Fri, Oct 18, 2024 at 05:47:49PM +0300, Ilpo Järvinen wrote:
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index be5ed534c39c..a02b77fe7865 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -303,6 +303,7 @@ enum pci_bus_speed {
>  	PCI_SPEED_UNKNOWN		= 0xff,
>  };
>  
> +u8 pcie_get_supported_speeds(struct pci_dev *dev);
>  enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
>  enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
>  

I realize this is now already queued as commit 73ee11953294 on pci/bwctrl,
nevertheless one belated comment:

Since there are no callers of pcie_get_supported_speeds() outside the
PCI core, the above declaration should probably rather live in
drivers/pci/pci.h.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 3/9] PCI: Store all PCIe Supported Link Speeds
  2024-11-11 13:23   ` Lukas Wunner
@ 2024-11-11 20:30     ` Bjorn Helgaas
  2024-11-12  9:20       ` Ilpo Järvinen
  0 siblings, 1 reply; 35+ messages in thread
From: Bjorn Helgaas @ 2024-11-11 20:30 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczy??ski, Maciej W . Rozycki,
	Jonathan Cameron, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, linux-kernel, Daniel Lezcano, Amit Kucheria,
	Zhang Rui, Christophe JAILLET

On Mon, Nov 11, 2024 at 02:23:35PM +0100, Lukas Wunner wrote:
> On Fri, Oct 18, 2024 at 05:47:49PM +0300, Ilpo Järvinen wrote:
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index be5ed534c39c..a02b77fe7865 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -303,6 +303,7 @@ enum pci_bus_speed {
> >  	PCI_SPEED_UNKNOWN		= 0xff,
> >  };
> >  
> > +u8 pcie_get_supported_speeds(struct pci_dev *dev);
> >  enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
> >  enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
> 
> I realize this is now already queued as commit 73ee11953294 on pci/bwctrl,
> nevertheless one belated comment:
> 
> Since there are no callers of pcie_get_supported_speeds() outside the
> PCI core, the above declaration should probably rather live in
> drivers/pci/pci.h.

I moved them, thanks!

I noticed duplicate declarations for pcie_get_speed_cap() and
pcie_get_width_cap(), so I'll add a patch to drop them from
drivers/pci/pci.h.

Bjorn

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 3/9] PCI: Store all PCIe Supported Link Speeds
  2024-11-11 20:30     ` Bjorn Helgaas
@ 2024-11-12  9:20       ` Ilpo Järvinen
  0 siblings, 0 replies; 35+ messages in thread
From: Ilpo Järvinen @ 2024-11-12  9:20 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Lukas Wunner, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczy??ski, Maciej W . Rozycki,
	Jonathan Cameron, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, LKML, Daniel Lezcano, Amit Kucheria, Zhang Rui,
	Christophe JAILLET

[-- Attachment #1: Type: text/plain, Size: 1306 bytes --]

On Mon, 11 Nov 2024, Bjorn Helgaas wrote:

> On Mon, Nov 11, 2024 at 02:23:35PM +0100, Lukas Wunner wrote:
> > On Fri, Oct 18, 2024 at 05:47:49PM +0300, Ilpo Järvinen wrote:
> > > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > > index be5ed534c39c..a02b77fe7865 100644
> > > --- a/include/linux/pci.h
> > > +++ b/include/linux/pci.h
> > > @@ -303,6 +303,7 @@ enum pci_bus_speed {
> > >  	PCI_SPEED_UNKNOWN		= 0xff,
> > >  };
> > >  
> > > +u8 pcie_get_supported_speeds(struct pci_dev *dev);
> > >  enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
> > >  enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
> > 
> > I realize this is now already queued as commit 73ee11953294 on pci/bwctrl,
> > nevertheless one belated comment:
> > 
> > Since there are no callers of pcie_get_supported_speeds() outside the
> > PCI core, the above declaration should probably rather live in
> > drivers/pci/pci.h.
> 
> I moved them, thanks!

Thanks for taking care of that.

> I noticed duplicate declarations for pcie_get_speed_cap() and
> pcie_get_width_cap(), so I'll add a patch to drop them from
> drivers/pci/pci.h.

I also noticed the same. FWIW,

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

for that patch of yours.

-- 
 i.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed
  2024-10-18 14:47 ` [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed Ilpo Järvinen
@ 2024-11-12 15:47   ` Lukas Wunner
  2024-11-12 16:01     ` Ilpo Järvinen
  2024-11-12 20:43     ` Bjorn Helgaas
  2025-01-02 10:38   ` Lukas Wunner
  1 sibling, 2 replies; 35+ messages in thread
From: Lukas Wunner @ 2024-11-12 15:47 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczy??ski, Maciej W . Rozycki, Jonathan Cameron,
	Alexandru Gagniuc, Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, linux-kernel,
	Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET

On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:
> +EXPORT_SYMBOL_GPL(pcie_set_target_speed);

My apologies for another belated comment on this series.
This patch is now a688ab21eb72 on pci/bwctrl:

I note that pcie_set_target_speed() is not called my a modular user
(CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
isn't really necessary right now.  I don't know if it was added
intentionally because some modular user is expected to show up
in the near future.


> @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
>  	if (!data)
>  		return -ENOMEM;
>  
> +	devm_mutex_init(&srv->device, &data->set_speed_mutex);
>  	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
>  					pcie_bwnotif_irq_thread,
>  					IRQF_SHARED | IRQF_ONESHOT,

We generally try to avoid devm_*() functions in port service drivers
because if we later on move them into the PCI core (which is the plan),
we'll have to unroll them.  Not the end of the world that they're used
here, just not ideal.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed
  2024-11-12 15:47   ` Lukas Wunner
@ 2024-11-12 16:01     ` Ilpo Järvinen
  2024-11-18 13:03       ` Jonathan Cameron
  2024-11-12 20:43     ` Bjorn Helgaas
  1 sibling, 1 reply; 35+ messages in thread
From: Ilpo Järvinen @ 2024-11-12 16:01 UTC (permalink / raw)
  To: Lukas Wunner, Jonathan Cameron
  Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczy??ski, Maciej W . Rozycki, Alexandru Gagniuc,
	Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, LKML,
	Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET

[-- Attachment #1: Type: text/plain, Size: 1903 bytes --]

On Tue, 12 Nov 2024, Lukas Wunner wrote:

> On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:
> > +EXPORT_SYMBOL_GPL(pcie_set_target_speed);
> 
> My apologies for another belated comment on this series.
> This patch is now a688ab21eb72 on pci/bwctrl:
> 
> I note that pcie_set_target_speed() is not called my a modular user
> (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
> isn't really necessary right now.  I don't know if it was added
> intentionally because some modular user is expected to show up
> in the near future.

Its probably a thinko to add it at all but then there have been talk about 
other users interested in the API too so it's not far fetched we could see 
a user. No idea about timelines though.

There are some AMD GPU drivers tweaking the TLS field on their own but 
they also touch some HW specific registers (although, IIRC, they only 
touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if 
that yields something very straightforward and ends up producing a working 
conversion or not (without ability to test with the HW). But TBH, not on 
my highest priority item.

> > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
> >  	if (!data)
> >  		return -ENOMEM;
> >  
> > +	devm_mutex_init(&srv->device, &data->set_speed_mutex);
> >  	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
> >  					pcie_bwnotif_irq_thread,
> >  					IRQF_SHARED | IRQF_ONESHOT,
> 
> We generally try to avoid devm_*() functions in port service drivers
> because if we later on move them into the PCI core (which is the plan),
> we'll have to unroll them.  Not the end of the world that they're used
> here, just not ideal.

I think Jonathan disagrees with you on that:

https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/

:-)

-- 
 i.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed
  2024-11-12 15:47   ` Lukas Wunner
  2024-11-12 16:01     ` Ilpo Järvinen
@ 2024-11-12 20:43     ` Bjorn Helgaas
  1 sibling, 0 replies; 35+ messages in thread
From: Bjorn Helgaas @ 2024-11-12 20:43 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczy??ski, Maciej W . Rozycki,
	Jonathan Cameron, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, linux-kernel, Daniel Lezcano, Amit Kucheria,
	Zhang Rui, Christophe JAILLET

On Tue, Nov 12, 2024 at 04:47:48PM +0100, Lukas Wunner wrote:
> On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:
> > +EXPORT_SYMBOL_GPL(pcie_set_target_speed);
> 
> My apologies for another belated comment on this series.
> This patch is now a688ab21eb72 on pci/bwctrl:
> 
> I note that pcie_set_target_speed() is not called my a modular user
> (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
> isn't really necessary right now.  I don't know if it was added
> intentionally because some modular user is expected to show up
> in the near future.

Dropped the export for now, we can add it back if/when needed.

Bjorn

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 8/9] thermal: Add PCIe cooling driver
  2024-10-18 14:47 ` [PATCH v9 8/9] thermal: Add PCIe cooling driver Ilpo Järvinen
@ 2024-11-13  8:44   ` Lukas Wunner
  2024-11-15 17:28     ` Bjorn Helgaas
  0 siblings, 1 reply; 35+ messages in thread
From: Lukas Wunner @ 2024-11-13  8:44 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczy??ski, Maciej W . Rozycki, Jonathan Cameron,
	Alexandru Gagniuc, Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, Daniel Lezcano,
	Zhang Rui, Lukasz Luba, linux-kernel, Amit Kucheria,
	Christophe JAILLET

On Fri, Oct 18, 2024 at 05:47:54PM +0300, Ilpo Järvinen wrote:
>  static void pcie_bwnotif_remove(struct pcie_device *srv)
>  {
> +	struct pcie_bwctrl_data *data = srv->port->link_bwctrl;
> +
> +	if (data->cdev)
> +		pcie_cooling_device_unregister(data->cdev);
> +

Just noting a minor nit here in what is now commit 7206400cda87
on pci/bwctrl:  The NULL pointer check for data->cdev seems
redundant as pcie_cooling_device_unregister() just calls
thermal_cooling_device_unregister(), which in turn contains
this at the top:

	if (!cdev)
		return;

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 0/9] PCI: Add PCIe bandwidth controller
  2024-10-23 22:19 ` [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Bjorn Helgaas
@ 2024-11-13 21:48   ` Bjorn Helgaas
  2024-11-14  8:46     ` Bartosz Golaszewski
                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Bjorn Helgaas @ 2024-11-13 21:48 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, linux-kernel, Daniel Lezcano, Amit Kucheria,
	Zhang Rui, Christophe JAILLET, Bartosz Golaszewski, Amit Pundir,
	Neil Armstrong, Caleb Connolly

[+cc Bart, Amit, Neil, Caleb]

On Wed, Oct 23, 2024 at 05:19:04PM -0500, Bjorn Helgaas wrote:
> On Fri, Oct 18, 2024 at 05:47:46PM +0300, Ilpo Järvinen wrote:
> > This series adds PCIe bandwidth controller (bwctrl) and associated PCIe
> > cooling driver to the thermal core side for limiting PCIe Link Speed
> > due to thermal reasons. PCIe bandwidth controller is a PCI express bus
> > port service driver. A cooling device is created for each port the
> > service driver finds to support changing speeds.

> >  drivers/pci/pcie/bwctrl.c                     | 366 ++++++++++++++++++
> >  include/linux/pci-bwctrl.h                    |  28 ++
> >  tools/testing/selftests/pcie_bwctrl/Makefile  |   2 +
> >  .../pcie_bwctrl/set_pcie_cooling_state.sh     | 122 ++++++
> >  .../selftests/pcie_bwctrl/set_pcie_speed.sh   |  67 ++++

> Applied to pci/bwctrl for v6.13, thanks Ilpo (and Alexandru, for the
> bandwidth notification interrupt support)!

How attached are we to "bwctrl" and "pwrctl" as the names?

I just noticed that we have "ctrl" for bandwidth control but "ctl" for
power control, which is slightly annoying to keep straight.

"ctrl" is more common in the tree:

  $ find -name \*ctrl\*[ch] | wc -l
  691
  $ find -name \*ctl\*[ch] | wc -l
  291

so I would prefer that, although "pwrctl" is already in the tree (as
of v6.11), so it would be more disruptive to change it than to rename
"bwctrl".

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 0/9] PCI: Add PCIe bandwidth controller
  2024-11-13 21:48   ` Bjorn Helgaas
@ 2024-11-14  8:46     ` Bartosz Golaszewski
  2024-11-14 12:32     ` Ilpo Järvinen
  2024-11-14 17:36     ` Krzysztof Wilczyński
  2 siblings, 0 replies; 35+ messages in thread
From: Bartosz Golaszewski @ 2024-11-14  8:46 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczyński, Maciej W . Rozycki,
	Jonathan Cameron, Lukas Wunner, Alexandru Gagniuc,
	Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, linux-kernel,
	Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	Amit Pundir, Neil Armstrong, Caleb Connolly

On Wed, 13 Nov 2024 at 22:48, Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Bart, Amit, Neil, Caleb]
>
> On Wed, Oct 23, 2024 at 05:19:04PM -0500, Bjorn Helgaas wrote:
> > On Fri, Oct 18, 2024 at 05:47:46PM +0300, Ilpo Järvinen wrote:
> > > This series adds PCIe bandwidth controller (bwctrl) and associated PCIe
> > > cooling driver to the thermal core side for limiting PCIe Link Speed
> > > due to thermal reasons. PCIe bandwidth controller is a PCI express bus
> > > port service driver. A cooling device is created for each port the
> > > service driver finds to support changing speeds.
>
> > >  drivers/pci/pcie/bwctrl.c                     | 366 ++++++++++++++++++
> > >  include/linux/pci-bwctrl.h                    |  28 ++
> > >  tools/testing/selftests/pcie_bwctrl/Makefile  |   2 +
> > >  .../pcie_bwctrl/set_pcie_cooling_state.sh     | 122 ++++++
> > >  .../selftests/pcie_bwctrl/set_pcie_speed.sh   |  67 ++++
>
> > Applied to pci/bwctrl for v6.13, thanks Ilpo (and Alexandru, for the
> > bandwidth notification interrupt support)!
>
> How attached are we to "bwctrl" and "pwrctl" as the names?
>
> I just noticed that we have "ctrl" for bandwidth control but "ctl" for
> power control, which is slightly annoying to keep straight.
>
> "ctrl" is more common in the tree:
>
>   $ find -name \*ctrl\*[ch] | wc -l
>   691
>   $ find -name \*ctl\*[ch] | wc -l
>   291
>
> so I would prefer that, although "pwrctl" is already in the tree (as
> of v6.11), so it would be more disruptive to change it than to rename
> "bwctrl".

"pwrctl" is shorter and it was the only reason for me to choose it over pwrctrl.

Bart

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 0/9] PCI: Add PCIe bandwidth controller
  2024-11-13 21:48   ` Bjorn Helgaas
  2024-11-14  8:46     ` Bartosz Golaszewski
@ 2024-11-14 12:32     ` Ilpo Järvinen
  2024-11-14 17:36     ` Krzysztof Wilczyński
  2 siblings, 0 replies; 35+ messages in thread
From: Ilpo Järvinen @ 2024-11-14 12:32 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczyński, Maciej W . Rozycki, Jonathan Cameron,
	Lukas Wunner, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, LKML, Daniel Lezcano, Amit Kucheria, Zhang Rui,
	Christophe JAILLET, Bartosz Golaszewski, Amit Pundir,
	Neil Armstrong, Caleb Connolly

[-- Attachment #1: Type: text/plain, Size: 1814 bytes --]

On Wed, 13 Nov 2024, Bjorn Helgaas wrote:

> [+cc Bart, Amit, Neil, Caleb]
> 
> On Wed, Oct 23, 2024 at 05:19:04PM -0500, Bjorn Helgaas wrote:
> > On Fri, Oct 18, 2024 at 05:47:46PM +0300, Ilpo Järvinen wrote:
> > > This series adds PCIe bandwidth controller (bwctrl) and associated PCIe
> > > cooling driver to the thermal core side for limiting PCIe Link Speed
> > > due to thermal reasons. PCIe bandwidth controller is a PCI express bus
> > > port service driver. A cooling device is created for each port the
> > > service driver finds to support changing speeds.
> 
> > >  drivers/pci/pcie/bwctrl.c                     | 366 ++++++++++++++++++
> > >  include/linux/pci-bwctrl.h                    |  28 ++
> > >  tools/testing/selftests/pcie_bwctrl/Makefile  |   2 +
> > >  .../pcie_bwctrl/set_pcie_cooling_state.sh     | 122 ++++++
> > >  .../selftests/pcie_bwctrl/set_pcie_speed.sh   |  67 ++++
> 
> > Applied to pci/bwctrl for v6.13, thanks Ilpo (and Alexandru, for the
> > bandwidth notification interrupt support)!
> 
> How attached are we to "bwctrl" and "pwrctl" as the names?
>
> I just noticed that we have "ctrl" for bandwidth control but "ctl" for
> power control, which is slightly annoying to keep straight.
> 
> "ctrl" is more common in the tree:
> 
>   $ find -name \*ctrl\*[ch] | wc -l
>   691
>   $ find -name \*ctl\*[ch] | wc -l
>   291
> 
> so I would prefer that, although "pwrctl" is already in the tree (as
> of v6.11), so it would be more disruptive to change it than to rename
> "bwctrl".

I personally don't mind what the name is.

However, what I've noted is that visually those two are so similar that my 
brains will mix them on a glance anyway. Maybe the length difference would 
help on that confusion too, remains to be seen.

-- 
 i.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 0/9] PCI: Add PCIe bandwidth controller
  2024-11-13 21:48   ` Bjorn Helgaas
  2024-11-14  8:46     ` Bartosz Golaszewski
  2024-11-14 12:32     ` Ilpo Järvinen
@ 2024-11-14 17:36     ` Krzysztof Wilczyński
  2 siblings, 0 replies; 35+ messages in thread
From: Krzysztof Wilczyński @ 2024-11-14 17:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Maciej W . Rozycki, Jonathan Cameron, Lukas Wunner,
	Alexandru Gagniuc, Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, linux-kernel,
	Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	Bartosz Golaszewski, Amit Pundir, Neil Armstrong, Caleb Connolly

Hello,

[...]
> How attached are we to "bwctrl" and "pwrctl" as the names?
> 
> I just noticed that we have "ctrl" for bandwidth control but "ctl" for
> power control, which is slightly annoying to keep straight.
> 
> "ctrl" is more common in the tree:
> 
>   $ find -name \*ctrl\*[ch] | wc -l
>   691
>   $ find -name \*ctl\*[ch] | wc -l
>   291
> 
> so I would prefer that, although "pwrctl" is already in the tree (as
> of v6.11), so it would be more disruptive to change it than to rename
> "bwctrl".

If I may, I would also lean towards the "pwrctrl" name.  The "ctl" suffix
makes me think of a command-line utility, a CLI, so to speak, where such
suffix is commonly used e.g., sysctl, etcdctl, kubectl; also, all the
systemd binaries, etc.

	Krzysztof

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 8/9] thermal: Add PCIe cooling driver
  2024-11-13  8:44   ` Lukas Wunner
@ 2024-11-15 17:28     ` Bjorn Helgaas
  0 siblings, 0 replies; 35+ messages in thread
From: Bjorn Helgaas @ 2024-11-15 17:28 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczy??ski, Maciej W . Rozycki,
	Jonathan Cameron, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	linux-kernel, Amit Kucheria, Christophe JAILLET

On Wed, Nov 13, 2024 at 09:44:05AM +0100, Lukas Wunner wrote:
> On Fri, Oct 18, 2024 at 05:47:54PM +0300, Ilpo Järvinen wrote:
> >  static void pcie_bwnotif_remove(struct pcie_device *srv)
> >  {
> > +	struct pcie_bwctrl_data *data = srv->port->link_bwctrl;
> > +
> > +	if (data->cdev)
> > +		pcie_cooling_device_unregister(data->cdev);
> > +
> 
> Just noting a minor nit here in what is now commit 7206400cda87
> on pci/bwctrl:  The NULL pointer check for data->cdev seems
> redundant as pcie_cooling_device_unregister() just calls
> thermal_cooling_device_unregister(), which in turn contains
> this at the top:
> 
> 	if (!cdev)
> 		return;

Thanks, I dropped this NULL pointer check locally.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed
  2024-11-12 16:01     ` Ilpo Järvinen
@ 2024-11-18 13:03       ` Jonathan Cameron
  2024-11-18 13:17         ` Ilpo Järvinen
  0 siblings, 1 reply; 35+ messages in thread
From: Jonathan Cameron @ 2024-11-18 13:03 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Lukas Wunner, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczy??ski, Maciej W . Rozycki,
	Alexandru Gagniuc, Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, LKML,
	Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET

On Tue, 12 Nov 2024 18:01:50 +0200 (EET)
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:

> On Tue, 12 Nov 2024, Lukas Wunner wrote:
> 
> > On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:  
> > > +EXPORT_SYMBOL_GPL(pcie_set_target_speed);  
> > 
> > My apologies for another belated comment on this series.
> > This patch is now a688ab21eb72 on pci/bwctrl:
> > 
> > I note that pcie_set_target_speed() is not called my a modular user
> > (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
> > isn't really necessary right now.  I don't know if it was added
> > intentionally because some modular user is expected to show up
> > in the near future.  
> 
> Its probably a thinko to add it at all but then there have been talk about 
> other users interested in the API too so it's not far fetched we could see 
> a user. No idea about timelines though.
> 
> There are some AMD GPU drivers tweaking the TLS field on their own but 
> they also touch some HW specific registers (although, IIRC, they only 
> touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if 
> that yields something very straightforward and ends up producing a working 
> conversion or not (without ability to test with the HW). But TBH, not on 
> my highest priority item.
> 
> > > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
> > >  	if (!data)
> > >  		return -ENOMEM;
> > >  
> > > +	devm_mutex_init(&srv->device, &data->set_speed_mutex);
> > >  	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
> > >  					pcie_bwnotif_irq_thread,
> > >  					IRQF_SHARED | IRQF_ONESHOT,  
> > 
> > We generally try to avoid devm_*() functions in port service drivers
> > because if we later on move them into the PCI core (which is the plan),
> > we'll have to unroll them.  Not the end of the world that they're used
> > here, just not ideal.  
> 
> I think Jonathan disagrees with you on that:
> 
> https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/

Indeed - you beat me to it ;)

There is no practical way to move most of the port driver code into the PCI
core and definitely not interrupts. It is a shame though as I'd much prefer
if we could do so.  At LPC other issues some as power management were called
out as being very hard to handle, but to me the interrupt bit is a single
relatively easy to understand blocker.

I've been very slow on getting back to this, but my current plan would

1) Split up the bits of portdrv subdrivers that are actually core code
   (things like the AER counters etc) The current mix in the same files
   makes it hard to reason about lifetimes etc.

2) Squash all the portdrv sub drivers into simple library style calls so
   no pretend devices, everything registered directly.  That cleans up
   a lot of the layering and still provides reusable code if it makes
   sense to have multiple drivers for ports or to reuse this code for
   something else. Note that along the way I'll check we can build the
   portdrv as a module - I don't plan to actually do that, but it shows
   the layering / abstractions all work if it is possible.  That will
   probably make use of devm_ where appropriate as it simplifies a lot
   of paths.

3) Add the MSIX stuff to support 'new' child drivers but only where
   dynamic MSIX is supported.

4) Register new child devices where the layering does make sense.  So
   where we are actually binding drivers that can be unbound etc.
   Only for cases where dynamic MSI-X is available. 

I hope to get back to this in a week or so.

Jonathan

> 
> :-)
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed
  2024-11-18 13:03       ` Jonathan Cameron
@ 2024-11-18 13:17         ` Ilpo Järvinen
  2024-11-20 15:36           ` Jonathan Cameron
  0 siblings, 1 reply; 35+ messages in thread
From: Ilpo Järvinen @ 2024-11-18 13:17 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Lukas Wunner, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczy??ski, Maciej W . Rozycki,
	Alexandru Gagniuc, Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, LKML,
	Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET

[-- Attachment #1: Type: text/plain, Size: 4238 bytes --]

On Mon, 18 Nov 2024, Jonathan Cameron wrote:

> On Tue, 12 Nov 2024 18:01:50 +0200 (EET)
> Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:
> 
> > On Tue, 12 Nov 2024, Lukas Wunner wrote:
> > 
> > > On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:  
> > > > +EXPORT_SYMBOL_GPL(pcie_set_target_speed);  
> > > 
> > > My apologies for another belated comment on this series.
> > > This patch is now a688ab21eb72 on pci/bwctrl:
> > > 
> > > I note that pcie_set_target_speed() is not called my a modular user
> > > (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
> > > isn't really necessary right now.  I don't know if it was added
> > > intentionally because some modular user is expected to show up
> > > in the near future.  
> > 
> > Its probably a thinko to add it at all but then there have been talk about 
> > other users interested in the API too so it's not far fetched we could see 
> > a user. No idea about timelines though.
> > 
> > There are some AMD GPU drivers tweaking the TLS field on their own but 
> > they also touch some HW specific registers (although, IIRC, they only 
> > touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if 
> > that yields something very straightforward and ends up producing a working 
> > conversion or not (without ability to test with the HW). But TBH, not on 
> > my highest priority item.
> > 
> > > > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
> > > >  	if (!data)
> > > >  		return -ENOMEM;
> > > >  
> > > > +	devm_mutex_init(&srv->device, &data->set_speed_mutex);
> > > >  	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
> > > >  					pcie_bwnotif_irq_thread,
> > > >  					IRQF_SHARED | IRQF_ONESHOT,  
> > > 
> > > We generally try to avoid devm_*() functions in port service drivers
> > > because if we later on move them into the PCI core (which is the plan),
> > > we'll have to unroll them.  Not the end of the world that they're used
> > > here, just not ideal.  
> > 
> > I think Jonathan disagrees with you on that:
> > 
> > https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/
> 
> Indeed - you beat me to it ;)
> 
> There is no practical way to move most of the port driver code into the PCI
> core and definitely not interrupts. It is a shame though as I'd much prefer
> if we could do so.  At LPC other issues some as power management were called
> out as being very hard to handle, but to me the interrupt bit is a single
> relatively easy to understand blocker.
> 
> I've been very slow on getting back to this, but my current plan would
> 
> 1) Split up the bits of portdrv subdrivers that are actually core code
>    (things like the AER counters etc) The current mix in the same files
>    makes it hard to reason about lifetimes etc.
> 
> 2) Squash all the portdrv sub drivers into simple library style calls so
>    no pretend devices, everything registered directly.  That cleans up
>    a lot of the layering and still provides reusable code if it makes
>    sense to have multiple drivers for ports or to reuse this code for
>    something else. Note that along the way I'll check we can build the
>    portdrv as a module - I don't plan to actually do that, but it shows
>    the layering / abstractions all work if it is possible.  That will
>    probably make use of devm_ where appropriate as it simplifies a lot
>    of paths.

I'm sorry to be a bit of a spoilsport here but quirks make calls to ASPM 
code and now also to bwctrl set Link Speed API so I'm not entire sure if 
you can actual succeed in that module test.

(It's just something that again indicates both would belong to PCI core
but sadly that direction is out of options).

-- 
 i.

> 3) Add the MSIX stuff to support 'new' child drivers but only where
>    dynamic MSIX is supported.
> 
> 4) Register new child devices where the layering does make sense.  So
>    where we are actually binding drivers that can be unbound etc.
>    Only for cases where dynamic MSI-X is available. 
> 
> I hope to get back to this in a week or so.
> 
> Jonathan
> 
> > 
> > :-)
> > 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed
  2024-11-18 13:17         ` Ilpo Järvinen
@ 2024-11-20 15:36           ` Jonathan Cameron
  0 siblings, 0 replies; 35+ messages in thread
From: Jonathan Cameron @ 2024-11-20 15:36 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Lukas Wunner, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczy??ski, Maciej W . Rozycki,
	Alexandru Gagniuc, Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, LKML,
	Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET

On Mon, 18 Nov 2024 15:17:53 +0200 (EET)
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:

> On Mon, 18 Nov 2024, Jonathan Cameron wrote:
> 
> > On Tue, 12 Nov 2024 18:01:50 +0200 (EET)
> > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:
> >   
> > > On Tue, 12 Nov 2024, Lukas Wunner wrote:
> > >   
> > > > On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:    
> > > > > +EXPORT_SYMBOL_GPL(pcie_set_target_speed);    
> > > > 
> > > > My apologies for another belated comment on this series.
> > > > This patch is now a688ab21eb72 on pci/bwctrl:
> > > > 
> > > > I note that pcie_set_target_speed() is not called my a modular user
> > > > (CONFIG_PCIE_THERMAL is bool, not tristate), so the above-quoted export
> > > > isn't really necessary right now.  I don't know if it was added
> > > > intentionally because some modular user is expected to show up
> > > > in the near future.    
> > > 
> > > Its probably a thinko to add it at all but then there have been talk about 
> > > other users interested in the API too so it's not far fetched we could see 
> > > a user. No idea about timelines though.
> > > 
> > > There are some AMD GPU drivers tweaking the TLS field on their own but 
> > > they also touch some HW specific registers (although, IIRC, they only 
> > > touch Endpoint'sTLS). I was thinking of converting them but I'm unsure if 
> > > that yields something very straightforward and ends up producing a working 
> > > conversion or not (without ability to test with the HW). But TBH, not on 
> > > my highest priority item.
> > >   
> > > > > @@ -135,6 +296,7 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
> > > > >  	if (!data)
> > > > >  		return -ENOMEM;
> > > > >  
> > > > > +	devm_mutex_init(&srv->device, &data->set_speed_mutex);
> > > > >  	ret = devm_request_threaded_irq(&srv->device, srv->irq, NULL,
> > > > >  					pcie_bwnotif_irq_thread,
> > > > >  					IRQF_SHARED | IRQF_ONESHOT,    
> > > > 
> > > > We generally try to avoid devm_*() functions in port service drivers
> > > > because if we later on move them into the PCI core (which is the plan),
> > > > we'll have to unroll them.  Not the end of the world that they're used
> > > > here, just not ideal.    
> > > 
> > > I think Jonathan disagrees with you on that:
> > > 
> > > https://lore.kernel.org/linux-pci/20241017114812.00005e67@Huawei.com/  
> > 
> > Indeed - you beat me to it ;)
> > 
> > There is no practical way to move most of the port driver code into the PCI
> > core and definitely not interrupts. It is a shame though as I'd much prefer
> > if we could do so.  At LPC other issues some as power management were called
> > out as being very hard to handle, but to me the interrupt bit is a single
> > relatively easy to understand blocker.
> > 
> > I've been very slow on getting back to this, but my current plan would
> > 
> > 1) Split up the bits of portdrv subdrivers that are actually core code
> >    (things like the AER counters etc) The current mix in the same files
> >    makes it hard to reason about lifetimes etc.
> > 
> > 2) Squash all the portdrv sub drivers into simple library style calls so
> >    no pretend devices, everything registered directly.  That cleans up
> >    a lot of the layering and still provides reusable code if it makes
> >    sense to have multiple drivers for ports or to reuse this code for
> >    something else. Note that along the way I'll check we can build the
> >    portdrv as a module - I don't plan to actually do that, but it shows
> >    the layering / abstractions all work if it is possible.  That will
> >    probably make use of devm_ where appropriate as it simplifies a lot
> >    of paths.  
> 
> I'm sorry to be a bit of a spoilsport here but quirks make calls to ASPM 
> code and now also to bwctrl set Link Speed API so I'm not entire sure if 
> you can actual succeed in that module test.
> 
> (It's just something that again indicates both would belong to PCI core
> but sadly that direction is out of options).
> 
It may involve some bodges, but it is still a path to checking the
layer splits at least make 'some' sense.  Also that the resulting
library style code is suitable for reuse.  Possibly with an exception
for a few parts.

Thanks for the pointers to where the pitfalls lie!

Jonathan



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller
  2024-10-18 14:47 ` [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller Ilpo Järvinen
@ 2024-12-06 18:12   ` Niklas Schnelle
  2024-12-06 19:31     ` Niklas Schnelle
  2025-01-02 21:30     ` Lukas Wunner
  0 siblings, 2 replies; 35+ messages in thread
From: Niklas Schnelle @ 2024-12-06 18:12 UTC (permalink / raw)
  To: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczyński, Maciej W . Rozycki,
	Jonathan Cameron, Lukas Wunner, Alexandru Gagniuc,
	Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	linux-pci

On Fri, 2024-10-18 at 17:47 +0300, Ilpo Järvinen wrote:
> This mostly reverts the commit b4c7d2076b4e ("PCI/LINK: Remove
> bandwidth notification"). An upcoming commit extends this driver
> building PCIe bandwidth controller on top of it.
> 
> The PCIe bandwidth notification were first added in the commit
> e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth
> notification") but later had to be removed. The significant changes
> compared with the old bandwidth notification driver include:
> 
> 1) Don't print the notifications into kernel log, just keep the Link
>    Speed cached in struct pci_bus updated. While somewhat
> unfortunate,
>    the log spam was the source of complaints that eventually lead to
>    the removal of the bandwidth notifications driver (see the links
>    below for further information).
> 
> 2) Besides the Link Bandwidth Management Interrupt, enable also Link
>    Autonomous Bandwidth Interrupt to cover the other source of
>    bandwidth changes.
> 
> 3) Use threaded IRQ with IRQF_ONESHOT to handle Bandwidth
> Notification
>    Interrupts to address the problem fixed in the commit 3e82a7f9031f
>    ("PCI/LINK: Supply IRQ handler so level-triggered IRQs are
> acked")).
> 
> 4) Handle Link Speed updates robustly. Refresh the cached Link Speed
>    when enabling Bandwidth Notification Interrupts, and solve the
> race
>    between Link Speed read and LBMS/LABS update in
>    pcie_bwnotif_irq_thread().
> 
> 5) Use concurrency safe LNKCTL RMW operations.
> 
> 6) The driver is now called PCIe bwctrl (bandwidth controller)
> instead
>    of just bandwidth notifications because of increased scope and
>    functionality within the driver.
> 
> 7) Coexist with the Target Link Speed quirk in
>    pcie_failed_link_retrain(). Provide LBMS counting API for it.
> 
> 8) Tweaks to variable/functions names for consistency and length
>    reasons.
> 
> Bandwidth Notifications enable the cur_bus_speed in the struct
> pci_bus
> to keep track PCIe Link Speed changes.
> 
> Link:
> https://lore.kernel.org/all/20190429185611.121751-1-helgaas@kernel.org/
> Link:
> https://lore.kernel.org/linux-pci/20190501142942.26972-1-keith.busch@intel.com/
> Link:
> https://lore.kernel.org/linux-pci/20200115221008.GA191037@google.com/
> Suggested-by: Lukas Wunner <lukas@wunner.de> # Building bwctrl on top
> of bwnotif
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---

Hi Ilpo,

I bisected a v6.13-rc1 boot hang on my personal workstation to this
patch. Sadly I don't have much details like a panic or so because the
boot hangs before any kernel messages, or at least they're not visible
long enough to see. I haven't yet looked into the code as I wanted to
raise awareness first. Since the commit doesn't revert cleanly on
v6.13-rc1 I also haven't tried that yet.

Here are some details on my system:
- AMD Ryzen 9 3900X 
- ASRock X570 Creator Motherboard
- Radeon RX 5600 XT
- Intel JHL7540 Thunderbolt 3 USB Controller (only USB 2 plugged)
- Intel 82599 10 Gigabit NIC with SR-IOV enabled with 2 VFs
- Intel n I211 Gigabit NIC
- Intel Wi-Fi 6 AX200
- Aquantia AQtion AQC107 NIC

If you have patches or things to try just ask.

Thanks,
Niklas


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller
  2024-12-06 18:12   ` Niklas Schnelle
@ 2024-12-06 19:31     ` Niklas Schnelle
  2024-12-06 20:07       ` Niklas Schnelle
  2025-01-02 21:30     ` Lukas Wunner
  1 sibling, 1 reply; 35+ messages in thread
From: Niklas Schnelle @ 2024-12-06 19:31 UTC (permalink / raw)
  To: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczyński, Maciej W . Rozycki,
	Jonathan Cameron, Lukas Wunner, Alexandru Gagniuc,
	Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	niks

On Fri, 2024-12-06 at 19:12 +0100, Niklas Schnelle wrote:
> On Fri, 2024-10-18 at 17:47 +0300, Ilpo Järvinen wrote:
> > This mostly reverts the commit b4c7d2076b4e ("PCI/LINK: Remove
> > bandwidth notification"). An upcoming commit extends this driver
> > building PCIe bandwidth controller on top of it.
> > 
> > The PCIe bandwidth notification were first added in the commit
> > e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth
> > notification") but later had to be removed. The significant changes
> > compared with the old bandwidth notification driver include:
> > 
---8<---
> > ---
> 
> Hi Ilpo,
> 
> I bisected a v6.13-rc1 boot hang on my personal workstation to this
> patch. Sadly I don't have much details like a panic or so because the
> boot hangs before any kernel messages, or at least they're not visible
> long enough to see. I haven't yet looked into the code as I wanted to
> raise awareness first. Since the commit doesn't revert cleanly on
> v6.13-rc1 I also haven't tried that yet.
> 
> Here are some details on my system:
> - AMD Ryzen 9 3900X 
> - ASRock X570 Creator Motherboard
> - Radeon RX 5600 XT
> - Intel JHL7540 Thunderbolt 3 USB Controller (only USB 2 plugged)
> - Intel 82599 10 Gigabit NIC with SR-IOV enabled with 2 VFs
> - Intel n I211 Gigabit NIC
> - Intel Wi-Fi 6 AX200
> - Aquantia AQtion AQC107 NIC
> 
> If you have patches or things to try just ask.
> 
> Thanks,
> Niklas
> 

Ok I can now at least confirm that bluntly disabling the new bwctrl
driver with the below diff on top of v6.13-rc1 circumvents the boot
hang I'm seeing. So it's definitely this.

diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 5e10306b6308..6fa54480444a 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -828,7 +828,7 @@ static void __init pcie_init_services(void)
        pcie_aer_init();
        pcie_pme_init();
        pcie_dpc_init();
-       pcie_bwctrl_init();
+       /* pcie_bwctrl_init(); */
        pcie_hp_init();
 }


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller
  2024-12-06 19:31     ` Niklas Schnelle
@ 2024-12-06 20:07       ` Niklas Schnelle
  2024-12-06 23:06         ` Niklas Schnelle
  0 siblings, 1 reply; 35+ messages in thread
From: Niklas Schnelle @ 2024-12-06 20:07 UTC (permalink / raw)
  To: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczyński, Maciej W . Rozycki,
	Jonathan Cameron, Lukas Wunner, Alexandru Gagniuc,
	Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	niks

On Fri, 2024-12-06 at 20:31 +0100, Niklas Schnelle wrote:
> On Fri, 2024-12-06 at 19:12 +0100, Niklas Schnelle wrote:
> > On Fri, 2024-10-18 at 17:47 +0300, Ilpo Järvinen wrote:
> > > This mostly reverts the commit b4c7d2076b4e ("PCI/LINK: Remove
> > > bandwidth notification"). An upcoming commit extends this driver
> > > building PCIe bandwidth controller on top of it.
> > > 
> > > The PCIe bandwidth notification were first added in the commit
> > > e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth
> > > notification") but later had to be removed. The significant changes
> > > compared with the old bandwidth notification driver include:
> > > 
> ---8<---
> > > ---
> > 
> > Hi Ilpo,
> > 
> > I bisected a v6.13-rc1 boot hang on my personal workstation to this
> > patch. Sadly I don't have much details like a panic or so because the
> > boot hangs before any kernel messages, or at least they're not visible
> > long enough to see. I haven't yet looked into the code as I wanted to
> > raise awareness first. Since the commit doesn't revert cleanly on
> > v6.13-rc1 I also haven't tried that yet.
> > 
> > Here are some details on my system:
> > - AMD Ryzen 9 3900X 
> > - ASRock X570 Creator Motherboard
> > - Radeon RX 5600 XT
> > - Intel JHL7540 Thunderbolt 3 USB Controller (only USB 2 plugged)
> > - Intel 82599 10 Gigabit NIC with SR-IOV enabled with 2 VFs
> > - Intel n I211 Gigabit NIC
> > - Intel Wi-Fi 6 AX200
> > - Aquantia AQtion AQC107 NIC
> > 
> > If you have patches or things to try just ask.
> > 
> > Thanks,
> > Niklas
> > 
> 
> Ok I can now at least confirm that bluntly disabling the new bwctrl
> driver with the below diff on top of v6.13-rc1 circumvents the boot
> hang I'm seeing. So it's definitely this.
> 
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 5e10306b6308..6fa54480444a 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -828,7 +828,7 @@ static void __init pcie_init_services(void)
>         pcie_aer_init();
>         pcie_pme_init();
>         pcie_dpc_init();
> -       pcie_bwctrl_init();
> +       /* pcie_bwctrl_init(); */
>         pcie_hp_init();
>  }
> 

Also here is the full lspci -vvv output running the above on v6.13-rc1:
https://paste.js.org/9UwQIMp7eSgp

Also note that I have CONFIG_PCIE_THERMAL unset so it's also not the
cooling device portion that's causing the issue. Next I guess I should
narrow it down to the specific port where enabling the bandwidth
monitoring is causing trouble, not yet sure how best to do this with
this many devices.

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller
  2024-12-06 20:07       ` Niklas Schnelle
@ 2024-12-06 23:06         ` Niklas Schnelle
  2024-12-07 16:31           ` Lukas Wunner
  0 siblings, 1 reply; 35+ messages in thread
From: Niklas Schnelle @ 2024-12-06 23:06 UTC (permalink / raw)
  To: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczyński, Maciej W . Rozycki,
	Jonathan Cameron, Lukas Wunner, Alexandru Gagniuc,
	Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, linux-kernel
  Cc: Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET,
	niks

On Fri, 2024-12-06 at 21:07 +0100, Niklas Schnelle wrote:
> On Fri, 2024-12-06 at 20:31 +0100, Niklas Schnelle wrote:
> > On Fri, 2024-12-06 at 19:12 +0100, Niklas Schnelle wrote:
> > > On Fri, 2024-10-18 at 17:47 +0300, Ilpo Järvinen wrote:
> > > > This mostly reverts the commit b4c7d2076b4e ("PCI/LINK: Remove
> > > > bandwidth notification"). An upcoming commit extends this driver
> > > > building PCIe bandwidth controller on top of it.
> > > > 
> > > > The PCIe bandwidth notification were first added in the commit
> > > > e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth
> > > > notification") but later had to be removed. The significant changes
> > > > compared with the old bandwidth notification driver include:
> > > > 
> > ---8<---
> > > > ---
> > > 
> > > Hi Ilpo,
> > > 
> > > I bisected a v6.13-rc1 boot hang on my personal workstation to this
> > > patch. Sadly I don't have much details like a panic or so because the
> > > boot hangs before any kernel messages, or at least they're not visible
> > > long enough to see. I haven't yet looked into the code as I wanted to
> > > raise awareness first. Since the commit doesn't revert cleanly on
> > > v6.13-rc1 I also haven't tried that yet.
> > > 
> > > Here are some details on my system:
> > > - AMD Ryzen 9 3900X 
> > > - ASRock X570 Creator Motherboard
> > > - Radeon RX 5600 XT
> > > - Intel JHL7540 Thunderbolt 3 USB Controller (only USB 2 plugged)
> > > - Intel 82599 10 Gigabit NIC with SR-IOV enabled with 2 VFs
> > > - Intel n I211 Gigabit NIC
> > > - Intel Wi-Fi 6 AX200
> > > - Aquantia AQtion AQC107 NIC
> > > 
> > > If you have patches or things to try just ask.
> > > 
> > > Thanks,
> > > Niklas
> > > 
> > 
> > Ok I can now at least confirm that bluntly disabling the new bwctrl
> > driver with the below diff on top of v6.13-rc1 circumvents the boot
> > hang I'm seeing. So it's definitely this.
> > 
> > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > index 5e10306b6308..6fa54480444a 100644
> > --- a/drivers/pci/pcie/portdrv.c
> > +++ b/drivers/pci/pcie/portdrv.c
> > @@ -828,7 +828,7 @@ static void __init pcie_init_services(void)
> >         pcie_aer_init();
> >         pcie_pme_init();
> >         pcie_dpc_init();
> > -       pcie_bwctrl_init();
> > +       /* pcie_bwctrl_init(); */
> >         pcie_hp_init();
> >  }
> > 
> 
> Also here is the full lspci -vvv output running the above on v6.13-rc1:
> https://paste.js.org/9UwQIMp7eSgp
> 
> Also note that I have CONFIG_PCIE_THERMAL unset so it's also not the
> cooling device portion that's causing the issue. Next I guess I should
> narrow it down to the specific port where enabling the bandwidth
> monitoring is causing trouble, not yet sure how best to do this with
> this many devices.
> 
> Thanks,
> Niklas

Ok did some fiddeling and it's the thunderbolt ports. The below diff
works around the issue. That said I guess for a proper fix this would
should get filtered by the port service matching? Also as can be seen
in lspci the port still claims to support bandwidth management so maybe
other thunderbolt ports actually do.

Thanks,
Niklas

diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index b59cacc740fa..76a14f959c7f 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -294,6 +294,9 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
        struct pci_dev *port = srv->port;
        int ret;

+       if (srv->port->is_thunderbolt)
+               return -EOPNOTSUPP;
+
        struct pcie_bwctrl_data *data = devm_kzalloc(&srv->device,
                                                     sizeof(*data), GFP_KERNEL);
        if (!data)


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller
  2024-12-06 23:06         ` Niklas Schnelle
@ 2024-12-07 16:31           ` Lukas Wunner
  2024-12-07 18:18             ` Niklas Schnelle
  0 siblings, 1 reply; 35+ messages in thread
From: Lukas Wunner @ 2024-12-07 16:31 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczy??ski, Maciej W . Rozycki,
	Jonathan Cameron, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, linux-kernel, Daniel Lezcano, Amit Kucheria,
	Zhang Rui, Christophe JAILLET, Mika Westerberg

[cc += Mika, start of thread:
https://lore.kernel.org/all/db8e457fcd155436449b035e8791a8241b0df400.camel@kernel.org/
]

On Sat, Dec 07, 2024 at 12:06:49AM +0100, Niklas Schnelle wrote:
> > > On Fri, 2024-12-06 at 19:12 +0100, Niklas Schnelle wrote:
> > > > I bisected a v6.13-rc1 boot hang on my personal workstation to this
> > > > patch. Sadly I don't have much details like a panic or so because the
> > > > boot hangs before any kernel messages, or at least they're not visible
> > > > long enough to see. I haven't yet looked into the code as I wanted to
> > > > raise awareness first. Since the commit doesn't revert cleanly on
> > > > v6.13-rc1 I also haven't tried that yet.
> > > > 
> > > > Here are some details on my system:
> > > > - AMD Ryzen 9 3900X 
> > > > - ASRock X570 Creator Motherboard
> > > > - Radeon RX 5600 XT
> > > > - Intel JHL7540 Thunderbolt 3 USB Controller (only USB 2 plugged)
> > > > - Intel 82599 10 Gigabit NIC with SR-IOV enabled with 2 VFs
> > > > - Intel n I211 Gigabit NIC
> > > > - Intel Wi-Fi 6 AX200
> > > > - Aquantia AQtion AQC107 NIC
> 
> Ok did some fiddeling and it's the thunderbolt ports. The below diff
> works around the issue. That said I guess for a proper fix this would
> should get filtered by the port service matching? Also as can be seen
> in lspci the port still claims to support bandwidth management so maybe
> other thunderbolt ports actually do.
[...]
> --- a/drivers/pci/pcie/bwctrl.c
> +++ b/drivers/pci/pcie/bwctrl.c
> @@ -294,6 +294,9 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
>         struct pci_dev *port = srv->port;
>         int ret;
> 
> +       if (srv->port->is_thunderbolt)
> +               return -EOPNOTSUPP;
> +

Thanks for reporting and analyzing this.

The PCIe bandwidth controller is only instantiated on Downstream Ports.
Per the spec, Thunderbolt PCIe Downstream Ports are just tunnel endpoints
with a fake Max Link Speed of 2.5 GT/s (USB4 v2 sec 11.2.1):

   "Max Link Speed field in the Link Capabilities Register set to 0001b
    (data rate of 2.5 GT/s only).
    Note: These settings do not represent actual throughput.
    Throughput is implementation specific and based on the USB4 Fabric
    performance."

So your patch does make sense in so far as the link speed of Thunderbolt
PCIe Downstream Ports is fixed to 2.5 GT/s and cannot be throttled because
that's already the lowest possible PCIe speed.  The actual speed is
determined by the Thunderbolt links.

The check for the is_thunderbolt bit should be moved to the if-clause in
get_port_device_capability() which sets the PCIE_PORT_SERVICE_BWCTRL bit
in the services mask.

Alternatively, it may be worth considering not to instantiate the
bandwidth controller if the only link speed supported is 2.5 GT/s.

We should try to find out what actually causes the boot hang
(some interrupt storm maybe?), but that can hopefully be done
internally at Intel if the boot hang is reproducible.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller
  2024-12-07 16:31           ` Lukas Wunner
@ 2024-12-07 18:18             ` Niklas Schnelle
  0 siblings, 0 replies; 35+ messages in thread
From: Niklas Schnelle @ 2024-12-07 18:18 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczy??ski, Maciej W . Rozycki,
	Jonathan Cameron, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, linux-kernel, Daniel Lezcano, Amit Kucheria,
	Zhang Rui, Christophe JAILLET, Mika Westerberg

On Sat, 2024-12-07 at 17:31 +0100, Lukas Wunner wrote:
> [cc += Mika, start of thread:
> https://lore.kernel.org/all/db8e457fcd155436449b035e8791a8241b0df400.camel@kernel.org/
> ]
> 
> On Sat, Dec 07, 2024 at 12:06:49AM +0100, Niklas Schnelle wrote:
> > > > On Fri, 2024-12-06 at 19:12 +0100, Niklas Schnelle wrote:
> > > > > I bisected a v6.13-rc1 boot hang on my personal workstation to this
> > > > > patch. Sadly I don't have much details like a panic or so because the
> > > > > boot hangs before any kernel messages, or at least they're not visible
> > > > > long enough to see. I haven't yet looked into the code as I wanted to
> > > > > raise awareness first. Since the commit doesn't revert cleanly on
> > > > > v6.13-rc1 I also haven't tried that yet.
> > > > > 
> > > > > Here are some details on my system:
> > > > > - AMD Ryzen 9 3900X 
> > > > > - ASRock X570 Creator Motherboard
> > > > > - Radeon RX 5600 XT
> > > > > - Intel JHL7540 Thunderbolt 3 USB Controller (only USB 2 plugged)
> > > > > - Intel 82599 10 Gigabit NIC with SR-IOV enabled with 2 VFs
> > > > > - Intel n I211 Gigabit NIC
> > > > > - Intel Wi-Fi 6 AX200
> > > > > - Aquantia AQtion AQC107 NIC
> > 
> > Ok did some fiddeling and it's the thunderbolt ports. The below diff
> > works around the issue. That said I guess for a proper fix this would
> > should get filtered by the port service matching? Also as can be seen
> > in lspci the port still claims to support bandwidth management so maybe
> > other thunderbolt ports actually do.
> [...]
> > --- a/drivers/pci/pcie/bwctrl.c
> > +++ b/drivers/pci/pcie/bwctrl.c
> > @@ -294,6 +294,9 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
> >         struct pci_dev *port = srv->port;
> >         int ret;
> > 
> > +       if (srv->port->is_thunderbolt)
> > +               return -EOPNOTSUPP;
> > +
> 
> Thanks for reporting and analyzing this.
> 
> The PCIe bandwidth controller is only instantiated on Downstream Ports.
> Per the spec, Thunderbolt PCIe Downstream Ports are just tunnel endpoints
> with a fake Max Link Speed of 2.5 GT/s (USB4 v2 sec 11.2.1):
> 
>    "Max Link Speed field in the Link Capabilities Register set to 0001b
>     (data rate of 2.5 GT/s only).
>     Note: These settings do not represent actual throughput.
>     Throughput is implementation specific and based on the USB4 Fabric
>     performance."
> 
> So your patch does make sense in so far as the link speed of Thunderbolt
> PCIe Downstream Ports is fixed to 2.5 GT/s and cannot be throttled because
> that's already the lowest possible PCIe speed.  The actual speed is
> determined by the Thunderbolt links.
> 
> The check for the is_thunderbolt bit should be moved to the if-clause in
> get_port_device_capability() which sets the PCIE_PORT_SERVICE_BWCTRL bit
> in the services mask.
> 
> Alternatively, it may be worth considering not to instantiate the
> bandwidth controller if the only link speed supported is 2.5 GT/s.

Thanks for the great explanation. I think your last point is also a
good one, if 2.5 GT/s is the only possible link speed why even
instantiate the bandwidth controller. As get_port_device_capability()
already reads PCI_EXP_LNKCAP we also have that information right there
already. Since I put time and effort in already I've gone ahead and
cooked up a patch with this approach and will send that shortly.

> 
> We should try to find out what actually causes the boot hang
> (some interrupt storm maybe?), but that can hopefully be done
> internally at Intel if the boot hang is reproducible.

Makes sense, I do remember that this was one of very few motherboards
for AMD CPUs with Thunderbolt.

Thanks,
Niklas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed
  2024-10-18 14:47 ` [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed Ilpo Järvinen
  2024-11-12 15:47   ` Lukas Wunner
@ 2025-01-02 10:38   ` Lukas Wunner
  2025-01-05 16:43     ` Ilpo Järvinen
  1 sibling, 1 reply; 35+ messages in thread
From: Lukas Wunner @ 2025-01-02 10:38 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczy??ski, Maciej W . Rozycki, Jonathan Cameron,
	Alexandru Gagniuc, Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, linux-kernel,
	Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET

On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:
> @@ -142,9 +304,11 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
>  	if (ret)
>  		return ret;
>  
> -	scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
> -		port->link_bwctrl = no_free_ptr(data);
> -		pcie_bwnotif_enable(srv);
> +	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) {
> +		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
> +			port->link_bwctrl = no_free_ptr(data);
> +			pcie_bwnotif_enable(srv);
> +		}
>  	}

The "data" pointer is allocated with devm_kzalloc().
There's no __free(kfree) anywhere.

So what's the motivation for the no_free_ptr()?
Is this a remnant of an earlier version of the patch set that can be deleted
or is there actually a purpose to it?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller
  2024-12-06 18:12   ` Niklas Schnelle
  2024-12-06 19:31     ` Niklas Schnelle
@ 2025-01-02 21:30     ` Lukas Wunner
  1 sibling, 0 replies; 35+ messages in thread
From: Lukas Wunner @ 2025-01-02 21:30 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: Ilpo Järvinen, linux-pci, Bjorn Helgaas, Lorenzo Pieralisi,
	Rob Herring, Krzysztof Wilczy??ski, Maciej W . Rozycki,
	Jonathan Cameron, Alexandru Gagniuc, Krishna chaitanya chundru,
	Srinivas Pandruvada, Rafael J . Wysocki, linux-pm,
	Smita Koralahalli, linux-kernel, Daniel Lezcano, Amit Kucheria,
	Zhang Rui, Christophe JAILLET, Evert Vorster

[cc += Evert]

On Fri, Dec 06, 2024 at 07:12:37PM +0100, Niklas Schnelle wrote:
> On Fri, 2024-10-18 at 17:47 +0300, Ilpo Järvinen wrote:
> > This mostly reverts the commit b4c7d2076b4e ("PCI/LINK: Remove
> > bandwidth notification"). An upcoming commit extends this driver
> > building PCIe bandwidth controller on top of it.
> > 
> > The PCIe bandwidth notification were first added in the commit
> > e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth
> > notification") but later had to be removed. The significant changes
> > compared with the old bandwidth notification driver include:
[...]
> I bisected a v6.13-rc1 boot hang on my personal workstation to this
> patch. Sadly I don't have much details like a panic or so because the
> boot hangs before any kernel messages, or at least they're not visible
> long enough to see. I haven't yet looked into the code as I wanted to
> raise awareness first. Since the commit doesn't revert cleanly on
> v6.13-rc1 I also haven't tried that yet.

Hi Niklas,

another regression caused by bandwidth control was reported,
this time it's a hang on shutdown (instead of on boot) and
this time we root-caused it instead of just working around it.

Here's the proposed fix:
https://lore.kernel.org/r/0ee5faf5395cad8d29fb66e1ec444c8d882a4201.1735852688.git.lukas@wunner.de/

Could you test this on top of v6.13-rc1 (or any other rc which
does not include 774c71c52aa4 ("PCI/bwctrl: Enable only if more
than one speed is supported")?

I suspect that it may fix the issue you were seeing as well and
I'm curious whether that suspicion is correct.

Thanks!

Lukas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed
  2025-01-02 10:38   ` Lukas Wunner
@ 2025-01-05 16:43     ` Ilpo Järvinen
  0 siblings, 0 replies; 35+ messages in thread
From: Ilpo Järvinen @ 2025-01-05 16:43 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, Bjorn Helgaas, Lorenzo Pieralisi, Rob Herring,
	Krzysztof Wilczy??ski, Maciej W . Rozycki, Jonathan Cameron,
	Alexandru Gagniuc, Krishna chaitanya chundru, Srinivas Pandruvada,
	Rafael J . Wysocki, linux-pm, Smita Koralahalli, LKML,
	Daniel Lezcano, Amit Kucheria, Zhang Rui, Christophe JAILLET

[-- Attachment #1: Type: text/plain, Size: 943 bytes --]

On Thu, 2 Jan 2025, Lukas Wunner wrote:

> On Fri, Oct 18, 2024 at 05:47:53PM +0300, Ilpo Järvinen wrote:
> > @@ -142,9 +304,11 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
> >  	if (ret)
> >  		return ret;
> >  
> > -	scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
> > -		port->link_bwctrl = no_free_ptr(data);
> > -		pcie_bwnotif_enable(srv);
> > +	scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) {
> > +		scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) {
> > +			port->link_bwctrl = no_free_ptr(data);
> > +			pcie_bwnotif_enable(srv);
> > +		}
> >  	}
> 
> The "data" pointer is allocated with devm_kzalloc().
> There's no __free(kfree) anywhere.
> 
> So what's the motivation for the no_free_ptr()?
> Is this a remnant of an earlier version of the patch set that can be deleted
> or is there actually a purpose to it?

Remnant it was so it is okay to remove it in your fix.

-- 
 i.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-01-05 16:43 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-18 14:47 [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Ilpo Järvinen
2024-10-18 14:47 ` [PATCH v9 1/9] Documentation PCI: Reformat RMW ops documentation Ilpo Järvinen
2024-10-18 14:47 ` [PATCH v9 2/9] PCI: Protect Link Control 2 Register with RMW locking Ilpo Järvinen
2024-10-18 14:47 ` [PATCH v9 3/9] PCI: Store all PCIe Supported Link Speeds Ilpo Järvinen
2024-11-11 13:23   ` Lukas Wunner
2024-11-11 20:30     ` Bjorn Helgaas
2024-11-12  9:20       ` Ilpo Järvinen
2024-10-18 14:47 ` [PATCH v9 4/9] PCI: Refactor pcie_update_link_speed() Ilpo Järvinen
2024-10-18 14:47 ` [PATCH v9 5/9] PCI/quirks: Abstract LBMS seen check into own function Ilpo Järvinen
2024-10-18 14:47 ` [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller Ilpo Järvinen
2024-12-06 18:12   ` Niklas Schnelle
2024-12-06 19:31     ` Niklas Schnelle
2024-12-06 20:07       ` Niklas Schnelle
2024-12-06 23:06         ` Niklas Schnelle
2024-12-07 16:31           ` Lukas Wunner
2024-12-07 18:18             ` Niklas Schnelle
2025-01-02 21:30     ` Lukas Wunner
2024-10-18 14:47 ` [PATCH v9 7/9] PCI/bwctrl: Add API to set PCIe Link Speed Ilpo Järvinen
2024-11-12 15:47   ` Lukas Wunner
2024-11-12 16:01     ` Ilpo Järvinen
2024-11-18 13:03       ` Jonathan Cameron
2024-11-18 13:17         ` Ilpo Järvinen
2024-11-20 15:36           ` Jonathan Cameron
2024-11-12 20:43     ` Bjorn Helgaas
2025-01-02 10:38   ` Lukas Wunner
2025-01-05 16:43     ` Ilpo Järvinen
2024-10-18 14:47 ` [PATCH v9 8/9] thermal: Add PCIe cooling driver Ilpo Järvinen
2024-11-13  8:44   ` Lukas Wunner
2024-11-15 17:28     ` Bjorn Helgaas
2024-10-18 14:47 ` [PATCH v9 9/9] selftests/pcie_bwctrl: Create selftests Ilpo Järvinen
2024-10-23 22:19 ` [PATCH v9 0/9] PCI: Add PCIe bandwidth controller Bjorn Helgaas
2024-11-13 21:48   ` Bjorn Helgaas
2024-11-14  8:46     ` Bartosz Golaszewski
2024-11-14 12:32     ` Ilpo Järvinen
2024-11-14 17:36     ` Krzysztof Wilczyński

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).