* [PATCH 0/2] pci/switch_discovery: Add new module to discover inter-switch P2P links
@ 2024-06-12 11:27 Shivasharan S
2024-06-12 11:27 ` [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges Shivasharan S
2024-06-12 11:27 ` [PATCH 2/2] pci/p2pdma: Modify p2p_dma_distance to detect virtual P2P links Shivasharan S
0 siblings, 2 replies; 10+ messages in thread
From: Shivasharan S @ 2024-06-12 11:27 UTC (permalink / raw)
To: linux-pci, bhelgaas
Cc: linux-kernel, sumanesh.samanta, sathya.prakash, Shivasharan S
[-- Attachment #1: Type: text/plain, Size: 9137 bytes --]
A. Introductory definitions:
Virtual Switch: Broadcom(PLX) switches have a capability where a
single physical switch can be divided up into N number of virtual
switches at SOD. For example, a single physical switch with 64 ports
can be configured to appear to the host as 2 switches with 32 ports
each. This is a static configuration that needs to be done before the
switch boots, and cannot generally be changed on the fly. Now consider
a GPU in Virtual switch 1 and a NIC on Virtual switch 2. The key here
is that it's actually the same switch, and IF P2P is enabled between
the two virtual switches, then that would be almost infinite bandwidth
between the GPU and the NIC. However, today there is no way for the
host to know that, and host applications believe that any data
exchange between the GPU and NIC must go through host root port and
thus would be slow. Note: Any such P2P must follow ACS/IOMMU rules,
and has to be enabled in the Broadcom switches.
Inter Switch Link: While the current use-case is about the virtual
switch config above, this could also extend to physical switch, where
the two physical switches have, say, a x16 PCIe connection between
them.
B: Goal/Problem statement:
Goal 1: Summary: Provide user applications a means by which they can
discover two virtual switches to be part of the same physical switch
or when physical switches are physically connected to each other, so
that they can discover optimized data path for HPC/AI applications.
With the rapid progression of High Performance Computing (HPC) and
Artificial Intelligence (AI), it is becoming more and more common to
have complex topologies with multiple GPU, NIC, NVMe devices etc
interconnected using multiple switches. HPC and AI libraries like
MPI, UCC, NCCL,RCCL, HCCL etc analyze this topology to build a
topology tree to optimize data path for collective operations like
all-reduce etc .
Example:
Host root bridge
---------------------------------------
| | |
NIC1 --- PCI Switch1 PCI Switch2 PCI Switch3 --- NIC2
| | |
GPU1 ------------- GPU2 ------------- GPU3
SERVER 1
In the simple picture above in Server1, Switch1, Switch2, Switch3
are all connected to the host bridge and each switch has a GPU
connected, and Switch1/3 each has a NIC connected.
In a typical AI setup, there are many such servers, each connected by
upper level network switch, and "rail optimized", ie, NIC1 of all
servers are connected to Ethernet Switch1, NIC2 connected to Ethernet
Switch2 etc (Ethernet switches are not shown in picture above)
The GPUs are connected among themselves by some backend fabric, like
NVLINK (NVIDIA).
Assume that in the above diagram, PCI Switch1 and PCI Switch3 are
virtual switches belonging to the same physical switch and thus a very
high speed data link exists between them, but today host applications
have no knowledge about that.
(This is a very simple example, and modern AI infrastructure can be
way more complex than that.)
Now for collective operations like all-reduce, the HPC/AI libraries
analyze the topology above and typically decide on a data path like
this: NIC1->GPU1->GPU2->GPU3-> NIC2 which is suboptimal, because
ideally data should come go in and out through the same NIC because of
"rail optimized" topology.
Some libraries do this:NIC1->GPU1->GPU2->GPU3-> GPU1->NIC1.
The applications do the above because they think data from GPU3 to
NIC1 needs to go through the host root port, which is very
inefficient. What they do not know is that Switch1 and Switch3 are the
same physical entity with virtually infinite bandwidth between them,
and with that, they would have chosen a path like:
NIC1->GPU1->GPU2->GPU3->NIC1, which is the most optimized in the above
example.
Goal 2: Extend Linux P2PDMA distance function pci_p2pdma_distance to
account for Virtual Switch and physical switches connected by inter
switch link. The current implementation of the function has no
knowledge of Virtual switch and inter switch link.
Consider the example below:
-+ Root Port
\+ Switch1 Upstream Port
+-+ Switch1 Downstream Port 0
\- Device A
\+ Switch2 Upstream Port
+-+ Switch2 Downstream Port 0
\- Device B
Suppose Switch1 and Switch2 are virtual switches belonging to the
same physical switch. Today P2PDMA distance between Device A and
Device B will return PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, as kernel has
no idea that switch1 and switch2 are actually physically connected to
each other.
We intend to fix that, so that pci_p2pdma_distance now takes into
account switch connectivity information.
C. FAQs
FAQ 1: How does this feature work with ACS/IOMMU?
This feature does NOT add any new connectivity. The inter-switch
/virtual switch connections already follow all ACS/IOMMU rules, and
only if allowed by ACS settings, they allow for data to follow a
shortcut connection between switches and bypass the root port. The
only thing this module does is provide the switch connection
information to application software and pci_p2pdma_distance clients,
so that they can make intelligent decisions for the data path.
FAQ 2: Is this feature Broadcom specific and will it work for other
vendors?
The current implementation of the kernel module looks at Broadcom
Vendor specific extensions to determine if switch p2p is enabled.
Thus, the current implementation works only on Broadcom switches. That
being said, other vendors are free to extend/modify the code to
support their switch. The function names, code structure and sysfs path
that exposes the pci switch p2p is on purpose made generic, to allow for
extension of support to other vendors.
FAQ 3: Why can't applications read the Broadcom vendor specific
information directly from the config space? Why do we need the sysfs
path?
The vendor specific section of PCIe config space is not readable by
applications running in non-root mode, as such applications can only
read the first few bytes of the config space. Besides, reading the
vendor specific config space will not make the solution generic.
FAQ 4: Will applications still use the standard P2P model of
registering the provider, client etc?
Absolutely. All existing p2p api will work as is. All that this module
provides is information that a fast connection exists between switches
and/or pci endpoints. To make the actual p2p DMA, application need
use existing p2p API and follow existing ACS/IOMMU rules
FAQ 5: Why can't we only modify the existing pci_p2pdma_distance
function, and expose a p2pdistance to userspace? Why do we need the
new sysfs entries for pci switch connectivity?
The existing HPC/AI libraries like MPI, UCC, NCCL,RCCL, HCCL etc work
not only with PCIe switches, but also with other kind of connectivity,
like TCP, network switches, infiniband and backend inter GPU
connectivity like NVLINK and AFL. Because of that, the libraries have
matured code that analyzes all the connections and entire topology to
determine the most optimal data path among nodes. Just using
pci_p2pdma_distance does not work for them, because there might be a
shorter path between two nodes using NVLINK or a network switch. In
theory those libraries could be modified to use pci_p2pdma_distance
for PCIe connection and other method for other connection, but in
practice that is near impossible, as those changes are very intrusive
and those libraries have matured for a long time,. Their respective
maintainers are highly reluctant to make such a big change and rather
get only the missing information, that is whether two switches are
connected together. Broadcom has received such first hand feedback.
Forcing everyone to use p2pdistance only will defeat the whole purpose
of this module. However, we do want to support those libraries that
want to use pci_p2pdma_distance, and that is why we are extending
pci_p2pdma_distance function too. Thus, our goal here is to enable
existing libraries to get only the information they need, while having
means for new code or more flexible code to use pci_p2pdma_distance as
needed.
Shivasharan S (2):
switch_discovery: Add new module to discover inter switch links
between PCI-to-PCI bridges
pci/p2pdma: Modify p2p_dma_distance to detect virtual P2P links
.../driver-api/pci/switch_discovery.rst | 52 +++
MAINTAINERS | 13 +
drivers/pci/Kconfig | 1 +
drivers/pci/p2pdma.c | 18 +-
drivers/pci/switch/Kconfig | 9 +
drivers/pci/switch/Makefile | 1 +
drivers/pci/switch/switch_discovery.c | 375 ++++++++++++++++++
drivers/pci/switch/switch_discovery.h | 44 ++
8 files changed, 512 insertions(+), 1 deletion(-)
create mode 100644 Documentation/driver-api/pci/switch_discovery.rst
create mode 100644 drivers/pci/switch/switch_discovery.c
create mode 100644 drivers/pci/switch/switch_discovery.h
--
2.43.0
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4251 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges 2024-06-12 11:27 [PATCH 0/2] pci/switch_discovery: Add new module to discover inter-switch P2P links Shivasharan S @ 2024-06-12 11:27 ` Shivasharan S 2024-06-13 12:40 ` Jonathan Cameron 2024-06-21 20:53 ` Bjorn Helgaas 2024-06-12 11:27 ` [PATCH 2/2] pci/p2pdma: Modify p2p_dma_distance to detect virtual P2P links Shivasharan S 1 sibling, 2 replies; 10+ messages in thread From: Shivasharan S @ 2024-06-12 11:27 UTC (permalink / raw) To: linux-pci, bhelgaas Cc: linux-kernel, sumanesh.samanta, sathya.prakash, Shivasharan S [-- Attachment #1: Type: text/plain, Size: 17331 bytes --] This kernel module discovers the virtual inter-switch P2P links present between two PCI-to-PCI bridges that allows an optimal data path for data movement. The module creates sysfs entries for upstream PCI-to-PCI bridges which supports the inter switch P2P links as below: Host root bridge --------------------------------------- | | NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2 (af:00.0) (ad:00.0) (8b:00.0) (8d:00.0) | | GPU1 GPU2 (b0:00.0) (8e:00.0) SERVER 1 /sys/kernel/pci_switch_link/virtual_switch_links ├── 0000:8b:00.0 │ └── 0000:ad:00.0 -> ../0000:ad:00.0 └── 0000:ad:00.0 └── 0000:8b:00.0 -> ../0000:8b:00.0 Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com> --- .../driver-api/pci/switch_discovery.rst | 52 +++ MAINTAINERS | 13 + drivers/pci/switch/Kconfig | 9 + drivers/pci/switch/Makefile | 1 + drivers/pci/switch/switch_discovery.c | 375 ++++++++++++++++++ drivers/pci/switch/switch_discovery.h | 44 ++ 6 files changed, 494 insertions(+) create mode 100644 Documentation/driver-api/pci/switch_discovery.rst create mode 100644 drivers/pci/switch/switch_discovery.c create mode 100644 drivers/pci/switch/switch_discovery.h diff --git a/Documentation/driver-api/pci/switch_discovery.rst b/Documentation/driver-api/pci/switch_discovery.rst new file mode 100644 index 000000000000..7c1476260e5e --- /dev/null +++ b/Documentation/driver-api/pci/switch_discovery.rst @@ -0,0 +1,52 @@ +================================= +Linux PCI Switch discovery module +================================= + +Modern PCI switches support inter switch Peer-to-Peer(P2P) data transfer +without using host resources. For example, Broadcom(PLX) PCIe Switches have a +capability where a single physical switch can be divided up into multiple +virtual switches at SOD. PCIe switch discovery module detects the virtual links +between the switches that belong to the same physical switch. +This allows user space applications to discover these virtual links that belong +to the same physical switch and configure optimized data paths. + +Userspace Interface +=================== + +The module exposes sysfs entries for user space applications like MPI, NCCL, +UCC, RCCL, HCCL, etc to discover the virtual switch links. + +Consider the below topology + + Host root bridge + --------------------------------------- + | | + NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2 +(af:00.0) (ad:00.0) (8b:00.0) (8d:00.0) + | | + GPU1 GPU2 + (b0:00.0) (8e:00.0) + SERVER 1 + +The simple topology above shows SERVER1, has Switch1 and Switch2 which are +virtual switches that belong to the same physical switch that support +Inter switch P2P. +Switch1 and Switch2 have a GPU and NIC each connected. +The module will detect the virtual P2P link existing between the two switches +and create the sysfs entries as below. + +/sys/kernel/pci_switch_link/virtual_switch_links +├── 0000:8b:00.0 +│ └── 0000:ad:00.0 -> ../0000:ad:00.0 +└── 0000:ad:00.0 + └── 0000:8b:00.0 -> ../0000:8b:00.0 + +The HPC/AI libraries that analyze the topology can decide the optimal data +path like: NIC1->GPU1->GPU2->NIC1 which would have otherwise take a +non-optimal path like NIC1->GPU1->GPU2->GPU1->NIC1. + +Enable P2P DMA to discover virtual links +---------------------------------------- +The module also enhances :c:func:`pci_p2pdma_distance()` to determine a virtual +link between the upstream PCI-to-PCI bridges of the devices and detect optimal +path for applications using P2P DMA API. diff --git a/MAINTAINERS b/MAINTAINERS index 823387766a0c..b1bf3533ea6f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17359,6 +17359,19 @@ F: Documentation/driver-api/pci/p2pdma.rst F: drivers/pci/p2pdma.c F: include/linux/pci-p2pdma.h +PCI SWITCH DISCOVERY +M: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> +M: Sumanesh Samanta <sumanesh.samanta@broadcom.com> +L: linux-pci@vger.kernel.org +S: Maintained +Q: https://patchwork.kernel.org/project/linux-pci/list/ +B: https://bugzilla.kernel.org +C: irc://irc.oftc.net/linux-pci +T: git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git +F: Documentation/driver-api/pci/switch_discovery.rst +F: drivers/pci/switch/switch_discovery.c +F: drivers/pci/switch/switch_discovery.h + PCI SUBSYSTEM M: Bjorn Helgaas <bhelgaas@google.com> L: linux-pci@vger.kernel.org diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig index d370f4ce0492..fb4410153950 100644 --- a/drivers/pci/switch/Kconfig +++ b/drivers/pci/switch/Kconfig @@ -12,4 +12,13 @@ config PCI_SW_SWITCHTEC devices. See <file:Documentation/driver-api/switchtec.rst> for more information. +config PCI_SW_DISCOVERY + depends on PCI + tristate "PCI Switch discovery module" + help + This kernel module discovers the PCI-to-PCI bridges of PCIe switches + and forms the virtual switch links if the bridges belong to the same + Physical switch. The switch links help to identify shorter distances + for P2P configurations. + endmenu diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile index acd56d3b4a35..a3584b5146af 100644 --- a/drivers/pci/switch/Makefile +++ b/drivers/pci/switch/Makefile @@ -1,2 +1,3 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o +obj-$(CONFIG_PCI_SW_DISCOVERY) += switch_discovery.o diff --git a/drivers/pci/switch/switch_discovery.c b/drivers/pci/switch/switch_discovery.c new file mode 100644 index 000000000000..a427d3885b1f --- /dev/null +++ b/drivers/pci/switch/switch_discovery.c @@ -0,0 +1,375 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * PCI Switch Discovery module + * + * Copyright (c) 2024 Broadcom Inc. + * + * Authors: Broadcom Inc. + * Sumanesh Samanta <sumanesh.samanta@broadcom.com> + * Shivasharan S <shivasharan.srikanteshwara@broadcom.com> + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/sysfs.h> +#include <linux/slab.h> +#include <linux/rwsem.h> +#include <linux/pci.h> +#include <linux/vmalloc.h> +#include "switch_discovery.h" + +static DECLARE_RWSEM(sw_disc_rwlock); +static struct kobject *sw_disc_kobj, *sw_link_kobj; +static struct kobject *sw_kobj[SWD_MAX_VIRT_SWITCH]; +static DECLARE_BITMAP(swdata_valid, SWD_MAX_VIRT_SWITCH); + +static struct switch_data *swdata; + +static int sw_disc_probe(void); +static int sw_disc_create_sysfs_files(void); +static bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num); + +static inline bool sw_disc_is_supported_pdev(struct pci_dev *pdev) +{ + if ((pdev->vendor == PCI_VENDOR_ID_LSI) && + ((pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_HLC) || + (pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_LLC))) + return true; + + return false; +} + +static ssize_t sw_disc_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + int retval; + + down_write(&sw_disc_rwlock); + retval = sw_disc_probe(); + if (!retval) { + pr_debug("No new switch found\n"); + goto exit_success; + } + + retval = sw_disc_create_sysfs_files(); + if (retval < 0) { + pr_err("Failed to create the sysfs entries, retval %d\n", + retval); + } + +exit_success: + up_write(&sw_disc_rwlock); + return sysfs_emit(buf, SWD_SCAN_DONE); +} + +/* This function probes the PCIe devices for virtual links */ +static int sw_disc_probe(void) +{ + int i, bit; + struct pci_dev *pdev = NULL; + int topology_changed = 0; + DECLARE_BITMAP(sw_found_map, SWD_MAX_VIRT_SWITCH); + + while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)) != NULL) { + int sw_found; + + /* Currently this function only traverses Broadcom + * PEX switches and determines the virtual SW links. + * Other Switch vendors can add their specific logic + * determine the virtual links. + */ + if (!sw_disc_is_supported_pdev(pdev)) + continue; + + sw_found = -1; + + for_each_set_bit(bit, swdata_valid, SWD_MAX_VIRT_SWITCH) { + if (swdata[bit].devfn == pdev->devfn && + swdata[bit].bus == pdev->bus) { + sw_found = bit; + set_bit(sw_found, sw_found_map); + break; + } + } + + if (sw_found != -1) + continue; + + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) + if (!swdata[i].bus) + break; + + if (i >= SWD_MAX_VIRT_SWITCH) { + pr_err("Max switch exceeded\n"); + break; + } + + sw_found = i; + + if (!brcm_sw_is_p2p_supported(pdev, (char *)&swdata[sw_found].serial_num)) + continue; + + /* Found a new switch which supports P2P */ + swdata[sw_found].devfn = pdev->devfn; + swdata[sw_found].bus = pdev->bus; + + topology_changed = 1; + set_bit(sw_found, sw_found_map); + set_bit(sw_found, swdata_valid); + } + + /* handle device removal */ + for_each_clear_bit(bit, sw_found_map, SWD_MAX_VIRT_SWITCH) { + if (test_bit(bit, swdata_valid)) { + memset(&swdata[bit], 0, sizeof(swdata[i])); + clear_bit(bit, swdata_valid); + topology_changed = 1; + } + } + + return topology_changed; +} + +/* Check the various config space registers of the Broadcom PCI device and + * return true if the device supports inter switch P2P. + * If P2P is supported, return the device serial number back to + * caller. + */ +bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num) +{ + int base; + u32 cap_data1, cap_data2; + u16 vsec; + u32 vsec_data; + + base = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DSN); + if (!base) { + pr_debug("Failed to get extended capability bus %x devfn %x\n", + pdev->bus->number, pdev->devfn); + return false; + } + + vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_LSI, 1); + if (!vsec) { + pr_debug("Failed to get VSEC bus %x devfn %x\n", + pdev->bus->number, pdev->devfn); + return false; + } + + if (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) + return false; + + pci_read_config_dword(pdev, base + 8, &cap_data1); + pci_read_config_dword(pdev, base + 4, &cap_data2); + + pci_read_config_dword(pdev, vsec + 12, &vsec_data); + + pr_debug("Found Broadcom device bus 0x%x devfn 0x%x " + "Serial Number: 0x%x 0x%x, VSEC 0x%x\n", + pdev->bus->number, pdev->devfn, + cap_data1, cap_data2, vsec_data); + + if (!SECURE_PART(cap_data1)) + return false; + + if (!(P2PMASK(vsec_data) & INTER_SWITCH_LINK)) + return false; + + if (serial_num) + snprintf(serial_num, SWD_MAX_CHAR, "%x%x", cap_data1, cap_data2); + + return true; +} + +static int sw_disc_create_sysfs_files(void) +{ + int i, j, retval; + + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { + if (sw_kobj[i]) { + kobject_put(sw_kobj[i]); + sw_kobj[i] = NULL; + } + } + + if (sw_link_kobj) { + kobject_put(sw_link_kobj); + sw_link_kobj = NULL; + } + + sw_link_kobj = kobject_create_and_add(SWD_LINK_DIR_STRING, sw_disc_kobj); + if (!sw_link_kobj) { + pr_err("Failed to create pci link object\n"); + return -ENOMEM; + } + + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { + int segment, bus, device, function; + char bdf_i[SWD_MAX_CHAR]; + + if (!test_bit(i, swdata_valid)) + continue; + + segment = pci_domain_nr(swdata[i].bus); + bus = swdata[i].bus->number; + device = pci_ari_enabled(swdata[i].bus) ? + 0 : PCI_SLOT(swdata[i].devfn); + function = pci_ari_enabled(swdata[i].bus) ? + swdata[i].devfn : PCI_FUNC(swdata[i].devfn); + sprintf(bdf_i, "%04x:%02x:%02x.%x", + segment, bus, device, function); + + for (j = i + 1; j < SWD_MAX_VIRT_SWITCH; j++) { + char bdf_j[SWD_MAX_CHAR]; + + if (!test_bit(j, swdata_valid)) + continue; + segment = pci_domain_nr(swdata[j].bus); + bus = swdata[j].bus->number; + device = pci_ari_enabled(swdata[j].bus) ? + 0 : PCI_SLOT(swdata[j].devfn); + function = pci_ari_enabled(swdata[j].bus) ? + swdata[j].devfn : PCI_FUNC(swdata[j].devfn); + sprintf(bdf_j, "%04x:%02x:%02x.%x", + segment, bus, device, function); + + if (strcmp(swdata[i].serial_num, swdata[j].serial_num) == 0) { + if (!sw_kobj[i]) { + sw_kobj[i] = kobject_create_and_add(bdf_i, + sw_link_kobj); + if (!sw_kobj[i]) { + pr_err("Failed to create sysfs entry for switch %s\n", + bdf_i); + } + } + + if (!sw_kobj[j]) { + sw_kobj[j] = kobject_create_and_add(bdf_j, + sw_link_kobj); + if (!sw_kobj[j]) { + pr_err("Failed to create sysfs entry for switch %s\n", + bdf_j); + } + } + + retval = sysfs_create_link(sw_kobj[i], sw_kobj[j], bdf_j); + if (retval) + pr_err("Error creating symlink %s and %s\n", + bdf_i, bdf_j); + + retval = sysfs_create_link(sw_kobj[j], sw_kobj[i], bdf_i); + if (retval) + pr_err("Error creating symlink %s and %s\n", + bdf_j, bdf_i); + } + } + } + + return 0; +} + +/* + * Check if the two pci devices have virtual P2P link available. + * This function is used by the p2pdma to determine virtual + * links between the PCI-to-PCI bridges + */ +bool sw_disc_check_virtual_link(struct pci_dev *a, + struct pci_dev *b) +{ + char serial_num_a[SWD_MAX_CHAR], serial_num_b[SWD_MAX_CHAR]; + + /* + * Check if the PCIe devices support Virtual P2P links + */ + if (!sw_disc_is_supported_pdev(a)) + return false; + + if (!sw_disc_is_supported_pdev(b)) + return false; + + if (brcm_sw_is_p2p_supported(a, serial_num_a) && + brcm_sw_is_p2p_supported(b, serial_num_b)) + if (!strcmp(serial_num_a, serial_num_b)) + return true; + + return false; +} +EXPORT_SYMBOL_GPL(sw_disc_check_virtual_link); + +static struct kobj_attribute sw_disc_attribute = + __ATTR(SWD_FILE_NAME_STRING, 0444, sw_disc_show, NULL); + +// Create attribute group +static struct attribute *attrs[] = { + &sw_disc_attribute.attr, + NULL, +}; + +static struct attribute_group attr_group = { + .attrs = attrs, +}; + +static int __init sw_discovery_init(void) +{ + int i, retval; + + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) + sw_kobj[i] = NULL; + + // Create "sw_disc" kobject + sw_disc_kobj = kobject_create_and_add(SWD_DIR_STRING, kernel_kobj); + if (!sw_disc_kobj) { + pr_err("Failed to create sw_disc_kobj\n"); + return -ENOMEM; + } + + retval = sysfs_create_group(sw_disc_kobj, &attr_group); + if (retval) { + pr_err("Cannot register sysfs attribute group\n"); + kobject_put(sw_disc_kobj); + } + + swdata = kzalloc(sizeof(swdata) * SWD_MAX_VIRT_SWITCH, GFP_KERNEL); + if (!swdata) { + sysfs_remove_group(sw_disc_kobj, &attr_group); + kobject_put(sw_disc_kobj); + return 0; + } + + pr_info("Loading PCIe switch discovery module, version %s\n", + SWITCH_DISC_VERSION); + + return 0; +} + +static void __exit sw_discovery_exit(void) +{ + int i; + + if (!swdata) + kfree(swdata); + + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { + if (sw_kobj[i]) + kobject_put(sw_kobj[i]); + } + + // Remove kobject + if (sw_link_kobj) + kobject_put(sw_link_kobj); + + sysfs_remove_group(sw_disc_kobj, &attr_group); + kobject_put(sw_disc_kobj); +} + +module_init(sw_discovery_init); +module_exit(sw_discovery_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Broadcom Inc."); +MODULE_VERSION(SWITCH_DISC_VERSION); +MODULE_DESCRIPTION("PCIe Switch Discovery Module"); diff --git a/drivers/pci/switch/switch_discovery.h b/drivers/pci/switch/switch_discovery.h new file mode 100644 index 000000000000..b84f5d2e29ac --- /dev/null +++ b/drivers/pci/switch/switch_discovery.h @@ -0,0 +1,44 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * PCI Switch Discovery module + * + * Copyright (c) 2024 Broadcom Inc. + * + * Authors: Broadcom Inc. + * Sumanesh Samanta <sumanesh.samanta@broadcom.com> + * Shivasharan S <shivasharan.srikanteshwara@broadcom.com> + */ + +#ifndef PCI_SWITCH_DISC_H +#define PCI_SWITCH_DISC_H + +#define SWD_MAX_SWITCH 32 +#define SWD_MAX_VER_PER_SWITCH 8 + +#define SWD_MAX_VIRT_SWITCH (SWD_MAX_SWITCH * SWD_MAX_VER_PER_SWITCH) +#define SWD_MAX_CHAR 16 +#define SWITCH_DISC_VERSION "0.1.1" +#define SWD_DIR_STRING "pci_switch_link" +#define SWD_LINK_DIR_STRING "virtual_switch_links" +#define SWD_SCAN_DONE "done\n" + +#define SWD_FILE_NAME_STRING refresh_switch_toplogy + +/* Broadcom Vendor Specific definitions */ +#define PCI_VENDOR_ID_LSI 0x1000 +#define PCI_DEVICE_ID_BRCM_PEX_89000_HLC 0xC030 +#define PCI_DEVICE_ID_BRCM_PEX_89000_LLC 0xC034 + +#define P2PMASK(x) (((x) & 0x300) >> 8) +#define SECURE_PART(x) ((x) & 0x8) +#define INTER_SWITCH_LINK 0x2 + +struct switch_data { + int devfn; + struct pci_bus *bus; + char serial_num[SWD_MAX_CHAR]; +}; + +bool sw_disc_check_virtual_link(struct pci_dev *a, struct pci_dev *b); + +#endif /* PCI_SWITCH_DISC_H */ -- 2.43.0 [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4251 bytes --] ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges 2024-06-12 11:27 ` [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges Shivasharan S @ 2024-06-13 12:40 ` Jonathan Cameron 2024-06-14 15:36 ` Sumanesh Samanta 2024-06-21 20:53 ` Bjorn Helgaas 1 sibling, 1 reply; 10+ messages in thread From: Jonathan Cameron @ 2024-06-13 12:40 UTC (permalink / raw) To: Shivasharan S Cc: linux-pci, bhelgaas, linux-kernel, sumanesh.samanta, sathya.prakash On Wed, 12 Jun 2024 04:27:35 -0700 Shivasharan S <shivasharan.srikanteshwara@broadcom.com> wrote: > This kernel module discovers the virtual inter-switch P2P links present > between two PCI-to-PCI bridges that allows an optimal data path for data > movement. The module creates sysfs entries for upstream PCI-to-PCI > bridges which supports the inter switch P2P links as below: > > Host root bridge > --------------------------------------- > | | > NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2 > (af:00.0) (ad:00.0) (8b:00.0) (8d:00.0) > | | > GPU1 GPU2 > (b0:00.0) (8e:00.0) > SERVER 1 > > /sys/kernel/pci_switch_link/virtual_switch_links > ├── 0000:8b:00.0 > │ └── 0000:ad:00.0 -> ../0000:ad:00.0 > └── 0000:ad:00.0 > └── 0000:8b:00.0 -> ../0000:8b:00.0 It think the functionality is useful in general. Not sure that's an appropriate location though. I'd rather see something in in each of the USP devices that references to others they share with. I also don't think we actually care if these are virtual or real inter switch links (which are hidden from the host anyway I think?) The discovery means might be different in those case (large 'switch' made up of multiple connected smaller switches). We may want a way to discover the bandwidth though. Needs a formal sysfs doc under Documentation/ABI/testing/sysfs* (not totally sure where for this interface, but I think that location will change anyway) The comments below are mostly superficial. I need to think a bit more on how this might fit better with the linux driver model as I really don't like magic things that cross many devices. > > Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > Signed-off-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com> > --- > .../driver-api/pci/switch_discovery.rst | 52 +++ > MAINTAINERS | 13 + > drivers/pci/switch/Kconfig | 9 + > drivers/pci/switch/Makefile | 1 + > drivers/pci/switch/switch_discovery.c | 375 ++++++++++++++++++ > drivers/pci/switch/switch_discovery.h | 44 ++ > 6 files changed, 494 insertions(+) > create mode 100644 Documentation/driver-api/pci/switch_discovery.rst > create mode 100644 drivers/pci/switch/switch_discovery.c > create mode 100644 drivers/pci/switch/switch_discovery.h > > diff --git a/Documentation/driver-api/pci/switch_discovery.rst b/Documentation/driver-api/pci/switch_discovery.rst > new file mode 100644 > index 000000000000..7c1476260e5e > --- /dev/null > +++ b/Documentation/driver-api/pci/switch_discovery.rst > @@ -0,0 +1,52 @@ > +================================= > +Linux PCI Switch discovery module > +================================= > + > +Modern PCI switches support inter switch Peer-to-Peer(P2P) data transfer > +without using host resources. For example, Broadcom(PLX) PCIe Switches have a > +capability where a single physical switch can be divided up into multiple > +virtual switches at SOD. PCIe switch discovery module detects the virtual links > +between the switches that belong to the same physical switch. > +This allows user space applications to discover these virtual links that belong > +to the same physical switch and configure optimized data paths. > + > +Userspace Interface > +=================== > + > +The module exposes sysfs entries for user space applications like MPI, NCCL, > +UCC, RCCL, HCCL, etc to discover the virtual switch links. > + > +Consider the below topology > + > + Host root bridge > + --------------------------------------- > + | | > + NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2 > +(af:00.0) (ad:00.0) (8b:00.0) (8d:00.0) > + | | > + GPU1 GPU2 > + (b0:00.0) (8e:00.0) > + SERVER 1 > + > +The simple topology above shows SERVER1, has Switch1 and Switch2 which are > +virtual switches that belong to the same physical switch that support > +Inter switch P2P. > +Switch1 and Switch2 have a GPU and NIC each connected. > +The module will detect the virtual P2P link existing between the two switches > +and create the sysfs entries as below. > + > +/sys/kernel/pci_switch_link/virtual_switch_links > +├── 0000:8b:00.0 > +│ └── 0000:ad:00.0 -> ../0000:ad:00.0 > +└── 0000:ad:00.0 > + └── 0000:8b:00.0 -> ../0000:8b:00.0 > + > +The HPC/AI libraries that analyze the topology can decide the optimal data > +path like: NIC1->GPU1->GPU2->NIC1 which would have otherwise take a > +non-optimal path like NIC1->GPU1->GPU2->GPU1->NIC1. > + > +Enable P2P DMA to discover virtual links > +---------------------------------------- > +The module also enhances :c:func:`pci_p2pdma_distance()` to determine a virtual > +link between the upstream PCI-to-PCI bridges of the devices and detect optimal > +path for applications using P2P DMA API. > diff --git a/MAINTAINERS b/MAINTAINERS > index 823387766a0c..b1bf3533ea6f 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -17359,6 +17359,19 @@ F: Documentation/driver-api/pci/p2pdma.rst > F: drivers/pci/p2pdma.c > F: include/linux/pci-p2pdma.h > > +PCI SWITCH DISCOVERY > +M: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > +M: Sumanesh Samanta <sumanesh.samanta@broadcom.com> > +L: linux-pci@vger.kernel.org > +S: Maintained > +Q: https://patchwork.kernel.org/project/linux-pci/list/ > +B: https://bugzilla.kernel.org > +C: irc://irc.oftc.net/linux-pci > +T: git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git > +F: Documentation/driver-api/pci/switch_discovery.rst > +F: drivers/pci/switch/switch_discovery.c > +F: drivers/pci/switch/switch_discovery.h > + > PCI SUBSYSTEM > M: Bjorn Helgaas <bhelgaas@google.com> > L: linux-pci@vger.kernel.org > diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig > index d370f4ce0492..fb4410153950 100644 > --- a/drivers/pci/switch/Kconfig > +++ b/drivers/pci/switch/Kconfig > @@ -12,4 +12,13 @@ config PCI_SW_SWITCHTEC > devices. See <file:Documentation/driver-api/switchtec.rst> for more > information. > > +config PCI_SW_DISCOVERY > + depends on PCI > + tristate "PCI Switch discovery module" > + help > + This kernel module discovers the PCI-to-PCI bridges of PCIe switches > + and forms the virtual switch links if the bridges belong to the same > + Physical switch. The switch links help to identify shorter distances > + for P2P configurations. > + > endmenu > diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile > index acd56d3b4a35..a3584b5146af 100644 > --- a/drivers/pci/switch/Makefile > +++ b/drivers/pci/switch/Makefile > @@ -1,2 +1,3 @@ > # SPDX-License-Identifier: GPL-2.0 > obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o > +obj-$(CONFIG_PCI_SW_DISCOVERY) += switch_discovery.o > diff --git a/drivers/pci/switch/switch_discovery.c b/drivers/pci/switch/switch_discovery.c > new file mode 100644 > index 000000000000..a427d3885b1f > --- /dev/null > +++ b/drivers/pci/switch/switch_discovery.c > @@ -0,0 +1,375 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * PCI Switch Discovery module > + * > + * Copyright (c) 2024 Broadcom Inc. > + * > + * Authors: Broadcom Inc. > + * Sumanesh Samanta <sumanesh.samanta@broadcom.com> > + * Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > + */ > + > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > + > +#include <linux/init.h> > +#include <linux/kernel.h> > +#include <linux/module.h> > +#include <linux/sysfs.h> > +#include <linux/slab.h> > +#include <linux/rwsem.h> > +#include <linux/pci.h> > +#include <linux/vmalloc.h> Pick an ordering scheme for headers. Can't remember which PCI uses but alphabetical is always a good starting point unless there is a local standard. > +#include "switch_discovery.h" > + > +static DECLARE_RWSEM(sw_disc_rwlock); > +static struct kobject *sw_disc_kobj, *sw_link_kobj; > +static struct kobject *sw_kobj[SWD_MAX_VIRT_SWITCH]; Why can't this be dynamically sized? Use a list. > +static DECLARE_BITMAP(swdata_valid, SWD_MAX_VIRT_SWITCH); > + > +static struct switch_data *swdata; > + > +static int sw_disc_probe(void); > +static int sw_disc_create_sysfs_files(void); > +static bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num); Can you reorder the code to avoid the need for these forwards definitions? > + > +static inline bool sw_disc_is_supported_pdev(struct pci_dev *pdev) > +{ > + if ((pdev->vendor == PCI_VENDOR_ID_LSI) && > + ((pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_HLC) || > + (pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_LLC))) > + return true; > + > + return false; > +} > + > +static ssize_t sw_disc_show(struct kobject *kobj, > + struct kobj_attribute *attr, > + char *buf) > +{ > + int retval; > + > + down_write(&sw_disc_rwlock); > + retval = sw_disc_probe(); > + if (!retval) { > + pr_debug("No new switch found\n"); > + goto exit_success; > + } > + > + retval = sw_disc_create_sysfs_files(); > + if (retval < 0) { > + pr_err("Failed to create the sysfs entries, retval %d\n", > + retval); > + } > + > +exit_success: > + up_write(&sw_disc_rwlock); > + return sysfs_emit(buf, SWD_SCAN_DONE); Don't have side effects on a read. Write 1 to the file to scan and when it is done, return len; > +} > + > +/* This function probes the PCIe devices for virtual links */ I'm not sure if a bus walk and search is the right way to do this. I need to think on this more, but options that occur are: 1) Do it in the PCI core (so without a driver binding). /sys/bus/pci/devices/0000:0c:00.0/isl/0000:12:00.0 -> ../../../0000:12:00.0 Controversial perhaps because PCIe provides no 'standard' way to discover this but it it is slim enough, maybe? 2) Do it in portdrv as that will currently bind to the USP anyway. There are other discussions on going on refactoring the pcie portdrv and this usecase might well fit in there. Doesn't seem very invasive to add this. > +static int sw_disc_probe(void) > +{ > + int i, bit; > + struct pci_dev *pdev = NULL; > + int topology_changed = 0; > + DECLARE_BITMAP(sw_found_map, SWD_MAX_VIRT_SWITCH); As above, I'd use a list of found virtual switches then removal is dropping an entry from middle of that list. Probe finds what is there and moves things to a new temporary list. Delete anything left on the old list. > + > + while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)) != NULL) { Not using the port class code? Feels like every switch will if this isn't in a different function? (I've been assuming it is vsec on the USP function) > + int sw_found; > + > + /* Currently this function only traverses Broadcom > + * PEX switches and determines the virtual SW links. > + * Other Switch vendors can add their specific logic > + * determine the virtual links. > + */ I'd move this comment to the supported query. As you observe, it is general in principle. > + if (!sw_disc_is_supported_pdev(pdev)) It's not really about discovering switches. So I'd call it sw_might_be_virtual_switch() or something like that. I'm sure we'll eventually have to handle multiple physical switches with a real interswitchlink at some point, but that can be addressed separately. > + continue; > + > + sw_found = -1; int sw_found = -1; above > + > + for_each_set_bit(bit, swdata_valid, SWD_MAX_VIRT_SWITCH) { > + if (swdata[bit].devfn == pdev->devfn && > + swdata[bit].bus == pdev->bus) { Can we use an xarray or similar to do this lookup? > + sw_found = bit; > + set_bit(sw_found, sw_found_map); > + break; > + } > + } > + > + if (sw_found != -1) > + continue; > + > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) > + if (!swdata[i].bus) > + break; > + > + if (i >= SWD_MAX_VIRT_SWITCH) { > + pr_err("Max switch exceeded\n"); > + break; > + } > + > + sw_found = i; > + > + if (!brcm_sw_is_p2p_supported(pdev, (char *)&swdata[sw_found].serial_num)) > + continue; > + > + /* Found a new switch which supports P2P */ > + swdata[sw_found].devfn = pdev->devfn; > + swdata[sw_found].bus = pdev->bus; > + > + topology_changed = 1; > + set_bit(sw_found, sw_found_map); > + set_bit(sw_found, swdata_valid); > + } > + > + /* handle device removal */ > + for_each_clear_bit(bit, sw_found_map, SWD_MAX_VIRT_SWITCH) { > + if (test_bit(bit, swdata_valid)) { > + memset(&swdata[bit], 0, sizeof(swdata[i])); > + clear_bit(bit, swdata_valid); > + topology_changed = 1; > + } > + } > + > + return topology_changed; > +} > + > +/* Check the various config space registers of the Broadcom PCI device and > + * return true if the device supports inter switch P2P. > + * If P2P is supported, return the device serial number back to > + * caller. > + */ > +bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num) > +{ > + int base; > + u32 cap_data1, cap_data2; > + u16 vsec; > + u32 vsec_data; > + > + base = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DSN); > + if (!base) { > + pr_debug("Failed to get extended capability bus %x devfn %x\n", > + pdev->bus->number, pdev->devfn); > + return false; > + } I'd just call pci_get_dsn() If it doesn't return 0 the cap is there and we get the value and just use it. > + > + vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_LSI, 1); > + if (!vsec) { > + pr_debug("Failed to get VSEC bus %x devfn %x\n", > + pdev->bus->number, pdev->devfn); > + return false; > + } > + > + if (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) > + return false; I'd do this first. Will apply to a lot of matches and this is much cheaper than finding capabilities. > + > + pci_read_config_dword(pdev, base + 8, &cap_data1); > + pci_read_config_dword(pdev, base + 4, &cap_data2); > + > + pci_read_config_dword(pdev, vsec + 12, &vsec_data); Use a define for that vsec offset that gives some indication of it's purpose in the LSI VSEC. > + > + pr_debug("Found Broadcom device bus 0x%x devfn 0x%x " > + "Serial Number: 0x%x 0x%x, VSEC 0x%x\n", > + pdev->bus->number, pdev->devfn, > + cap_data1, cap_data2, vsec_data); > + > + if (!SECURE_PART(cap_data1)) > + return false; FIELD_GET() > + > + if (!(P2PMASK(vsec_data) & INTER_SWITCH_LINK)) FIELD_GET() for the relevant bits in each. > + return false; > + > + if (serial_num) > + snprintf(serial_num, SWD_MAX_CHAR, "%x%x", cap_data1, cap_data2); Just use the u64. > + > + return true; > +} > + > +static int sw_disc_create_sysfs_files(void) > +{ > + int i, j, retval; > + > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > + if (sw_kobj[i]) { > + kobject_put(sw_kobj[i]); > + sw_kobj[i] = NULL; If you are freeing kobjects in a creation path something went wrong. Don't do this - if it makes sense free them before calling this create function. > + } > + } > + > + if (sw_link_kobj) { > + kobject_put(sw_link_kobj); > + sw_link_kobj = NULL; > + } > + > + sw_link_kobj = kobject_create_and_add(SWD_LINK_DIR_STRING, sw_disc_kobj); Don't use defines for file names. We want to see them inline as much more readable! > + if (!sw_link_kobj) { > + pr_err("Failed to create pci link object\n"); > + return -ENOMEM; > + } > + > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > + int segment, bus, device, function; > + char bdf_i[SWD_MAX_CHAR]; No obvious reason why this is the same length as serial numbers? Use an appropriate define for each. We print the bdf in various places, maybe there is already a suitable define and if not perhaps worth adding one. > + > + if (!test_bit(i, swdata_valid)) > + continue; > + > + segment = pci_domain_nr(swdata[i].bus); > + bus = swdata[i].bus->number; > + device = pci_ari_enabled(swdata[i].bus) ? > + 0 : PCI_SLOT(swdata[i].devfn); > + function = pci_ari_enabled(swdata[i].bus) ? > + swdata[i].devfn : PCI_FUNC(swdata[i].devfn); > + sprintf(bdf_i, "%04x:%02x:%02x.%x", > + segment, bus, device, function); > + > + for (j = i + 1; j < SWD_MAX_VIRT_SWITCH; j++) { > + char bdf_j[SWD_MAX_CHAR]; > + > + if (!test_bit(j, swdata_valid)) > + continue; > + segment = pci_domain_nr(swdata[j].bus); > + bus = swdata[j].bus->number; > + device = pci_ari_enabled(swdata[j].bus) ? > + 0 : PCI_SLOT(swdata[j].devfn); > + function = pci_ari_enabled(swdata[j].bus) ? > + swdata[j].devfn : PCI_FUNC(swdata[j].devfn); > + sprintf(bdf_j, "%04x:%02x:%02x.%x", > + segment, bus, device, function); > + > + if (strcmp(swdata[i].serial_num, swdata[j].serial_num) == 0) { > + if (!sw_kobj[i]) { > + sw_kobj[i] = kobject_create_and_add(bdf_i, > + sw_link_kobj); > + if (!sw_kobj[i]) { > + pr_err("Failed to create sysfs entry for switch %s\n", > + bdf_i); > + } > + } > + > + if (!sw_kobj[j]) { > + sw_kobj[j] = kobject_create_and_add(bdf_j, > + sw_link_kobj); > + if (!sw_kobj[j]) { > + pr_err("Failed to create sysfs entry for switch %s\n", > + bdf_j); > + } > + } > + > + retval = sysfs_create_link(sw_kobj[i], sw_kobj[j], bdf_j); > + if (retval) > + pr_err("Error creating symlink %s and %s\n", > + bdf_i, bdf_j); > + > + retval = sysfs_create_link(sw_kobj[j], sw_kobj[i], bdf_i); > + if (retval) > + pr_err("Error creating symlink %s and %s\n", > + bdf_j, bdf_i); > + } > + } > + } > + > + return 0; > +} > + > +/* > + * Check if the two pci devices have virtual P2P link available. > + * This function is used by the p2pdma to determine virtual > + * links between the PCI-to-PCI bridges > + */ > +bool sw_disc_check_virtual_link(struct pci_dev *a, > + struct pci_dev *b) No need to wrrap line. > +{ > + char serial_num_a[SWD_MAX_CHAR], serial_num_b[SWD_MAX_CHAR]; > + > + /* > + * Check if the PCIe devices support Virtual P2P links > + */ Single line comment /* Check if the PCIe devices support Virtual P2P links */ > + if (!sw_disc_is_supported_pdev(a)) > + return false; > + > + if (!sw_disc_is_supported_pdev(b)) > + return false; > + > + if (brcm_sw_is_p2p_supported(a, serial_num_a) && > + brcm_sw_is_p2p_supported(b, serial_num_b)) > + if (!strcmp(serial_num_a, serial_num_b)) > + return true; > + > + return false; > +} > +EXPORT_SYMBOL_GPL(sw_disc_check_virtual_link); > + > +static struct kobj_attribute sw_disc_attribute = > + __ATTR(SWD_FILE_NAME_STRING, 0444, sw_disc_show, NULL); As below. Use string directly for file names, don't hide it behind a define. > + > +// Create attribute group Drop comment + if it was here /* */ > +static struct attribute *attrs[] = { > + &sw_disc_attribute.attr, > + NULL, No comma on NULL terminators as we won't add anything after them. > +}; > + > +static struct attribute_group attr_group = { > + .attrs = attrs, > +}; > + > +static int __init sw_discovery_init(void) > +{ > + int i, retval; > + > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) > + sw_kobj[i] = NULL; > + > + // Create "sw_disc" kobject Drop any 'obvious' comments. > + sw_disc_kobj = kobject_create_and_add(SWD_DIR_STRING, kernel_kobj); > + if (!sw_disc_kobj) { > + pr_err("Failed to create sw_disc_kobj\n"); > + return -ENOMEM; > + } > + > + retval = sysfs_create_group(sw_disc_kobj, &attr_group); > + if (retval) { > + pr_err("Cannot register sysfs attribute group\n"); > + kobject_put(sw_disc_kobj); return an error. > + } > + > + swdata = kzalloc(sizeof(swdata) * SWD_MAX_VIRT_SWITCH, GFP_KERNEL); > + if (!swdata) { > + sysfs_remove_group(sw_disc_kobj, &attr_group); > + kobject_put(sw_disc_kobj); return an error. > + return 0; > + } > + > + pr_info("Loading PCIe switch discovery module, version %s\n", > + SWITCH_DISC_VERSION); > + > + return 0; > +} > + > +static void __exit sw_discovery_exit(void) > +{ > + int i; > + > + if (!swdata) I'm fairly sure that if you return an error in failure above (which shouldn't fail anyway) you won't need this protection as for exit() to be called init() must have succeeded and the data must have been allocated. > + kfree(swdata); > + > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > + if (sw_kobj[i]) > + kobject_put(sw_kobj[i]); > + } > + > + // Remove kobject /* Remove kobject */ but that's pretty obvious anyway so better to just drop the comment. > + if (sw_link_kobj) > + kobject_put(sw_link_kobj); > + > + sysfs_remove_group(sw_disc_kobj, &attr_group); > + kobject_put(sw_disc_kobj); > +} > + > +module_init(sw_discovery_init); > +module_exit(sw_discovery_exit); > + > +MODULE_LICENSE("GPL"); > +MODULE_AUTHOR("Broadcom Inc."); > +MODULE_VERSION(SWITCH_DISC_VERSION); > +MODULE_DESCRIPTION("PCIe Switch Discovery Module"); > diff --git a/drivers/pci/switch/switch_discovery.h b/drivers/pci/switch/switch_discovery.h > new file mode 100644 > index 000000000000..b84f5d2e29ac > --- /dev/null > +++ b/drivers/pci/switch/switch_discovery.h > @@ -0,0 +1,44 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * PCI Switch Discovery module > + * > + * Copyright (c) 2024 Broadcom Inc. > + * > + * Authors: Broadcom Inc. > + * Sumanesh Samanta <sumanesh.samanta@broadcom.com> > + * Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Why is the header needed? Only seems to be used from one c file. Move everything down there and drop this file. > + */ > + > +#ifndef PCI_SWITCH_DISC_H > +#define PCI_SWITCH_DISC_H > + > +#define SWD_MAX_SWITCH 32 > +#define SWD_MAX_VER_PER_SWITCH 8 > + > +#define SWD_MAX_VIRT_SWITCH (SWD_MAX_SWITCH * SWD_MAX_VER_PER_SWITCH) > +#define SWD_MAX_CHAR 16 Name this so it's clearer what it is sizing. > +#define SWITCH_DISC_VERSION "0.1.1" Whilst there are module versions in the kernel etc, they are meaningless as we must support backwards compatibility anyway. So don't give it a version (this is basically ancient legacy no one uses any more) > +#define SWD_DIR_STRING "pci_switch_link" All these better inline. Defines just make yoru code harder to read. > +#define SWD_LINK_DIR_STRING "virtual_switch_links" > +#define SWD_SCAN_DONE "done\n" Definitely inline! > + > +#define SWD_FILE_NAME_STRING refresh_switch_toplogy Just use the string directly inline. This doesn't belong in a header. > + > +/* Broadcom Vendor Specific definitions */ > +#define PCI_VENDOR_ID_LSI 0x1000 > +#define PCI_DEVICE_ID_BRCM_PEX_89000_HLC 0xC030 > +#define PCI_DEVICE_ID_BRCM_PEX_89000_LLC 0xC034 > + > +#define P2PMASK(x) (((x) & 0x300) >> 8) Use FIELD_GET() on the mask alone and make sure it's clear from naming what register this applies to. > +#define SECURE_PART(x) ((x) & 0x8) > +#define INTER_SWITCH_LINK 0x2 Give this a name that matches with a register name or smilar. > + > +struct switch_data { More specific name needed as this will clash with something at somepoint in the future > + int devfn; extra space before devfn. > + struct pci_bus *bus; > + char serial_num[SWD_MAX_CHAR]; > +}; > + > +bool sw_disc_check_virtual_link(struct pci_dev *a, struct pci_dev *b); > + > +#endif /* PCI_SWITCH_DISC_H */ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges 2024-06-13 12:40 ` Jonathan Cameron @ 2024-06-14 15:36 ` Sumanesh Samanta 2024-06-23 14:44 ` Manivannan Sadhasivam 0 siblings, 1 reply; 10+ messages in thread From: Sumanesh Samanta @ 2024-06-14 15:36 UTC (permalink / raw) To: Jonathan Cameron Cc: Shivasharan S, linux-pci, bhelgaas, linux-kernel, sathya.prakash [-- Attachment #1: Type: text/plain, Size: 32312 bytes --] Hi Jonathan and others, Thanks for the feedback. > It think the functionality is useful in general. > > Not sure that's an appropriate location though. I'd rather > see something in in each of the USP devices that references > to others they share with. Yes, we did think of using /sys/bus/pci/devices/, but that would of course mean changing the PCIe driver ( as you too noted later), and we were not sure if that would be acceptable to the community. There is also an issue with race condition during discovery, more on that below. > I also don't think we actually care > if these are virtual or real inter switch links (which are hidden > from the host anyway I think?) Yes, correct. The discovery of whether the switches are connected will be different for Virtual Switch and physical Inter switch link, but the sysfs entries will be identical, and applications/consumers need not care. >>We may want a way to discover the bandwidth though. Agreed, that would be a good enhancement, but will need more vendor specific discovery, I think. But that is very useful, I agree and we can take that later once this set of patches are approved > I'm not sure if a bus walk and search is the right way to do this. > > I need to think on this more, but options that occur are: > 1) Do it in the PCI core (so without a driver binding). > /sys/bus/pci/devices/0000:0c:00.0/isl/0000:12:00.0 -> ../../../0000:12:00.0 > Controversial perhaps because PCIe provides no 'standard' way to discover > this but it it is slim enough, maybe? > > 2) Do it in portdrv as that will currently bind to the USP anyway. > One issue with using the PCI core and/or portdrv is that when those drivers load, they will discover devices one by one, and when a given device tries to find its connected peers, it may not be able to do so, because its peers might not have been discovered yet. We solved that problem by having a "refresh_link" sysfs entry, and when that entry is triggered, we rediscover all the p2p switch links. If we move the code to PCI core/portdev, then we shall probably need a "refresh" under each device node, and when that is triggered, we shall need to discover the links for that device. But still we shall still need to do the bus walk, as we shall still need to find the peer devices and check whether they are connected to the given device. So, basically, what I am saying is, something like this: /sys/bus/pci/devices/0000:0c:00.0/isl/refresh_link /sys/bus/pci/devices/0000:0c:00.0/isl/0000:12:00.0 -> ../../../0000:12:00.0 Will that be acceptable? If so, we shall incorporate that change in the next patch. We shall also incorporate the rest of your feedback. sincerely, Sumanesh On Thu, Jun 13, 2024 at 6:40 AM Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > On Wed, 12 Jun 2024 04:27:35 -0700 > Shivasharan S <shivasharan.srikanteshwara@broadcom.com> wrote: > > > This kernel module discovers the virtual inter-switch P2P links present > > between two PCI-to-PCI bridges that allows an optimal data path for data > > movement. The module creates sysfs entries for upstream PCI-to-PCI > > bridges which supports the inter switch P2P links as below: > > > > Host root bridge > > --------------------------------------- > > | | > > NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2 > > (af:00.0) (ad:00.0) (8b:00.0) (8d:00.0) > > | | > > GPU1 GPU2 > > (b0:00.0) (8e:00.0) > > SERVER 1 > > > > /sys/kernel/pci_switch_link/virtual_switch_links > > ├── 0000:8b:00.0 > > │ └── 0000:ad:00.0 -> ../0000:ad:00.0 > > └── 0000:ad:00.0 > > └── 0000:8b:00.0 -> ../0000:8b:00.0 > > It think the functionality is useful in general. > > Not sure that's an appropriate location though. I'd rather > see something in in each of the USP devices that references > to others they share with. I also don't think we actually care > if these are virtual or real inter switch links (which are hidden > from the host anyway I think?) The discovery means might be different > in those case (large 'switch' made up of multiple connected smaller > switches). We may want a way to discover the bandwidth though. > > > Needs a formal sysfs doc under > Documentation/ABI/testing/sysfs* (not totally sure where for this > interface, but I think that location will change anyway) > > The comments below are mostly superficial. I need to think a bit > more on how this might fit better with the linux driver model > as I really don't like magic things that cross many devices. > > > > > Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > > Signed-off-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com> > > --- > > .../driver-api/pci/switch_discovery.rst | 52 +++ > > MAINTAINERS | 13 + > > drivers/pci/switch/Kconfig | 9 + > > drivers/pci/switch/Makefile | 1 + > > drivers/pci/switch/switch_discovery.c | 375 ++++++++++++++++++ > > drivers/pci/switch/switch_discovery.h | 44 ++ > > 6 files changed, 494 insertions(+) > > create mode 100644 Documentation/driver-api/pci/switch_discovery.rst > > create mode 100644 drivers/pci/switch/switch_discovery.c > > create mode 100644 drivers/pci/switch/switch_discovery.h > > > > diff --git a/Documentation/driver-api/pci/switch_discovery.rst b/Documentation/driver-api/pci/switch_discovery.rst > > new file mode 100644 > > index 000000000000..7c1476260e5e > > --- /dev/null > > +++ b/Documentation/driver-api/pci/switch_discovery.rst > > @@ -0,0 +1,52 @@ > > +================================= > > +Linux PCI Switch discovery module > > +================================= > > + > > +Modern PCI switches support inter switch Peer-to-Peer(P2P) data transfer > > +without using host resources. For example, Broadcom(PLX) PCIe Switches have a > > +capability where a single physical switch can be divided up into multiple > > +virtual switches at SOD. PCIe switch discovery module detects the virtual links > > +between the switches that belong to the same physical switch. > > +This allows user space applications to discover these virtual links that belong > > +to the same physical switch and configure optimized data paths. > > + > > +Userspace Interface > > +=================== > > + > > +The module exposes sysfs entries for user space applications like MPI, NCCL, > > +UCC, RCCL, HCCL, etc to discover the virtual switch links. > > + > > +Consider the below topology > > + > > + Host root bridge > > + --------------------------------------- > > + | | > > + NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2 > > +(af:00.0) (ad:00.0) (8b:00.0) (8d:00.0) > > + | | > > + GPU1 GPU2 > > + (b0:00.0) (8e:00.0) > > + SERVER 1 > > + > > +The simple topology above shows SERVER1, has Switch1 and Switch2 which are > > +virtual switches that belong to the same physical switch that support > > +Inter switch P2P. > > +Switch1 and Switch2 have a GPU and NIC each connected. > > +The module will detect the virtual P2P link existing between the two switches > > +and create the sysfs entries as below. > > + > > +/sys/kernel/pci_switch_link/virtual_switch_links > > +├── 0000:8b:00.0 > > +│ └── 0000:ad:00.0 -> ../0000:ad:00.0 > > +└── 0000:ad:00.0 > > + └── 0000:8b:00.0 -> ../0000:8b:00.0 > > + > > +The HPC/AI libraries that analyze the topology can decide the optimal data > > +path like: NIC1->GPU1->GPU2->NIC1 which would have otherwise take a > > +non-optimal path like NIC1->GPU1->GPU2->GPU1->NIC1. > > + > > +Enable P2P DMA to discover virtual links > > +---------------------------------------- > > +The module also enhances :c:func:`pci_p2pdma_distance()` to determine a virtual > > +link between the upstream PCI-to-PCI bridges of the devices and detect optimal > > +path for applications using P2P DMA API. > > diff --git a/MAINTAINERS b/MAINTAINERS > > index 823387766a0c..b1bf3533ea6f 100644 > > --- a/MAINTAINERS > > +++ b/MAINTAINERS > > @@ -17359,6 +17359,19 @@ F: Documentation/driver-api/pci/p2pdma.rst > > F: drivers/pci/p2pdma.c > > F: include/linux/pci-p2pdma.h > > > > +PCI SWITCH DISCOVERY > > +M: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > > +M: Sumanesh Samanta <sumanesh.samanta@broadcom.com> > > +L: linux-pci@vger.kernel.org > > +S: Maintained > > +Q: https://patchwork.kernel.org/project/linux-pci/list/ > > +B: https://bugzilla.kernel.org > > +C: irc://irc.oftc.net/linux-pci > > +T: git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git > > +F: Documentation/driver-api/pci/switch_discovery.rst > > +F: drivers/pci/switch/switch_discovery.c > > +F: drivers/pci/switch/switch_discovery.h > > + > > PCI SUBSYSTEM > > M: Bjorn Helgaas <bhelgaas@google.com> > > L: linux-pci@vger.kernel.org > > diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig > > index d370f4ce0492..fb4410153950 100644 > > --- a/drivers/pci/switch/Kconfig > > +++ b/drivers/pci/switch/Kconfig > > @@ -12,4 +12,13 @@ config PCI_SW_SWITCHTEC > > devices. See <file:Documentation/driver-api/switchtec.rst> for more > > information. > > > > +config PCI_SW_DISCOVERY > > + depends on PCI > > + tristate "PCI Switch discovery module" > > + help > > + This kernel module discovers the PCI-to-PCI bridges of PCIe switches > > + and forms the virtual switch links if the bridges belong to the same > > + Physical switch. The switch links help to identify shorter distances > > + for P2P configurations. > > + > > endmenu > > diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile > > index acd56d3b4a35..a3584b5146af 100644 > > --- a/drivers/pci/switch/Makefile > > +++ b/drivers/pci/switch/Makefile > > @@ -1,2 +1,3 @@ > > # SPDX-License-Identifier: GPL-2.0 > > obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o > > +obj-$(CONFIG_PCI_SW_DISCOVERY) += switch_discovery.o > > diff --git a/drivers/pci/switch/switch_discovery.c b/drivers/pci/switch/switch_discovery.c > > new file mode 100644 > > index 000000000000..a427d3885b1f > > --- /dev/null > > +++ b/drivers/pci/switch/switch_discovery.c > > @@ -0,0 +1,375 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * PCI Switch Discovery module > > + * > > + * Copyright (c) 2024 Broadcom Inc. > > + * > > + * Authors: Broadcom Inc. > > + * Sumanesh Samanta <sumanesh.samanta@broadcom.com> > > + * Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > > + */ > > + > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > + > > +#include <linux/init.h> > > +#include <linux/kernel.h> > > +#include <linux/module.h> > > +#include <linux/sysfs.h> > > +#include <linux/slab.h> > > +#include <linux/rwsem.h> > > +#include <linux/pci.h> > > +#include <linux/vmalloc.h> > > Pick an ordering scheme for headers. Can't remember which PCI uses > but alphabetical is always a good starting point unless there is > a local standard. > > > +#include "switch_discovery.h" > > + > > +static DECLARE_RWSEM(sw_disc_rwlock); > > +static struct kobject *sw_disc_kobj, *sw_link_kobj; > > +static struct kobject *sw_kobj[SWD_MAX_VIRT_SWITCH]; > Why can't this be dynamically sized? Use a list. > > > +static DECLARE_BITMAP(swdata_valid, SWD_MAX_VIRT_SWITCH); > > + > > +static struct switch_data *swdata; > > + > > +static int sw_disc_probe(void); > > +static int sw_disc_create_sysfs_files(void); > > +static bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num); > > Can you reorder the code to avoid the need for these forwards definitions? > > > + > > +static inline bool sw_disc_is_supported_pdev(struct pci_dev *pdev) > > +{ > > + if ((pdev->vendor == PCI_VENDOR_ID_LSI) && > > + ((pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_HLC) || > > + (pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_LLC))) > > + return true; > > + > > + return false; > > +} > > + > > +static ssize_t sw_disc_show(struct kobject *kobj, > > + struct kobj_attribute *attr, > > + char *buf) > > +{ > > + int retval; > > + > > + down_write(&sw_disc_rwlock); > > + retval = sw_disc_probe(); > > + if (!retval) { > > + pr_debug("No new switch found\n"); > > + goto exit_success; > > + } > > + > > + retval = sw_disc_create_sysfs_files(); > > + if (retval < 0) { > > + pr_err("Failed to create the sysfs entries, retval %d\n", > > + retval); > > + } > > + > > +exit_success: > > + up_write(&sw_disc_rwlock); > > + return sysfs_emit(buf, SWD_SCAN_DONE); > Don't have side effects on a read. Write 1 to the file to scan and when > it is done, return len; > > > +} > > + > > +/* This function probes the PCIe devices for virtual links */ > > I'm not sure if a bus walk and search is the right way to do this. > > I need to think on this more, but options that occur are: > 1) Do it in the PCI core (so without a driver binding). > /sys/bus/pci/devices/0000:0c:00.0/isl/0000:12:00.0 -> ../../../0000:12:00.0 > Controversial perhaps because PCIe provides no 'standard' way to discover > this but it it is slim enough, maybe? > > 2) Do it in portdrv as that will currently bind to the USP anyway. > > There are other discussions on going on refactoring the pcie portdrv > and this usecase might well fit in there. Doesn't seem very invasive > to add this. > > > +static int sw_disc_probe(void) > > +{ > > + int i, bit; > > + struct pci_dev *pdev = NULL; > > + int topology_changed = 0; > > + DECLARE_BITMAP(sw_found_map, SWD_MAX_VIRT_SWITCH); > > As above, I'd use a list of found virtual switches then removal > is dropping an entry from middle of that list. > Probe finds what is there and moves things to a new temporary list. > Delete anything left on the old list. > > > + > > + while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)) != NULL) { > > Not using the port class code? Feels like every switch will if this isn't > in a different function? (I've been assuming it is vsec on the USP function) > > > + int sw_found; > > + > > + /* Currently this function only traverses Broadcom > > + * PEX switches and determines the virtual SW links. > > + * Other Switch vendors can add their specific logic > > + * determine the virtual links. > > + */ > > I'd move this comment to the supported query. As you observe, it is > general in principle. > > > + if (!sw_disc_is_supported_pdev(pdev)) > > It's not really about discovering switches. So I'd call it > sw_might_be_virtual_switch() or something like that. > > I'm sure we'll eventually have to handle multiple physical switches > with a real interswitchlink at some point, but that can be addressed > separately. > > > > + continue; > > + > > + sw_found = -1; > > int sw_found = -1; above > > > + > > + for_each_set_bit(bit, swdata_valid, SWD_MAX_VIRT_SWITCH) { > > + if (swdata[bit].devfn == pdev->devfn && > > + swdata[bit].bus == pdev->bus) { > > Can we use an xarray or similar to do this lookup? > > > + sw_found = bit; > > + set_bit(sw_found, sw_found_map); > > + break; > > + } > > + } > > + > > + if (sw_found != -1) > > + continue; > > + > > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) > > + if (!swdata[i].bus) > > + break; > > + > > + if (i >= SWD_MAX_VIRT_SWITCH) { > > + pr_err("Max switch exceeded\n"); > > + break; > > + } > > + > > + sw_found = i; > > + > > + if (!brcm_sw_is_p2p_supported(pdev, (char *)&swdata[sw_found].serial_num)) > > + continue; > > + > > + /* Found a new switch which supports P2P */ > > + swdata[sw_found].devfn = pdev->devfn; > > + swdata[sw_found].bus = pdev->bus; > > + > > + topology_changed = 1; > > + set_bit(sw_found, sw_found_map); > > + set_bit(sw_found, swdata_valid); > > + } > > + > > + /* handle device removal */ > > + for_each_clear_bit(bit, sw_found_map, SWD_MAX_VIRT_SWITCH) { > > + if (test_bit(bit, swdata_valid)) { > > + memset(&swdata[bit], 0, sizeof(swdata[i])); > > + clear_bit(bit, swdata_valid); > > + topology_changed = 1; > > + } > > + } > > + > > + return topology_changed; > > +} > > + > > +/* Check the various config space registers of the Broadcom PCI device and > > + * return true if the device supports inter switch P2P. > > + * If P2P is supported, return the device serial number back to > > + * caller. > > + */ > > +bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num) > > +{ > > + int base; > > + u32 cap_data1, cap_data2; > > + u16 vsec; > > + u32 vsec_data; > > + > > + base = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DSN); > > + if (!base) { > > + pr_debug("Failed to get extended capability bus %x devfn %x\n", > > + pdev->bus->number, pdev->devfn); > > + return false; > > + } > > I'd just call pci_get_dsn() If it doesn't return 0 the cap is there > and we get the value and just use it. > > > > + > > + vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_LSI, 1); > > + if (!vsec) { > > + pr_debug("Failed to get VSEC bus %x devfn %x\n", > > + pdev->bus->number, pdev->devfn); > > + return false; > > + } > > + > > + if (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) > > + return false; > > I'd do this first. Will apply to a lot of matches and this > is much cheaper than finding capabilities. > > > + > > + pci_read_config_dword(pdev, base + 8, &cap_data1); > > + pci_read_config_dword(pdev, base + 4, &cap_data2); > > + > > + pci_read_config_dword(pdev, vsec + 12, &vsec_data); > > Use a define for that vsec offset that gives some indication > of it's purpose in the LSI VSEC. > > > + > > + pr_debug("Found Broadcom device bus 0x%x devfn 0x%x " > > + "Serial Number: 0x%x 0x%x, VSEC 0x%x\n", > > + pdev->bus->number, pdev->devfn, > > + cap_data1, cap_data2, vsec_data); > > + > > + if (!SECURE_PART(cap_data1)) > > + return false; > FIELD_GET() > > > + > > + if (!(P2PMASK(vsec_data) & INTER_SWITCH_LINK)) > > FIELD_GET() for the relevant bits in each. > > > + return false; > > + > > + if (serial_num) > > + snprintf(serial_num, SWD_MAX_CHAR, "%x%x", cap_data1, cap_data2); > Just use the u64. > > + > > + return true; > > +} > > + > > +static int sw_disc_create_sysfs_files(void) > > +{ > > + int i, j, retval; > > + > > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > > + if (sw_kobj[i]) { > > + kobject_put(sw_kobj[i]); > > + sw_kobj[i] = NULL; > If you are freeing kobjects in a creation path something went wrong. > Don't do this - if it makes sense free them before calling this create function. > > > + } > > + } > > + > > + if (sw_link_kobj) { > > + kobject_put(sw_link_kobj); > > + sw_link_kobj = NULL; > > + } > > + > > + sw_link_kobj = kobject_create_and_add(SWD_LINK_DIR_STRING, sw_disc_kobj); > > Don't use defines for file names. We want to see them inline as > much more readable! > > > + if (!sw_link_kobj) { > > + pr_err("Failed to create pci link object\n"); > > + return -ENOMEM; > > + } > > + > > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > > + int segment, bus, device, function; > > + char bdf_i[SWD_MAX_CHAR]; > > No obvious reason why this is the same length as serial numbers? > Use an appropriate define for each. We print the bdf in > various places, maybe there is already a suitable define and if > not perhaps worth adding one. > > > + > > + if (!test_bit(i, swdata_valid)) > > + continue; > > + > > + segment = pci_domain_nr(swdata[i].bus); > > + bus = swdata[i].bus->number; > > + device = pci_ari_enabled(swdata[i].bus) ? > > + 0 : PCI_SLOT(swdata[i].devfn); > > + function = pci_ari_enabled(swdata[i].bus) ? > > + swdata[i].devfn : PCI_FUNC(swdata[i].devfn); > > + sprintf(bdf_i, "%04x:%02x:%02x.%x", > > + segment, bus, device, function); > > + > > + for (j = i + 1; j < SWD_MAX_VIRT_SWITCH; j++) { > > + char bdf_j[SWD_MAX_CHAR]; > > + > > + if (!test_bit(j, swdata_valid)) > > + continue; > > + segment = pci_domain_nr(swdata[j].bus); > > + bus = swdata[j].bus->number; > > + device = pci_ari_enabled(swdata[j].bus) ? > > + 0 : PCI_SLOT(swdata[j].devfn); > > + function = pci_ari_enabled(swdata[j].bus) ? > > + swdata[j].devfn : PCI_FUNC(swdata[j].devfn); > > + sprintf(bdf_j, "%04x:%02x:%02x.%x", > > + segment, bus, device, function); > > + > > + if (strcmp(swdata[i].serial_num, swdata[j].serial_num) == 0) { > > + if (!sw_kobj[i]) { > > + sw_kobj[i] = kobject_create_and_add(bdf_i, > > + sw_link_kobj); > > + if (!sw_kobj[i]) { > > + pr_err("Failed to create sysfs entry for switch %s\n", > > + bdf_i); > > + } > > + } > > + > > + if (!sw_kobj[j]) { > > + sw_kobj[j] = kobject_create_and_add(bdf_j, > > + sw_link_kobj); > > + if (!sw_kobj[j]) { > > + pr_err("Failed to create sysfs entry for switch %s\n", > > + bdf_j); > > + } > > + } > > + > > + retval = sysfs_create_link(sw_kobj[i], sw_kobj[j], bdf_j); > > + if (retval) > > + pr_err("Error creating symlink %s and %s\n", > > + bdf_i, bdf_j); > > + > > + retval = sysfs_create_link(sw_kobj[j], sw_kobj[i], bdf_i); > > + if (retval) > > + pr_err("Error creating symlink %s and %s\n", > > + bdf_j, bdf_i); > > + } > > + } > > + } > > + > > + return 0; > > +} > > + > > +/* > > + * Check if the two pci devices have virtual P2P link available. > > + * This function is used by the p2pdma to determine virtual > > + * links between the PCI-to-PCI bridges > > + */ > > +bool sw_disc_check_virtual_link(struct pci_dev *a, > > + struct pci_dev *b) > No need to wrrap line. > > > +{ > > + char serial_num_a[SWD_MAX_CHAR], serial_num_b[SWD_MAX_CHAR]; > > + > > + /* > > + * Check if the PCIe devices support Virtual P2P links > > + */ > > Single line comment > /* Check if the PCIe devices support Virtual P2P links */ > > > + if (!sw_disc_is_supported_pdev(a)) > > + return false; > > + > > + if (!sw_disc_is_supported_pdev(b)) > > + return false; > > + > > + if (brcm_sw_is_p2p_supported(a, serial_num_a) && > > + brcm_sw_is_p2p_supported(b, serial_num_b)) > > + if (!strcmp(serial_num_a, serial_num_b)) > > + return true; > > + > > + return false; > > +} > > +EXPORT_SYMBOL_GPL(sw_disc_check_virtual_link); > > + > > +static struct kobj_attribute sw_disc_attribute = > > + __ATTR(SWD_FILE_NAME_STRING, 0444, sw_disc_show, NULL); > > As below. Use string directly for file names, don't hide it behind > a define. > > > + > > +// Create attribute group > Drop comment + if it was here /* */ > > > +static struct attribute *attrs[] = { > > + &sw_disc_attribute.attr, > > + NULL, > No comma on NULL terminators as we won't add anything after them. > > > +}; > > + > > +static struct attribute_group attr_group = { > > + .attrs = attrs, > > +}; > > + > > +static int __init sw_discovery_init(void) > > +{ > > + int i, retval; > > + > > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) > > + sw_kobj[i] = NULL; > > + > > + // Create "sw_disc" kobject > > Drop any 'obvious' comments. > > > + sw_disc_kobj = kobject_create_and_add(SWD_DIR_STRING, kernel_kobj); > > + if (!sw_disc_kobj) { > > + pr_err("Failed to create sw_disc_kobj\n"); > > + return -ENOMEM; > > + } > > + > > + retval = sysfs_create_group(sw_disc_kobj, &attr_group); > > + if (retval) { > > + pr_err("Cannot register sysfs attribute group\n"); > > + kobject_put(sw_disc_kobj); > return an error. > > + } > > + > > + swdata = kzalloc(sizeof(swdata) * SWD_MAX_VIRT_SWITCH, GFP_KERNEL); > > + if (!swdata) { > > + sysfs_remove_group(sw_disc_kobj, &attr_group); > > + kobject_put(sw_disc_kobj); > return an error. > > + return 0; > > + } > > + > > + pr_info("Loading PCIe switch discovery module, version %s\n", > > + SWITCH_DISC_VERSION); > > + > > + return 0; > > +} > > + > > +static void __exit sw_discovery_exit(void) > > +{ > > + int i; > > + > > + if (!swdata) > I'm fairly sure that if you return an error in failure above (which shouldn't > fail anyway) you won't need this protection as for exit() to be called init() > must have succeeded and the data must have been allocated. > > > + kfree(swdata); > > + > > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > > + if (sw_kobj[i]) > > + kobject_put(sw_kobj[i]); > > + } > > + > > + // Remove kobject > > /* Remove kobject */ but that's pretty obvious anyway so better to just drop the > comment. > > > > + if (sw_link_kobj) > > + kobject_put(sw_link_kobj); > > + > > + sysfs_remove_group(sw_disc_kobj, &attr_group); > > + kobject_put(sw_disc_kobj); > > +} > > + > > +module_init(sw_discovery_init); > > +module_exit(sw_discovery_exit); > > + > > +MODULE_LICENSE("GPL"); > > +MODULE_AUTHOR("Broadcom Inc."); > > +MODULE_VERSION(SWITCH_DISC_VERSION); > > +MODULE_DESCRIPTION("PCIe Switch Discovery Module"); > > diff --git a/drivers/pci/switch/switch_discovery.h b/drivers/pci/switch/switch_discovery.h > > new file mode 100644 > > index 000000000000..b84f5d2e29ac > > --- /dev/null > > +++ b/drivers/pci/switch/switch_discovery.h > > @@ -0,0 +1,44 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +/* > > + * PCI Switch Discovery module > > + * > > + * Copyright (c) 2024 Broadcom Inc. > > + * > > + * Authors: Broadcom Inc. > > + * Sumanesh Samanta <sumanesh.samanta@broadcom.com> > > + * Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > Why is the header needed? Only seems to be used from one c file. > Move everything down there and drop this file. > > > + */ > > + > > +#ifndef PCI_SWITCH_DISC_H > > +#define PCI_SWITCH_DISC_H > > + > > +#define SWD_MAX_SWITCH 32 > > +#define SWD_MAX_VER_PER_SWITCH 8 > > + > > +#define SWD_MAX_VIRT_SWITCH (SWD_MAX_SWITCH * SWD_MAX_VER_PER_SWITCH) > > +#define SWD_MAX_CHAR 16 > > Name this so it's clearer what it is sizing. > > > +#define SWITCH_DISC_VERSION "0.1.1" > > Whilst there are module versions in the kernel etc, they are meaningless > as we must support backwards compatibility anyway. So don't give > it a version (this is basically ancient legacy no one uses any more) > > > +#define SWD_DIR_STRING "pci_switch_link" > All these better inline. Defines just make yoru code harder to read. > > +#define SWD_LINK_DIR_STRING "virtual_switch_links" > > +#define SWD_SCAN_DONE "done\n" > Definitely inline! > > > + > > +#define SWD_FILE_NAME_STRING refresh_switch_toplogy > Just use the string directly inline. This doesn't belong in > a header. > > + > > +/* Broadcom Vendor Specific definitions */ > > +#define PCI_VENDOR_ID_LSI 0x1000 > > +#define PCI_DEVICE_ID_BRCM_PEX_89000_HLC 0xC030 > > +#define PCI_DEVICE_ID_BRCM_PEX_89000_LLC 0xC034 > > > + > > +#define P2PMASK(x) (((x) & 0x300) >> 8) > > Use FIELD_GET() on the mask alone and make sure it's clear from > naming what register this applies to. > > > +#define SECURE_PART(x) ((x) & 0x8) > > +#define INTER_SWITCH_LINK 0x2 > Give this a name that matches with a register name or smilar. > > > + > > +struct switch_data { > > More specific name needed as this will clash with something at somepoint > in the future > > > + int devfn; > extra space before devfn. > > > + struct pci_bus *bus; > > + char serial_num[SWD_MAX_CHAR]; > > +}; > > + > > +bool sw_disc_check_virtual_link(struct pci_dev *a, struct pci_dev *b); > > + > > +#endif /* PCI_SWITCH_DISC_H */ > -- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4221 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges 2024-06-14 15:36 ` Sumanesh Samanta @ 2024-06-23 14:44 ` Manivannan Sadhasivam 0 siblings, 0 replies; 10+ messages in thread From: Manivannan Sadhasivam @ 2024-06-23 14:44 UTC (permalink / raw) To: Sumanesh Samanta Cc: Jonathan Cameron, Shivasharan S, linux-pci, bhelgaas, linux-kernel, sathya.prakash On Fri, Jun 14, 2024 at 09:36:58AM -0600, Sumanesh Samanta wrote: > Hi Jonathan and others, > > Thanks for the feedback. > > > It think the functionality is useful in general. > > > > Not sure that's an appropriate location though. I'd rather > > see something in in each of the USP devices that references > > to others they share with. > > Yes, we did think of using /sys/bus/pci/devices/, but that would of > course mean changing the PCIe driver ( as you too noted later), and we > were not sure if that would be acceptable to the community. > There is also an issue with race condition during discovery, more on that below. > > > I also don't think we actually care > > if these are virtual or real inter switch links (which are hidden > > from the host anyway I think?) > > Yes, correct. The discovery of whether the switches are connected will > be different for Virtual Switch and physical Inter switch link, but > the sysfs entries will be identical, and applications/consumers need > not care. > > >>We may want a way to discover the bandwidth though. > > Agreed, that would be a good enhancement, but will need more vendor > specific discovery, I think. But that is very useful, I agree and we > can take that later once this set of patches are approved > > > I'm not sure if a bus walk and search is the right way to do this. > > > > I need to think on this more, but options that occur are: > > 1) Do it in the PCI core (so without a driver binding). > > /sys/bus/pci/devices/0000:0c:00.0/isl/0000:12:00.0 -> ../../../0000:12:00.0 > > Controversial perhaps because PCIe provides no 'standard' way to discover > > this but it it is slim enough, maybe? > > > > 2) Do it in portdrv as that will currently bind to the USP anyway. > > > > One issue with using the PCI core and/or portdrv is that when those > drivers load, they will discover devices one by one, and when a given > device tries to find its connected peers, it may not be able to do so, > because its peers might not have been discovered yet. > We solved that problem by having a "refresh_link" sysfs entry, and > when that entry is triggered, we rediscover all the p2p switch links. > If we move the code to PCI core/portdev, then we shall probably need a > "refresh" under each device node, and when that is triggered, we shall > need to discover the links for that device. But still we shall still > need to do the bus walk, as we shall still need to find the peer > devices and check whether they are connected to the given device. > So, basically, what I am saying is, something like this: > > /sys/bus/pci/devices/0000:0c:00.0/isl/refresh_link > /sys/bus/pci/devices/0000:0c:00.0/isl/0000:12:00.0 -> ../../../0000:12:00.0 > > Will that be acceptable? If so, we shall incorporate that change in > the next patch. > We shall also incorporate the rest of your feedback. > I feel like moving this part of the code to portdrv/pci core would be the correct approach as the functionality itself is generic but only the implementation is vendor specific. But I'm thinking that instead of discovering the link during device probe itself, can we just create the sysfs attribute for each switch during probe, and display the link during attribute read? This would also mean that we can have the sysfs attribute under each bridge instead of a separate location. Like, /sys/bus/pci/devices/0000:01:00.0/p2p_link - Mani > sincerely, > Sumanesh > > > > > > On Thu, Jun 13, 2024 at 6:40 AM Jonathan Cameron > <Jonathan.Cameron@huawei.com> wrote: > > > > On Wed, 12 Jun 2024 04:27:35 -0700 > > Shivasharan S <shivasharan.srikanteshwara@broadcom.com> wrote: > > > > > This kernel module discovers the virtual inter-switch P2P links present > > > between two PCI-to-PCI bridges that allows an optimal data path for data > > > movement. The module creates sysfs entries for upstream PCI-to-PCI > > > bridges which supports the inter switch P2P links as below: > > > > > > Host root bridge > > > --------------------------------------- > > > | | > > > NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2 > > > (af:00.0) (ad:00.0) (8b:00.0) (8d:00.0) > > > | | > > > GPU1 GPU2 > > > (b0:00.0) (8e:00.0) > > > SERVER 1 > > > > > > /sys/kernel/pci_switch_link/virtual_switch_links > > > ├── 0000:8b:00.0 > > > │ └── 0000:ad:00.0 -> ../0000:ad:00.0 > > > └── 0000:ad:00.0 > > > └── 0000:8b:00.0 -> ../0000:8b:00.0 > > > > It think the functionality is useful in general. > > > > Not sure that's an appropriate location though. I'd rather > > see something in in each of the USP devices that references > > to others they share with. I also don't think we actually care > > if these are virtual or real inter switch links (which are hidden > > from the host anyway I think?) The discovery means might be different > > in those case (large 'switch' made up of multiple connected smaller > > switches). We may want a way to discover the bandwidth though. > > > > > > Needs a formal sysfs doc under > > Documentation/ABI/testing/sysfs* (not totally sure where for this > > interface, but I think that location will change anyway) > > > > The comments below are mostly superficial. I need to think a bit > > more on how this might fit better with the linux driver model > > as I really don't like magic things that cross many devices. > > > > > > > > Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > > > Signed-off-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com> > > > --- > > > .../driver-api/pci/switch_discovery.rst | 52 +++ > > > MAINTAINERS | 13 + > > > drivers/pci/switch/Kconfig | 9 + > > > drivers/pci/switch/Makefile | 1 + > > > drivers/pci/switch/switch_discovery.c | 375 ++++++++++++++++++ > > > drivers/pci/switch/switch_discovery.h | 44 ++ > > > 6 files changed, 494 insertions(+) > > > create mode 100644 Documentation/driver-api/pci/switch_discovery.rst > > > create mode 100644 drivers/pci/switch/switch_discovery.c > > > create mode 100644 drivers/pci/switch/switch_discovery.h > > > > > > diff --git a/Documentation/driver-api/pci/switch_discovery.rst b/Documentation/driver-api/pci/switch_discovery.rst > > > new file mode 100644 > > > index 000000000000..7c1476260e5e > > > --- /dev/null > > > +++ b/Documentation/driver-api/pci/switch_discovery.rst > > > @@ -0,0 +1,52 @@ > > > +================================= > > > +Linux PCI Switch discovery module > > > +================================= > > > + > > > +Modern PCI switches support inter switch Peer-to-Peer(P2P) data transfer > > > +without using host resources. For example, Broadcom(PLX) PCIe Switches have a > > > +capability where a single physical switch can be divided up into multiple > > > +virtual switches at SOD. PCIe switch discovery module detects the virtual links > > > +between the switches that belong to the same physical switch. > > > +This allows user space applications to discover these virtual links that belong > > > +to the same physical switch and configure optimized data paths. > > > + > > > +Userspace Interface > > > +=================== > > > + > > > +The module exposes sysfs entries for user space applications like MPI, NCCL, > > > +UCC, RCCL, HCCL, etc to discover the virtual switch links. > > > + > > > +Consider the below topology > > > + > > > + Host root bridge > > > + --------------------------------------- > > > + | | > > > + NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2 > > > +(af:00.0) (ad:00.0) (8b:00.0) (8d:00.0) > > > + | | > > > + GPU1 GPU2 > > > + (b0:00.0) (8e:00.0) > > > + SERVER 1 > > > + > > > +The simple topology above shows SERVER1, has Switch1 and Switch2 which are > > > +virtual switches that belong to the same physical switch that support > > > +Inter switch P2P. > > > +Switch1 and Switch2 have a GPU and NIC each connected. > > > +The module will detect the virtual P2P link existing between the two switches > > > +and create the sysfs entries as below. > > > + > > > +/sys/kernel/pci_switch_link/virtual_switch_links > > > +├── 0000:8b:00.0 > > > +│ └── 0000:ad:00.0 -> ../0000:ad:00.0 > > > +└── 0000:ad:00.0 > > > + └── 0000:8b:00.0 -> ../0000:8b:00.0 > > > + > > > +The HPC/AI libraries that analyze the topology can decide the optimal data > > > +path like: NIC1->GPU1->GPU2->NIC1 which would have otherwise take a > > > +non-optimal path like NIC1->GPU1->GPU2->GPU1->NIC1. > > > + > > > +Enable P2P DMA to discover virtual links > > > +---------------------------------------- > > > +The module also enhances :c:func:`pci_p2pdma_distance()` to determine a virtual > > > +link between the upstream PCI-to-PCI bridges of the devices and detect optimal > > > +path for applications using P2P DMA API. > > > diff --git a/MAINTAINERS b/MAINTAINERS > > > index 823387766a0c..b1bf3533ea6f 100644 > > > --- a/MAINTAINERS > > > +++ b/MAINTAINERS > > > @@ -17359,6 +17359,19 @@ F: Documentation/driver-api/pci/p2pdma.rst > > > F: drivers/pci/p2pdma.c > > > F: include/linux/pci-p2pdma.h > > > > > > +PCI SWITCH DISCOVERY > > > +M: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > > > +M: Sumanesh Samanta <sumanesh.samanta@broadcom.com> > > > +L: linux-pci@vger.kernel.org > > > +S: Maintained > > > +Q: https://patchwork.kernel.org/project/linux-pci/list/ > > > +B: https://bugzilla.kernel.org > > > +C: irc://irc.oftc.net/linux-pci > > > +T: git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git > > > +F: Documentation/driver-api/pci/switch_discovery.rst > > > +F: drivers/pci/switch/switch_discovery.c > > > +F: drivers/pci/switch/switch_discovery.h > > > + > > > PCI SUBSYSTEM > > > M: Bjorn Helgaas <bhelgaas@google.com> > > > L: linux-pci@vger.kernel.org > > > diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig > > > index d370f4ce0492..fb4410153950 100644 > > > --- a/drivers/pci/switch/Kconfig > > > +++ b/drivers/pci/switch/Kconfig > > > @@ -12,4 +12,13 @@ config PCI_SW_SWITCHTEC > > > devices. See <file:Documentation/driver-api/switchtec.rst> for more > > > information. > > > > > > +config PCI_SW_DISCOVERY > > > + depends on PCI > > > + tristate "PCI Switch discovery module" > > > + help > > > + This kernel module discovers the PCI-to-PCI bridges of PCIe switches > > > + and forms the virtual switch links if the bridges belong to the same > > > + Physical switch. The switch links help to identify shorter distances > > > + for P2P configurations. > > > + > > > endmenu > > > diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile > > > index acd56d3b4a35..a3584b5146af 100644 > > > --- a/drivers/pci/switch/Makefile > > > +++ b/drivers/pci/switch/Makefile > > > @@ -1,2 +1,3 @@ > > > # SPDX-License-Identifier: GPL-2.0 > > > obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o > > > +obj-$(CONFIG_PCI_SW_DISCOVERY) += switch_discovery.o > > > diff --git a/drivers/pci/switch/switch_discovery.c b/drivers/pci/switch/switch_discovery.c > > > new file mode 100644 > > > index 000000000000..a427d3885b1f > > > --- /dev/null > > > +++ b/drivers/pci/switch/switch_discovery.c > > > @@ -0,0 +1,375 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +/* > > > + * PCI Switch Discovery module > > > + * > > > + * Copyright (c) 2024 Broadcom Inc. > > > + * > > > + * Authors: Broadcom Inc. > > > + * Sumanesh Samanta <sumanesh.samanta@broadcom.com> > > > + * Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > > > + */ > > > + > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > > + > > > +#include <linux/init.h> > > > +#include <linux/kernel.h> > > > +#include <linux/module.h> > > > +#include <linux/sysfs.h> > > > +#include <linux/slab.h> > > > +#include <linux/rwsem.h> > > > +#include <linux/pci.h> > > > +#include <linux/vmalloc.h> > > > > Pick an ordering scheme for headers. Can't remember which PCI uses > > but alphabetical is always a good starting point unless there is > > a local standard. > > > > > +#include "switch_discovery.h" > > > + > > > +static DECLARE_RWSEM(sw_disc_rwlock); > > > +static struct kobject *sw_disc_kobj, *sw_link_kobj; > > > +static struct kobject *sw_kobj[SWD_MAX_VIRT_SWITCH]; > > Why can't this be dynamically sized? Use a list. > > > > > +static DECLARE_BITMAP(swdata_valid, SWD_MAX_VIRT_SWITCH); > > > + > > > +static struct switch_data *swdata; > > > + > > > +static int sw_disc_probe(void); > > > +static int sw_disc_create_sysfs_files(void); > > > +static bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num); > > > > Can you reorder the code to avoid the need for these forwards definitions? > > > > > + > > > +static inline bool sw_disc_is_supported_pdev(struct pci_dev *pdev) > > > +{ > > > + if ((pdev->vendor == PCI_VENDOR_ID_LSI) && > > > + ((pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_HLC) || > > > + (pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_LLC))) > > > + return true; > > > + > > > + return false; > > > +} > > > + > > > +static ssize_t sw_disc_show(struct kobject *kobj, > > > + struct kobj_attribute *attr, > > > + char *buf) > > > +{ > > > + int retval; > > > + > > > + down_write(&sw_disc_rwlock); > > > + retval = sw_disc_probe(); > > > + if (!retval) { > > > + pr_debug("No new switch found\n"); > > > + goto exit_success; > > > + } > > > + > > > + retval = sw_disc_create_sysfs_files(); > > > + if (retval < 0) { > > > + pr_err("Failed to create the sysfs entries, retval %d\n", > > > + retval); > > > + } > > > + > > > +exit_success: > > > + up_write(&sw_disc_rwlock); > > > + return sysfs_emit(buf, SWD_SCAN_DONE); > > Don't have side effects on a read. Write 1 to the file to scan and when > > it is done, return len; > > > > > +} > > > + > > > +/* This function probes the PCIe devices for virtual links */ > > > > I'm not sure if a bus walk and search is the right way to do this. > > > > I need to think on this more, but options that occur are: > > 1) Do it in the PCI core (so without a driver binding). > > /sys/bus/pci/devices/0000:0c:00.0/isl/0000:12:00.0 -> ../../../0000:12:00.0 > > Controversial perhaps because PCIe provides no 'standard' way to discover > > this but it it is slim enough, maybe? > > > > 2) Do it in portdrv as that will currently bind to the USP anyway. > > > > There are other discussions on going on refactoring the pcie portdrv > > and this usecase might well fit in there. Doesn't seem very invasive > > to add this. > > > > > +static int sw_disc_probe(void) > > > +{ > > > + int i, bit; > > > + struct pci_dev *pdev = NULL; > > > + int topology_changed = 0; > > > + DECLARE_BITMAP(sw_found_map, SWD_MAX_VIRT_SWITCH); > > > > As above, I'd use a list of found virtual switches then removal > > is dropping an entry from middle of that list. > > Probe finds what is there and moves things to a new temporary list. > > Delete anything left on the old list. > > > > > + > > > + while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)) != NULL) { > > > > Not using the port class code? Feels like every switch will if this isn't > > in a different function? (I've been assuming it is vsec on the USP function) > > > > > + int sw_found; > > > + > > > + /* Currently this function only traverses Broadcom > > > + * PEX switches and determines the virtual SW links. > > > + * Other Switch vendors can add their specific logic > > > + * determine the virtual links. > > > + */ > > > > I'd move this comment to the supported query. As you observe, it is > > general in principle. > > > > > + if (!sw_disc_is_supported_pdev(pdev)) > > > > It's not really about discovering switches. So I'd call it > > sw_might_be_virtual_switch() or something like that. > > > > I'm sure we'll eventually have to handle multiple physical switches > > with a real interswitchlink at some point, but that can be addressed > > separately. > > > > > > > + continue; > > > + > > > + sw_found = -1; > > > > int sw_found = -1; above > > > > > + > > > + for_each_set_bit(bit, swdata_valid, SWD_MAX_VIRT_SWITCH) { > > > + if (swdata[bit].devfn == pdev->devfn && > > > + swdata[bit].bus == pdev->bus) { > > > > Can we use an xarray or similar to do this lookup? > > > > > + sw_found = bit; > > > + set_bit(sw_found, sw_found_map); > > > + break; > > > + } > > > + } > > > + > > > + if (sw_found != -1) > > > + continue; > > > + > > > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) > > > + if (!swdata[i].bus) > > > + break; > > > + > > > + if (i >= SWD_MAX_VIRT_SWITCH) { > > > + pr_err("Max switch exceeded\n"); > > > + break; > > > + } > > > + > > > + sw_found = i; > > > + > > > + if (!brcm_sw_is_p2p_supported(pdev, (char *)&swdata[sw_found].serial_num)) > > > + continue; > > > + > > > + /* Found a new switch which supports P2P */ > > > + swdata[sw_found].devfn = pdev->devfn; > > > + swdata[sw_found].bus = pdev->bus; > > > + > > > + topology_changed = 1; > > > + set_bit(sw_found, sw_found_map); > > > + set_bit(sw_found, swdata_valid); > > > + } > > > + > > > + /* handle device removal */ > > > + for_each_clear_bit(bit, sw_found_map, SWD_MAX_VIRT_SWITCH) { > > > + if (test_bit(bit, swdata_valid)) { > > > + memset(&swdata[bit], 0, sizeof(swdata[i])); > > > + clear_bit(bit, swdata_valid); > > > + topology_changed = 1; > > > + } > > > + } > > > + > > > + return topology_changed; > > > +} > > > + > > > +/* Check the various config space registers of the Broadcom PCI device and > > > + * return true if the device supports inter switch P2P. > > > + * If P2P is supported, return the device serial number back to > > > + * caller. > > > + */ > > > +bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num) > > > +{ > > > + int base; > > > + u32 cap_data1, cap_data2; > > > + u16 vsec; > > > + u32 vsec_data; > > > + > > > + base = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DSN); > > > + if (!base) { > > > + pr_debug("Failed to get extended capability bus %x devfn %x\n", > > > + pdev->bus->number, pdev->devfn); > > > + return false; > > > + } > > > > I'd just call pci_get_dsn() If it doesn't return 0 the cap is there > > and we get the value and just use it. > > > > > > > + > > > + vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_LSI, 1); > > > + if (!vsec) { > > > + pr_debug("Failed to get VSEC bus %x devfn %x\n", > > > + pdev->bus->number, pdev->devfn); > > > + return false; > > > + } > > > + > > > + if (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) > > > + return false; > > > > I'd do this first. Will apply to a lot of matches and this > > is much cheaper than finding capabilities. > > > > > + > > > + pci_read_config_dword(pdev, base + 8, &cap_data1); > > > + pci_read_config_dword(pdev, base + 4, &cap_data2); > > > + > > > + pci_read_config_dword(pdev, vsec + 12, &vsec_data); > > > > Use a define for that vsec offset that gives some indication > > of it's purpose in the LSI VSEC. > > > > > + > > > + pr_debug("Found Broadcom device bus 0x%x devfn 0x%x " > > > + "Serial Number: 0x%x 0x%x, VSEC 0x%x\n", > > > + pdev->bus->number, pdev->devfn, > > > + cap_data1, cap_data2, vsec_data); > > > + > > > + if (!SECURE_PART(cap_data1)) > > > + return false; > > FIELD_GET() > > > > > + > > > + if (!(P2PMASK(vsec_data) & INTER_SWITCH_LINK)) > > > > FIELD_GET() for the relevant bits in each. > > > > > + return false; > > > + > > > + if (serial_num) > > > + snprintf(serial_num, SWD_MAX_CHAR, "%x%x", cap_data1, cap_data2); > > Just use the u64. > > > + > > > + return true; > > > +} > > > + > > > +static int sw_disc_create_sysfs_files(void) > > > +{ > > > + int i, j, retval; > > > + > > > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > > > + if (sw_kobj[i]) { > > > + kobject_put(sw_kobj[i]); > > > + sw_kobj[i] = NULL; > > If you are freeing kobjects in a creation path something went wrong. > > Don't do this - if it makes sense free them before calling this create function. > > > > > + } > > > + } > > > + > > > + if (sw_link_kobj) { > > > + kobject_put(sw_link_kobj); > > > + sw_link_kobj = NULL; > > > + } > > > + > > > + sw_link_kobj = kobject_create_and_add(SWD_LINK_DIR_STRING, sw_disc_kobj); > > > > Don't use defines for file names. We want to see them inline as > > much more readable! > > > > > + if (!sw_link_kobj) { > > > + pr_err("Failed to create pci link object\n"); > > > + return -ENOMEM; > > > + } > > > + > > > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > > > + int segment, bus, device, function; > > > + char bdf_i[SWD_MAX_CHAR]; > > > > No obvious reason why this is the same length as serial numbers? > > Use an appropriate define for each. We print the bdf in > > various places, maybe there is already a suitable define and if > > not perhaps worth adding one. > > > > > + > > > + if (!test_bit(i, swdata_valid)) > > > + continue; > > > + > > > + segment = pci_domain_nr(swdata[i].bus); > > > + bus = swdata[i].bus->number; > > > + device = pci_ari_enabled(swdata[i].bus) ? > > > + 0 : PCI_SLOT(swdata[i].devfn); > > > + function = pci_ari_enabled(swdata[i].bus) ? > > > + swdata[i].devfn : PCI_FUNC(swdata[i].devfn); > > > + sprintf(bdf_i, "%04x:%02x:%02x.%x", > > > + segment, bus, device, function); > > > + > > > + for (j = i + 1; j < SWD_MAX_VIRT_SWITCH; j++) { > > > + char bdf_j[SWD_MAX_CHAR]; > > > + > > > + if (!test_bit(j, swdata_valid)) > > > + continue; > > > + segment = pci_domain_nr(swdata[j].bus); > > > + bus = swdata[j].bus->number; > > > + device = pci_ari_enabled(swdata[j].bus) ? > > > + 0 : PCI_SLOT(swdata[j].devfn); > > > + function = pci_ari_enabled(swdata[j].bus) ? > > > + swdata[j].devfn : PCI_FUNC(swdata[j].devfn); > > > + sprintf(bdf_j, "%04x:%02x:%02x.%x", > > > + segment, bus, device, function); > > > + > > > + if (strcmp(swdata[i].serial_num, swdata[j].serial_num) == 0) { > > > + if (!sw_kobj[i]) { > > > + sw_kobj[i] = kobject_create_and_add(bdf_i, > > > + sw_link_kobj); > > > + if (!sw_kobj[i]) { > > > + pr_err("Failed to create sysfs entry for switch %s\n", > > > + bdf_i); > > > + } > > > + } > > > + > > > + if (!sw_kobj[j]) { > > > + sw_kobj[j] = kobject_create_and_add(bdf_j, > > > + sw_link_kobj); > > > + if (!sw_kobj[j]) { > > > + pr_err("Failed to create sysfs entry for switch %s\n", > > > + bdf_j); > > > + } > > > + } > > > + > > > + retval = sysfs_create_link(sw_kobj[i], sw_kobj[j], bdf_j); > > > + if (retval) > > > + pr_err("Error creating symlink %s and %s\n", > > > + bdf_i, bdf_j); > > > + > > > + retval = sysfs_create_link(sw_kobj[j], sw_kobj[i], bdf_i); > > > + if (retval) > > > + pr_err("Error creating symlink %s and %s\n", > > > + bdf_j, bdf_i); > > > + } > > > + } > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +/* > > > + * Check if the two pci devices have virtual P2P link available. > > > + * This function is used by the p2pdma to determine virtual > > > + * links between the PCI-to-PCI bridges > > > + */ > > > +bool sw_disc_check_virtual_link(struct pci_dev *a, > > > + struct pci_dev *b) > > No need to wrrap line. > > > > > +{ > > > + char serial_num_a[SWD_MAX_CHAR], serial_num_b[SWD_MAX_CHAR]; > > > + > > > + /* > > > + * Check if the PCIe devices support Virtual P2P links > > > + */ > > > > Single line comment > > /* Check if the PCIe devices support Virtual P2P links */ > > > > > + if (!sw_disc_is_supported_pdev(a)) > > > + return false; > > > + > > > + if (!sw_disc_is_supported_pdev(b)) > > > + return false; > > > + > > > + if (brcm_sw_is_p2p_supported(a, serial_num_a) && > > > + brcm_sw_is_p2p_supported(b, serial_num_b)) > > > + if (!strcmp(serial_num_a, serial_num_b)) > > > + return true; > > > + > > > + return false; > > > +} > > > +EXPORT_SYMBOL_GPL(sw_disc_check_virtual_link); > > > + > > > +static struct kobj_attribute sw_disc_attribute = > > > + __ATTR(SWD_FILE_NAME_STRING, 0444, sw_disc_show, NULL); > > > > As below. Use string directly for file names, don't hide it behind > > a define. > > > > > + > > > +// Create attribute group > > Drop comment + if it was here /* */ > > > > > +static struct attribute *attrs[] = { > > > + &sw_disc_attribute.attr, > > > + NULL, > > No comma on NULL terminators as we won't add anything after them. > > > > > +}; > > > + > > > +static struct attribute_group attr_group = { > > > + .attrs = attrs, > > > +}; > > > + > > > +static int __init sw_discovery_init(void) > > > +{ > > > + int i, retval; > > > + > > > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) > > > + sw_kobj[i] = NULL; > > > + > > > + // Create "sw_disc" kobject > > > > Drop any 'obvious' comments. > > > > > + sw_disc_kobj = kobject_create_and_add(SWD_DIR_STRING, kernel_kobj); > > > + if (!sw_disc_kobj) { > > > + pr_err("Failed to create sw_disc_kobj\n"); > > > + return -ENOMEM; > > > + } > > > + > > > + retval = sysfs_create_group(sw_disc_kobj, &attr_group); > > > + if (retval) { > > > + pr_err("Cannot register sysfs attribute group\n"); > > > + kobject_put(sw_disc_kobj); > > return an error. > > > + } > > > + > > > + swdata = kzalloc(sizeof(swdata) * SWD_MAX_VIRT_SWITCH, GFP_KERNEL); > > > + if (!swdata) { > > > + sysfs_remove_group(sw_disc_kobj, &attr_group); > > > + kobject_put(sw_disc_kobj); > > return an error. > > > + return 0; > > > + } > > > + > > > + pr_info("Loading PCIe switch discovery module, version %s\n", > > > + SWITCH_DISC_VERSION); > > > + > > > + return 0; > > > +} > > > + > > > +static void __exit sw_discovery_exit(void) > > > +{ > > > + int i; > > > + > > > + if (!swdata) > > I'm fairly sure that if you return an error in failure above (which shouldn't > > fail anyway) you won't need this protection as for exit() to be called init() > > must have succeeded and the data must have been allocated. > > > > > + kfree(swdata); > > > + > > > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > > > + if (sw_kobj[i]) > > > + kobject_put(sw_kobj[i]); > > > + } > > > + > > > + // Remove kobject > > > > /* Remove kobject */ but that's pretty obvious anyway so better to just drop the > > comment. > > > > > > > + if (sw_link_kobj) > > > + kobject_put(sw_link_kobj); > > > + > > > + sysfs_remove_group(sw_disc_kobj, &attr_group); > > > + kobject_put(sw_disc_kobj); > > > +} > > > + > > > +module_init(sw_discovery_init); > > > +module_exit(sw_discovery_exit); > > > + > > > +MODULE_LICENSE("GPL"); > > > +MODULE_AUTHOR("Broadcom Inc."); > > > +MODULE_VERSION(SWITCH_DISC_VERSION); > > > +MODULE_DESCRIPTION("PCIe Switch Discovery Module"); > > > diff --git a/drivers/pci/switch/switch_discovery.h b/drivers/pci/switch/switch_discovery.h > > > new file mode 100644 > > > index 000000000000..b84f5d2e29ac > > > --- /dev/null > > > +++ b/drivers/pci/switch/switch_discovery.h > > > @@ -0,0 +1,44 @@ > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > +/* > > > + * PCI Switch Discovery module > > > + * > > > + * Copyright (c) 2024 Broadcom Inc. > > > + * > > > + * Authors: Broadcom Inc. > > > + * Sumanesh Samanta <sumanesh.samanta@broadcom.com> > > > + * Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > > Why is the header needed? Only seems to be used from one c file. > > Move everything down there and drop this file. > > > > > + */ > > > + > > > +#ifndef PCI_SWITCH_DISC_H > > > +#define PCI_SWITCH_DISC_H > > > + > > > +#define SWD_MAX_SWITCH 32 > > > +#define SWD_MAX_VER_PER_SWITCH 8 > > > + > > > +#define SWD_MAX_VIRT_SWITCH (SWD_MAX_SWITCH * SWD_MAX_VER_PER_SWITCH) > > > +#define SWD_MAX_CHAR 16 > > > > Name this so it's clearer what it is sizing. > > > > > +#define SWITCH_DISC_VERSION "0.1.1" > > > > Whilst there are module versions in the kernel etc, they are meaningless > > as we must support backwards compatibility anyway. So don't give > > it a version (this is basically ancient legacy no one uses any more) > > > > > +#define SWD_DIR_STRING "pci_switch_link" > > All these better inline. Defines just make yoru code harder to read. > > > +#define SWD_LINK_DIR_STRING "virtual_switch_links" > > > +#define SWD_SCAN_DONE "done\n" > > Definitely inline! > > > > > + > > > +#define SWD_FILE_NAME_STRING refresh_switch_toplogy > > Just use the string directly inline. This doesn't belong in > > a header. > > > + > > > +/* Broadcom Vendor Specific definitions */ > > > +#define PCI_VENDOR_ID_LSI 0x1000 > > > +#define PCI_DEVICE_ID_BRCM_PEX_89000_HLC 0xC030 > > > +#define PCI_DEVICE_ID_BRCM_PEX_89000_LLC 0xC034 > > > > > + > > > +#define P2PMASK(x) (((x) & 0x300) >> 8) > > > > Use FIELD_GET() on the mask alone and make sure it's clear from > > naming what register this applies to. > > > > > +#define SECURE_PART(x) ((x) & 0x8) > > > +#define INTER_SWITCH_LINK 0x2 > > Give this a name that matches with a register name or smilar. > > > > > + > > > +struct switch_data { > > > > More specific name needed as this will clash with something at somepoint > > in the future > > > > > + int devfn; > > extra space before devfn. > > > > > + struct pci_bus *bus; > > > + char serial_num[SWD_MAX_CHAR]; > > > +}; > > > + > > > +bool sw_disc_check_virtual_link(struct pci_dev *a, struct pci_dev *b); > > > + > > > +#endif /* PCI_SWITCH_DISC_H */ > > > > -- > This electronic communication and the information and any files transmitted > with it, or attached to it, are confidential and are intended solely for > the use of the individual or entity to whom it is addressed and may contain > information that is confidential, legally privileged, protected by privacy > laws, or otherwise restricted from disclosure to anyone else. If you are > not the intended recipient or the person responsible for delivering the > e-mail to the intended recipient, you are hereby notified that any use, > copying, distributing, dissemination, forwarding, printing, or copying of > this e-mail is strictly prohibited. If you received this e-mail in error, > please return the e-mail to the sender, delete it from your computer, and > destroy any printed copy of it. -- மணிவண்ணன் சதாசிவம் ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges 2024-06-12 11:27 ` [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges Shivasharan S 2024-06-13 12:40 ` Jonathan Cameron @ 2024-06-21 20:53 ` Bjorn Helgaas 2024-06-24 19:31 ` Logan Gunthorpe 1 sibling, 1 reply; 10+ messages in thread From: Bjorn Helgaas @ 2024-06-21 20:53 UTC (permalink / raw) To: Shivasharan S Cc: linux-pci, bhelgaas, linux-kernel, sumanesh.samanta, sathya.prakash, Logan Gunthorpe [+cc Logan] Surface-level comments below. On Wed, Jun 12, 2024 at 04:27:35AM -0700, Shivasharan S wrote: > This kernel module discovers the virtual inter-switch P2P links present > between two PCI-to-PCI bridges that allows an optimal data path for data > movement. The module creates sysfs entries for upstream PCI-to-PCI > bridges which supports the inter switch P2P links as below: > > Host root bridge > --------------------------------------- > | | > NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2 > (af:00.0) (ad:00.0) (8b:00.0) (8d:00.0) > | | > GPU1 GPU2 > (b0:00.0) (8e:00.0) > SERVER 1 > > /sys/kernel/pci_switch_link/virtual_switch_links > ├── 0000:8b:00.0 > │ └── 0000:ad:00.0 -> ../0000:ad:00.0 > └── 0000:ad:00.0 > └── 0000:8b:00.0 -> ../0000:8b:00.0 > > Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > Signed-off-by: Sumanesh Samanta <sumanesh.samanta@broadcom.com> > --- > .../driver-api/pci/switch_discovery.rst | 52 +++ > MAINTAINERS | 13 + > drivers/pci/switch/Kconfig | 9 + > drivers/pci/switch/Makefile | 1 + > drivers/pci/switch/switch_discovery.c | 375 ++++++++++++++++++ > drivers/pci/switch/switch_discovery.h | 44 ++ > 6 files changed, 494 insertions(+) > create mode 100644 Documentation/driver-api/pci/switch_discovery.rst > create mode 100644 drivers/pci/switch/switch_discovery.c > create mode 100644 drivers/pci/switch/switch_discovery.h > > diff --git a/Documentation/driver-api/pci/switch_discovery.rst b/Documentation/driver-api/pci/switch_discovery.rst > new file mode 100644 > index 000000000000..7c1476260e5e > --- /dev/null > +++ b/Documentation/driver-api/pci/switch_discovery.rst > @@ -0,0 +1,52 @@ > +================================= > +Linux PCI Switch discovery module "switch discovery" sounds a lot like "switch enumeration". What you're doing here is not enumeration or discovery of the *switch*, it's discovery of these special non-architected paths between switches, so I think the name isn't quite right. > +================================= > + > +Modern PCI switches support inter switch Peer-to-Peer(P2P) data transfer > +without using host resources. For example, Broadcom(PLX) PCIe Switches have a > +capability where a single physical switch can be divided up into multiple > +virtual switches at SOD. PCIe switch discovery module detects the virtual links SOD? > +between the switches that belong to the same physical switch. > +This allows user space applications to discover these virtual links that belong > +to the same physical switch and configure optimized data paths. > + > +Userspace Interface > +=================== > + > +The module exposes sysfs entries for user space applications like MPI, NCCL, > +UCC, RCCL, HCCL, etc to discover the virtual switch links. > + > +Consider the below topology > + > + Host root bridge > + --------------------------------------- > + | | > + NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2 > +(af:00.0) (ad:00.0) (8b:00.0) (8d:00.0) > + | | > + GPU1 GPU2 > + (b0:00.0) (8e:00.0) > + SERVER 1 > + > +The simple topology above shows SERVER1, has Switch1 and Switch2 which are > +virtual switches that belong to the same physical switch that support > +Inter switch P2P. Need blank lines between paragraphs. Or rewrap to fill 75 columns or so if you intend a single paragraph. > +Switch1 and Switch2 have a GPU and NIC each connected. > +The module will detect the virtual P2P link existing between the two switches > +and create the sysfs entries as below. > + > +/sys/kernel/pci_switch_link/virtual_switch_links > +├── 0000:8b:00.0 > +│ └── 0000:ad:00.0 -> ../0000:ad:00.0 > +└── 0000:ad:00.0 > + └── 0000:8b:00.0 -> ../0000:8b:00.0 > + > +The HPC/AI libraries that analyze the topology can decide the optimal data > +path like: NIC1->GPU1->GPU2->NIC1 which would have otherwise take a > +non-optimal path like NIC1->GPU1->GPU2->GPU1->NIC1. > + > +Enable P2P DMA to discover virtual links > +---------------------------------------- > +The module also enhances :c:func:`pci_p2pdma_distance()` to determine a virtual > +link between the upstream PCI-to-PCI bridges of the devices and detect optimal > +path for applications using P2P DMA API. > diff --git a/MAINTAINERS b/MAINTAINERS > index 823387766a0c..b1bf3533ea6f 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -17359,6 +17359,19 @@ F: Documentation/driver-api/pci/p2pdma.rst > F: drivers/pci/p2pdma.c > F: include/linux/pci-p2pdma.h > > +PCI SWITCH DISCOVERY > +M: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > +M: Sumanesh Samanta <sumanesh.samanta@broadcom.com> > +L: linux-pci@vger.kernel.org > +S: Maintained > +Q: https://patchwork.kernel.org/project/linux-pci/list/ > +B: https://bugzilla.kernel.org > +C: irc://irc.oftc.net/linux-pci > +T: git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git > +F: Documentation/driver-api/pci/switch_discovery.rst > +F: drivers/pci/switch/switch_discovery.c > +F: drivers/pci/switch/switch_discovery.h > + > PCI SUBSYSTEM > M: Bjorn Helgaas <bhelgaas@google.com> > L: linux-pci@vger.kernel.org > diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig > index d370f4ce0492..fb4410153950 100644 > --- a/drivers/pci/switch/Kconfig > +++ b/drivers/pci/switch/Kconfig > @@ -12,4 +12,13 @@ config PCI_SW_SWITCHTEC > devices. See <file:Documentation/driver-api/switchtec.rst> for more > information. > > +config PCI_SW_DISCOVERY > + depends on PCI > + tristate "PCI Switch discovery module" > + help > + This kernel module discovers the PCI-to-PCI bridges of PCIe switches > + and forms the virtual switch links if the bridges belong to the same > + Physical switch. The switch links help to identify shorter distances > + for P2P configurations. > + > endmenu > diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile > index acd56d3b4a35..a3584b5146af 100644 > --- a/drivers/pci/switch/Makefile > +++ b/drivers/pci/switch/Makefile > @@ -1,2 +1,3 @@ > # SPDX-License-Identifier: GPL-2.0 > obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o > +obj-$(CONFIG_PCI_SW_DISCOVERY) += switch_discovery.o > diff --git a/drivers/pci/switch/switch_discovery.c b/drivers/pci/switch/switch_discovery.c > new file mode 100644 > index 000000000000..a427d3885b1f > --- /dev/null > +++ b/drivers/pci/switch/switch_discovery.c > @@ -0,0 +1,375 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * PCI Switch Discovery module > + * > + * Copyright (c) 2024 Broadcom Inc. > + * > + * Authors: Broadcom Inc. > + * Sumanesh Samanta <sumanesh.samanta@broadcom.com> > + * Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > + */ > + > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > + > +#include <linux/init.h> > +#include <linux/kernel.h> > +#include <linux/module.h> > +#include <linux/sysfs.h> > +#include <linux/slab.h> > +#include <linux/rwsem.h> > +#include <linux/pci.h> > +#include <linux/vmalloc.h> > +#include "switch_discovery.h" > + > +static DECLARE_RWSEM(sw_disc_rwlock); > +static struct kobject *sw_disc_kobj, *sw_link_kobj; > +static struct kobject *sw_kobj[SWD_MAX_VIRT_SWITCH]; > +static DECLARE_BITMAP(swdata_valid, SWD_MAX_VIRT_SWITCH); > + > +static struct switch_data *swdata; > + > +static int sw_disc_probe(void); > +static int sw_disc_create_sysfs_files(void); > +static bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num); > + > +static inline bool sw_disc_is_supported_pdev(struct pci_dev *pdev) > +{ > + if ((pdev->vendor == PCI_VENDOR_ID_LSI) && > + ((pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_HLC) || > + (pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_LLC))) > + return true; > + > + return false; > +} > + > +static ssize_t sw_disc_show(struct kobject *kobj, > + struct kobj_attribute *attr, > + char *buf) > +{ > + int retval; > + > + down_write(&sw_disc_rwlock); > + retval = sw_disc_probe(); > + if (!retval) { > + pr_debug("No new switch found\n"); > + goto exit_success; > + } > + > + retval = sw_disc_create_sysfs_files(); > + if (retval < 0) { > + pr_err("Failed to create the sysfs entries, retval %d\n", > + retval); > + } > + > +exit_success: > + up_write(&sw_disc_rwlock); > + return sysfs_emit(buf, SWD_SCAN_DONE); > +} > + > +/* This function probes the PCIe devices for virtual links */ > +static int sw_disc_probe(void) > +{ > + int i, bit; > + struct pci_dev *pdev = NULL; > + int topology_changed = 0; > + DECLARE_BITMAP(sw_found_map, SWD_MAX_VIRT_SWITCH); > + > + while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)) != NULL) { We might be forced to use pci_get_device() if we can't find a better way, but every call of this interface is suspect because it leaves the hotplug case unhandled. > + int sw_found; > + > + /* Currently this function only traverses Broadcom > + * PEX switches and determines the virtual SW links. > + * Other Switch vendors can add their specific logic > + * determine the virtual links. Use drivers/pci comment style (/* by itself on first line). Factor the things that you think are Broadcom-specific out to a function with a name that is obviously Broadcom-specific. > + */ > + if (!sw_disc_is_supported_pdev(pdev)) > + continue; > + > + sw_found = -1; > + > + for_each_set_bit(bit, swdata_valid, SWD_MAX_VIRT_SWITCH) { > + if (swdata[bit].devfn == pdev->devfn && > + swdata[bit].bus == pdev->bus) { > + sw_found = bit; > + set_bit(sw_found, sw_found_map); > + break; > + } > + } > + > + if (sw_found != -1) > + continue; > + > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) > + if (!swdata[i].bus) > + break; > + > + if (i >= SWD_MAX_VIRT_SWITCH) { > + pr_err("Max switch exceeded\n"); Use pci_info/err/etc when possible. "Max switch exceeded" by itself in the dmesg log is useless. You have a pdev available here, so use that. > + break; > + } > + > + sw_found = i; > + > + if (!brcm_sw_is_p2p_supported(pdev, (char *)&swdata[sw_found].serial_num)) > + continue; > + > + /* Found a new switch which supports P2P */ > + swdata[sw_found].devfn = pdev->devfn; > + swdata[sw_found].bus = pdev->bus; > + > + topology_changed = 1; > + set_bit(sw_found, sw_found_map); > + set_bit(sw_found, swdata_valid); > + } > + > + /* handle device removal */ > + for_each_clear_bit(bit, sw_found_map, SWD_MAX_VIRT_SWITCH) { > + if (test_bit(bit, swdata_valid)) { > + memset(&swdata[bit], 0, sizeof(swdata[i])); > + clear_bit(bit, swdata_valid); > + topology_changed = 1; > + } > + } > + > + return topology_changed; > +} > + > +/* Check the various config space registers of the Broadcom PCI device and > + * return true if the device supports inter switch P2P. > + * If P2P is supported, return the device serial number back to > + * caller. > + */ > +bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num) > +{ > + int base; > + u32 cap_data1, cap_data2; > + u16 vsec; > + u32 vsec_data; > + > + base = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DSN); > + if (!base) { > + pr_debug("Failed to get extended capability bus %x devfn %x\n", > + pdev->bus->number, pdev->devfn); Another place to use pci_dbg with pdev. More below. There should not be any pr_info(), pr_debug(), etc calls unless there's something that cannot be associated with a device. > + return false; > + } > + > + vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_LSI, 1); > + if (!vsec) { > + pr_debug("Failed to get VSEC bus %x devfn %x\n", > + pdev->bus->number, pdev->devfn); > + return false; > + } > + > + if (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) > + return false; > + > + pci_read_config_dword(pdev, base + 8, &cap_data1); > + pci_read_config_dword(pdev, base + 4, &cap_data2); > + > + pci_read_config_dword(pdev, vsec + 12, &vsec_data); > + > + pr_debug("Found Broadcom device bus 0x%x devfn 0x%x " > + "Serial Number: 0x%x 0x%x, VSEC 0x%x\n", > + pdev->bus->number, pdev->devfn, > + cap_data1, cap_data2, vsec_data); > + > + if (!SECURE_PART(cap_data1)) > + return false; > + > + if (!(P2PMASK(vsec_data) & INTER_SWITCH_LINK)) > + return false; > + > + if (serial_num) > + snprintf(serial_num, SWD_MAX_CHAR, "%x%x", cap_data1, cap_data2); > + > + return true; > +} > + > +static int sw_disc_create_sysfs_files(void) > +{ > + int i, j, retval; > + > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > + if (sw_kobj[i]) { > + kobject_put(sw_kobj[i]); > + sw_kobj[i] = NULL; > + } > + } > + > + if (sw_link_kobj) { > + kobject_put(sw_link_kobj); > + sw_link_kobj = NULL; > + } > + > + sw_link_kobj = kobject_create_and_add(SWD_LINK_DIR_STRING, sw_disc_kobj); > + if (!sw_link_kobj) { > + pr_err("Failed to create pci link object\n"); > + return -ENOMEM; > + } > + > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > + int segment, bus, device, function; > + char bdf_i[SWD_MAX_CHAR]; > + > + if (!test_bit(i, swdata_valid)) > + continue; > + > + segment = pci_domain_nr(swdata[i].bus); > + bus = swdata[i].bus->number; > + device = pci_ari_enabled(swdata[i].bus) ? > + 0 : PCI_SLOT(swdata[i].devfn); > + function = pci_ari_enabled(swdata[i].bus) ? > + swdata[i].devfn : PCI_FUNC(swdata[i].devfn); > + sprintf(bdf_i, "%04x:%02x:%02x.%x", > + segment, bus, device, function); > + > + for (j = i + 1; j < SWD_MAX_VIRT_SWITCH; j++) { > + char bdf_j[SWD_MAX_CHAR]; > + > + if (!test_bit(j, swdata_valid)) > + continue; > + segment = pci_domain_nr(swdata[j].bus); > + bus = swdata[j].bus->number; > + device = pci_ari_enabled(swdata[j].bus) ? > + 0 : PCI_SLOT(swdata[j].devfn); > + function = pci_ari_enabled(swdata[j].bus) ? > + swdata[j].devfn : PCI_FUNC(swdata[j].devfn); > + sprintf(bdf_j, "%04x:%02x:%02x.%x", > + segment, bus, device, function); > + > + if (strcmp(swdata[i].serial_num, swdata[j].serial_num) == 0) { This gets too deep. Needs to be factored out somehow to avoid excessive indentation. > + if (!sw_kobj[i]) { > + sw_kobj[i] = kobject_create_and_add(bdf_i, > + sw_link_kobj); > + if (!sw_kobj[i]) { > + pr_err("Failed to create sysfs entry for switch %s\n", > + bdf_i); > + } > + } > + > + if (!sw_kobj[j]) { > + sw_kobj[j] = kobject_create_and_add(bdf_j, > + sw_link_kobj); > + if (!sw_kobj[j]) { > + pr_err("Failed to create sysfs entry for switch %s\n", > + bdf_j); > + } > + } > + > + retval = sysfs_create_link(sw_kobj[i], sw_kobj[j], bdf_j); > + if (retval) > + pr_err("Error creating symlink %s and %s\n", > + bdf_i, bdf_j); > + > + retval = sysfs_create_link(sw_kobj[j], sw_kobj[i], bdf_i); > + if (retval) > + pr_err("Error creating symlink %s and %s\n", > + bdf_j, bdf_i); > + } > + } > + } > + > + return 0; > +} > + > +/* > + * Check if the two pci devices have virtual P2P link available. > + * This function is used by the p2pdma to determine virtual > + * links between the PCI-to-PCI bridges > + */ > +bool sw_disc_check_virtual_link(struct pci_dev *a, > + struct pci_dev *b) "check" is not a meaningful name. "if (check(...))" suggests nothing about what true and false return values mean. Name it so reading the caller makes sense without looking at the implementation to find out what the return values mean. > +{ > + char serial_num_a[SWD_MAX_CHAR], serial_num_b[SWD_MAX_CHAR]; > + > + /* > + * Check if the PCIe devices support Virtual P2P links > + */ > + if (!sw_disc_is_supported_pdev(a)) > + return false; > + > + if (!sw_disc_is_supported_pdev(b)) > + return false; > + > + if (brcm_sw_is_p2p_supported(a, serial_num_a) && > + brcm_sw_is_p2p_supported(b, serial_num_b)) > + if (!strcmp(serial_num_a, serial_num_b)) > + return true; > + > + return false; > +} > +EXPORT_SYMBOL_GPL(sw_disc_check_virtual_link); > + > +static struct kobj_attribute sw_disc_attribute = > + __ATTR(SWD_FILE_NAME_STRING, 0444, sw_disc_show, NULL); > + > +// Create attribute group We don't use // comments. > +static struct attribute *attrs[] = { > + &sw_disc_attribute.attr, > + NULL, > +}; > + > +static struct attribute_group attr_group = { > + .attrs = attrs, > +}; > + > +static int __init sw_discovery_init(void) > +{ > + int i, retval; > + > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) > + sw_kobj[i] = NULL; > + > + // Create "sw_disc" kobject > + sw_disc_kobj = kobject_create_and_add(SWD_DIR_STRING, kernel_kobj); > + if (!sw_disc_kobj) { > + pr_err("Failed to create sw_disc_kobj\n"); > + return -ENOMEM; > + } > + > + retval = sysfs_create_group(sw_disc_kobj, &attr_group); > + if (retval) { > + pr_err("Cannot register sysfs attribute group\n"); > + kobject_put(sw_disc_kobj); > + } > + > + swdata = kzalloc(sizeof(swdata) * SWD_MAX_VIRT_SWITCH, GFP_KERNEL); > + if (!swdata) { > + sysfs_remove_group(sw_disc_kobj, &attr_group); > + kobject_put(sw_disc_kobj); > + return 0; > + } > + > + pr_info("Loading PCIe switch discovery module, version %s\n", > + SWITCH_DISC_VERSION); > + > + return 0; > +} > + > +static void __exit sw_discovery_exit(void) > +{ > + int i; > + > + if (!swdata) > + kfree(swdata); > + > + for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) { > + if (sw_kobj[i]) > + kobject_put(sw_kobj[i]); > + } > + > + // Remove kobject > + if (sw_link_kobj) > + kobject_put(sw_link_kobj); > + > + sysfs_remove_group(sw_disc_kobj, &attr_group); > + kobject_put(sw_disc_kobj); > +} > + > +module_init(sw_discovery_init); > +module_exit(sw_discovery_exit); > + > +MODULE_LICENSE("GPL"); > +MODULE_AUTHOR("Broadcom Inc."); > +MODULE_VERSION(SWITCH_DISC_VERSION); > +MODULE_DESCRIPTION("PCIe Switch Discovery Module"); > +++ b/drivers/pci/switch/switch_discovery.h > +#define PCI_VENDOR_ID_LSI 0x1000 Already exists: PCI_VENDOR_ID_LSI_LOGIC ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges 2024-06-21 20:53 ` Bjorn Helgaas @ 2024-06-24 19:31 ` Logan Gunthorpe 0 siblings, 0 replies; 10+ messages in thread From: Logan Gunthorpe @ 2024-06-24 19:31 UTC (permalink / raw) To: Bjorn Helgaas, Shivasharan S Cc: linux-pci, bhelgaas, linux-kernel, sumanesh.samanta, sathya.prakash On 2024-06-21 14:53, Bjorn Helgaas wrote: > [+cc Logan] My apologies. I'm dealing with some health issues at the moment and I'm not able to review this in a timely fashion. I'll try to review this when I'm able but that may not be for a few weeks. Thanks, Logan ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/2] pci/p2pdma: Modify p2p_dma_distance to detect virtual P2P links 2024-06-12 11:27 [PATCH 0/2] pci/switch_discovery: Add new module to discover inter-switch P2P links Shivasharan S 2024-06-12 11:27 ` [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges Shivasharan S @ 2024-06-12 11:27 ` Shivasharan S 2024-06-21 20:47 ` Bjorn Helgaas 1 sibling, 1 reply; 10+ messages in thread From: Shivasharan S @ 2024-06-12 11:27 UTC (permalink / raw) To: linux-pci, bhelgaas Cc: linux-kernel, sumanesh.samanta, sathya.prakash, Shivasharan S [-- Attachment #1: Type: text/plain, Size: 2380 bytes --] Update the p2p_dma_distance() to determine virtual inter-switch P2P links existing between two switches and use this to calculate the DMA distance between two devices. Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> --- drivers/pci/Kconfig | 1 + drivers/pci/p2pdma.c | 18 +++++++++++++++++- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index d35001589d88..3e6226ec91fd 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -174,6 +174,7 @@ config PCI_P2PDMA depends on 64BIT select GENERIC_ALLOCATOR select NEED_SG_DMA_FLAGS + select PCI_SW_DISCOVERY help Enables drivers to do PCI peer-to-peer transactions to and from BARs that are exposed in other devices that are the part of diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 4f47a13cb500..780e649b3a1d 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -21,6 +21,8 @@ #include <linux/seq_buf.h> #include <linux/xarray.h> +extern bool sw_disc_check_virtual_link(struct pci_dev *a, struct pci_dev *b); + struct pci_p2pdma { struct gen_pool *pool; bool p2pmem_published; @@ -576,7 +578,7 @@ calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client, int *dist, bool verbose) { enum pci_p2pdma_map_type map_type = PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; - struct pci_dev *a = provider, *b = client, *bb; + struct pci_dev *a = provider, *b = client, *bb, *b_virtual_link = NULL; bool acs_redirects = false; struct pci_p2pdma *p2pdma; struct seq_buf acs_list; @@ -606,6 +608,17 @@ calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client, if (a == bb) goto check_b_path_acs; + // Physical Broadcom PEX switches can be provisioned into + // multiple virtual switches. + // if both upstream bridges belong to the same physical + // switch, and the switch supports P2P, + // p2p_dma_distance() should take into account of such + // scenarios. + if (sw_disc_check_virtual_link(a, bb)) { + b_virtual_link = bb; + goto check_b_path_acs; + } + bb = pci_upstream_bridge(bb); dist_b++; } @@ -629,6 +642,9 @@ calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client, acs_cnt++; } + if (b_virtual_link && bb == b_virtual_link) + break; + bb = pci_upstream_bridge(bb); } -- 2.43.0 [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4251 bytes --] ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] pci/p2pdma: Modify p2p_dma_distance to detect virtual P2P links 2024-06-12 11:27 ` [PATCH 2/2] pci/p2pdma: Modify p2p_dma_distance to detect virtual P2P links Shivasharan S @ 2024-06-21 20:47 ` Bjorn Helgaas 2024-08-10 19:33 ` Logan Gunthorpe 0 siblings, 1 reply; 10+ messages in thread From: Bjorn Helgaas @ 2024-06-21 20:47 UTC (permalink / raw) To: Shivasharan S Cc: linux-pci, bhelgaas, linux-kernel, sumanesh.samanta, sathya.prakash, Logan Gunthorpe [+cc Logan] On Wed, Jun 12, 2024 at 04:27:36AM -0700, Shivasharan S wrote: > Update the p2p_dma_distance() to determine virtual inter-switch P2P links > existing between two switches and use this to calculate the DMA distance > between two devices. > > Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> > --- > drivers/pci/Kconfig | 1 + > drivers/pci/p2pdma.c | 18 +++++++++++++++++- > 2 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig > index d35001589d88..3e6226ec91fd 100644 > --- a/drivers/pci/Kconfig > +++ b/drivers/pci/Kconfig > @@ -174,6 +174,7 @@ config PCI_P2PDMA > depends on 64BIT > select GENERIC_ALLOCATOR > select NEED_SG_DMA_FLAGS > + select PCI_SW_DISCOVERY > help > Enables drivers to do PCI peer-to-peer transactions to and from > BARs that are exposed in other devices that are the part of > diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c > index 4f47a13cb500..780e649b3a1d 100644 > --- a/drivers/pci/p2pdma.c > +++ b/drivers/pci/p2pdma.c > @@ -21,6 +21,8 @@ > #include <linux/seq_buf.h> > #include <linux/xarray.h> > > +extern bool sw_disc_check_virtual_link(struct pci_dev *a, struct pci_dev *b); This isn't the way we declare external references. Probably drivers/pci/pci.h. Would need a "pci_" or "pcie_" prefix. > struct pci_p2pdma { > struct gen_pool *pool; > bool p2pmem_published; > @@ -576,7 +578,7 @@ calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client, > int *dist, bool verbose) > { > enum pci_p2pdma_map_type map_type = PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; > - struct pci_dev *a = provider, *b = client, *bb; > + struct pci_dev *a = provider, *b = client, *bb, *b_virtual_link = NULL; > bool acs_redirects = false; > struct pci_p2pdma *p2pdma; > struct seq_buf acs_list; > @@ -606,6 +608,17 @@ calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client, > if (a == bb) > goto check_b_path_acs; > > + // Physical Broadcom PEX switches can be provisioned into > + // multiple virtual switches. > + // if both upstream bridges belong to the same physical > + // switch, and the switch supports P2P, > + // p2p_dma_distance() should take into account of such > + // scenarios. Copy the comment style (/* */) of the rest of the file. Rewrap to fit in 80 columns like the rest of the file. Capitalize sentences. Add blank line between paragraphs. > + if (sw_disc_check_virtual_link(a, bb)) { > + b_virtual_link = bb; > + goto check_b_path_acs; > + } > + > bb = pci_upstream_bridge(bb); > dist_b++; > } > @@ -629,6 +642,9 @@ calc_map_type_and_dist(struct pci_dev *provider, struct pci_dev *client, > acs_cnt++; > } > > + if (b_virtual_link && bb == b_virtual_link) > + break; > + > bb = pci_upstream_bridge(bb); > } > > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] pci/p2pdma: Modify p2p_dma_distance to detect virtual P2P links 2024-06-21 20:47 ` Bjorn Helgaas @ 2024-08-10 19:33 ` Logan Gunthorpe 0 siblings, 0 replies; 10+ messages in thread From: Logan Gunthorpe @ 2024-08-10 19:33 UTC (permalink / raw) To: Bjorn Helgaas, Shivasharan S Cc: linux-pci, bhelgaas, linux-kernel, sumanesh.samanta, sathya.prakash Hi, On 2024-06-21 14:47, Bjorn Helgaas wrote: > [+cc Logan] > > On Wed, Jun 12, 2024 at 04:27:36AM -0700, Shivasharan S wrote: >> Update the p2p_dma_distance() to determine virtual inter-switch P2P links >> existing between two switches and use this to calculate the DMA distance >> between two devices. >> I've reviewed this and the previous patch to the best of my current ability and I haven't seen anything major to object to. The changes to the P2P code specifically look good (once Bjorn's comments are addressed). Please copy me on future postings and I'll do another pass with a Reviewed-By tag. Thanks, Logan ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-08-10 20:12 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-12 11:27 [PATCH 0/2] pci/switch_discovery: Add new module to discover inter-switch P2P links Shivasharan S 2024-06-12 11:27 ` [PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges Shivasharan S 2024-06-13 12:40 ` Jonathan Cameron 2024-06-14 15:36 ` Sumanesh Samanta 2024-06-23 14:44 ` Manivannan Sadhasivam 2024-06-21 20:53 ` Bjorn Helgaas 2024-06-24 19:31 ` Logan Gunthorpe 2024-06-12 11:27 ` [PATCH 2/2] pci/p2pdma: Modify p2p_dma_distance to detect virtual P2P links Shivasharan S 2024-06-21 20:47 ` Bjorn Helgaas 2024-08-10 19:33 ` Logan Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox