LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 11/24] alpha/PCI: Add security_locked_down() check to pci_mmap_resource()
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

Currently, Alpha's pci_mmap_resource() does not check
security_locked_down(LOCKDOWN_PCI_ACCESS) before allowing
userspace to mmap PCI BARs.

The generic version has had this check since commit eb627e17727e
("PCI: Lock down BAR access when the kernel is locked down") to
prevent DMA attacks when the kernel is locked down.

Add the same check to Alpha's pci_mmap_resource().

Fixes: eb627e17727e ("PCI: Lock down BAR access when the kernel is locked down")
Tested-by: Magnus Lindholm <linmag7@gmail.com>
Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Acked-by: Magnus Lindholm <linmag7@gmail.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 arch/alpha/kernel/pci-sysfs.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/pci-sysfs.c b/arch/alpha/kernel/pci-sysfs.c
index 3048758304b5..2324720c3e83 100644
--- a/arch/alpha/kernel/pci-sysfs.c
+++ b/arch/alpha/kernel/pci-sysfs.c
@@ -11,6 +11,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/security.h>
 #include <linux/stat.h>
 #include <linux/slab.h>
 #include <linux/pci.h>
@@ -71,7 +72,11 @@ static int pci_mmap_resource(struct kobject *kobj,
 	struct resource *res = attr->private;
 	enum pci_mmap_state mmap_type;
 	struct pci_bus_region bar;
-	int i;
+	int i, ret;
+
+	ret = security_locked_down(LOCKDOWN_PCI_ACCESS);
+	if (ret)
+		return ret;
 
 	for (i = 0; i < PCI_STD_NUM_BARS; i++)
 		if (res == &pdev->resource[i])
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 10/24] PCI/sysfs: Limit pci_sysfs_init() late_initcall compile scope
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

Currently, pci_sysfs_init() and sysfs_initialized compile
unconditionally, even on platforms where static attribute
groups handle all resource file creation.

Thus, place them behind a new HAVE_PCI_SYSFS_INIT macro,
especially as the late_initcall is only needed when:

  - HAVE_PCI_LEGACY is set, to iterate buses and create legacy
    I/O and memory files.

  - Neither HAVE_PCI_MMAP nor ARCH_GENERIC_PCI_MMAP_RESOURCE is
    set, to iterate devices and create resource files via the
    __weak pci_create_resource_files() stub override (this is
    how the Alpha architecture handles this currently).

On most systems both conditions are false and the entire
late_initcall compiles away.

Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 drivers/pci/pci-sysfs.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index b2a2de622f39..225c3d0db74e 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -37,7 +37,14 @@
 #define ARCH_PCI_DEV_GROUPS
 #endif
 
+#if defined(HAVE_PCI_LEGACY) || \
+	!defined(HAVE_PCI_MMAP) && !defined(ARCH_GENERIC_PCI_MMAP_RESOURCE)
+#define HAVE_PCI_SYSFS_INIT
+#endif
+
+#ifdef HAVE_PCI_SYSFS_INIT
 static int sysfs_initialized;	/* = 0 */
+#endif
 
 /* show configuration fields */
 #define pci_config_attr(field, format_string)				\
@@ -1768,6 +1775,7 @@ void pci_remove_sysfs_dev_files(struct pci_dev *pdev)
 }
 #endif
 
+#ifdef HAVE_PCI_SYSFS_INIT
 static int __init pci_sysfs_init(void)
 {
 #if defined(HAVE_PCI_MMAP) || defined(ARCH_GENERIC_PCI_MMAP_RESOURCE)
@@ -1794,6 +1802,7 @@ static int __init pci_sysfs_init(void)
 	return 0;
 }
 late_initcall(pci_sysfs_init);
+#endif
 
 static struct attribute *pci_dev_dev_attrs[] = {
 	&dev_attr_boot_vga.attr,
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 09/24] PCI/sysfs: Add stubs for pci_{create,remove}_sysfs_dev_files()
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

On platforms with HAVE_PCI_MMAP or ARCH_GENERIC_PCI_MMAP_RESOURCE,
resource files are now handled by static attribute groups registered
via pci_dev_groups[].

Thus, the pci_create_sysfs_dev_files() and pci_remove_sysfs_dev_files()
can now be stubbed out, as the dynamic resource file creation is no
longer needed.

Also, simplify pci_sysfs_init() on these platforms to only iterate
buses for legacy attributes creation, skipping the per-device loop.

Move the __weak stubs for pci_create_resource_files() and
pci_remove_resource_files() into the #else branch since only platforms
without HAVE_PCI_MMAP (such as Alpha architecture) still need them.
Guard the res_attr[] and res_attr_wc[] fields in struct pci_dev the
same way.

Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 drivers/pci/pci-sysfs.c | 15 ++++++++++++---
 include/linux/pci.h     |  4 ++++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index f34415a401ec..b2a2de622f39 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1370,10 +1370,9 @@ static const struct attribute_group *pci_dev_resource_attr_groups[] = {
 };
 #else
 #define pci_dev_resource_attr_groups NULL
-#endif
-
 int __weak pci_create_resource_files(struct pci_dev *dev) { return 0; }
 void __weak pci_remove_resource_files(struct pci_dev *dev) { }
+#endif
 
 /**
  * pci_write_rom - used to enable access to the PCI ROM display
@@ -1742,6 +1741,10 @@ static const struct attribute_group pci_dev_resource_resize_attr_group = {
 	.is_visible = resource_resize_attr_is_visible,
 };
 
+#if defined(HAVE_PCI_MMAP) || defined(ARCH_GENERIC_PCI_MMAP_RESOURCE)
+int pci_create_sysfs_dev_files(struct pci_dev *pdev) { return 0; }
+void pci_remove_sysfs_dev_files(struct pci_dev *pdev) { }
+#else
 int __must_check pci_create_sysfs_dev_files(struct pci_dev *pdev)
 {
 	if (!sysfs_initialized)
@@ -1763,9 +1766,15 @@ void pci_remove_sysfs_dev_files(struct pci_dev *pdev)
 
 	pci_remove_resource_files(pdev);
 }
+#endif
 
 static int __init pci_sysfs_init(void)
 {
+#if defined(HAVE_PCI_MMAP) || defined(ARCH_GENERIC_PCI_MMAP_RESOURCE)
+	struct pci_bus *pbus = NULL;
+
+	sysfs_initialized = 1;
+#else
 	struct pci_dev *pdev = NULL;
 	struct pci_bus *pbus = NULL;
 	int retval;
@@ -1778,7 +1787,7 @@ static int __init pci_sysfs_init(void)
 			return retval;
 		}
 	}
-
+#endif
 	while ((pbus = pci_find_next_bus(pbus)))
 		pci_create_legacy_files(pbus);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 974605c9fce2..b998a56f6010 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -515,8 +515,10 @@ struct pci_dev {
 	spinlock_t	pcie_cap_lock;		/* Protects RMW ops in capability accessors */
 	u32		saved_config_space[16]; /* Config space saved at suspend time */
 	struct hlist_head saved_cap_space;
+#if !defined(HAVE_PCI_MMAP) && !defined(ARCH_GENERIC_PCI_MMAP_RESOURCE)
 	struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
 	struct bin_attribute *res_attr_wc[DEVICE_COUNT_RESOURCE]; /* sysfs file for WC mapping of resources */
+#endif
 
 #ifdef CONFIG_HOTPLUG_PCI_PCIE
 	unsigned int	broken_cmd_compl:1;	/* No compl for some cmds */
@@ -2531,8 +2533,10 @@ int pcibios_alloc_irq(struct pci_dev *dev);
 void pcibios_free_irq(struct pci_dev *dev);
 resource_size_t pcibios_default_alignment(void);
 
+#if !defined(HAVE_PCI_MMAP) && !defined(ARCH_GENERIC_PCI_MMAP_RESOURCE)
 extern int pci_create_resource_files(struct pci_dev *dev);
 extern void pci_remove_resource_files(struct pci_dev *dev);
+#endif
 
 #if defined(CONFIG_PCI_MMCONFIG) || defined(CONFIG_ACPI_MCFG)
 void __init pci_mmcfg_early_init(void);
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 08/24] PCI/sysfs: Warn about BAR resize failure in __resource_resize_store()
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

Add a pci_warn() to __resource_resize_store(), so that BAR resize
failures are visible to the user, which can help troubleshoot any
potential resource resize issues.

While at it, rename the resource_resize_is_visible() to
resource_resize_attr_is_visible() along with the corresponding
group variable to align with the naming convention used by the
resource attribute groups.

Also, change the order of pci_dev_groups[] such that the resize
group is now located alongside the other resource groups.

Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 drivers/pci/pci-sysfs.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 0fa126d46e35..f34415a401ec 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1679,6 +1679,9 @@ static ssize_t __resource_resize_store(struct device *dev, int n,
 	sysfs_remove_groups(&pdev->dev.kobj, pci_dev_resource_attr_groups);
 
 	ret = pci_resize_resource(pdev, n, size, 0);
+	if (ret)
+		pci_warn(pdev, "Failed to resize BAR %d: %pe\n",
+			 n, ERR_PTR(ret));
 
 	pci_assign_unassigned_bus_resources(bus);
 
@@ -1726,7 +1729,7 @@ static struct attribute *resource_resize_attrs[] = {
 	NULL,
 };
 
-static umode_t resource_resize_is_visible(struct kobject *kobj,
+static umode_t resource_resize_attr_is_visible(struct kobject *kobj,
 					  struct attribute *a, int n)
 {
 	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
@@ -1734,9 +1737,9 @@ static umode_t resource_resize_is_visible(struct kobject *kobj,
 	return pci_rebar_get_current_size(pdev, n) < 0 ? 0 : a->mode;
 }
 
-static const struct attribute_group pci_dev_resource_resize_group = {
+static const struct attribute_group pci_dev_resource_resize_attr_group = {
 	.attrs = resource_resize_attrs,
-	.is_visible = resource_resize_is_visible,
+	.is_visible = resource_resize_attr_is_visible,
 };
 
 int __must_check pci_create_sysfs_dev_files(struct pci_dev *pdev)
@@ -1857,6 +1860,7 @@ const struct attribute_group *pci_dev_groups[] = {
 	&pci_dev_resource_uc_attr_group,
 	&pci_dev_resource_wc_attr_group,
 #endif
+	&pci_dev_resource_resize_attr_group,
 	&pci_dev_config_attr_group,
 	&pci_dev_rom_attr_group,
 	&pci_dev_reset_attr_group,
@@ -1868,7 +1872,6 @@ const struct attribute_group *pci_dev_groups[] = {
 #ifdef CONFIG_ACPI
 	&pci_dev_acpi_attr_group,
 #endif
-	&pci_dev_resource_resize_group,
 	ARCH_PCI_DEV_GROUPS
 	NULL,
 };
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 07/24] PCI/sysfs: Convert PCI resource files to static attributes
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

Currently, the PCI resource files (resourceN, resourceN_wc) are
dynamically created by pci_create_sysfs_dev_files(), called from
both pci_bus_add_device() and the pci_sysfs_init() late_initcall,
with only a sysfs_initialized flag for synchronisation.  This has
caused "duplicate filename" warnings and boot panics when both
paths race on the same device.

This is especially likely on Devicetree-based platforms, where the
PCI host controllers are platform drivers that probe via the driver
model, which can happen during or after the late_initcall.  As such,
pci_bus_add_device() and pci_sysfs_init() are more likely to overlap.

Thus, convert to static const attributes with three attribute groups
(I/O, UC, WC), each with an .is_bin_visible callback that checks
resource flags, BAR length, and non_mappable_bars.  A .bin_size
callback provides pci_resource_len() to the kernfs node for correct
stat and lseek behaviour.

As part of this conversion:

  - Rename pci_read_resource_io() and pci_write_resource_io() to
    pci_read_resource() and pci_write_resource() since the callbacks
    are no longer I/O-specific in the static attribute context.

  - Update __resource_resize_store() to use sysfs_create_groups() and
    sysfs_remove_groups(), which re-evaluates visibility and runs the
    .bin_size callback for the static resource attribute groups.

  - Remove pci_create_resource_files(), pci_remove_resource_files(),
    and pci_create_attr() which are no longer needed.

  - Move the __weak stubs outside the #if guard so they remain
    available for callers converted in subsequent commits.

Platforms that do not define the HAVE_PCI_MMAP macro or the
ARCH_GENERIC_PCI_MMAP_RESOURCE macro, such as Alpha architecture,
continue using their platform-specific resource file creation.

For reference, the dynamic creation dates back to the pre-Git era:

  https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/drivers/pci/pci-sysfs.c?id=42298be0eeb5ae98453b3374c36161b05a46c5dc

The write-combine support was added in commit 45aec1ae72fc ("x86: PAT
export resource_wc in pci sysfs").

Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 drivers/pci/pci-sysfs.c | 285 +++++++++++++++++++++-------------------
 include/linux/pci.h     |   2 -
 2 files changed, 152 insertions(+), 135 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index d50b8fe1498c..0fa126d46e35 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -890,19 +890,6 @@ pci_llseek_resource_legacy(struct file *filep,
 	return fixed_size_llseek(filep, offset, whence, attr->size);
 }
 
-static __maybe_unused loff_t
-pci_llseek_resource(struct file *filep,
-		    struct kobject *kobj,
-		    const struct bin_attribute *attr,
-		    loff_t offset, int whence)
-{
-	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
-	int bar = (unsigned long)attr->private;
-
-	return fixed_size_llseek(filep, offset, whence,
-				 pci_resource_len(pdev, bar));
-}
-
 #ifdef HAVE_PCI_LEGACY
 /**
  * pci_read_legacy_io - read byte(s) from legacy I/O port space
@@ -1177,14 +1164,14 @@ static ssize_t pci_resource_io(struct file *filp, struct kobject *kobj,
 #endif
 }
 
-static ssize_t pci_read_resource_io(struct file *filp, struct kobject *kobj,
+static ssize_t pci_read_resource(struct file *filp, struct kobject *kobj,
 				    const struct bin_attribute *attr, char *buf,
 				    loff_t off, size_t count)
 {
 	return pci_resource_io(filp, kobj, attr, buf, off, count, false);
 }
 
-static ssize_t pci_write_resource_io(struct file *filp, struct kobject *kobj,
+static ssize_t pci_write_resource(struct file *filp, struct kobject *kobj,
 				     const struct bin_attribute *attr, char *buf,
 				     loff_t off, size_t count)
 {
@@ -1197,6 +1184,18 @@ static ssize_t pci_write_resource_io(struct file *filp, struct kobject *kobj,
 	return pci_resource_io(filp, kobj, attr, buf, off, count, true);
 }
 
+static loff_t pci_llseek_resource(struct file *filep,
+				  struct kobject *kobj,
+				  const struct bin_attribute *attr,
+				  loff_t offset, int whence)
+{
+	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
+	int bar = (unsigned long)attr->private;
+
+	return fixed_size_llseek(filep, offset, whence,
+				 pci_resource_len(pdev, bar));
+}
+
 /*
  * generic_file_llseek() consults f_mapping->host to determine
  * the file size. As iomem_inode knows nothing about the
@@ -1215,8 +1214,8 @@ static ssize_t pci_write_resource_io(struct file *filp, struct kobject *kobj,
 static const struct bin_attribute dev_resource##_bar##_io_attr = {	\
 	.attr = { .name = "resource" __stringify(_bar), .mode = 0600 },	\
 	.private = (void *)(unsigned long)(_bar),			\
-	.read = pci_read_resource_io,					\
-	.write = pci_write_resource_io,					\
+	.read = pci_read_resource,					\
+	.write = pci_write_resource,					\
 	__PCI_RESOURCE_IO_MMAP_ATTRS					\
 }
 
@@ -1238,129 +1237,144 @@ static const struct bin_attribute dev_resource##_bar##_wc_attr = {		\
 	.mmap = pci_mmap_resource_wc,						\
 }
 
-/**
- * pci_remove_resource_files - cleanup resource files
- * @pdev: dev to cleanup
- *
- * If we created resource files for @pdev, remove them from sysfs and
- * free their resources.
- */
-static void pci_remove_resource_files(struct pci_dev *pdev)
+static inline umode_t
+__pci_resource_attr_is_visible(struct kobject *kobj,
+			       const struct bin_attribute *a,
+			       int bar, bool write_combine,
+			       unsigned long flags)
 {
-	int i;
+	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
 
-	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
-		struct bin_attribute *res_attr;
-
-		res_attr = pdev->res_attr[i];
-		if (res_attr) {
-			sysfs_remove_bin_file(&pdev->dev.kobj, res_attr);
-			kfree(res_attr);
-		}
-
-		res_attr = pdev->res_attr_wc[i];
-		if (res_attr) {
-			sysfs_remove_bin_file(&pdev->dev.kobj, res_attr);
-			kfree(res_attr);
-		}
-	}
-}
-
-static int pci_create_attr(struct pci_dev *pdev, int num, int write_combine)
-{
-	/* allocate attribute structure, piggyback attribute name */
-	int name_len = write_combine ? 13 : 10;
-	struct bin_attribute *res_attr;
-	char *res_attr_name;
-	int retval;
-
-	res_attr = kzalloc(sizeof(*res_attr) + name_len, GFP_ATOMIC);
-	if (!res_attr)
-		return -ENOMEM;
-
-	res_attr_name = (char *)(res_attr + 1);
-
-	sysfs_bin_attr_init(res_attr);
-	if (write_combine) {
-		sprintf(res_attr_name, "resource%d_wc", num);
-		res_attr->mmap = pci_mmap_resource_wc;
-	} else {
-		sprintf(res_attr_name, "resource%d", num);
-		if (pci_resource_flags(pdev, num) & IORESOURCE_IO) {
-			res_attr->read = pci_read_resource_io;
-			res_attr->write = pci_write_resource_io;
-			if (arch_can_pci_mmap_io())
-				res_attr->mmap = pci_mmap_resource_uc;
-		} else {
-			res_attr->mmap = pci_mmap_resource_uc;
-		}
-	}
-	if (res_attr->mmap) {
-		res_attr->f_mapping = iomem_get_mapping;
-		/*
-		 * generic_file_llseek() consults f_mapping->host to determine
-		 * the file size. As iomem_inode knows nothing about the
-		 * attribute, it's not going to work, so override it as well.
-		 */
-		res_attr->llseek = pci_llseek_resource;
-	}
-	res_attr->attr.name = res_attr_name;
-	res_attr->attr.mode = 0600;
-	res_attr->size = pci_resource_len(pdev, num);
-	res_attr->private = (void *)(unsigned long)num;
-	retval = sysfs_create_bin_file(&pdev->dev.kobj, res_attr);
-	if (retval) {
-		kfree(res_attr);
-		return retval;
-	}
-
-	if (write_combine)
-		pdev->res_attr_wc[num] = res_attr;
-	else
-		pdev->res_attr[num] = res_attr;
-
-	return 0;
-}
-
-/**
- * pci_create_resource_files - create resource files in sysfs for @dev
- * @pdev: dev in question
- *
- * Walk the resources in @pdev creating files for each resource available.
- */
-static int pci_create_resource_files(struct pci_dev *pdev)
-{
-	int i;
-	int retval;
-
-	/* Skip devices with non-mappable BARs */
 	if (pdev->non_mappable_bars)
 		return 0;
 
-	/* Expose the PCI resources from this device as files */
-	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
+	if (!pci_resource_len(pdev, bar))
+		return 0;
 
-		/* skip empty resources */
-		if (!pci_resource_len(pdev, i))
-			continue;
+	if ((pci_resource_flags(pdev, bar) & flags) != flags)
+		return 0;
 
-		retval = pci_create_attr(pdev, i, 0);
-		/* for prefetchable resources, create a WC mappable file */
-		if (!retval && arch_can_pci_mmap_wc() &&
-		    pci_resource_flags(pdev, i) & IORESOURCE_PREFETCH)
-			retval = pci_create_attr(pdev, i, 1);
-		if (retval) {
-			pci_remove_resource_files(pdev);
-			return retval;
-		}
-	}
-	return 0;
+	if (write_combine && !arch_can_pci_mmap_wc())
+		return 0;
+
+	return a->attr.mode;
 }
-#else /* !(defined(HAVE_PCI_MMAP) || defined(ARCH_GENERIC_PCI_MMAP_RESOURCE)) */
-int __weak pci_create_resource_files(struct pci_dev *dev) { return 0; }
-void __weak pci_remove_resource_files(struct pci_dev *dev) { return; }
+
+static umode_t pci_dev_resource_io_is_visible(struct kobject *kobj,
+					      const struct bin_attribute *a,
+					      int n)
+{
+	return __pci_resource_attr_is_visible(kobj, a, n, false,
+					      IORESOURCE_IO);
+}
+
+static umode_t pci_dev_resource_uc_is_visible(struct kobject *kobj,
+					      const struct bin_attribute *a,
+					      int n)
+{
+	return __pci_resource_attr_is_visible(kobj, a, n, false,
+					      IORESOURCE_MEM);
+}
+
+static umode_t pci_dev_resource_wc_is_visible(struct kobject *kobj,
+					      const struct bin_attribute *a,
+					      int n)
+{
+	return __pci_resource_attr_is_visible(kobj, a, n, true,
+					      IORESOURCE_MEM | IORESOURCE_PREFETCH);
+}
+
+static size_t pci_dev_resource_bin_size(struct kobject *kobj,
+					const struct bin_attribute *a,
+					int n)
+{
+	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
+
+	return pci_resource_len(pdev, n);
+}
+
+pci_dev_resource_io_attr(0);
+pci_dev_resource_io_attr(1);
+pci_dev_resource_io_attr(2);
+pci_dev_resource_io_attr(3);
+pci_dev_resource_io_attr(4);
+pci_dev_resource_io_attr(5);
+
+pci_dev_resource_uc_attr(0);
+pci_dev_resource_uc_attr(1);
+pci_dev_resource_uc_attr(2);
+pci_dev_resource_uc_attr(3);
+pci_dev_resource_uc_attr(4);
+pci_dev_resource_uc_attr(5);
+
+pci_dev_resource_wc_attr(0);
+pci_dev_resource_wc_attr(1);
+pci_dev_resource_wc_attr(2);
+pci_dev_resource_wc_attr(3);
+pci_dev_resource_wc_attr(4);
+pci_dev_resource_wc_attr(5);
+
+static const struct bin_attribute *const pci_dev_resource_io_attrs[] = {
+	&dev_resource0_io_attr,
+	&dev_resource1_io_attr,
+	&dev_resource2_io_attr,
+	&dev_resource3_io_attr,
+	&dev_resource4_io_attr,
+	&dev_resource5_io_attr,
+	NULL,
+};
+
+static const struct bin_attribute *const pci_dev_resource_uc_attrs[] = {
+	&dev_resource0_uc_attr,
+	&dev_resource1_uc_attr,
+	&dev_resource2_uc_attr,
+	&dev_resource3_uc_attr,
+	&dev_resource4_uc_attr,
+	&dev_resource5_uc_attr,
+	NULL,
+};
+
+static const struct bin_attribute *const pci_dev_resource_wc_attrs[] = {
+	&dev_resource0_wc_attr,
+	&dev_resource1_wc_attr,
+	&dev_resource2_wc_attr,
+	&dev_resource3_wc_attr,
+	&dev_resource4_wc_attr,
+	&dev_resource5_wc_attr,
+	NULL,
+};
+
+static const struct attribute_group pci_dev_resource_io_attr_group = {
+	.bin_attrs = pci_dev_resource_io_attrs,
+	.is_bin_visible = pci_dev_resource_io_is_visible,
+	.bin_size = pci_dev_resource_bin_size,
+};
+
+static const struct attribute_group pci_dev_resource_uc_attr_group = {
+	.bin_attrs = pci_dev_resource_uc_attrs,
+	.is_bin_visible = pci_dev_resource_uc_is_visible,
+	.bin_size = pci_dev_resource_bin_size,
+};
+
+static const struct attribute_group pci_dev_resource_wc_attr_group = {
+	.bin_attrs = pci_dev_resource_wc_attrs,
+	.is_bin_visible = pci_dev_resource_wc_is_visible,
+	.bin_size = pci_dev_resource_bin_size,
+};
+
+static const struct attribute_group *pci_dev_resource_attr_groups[] = {
+	&pci_dev_resource_io_attr_group,
+	&pci_dev_resource_uc_attr_group,
+	&pci_dev_resource_wc_attr_group,
+	NULL,
+};
+#else
+#define pci_dev_resource_attr_groups NULL
 #endif
 
+int __weak pci_create_resource_files(struct pci_dev *dev) { return 0; }
+void __weak pci_remove_resource_files(struct pci_dev *dev) { }
+
 /**
  * pci_write_rom - used to enable access to the PCI ROM display
  * @filp: sysfs file
@@ -1662,14 +1676,14 @@ static ssize_t __resource_resize_store(struct device *dev, int n,
 	pci_write_config_word(pdev, PCI_COMMAND,
 			      cmd & ~PCI_COMMAND_MEMORY);
 
-	pci_remove_resource_files(pdev);
+	sysfs_remove_groups(&pdev->dev.kobj, pci_dev_resource_attr_groups);
 
 	ret = pci_resize_resource(pdev, n, size, 0);
 
 	pci_assign_unassigned_bus_resources(bus);
 
-	if (pci_create_resource_files(pdev))
-		pci_warn(pdev, "Failed to recreate resource files after BAR resizing\n");
+	if (sysfs_create_groups(&pdev->dev.kobj, pci_dev_resource_attr_groups))
+		pci_warn(pdev, "Failed to recreate resource groups after BAR resizing\n");
 
 	pci_write_config_word(pdev, PCI_COMMAND, cmd);
 pm_put:
@@ -1838,6 +1852,11 @@ static const struct attribute_group pci_dev_group = {
 
 const struct attribute_group *pci_dev_groups[] = {
 	&pci_dev_group,
+#if defined(HAVE_PCI_MMAP) || defined(ARCH_GENERIC_PCI_MMAP_RESOURCE)
+	&pci_dev_resource_io_attr_group,
+	&pci_dev_resource_uc_attr_group,
+	&pci_dev_resource_wc_attr_group,
+#endif
 	&pci_dev_config_attr_group,
 	&pci_dev_rom_attr_group,
 	&pci_dev_reset_attr_group,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e93fbe6b57fe..974605c9fce2 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2531,10 +2531,8 @@ int pcibios_alloc_irq(struct pci_dev *dev);
 void pcibios_free_irq(struct pci_dev *dev);
 resource_size_t pcibios_default_alignment(void);
 
-#if !defined(HAVE_PCI_MMAP) && !defined(ARCH_GENERIC_PCI_MMAP_RESOURCE)
 extern int pci_create_resource_files(struct pci_dev *dev);
 extern void pci_remove_resource_files(struct pci_dev *dev);
-#endif
 
 #if defined(CONFIG_PCI_MMCONFIG) || defined(CONFIG_ACPI_MCFG)
 void __init pci_mmcfg_early_init(void);
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 06/24] PCI/sysfs: Add static PCI resource attribute macros
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

Add three macros for declaring static binary attributes for PCI
resource files:

  - pci_dev_resource_io_attr(), for I/O BAR resources (read/write)
  - pci_dev_resource_uc_attr(), for memory BAR resources (mmap uncached)
  - pci_dev_resource_wc_attr(), for write-combine resources (mmap WC)

Each macro only sets the callbacks its resource type needs.  The I/O
macro conditionally includes mmap support via __PCI_RESOURCE_IO_MMAP_ATTRS
on architectures where arch_can_pci_mmap_io() is true at compile time
(such as PowerPC, SPARC, and Xtensa).

Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 drivers/pci/pci-sysfs.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index dac780597727..d50b8fe1498c 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1197,6 +1197,47 @@ static ssize_t pci_write_resource_io(struct file *filp, struct kobject *kobj,
 	return pci_resource_io(filp, kobj, attr, buf, off, count, true);
 }
 
+/*
+ * generic_file_llseek() consults f_mapping->host to determine
+ * the file size. As iomem_inode knows nothing about the
+ * attribute, it's not going to work, so override it as well.
+ */
+#if arch_can_pci_mmap_io()
+# define __PCI_RESOURCE_IO_MMAP_ATTRS		\
+	.f_mapping = iomem_get_mapping,		\
+	.llseek = pci_llseek_resource,		\
+	.mmap = pci_mmap_resource_uc,
+#else
+# define __PCI_RESOURCE_IO_MMAP_ATTRS
+#endif
+
+#define pci_dev_resource_io_attr(_bar)					\
+static const struct bin_attribute dev_resource##_bar##_io_attr = {	\
+	.attr = { .name = "resource" __stringify(_bar), .mode = 0600 },	\
+	.private = (void *)(unsigned long)(_bar),			\
+	.read = pci_read_resource_io,					\
+	.write = pci_write_resource_io,					\
+	__PCI_RESOURCE_IO_MMAP_ATTRS					\
+}
+
+#define pci_dev_resource_uc_attr(_bar)					\
+static const struct bin_attribute dev_resource##_bar##_uc_attr = {	\
+	.attr = { .name = "resource" __stringify(_bar), .mode = 0600 },	\
+	.private = (void *)(unsigned long)(_bar),			\
+	.f_mapping = iomem_get_mapping,					\
+	.llseek = pci_llseek_resource,					\
+	.mmap = pci_mmap_resource_uc,					\
+}
+
+#define pci_dev_resource_wc_attr(_bar)						\
+static const struct bin_attribute dev_resource##_bar##_wc_attr = {		\
+	.attr = { .name = "resource" __stringify(_bar) "_wc", .mode = 0600 },	\
+	.private = (void *)(unsigned long)(_bar),				\
+	.f_mapping = iomem_get_mapping,						\
+	.llseek = pci_llseek_resource,						\
+	.mmap = pci_mmap_resource_wc,						\
+}
+
 /**
  * pci_remove_resource_files - cleanup resource files
  * @pdev: dev to cleanup
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 05/24] PCI/sysfs: Add CAP_SYS_ADMIN check to __resource_resize_store()
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

Currently, the __resource_resize_store() allows writing to the
resourceN_resize sysfs attribute to change a BAR's size without
checking for capabilities, currently relying only on the file
access check.

Resizing a BAR modifies PCI device configuration and can disrupt
active drivers.  After the upcoming conversion to static attributes,
it will also trigger resource file updates via sysfs_update_groups().

Thus, add a CAP_SYS_ADMIN check to prevent unprivileged users from
performing BAR resize operations.

Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 drivers/pci/pci-sysfs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 2280b7edb41f..dac780597727 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1596,6 +1596,9 @@ static ssize_t __resource_resize_store(struct device *dev, int n,
 	int ret;
 	u16 cmd;
 
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	if (kstrtoul(buf, 0, &size) < 0)
 		return -EINVAL;
 
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 04/24] PCI/sysfs: Split pci_llseek_resource() for device and legacy attributes
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

Both legacy and resource attributes set .f_mapping = iomem_get_mapping,
so the default generic_file_llseek() would consult iomem_inode for the
file size, which knows nothing about the attribute.  That is why custom
llseek callbacks exist.

Currently, the legacy and resource attributes have .size set at creation
time, as such, using the attr->size is sufficient.  However, the upcoming
static resource attributes will have .size == 0 set, since they are const,
and the .bin_size callback will be used to provide the real size to kernfs
instead.

The legacy attributes operate on a struct pci_bus, not struct pci_dev,
so calling to_pci_dev() on them would be invalid.

Thus, split pci_llseek_resource() into two functions:

  - pci_llseek_resource(), which derives the file size from the BAR
    using pci_resource_len().

  - pci_llseek_resource_legacy(), which uses attr->size directly.

Update the dynamic legacy attribute creation to use the new
pci_llseek_resource_legacy() callback.

The original pci_llseek_resource() was added in commit 24de09c16f97
("PCI: Implement custom llseek for sysfs resource entries").

Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 drivers/pci/pci-sysfs.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 2e4e226e78d4..2280b7edb41f 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -881,13 +881,26 @@ static const struct attribute_group pci_dev_config_attr_group = {
  * llseek operation for mmappable PCI resources.
  * May be left unused if the arch doesn't provide them.
  */
+static __maybe_unused loff_t
+pci_llseek_resource_legacy(struct file *filep,
+			   struct kobject *kobj __always_unused,
+			   const struct bin_attribute *attr,
+			   loff_t offset, int whence)
+{
+	return fixed_size_llseek(filep, offset, whence, attr->size);
+}
+
 static __maybe_unused loff_t
 pci_llseek_resource(struct file *filep,
-		    struct kobject *kobj __always_unused,
+		    struct kobject *kobj,
 		    const struct bin_attribute *attr,
 		    loff_t offset, int whence)
 {
-	return fixed_size_llseek(filep, offset, whence, attr->size);
+	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
+	int bar = (unsigned long)attr->private;
+
+	return fixed_size_llseek(filep, offset, whence,
+				 pci_resource_len(pdev, bar));
 }
 
 #ifdef HAVE_PCI_LEGACY
@@ -1022,7 +1035,7 @@ void pci_create_legacy_files(struct pci_bus *b)
 	b->legacy_io->read = pci_read_legacy_io;
 	b->legacy_io->write = pci_write_legacy_io;
 	/* See pci_create_attr() for motivation */
-	b->legacy_io->llseek = pci_llseek_resource;
+	b->legacy_io->llseek = pci_llseek_resource_legacy;
 	b->legacy_io->mmap = pci_mmap_legacy_io;
 	b->legacy_io->f_mapping = iomem_get_mapping;
 	pci_adjust_legacy_attr(b, pci_mmap_io);
@@ -1038,7 +1051,7 @@ void pci_create_legacy_files(struct pci_bus *b)
 	b->legacy_mem->attr.mode = 0600;
 	b->legacy_mem->mmap = pci_mmap_legacy_mem;
 	/* See pci_create_attr() for motivation */
-	b->legacy_mem->llseek = pci_llseek_resource;
+	b->legacy_mem->llseek = pci_llseek_resource_legacy;
 	b->legacy_mem->f_mapping = iomem_get_mapping;
 	pci_adjust_legacy_attr(b, pci_mmap_mem);
 	error = device_create_bin_file(&b->dev, b->legacy_mem);
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 03/24] PCI/sysfs: Only allow supported resource types in I/O and MMIO helpers
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

Currently, when the sysfs attributes for PCI resources are added
dynamically, the resource access callbacks are only set when the
underlying BAR type matches, using .read and .write for IORESOURCE_IO,
and .mmap for IORESOURCE_MEM or IORESOURCE_IO with arch_can_pci_mmap_io()
support.  As such, when the callback is not set, the operation inherently
fails.

After the conversion to static attributes, visibility callbacks will
control which resource files appear for each BAR, but the callbacks
themselves will always be set.

Thus, add a type check to pci_resource_io() and pci_mmap_resource()
to return -EIO for an unsupported resource type.

Use the new pci_resource_is_io() and pci_resource_is_mem() helpers
for the type checks, replacing the open-coded bitwise flag tests and
also drop the local struct resource pointer in pci_mmap_resource().

Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 drivers/pci/pci-sysfs.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 1fbc3daf87cc..2e4e226e78d4 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1082,20 +1082,24 @@ static int pci_mmap_resource(struct kobject *kobj, const struct bin_attribute *a
 	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
 	int bar = (unsigned long)attr->private;
 	enum pci_mmap_state mmap_type;
-	struct resource *res = pci_resource_n(pdev, bar);
 	int ret;
 
 	ret = security_locked_down(LOCKDOWN_PCI_ACCESS);
 	if (ret)
 		return ret;
 
-	if (res->flags & IORESOURCE_MEM && iomem_is_exclusive(res->start))
+	if (!pci_resource_is_mem(pdev, bar) &&
+	    !(pci_resource_is_io(pdev, bar) && arch_can_pci_mmap_io()))
+		return -EIO;
+
+	if (pci_resource_is_mem(pdev, bar) &&
+	    iomem_is_exclusive(pci_resource_start(pdev, bar)))
 		return -EINVAL;
 
 	if (!pci_mmap_fits(pdev, bar, vma, PCI_MMAP_SYSFS))
 		return -EINVAL;
 
-	mmap_type = res->flags & IORESOURCE_MEM ? pci_mmap_mem : pci_mmap_io;
+	mmap_type = pci_resource_is_mem(pdev, bar) ? pci_mmap_mem : pci_mmap_io;
 
 	return pci_mmap_resource_range(pdev, bar, vma, mmap_type, write_combine);
 }
@@ -1123,6 +1127,9 @@ static ssize_t pci_resource_io(struct file *filp, struct kobject *kobj,
 	int bar = (unsigned long)attr->private;
 	unsigned long port = off;
 
+	if (!pci_resource_is_io(pdev, bar))
+		return -EIO;
+
 	port += pci_resource_start(pdev, bar);
 
 	if (port > pci_resource_end(pdev, bar))
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 02/24] PCI: Add pci_resource_is_io() and pci_resource_is_mem() helpers
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

Add helpers to check whether a PCI resource is of I/O port or
memory type.  These replace the open-coded pci_resource_flags()
with IORESOURCE_IO and IORESOURCE_MEM pattern used across the
tree.

Suggested-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 include/linux/pci.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..e93fbe6b57fe 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2300,6 +2300,31 @@ int pci_iobar_pfn(struct pci_dev *pdev, int bar, struct vm_area_struct *vma);
 	CONCATENATE(__pci_dev_for_each_res, COUNT_ARGS(__VA_ARGS__)) 	\
 		    (dev, res, __VA_ARGS__)
 
+/**
+ * pci_resource_is_io - check if a PCI resource is of I/O port type.
+ * @dev: PCI device to check.
+ * @resno: The resource number (BAR index) to check.
+ *
+ * Returns true if the resource type is I/O port.
+ */
+static inline bool pci_resource_is_io(const struct pci_dev *dev, int resno)
+{
+	return resource_type(pci_resource_n(dev, resno)) == IORESOURCE_IO;
+}
+
+/**
+ * pci_resource_is_mem - check if a PCI resource is of memory type.
+ * @dev: PCI device to check.
+ * @resno: The resource number (BAR index) to check.
+ *
+ * Returns true if the resource type is memory, including
+ * prefetchable memory.
+ */
+static inline bool pci_resource_is_mem(const struct pci_dev *dev, int resno)
+{
+	return resource_type(pci_resource_n(dev, resno)) == IORESOURCE_MEM;
+}
+
 /*
  * Similar to the helpers above, these manipulate per-pci_dev
  * driver-specific data.  They are really just a wrapper around
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 01/24] PCI/sysfs: Use PCI resource accessor macros
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev
In-Reply-To: <20260508043543.217179-1-kwilczynski@kernel.org>

Replace direct pdev->resource[] accesses with pci_resource_n(),
and pdev->resource[].flags accesses with pci_resource_flags().

No functional changes intended.

Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
---
 drivers/pci/pci-sysfs.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index d37860841260..1fbc3daf87cc 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -177,7 +177,7 @@ static ssize_t resource_show(struct device *dev, struct device_attribute *attr,
 		max = PCI_BRIDGE_RESOURCES;
 
 	for (i = 0; i < max; i++) {
-		struct resource *res =  &pci_dev->resource[i];
+		struct resource *res = pci_resource_n(pci_dev, i);
 		struct resource zerores = {};
 
 		/* For backwards compatibility */
@@ -689,7 +689,7 @@ static ssize_t boot_vga_show(struct device *dev, struct device_attribute *attr,
 		return sysfs_emit(buf, "%u\n", (pdev == vga_dev));
 
 	return sysfs_emit(buf, "%u\n",
-			  !!(pdev->resource[PCI_ROM_RESOURCE].flags &
+			  !!(pci_resource_flags(pdev, PCI_ROM_RESOURCE) &
 			     IORESOURCE_ROM_SHADOW));
 }
 static DEVICE_ATTR_RO(boot_vga);
@@ -1082,7 +1082,7 @@ static int pci_mmap_resource(struct kobject *kobj, const struct bin_attribute *a
 	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
 	int bar = (unsigned long)attr->private;
 	enum pci_mmap_state mmap_type;
-	struct resource *res = &pdev->resource[bar];
+	struct resource *res = pci_resource_n(pdev, bar);
 	int ret;
 
 	ret = security_locked_down(LOCKDOWN_PCI_ACCESS);
@@ -1286,7 +1286,7 @@ static int pci_create_resource_files(struct pci_dev *pdev)
 		retval = pci_create_attr(pdev, i, 0);
 		/* for prefetchable resources, create a WC mappable file */
 		if (!retval && arch_can_pci_mmap_wc() &&
-		    pdev->resource[i].flags & IORESOURCE_PREFETCH)
+		    pci_resource_flags(pdev, i) & IORESOURCE_PREFETCH)
 			retval = pci_create_attr(pdev, i, 1);
 		if (retval) {
 			pci_remove_resource_files(pdev);
-- 
2.54.0



^ permalink raw reply related

* [PATCH v7 00/24] PCI: Convert all dynamic sysfs attributes to static
From: Krzysztof Wilczyński @ 2026-05-08  4:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Alex Williamson, Magnus Lindholm, Matt Turner, Richard Henderson,
	Christophe Leroy, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Dexuan Cui, Krzysztof Hałasa, Lukas Wunner,
	Oliver O'Halloran, Saurabh Singh Sengar, Shuan He,
	Srivatsa Bhat, Ilpo Järvinen, linux-pci, linux-alpha,
	linuxppc-dev

Hello,

This series converts every dynamically allocated PCI sysfs attribute to
a static const definition.  After the full series, pci_sysfs_init() and
sysfs_initialized are gone, and every sysfs file is created by the
driver model at device_add() time.

Currently, the PCI resource files (resourceN, resourceN_wc) and the
legacy bus files (legacy_io, legacy_mem) are created dynamically
from two unsynchronised paths:

Path A: late_initcall

  pci_sysfs_init()                        (late_initcall)
    sysfs_initialized = 1
    for_each_pci_dev()
      pci_create_sysfs_dev_files()
        sysfs_create_bin_file()           (resourceN, resourceN_wc)
    pci_find_next_bus()
      pci_create_legacy_files()
        sysfs_create_bin_file()           (legacy_io, legacy_mem)

Path B: device registration / hotplug

  pci_bus_add_devices()
    pci_bus_add_device()
      pci_create_sysfs_dev_files()
        if (!sysfs_initialized)           <- only guard
          return
        sysfs_create_bin_file()           (resourceN, resourceN_wc)

On most ACPI systems this does not race because PCI enumeration
completes at subsys_initcall time, before pci_sysfs_init() runs:

  subsys_initcall (level 4):
    acpi_pci_root_add()
      pci_bus_add_device()
        pci_create_sysfs_dev_files()
          if (!sysfs_initialized)         <- variable not yet set
            return -EACCES

  late_initcall (level 7):
    pci_sysfs_init()
      sysfs_initialized = 1
      for_each_pci_dev()
        pci_create_sysfs_dev_files()      <- creates the files, no race

On Devicetree platforms the host controller is a platform driver that
probes via the driver model, often on a workqueue, and overlaps with the
late_initcall:

  CPU 0 (late_initcall)                CPU 1 (driver probe)
  ---------------------------          ----------------------------
  pci_sysfs_init()
    sysfs_initialized = 1
    for_each_pci_dev()                 pci_bus_add_device()
      pci_create_sysfs_dev_files()       pci_create_sysfs_dev_files()
        sysfs_create_bin_file()            sysfs_create_bin_file()
                                             -> "duplicate filename"

The same happens on ACPI when probing is asynchronous (hv_pci on
Azure, RISC-V with ACPI).

The duplicate causes sysfs_create_bin_file() to fail with -EEXIST.
pci_create_resource_files() then calls pci_remove_resource_files() in
its error unwind, tearing down files the other thread created and
still references through pdev->res_attr[].  This has caused kernel
panics on i.MX6 and boot failures on other platforms.

Several different fixes have been proposed over the years: reordering
the sysfs_initialized assignment, adding locks, checking
pci_dev_is_added(), setting pdev->res_attr[] to NULL after kfree
(which only prevents a double-free on the teardown path, not the
error unwind removing the other thread's files).  None would address the
root cause.

This has been reported a few times:

  - https://lore.kernel.org/linux-pci/20250702155112.40124-1-heshuan@bytedance.com/
  - https://lore.kernel.org/linux-pci/b51519d6-ce45-4b6d-8135-c70169bd110e@h-partners.com/
  - https://lore.kernel.org/linux-pci/1702093576-30405-1-git-send-email-ssengar@linux.microsoft.com/
  - https://lore.kernel.org/linux-pci/SY0P300MB04687548090B73E40AF97D8897B82@SY0P300MB0468.AUSP300.PROD.OUTLOOK.COM/
  - https://lore.kernel.org/linux-pci/20230105174736.GA1154719@bhelgaas/
  - https://lore.kernel.org/linux-pci/m3eebg9puj.fsf@t19.piap.pl/
  - https://lore.kernel.org/linux-pci/20200716110423.xtfyb3n6tn5ixedh@pali/
  - https://lore.kernel.org/linux-pci/1366196798-15929-1-git-send-email-artem.savkov@gmail.com/
  - https://bugzilla.kernel.org/show_bug.cgi?id=215515
  - https://bugzilla.kernel.org/show_bug.cgi?id=216888

With static attributes the driver model creates sysfs entries once per
device at device_add() time, under the device lock, eliminating the
late_initcall iteration and the race along with it.

	Krzysztof

---
Changes in v7:
  https://lore.kernel.org/linux-pci/20260422161407.118748-1-kwilczynski@kernel.org/

  - Addded Alex Williamson (author of the resource resize sysfs
    attributes) to the list of recipients for visibility.
  - Split pci_llseek_resource() into pci_llseek_resource() and
    pci_llseek_resource_legacy() since legacy attributes operate
    on a struct pci_bus where to_pci_dev() would be invalid,
    as per Bjorn Helgaas' feedback.
  - Moved each llseek variant inside its respective #ifdef guard
    during the corresponding dynamic-to-static conversion commit,
    dropping the __maybe_unused annotations.
  - Extended the WARN macro removal to also cover __legacy_mmap_fits().
  - Updated commit message of patch 18, so that it correctly mentions
    pci_stop_dev() rather than pci_stop_bus_device().
  - Updated commit message of patch 24 to clarify the indirect
    relationship between ReBAR and the HAVE_PCI_MMAP and/or
    ARCH_GENERIC_PCI_MMAP_RESOURCE guards.

Changes in v6:
  https://lore.kernel.org/linux-pci/20260416180107.777065-1-kwilczynski@kernel.org/

   - Fixed commit message for patch 13, removing reference to
     pci_resource_flags() which was no longer changed there.
   - Added a new patch (24) to move the BAR resource resize
     (ReBAR) support behind existing PCI mmap #ifdef guard,
     so that the code is not included on architectures that
     do not support resource resizing (i.e., Alpha, etc.).

Changes in v5:
  https://lore.kernel.org/linux-pci/20260411080148.471335-1-kwilczynski@kernel.org/
  
   - Added new Tested-by, Reviewed-by, and Acked-by tags.
   - Used the existing _io function names in the static macro
     definitions, deferring the rename to the conversion commit
     where it belongs, to avoid a forward reference across
     commits. This was reported by Sashiko, see:
     https://sashiko.dev/#/patchset/20260411080148.471335-1-kwilczynski%40kernel.org?part=6
   - Folded the __resource_resize_store() conversion into the
     main static attributes commit so the resize path is never
     broken between commits. This was reported by Sashiko, see:
     https://sashiko.dev/#/patchset/20260410055040.39233-1-kwilczynski%40kernel.org?part=6
     https://sashiko.dev/#/patchset/20260411080148.471335-1-kwilczynski%40kernel.org?part=7
   - Dropped the unnecessary parentheses cleanup from the Alpha
     BAR index commit, as the line is replaced two commits later
     anyway, as per Ilpo Järvinen's feedback.
   - Squashed the Alpha accessor macro and cleanup commits into
     one, using pci_resource_is_mem() directly instead of the
     intermediate pci_resource_flags() step, as per Ilpo
     Järvinen's feedback.
   - Moved the raw literal conversion in pci_create_legacy_files()
     into the macro definition commit, so the macros and their
     usage are introduced together, as per Ilpo Järvinen's
     feedback.
   - Removed unnecessary backslash line continuation from the
     ternary in pci_mmap_legacy_page_range().
   - Kept pci_resource_len() for visibility checks instead of
     resource_assigned().  The static is_visible() callback
     runs at device_add() time during the PCI enumeration,
     before the pci_assign_unassigned_bus_resources() populates
     res->parent, as such, resource_assigned() returned false
     for every BAR, hiding all resource files.  This is related
     to review feedback from Ilpo Järvinen.

Changes in v4:
  https://lore.kernel.org/linux-pci/20260410055040.39233-1-kwilczynski@kernel.org/

   - Added new Reviewed-by tags.
   - Added pci_resource_is_io() and pci_resource_is_mem() helpers
     for resource type checks, replacing the open-coded bitwise
     flag tests in pci_mmap_resource(), pci_resource_io(), and
     Alpha's pci_mmap_resource(), as per Ilpo Järvinen's
     suggestion.
   - Split the __pci_mmap_fits() cleanup into two patches.  An
     overflow fix for zero-length BARs, which now includes a
     Fixes: tag referencing the original Alpha PCI sysfs commit,
     and the WARN macro removal is a separate cleanup as per Ilpo
     Järvinen's suggestion.
   - Added a missing Fixes: tag to the Alpha lockdown check,
     referencing the commit that added the check to the generic
     path but missed Alpha's implementation.
   - Added PCI_LEGACY_IO_SIZE and PCI_LEGACY_MEM_SIZE macros to
     replace the raw literals used for legacy address space sizes.
     These are used in both Alpha's pci_mmap_legacy_page_range()
     and the static legacy attribute definitions, as per Ilpo
     Järvinen's suggestion.
   - Replaced sysfs_update_groups() in the BAR resize path with
     sysfs_remove_groups() before the resize and sysfs_create_groups()
     after, restoring the original teardown before BAR resize
     ordering.  This was reported by Sashiko, see:
     https://sashiko.dev/#/patchset/20260410055040.39233-1-kwilczynski%40kernel.org?part=7
   - Defined pci_dev_resource_attr_groups as a NULL macro when
     HAVE_PCI_MMAP and ARCH_GENERIC_PCI_MMAP_RESOURCE are both
     absent, so the resize path compiles unconditionally without
     #ifdef guards in the function body.  This was reported by
     Sashiko, see:
     https://sashiko.dev/#/patchset/20260410055040.39233-1-kwilczynski%40kernel.org?part=7
   - Moved the pci_legacy_has_sparse() prototype into the patch
     that introduces the function, alongside the existing
     pci_adjust_legacy_attr() declaration, to fix a bisection
     issue where Alpha would warn on -Wmissing-prototypes.
     This was reported by Sashiko, see:
     https://sashiko.dev/#/patchset/20260410055040.39233-1-kwilczynski%40kernel.org?part=18

Changes in v3:
  https://lore.kernel.org/linux-pci/20210910202623.2293708-1-kw@linux.com/

  - Updated for modern kernel releases and expanded scope.  The
    v2 only covered the generic resource files.  This version
    also converts Alpha's sparse/dense resource files and the
    legacy bus attributes, removing pci_sysfs_init() entirely.
  - Split the single macro definition into three distinct ones
    (per I/O, UC, and WC), to make sure that each carries only
    the callbacks its resource type needs.
  - Updated to use the new .bin_size callback, as the attributes
    are const, to replace using a->size directly, which was not
    ideal.  This required changes to pci_llseek_resource(), to
    ensure that it would work for device and bus-level attributes.
  - Updated the __resource_resize_store() to include CAP_SYS_ADMIN
    capabilities check.
  - Added the security_locked_down() check to Alpha's
    pci_mmap_resource(), to align with other architectures.

Changes in v2:
  https://lore.kernel.org/linux-pci/20210825212255.878043-1-kw@linux.com/

  - Refactored code so that the macros, helpers and internal
    functions can be used to correctly leverage the read(),
    write() and mmap() callbacks rather than to use the
    .is_bin_visible() callback to set up sysfs objects
    internals as this is not supported.
  - Refactored some if-statements to check for a resource
    flag first, and then call either arch_can_pci_mmap_io()
    or arch_can_pci_mmap_wc(), plus store result of testing
    for IORESOURCE_MEM and IORESOURCE_PREFETCH flags into
    a boolean variable, as per Bjorn Helgaas' suggestion.
  - Renamed pci_read_resource_io() and pci_write_resource_io()
    callbacks so that these are not specifically tied to I/O
    BARs read() and write() operations also as per Bjorn
    Helgaas' suggestion.
  - Updated style for code handling bitwise operations to
    match the style that is preferred as per Bjorn Helgaas'
    suggestion.
  - Updated commit messages adding more details about the
    implementation as requested by Bjorn Helgaas.

Krzysztof Wilczyński (24):
  PCI/sysfs: Use PCI resource accessor macros
  PCI: Add pci_resource_is_io() and pci_resource_is_mem() helpers
  PCI/sysfs: Only allow supported resource types in I/O and MMIO helpers
  PCI/sysfs: Split pci_llseek_resource() for device and legacy
    attributes
  PCI/sysfs: Add CAP_SYS_ADMIN check to __resource_resize_store()
  PCI/sysfs: Add static PCI resource attribute macros
  PCI/sysfs: Convert PCI resource files to static attributes
  PCI/sysfs: Warn about BAR resize failure in __resource_resize_store()
  PCI/sysfs: Add stubs for pci_{create,remove}_sysfs_dev_files()
  PCI/sysfs: Limit pci_sysfs_init() late_initcall compile scope
  alpha/PCI: Add security_locked_down() check to pci_mmap_resource()
  alpha/PCI: Use BAR index in sysfs attr->private instead of resource
    pointer
  alpha/PCI: Use PCI resource accessor macros
  alpha/PCI: Fix __pci_mmap_fits() overflow for zero-length BARs
  alpha/PCI: Remove WARN from __pci_mmap_fits() and __legacy_mmap_fits()
  alpha/PCI: Add static PCI resource attribute macros
  alpha/PCI: Convert resource files to static attributes
  PCI/sysfs: Remove pci_{create,remove}_sysfs_dev_files()
  PCI: Add macros for legacy I/O and memory address space sizes
  alpha/PCI: Compute legacy size in pci_mmap_legacy_page_range()
  PCI/sysfs: Add __weak pci_legacy_has_sparse() helper
  PCI/sysfs: Convert legacy I/O and memory attributes to static
    definitions
  PCI/sysfs: Remove pci_create_legacy_files() and pci_sysfs_init()
  PCI/sysfs: Limit BAR resize attribute scope to platforms with PCI mmap

 arch/alpha/include/asm/pci.h   |  13 +-
 arch/alpha/kernel/pci-sysfs.c  | 385 +++++++++++----------
 arch/powerpc/include/asm/pci.h |   2 -
 drivers/pci/bus.c              |   1 -
 drivers/pci/pci-sysfs.c        | 592 +++++++++++++++++++--------------
 drivers/pci/pci.h              |  16 +-
 drivers/pci/probe.c            |   6 -
 drivers/pci/remove.c           |   3 -
 include/linux/pci.h            |  39 ++-
 9 files changed, 589 insertions(+), 468 deletions(-)

-- 
2.54.0



^ permalink raw reply

* [PATCH] powerpc/hv-gpci: fix preempt count leak in sysfs show paths
From: Aboorva Devarajan @ 2026-05-08  4:12 UTC (permalink / raw)
  To: Madhavan Srinivasan, Athira Rajeev, linuxppc-dev
  Cc: Aboorva Devarajan, Christophe Leroy, Kajol Jain, linux-kernel

Four sysfs show() callbacks in hv-gpci take get_cpu_var(hv_gpci_reqb)
(which calls preempt_disable()) but only call the matching put_cpu_var()
on the error path under the 'out:' label. Every successful read leaks
one preempt_disable():

  processor_bus_topology_show()
  processor_config_show()
  affinity_domain_via_virtual_processor_show()
  affinity_domain_via_domain_show()

(affinity_domain_via_partition_show() was already correct.)

On a CONFIG_PREEMPT=y kernel, repeated reads raise preempt_count and
eventually return to userspace with preemption still disabled. The
next user-mode page fault then hits faulthandler_disabled() == 1,
gets forced to SIGSEGV, and the resulting coredump trips
'BUG: scheduling while atomic' in call_usermodehelper_exec ->
wait_for_completion_state -> schedule:

  BUG: scheduling while atomic: <task>/<pid>/0x00000004
  ...
  __schedule_bug+0x6c/0x90
  __schedule+0x58c/0x13a0
  schedule+0x48/0x1a0
  schedule_timeout+0x104/0x170
  wait_for_completion_state+0x16c/0x330
  call_usermodehelper_exec+0x254/0x2d0
  vfs_coredump+0x1050/0x2590
  get_signal+0xb9c/0xc80
  do_notify_resume+0xf8/0x470

Add an out_success label that calls put_cpu_var() before returning
the byte count, mirroring affinity_domain_via_partition_show().

Fixes: 71f1c39647d8 ("powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show processor bus topology information")
Fixes: 1a160c2a13c6 ("powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show processor config information")
Fixes: 71a7ccb478fc ("powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show affinity domain via virtual processor information")
Fixes: a69a57cac1ec ("powerpc/hv_gpci: Add sysfs file inside hv_gpci device to show affinity domain via domain information")
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
---
 arch/powerpc/perf/hv-gpci.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
index 5cac2cf3bd1e5..10c82cf8f5b39 100644
--- a/arch/powerpc/perf/hv-gpci.c
+++ b/arch/powerpc/perf/hv-gpci.c
@@ -210,7 +210,7 @@ static ssize_t processor_bus_topology_show(struct device *dev, struct device_att
 			0, 0, buf, &n, arg);
 
 	if (!ret)
-		return n;
+		goto out_success;
 
 	if (ret != H_PARAMETER)
 		goto out;
@@ -244,12 +244,14 @@ static ssize_t processor_bus_topology_show(struct device *dev, struct device_att
 				starting_index, 0, buf, &n, arg);
 
 		if (!ret)
-			return n;
+			goto out_success;
 
 		if (ret != H_PARAMETER)
 			goto out;
 	}
 
+out_success:
+	put_cpu_var(hv_gpci_reqb);
 	return n;
 
 out:
@@ -278,7 +280,7 @@ static ssize_t processor_config_show(struct device *dev, struct device_attribute
 			0, 0, buf, &n, arg);
 
 	if (!ret)
-		return n;
+		goto out_success;
 
 	if (ret != H_PARAMETER)
 		goto out;
@@ -312,12 +314,14 @@ static ssize_t processor_config_show(struct device *dev, struct device_attribute
 				starting_index, 0, buf, &n, arg);
 
 		if (!ret)
-			return n;
+			goto out_success;
 
 		if (ret != H_PARAMETER)
 			goto out;
 	}
 
+out_success:
+	put_cpu_var(hv_gpci_reqb);
 	return n;
 
 out:
@@ -346,7 +350,7 @@ static ssize_t affinity_domain_via_virtual_processor_show(struct device *dev,
 			0, 0, buf, &n, arg);
 
 	if (!ret)
-		return n;
+		goto out_success;
 
 	if (ret != H_PARAMETER)
 		goto out;
@@ -382,12 +386,14 @@ static ssize_t affinity_domain_via_virtual_processor_show(struct device *dev,
 				starting_index, secondary_index, buf, &n, arg);
 
 		if (!ret)
-			return n;
+			goto out_success;
 
 		if (ret != H_PARAMETER)
 			goto out;
 	}
 
+out_success:
+	put_cpu_var(hv_gpci_reqb);
 	return n;
 
 out:
@@ -416,7 +422,7 @@ static ssize_t affinity_domain_via_domain_show(struct device *dev, struct device
 			0, 0, buf, &n, arg);
 
 	if (!ret)
-		return n;
+		goto out_success;
 
 	if (ret != H_PARAMETER)
 		goto out;
@@ -448,12 +454,14 @@ static ssize_t affinity_domain_via_domain_show(struct device *dev, struct device
 					starting_index, 0, buf, &n, arg);
 
 		if (!ret)
-			return n;
+			goto out_success;
 
 		if (ret != H_PARAMETER)
 			goto out;
 	}
 
+out_success:
+	put_cpu_var(hv_gpci_reqb);
 	return n;
 
 out:
-- 
2.54.0



^ permalink raw reply related

* Re: [PATCH 4/5] ibmvfc: use async sub-queue for FPIN messages
From: Dave Marquardt @ 2026-05-07 22:40 UTC (permalink / raw)
  To: Tyrel Datwyler
  Cc: James E.J. Bottomley, Martin K. Petersen, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	linux-kernel, linux-scsi, linuxppc-dev, Brian King, Greg Joyce,
	Kyle Mahlkuch
In-Reply-To: <e59242e4-f353-44eb-ad48-b76a9101d4fb@linux.ibm.com>

Tyrel Datwyler <tyreld@linux.ibm.com> writes:

> On 4/8/26 10:07 AM, Dave Marquardt via B4 Relay wrote:
>> From: Dave Marquardt <davemarq@linux.ibm.com>
>> 
>> - allocate async sub-queue
>> - allocate interrupt and set up handler
>> - negotiate use of async sub-queue with NPIV (VIOS)
>> - refactor ibmvfc_basic_fpin_to_desc() and ibmvfc_full_fpin_to_desc()
>>   into common routine
>> - add KUnit test to verify async sub-queue is allocated
>
> Again more descriptive commit log message required here. Also, this looks like a
> lot of things being implmented. Can this be broken into multiple patches? It
> sure looks like a bunch of functional changes that build on each other.

Yes, I'll break this up into more patches.

>> ---
>>  drivers/scsi/ibmvscsi/ibmvfc.c       | 325 ++++++++++++++++++++++++++++++++---
>>  drivers/scsi/ibmvscsi/ibmvfc.h       |  29 +++-
>>  drivers/scsi/ibmvscsi/ibmvfc_kunit.c |  52 +++---
>>  3 files changed, 363 insertions(+), 43 deletions(-)
>> 
>> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
>> index 803fc3caa14d..26e39b367022 100644
>> --- a/drivers/scsi/ibmvscsi/ibmvfc.c
>> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c
>> @@ -1471,6 +1471,13 @@ static void ibmvfc_gather_partition_info(struct ibmvfc_host *vhost)
>>  	of_node_put(rootdn);
>>  }
>>  
>> +static __be64 ibmvfc_npiv_chan_caps[] = {
>> +	cpu_to_be64(IBMVFC_CAN_USE_CHANNELS | IBMVFC_USE_ASYNC_SUBQ |
>> +		    IBMVFC_YES_SCSI | IBMVFC_CAN_HANDLE_FPIN),
>> +	cpu_to_be64(IBMVFC_CAN_USE_CHANNELS),
>> +};
>> +#define IBMVFC_NPIV_CHAN_CAPS_SIZE (sizeof(ibmvfc_npiv_chan_caps)/sizeof(__be64))
>> +
>
> I really don't understand what you are doing here? You seem to be definig
> various sets of capabilities, but how does the driver decide which set to use?
> As far as I can tell the index is increased and the capabilities decrease each
> time a transport event is received. This looks like maybe its just a testing hack.

My thought was to deal with an older VIOS that doesn't support the async
sub-queue and full FPIN. But I suppose the response should just not set the
appropriate bits. I'll go re-read the NPIV spec and figure out if this
is actually needed.

>>  /**
>>   * ibmvfc_set_login_info - Setup info for NPIV login
>>   * @vhost:	ibmvfc host struct
>> @@ -1486,6 +1493,8 @@ static void ibmvfc_set_login_info(struct ibmvfc_host *vhost)
>>  	const char *location;
>>  	u16 max_cmds;
>>  
>> +	ENTER;
>> +
>>  	max_cmds = scsi_qdepth + IBMVFC_NUM_INTERNAL_REQ;
>>  	if (mq_enabled)
>>  		max_cmds += (scsi_qdepth + IBMVFC_NUM_INTERNAL_SUBQ_REQ) *
>> @@ -1509,8 +1518,12 @@ static void ibmvfc_set_login_info(struct ibmvfc_host *vhost)
>>  		cpu_to_be64(IBMVFC_CAN_MIGRATE | IBMVFC_CAN_SEND_VF_WWPN |
>>  			    IBMVFC_CAN_USE_NOOP_CMD);
>>  
>> -	if (vhost->mq_enabled || vhost->using_channels)
>> -		login_info->capabilities |= cpu_to_be64(IBMVFC_CAN_USE_CHANNELS);
>> +	if (vhost->mq_enabled || vhost->using_channels) {
>> +		if (vhost->login_cap_index >= IBMVFC_NPIV_CHAN_CAPS_SIZE)
>> +			login_info->capabilities |= cpu_to_be64(IBMVFC_CAN_USE_CHANNELS);
>> +		else
>> +			login_info->capabilities |= ibmvfc_npiv_chan_caps[vhost->login_cap_index];
>> +	}
>>  
>>  	login_info->async.va = cpu_to_be64(vhost->async_crq.msg_token);
>>  	login_info->async.len = cpu_to_be32(async_crq->size *
>> @@ -1524,6 +1537,8 @@ static void ibmvfc_set_login_info(struct ibmvfc_host *vhost)
>>  	location = of_get_property(of_node, "ibm,loc-code", NULL);
>>  	location = location ? location : dev_name(vhost->dev);
>>  	strscpy(login_info->drc_name, location, sizeof(login_info->drc_name));
>> +
>> +	LEAVE;
>>  }
>>  
>>  /**
>> @@ -3323,7 +3338,7 @@ ibmvfc_common_fpin_to_desc(u8 fpin_status, __be64 wwpn, __be16 modifier,
>>   * non-NULL - pointer to populated struct fc_els_fpin
>>   */
>>  static struct fc_els_fpin *
>> -/*XXX*/ibmvfc_basic_fpin_to_desc(struct ibmvfc_async_crq *crq)
>
> I mentioned this /*XXX*/ in an earlier patch. This needs to be fixed in that patch.

Yes.

>> +ibmvfc_basic_fpin_to_desc(struct ibmvfc_async_crq *crq)
>>  {
>>  	return ibmvfc_common_fpin_to_desc(crq->fpin_status, crq->wwpn,
>>  					  cpu_to_be16(0),
>> @@ -3332,6 +3347,29 @@ static struct fc_els_fpin *
>>  					  cpu_to_be32(1));
>>  }
>>  
>> +/**
>> + * ibmvfc_full_fpin_to_desc(): allocate and populate a struct fc_els_fpin struct
>> + * containing a descriptor.
>> + * @ibmvfc_fpin: Pointer to async subq FPIN data
>> + *
>> + * Allocate a struct fc_els_fpin containing a descriptor and populate
>> + * based on data from *ibmvfc_fpin.
>> + *
>> + * Return:
>> + * NULL     - unable to allocate structure
>> + * non-NULL - pointer to populated struct fc_els_fpin
>> + */
>> +static struct fc_els_fpin *
>> +ibmvfc_full_fpin_to_desc(struct ibmvfc_async_subq *ibmvfc_fpin)
>> +{
>> +	return ibmvfc_common_fpin_to_desc(ibmvfc_fpin->fpin_status,
>> +					  ibmvfc_fpin->wwpn,
>> +					  cpu_to_be16(0),
>> +					  cpu_to_be32(IBMVFC_FPIN_DEFAULT_EVENT_PERIOD),
>> +					  cpu_to_be32(IBMVFC_FPIN_DEFAULT_EVENT_THRESHOLD),
>> +					  cpu_to_be32(1));
>> +}
>> +
>>  /**
>>   * ibmvfc_handle_async - Handle an async event from the adapter
>>   * @crq:	crq to process
>> @@ -3449,6 +3487,120 @@ VISIBLE_IF_KUNIT void ibmvfc_handle_async(struct ibmvfc_async_crq *crq,
>>  }
>>  EXPORT_SYMBOL_IF_KUNIT(ibmvfc_handle_async);
>>  
>> +VISIBLE_IF_KUNIT void ibmvfc_handle_asyncq(struct ibmvfc_crq *crq_instance,
>> +					   struct ibmvfc_host *vhost,
>> +					   struct list_head *evt_doneq)
>> +{
>> +	struct ibmvfc_async_subq *crq = (struct ibmvfc_async_subq *)crq_instance;
>> +	const struct ibmvfc_async_desc *desc = ibmvfc_get_ae_desc(be16_to_cpu(crq->event));
>> +	struct ibmvfc_target *tgt;
>> +	struct fc_els_fpin *fpin;
>> +
>> +	ibmvfc_log(vhost, desc->log_level,
>> +		   "%s event received. wwpn: %llx, node_name: %llx%s event 0x%x\n",
>> +		   desc->desc, be64_to_cpu(crq->wwpn), be64_to_cpu(crq->id.node_name),
>> +		   ibmvfc_get_link_state(crq->link_state), be16_to_cpu(crq->event));
>
> Was there no way to not copy/paste what looks like basically ibmvfc_handle_async
> into ibmvfc_handle_asyncq? This is a bunch of unnecessary code bloat. The major
> difference seems that crq->event is be64 on the standard CRQ and be16 on a
> sub-crq and accessing certain fields differently.

That's a good idea. I'll see what I can do. Seems like a little
refactoring should make it work.

> Again I think maybe we need to consider moving all the async work into a workqueue.

My initial thought was to just queue the FPIN processing of
ibmvfc_handle_asyncq to a work queue to resolve the problem of calling
fc_host_fpin_rcv from interrupt context, but putting all of this
processing into a work queue would work too. I'll look into it.

-Dave


^ permalink raw reply

* [powerpc:merge] BUILD SUCCESS 2e16485a3286c4e603873c40d13adc73bdbe3845
From: kernel test robot @ 2026-05-07 22:33 UTC (permalink / raw)
  To: Madhavan Srinivasan; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge
branch HEAD: 2e16485a3286c4e603873c40d13adc73bdbe3845  Automatic merge of 'fixes' into merge (2026-05-07 15:41)

elapsed time: 726m

configs tested: 233
configs skipped: 9

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha                             allnoconfig    gcc-15.2.0
alpha                            allyesconfig    gcc-15.2.0
alpha                               defconfig    gcc-15.2.0
arc                              allmodconfig    clang-16
arc                              allmodconfig    gcc-15.2.0
arc                               allnoconfig    gcc-15.2.0
arc                              allyesconfig    clang-23
arc                              allyesconfig    gcc-15.2.0
arc                                 defconfig    gcc-15.2.0
arc                   randconfig-001-20260507    gcc-14.3.0
arc                   randconfig-002-20260507    gcc-14.3.0
arm                               allnoconfig    clang-23
arm                              allyesconfig    clang-16
arm                              allyesconfig    gcc-15.2.0
arm                                 defconfig    gcc-15.2.0
arm                   randconfig-002-20260507    gcc-14.3.0
arm                   randconfig-004-20260507    gcc-14.3.0
arm64                            allmodconfig    clang-19
arm64                            allmodconfig    clang-23
arm64                             allnoconfig    gcc-15.2.0
arm64                               defconfig    gcc-15.2.0
arm64                 randconfig-001-20260507    gcc-15.2.0
arm64                 randconfig-001-20260508    gcc-14.3.0
arm64                 randconfig-002-20260507    gcc-15.2.0
arm64                 randconfig-002-20260508    gcc-14.3.0
arm64                 randconfig-003-20260507    gcc-15.2.0
arm64                 randconfig-003-20260508    gcc-14.3.0
arm64                 randconfig-004-20260507    gcc-15.2.0
arm64                 randconfig-004-20260508    gcc-14.3.0
csky                             allmodconfig    gcc-15.2.0
csky                              allnoconfig    gcc-15.2.0
csky                                defconfig    gcc-15.2.0
csky                  randconfig-001-20260507    gcc-15.2.0
csky                  randconfig-001-20260508    gcc-14.3.0
csky                  randconfig-002-20260507    gcc-15.2.0
csky                  randconfig-002-20260508    gcc-14.3.0
hexagon                          allmodconfig    clang-17
hexagon                          allmodconfig    gcc-15.2.0
hexagon                           allnoconfig    clang-23
hexagon                             defconfig    gcc-15.2.0
hexagon               randconfig-001-20260507    clang-23
hexagon               randconfig-002-20260507    clang-23
i386                             allmodconfig    clang-20
i386                             allmodconfig    gcc-14
i386                              allnoconfig    gcc-14
i386                             allyesconfig    clang-20
i386                             allyesconfig    gcc-14
i386        buildonly-randconfig-001-20260507    clang-20
i386        buildonly-randconfig-001-20260507    gcc-14
i386        buildonly-randconfig-001-20260508    gcc-14
i386        buildonly-randconfig-002-20260507    clang-20
i386        buildonly-randconfig-002-20260507    gcc-14
i386        buildonly-randconfig-002-20260508    gcc-14
i386        buildonly-randconfig-003-20260507    clang-20
i386        buildonly-randconfig-003-20260507    gcc-14
i386        buildonly-randconfig-003-20260508    gcc-14
i386        buildonly-randconfig-004-20260507    clang-20
i386        buildonly-randconfig-004-20260507    gcc-14
i386        buildonly-randconfig-004-20260508    gcc-14
i386        buildonly-randconfig-005-20260507    clang-20
i386        buildonly-randconfig-005-20260508    gcc-14
i386        buildonly-randconfig-006-20260507    clang-20
i386        buildonly-randconfig-006-20260507    gcc-14
i386        buildonly-randconfig-006-20260508    gcc-14
i386                                defconfig    gcc-15.2.0
i386                  randconfig-001-20260507    gcc-14
i386                  randconfig-001-20260508    gcc-14
i386                  randconfig-002-20260507    gcc-14
i386                  randconfig-002-20260508    gcc-14
i386                  randconfig-003-20260507    gcc-14
i386                  randconfig-003-20260508    gcc-14
i386                  randconfig-004-20260507    gcc-14
i386                  randconfig-004-20260508    gcc-14
i386                  randconfig-005-20260507    gcc-14
i386                  randconfig-005-20260508    gcc-14
i386                  randconfig-006-20260507    gcc-14
i386                  randconfig-006-20260508    gcc-14
i386                  randconfig-007-20260507    gcc-14
i386                  randconfig-007-20260508    gcc-14
i386                  randconfig-011-20260507    clang-20
i386                  randconfig-011-20260507    gcc-14
i386                  randconfig-012-20260507    clang-20
i386                  randconfig-013-20260507    clang-20
i386                  randconfig-013-20260507    gcc-14
i386                  randconfig-014-20260507    clang-20
i386                  randconfig-014-20260507    gcc-14
i386                  randconfig-015-20260507    clang-20
i386                  randconfig-016-20260507    clang-20
i386                  randconfig-016-20260507    gcc-14
i386                  randconfig-017-20260507    clang-20
i386                  randconfig-017-20260507    gcc-14
loongarch                        allmodconfig    clang-19
loongarch                        allmodconfig    clang-23
loongarch                         allnoconfig    clang-23
loongarch                           defconfig    clang-19
loongarch             randconfig-001-20260507    clang-23
loongarch             randconfig-002-20260507    clang-23
m68k                             allmodconfig    gcc-15.2.0
m68k                              allnoconfig    gcc-15.2.0
m68k                             allyesconfig    clang-16
m68k                             allyesconfig    gcc-15.2.0
m68k                                defconfig    clang-19
microblaze                        allnoconfig    gcc-15.2.0
microblaze                       allyesconfig    gcc-15.2.0
microblaze                          defconfig    clang-19
mips                             allmodconfig    gcc-15.2.0
mips                              allnoconfig    gcc-15.2.0
mips                             allyesconfig    gcc-15.2.0
nios2                            allmodconfig    clang-23
nios2                            allmodconfig    gcc-11.5.0
nios2                             allnoconfig    clang-23
nios2                             allnoconfig    gcc-11.5.0
nios2                               defconfig    clang-19
nios2                 randconfig-001-20260507    clang-23
nios2                 randconfig-002-20260507    clang-23
openrisc                         allmodconfig    clang-23
openrisc                         allmodconfig    gcc-15.2.0
openrisc                          allnoconfig    clang-23
openrisc                          allnoconfig    gcc-15.2.0
openrisc                            defconfig    gcc-15.2.0
parisc                           allmodconfig    gcc-15.2.0
parisc                            allnoconfig    clang-23
parisc                            allnoconfig    gcc-15.2.0
parisc                           allyesconfig    clang-19
parisc                           allyesconfig    gcc-15.2.0
parisc                              defconfig    gcc-15.2.0
parisc                randconfig-001-20260507    gcc-8.5.0
parisc                randconfig-001-20260508    gcc-9.5.0
parisc                randconfig-002-20260507    gcc-8.5.0
parisc                randconfig-002-20260508    gcc-9.5.0
parisc64                            defconfig    clang-19
powerpc                          allmodconfig    gcc-15.2.0
powerpc                           allnoconfig    clang-23
powerpc                           allnoconfig    gcc-15.2.0
powerpc                 mpc837x_rdb_defconfig    gcc-15.2.0
powerpc               randconfig-001-20260507    gcc-8.5.0
powerpc               randconfig-001-20260508    gcc-9.5.0
powerpc               randconfig-002-20260507    gcc-8.5.0
powerpc               randconfig-002-20260508    gcc-9.5.0
powerpc64             randconfig-001-20260507    gcc-8.5.0
powerpc64             randconfig-001-20260508    gcc-9.5.0
powerpc64             randconfig-002-20260507    gcc-8.5.0
powerpc64             randconfig-002-20260508    gcc-9.5.0
riscv                            allmodconfig    clang-23
riscv                             allnoconfig    clang-23
riscv                             allnoconfig    gcc-15.2.0
riscv                            allyesconfig    clang-16
riscv                               defconfig    gcc-15.2.0
s390                             allmodconfig    clang-18
s390                             allmodconfig    clang-19
s390                              allnoconfig    clang-23
s390                             allyesconfig    gcc-15.2.0
s390                                defconfig    gcc-15.2.0
sh                               allmodconfig    gcc-15.2.0
sh                                allnoconfig    clang-23
sh                                allnoconfig    gcc-15.2.0
sh                               allyesconfig    clang-19
sh                               allyesconfig    gcc-15.2.0
sh                                  defconfig    gcc-14
sparc                             allnoconfig    clang-23
sparc                             allnoconfig    gcc-15.2.0
sparc                               defconfig    gcc-15.2.0
sparc                          randconfig-001    gcc-12.5.0
sparc                 randconfig-001-20260507    gcc-12.5.0
sparc                          randconfig-002    gcc-12.5.0
sparc                 randconfig-002-20260507    gcc-12.5.0
sparc64                          allmodconfig    clang-23
sparc64                             defconfig    gcc-14
sparc64                        randconfig-001    gcc-12.5.0
sparc64               randconfig-001-20260507    gcc-12.5.0
sparc64                        randconfig-002    gcc-12.5.0
sparc64               randconfig-002-20260507    gcc-12.5.0
um                               allmodconfig    clang-19
um                                allnoconfig    clang-23
um                               allyesconfig    gcc-14
um                               allyesconfig    gcc-15.2.0
um                                  defconfig    gcc-14
um                             i386_defconfig    gcc-14
um                             randconfig-001    gcc-12.5.0
um                    randconfig-001-20260507    gcc-12.5.0
um                             randconfig-002    gcc-12.5.0
um                    randconfig-002-20260507    gcc-12.5.0
um                           x86_64_defconfig    gcc-14
x86_64                           allmodconfig    clang-20
x86_64                            allnoconfig    clang-20
x86_64                            allnoconfig    clang-23
x86_64                           allyesconfig    clang-20
x86_64      buildonly-randconfig-001-20260507    clang-20
x86_64      buildonly-randconfig-002-20260507    clang-20
x86_64      buildonly-randconfig-003-20260507    clang-20
x86_64      buildonly-randconfig-004-20260507    clang-20
x86_64      buildonly-randconfig-005-20260507    clang-20
x86_64      buildonly-randconfig-006-20260507    clang-20
x86_64                              defconfig    gcc-14
x86_64                                  kexec    clang-20
x86_64                randconfig-001-20260507    gcc-14
x86_64                randconfig-001-20260508    clang-20
x86_64                randconfig-002-20260507    gcc-14
x86_64                randconfig-002-20260508    clang-20
x86_64                randconfig-003-20260507    gcc-14
x86_64                randconfig-003-20260508    clang-20
x86_64                randconfig-004-20260507    gcc-14
x86_64                randconfig-004-20260508    clang-20
x86_64                randconfig-005-20260507    gcc-14
x86_64                randconfig-005-20260508    clang-20
x86_64                randconfig-006-20260507    gcc-14
x86_64                randconfig-006-20260508    clang-20
x86_64                randconfig-011-20260507    gcc-14
x86_64                randconfig-012-20260507    gcc-14
x86_64                randconfig-013-20260507    gcc-14
x86_64                randconfig-014-20260507    gcc-14
x86_64                randconfig-015-20260507    gcc-14
x86_64                randconfig-016-20260507    gcc-14
x86_64                randconfig-071-20260507    clang-20
x86_64                randconfig-072-20260507    clang-20
x86_64                randconfig-073-20260507    clang-20
x86_64                randconfig-074-20260507    clang-20
x86_64                randconfig-075-20260507    clang-20
x86_64                randconfig-076-20260507    clang-20
x86_64                               rhel-9.4    clang-20
x86_64                           rhel-9.4-bpf    gcc-14
x86_64                          rhel-9.4-func    clang-20
x86_64                    rhel-9.4-kselftests    clang-20
x86_64                         rhel-9.4-kunit    gcc-14
x86_64                           rhel-9.4-ltp    gcc-14
x86_64                          rhel-9.4-rust    clang-20
xtensa                            allnoconfig    clang-23
xtensa                            allnoconfig    gcc-15.2.0
xtensa                           allyesconfig    clang-23
xtensa                         randconfig-001    gcc-12.5.0
xtensa                randconfig-001-20260507    gcc-12.5.0
xtensa                         randconfig-002    gcc-12.5.0
xtensa                randconfig-002-20260507    gcc-12.5.0

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply

* Re: [PATCH 3/5] ibmvfc: make ibmvfc login to fabric
From: Dave Marquardt @ 2026-05-07 22:34 UTC (permalink / raw)
  To: Tyrel Datwyler
  Cc: James E.J. Bottomley, Martin K. Petersen, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	linux-kernel, linux-scsi, linuxppc-dev, Brian King, Greg Joyce,
	Kyle Mahlkuch
In-Reply-To: <e8dfa3f2-2b2b-43c0-97a1-6bccb748ca6e@linux.ibm.com>

Tyrel Datwyler <tyreld@linux.ibm.com> writes:

> On 4/8/26 10:07 AM, Dave Marquardt via B4 Relay wrote:
>> From: Dave Marquardt <davemarq@linux.ibm.com>
>> 
>> Make ibmvfc login to fabric when NPIV login returns SUPPORT_SCSI or
>> SUPPORT_NVMEOF capabilities.
>
> Again better commit log message here and developer sign off tag.
>
>> ---
>>  drivers/scsi/ibmvscsi/ibmvfc.c | 100 ++++++++++++++++++++++++++++++++++++++---
>>  drivers/scsi/ibmvscsi/ibmvfc.h |  20 +++++++++
>>  2 files changed, 115 insertions(+), 5 deletions(-)
>> 
>> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
>> index 808301fa452d..803fc3caa14d 100644
>> --- a/drivers/scsi/ibmvscsi/ibmvfc.c
>> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c
>> @@ -5205,6 +5205,89 @@ static void ibmvfc_discover_targets(struct ibmvfc_host *vhost)
>>  		ibmvfc_link_down(vhost, IBMVFC_LINK_DEAD);
>>  }
>>  
>> +static void ibmvfc_fabric_login_done(struct ibmvfc_event *evt)
>> +{
>> +	struct ibmvfc_fabric_login *rsp = &evt->xfer_iu->fabric_login;
>> +	u32 mad_status = be16_to_cpu(rsp->common.status);
>> +	struct ibmvfc_host *vhost = evt->vhost;
>> +	int level = IBMVFC_DEFAULT_LOG_LEVEL;
>> +
>> +	ENTER;
>> +
>> +	switch (mad_status) {
>> +	case IBMVFC_MAD_SUCCESS:
>> +		vhost->logged_in = 1;
>
> I'm not sure I see the point of setting logged_in here since we already set it
> in npiv_login_done.

Agreed.

>> +		vhost->fabric_capabilities = rsp->capabilities;
>
> The way this is currently spec'd out there are no linux relevant capabilities
> coming from fabric login. So, I'm not sure there is a reason to save them at
> this point.

Okay.

>> +		fc_host_port_id(vhost->host) = be64_to_cpu(rsp->nport_id);
>> +		ibmvfc_free_event(evt);
>> +		break;
>> +
>> +	case IBMVFC_MAD_FAILED:
>> +		if (ibmvfc_retry_cmd(be16_to_cpu(rsp->status), be16_to_cpu(rsp->error)))
>> +			level += ibmvfc_retry_host_init(vhost);
>> +		else
>> +			ibmvfc_link_down(vhost, IBMVFC_LINK_DEAD);
>> +		ibmvfc_log(vhost, level, "Fabric Login failed: %s (%x:%x)\n",
>> +			   ibmvfc_get_cmd_error(be16_to_cpu(rsp->status), be16_to_cpu(rsp->error)),
>> +						be16_to_cpu(rsp->status), be16_to_cpu(rsp->error));
>> +		ibmvfc_free_event(evt);
>> +		LEAVE;
>> +		return;
>> +
>> +	case IBMVFC_MAD_CRQ_ERROR:
>> +		ibmvfc_retry_host_init(vhost);
>> +		fallthrough;
>> +
>> +	case IBMVFC_MAD_DRIVER_FAILED:
>> +		ibmvfc_free_event(evt);
>> +		LEAVE;
>> +		return;
>> +
>> +	default:
>> +		dev_err(vhost->dev, "Invalid fabric Login response: 0x%x\n", mad_status);
>> +		ibmvfc_link_down(vhost, IBMVFC_LINK_DEAD);
>> +		ibmvfc_free_event(evt);
>> +		LEAVE;
>> +		return;
>> +	}
>> +
>> +	ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_QUERY);
>> +	wake_up(&vhost->work_wait_q);
>> +
>> +	LEAVE;
>> +}
>> +
>> +static void ibmvfc_fabric_login(struct ibmvfc_host *vhost)
>> +{
>> +	struct ibmvfc_fabric_login *mad;
>> +	struct ibmvfc_event *evt = ibmvfc_get_reserved_event(&vhost->crq);
>> +	int level = IBMVFC_DEFAULT_LOG_LEVEL;
>> +
>> +	if (!evt) {
>
> I think we need to hard reset here or we are dead in the water if there are no
> events.

I will add a hard reset here.

>> +		ibmvfc_log(vhost, level, "Fabric Login failed: no available events\n");
>> +		return;
>> +	}
>> +
>> +	ibmvfc_init_event(evt, ibmvfc_fabric_login_done, IBMVFC_MAD_FORMAT);
>> +	mad = &evt->iu.fabric_login;
>> +	memset(mad, 0, sizeof(*mad));
>> +	if (vhost->scsi_scrqs.protocol == IBMVFC_PROTO_SCSI)
>> +		mad->common.opcode = cpu_to_be32(IBMVFC_FABRIC_LOGIN);
>> +	else if (vhost->scsi_scrqs.protocol == IBMVFC_PROTO_NVME)
>> +		mad->common.opcode = cpu_to_be32(IBMVFC_NVMF_FABRIC_LOGIN);
>
> The VIOS won't return NVMF support unless we advertise it. So, I think its best
> to omit any NVMF releveant changes that are spec'd as they aren't being applied
> in a proper workflow here anyways. If the driver advertised both SCSI and NVMF
> support the current code would never do a NVMF fabric login as it would never
> fall through here.

Okay, I'll removed the NVMF changes.

>> +	else {
>> +		ibmvfc_log(vhost, level, "Fabric Login failed: unknown protocol\n");
>> +		return;
>> +	}
>> +	mad->common.version = cpu_to_be32(1);
>> +	mad->common.length = cpu_to_be16(sizeof(*mad));
>> +
>> +	ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_INIT_WAIT);
>> +
>> +	if (ibmvfc_send_event(evt, vhost, default_timeout))
>> +		ibmvfc_link_down(vhost, IBMVFC_LINK_DOWN);
>> +}
>> +
>>  static void ibmvfc_channel_setup_done(struct ibmvfc_event *evt)
>>  {
>>  	struct ibmvfc_host *vhost = evt->vhost;
>> @@ -5251,8 +5334,12 @@ static void ibmvfc_channel_setup_done(struct ibmvfc_event *evt)
>>  		return;
>>  	}
>>  
>> -	ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_QUERY);
>> -	wake_up(&vhost->work_wait_q);
>> +	if (ibmvfc_check_caps(vhost, (IBMVFC_SUPPORT_SCSI | IBMVFC_SUPPORT_NVMEOF))) {
>> +		ibmvfc_fabric_login(vhost);
>
> Again drop the NVMEOF code.

Okay.

>> +	} else {
>> +		ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_QUERY);
>> +		wake_up(&vhost->work_wait_q);
>> +	}
>>  }
>>  
>>  static void ibmvfc_channel_setup(struct ibmvfc_host *vhost)
>> @@ -5443,9 +5530,12 @@ static void ibmvfc_npiv_login_done(struct ibmvfc_event *evt)
>>  	vhost->host->can_queue = be32_to_cpu(rsp->max_cmds) - IBMVFC_NUM_INTERNAL_REQ;
>>  	vhost->host->max_sectors = npiv_max_sectors;
>>  
>> -	if (ibmvfc_check_caps(vhost, IBMVFC_CAN_SUPPORT_CHANNELS) && vhost->do_enquiry) {
>> -		ibmvfc_channel_enquiry(vhost);
>> -	} else {
>> +	if (ibmvfc_check_caps(vhost, IBMVFC_CAN_SUPPORT_CHANNELS)) {
>> +		if (vhost->do_enquiry)
>> +			ibmvfc_channel_enquiry(vhost);
>
> I'm not sure I understand expanding this code to a second if block as there is
> no functional change.

Agreed.

>> +	} else if (ibmvfc_check_caps(vhost, (IBMVFC_SUPPORT_SCSI | IBMVFC_SUPPORT_NVMEOF)))
>
> Again drop NVMEOF and NVMF related changes.

Yes.

>> +		ibmvfc_fabric_login(vhost);
>> +	else {
>>  		vhost->do_enquiry = 0;
>>  		ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_QUERY);
>>  		wake_up(&vhost->work_wait_q);
>> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
>> index cd0917f70c6d..4f680c5d9558 100644
>> --- a/drivers/scsi/ibmvscsi/ibmvfc.h
>> +++ b/drivers/scsi/ibmvscsi/ibmvfc.h
>> @@ -138,6 +138,8 @@ enum ibmvfc_mad_types {
>>  	IBMVFC_CHANNEL_ENQUIRY	= 0x1000,
>>  	IBMVFC_CHANNEL_SETUP	= 0x2000,
>>  	IBMVFC_CONNECTION_INFO	= 0x4000,
>> +	IBMVFC_FABRIC_LOGIN	= 0x8000,
>> +	IBMVFC_NVMF_FABRIC_LOGIN	= 0x8001,
>>  };
>>  
>>  struct ibmvfc_mad_common {
>> @@ -227,6 +229,8 @@ struct ibmvfc_npiv_login_resp {
>>  #define IBMVFC_MAD_VERSION_CAP		0x20
>>  #define IBMVFC_HANDLE_VF_WWPN		0x40
>>  #define IBMVFC_CAN_SUPPORT_CHANNELS	0x80
>> +#define IBMVFC_SUPPORT_NVMEOF		0x100
>> +#define IBMVFC_SUPPORT_SCSI		0x200
>>  #define IBMVFC_SUPPORT_NOOP_CMD		0x1000
>>  	__be32 max_cmds;
>>  	__be32 scsi_id_sz;
>> @@ -590,6 +594,19 @@ struct ibmvfc_connection_info {
>>  	__be64 reserved[16];
>>  } __packed __aligned(8);
>>  
>> +struct ibmvfc_fabric_login {
>> +	struct ibmvfc_mad_common common;
>> +	__be64 flags;
>> +#define IBMVFC_STRIP_MERGE	0x1
>> +#define IBMVFC_LINK_COMMANDS	0x2
>> +	__be64 capabilities;
>> +	__be64 nport_id;
>> +	__be16 status;
>> +	__be16 error;
>> +	__be32 pad;
>> +	__be64 reserved[16];
>> +} __packed __aligned(8);
>> +
>>  struct ibmvfc_trace_start_entry {
>>  	u32 xfer_len;
>>  } __packed;
>> @@ -709,6 +726,7 @@ union ibmvfc_iu {
>>  	struct ibmvfc_channel_enquiry channel_enquiry;
>>  	struct ibmvfc_channel_setup_mad channel_setup;
>>  	struct ibmvfc_connection_info connection_info;
>> +	struct ibmvfc_fabric_login fabric_login;
>>  } __packed __aligned(8);
>>  
>>  enum ibmvfc_target_action {
>> @@ -921,6 +939,8 @@ struct ibmvfc_host {
>>  	struct work_struct rport_add_work_q;
>>  	wait_queue_head_t init_wait_q;
>>  	wait_queue_head_t work_wait_q;
>> +	__be64 fabric_capabilities;
>> +	unsigned int login_cap_index;
>
> Lets drop these as they serve no purpose for Linux. If the spec changes to
> introduce capabilites releveant to Linux we can add it then.

Okay. You discussed fabric_capabilities. I think login_cap_index isn't
useful until patch 4 or 5, so I'll drop it from patch 4.

-Dave



^ permalink raw reply

* Re: [PATCH 2/5] ibmvfc: Add NOOP command support
From: Dave Marquardt @ 2026-05-07 22:25 UTC (permalink / raw)
  To: Tyrel Datwyler
  Cc: James E.J. Bottomley, Martin K. Petersen, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	linux-kernel, linux-scsi, linuxppc-dev, Brian King, Greg Joyce,
	Kyle Mahlkuch
In-Reply-To: <7da072cf-8774-4144-8888-b3d41af470f8@linux.ibm.com>

Tyrel Datwyler <tyreld@linux.ibm.com> writes:

> On 4/8/26 10:07 AM, Dave Marquardt via B4 Relay wrote:
>> From: Dave Marquardt <davemarq@linux.ibm.com>
>
>> - Add VFC_NOOP command support
>> - Add KUnit tests for VFC_NOOP command
>> ---
>>  drivers/scsi/ibmvscsi/ibmvfc.c       | 23 +++++++++++++----------
>>  drivers/scsi/ibmvscsi/ibmvfc.h       | 13 +++++++++++++
>>  drivers/scsi/ibmvscsi/ibmvfc_kunit.c | 27 +++++++++++++++++++++++++++
>>  3 files changed, 53 insertions(+), 10 deletions(-)
>> 
>> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
>> index 3ac376ba2c62..808301fa452d 100644
>> --- a/drivers/scsi/ibmvscsi/ibmvfc.c
>> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c
>> @@ -189,13 +189,6 @@ static long h_reg_sub_crq(unsigned long unit_address, unsigned long ioba,
>>  	return rc;
>>  }
>>  
>> -static int ibmvfc_check_caps(struct ibmvfc_host *vhost, unsigned long cap_flags)
>> -{
>> -	u64 host_caps = be64_to_cpu(vhost->login_buf->resp.capabilities);
>> -
>> -	return (host_caps & cap_flags) ? 1 : 0;
>> -}
>> -
>
> It appears you are moving this to ibmvfc.h? Is there reasoning outside making it
> visible to kunit?

That's exactly why I'm moving this to ibmvfc.h. Alternatively I could
export this routine. I'm okay doing that if you prefer it.

>>  static struct ibmvfc_fcp_cmd_iu *ibmvfc_get_fcp_iu(struct ibmvfc_host *vhost,
>>  						   struct ibmvfc_cmd *vfc_cmd)
>>  {
>> @@ -1512,7 +1505,9 @@ static void ibmvfc_set_login_info(struct ibmvfc_host *vhost)
>>  		login_info->flags |= cpu_to_be16(IBMVFC_CLIENT_MIGRATED);
>>  
>>  	login_info->max_cmds = cpu_to_be32(max_cmds);
>> -	login_info->capabilities = cpu_to_be64(IBMVFC_CAN_MIGRATE | IBMVFC_CAN_SEND_VF_WWPN);
>> +	login_info->capabilities =
>> +		cpu_to_be64(IBMVFC_CAN_MIGRATE | IBMVFC_CAN_SEND_VF_WWPN |
>> +			    IBMVFC_CAN_USE_NOOP_CMD);
>>  
>>  	if (vhost->mq_enabled || vhost->using_channels)
>>  		login_info->capabilities |= cpu_to_be64(IBMVFC_CAN_USE_CHANNELS);
>> @@ -3461,8 +3456,8 @@ EXPORT_SYMBOL_IF_KUNIT(ibmvfc_handle_async);
>>   * @evt_doneq:	Event done queue
>>   *
>>  **/
>> -static void ibmvfc_handle_crq(struct ibmvfc_crq *crq, struct ibmvfc_host *vhost,
>> -			      struct list_head *evt_doneq)
>> +VISIBLE_IF_KUNIT void ibmvfc_handle_crq(struct ibmvfc_crq *crq, struct ibmvfc_host *vhost,
>> +					struct list_head *evt_doneq)
>>  {
>>  	long rc;
>>  	struct ibmvfc_event *evt = (struct ibmvfc_event *)be64_to_cpu(crq->ioba);
>> @@ -3520,6 +3515,13 @@ static void ibmvfc_handle_crq(struct ibmvfc_crq *crq, struct ibmvfc_host *vhost,
>>  	if (crq->format == IBMVFC_ASYNC_EVENT)
>>  		return;
>>  
>> +	if (crq->format == IBMVFC_VFC_NOOP) {
>> +		if (!ibmvfc_check_caps(vhost, IBMVFC_SUPPORT_NOOP_CMD))
>> +			dev_err(vhost->dev,
>> +				"Received unexpected NOOP command from partner\n");
>
> If we have a misbahaved VIOS partner we may want to ratelimit this dev_err so
> that we don't flood the log. Probably a corner case, but I don't think it hurts.

Okay, I'll add rate limiting in v2.

-Dave


^ permalink raw reply

* Re: [PATCH 1/5] ibmvfc: add basic FPIN support
From: Dave Marquardt @ 2026-05-07 22:22 UTC (permalink / raw)
  To: Tyrel Datwyler
  Cc: James E.J. Bottomley, Martin K. Petersen, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	linux-kernel, linux-scsi, linuxppc-dev, Brian King, Greg Joyce,
	Kyle Mahlkuch
In-Reply-To: <291e79ea-d993-45e4-877e-4c25336c3076@linux.ibm.com>

Tyrel Datwyler <tyreld@linux.ibm.com> writes:

> On 4/8/26 10:07 AM, Dave Marquardt via B4 Relay wrote:
>> From: Dave Marquardt <davemarq@linux.ibm.com>
>> 
>> - Add FPIN event descriptor
>> - Add congestion cleared status
>> - Add code to handle basic FPIN async event
>> - Add KUnit tests
>
> You need a more detailed description of your changes here for the commit log body.
>
> You will also need a signed off tag from yourself for this to even be merged.

Odd. I used b4 to prepare the patch set, and it should have added a
Signed-off-by: tag. I'll double check it next round.

> https://www.kernel.org/doc/html/latest/process/submitting-patches.html
>
>> ---
>>  drivers/scsi/Kconfig                 |  10 ++
>>  drivers/scsi/ibmvscsi/Makefile       |   1 +
>>  drivers/scsi/ibmvscsi/ibmvfc.c       | 189 ++++++++++++++++++++++++++++++++++-
>>  drivers/scsi/ibmvscsi/ibmvfc.h       |   9 ++
>>  drivers/scsi/ibmvscsi/ibmvfc_kunit.c |  95 ++++++++++++++++++
>>  5 files changed, 302 insertions(+), 2 deletions(-)
>
> <snip>
>
>> +static struct fc_els_fpin *
>> +ibmvfc_common_fpin_to_desc(u8 fpin_status, __be64 wwpn, __be16 modifier,
>> +			   __be32 period, __be32 threshold, __be32 event_count)
>> +{
>> +	struct fc_fn_peer_congn_desc *pdesc;
>> +	struct fc_fn_congn_desc *cdesc;
>> +	struct fc_fn_li_desc *ldesc;
>> +	struct fc_els_fpin *fpin;
>> +	size_t size;
>> +
>> +	size = ibmvfc_fpin_size_helper(fpin_status);
>> +	if (size == 0)
>> +		return NULL;
>> +
>> +	fpin = kzalloc(size, GFP_KERNEL);
>
> This appears to be called by ibmvfc_handle_async() with runs in atomic context
> and cannot therefore sleep. This allocation needs to be GFP_ATOMIC. Although
> there is another issue below that might make this moot.

Noted. As to the problem below, once this is moved to running in a work
queue worker thread, this can stay as is.

>> +	if (fpin == NULL)
>> +		return NULL;
>> +
>> +	fpin->fpin_cmd = ELS_FPIN;
>> +
>> +	switch (fpin_status) {
>> +	case IBMVFC_AE_FPIN_CONGESTION_CLEARED:
>> +	case IBMVFC_AE_FPIN_LINK_CONGESTED:
>> +		fpin->desc_len = cpu_to_be32(sizeof(struct fc_fn_congn_desc));
>> +		cdesc = (struct fc_fn_congn_desc *)fpin->fpin_desc;
>> +		cdesc->desc_tag = cpu_to_be32(ELS_DTAG_CONGESTION);
>> +		cdesc->desc_len = cpu_to_be32(FC_TLV_DESC_LENGTH_FROM_SZ(*cdesc));
>> +		if (fpin_status == IBMVFC_AE_FPIN_CONGESTION_CLEARED)
>> +			cdesc->event_type = cpu_to_be16(FPIN_CONGN_CLEAR);
>> +		else
>> +			cdesc->event_type = cpu_to_be16(FPIN_CONGN_DEVICE_SPEC);
>> +		cdesc->event_modifier = modifier;
>> +		cdesc->event_period = period;
>> +		cdesc->severity = FPIN_CONGN_SEVERITY_WARNING;
>> +		break;
>> +	case IBMVFC_AE_FPIN_PORT_CONGESTED:
>> +	case IBMVFC_AE_FPIN_PORT_CLEARED:
>> +		fpin->desc_len = cpu_to_be32(sizeof(struct fc_fn_peer_congn_desc));
>> +		pdesc = (struct fc_fn_peer_congn_desc *)fpin->fpin_desc;
>> +		pdesc->desc_tag = cpu_to_be32(ELS_DTAG_PEER_CONGEST);
>> +		pdesc->desc_len = cpu_to_be32(FC_TLV_DESC_LENGTH_FROM_SZ(*pdesc));
>> +		if (fpin_status == IBMVFC_AE_FPIN_PORT_CLEARED)
>> +			pdesc->event_type = cpu_to_be16(FPIN_CONGN_CLEAR);
>> +		else
>> +			pdesc->event_type = cpu_to_be16(FPIN_CONGN_DEVICE_SPEC);
>> +		pdesc->event_modifier = modifier;
>> +		pdesc->event_period = period;
>> +		pdesc->detecting_wwpn = cpu_to_be64(0);
>> +		pdesc->attached_wwpn = wwpn;
>> +		pdesc->pname_count = cpu_to_be32(1);
>> +		pdesc->pname_list[0] = wwpn;
>> +		break;
>> +	case IBMVFC_AE_FPIN_PORT_DEGRADED:
>> +		fpin->desc_len = cpu_to_be32(sizeof(struct fc_fn_li_desc));
>> +		ldesc = (struct fc_fn_li_desc *)fpin->fpin_desc;
>> +		ldesc->desc_tag = cpu_to_be32(ELS_DTAG_LNK_INTEGRITY);
>> +		ldesc->desc_len = cpu_to_be32(FC_TLV_DESC_LENGTH_FROM_SZ(*ldesc));
>> +		ldesc->event_type = cpu_to_be16(FPIN_LI_UNKNOWN);
>> +		ldesc->event_modifier = modifier;
>> +		ldesc->event_threshold = threshold;
>> +		ldesc->event_count = event_count;
>> +		ldesc->detecting_wwpn = cpu_to_be64(0);
>> +		ldesc->attached_wwpn = wwpn;
>> +		ldesc->pname_count = cpu_to_be32(1);
>> +		ldesc->pname_list[0] = wwpn;
>> +		break;
>> +	default:
>> +		/* This should be caught above. */
>> +		kfree(fpin);
>> +		fpin = NULL;
>> +		break;
>> +	}
>> +
>> +	return fpin;
>> +}
>> +
>> +/**
>> + * ibmvfc_basic_fpin_to_desc(): allocate and populate a struct fc_els_fpin struct
>> + * containing a descriptor.
>> + * @ibmvfc_fpin: Pointer to async crq
>> + *
>> + * Allocate a struct fc_els_fpin containing a descriptor and populate
>> + * based on data from *ibmvfc_fpin.
>> + *
>> + * Return:
>> + * NULL     - unable to allocate structure
>> + * non-NULL - pointer to populated struct fc_els_fpin
>> + */
>> +static struct fc_els_fpin *
>> +/*XXX*/ibmvfc_basic_fpin_to_desc(struct ibmvfc_async_crq *crq)
>
> What is with this /*XXX*/? I can't find it once I apply the patchset so I assume
> its removed in a later patch, but it should be removed here.

An artifact I missed in squashing commits. Thanks for pointing it out.

>> +{
>> +	return ibmvfc_common_fpin_to_desc(crq->fpin_status, crq->wwpn,
>> +					  cpu_to_be16(0),
>> +					  cpu_to_be32(IBMVFC_FPIN_DEFAULT_EVENT_PERIOD),
>> +					  cpu_to_be32(IBMVFC_FPIN_DEFAULT_EVENT_THRESHOLD),
>> +					  cpu_to_be32(1));
>> +}
>> +
>>  /**
>>   * ibmvfc_handle_async - Handle an async event from the adapter
>>   * @crq:	crq to process
>>   * @vhost:	ibmvfc host struct
>>   *
>>   **/
>> -static void ibmvfc_handle_async(struct ibmvfc_async_crq *crq,
>> -				struct ibmvfc_host *vhost)
>> +VISIBLE_IF_KUNIT void ibmvfc_handle_async(struct ibmvfc_async_crq *crq,
>> +					  struct ibmvfc_host *vhost)
>>  {
>>  	const struct ibmvfc_async_desc *desc = ibmvfc_get_ae_desc(be64_to_cpu(crq->event));
>>  	struct ibmvfc_target *tgt;
>> +	struct fc_els_fpin *fpin;
>>  
>>  	ibmvfc_log(vhost, desc->log_level, "%s event received. scsi_id: %llx, wwpn: %llx,"
>>  		   " node_name: %llx%s\n", desc->desc, be64_to_cpu(crq->scsi_id),
>> @@ -3269,11 +3422,37 @@ static void ibmvfc_handle_async(struct ibmvfc_async_crq *crq,
>>  	case IBMVFC_AE_HALT:
>>  		ibmvfc_link_down(vhost, IBMVFC_HALTED);
>>  		break;
>> +	case IBMVFC_AE_FPIN:
>> +		if (!crq->scsi_id && !crq->wwpn && !crq->node_name)
>> +			break;
>> +		list_for_each_entry(tgt, &vhost->targets, queue) {
>> +			if (crq->scsi_id && cpu_to_be64(tgt->scsi_id) != crq->scsi_id)
>> +				continue;
>> +			if (crq->wwpn && cpu_to_be64(tgt->ids.port_name) != crq->wwpn)
>> +				continue;
>> +			if (crq->node_name && cpu_to_be64(tgt->ids.node_name) != crq->node_name)
>> +				continue;
>> +			if (!tgt->rport)
>> +				continue;
>> +			fpin = ibmvfc_basic_fpin_to_desc(crq);
>> +			if (fpin) {
>> +				fc_host_fpin_rcv(tgt->vhost->host,
>> +						 sizeof(*fpin) +
>> +						       be32_to_cpu(fpin->desc_len),
>> +						 (char *)fpin, 0);
>
> This call to fc_host_fpin_rcv() appears to be problematic as it assumes no locks
> are held, but ibmvfc_handle_async() is called with the scsi host lock held. We
> already do a lot more work than we probaly should in our interrupt handler. I
> think we maybe need to pass the FPIN work off to a workqueue instead to be
> handled in process context instead.

Agreed. I'm working on adding a work queue for offloading this FPIN
processing.

-Dave


^ permalink raw reply

* Re: [PATCH 0/5] ibmvfc: make ibmvfc support FPIN messages
From: Dave Marquardt @ 2026-05-07 22:15 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Dave Marquardt via B4 Relay, James E.J. Bottomley,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Tyrel Datwyler, linux-kernel,
	linux-scsi, linuxppc-dev, Brian King, Greg Joyce, Kyle Mahlkuch
In-Reply-To: <yq1a4uke7rz.fsf@ca-mkp.ca.oracle.com>

"Martin K. Petersen" <martin.petersen@oracle.com> writes:

> Dave,
>
>> This patch series adds FPIN (fabric performance impact notification)
>> support to the ibmvfc (IBM Virtual Fibre Channel) driver. This comes
>> in three flavors:
>
> https://sashiko.dev/#/patchset/20260408-ibmvfc-fpin-support-v1-0-52b06c464e03%40linux.ibm.com

Thanks for this. I'm working through the comments and fixing things up
before sending out a v2 patch series.

-Dave


^ permalink raw reply

* [PATCH v2] powerpc/pseries/iommu: export DMA window data to user space
From: Gaurav Batra @ 2026-05-07 18:06 UTC (permalink / raw)
  To: maddy; +Cc: linuxppc-dev, sbhat, vaibhav, ritesh.list, Gaurav Batra,
	Brian King

Export PowerPC DMA window information (both default 2GB and Dynamic
larger window) to user space via sysfs. Each of these DMA windows has
attributes like size of the window, page size backing the window, mode,
etc. Each of these atributes is exported for user space consumption as a
file.

PowerPC Host Bridge (PHB) can have multiple devices/functions sharing
the same DMA window. For each PHB, iommu registration creates an iommu
device under "/sys/devices/virtual/iommu".

These devices will have 2 groups created to export Default and DDW
attributes.

Reviewed-by: Brian King <brking@linux.ibm.com>
Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
---
V1 -> V2 change log:

1. Shiva: "weight" the it_map for the bitmap. This avoids using an extra
   counter in the table. Please look into how iommu_debugfs_weight_get()
   does this

   Response: Incorporated changes

2. Vaibhav: If the DMA window is not available, show function should just
   return ENOENT so that userspace know the error instantly instead of
   having to parse the sysfs contents.

   Response: Incorporated changes, returning ENODATA

3. Vaibhav: All the show functions have similar template. Please convert
   them to macros expansion to reduce code volume.

   Response: Incorporated changes

4. Vaibhav: These new attributes are PSeries specific but they are being
   setup in ppc generic iommu code at arch/powerpc/kernel/iommu.c. Can
   you move these attributes to arch/powerpc/platforms/pseries/iommu.c

   Response: I have split the attributes and moved them to pseries specific
   files. The original group "spapr-tce-iommu", is moved to PowerNV code
   base to retain the legacy functionality.

   I tested the changes both on Pseries and PowerNV.

5. Vaibhav: It would be better to use function iommu_table_inuse_tces() as
   a callback in iommu_table_ops which can be implemented by pseries and
   powernv code differently.

   Response: the function is no longer needed after changes in #1

6. Vaibhav: Since sysfs is ABI can you propose appropriate entries under
   Documentation/ABI/testing

   Response: Added documentation

 ...sfs-devices-virtual-iommu-dma_window_attrs |  21 ++
 .../arch/powerpc/dma_window_attributes.rst    |  65 +++++
 arch/powerpc/include/asm/pci-bridge.h         |   4 +
 arch/powerpc/kernel/iommu.c                   |  16 +-
 arch/powerpc/platforms/powernv/pci-ioda.c     |  16 ++
 arch/powerpc/platforms/pseries/iommu.c        | 261 ++++++++++++++++++
 arch/powerpc/platforms/pseries/pci_dlpar.c    |   2 +
 arch/powerpc/platforms/pseries/pseries.h      |   1 +
 arch/powerpc/platforms/pseries/setup.c        |   2 +
 9 files changed, 373 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs
 create mode 100644 Documentation/arch/powerpc/dma_window_attributes.rst

diff --git a/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs b/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs
new file mode 100644
index 000000000000..18ba63874276
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs
@@ -0,0 +1,21 @@
+What:       /sys/devices/virtual/iommu/<iommu-isolation>/spapr-tce-ddw/*
+Date:       Oct 2025
+Contact:    linuxppc-dev@lists.ozlabs.org
+Description:    read only
+    For each IOMMU isolation unit spapr-tce-ddw sub-directory provides
+    attributes to query information related to the bigger Dynamic DMA
+    window (DDW) in the PowerPC virtualized platforms.
+
+    See Documentation/arch/powerpc/dma_window_attributes.rst for more
+    information.
+
+What:       /sys/devices/virtual/iommu/<iommu-isolation>/spapr-tce-dma/*
+Date:       Oct 2025
+Contact:    linuxppc-dev@lists.ozlabs.org
+Description:    read only
+    For each IOMMU isolation unit spapr-tce-dma sub-directory provides
+    attributes to query information related to the default 2GB DMA
+    window in the PowerPC virtualized platforms.
+
+    See Documentation/arch/powerpc/dma_window_attributes.rst for more
+    information.
diff --git a/Documentation/arch/powerpc/dma_window_attributes.rst b/Documentation/arch/powerpc/dma_window_attributes.rst
new file mode 100644
index 000000000000..8bd9aec8539d
--- /dev/null
+++ b/Documentation/arch/powerpc/dma_window_attributes.rst
@@ -0,0 +1,65 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+DMA Window Attributes
+=====================
+
+In PowerPC architecture there are 2 types of DMA windows -
+
+1. Default 2GB DMA window which is backed by 4K page size
+2. A bigger Dynamic DMA Window (DDW) which is backed by larger page size
+   (64K or 2MB)
+
+A dedicated device will have both the DMA windows instantiated but an SR-IOV
+device will only have the bigger Dynamic DMA Window.
+
+The attributes of these 2 DMA windows are exported to user space via sysfs.
+Each IOMMU isolation unit will have its directory created under
+/sys/devices/virtual/iommu.
+
+As an exapmple, iommu-phb0001
+
+Under each IOMMU isolation unit, there will be a group of attributes for
+"Default 2GB DMA Window" and "Dynamic DMA Window" - spapr-tce-dma and
+spapr-tce-ddw respectively.
+
+Attributes under each group
+
+spapr-tce-ddw:
+direct_address  dynamic_address       dynamic_size  window_type
+direct_size     dynamic_pages_mapped  page_size
+
+spapr-tce-dma:
+dynamic_address  dynamic_pages_mapped  dynamic_size  page_size
+
+
+The bigger Dynamic DMA Window is configured into pre-mapped and/or dynamically
+allocated TCEs. If the DDW is in "Hybrid" mode, then both the Direct
+(pre-mapped) and Dynamic part of the DMA window will have valid values. Hybrid
+mode is valid only for SR-IOV devices.
+
+DMA Window properties:
+
+direct_address              Starting address of the pre-mapped DMA window
+direct_size                 Size of the pre-mapped DMA Window
+dynamic_address             Starting address of the dynamic allocations
+dynamic_size                Size of the dynamic allocation window
+dynamic_pages_mapped        Pages mapped for DMA by dynamic allocations
+page_size                   Page size backing the DMA window
+window_type                 Type of the DMA Window (Direct/Dynamic/Hybrid)
+
+
+An example of DDW attributes for an SR-IOV device::
+
+    $ cd /sys/devices/virtual/iommu/iommu-phb0001/spapr-tce-ddw
+
+    $ grep . *
+
+    direct_address:0x800000000000000   <-- Starting addr of pre-mapped Window
+    direct_size:137438953472           <-- Size of pre-mapped Window (128GB)
+    dynamic_address:0x800002000000000  <-- Starting addr of Dynamic allocations
+    dynamic_size:412316860416          <-- Size of dynamic allocation window (384GB)
+    dynamic_pages_mapped:270           <-- Pages mapped by dynamic allocations
+    page_size:2097152                  <-- DMA window page size (2MB)
+    window_type:Hybrid                 <-- window has both pre-mapped and
+                                           dynamic sections
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index 1dae53130782..9b09178aca5e 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -124,6 +124,10 @@ struct pci_controller {
 	resource_size_t dma_window_base_cur;
 	resource_size_t dma_window_size;
 
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
+	const struct attribute_group **iommu_groups;
+#endif
+
 #ifdef CONFIG_PPC64
 	unsigned long buid;
 	struct pci_dn *pci_data;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 0ce71310b7d9..d6242e3f77da 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1269,24 +1269,10 @@ static const struct iommu_ops spapr_tce_iommu_ops = {
 	.device_group = spapr_tce_iommu_device_group,
 };
 
-static struct attribute *spapr_tce_iommu_attrs[] = {
-	NULL,
-};
-
-static struct attribute_group spapr_tce_iommu_group = {
-	.name = "spapr-tce-iommu",
-	.attrs = spapr_tce_iommu_attrs,
-};
-
-static const struct attribute_group *spapr_tce_iommu_groups[] = {
-	&spapr_tce_iommu_group,
-	NULL,
-};
-
 void ppc_iommu_register_device(struct pci_controller *phb)
 {
 	iommu_device_sysfs_add(&phb->iommu, phb->parent,
-				spapr_tce_iommu_groups, "iommu-phb%04x",
+				phb->iommu_groups, "iommu-phb%04x",
 				phb->global_number);
 	iommu_device_register(&phb->iommu, &spapr_tce_iommu_ops,
 				phb->parent);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1c78fdfb7b03..0887f154955e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2493,6 +2493,20 @@ static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
 	.shutdown		= pnv_pci_ioda_shutdown,
 };
 
+static struct attribute *pnv_tce_iommu_attrs[] = {
+	NULL,
+};
+
+static struct attribute_group pnv_tce_iommu_group = {
+	.name = "spapr-tce-iommu",
+	.attrs = pnv_tce_iommu_attrs,
+};
+
+static const struct attribute_group *pnv_tce_iommu_groups[] = {
+	&pnv_tce_iommu_group,
+	NULL,
+};
+
 static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 					 u64 hub_id, int ioda_type)
 {
@@ -2697,6 +2711,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 		hose->controller_ops = pnv_pci_ioda_controller_ops;
 	}
 
+	hose->iommu_groups = pnv_tce_iommu_groups;
+
 	ppc_md.pcibios_default_alignment = pnv_pci_default_alignment;
 
 #ifdef CONFIG_PCI_IOV
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 5497b130e026..28be7a45761d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -56,6 +56,20 @@ enum {
 	DDW_EXT_LIMITED_ADDR_MODE = 3
 };
 
+/* used by sysfs when querying Dynamic/Default DMA Window data */
+struct dma_win_data {
+	u32     page_size;
+	u64     direct_address;
+	u64     direct_size;
+	u64     dynamic_address;
+	u64     dynamic_size;
+	u32     dynamic_pages_mapped;
+	char    window_type[15];
+};
+
+#define SPAPR_SUCCESS		0
+#define SPAPR_ERROR			-1
+
 static struct iommu_table *iommu_pseries_alloc_table(int node)
 {
 	struct iommu_table *tbl;
@@ -837,6 +851,253 @@ static struct device_node *pci_dma_find(struct device_node *dn,
 	return rdn;
 }
 
+/* Get DDW information for the device */
+static int gather_ddw_info(struct device *dev, struct dma_win_data *data)
+{
+	struct iommu_device *iommu;
+	struct pci_controller *phb;
+	struct device_node *dn;
+	struct pci_dn *pci;
+	const __be32 *prop = NULL;
+	bool ddw_direct = false;
+	bool found = false;
+	struct iommu_table *tbl;
+	u32 pgshift;
+	struct dynamic_dma_window_prop *p;
+
+	memset(data, 0, sizeof(*data));
+
+	iommu = dev_get_drvdata(dev);
+	phb = container_of(iommu, struct pci_controller, iommu);
+	dn = phb->dn;
+
+	if (!dn)
+		return SPAPR_ERROR;
+
+	pci = PCI_DN(dn);
+	if (!pci || !pci->table_group)
+		return SPAPR_ERROR;
+
+	/* Find DDW */
+	prop = of_get_property(dn, DIRECT64_PROPNAME, NULL);
+	if (prop) {
+		ddw_direct = true;
+		found = true;
+	} else {
+		prop = of_get_property(dn, DMA64_PROPNAME, NULL);
+		if (prop)
+			found = true;
+	}
+
+	/* NO DDW */
+	if (!found)
+		return SPAPR_ERROR;
+
+	p = (struct dynamic_dma_window_prop *)prop;
+
+	pgshift = be32_to_cpu(p->tce_shift);
+	if (pgshift != 0xc && pgshift != 0x10 && pgshift != 0x15)
+		data->page_size = 0;
+	else
+		data->page_size = 1 << pgshift;
+
+	/* Check if DDW has table associated with it. Having a table associated with
+	 * DDW is indicative that is has some dynamic TCE allocations. In this case the
+	 * DDW can be fully Dynamic or in Hybrid mode. For SR-IOV DDW is on index 0,
+	 * for dedicated adapter on index 1.
+	 */
+	found = false;
+	for (int i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+		tbl = pci->table_group->tables[i];
+
+		if (tbl && tbl->it_index == be32_to_cpu(p->liobn)) {
+			found = true;
+			break;
+		}
+	}
+
+	/* set the parameters depnding on the DDW type */
+	if (ddw_direct && found) {          /* Hybrid */
+		data->direct_address = be64_to_cpu(p->dma_base);
+		data->dynamic_size = (u64)(tbl->it_size << tbl->it_page_shift);
+
+		data->dynamic_address = data->direct_address
+								+ (u64)(1UL << be32_to_cpu(p->window_shift))
+								- data->dynamic_size;
+
+		data->direct_size = data->dynamic_address - data->direct_address;
+		data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size);
+
+		sprintf(data->window_type, "%s", "Hybrid");
+	} else if (ddw_direct && !found) {    /* Direct */
+		data->direct_address = be64_to_cpu(p->dma_base);
+		data->direct_size = (u64)(1UL << be32_to_cpu(p->window_shift));
+
+		sprintf(data->window_type, "%s", "Direct");
+	} else {                              /* Dynamic */
+		data->dynamic_address = be64_to_cpu(p->dma_base);
+		data->dynamic_size = (u64)(1UL << be32_to_cpu(p->window_shift));
+		data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size);
+
+		sprintf(data->window_type, "%s", "Dynamic");
+	}
+
+	return SPAPR_SUCCESS;
+}
+
+/* Get DDW information for the device */
+static int gather_dma_info(struct device *dev, struct dma_win_data *data)
+{
+	struct iommu_device *iommu;
+	struct pci_controller *phb;
+	struct device_node *dn;
+	struct pci_dn *pci;
+	const __be32 *prop = NULL;
+	struct iommu_table *tbl;
+	unsigned long offset, size, liobn;
+
+	memset(data, 0, sizeof(*data));
+
+	iommu = dev_get_drvdata(dev);
+	phb = container_of(iommu, struct pci_controller, iommu);
+	dn = phb->dn;
+
+	if (!dn)
+		return SPAPR_ERROR;
+
+	pci = PCI_DN(dn);
+	if (!pci || !pci->table_group)
+		return SPAPR_ERROR;
+
+	/* search for default DMA window */
+	prop = of_get_property(dn, "ibm,dma-window", NULL);
+
+	if (!prop)
+		return SPAPR_ERROR;
+
+	/* default DMA Window is always at index 0 */
+	tbl = pci->table_group->tables[0];
+	if (!tbl)
+		return SPAPR_ERROR;
+
+	of_parse_dma_window(dn, prop, &liobn, &offset, &size);
+
+	data->dynamic_address = offset;
+	data->dynamic_size = size;
+	data->page_size = 1ULL << IOMMU_PAGE_SHIFT_4K;
+	data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size);
+
+	return SPAPR_SUCCESS;
+}
+
+#define DEVICE_SHOW_DDW(_name, _fmt)							\
+ssize_t ddw_##_name##_show(struct device *dev,					\
+								  struct device_attribute *attr,\
+								  char *buf)					\
+{																\
+	int rc = 0;													\
+	struct dma_win_data data;									\
+																\
+	rc = gather_ddw_info(dev, &data);							\
+																\
+	if (rc == SPAPR_SUCCESS)									\
+		return sysfs_emit(buf, _fmt, data._name);				\
+	else														\
+		return -ENODATA;										\
+}																\
+
+#define DEVICE_SHOW_DMA(_name, _fmt)							\
+ssize_t dma_##_name##_show(struct device *dev,					\
+								  struct device_attribute *attr,\
+								  char *buf)					\
+{																\
+	int rc = 0;													\
+	struct dma_win_data data;									\
+																\
+	rc = gather_dma_info(dev, &data);							\
+																\
+	if (rc == SPAPR_SUCCESS)									\
+		return sysfs_emit(buf, _fmt, data._name);				\
+	else														\
+		return -ENODATA;										\
+}																\
+
+static DEVICE_SHOW_DDW(direct_address, "%#llx\n");
+static DEVICE_SHOW_DDW(direct_size, "%lld\n");
+static DEVICE_SHOW_DDW(page_size, "%d\n");
+static DEVICE_SHOW_DDW(window_type, "%s\n");
+static DEVICE_SHOW_DDW(dynamic_address, "%#llx\n");
+static DEVICE_SHOW_DDW(dynamic_size, "%lld\n");
+static DEVICE_SHOW_DDW(dynamic_pages_mapped, "%d\n");
+static DEVICE_SHOW_DMA(dynamic_address, "%#llx\n");
+static DEVICE_SHOW_DMA(dynamic_size, "%lld\n");
+static DEVICE_SHOW_DMA(page_size, "%d\n");
+static DEVICE_SHOW_DMA(dynamic_pages_mapped, "%d\n");
+
+#define DEVICE_ATTR_DDW(_name)                              \
+		struct device_attribute dev_attr_ddw_##_name =      \
+			__ATTR(_name, 0444, ddw_##_name##_show, NULL)
+#define DEVICE_ATTR_DMA(_name)                              \
+		struct device_attribute dev_attr_dma_##_name =      \
+		__ATTR(_name, 0444, dma_##_name##_show, NULL)
+
+static DEVICE_ATTR_DDW(direct_address);
+static DEVICE_ATTR_DDW(direct_size);
+static DEVICE_ATTR_DDW(page_size);
+static DEVICE_ATTR_DDW(window_type);
+static DEVICE_ATTR_DDW(dynamic_address);
+static DEVICE_ATTR_DDW(dynamic_size);
+static DEVICE_ATTR_DDW(dynamic_pages_mapped);
+static DEVICE_ATTR_DMA(dynamic_address);
+static DEVICE_ATTR_DMA(dynamic_size);
+static DEVICE_ATTR_DMA(page_size);
+static DEVICE_ATTR_DMA(dynamic_pages_mapped);
+
+static struct attribute *spapr_tce_ddw_attrs[] = {
+	&dev_attr_ddw_direct_address.attr,
+	&dev_attr_ddw_direct_size.attr,
+	&dev_attr_ddw_page_size.attr,
+	&dev_attr_ddw_window_type.attr,
+	&dev_attr_ddw_dynamic_address.attr,
+	&dev_attr_ddw_dynamic_size.attr,
+	&dev_attr_ddw_dynamic_pages_mapped.attr,
+	NULL,
+};
+
+static struct attribute *spapr_tce_dma_attrs[] = {
+	&dev_attr_dma_dynamic_address.attr,
+	&dev_attr_dma_dynamic_size.attr,
+	&dev_attr_dma_page_size.attr,
+	&dev_attr_dma_dynamic_pages_mapped.attr,
+	NULL,
+};
+
+static struct attribute_group spapr_tce_ddw_group = {
+	.name = "spapr-tce-ddw",
+	.attrs = spapr_tce_ddw_attrs,
+};
+
+static struct attribute_group spapr_tce_dma_group = {
+	.name = "spapr-tce-dma",
+	.attrs = spapr_tce_dma_attrs,
+};
+
+static struct attribute *spapr_tce_iommu_attrs[] = {
+	NULL,
+};
+
+static struct attribute_group spapr_tce_iommu_group = {
+	.name = "spapr-tce-iommu",
+	.attrs = spapr_tce_iommu_attrs,
+};
+
+const struct attribute_group *spapr_tce_iommu_groups[] = {
+	&spapr_tce_iommu_group,
+	&spapr_tce_ddw_group,
+	&spapr_tce_dma_group,
+	NULL,
+};
+
 static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
 {
 	struct iommu_table *tbl;
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c
index 8c77ec7980de..b457451a2814 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -45,6 +45,8 @@ struct pci_controller *init_phb_dynamic(struct device_node *dn)
 	pci_process_bridge_OF_ranges(phb, dn, 0);
 	phb->controller_ops = pseries_pci_controller_ops;
 
+	phb->iommu_groups = spapr_tce_iommu_groups;
+
 	pci_devs_phb_init_dynamic(phb);
 
 	pseries_msi_allocate_domains(phb);
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index 3968a6970fa8..4cf0b7a4e96a 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -128,4 +128,5 @@ struct iommu_group *pSeries_pci_device_group(struct pci_controller *hose,
 					     struct pci_dev *pdev);
 #endif
 
+extern const struct attribute_group *spapr_tce_iommu_groups[];
 #endif /* _PSERIES_PSERIES_H */
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 50b26ed8432d..4d877aae0560 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -512,6 +512,8 @@ static void __init pSeries_discover_phbs(void)
 		isa_bridge_find_early(phb);
 		phb->controller_ops = pseries_pci_controller_ops;
 
+		phb->iommu_groups = spapr_tce_iommu_groups;
+
 		/* create pci_dn's for DT nodes under this PHB */
 		pci_devs_phb_init_dynamic(phb);
 
base-commit: 192c0159402e6bfbe13de6f8379546943297783d
-- 
2.39.3



^ permalink raw reply related

* Re: [PATCH] powerpc/pseries/iommu: export DMA window data to user space
From: Gaurav Batra @ 2026-05-07 17:22 UTC (permalink / raw)
  To: Vaibhav Jain, maddy; +Cc: linuxppc-dev, sbhat, Brian King
In-Reply-To: <878qbvfzvw.fsf@vajain21.in.ibm.com>

Hello Vaibhav/Shiva,


Thanks a lot for taking time to review the patch and for 
comments/suggestions. I have incorporated the changes. Will be sending 
out V2.


Gaurav

On 3/13/26 11:38 AM, Vaibhav Jain wrote:
> Hi Gaurav,
>
> Thanks for the patch. Few review comments inline below:
>
> Gaurav Batra <gbatra@linux.ibm.com> writes:
>
>> Export PowerPC DMA window information (both default 2GB and Dynamic
>> larger window) to user space via sysfs. Each of these DMA windows has
>> attributes like size of the window, page size backing the window, mode,
>> etc. Each of these atributes is exported for user space consumption as a
>> file.
>>
>> PowerPC Host Bridge (PHB) can have multiple devices/functions sharing
>> the same DMA window. For each PHB, iommu registration creates an iommu
>> device under "/sys/devices/virtual/iommu".
>>
>> These devices will have 2 groups created to export Default and DDW
>> attributes.
>>
>> Reviewed-by: Brian King <brking@linux.ibm.com>
>> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com>
>> ---
>>   .../arch/powerpc/dma_window_attributes.rst    |  65 +++++
>>   arch/powerpc/include/asm/iommu.h              |  20 ++
>>   arch/powerpc/kernel/iommu.c                   | 235 ++++++++++++++++++
>>   arch/powerpc/platforms/pseries/iommu.c        | 156 ++++++++++++
>>   4 files changed, 476 insertions(+)
>>   create mode 100644 Documentation/arch/powerpc/dma_window_attributes.rst
>>
>> diff --git a/Documentation/arch/powerpc/dma_window_attributes.rst b/Documentation/arch/powerpc/dma_window_attributes.rst
>> new file mode 100644
>> index 000000000000..8bd9aec8539d
>> --- /dev/null
>> +++ b/Documentation/arch/powerpc/dma_window_attributes.rst
>> @@ -0,0 +1,65 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================
>> +DMA Window Attributes
>> +=====================
>> +
>> +In PowerPC architecture there are 2 types of DMA windows -
>> +
>> +1. Default 2GB DMA window which is backed by 4K page size
>> +2. A bigger Dynamic DMA Window (DDW) which is backed by larger page size
>> +   (64K or 2MB)
>> +
>> +A dedicated device will have both the DMA windows instantiated but an SR-IOV
>> +device will only have the bigger Dynamic DMA Window.
>> +
>> +The attributes of these 2 DMA windows are exported to user space via sysfs.
>> +Each IOMMU isolation unit will have its directory created under
>> +/sys/devices/virtual/iommu.
>> +
>> +As an exapmple, iommu-phb0001
>> +
>> +Under each IOMMU isolation unit, there will be a group of attributes for
>> +"Default 2GB DMA Window" and "Dynamic DMA Window" - spapr-tce-dma and
>> +spapr-tce-ddw respectively.
>> +
>> +Attributes under each group
>> +
>> +spapr-tce-ddw:
>> +direct_address  dynamic_address       dynamic_size  window_type
>> +direct_size     dynamic_pages_mapped  page_size
>> +
>> +spapr-tce-dma:
>> +dynamic_address  dynamic_pages_mapped  dynamic_size  page_size
>> +
>> +
>> +The bigger Dynamic DMA Window is configured into pre-mapped and/or dynamically
>> +allocated TCEs. If the DDW is in "Hybrid" mode, then both the Direct
>> +(pre-mapped) and Dynamic part of the DMA window will have valid values. Hybrid
>> +mode is valid only for SR-IOV devices.
>> +
>> +DMA Window properties:
>> +
>> +direct_address              Starting address of the pre-mapped DMA window
>> +direct_size                 Size of the pre-mapped DMA Window
>> +dynamic_address             Starting address of the dynamic allocations
>> +dynamic_size                Size of the dynamic allocation window
>> +dynamic_pages_mapped        Pages mapped for DMA by dynamic allocations
>> +page_size                   Page size backing the DMA window
>> +window_type                 Type of the DMA Window (Direct/Dynamic/Hybrid)
>> +
>> +
>> +An example of DDW attributes for an SR-IOV device::
>> +
>> +    $ cd /sys/devices/virtual/iommu/iommu-phb0001/spapr-tce-ddw
>> +
>> +    $ grep . *
>> +
>> +    direct_address:0x800000000000000   <-- Starting addr of pre-mapped Window
>> +    direct_size:137438953472           <-- Size of pre-mapped Window (128GB)
>> +    dynamic_address:0x800002000000000  <-- Starting addr of Dynamic allocations
>> +    dynamic_size:412316860416          <-- Size of dynamic allocation window (384GB)
>> +    dynamic_pages_mapped:270           <-- Pages mapped by dynamic allocations
>> +    page_size:2097152                  <-- DMA window page size (2MB)
>> +    window_type:Hybrid                 <-- window has both pre-mapped and
>> +                                           dynamic sections
> Since sysfs is ABI can you propose appropriate entries under Documentation/ABI/testing
>
>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>> index eafdd63cd6c4..e644c6e95301 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -90,6 +90,7 @@ struct iommu_pool {
>>   	unsigned long start;
>>   	unsigned long end;
>>   	unsigned long hint;
>> +	unsigned long inuse;
>>   	spinlock_t lock;
>>   } ____cacheline_aligned_in_smp;
>>
> Review-comment from Shivaprasad:
> Instead of  maintaining a counter in iommu_pool can you just 'weigh' the it_map
> bitmap. That way you wont have to introduce a new counter. Please look
> into how iommu_debugfs_weight_get() does this.
>
>
>> @@ -319,5 +320,24 @@ extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir);
>>   
>>   extern const struct dma_map_ops dma_iommu_ops;
>>   
>> +/* used by sysfs when querying Dynamic/Default DMA Window data */
>> +struct dma_win_data {
>> +	u32     win_pgsize;
>> +	u64     direct_addr;
>> +	u64     direct_size;
>> +	u64     dynamic_addr;
>> +	u64     dynamic_size;
>> +	u32     dynamic_tces_inuse;
>> +	char    win_type[15];
>> +};
>> +
>> +#define SPAPR_SUCCESS       0
>> +#define SPAPR_NODMAWIN      -1
>> +#define SPAPR_NODDWWIN      -2
>> +#define SPAPR_ERROR         -3
>> +
>> +extern int gather_ddw_info(struct device *dev, struct dma_win_data *data);
>> +extern int gather_dma_info(struct device *dev, struct dma_win_data *data);
>> +
>>   #endif /* __KERNEL__ */
>>   #endif /* _ASM_IOMMU_H */
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index 0ce71310b7d9..e3cf3701dd6e 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -339,6 +339,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
>>   	if (handle)
>>   		*handle = end;
>>   
>> +	/* update use count */
>> +	pool->inuse += npages;
>> +
> See the review comment above. This counter can be done away with.
>
>>   	spin_unlock_irqrestore(&(pool->lock), flags);
>>   
>>   	return n;
>> @@ -452,6 +455,7 @@ static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
>>   	tbl->it_ops->clear(tbl, entry, npages);
>>   
>>   	spin_lock_irqsave(&(pool->lock), flags);
>> +	pool->inuse -= npages;
> Ditto as above
>
>>   	bitmap_clear(tbl->it_map, free_entry, npages);
>>   	spin_unlock_irqrestore(&(pool->lock), flags);
>>   }
>> @@ -759,6 +763,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
>>   		p->start = tbl->poolsize * i;
>>   		p->hint = p->start;
>>   		p->end = p->start + tbl->poolsize;
>> +		p->inuse = 0;
> Ditto as above
>
>>   	}
>>   
>>   	p = &tbl->large_pool;
>> @@ -766,6 +771,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
>>   	p->start = tbl->poolsize * i;
>>   	p->hint = p->start;
>>   	p->end = tbl->it_size;
>> +	p->inuse = 0;
>>   
>>   	iommu_table_clear(tbl);
>>   
>> @@ -1269,6 +1275,233 @@ static const struct iommu_ops spapr_tce_iommu_ops = {
>>   	.device_group = spapr_tce_iommu_device_group,
>>   };
>>   
>> +static inline const char *dma_win_error(int err)
>> +{
>> +	switch (err) {
>> +	case SPAPR_ERROR:
>> +		return "Error";
>> +	case SPAPR_NODMAWIN:
>> +		return "No Default DMA Window Found";
>> +	case SPAPR_NODDWWIN:
>> +		return "No Dynamic DMA Window Found";
>> +	default:
>> +		return "Unknown Result";
>> +	}
>> +}
>> +
>> +static ssize_t ddw_direct_address_show(struct device *dev,
>> +									   struct device_attribute *attr,
>> +									   char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_ddw_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%#llx\n", data.direct_addr);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
>> +
> Instead of returning success from these *_show() functions despite dma
> window not available, can you just return an error (e.g ENOENT) so that
> userspace know the error instantly instead of having to parse the sysfs
> contents.
>
>
>
>> +static ssize_t ddw_dynamic_address_show(struct device *dev,
>> +										struct device_attribute *attr,
>> +										char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_ddw_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%#llx\n", data.dynamic_addr);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
>> +
>> +static ssize_t ddw_direct_size_show(struct device *dev,
>> +									struct device_attribute *attr,
>> +									char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_ddw_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%lld\n", data.direct_size);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
>> +
>> +static ssize_t ddw_dynamic_size_show(struct device *dev,
>> +									 struct device_attribute *attr,
>> +									 char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_ddw_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%lld\n", data.dynamic_size);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
>> +
>> +static ssize_t ddw_page_size_show(struct device *dev,
>> +								  struct device_attribute *attr,
>> +								  char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_ddw_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%d\n", data.win_pgsize);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
>> +
>> +static ssize_t ddw_window_type_show(struct device *dev,
>> +									struct device_attribute *attr,
>> +									char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_ddw_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%s\n", data.win_type);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
>> +
>> +static ssize_t ddw_dynamic_pages_mapped_show(struct device *dev,
>> +											 struct device_attribute *attr,
>> +											 char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_ddw_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%d\n", data.dynamic_tces_inuse);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
>> +
>> +static ssize_t dma_dynamic_address_show(struct device *dev,
>> +										struct device_attribute *attr,
>> +										char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_dma_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%#llx\n", data.dynamic_addr);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
>> +
>> +static ssize_t dma_dynamic_size_show(struct device *dev,
>> +									 struct device_attribute *attr,
>> +									 char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_dma_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%lld\n", data.dynamic_size);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
>> +
>> +static ssize_t dma_page_size_show(struct device *dev,
>> +								  struct device_attribute *attr,
>> +								  char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_dma_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%d\n", data.win_pgsize);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
>> +
>> +static ssize_t dma_dynamic_pages_mapped_show(struct device *dev,
>> +											 struct device_attribute *attr,
>> +											 char *buf)
>> +{
>> +	int rc = 0;
>> +	struct dma_win_data data;
>> +
>> +	rc = gather_dma_info(dev, &data);
>> +
>> +	if (rc == SPAPR_SUCCESS)
>> +		return sysfs_emit(buf, "%d\n", data.dynamic_tces_inuse);
>> +	else
>> +		return sysfs_emit(buf, "%s\n", dma_win_error(rc));
>> +}
> All the *_show() functions above share same template. Please convert
> them to macros expansion below to reduce code volume.
>
>
>> +
>> +#define DEVICE_ATTR_DDW(_name)                              \
>> +		struct device_attribute dev_attr_ddw_##_name =      \
>> +			__ATTR(_name, 0444, ddw_##_name##_show, NULL)
>> +#define DEVICE_ATTR_DMA(_name)                              \
>> +		struct device_attribute dev_attr_dma_##_name =      \
>> +		__ATTR(_name, 0444, dma_##_name##_show, NULL)
>> +
>> +static DEVICE_ATTR_DDW(direct_address);
>> +static DEVICE_ATTR_DDW(direct_size);
>> +static DEVICE_ATTR_DDW(page_size);
>> +static DEVICE_ATTR_DDW(window_type);
>> +static DEVICE_ATTR_DDW(dynamic_address);
>> +static DEVICE_ATTR_DDW(dynamic_size);
>> +static DEVICE_ATTR_DDW(dynamic_pages_mapped);
>> +static DEVICE_ATTR_DMA(dynamic_address);
>> +static DEVICE_ATTR_DMA(dynamic_size);
>> +static DEVICE_ATTR_DMA(page_size);
>> +static DEVICE_ATTR_DMA(dynamic_pages_mapped);
>> +
>> +static struct attribute *spapr_tce_ddw_attrs[] = {
>> +	&dev_attr_ddw_direct_address.attr,
>> +	&dev_attr_ddw_direct_size.attr,
>> +	&dev_attr_ddw_page_size.attr,
>> +	&dev_attr_ddw_window_type.attr,
>> +	&dev_attr_ddw_dynamic_address.attr,
>> +	&dev_attr_ddw_dynamic_size.attr,
>> +	&dev_attr_ddw_dynamic_pages_mapped.attr,
>> +	NULL,
>> +};
>> +
>> +static struct attribute *spapr_tce_dma_attrs[] = {
>> +	&dev_attr_dma_dynamic_address.attr,
>> +	&dev_attr_dma_dynamic_size.attr,
>> +	&dev_attr_dma_page_size.attr,
>> +	&dev_attr_dma_dynamic_pages_mapped.attr,
>> +	NULL,
>> +};
>> +
>> +static struct attribute_group spapr_tce_ddw_group = {
>> +	.name = "spapr-tce-ddw",
>> +	.attrs = spapr_tce_ddw_attrs,
>> +};
>> +
>> +static struct attribute_group spapr_tce_dma_group = {
>> +	.name = "spapr-tce-dma",
>> +	.attrs = spapr_tce_dma_attrs,
>> +};
>> +
> These attributes are PSeries specific but they are being setup in ppc
> generic iommu code at arch/powerpc/kernel/iommu.c . Can you move these
> attributes to arch/powerpc/platforms/pseries/iommu.c
>
>>   static struct attribute *spapr_tce_iommu_attrs[] = {
>>   	NULL,
>>   };
>> @@ -1280,6 +1513,8 @@ static struct attribute_group spapr_tce_iommu_group = {
>>   
>>   static const struct attribute_group *spapr_tce_iommu_groups[] = {
>>   	&spapr_tce_iommu_group,
>> +	&spapr_tce_ddw_group,
>> +	&spapr_tce_dma_group,
>>   	NULL,
>>   };
>>   
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 5497b130e026..5d04b50ae265 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -837,6 +837,162 @@ static struct device_node *pci_dma_find(struct device_node *dn,
>>   	return rdn;
>>   }
>>   
>> +static unsigned long iommu_table_inuse_tces(struct iommu_table *tbl)
>> +{
>> +	struct iommu_pool *pool;
>> +	unsigned long ntces = 0;
>> +
>> +	/* Number of TCEs in-use */
>> +	for (int i = 0; i < tbl->nr_pools; i++) {
>> +		pool = &tbl->pools[i];
>> +		ntces += pool->inuse;
>> +	}
>> +
>> +	pool = &tbl->large_pool;
>> +	ntces += pool->inuse;
>> +
>> +	return ntces;
>> +}
> It would be better to use this functions as a callback in
> iommu_table_ops which can be implemented by pseries and powernv code
> differently.
>
>
>> +
>> +/* Get DDW information for the device */
>> +int gather_ddw_info(struct device *dev, struct dma_win_data *data)
>> +{
>> +	struct iommu_device *iommu;
>> +	struct pci_controller *phb;
>> +	struct device_node *dn;
>> +	struct pci_dn *pci;
>> +	const __be32 *prop = NULL;
>> +	bool ddw_direct = false;
>> +	bool found = false;
>> +	struct iommu_table *tbl;
>> +	u32 pgshift;
>> +	struct dynamic_dma_window_prop *p;
>> +
>> +	memset(data, 0, sizeof(*data));
>> +
>> +	iommu = dev_get_drvdata(dev);
>> +	phb = container_of(iommu, struct pci_controller, iommu);
>> +	dn = phb->dn;
>> +
>> +	if (!dn)
>> +		return SPAPR_ERROR;
>> +
>> +	pci = PCI_DN(dn);
>> +	if (!pci || !pci->table_group)
>> +		return SPAPR_ERROR;
>> +
>> +	/* Find DDW */
>> +	prop = of_get_property(dn, DIRECT64_PROPNAME, NULL);
>> +	if (prop) {
>> +		ddw_direct = true;
>> +		found = true;
>> +	} else {
>> +		prop = of_get_property(dn, DMA64_PROPNAME, NULL);
>> +		if (prop)
>> +			found = true;
>> +	}
>> +
>> +	/* NO DDW */
>> +	if (!found)
>> +		return SPAPR_NODDWWIN;
>> +
>> +	p = (struct dynamic_dma_window_prop *)prop;
>> +
>> +	pgshift = be32_to_cpu(p->tce_shift);
>> +	if (pgshift != 0xc && pgshift != 0x10 && pgshift != 0x15)
>> +		data->win_pgsize = 0;
>> +	else
>> +		data->win_pgsize = 1 << pgshift;
>> +
>> +	/* Check if DDW has table associated with it. Having a table associated with
>> +	 * DDW is indicative that is has some dynamic TCE allocations. In this case the
>> +	 * DDW can be fully Dynamic or in Hybrid mode. For SR-IOV DDW is on index 0,
>> +	 * for dedicated adapter on index 1.
>> +	 */
>> +	found = false;
>> +	for (int i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>> +		tbl = pci->table_group->tables[i];
>> +
>> +		if (tbl && tbl->it_index == be32_to_cpu(p->liobn)) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	/* set the parameters depnding on the DDW type */
>> +	if (ddw_direct && found) {          /* Hybrid */
>> +		data->direct_addr = be64_to_cpu(p->dma_base);
>> +		data->dynamic_size = (u64)(tbl->it_size << tbl->it_page_shift);
>> +
>> +		data->dynamic_addr = data->direct_addr
>> +								+ (u64)(1UL << be32_to_cpu(p->window_shift))
>> +								- data->dynamic_size;
>> +
>> +		data->direct_size = data->dynamic_addr - data->direct_addr;
>> +		data->dynamic_tces_inuse = iommu_table_inuse_tces(tbl);
>> +
>> +		sprintf(data->win_type, "%s", "Hybrid");
>> +	} else if (ddw_direct && !found) {    /* Direct */
>> +		data->direct_addr = be64_to_cpu(p->dma_base);
>> +		data->direct_size = (u64)(1UL << be32_to_cpu(p->window_shift));
>> +
>> +		sprintf(data->win_type, "%s", "Direct");
>> +	} else {                              /* Dynamic */
>> +		data->dynamic_addr = be64_to_cpu(p->dma_base);
>> +		data->dynamic_size = (u64)(1UL << be32_to_cpu(p->window_shift));
>> +		data->dynamic_tces_inuse = iommu_table_inuse_tces(tbl);
>> +
>> +		sprintf(data->win_type, "%s", "Dynamic");
>> +	}
>> +
>> +	return SPAPR_SUCCESS;
>> +}
>> +
>> +/* Get DDW information for the device */
>> +int gather_dma_info(struct device *dev, struct dma_win_data *data)
>> +{
>> +	struct iommu_device *iommu;
>> +	struct pci_controller *phb;
>> +	struct device_node *dn;
>> +	struct pci_dn *pci;
>> +	const __be32 *prop = NULL;
>> +	struct iommu_table *tbl;
>> +	unsigned long offset, size, liobn;
>> +
>> +	memset(data, 0, sizeof(*data));
>> +
>> +	iommu = dev_get_drvdata(dev);
>> +	phb = container_of(iommu, struct pci_controller, iommu);
>> +	dn = phb->dn;
>> +
>> +	if (!dn)
>> +		return SPAPR_ERROR;
>> +
>> +	pci = PCI_DN(dn);
>> +	if (!pci || !pci->table_group)
>> +		return SPAPR_ERROR;
>> +
>> +	/* search for default DMA window */
>> +	prop = of_get_property(dn, "ibm,dma-window", NULL);
>> +
>> +	if (!prop)
>> +		return SPAPR_NODMAWIN;
>> +
>> +	/* default DMA Window is always at index 0 */
>> +	tbl = pci->table_group->tables[0];
>> +	if (!tbl)
>> +		return SPAPR_ERROR;
>> +
>> +	of_parse_dma_window(dn, prop, &liobn, &offset, &size);
>> +
>> +	data->dynamic_addr = offset;
>> +	data->dynamic_size = size;
>> +	data->win_pgsize = 1ULL << IOMMU_PAGE_SHIFT_4K;
>> +	data->dynamic_tces_inuse = iommu_table_inuse_tces(tbl);
>> +
>> +	return SPAPR_SUCCESS;
>> +}
>> +
>>   static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
>>   {
>>   	struct iommu_table *tbl;
>>
>> base-commit: 192c0159402e6bfbe13de6f8379546943297783d
>> -- 
>> 2.39.3
>>


^ permalink raw reply

* [PATCH v3 net] net: wan: fsl_ucc_hdlc: free tx_skbuff in uhdlc_memclean
From: Holger Brunck @ 2026-05-07 15:53 UTC (permalink / raw)
  To: netdev
  Cc: linuxppc-dev, andrew+netdev, chleroy, qiang.zhao, horms, kuba,
	Holger Brunck

When the device is removed all allocated resources should be freed.
In uhdlc_memclean the netdev transmit queue was already stopped. But at
this point we may have pending skb in the transmit queue which must be
freed. Therefore iterate over the tx_skbuff pointers and free all
pending skb. The issue was discovered by sashiko.
Tested on a ls1043a board running HDLC in bus mode on kernel 6.12.

https://sashiko.dev/#/patchset/20260429114208.941011-1-holger.brunck%40hitachienergy.com
Fixes: c19b6d246a35 ("drivers/net: support hdlc function for QE-UCC")
Signed-off-by: Holger Brunck <holger.brunck@hitachienergy.com>
---
v3: - add test setup to commit message
v2: https://lore.kernel.org/linuxppc-dev/20260506111529.2919079-1-holger.brunck@hitachienergy.com/
    - use dev_kfree_skb instead of kfree
    - improve commit message
    - add missing paramter in for statement
v1: https://lore.kernel.org/linuxppc-dev/20260504161145.2217950-1-holger.brunck@hitachienergy.com/

 drivers/net/wan/fsl_ucc_hdlc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/wan/fsl_ucc_hdlc.c b/drivers/net/wan/fsl_ucc_hdlc.c
index bc7c2e9e6554..417e8e4c111f 100644
--- a/drivers/net/wan/fsl_ucc_hdlc.c
+++ b/drivers/net/wan/fsl_ucc_hdlc.c
@@ -739,6 +739,8 @@ static int uhdlc_open(struct net_device *dev)
 
 static void uhdlc_memclean(struct ucc_hdlc_private *priv)
 {
+	int i;
+
 	qe_muram_free(ioread16be(&priv->ucc_pram->riptr));
 	qe_muram_free(ioread16be(&priv->ucc_pram->tiptr));
 
@@ -769,6 +771,11 @@ static void uhdlc_memclean(struct ucc_hdlc_private *priv)
 	kfree(priv->rx_skbuff);
 	priv->rx_skbuff = NULL;
 
+	for (i = 0; i < TX_BD_RING_LEN; i++) {
+		dev_kfree_skb(priv->tx_skbuff[i]);
+		priv->tx_skbuff[i] = NULL;
+	}
+
 	kfree(priv->tx_skbuff);
 	priv->tx_skbuff = NULL;
 
-- 
2.52.0.120.gb31ab939fe



^ permalink raw reply related

* Re: [PATCH v2 net] net: wan: fsl_ucc_hdlc: free tx_skbuff in uhdlc_memclean
From: Jakub Kicinski @ 2026-05-07 15:35 UTC (permalink / raw)
  To: Holger Brunck
  Cc: netdev@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	andrew+netdev@lunn.ch, chleroy@kernel.org, qiang.zhao@nxp.com,
	horms@kernel.org
In-Reply-To: <AM0PR06MB1039657EB2FADD7BDAD4159D5F73C2@AM0PR06MB10396.eurprd06.prod.outlook.com>

On Thu, 7 May 2026 15:07:35 +0000 Holger Brunck wrote:
> > On Wed,  6 May 2026 13:15:29 +0200 Holger Brunck wrote:  
> > > When the device is removed all allocated resources should be freed.
> > > In uhdlc_memclean the netdev transmit queue was already stopped. But
> > > at this point we may have pending skb in the transmit queue which must
> > > be freed. Therefore iterate over the tx_skbuff pointers and free all
> > > pending skb. The issue was discovered by sashiko.  
> > 
> > And you tested this how?
> > 
> > Given the questionable v1 I'm highly hesitant to accept patches from you if you
> > can't test them.  
> 
> I tested the patch on a ls1043a board running HDLC in busmode on kernel 6.12

Please add this to the commit message, as previously requested.


^ permalink raw reply

* Re: [PATCH 6/6] KVM: PPC: Document KVM_PPC_GET_COMPAT_CAPS ioctl
From: Amit Machhiwal @ 2026-05-07 15:20 UTC (permalink / raw)
  To: Harsh Prateek Bora
  Cc: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan, Vaibhav Jain,
	Paolo Bonzini, Jonathan Corbet, Shuah Khan, kvm, linux-kernel,
	linux-doc
In-Reply-To: <81b9d00f-7568-4bcd-9b77-2f6de3162d65@linux.ibm.com>

On 2026/05/05 03:25 PM, Harsh Prateek Bora wrote:
> 
> 
> On 30/04/26 11:19 am, Amit Machhiwal wrote:
> > Add documentation for the KVM_PPC_GET_COMPAT_CAPS ioctl to the KVM API
> > documentation.
> > 
> > The ioctl exposes host processor compatibility modes supported for
> > nested KVM guests on PowerPC systems.
> > 
> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > ---
> >   Documentation/virt/kvm/api.rst | 35 ++++++++++++++++++++++++++++++++++
> >   1 file changed, 35 insertions(+)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 52bbbb553ce1..7a10c3c6cbf1 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6555,6 +6555,41 @@ KVM_S390_KEYOP_SSKE
> >   .. _kvm_run:
> > +4.145 KVM_PPC_GET_COMPAT_CAPS
> > +-----------------------------
> > +:Capability: KVM_CAP_PPC_COMPAT_CAPS
> > +:Architectures: powerpc
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_ppc_compat_caps (out)
> > +:Returns:
> > +	0 on successful completion,
> > +	-EFAULT if ``struct kvm_ppc_compat_caps`` cannot be written
> 
> -EINVAL also needs to be documented?

We do not currently return -EINVAL for this ioctl, so I did not document it. As
implemented, the only explicit error documented here is -EFAULT when struct
kvm_ppc_compat_caps cannot be written. Please let me know if I am missing a case
where -EINVAL should be returned.

> 
> > +
> > +IBM POWER system server-based processors provide a compatibility mode feature
> > +where an Nth generation processor can operate in modes consistent with earlier
> > +generations such as (N-1) and (N-2).
> > +
> > +This ioctl provides userspace with information about the CPU compatibility modes
> > +supported by the current host processor for booting the nested KVM guests on
> > +PowerNV (KVM nested APIv1) and PowerVM (KVM nested APIv2) platforms.
> > +
> > +::
> > +
> > +  struct kvm_ppc_compat_caps {
> > +         __u32   flags;
> > +         __u64   compat_capabilities;    /* Capabilities supported by the host */
> > +  };
> > +
> > +The ``compat_capabilities`` bit field describes the processor compatibility
> > +modes supported by the host. For example, the following bits indicate support
> > +for specific processor modes.
> > +
> > +::
> > +
> > + bit 1: KVM guests can run in Power9 processor mode
> > + bit 2: KVM guests can run in Power10 processor mode
> > + bit 3: KVM guests can run in Power11 processor mode
> 
> May be use H_GUEST_CAP_POWER9 and friends ?

I used the bit descriptions directly since this is documenting the uAPI bitmap
layout, which felt clearer for userspace than using only H_GUEST_CAP_POWER9 and
friends. But I’m happy to mention those macros as well if you think that
improves clarity.

Thanks again for reviewing the series. I will wait a little longer for any
further feedback before posting v2.

Thanks,
Amit

> 
> > +
> >   5. The kvm_run structure
> >   ========================
> 


^ permalink raw reply

* RE: [PATCH v2 net] net: wan: fsl_ucc_hdlc: free tx_skbuff in uhdlc_memclean
From: Holger Brunck @ 2026-05-07 15:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	andrew+netdev@lunn.ch, chleroy@kernel.org, qiang.zhao@nxp.com,
	horms@kernel.org
In-Reply-To: <20260506161629.5488a643@kernel.org>

> On Wed,  6 May 2026 13:15:29 +0200 Holger Brunck wrote:
> > When the device is removed all allocated resources should be freed.
> > In uhdlc_memclean the netdev transmit queue was already stopped. But
> > at this point we may have pending skb in the transmit queue which must
> > be freed. Therefore iterate over the tx_skbuff pointers and free all
> > pending skb. The issue was discovered by sashiko.
> 
> And you tested this how?
> 
> Given the questionable v1 I'm highly hesitant to accept patches from you if you
> can't test them.

I tested the patch on a ls1043a board running HDLC in busmode on kernel 6.12

I was able to queue some packets in the TX part simply in removing the TX clock
for my setup. When I then shutdown the interface and remove the module I can
see that thel sk_buff pointers stored in the priv->tx_skbuff array, are not freed
 without the patch in question.
I am currently not able to run my setup on latest master, but the changes in the
driver compared to master are minimal.
 
Best regards
Holger  


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox