linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions
@ 2025-04-26 21:22 Chathura Rajapaksha
  2025-04-26 21:22 ` [RFC PATCH 1/2] block accesses to unassigned PCI " Chathura Rajapaksha
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Chathura Rajapaksha @ 2025-04-26 21:22 UTC (permalink / raw)
  To: kvm
  Cc: Chathura Rajapaksha, Alex Williamson, Paul Moore, Eric Paris,
	Giovanni Cabiddu, Xin Zeng, Yahui Cao, Bjorn Helgaas, Kevin Tian,
	Niklas Schnelle, Yunxiang Li, Dongdong Zhang, Avihai Horon,
	linux-kernel, audit

Some PCIe devices trigger PCI bus errors when accesses are made to
unassigned regions within their PCI configuration space. On certain
platforms, this can lead to host system hangs or reboots.

The current vfio-pci driver allows guests to access unassigned regions
in the PCI configuration space. Therefore, when such a device is passed
through to a guest, the guest can induce a host system hang or reboot
through crafted configuration space accesses, posing a threat to
system availability.

This patch series introduces:
1. Support for blocking guest accesses to unassigned
   PCI configuration space, and the ability to bypass this access control
   for specific devices. The patch introduces three module parameters:

   block_pci_unassigned_write:
   Blocks write accesses to unassigned config space regions.

   block_pci_unassigned_read:
   Blocks read accesses to unassigned config space regions.

   uaccess_allow_ids:
   Specifies the devices for which the above access control is bypassed.
   The value is a comma-separated list of device IDs in
   <vendor_id>:<device_id> format.

   Example usage:
   To block guest write accesses to unassigned config regions for all
   passed through devices except for the device with vendor ID 0x1234 and
   device ID 0x5678:

   block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678

2. Auditing support for config space accesses to unassigned regions.
   When enabled, this logs such accesses for all passthrough devices.
   This feature is controlled via a new Kconfig option:

     CONFIG_VFIO_PCI_UNASSIGNED_ACCESS_AUDIT

   A new audit event type, AUDIT_VFIO, has been introduced to support
   this, allowing administrators to monitor and investigate suspicious
   behavior by guests.

This proposal is intended to harden VFIO passthrough in environments
where guests are untrusted or system reliability is critical.

Any feedback and comments are greatly appreciated.

Chathura Rajapaksha (2):
  block accesses to unassigned PCI config regions
  audit accesses to unassigned PCI config regions

 drivers/vfio/pci/Kconfig           |  12 +++
 drivers/vfio/pci/vfio_pci_config.c | 164 ++++++++++++++++++++++++++++-
 include/uapi/linux/audit.h         |   1 +
 3 files changed, 176 insertions(+), 1 deletion(-)


base-commit: f1a3944c860b0615d0513110d8cf62bb94adbb41
-- 
2.34.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/2] block accesses to unassigned PCI config regions
  2025-04-26 21:22 [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions Chathura Rajapaksha
@ 2025-04-26 21:22 ` Chathura Rajapaksha
  2025-04-28 15:00   ` Bjorn Helgaas
  2025-04-26 21:22 ` [RFC PATCH 2/2] audit " Chathura Rajapaksha
  2025-04-28 13:24 ` [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned " Jason Gunthorpe
  2 siblings, 1 reply; 15+ messages in thread
From: Chathura Rajapaksha @ 2025-04-26 21:22 UTC (permalink / raw)
  To: kvm
  Cc: Chathura Rajapaksha, William Wang, Alex Williamson, Paul Moore,
	Eric Paris, Xin Zeng, Yahui Cao, Giovanni Cabiddu, Bjorn Helgaas,
	Niklas Schnelle, Yunxiang Li, Dongdong Zhang, Kevin Tian,
	Avihai Horon, linux-kernel, audit

Some PCIe devices trigger PCI bus errors when accesses are made to
unassigned regions within their PCI configuration space. On certain
platforms, this can lead to host system hangs or reboots.

The current vfio-pci driver allows guests to access unassigned regions
in the PCI configuration space. Therefore, when such a device is passed
through to a guest, the guest can induce a host system hang or reboot
through crafted configuration space accesses, posing a threat to
system availability.

This patch introduces support for blocking guest accesses to unassigned
PCI configuration space, and the ability to bypass this access control
for specific devices. The patch introduces three module parameters:

   block_pci_unassigned_write:
   Blocks write accesses to unassigned config space regions.

   block_pci_unassigned_read:
   Blocks read accesses to unassigned config space regions.

   uaccess_allow_ids:
   Specifies the devices for which the above access control is bypassed.
   The value is a comma-separated list of device IDs in
   <vendor_id>:<device_id> format.

   Example usage:
   To block guest write accesses to unassigned config regions for all
   passed through devices except for the device with vendor ID 0x1234 and
   device ID 0x5678:

   block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678

Co-developed by: William Wang <xwill@bu.edu>
Signed-off-by: William Wang <xwill@bu.edu>
Signed-off-by: Chathura Rajapaksha <chath@bu.edu>
---
 drivers/vfio/pci/vfio_pci_config.c | 122 ++++++++++++++++++++++++++++-
 1 file changed, 121 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 8f02f236b5b4..cb4d11aa5598 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -120,6 +120,106 @@ struct perm_bits {
 #define	NO_WRITE	0
 #define	ALL_WRITE	0xFFFFFFFFU
 
+static bool block_pci_unassigned_write;
+module_param(block_pci_unassigned_write, bool, 0644);
+MODULE_PARM_DESC(block_pci_unassigned_write,
+		 "Block write accesses to unassigned PCI config regions.");
+
+static bool block_pci_unassigned_read;
+module_param(block_pci_unassigned_read, bool, 0644);
+MODULE_PARM_DESC(block_pci_unassigned_read,
+		 "Block read accesses from unassigned PCI config regions.");
+
+static char *uaccess_allow_ids;
+module_param(uaccess_allow_ids, charp, 0444);
+MODULE_PARM_DESC(uaccess_allow_ids, "PCI IDs to allow access to unassigned PCI config regions, format is \"vendor:device\" and multiple comma separated entries can be specified");
+
+static LIST_HEAD(allowed_device_ids);
+static DEFINE_SPINLOCK(device_ids_lock);
+
+struct uaccess_device_id {
+	struct list_head slot_list;
+	unsigned short vendor;
+	unsigned short device;
+};
+
+static void pci_uaccess_add_device(struct uaccess_device_id *new,
+				   int vendor, int device)
+{
+	struct uaccess_device_id *pci_dev_id;
+	unsigned long flags;
+	int found = 0;
+
+	spin_lock_irqsave(&device_ids_lock, flags);
+
+	list_for_each_entry(pci_dev_id, &allowed_device_ids, slot_list) {
+		if (pci_dev_id->vendor == vendor &&
+		    pci_dev_id->device == device) {
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found) {
+		new->vendor = vendor;
+		new->device = device;
+		list_add_tail(&new->slot_list, &allowed_device_ids);
+	}
+
+	spin_unlock_irqrestore(&device_ids_lock, flags);
+
+	if (found)
+		kfree(new);
+}
+
+static bool pci_uaccess_lookup(struct pci_dev *dev)
+{
+	struct uaccess_device_id *pdev_id;
+	unsigned long flags;
+	bool found = false;
+
+	spin_lock_irqsave(&device_ids_lock, flags);
+	list_for_each_entry(pdev_id, &allowed_device_ids, slot_list) {
+		if (pdev_id->vendor == dev->vendor &&
+		    pdev_id->device == dev->device) {
+			found = true;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&device_ids_lock, flags);
+
+	return found;
+}
+
+static int __init pci_uaccess_init(void)
+{
+	char *p, *id;
+	int fields;
+	int vendor, device;
+	struct uaccess_device_id *pci_dev_id;
+
+	/* add ids specified in the module parameter */
+	p = uaccess_allow_ids;
+	while ((id = strsep(&p, ","))) {
+		if (!strlen(id))
+			continue;
+
+		fields = sscanf(id, "%x:%x", &vendor, &device);
+
+		if (fields != 2) {
+			pr_warn("Invalid id string \"%s\"\n", id);
+			continue;
+		}
+
+		pci_dev_id = kmalloc(sizeof(*pci_dev_id), GFP_KERNEL);
+		if (!pci_dev_id)
+			return -ENOMEM;
+
+		pci_uaccess_add_device(pci_dev_id, vendor, device);
+	}
+	return 0;
+}
+
 static int vfio_user_config_read(struct pci_dev *pdev, int offset,
 				 __le32 *val, int count)
 {
@@ -335,6 +435,18 @@ static struct perm_bits unassigned_perms = {
 	.writefn = vfio_raw_config_write
 };
 
+/*
+ * Read/write access to PCI unassigned config regions can be blocked
+ * using the block_pci_unassigned_read and block_pci_unassigned_write
+ * module parameters. The uaccess_allow_ids module parameter can be used
+ * to whitelist devices (i.e., bypass the block) when either of the
+ * above parameters is specified.
+ */
+static struct perm_bits block_unassigned_perms = {
+	.readfn = NULL,
+	.writefn = NULL
+};
+
 static struct perm_bits virt_perms = {
 	.readfn = vfio_virt_config_read,
 	.writefn = vfio_virt_config_write
@@ -1108,6 +1220,9 @@ int __init vfio_pci_init_perm_bits(void)
 	ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_raw_config_write;
 	ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_raw_config_write;
 
+	/* Device list allowed to access unassigned PCI regions */
+	ret |= pci_uaccess_init();
+
 	if (ret)
 		vfio_pci_uninit_perm_bits();
 
@@ -1896,7 +2011,12 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
 	cap_id = vdev->pci_config_map[*ppos];
 
 	if (cap_id == PCI_CAP_ID_INVALID) {
-		perm = &unassigned_perms;
+		if (((iswrite && block_pci_unassigned_write) ||
+		     (!iswrite && block_pci_unassigned_read)) &&
+		    !pci_uaccess_lookup(pdev))
+			perm = &block_unassigned_perms;
+		else
+			perm = &unassigned_perms;
 		cap_start = *ppos;
 	} else if (cap_id == PCI_CAP_ID_INVALID_VIRT) {
 		perm = &virt_perms;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/2] audit accesses to unassigned PCI config regions
  2025-04-26 21:22 [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions Chathura Rajapaksha
  2025-04-26 21:22 ` [RFC PATCH 1/2] block accesses to unassigned PCI " Chathura Rajapaksha
@ 2025-04-26 21:22 ` Chathura Rajapaksha
  2025-04-28 15:05   ` Bjorn Helgaas
  2025-05-16 20:41   ` [PATCH RFC " Paul Moore
  2025-04-28 13:24 ` [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned " Jason Gunthorpe
  2 siblings, 2 replies; 15+ messages in thread
From: Chathura Rajapaksha @ 2025-04-26 21:22 UTC (permalink / raw)
  To: kvm
  Cc: Chathura Rajapaksha, William Wang, Alex Williamson, Paul Moore,
	Eric Paris, Niklas Schnelle, Xin Zeng, Kevin Tian, Bjorn Helgaas,
	Yahui Cao, Yunxiang Li, Dongdong Zhang, Avihai Horon,
	linux-kernel, audit

Some PCIe devices trigger PCI bus errors when accesses are made to
unassigned regions within their PCI configuration space. On certain
platforms, this can lead to host system hangs or reboots.

The current vfio-pci driver allows guests to access unassigned regions
in the PCI configuration space. Therefore, when such a device is passed
through to a guest, the guest can induce a host system hang or reboot
through crafted configuration space accesses, posing a threat to
system availability.

This patch introduces auditing support for config space accesses to
unassigned regions. When enabled, this logs such accesses for all
passthrough devices. 
This feature is controlled via a new Kconfig option:

  CONFIG_VFIO_PCI_UNASSIGNED_ACCESS_AUDIT

A new audit event type, AUDIT_VFIO, has been introduced to support
this, allowing administrators to monitor and investigate suspicious
behavior by guests.

Co-developed by: William Wang <xwill@bu.edu>
Signed-off-by: William Wang <xwill@bu.edu>
Signed-off-by: Chathura Rajapaksha <chath@bu.edu>
---
 drivers/vfio/pci/Kconfig           | 12 ++++++++
 drivers/vfio/pci/vfio_pci_config.c | 46 ++++++++++++++++++++++++++++--
 include/uapi/linux/audit.h         |  1 +
 3 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index c3bcb6911c53..7f9f16262b90 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -42,6 +42,18 @@ config VFIO_PCI_IGD
 	  and LPC bridge config space.
 
 	  To enable Intel IGD assignment through vfio-pci, say Y.
+
+config VFIO_PCI_UNASSIGNED_ACCESS_AUDIT
+	bool "Audit accesses to unassigned PCI configuration regions"
+	depends on AUDIT && VFIO_PCI_CORE
+	help
+	  Some PCIe devices are known to cause bus errors when accessing
+	  unassigned PCI configuration space, potentially leading to host
+	  system hangs on certain platforms. When enabled, this option
+	  audits accesses to unassigned PCI configuration regions.
+
+	  If you don't know what to do here, say N.
+
 endif
 
 config VFIO_PCI_ZDEV_KVM
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index cb4d11aa5598..ddd10904d60f 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/slab.h>
+#include <linux/audit.h>
 
 #include "vfio_pci_priv.h"
 
@@ -1980,6 +1981,37 @@ static size_t vfio_pci_cap_remaining_dword(struct vfio_pci_core_device *vdev,
 	return i;
 }
 
+enum vfio_audit {
+	VFIO_AUDIT_READ,
+	VFIO_AUDIT_WRITE,
+	VFIO_AUDIT_MAX,
+};
+
+static const char * const vfio_audit_str[VFIO_AUDIT_MAX] = {
+	[VFIO_AUDIT_READ]  = "READ",
+	[VFIO_AUDIT_WRITE] = "WRITE",
+};
+
+static void vfio_audit_access(const struct pci_dev *pdev,
+			      size_t count, loff_t *ppos, bool blocked, unsigned int op)
+{
+	struct audit_buffer *ab;
+
+	if (WARN_ON_ONCE(op >= VFIO_AUDIT_MAX))
+		return;
+	if (audit_enabled == AUDIT_OFF)
+		return;
+	ab = audit_log_start(audit_context(), GFP_ATOMIC, AUDIT_VFIO);
+	if (unlikely(!ab))
+		return;
+	audit_log_format(ab,
+			 "device=%04x:%02x:%02x.%d access=%s offset=0x%llx size=%ld blocked=%u\n",
+			 pci_domain_nr(pdev->bus), pdev->bus->number,
+			 PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
+			 vfio_audit_str[op], *ppos, count, blocked);
+	audit_log_end(ab);
+}
+
 static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 				 size_t count, loff_t *ppos, bool iswrite)
 {
@@ -1989,6 +2021,7 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
 	int cap_start = 0, offset;
 	u8 cap_id;
 	ssize_t ret;
+	bool blocked;
 
 	if (*ppos < 0 || *ppos >= pdev->cfg_size ||
 	    *ppos + count > pdev->cfg_size)
@@ -2011,13 +2044,22 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
 	cap_id = vdev->pci_config_map[*ppos];
 
 	if (cap_id == PCI_CAP_ID_INVALID) {
-		if (((iswrite && block_pci_unassigned_write) ||
+		blocked = (((iswrite && block_pci_unassigned_write) ||
 		     (!iswrite && block_pci_unassigned_read)) &&
-		    !pci_uaccess_lookup(pdev))
+		    !pci_uaccess_lookup(pdev));
+		if (blocked)
 			perm = &block_unassigned_perms;
 		else
 			perm = &unassigned_perms;
 		cap_start = *ppos;
+		if (IS_ENABLED(CONFIG_VFIO_PCI_UNASSIGNED_ACCESS_AUDIT)) {
+			if (iswrite)
+				vfio_audit_access(pdev, count, ppos, blocked,
+						  VFIO_AUDIT_WRITE);
+			else
+				vfio_audit_access(pdev, count, ppos, blocked,
+						  VFIO_AUDIT_READ);
+		}
 	} else if (cap_id == PCI_CAP_ID_INVALID_VIRT) {
 		perm = &virt_perms;
 		cap_start = *ppos;
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 9a4ecc9f6dc5..c0aace7384f3 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -122,6 +122,7 @@
 #define AUDIT_OPENAT2		1337	/* Record showing openat2 how args */
 #define AUDIT_DM_CTRL		1338	/* Device Mapper target control */
 #define AUDIT_DM_EVENT		1339	/* Device Mapper events */
+#define AUDIT_VFIO		1340	/* VFIO events */
 
 #define AUDIT_AVC		1400	/* SE Linux avc denial or grant */
 #define AUDIT_SELINUX_ERR	1401	/* Internal SE Linux Errors */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions
  2025-04-26 21:22 [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions Chathura Rajapaksha
  2025-04-26 21:22 ` [RFC PATCH 1/2] block accesses to unassigned PCI " Chathura Rajapaksha
  2025-04-26 21:22 ` [RFC PATCH 2/2] audit " Chathura Rajapaksha
@ 2025-04-28 13:24 ` Jason Gunthorpe
  2025-04-28 20:25   ` Alex Williamson
  2 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2025-04-28 13:24 UTC (permalink / raw)
  To: Chathura Rajapaksha
  Cc: kvm, Chathura Rajapaksha, Alex Williamson, Paul Moore, Eric Paris,
	Giovanni Cabiddu, Xin Zeng, Yahui Cao, Bjorn Helgaas, Kevin Tian,
	Niklas Schnelle, Yunxiang Li, Dongdong Zhang, Avihai Horon,
	linux-kernel, audit

On Sat, Apr 26, 2025 at 09:22:47PM +0000, Chathura Rajapaksha wrote:
> Some PCIe devices trigger PCI bus errors when accesses are made to
> unassigned regions within their PCI configuration space. On certain
> platforms, this can lead to host system hangs or reboots.

Do you have an example of this? What do you mean by bus error?

I would expect the device to return some constant like 0, or to return
an error TLP. The host bridge should convert the error TLP to
0XFFFFFFF like all other read error conversions.

Is it a device problem or host bridge problem you are facing?

> 1. Support for blocking guest accesses to unassigned
>    PCI configuration space, and the ability to bypass this access control
>    for specific devices. The patch introduces three module parameters:
> 
>    block_pci_unassigned_write:
>    Blocks write accesses to unassigned config space regions.
> 
>    block_pci_unassigned_read:
>    Blocks read accesses to unassigned config space regions.
> 
>    uaccess_allow_ids:
>    Specifies the devices for which the above access control is bypassed.
>    The value is a comma-separated list of device IDs in
>    <vendor_id>:<device_id> format.
> 
>    Example usage:
>    To block guest write accesses to unassigned config regions for all
>    passed through devices except for the device with vendor ID 0x1234 and
>    device ID 0x5678:
> 
>    block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678

No module parameters please.

At worst the kernel should maintain a quirks list to control this,
maybe with a sysfs to update it.

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] block accesses to unassigned PCI config regions
  2025-04-26 21:22 ` [RFC PATCH 1/2] block accesses to unassigned PCI " Chathura Rajapaksha
@ 2025-04-28 15:00   ` Bjorn Helgaas
  0 siblings, 0 replies; 15+ messages in thread
From: Bjorn Helgaas @ 2025-04-28 15:00 UTC (permalink / raw)
  To: Chathura Rajapaksha
  Cc: kvm, Chathura Rajapaksha, William Wang, Alex Williamson,
	Paul Moore, Eric Paris, Xin Zeng, Yahui Cao, Giovanni Cabiddu,
	Bjorn Helgaas, Niklas Schnelle, Yunxiang Li, Dongdong Zhang,
	Kevin Tian, Avihai Horon, linux-kernel, audit

Hint: observe subject line conventions for the file you're changing.
See "git log --oneline drivers/vfio/pci/vfio_pci_config.c".

On Sat, Apr 26, 2025 at 09:22:48PM +0000, Chathura Rajapaksha wrote:
> Some PCIe devices trigger PCI bus errors when accesses are made to
> unassigned regions within their PCI configuration space. On certain
> platforms, this can lead to host system hangs or reboots.

1) Define "unassigned regions".  From the patch, I infer that the
64-byte header, capabilities, and extended capabilities are considered
"assigned," and anything else would be "unassigned."

2) Can you expand on what these certain platforms are and the details
of what these errors are and how they're reported?  You mention
"PCIe," but I suppose this is not really PCIe-specific.  I suppose the
hang or reboot would be a consequence of the error being reported as
SERR# or a platform-specific System Error (PCIe r6.0, sec 6.2.6)?

In conventional PCI, we may not have control over SERR# assertion and
the effect on the system.  In PCIe, bits in the Root Control register
should control SERR# assertion.

> The current vfio-pci driver allows guests to access unassigned regions
> in the PCI configuration space. Therefore, when such a device is passed
> through to a guest, the guest can induce a host system hang or reboot
> through crafted configuration space accesses, posing a threat to
> system availability.
> 
> This patch introduces support for blocking guest accesses to unassigned
> PCI configuration space, and the ability to bypass this access control
> for specific devices. The patch introduces three module parameters:

We already know which patch this refers to (so "this patch" is
unnecessary) and typical style is to use imperative mood:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=v6.14#n94

>    block_pci_unassigned_write:
>    Blocks write accesses to unassigned config space regions.
> 
>    block_pci_unassigned_read:
>    Blocks read accesses to unassigned config space regions.
> 
>    uaccess_allow_ids:
>    Specifies the devices for which the above access control is bypassed.
>    The value is a comma-separated list of device IDs in
>    <vendor_id>:<device_id> format.
> 
>    Example usage:
>    To block guest write accesses to unassigned config regions for all
>    passed through devices except for the device with vendor ID 0x1234 and
>    device ID 0x5678:
> 
>    block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678
> 
> Co-developed by: William Wang <xwill@bu.edu>
> Signed-off-by: William Wang <xwill@bu.edu>
> Signed-off-by: Chathura Rajapaksha <chath@bu.edu>
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 122 ++++++++++++++++++++++++++++-
>  1 file changed, 121 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 8f02f236b5b4..cb4d11aa5598 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -120,6 +120,106 @@ struct perm_bits {
>  #define	NO_WRITE	0
>  #define	ALL_WRITE	0xFFFFFFFFU
>  
> +static bool block_pci_unassigned_write;
> +module_param(block_pci_unassigned_write, bool, 0644);
> +MODULE_PARM_DESC(block_pci_unassigned_write,
> +		 "Block write accesses to unassigned PCI config regions.");
> +
> +static bool block_pci_unassigned_read;
> +module_param(block_pci_unassigned_read, bool, 0644);
> +MODULE_PARM_DESC(block_pci_unassigned_read,
> +		 "Block read accesses from unassigned PCI config regions.");
> +
> +static char *uaccess_allow_ids;
> +module_param(uaccess_allow_ids, charp, 0444);
> +MODULE_PARM_DESC(uaccess_allow_ids, "PCI IDs to allow access to unassigned PCI config regions, format is \"vendor:device\" and multiple comma separated entries can be specified");
> +
> +static LIST_HEAD(allowed_device_ids);
> +static DEFINE_SPINLOCK(device_ids_lock);
> +
> +struct uaccess_device_id {
> +	struct list_head slot_list;
> +	unsigned short vendor;
> +	unsigned short device;
> +};
> +
> +static void pci_uaccess_add_device(struct uaccess_device_id *new,
> +				   int vendor, int device)
> +{
> +	struct uaccess_device_id *pci_dev_id;
> +	unsigned long flags;
> +	int found = 0;
> +
> +	spin_lock_irqsave(&device_ids_lock, flags);
> +
> +	list_for_each_entry(pci_dev_id, &allowed_device_ids, slot_list) {
> +		if (pci_dev_id->vendor == vendor &&
> +		    pci_dev_id->device == device) {
> +			found = 1;
> +			break;
> +		}
> +	}
> +
> +	if (!found) {
> +		new->vendor = vendor;
> +		new->device = device;
> +		list_add_tail(&new->slot_list, &allowed_device_ids);
> +	}
> +
> +	spin_unlock_irqrestore(&device_ids_lock, flags);
> +
> +	if (found)
> +		kfree(new);
> +}
> +
> +static bool pci_uaccess_lookup(struct pci_dev *dev)
> +{
> +	struct uaccess_device_id *pdev_id;
> +	unsigned long flags;
> +	bool found = false;
> +
> +	spin_lock_irqsave(&device_ids_lock, flags);
> +	list_for_each_entry(pdev_id, &allowed_device_ids, slot_list) {
> +		if (pdev_id->vendor == dev->vendor &&
> +		    pdev_id->device == dev->device) {
> +			found = true;
> +			break;
> +		}
> +	}
> +	spin_unlock_irqrestore(&device_ids_lock, flags);
> +
> +	return found;
> +}
> +
> +static int __init pci_uaccess_init(void)
> +{
> +	char *p, *id;
> +	int fields;
> +	int vendor, device;
> +	struct uaccess_device_id *pci_dev_id;
> +
> +	/* add ids specified in the module parameter */
> +	p = uaccess_allow_ids;
> +	while ((id = strsep(&p, ","))) {
> +		if (!strlen(id))
> +			continue;
> +
> +		fields = sscanf(id, "%x:%x", &vendor, &device);
> +
> +		if (fields != 2) {
> +			pr_warn("Invalid id string \"%s\"\n", id);
> +			continue;
> +		}
> +
> +		pci_dev_id = kmalloc(sizeof(*pci_dev_id), GFP_KERNEL);
> +		if (!pci_dev_id)
> +			return -ENOMEM;
> +
> +		pci_uaccess_add_device(pci_dev_id, vendor, device);
> +	}
> +	return 0;
> +}
> +
>  static int vfio_user_config_read(struct pci_dev *pdev, int offset,
>  				 __le32 *val, int count)
>  {
> @@ -335,6 +435,18 @@ static struct perm_bits unassigned_perms = {
>  	.writefn = vfio_raw_config_write
>  };
>  
> +/*
> + * Read/write access to PCI unassigned config regions can be blocked
> + * using the block_pci_unassigned_read and block_pci_unassigned_write
> + * module parameters. The uaccess_allow_ids module parameter can be used
> + * to whitelist devices (i.e., bypass the block) when either of the
> + * above parameters is specified.
> + */
> +static struct perm_bits block_unassigned_perms = {
> +	.readfn = NULL,
> +	.writefn = NULL
> +};
> +
>  static struct perm_bits virt_perms = {
>  	.readfn = vfio_virt_config_read,
>  	.writefn = vfio_virt_config_write
> @@ -1108,6 +1220,9 @@ int __init vfio_pci_init_perm_bits(void)
>  	ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_raw_config_write;
>  	ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_raw_config_write;
>  
> +	/* Device list allowed to access unassigned PCI regions */
> +	ret |= pci_uaccess_init();
> +
>  	if (ret)
>  		vfio_pci_uninit_perm_bits();
>  
> @@ -1896,7 +2011,12 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>  	cap_id = vdev->pci_config_map[*ppos];
>  
>  	if (cap_id == PCI_CAP_ID_INVALID) {
> -		perm = &unassigned_perms;
> +		if (((iswrite && block_pci_unassigned_write) ||
> +		     (!iswrite && block_pci_unassigned_read)) &&
> +		    !pci_uaccess_lookup(pdev))
> +			perm = &block_unassigned_perms;
> +		else
> +			perm = &unassigned_perms;
>  		cap_start = *ppos;
>  	} else if (cap_id == PCI_CAP_ID_INVALID_VIRT) {
>  		perm = &virt_perms;
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 2/2] audit accesses to unassigned PCI config regions
  2025-04-26 21:22 ` [RFC PATCH 2/2] audit " Chathura Rajapaksha
@ 2025-04-28 15:05   ` Bjorn Helgaas
  2025-05-16 20:41   ` [PATCH RFC " Paul Moore
  1 sibling, 0 replies; 15+ messages in thread
From: Bjorn Helgaas @ 2025-04-28 15:05 UTC (permalink / raw)
  To: Chathura Rajapaksha
  Cc: kvm, Chathura Rajapaksha, William Wang, Alex Williamson,
	Paul Moore, Eric Paris, Niklas Schnelle, Xin Zeng, Kevin Tian,
	Bjorn Helgaas, Yahui Cao, Yunxiang Li, Dongdong Zhang,
	Avihai Horon, linux-kernel, audit

On Sat, Apr 26, 2025 at 09:22:49PM +0000, Chathura Rajapaksha wrote:
> Some PCIe devices trigger PCI bus errors when accesses are made to
> unassigned regions within their PCI configuration space. On certain
> platforms, this can lead to host system hangs or reboots.
> 
> The current vfio-pci driver allows guests to access unassigned regions
> in the PCI configuration space. Therefore, when such a device is passed
> through to a guest, the guest can induce a host system hang or reboot
> through crafted configuration space accesses, posing a threat to
> system availability.
> 
> This patch introduces auditing support for config space accesses to
> unassigned regions. When enabled, this logs such accesses for all
> passthrough devices. 
> This feature is controlled via a new Kconfig option:

Add blank line between paragraphs.

>   CONFIG_VFIO_PCI_UNASSIGNED_ACCESS_AUDIT
> 
> A new audit event type, AUDIT_VFIO, has been introduced to support
> this, allowing administrators to monitor and investigate suspicious
> behavior by guests.

Use imperative mood ("Introduce" instead of "This patch introduces
..." and "Add ..." instead of "A new type has been introduced").

> Co-developed by: William Wang <xwill@bu.edu>
> Signed-off-by: William Wang <xwill@bu.edu>
> Signed-off-by: Chathura Rajapaksha <chath@bu.edu>
> ---
>  drivers/vfio/pci/Kconfig           | 12 ++++++++
>  drivers/vfio/pci/vfio_pci_config.c | 46 ++++++++++++++++++++++++++++--
>  include/uapi/linux/audit.h         |  1 +
>  3 files changed, 57 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index c3bcb6911c53..7f9f16262b90 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -42,6 +42,18 @@ config VFIO_PCI_IGD
>  	  and LPC bridge config space.
>  
>  	  To enable Intel IGD assignment through vfio-pci, say Y.
> +
> +config VFIO_PCI_UNASSIGNED_ACCESS_AUDIT
> +	bool "Audit accesses to unassigned PCI configuration regions"
> +	depends on AUDIT && VFIO_PCI_CORE
> +	help
> +	  Some PCIe devices are known to cause bus errors when accessing
> +	  unassigned PCI configuration space, potentially leading to host
> +	  system hangs on certain platforms. When enabled, this option
> +	  audits accesses to unassigned PCI configuration regions.
> +
> +	  If you don't know what to do here, say N.
> +
>  endif
>  
>  config VFIO_PCI_ZDEV_KVM
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index cb4d11aa5598..ddd10904d60f 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -25,6 +25,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/slab.h>
> +#include <linux/audit.h>
>  
>  #include "vfio_pci_priv.h"
>  
> @@ -1980,6 +1981,37 @@ static size_t vfio_pci_cap_remaining_dword(struct vfio_pci_core_device *vdev,
>  	return i;
>  }
>  
> +enum vfio_audit {
> +	VFIO_AUDIT_READ,
> +	VFIO_AUDIT_WRITE,
> +	VFIO_AUDIT_MAX,
> +};
> +
> +static const char * const vfio_audit_str[VFIO_AUDIT_MAX] = {
> +	[VFIO_AUDIT_READ]  = "READ",
> +	[VFIO_AUDIT_WRITE] = "WRITE",
> +};
> +
> +static void vfio_audit_access(const struct pci_dev *pdev,
> +			      size_t count, loff_t *ppos, bool blocked, unsigned int op)
> +{
> +	struct audit_buffer *ab;
> +
> +	if (WARN_ON_ONCE(op >= VFIO_AUDIT_MAX))
> +		return;
> +	if (audit_enabled == AUDIT_OFF)
> +		return;
> +	ab = audit_log_start(audit_context(), GFP_ATOMIC, AUDIT_VFIO);
> +	if (unlikely(!ab))
> +		return;
> +	audit_log_format(ab,
> +			 "device=%04x:%02x:%02x.%d access=%s offset=0x%llx size=%ld blocked=%u\n",
> +			 pci_domain_nr(pdev->bus), pdev->bus->number,
> +			 PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
> +			 vfio_audit_str[op], *ppos, count, blocked);
> +	audit_log_end(ab);
> +}
> +
>  static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>  				 size_t count, loff_t *ppos, bool iswrite)
>  {
> @@ -1989,6 +2021,7 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>  	int cap_start = 0, offset;
>  	u8 cap_id;
>  	ssize_t ret;
> +	bool blocked;
>  
>  	if (*ppos < 0 || *ppos >= pdev->cfg_size ||
>  	    *ppos + count > pdev->cfg_size)
> @@ -2011,13 +2044,22 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>  	cap_id = vdev->pci_config_map[*ppos];
>  
>  	if (cap_id == PCI_CAP_ID_INVALID) {
> -		if (((iswrite && block_pci_unassigned_write) ||
> +		blocked = (((iswrite && block_pci_unassigned_write) ||
>  		     (!iswrite && block_pci_unassigned_read)) &&
> -		    !pci_uaccess_lookup(pdev))
> +		    !pci_uaccess_lookup(pdev));
> +		if (blocked)
>  			perm = &block_unassigned_perms;
>  		else
>  			perm = &unassigned_perms;
>  		cap_start = *ppos;
> +		if (IS_ENABLED(CONFIG_VFIO_PCI_UNASSIGNED_ACCESS_AUDIT)) {
> +			if (iswrite)
> +				vfio_audit_access(pdev, count, ppos, blocked,
> +						  VFIO_AUDIT_WRITE);
> +			else
> +				vfio_audit_access(pdev, count, ppos, blocked,
> +						  VFIO_AUDIT_READ);
> +		}

Simplify this patch by adding "blocked" in the first patch.  Then you
won't have to touch the permission checking that is unrelated to the
audit logging.  Consider adding a helper to do the checking and return
"blocked" so it doesn't clutter vfio_config_do_rw().

>  	} else if (cap_id == PCI_CAP_ID_INVALID_VIRT) {
>  		perm = &virt_perms;
>  		cap_start = *ppos;
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index 9a4ecc9f6dc5..c0aace7384f3 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -122,6 +122,7 @@
>  #define AUDIT_OPENAT2		1337	/* Record showing openat2 how args */
>  #define AUDIT_DM_CTRL		1338	/* Device Mapper target control */
>  #define AUDIT_DM_EVENT		1339	/* Device Mapper events */
> +#define AUDIT_VFIO		1340	/* VFIO events */
>  
>  #define AUDIT_AVC		1400	/* SE Linux avc denial or grant */
>  #define AUDIT_SELINUX_ERR	1401	/* Internal SE Linux Errors */
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions
  2025-04-28 13:24 ` [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned " Jason Gunthorpe
@ 2025-04-28 20:25   ` Alex Williamson
  2025-04-29 13:44     ` Jason Gunthorpe
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2025-04-28 20:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chathura Rajapaksha, kvm, Chathura Rajapaksha, Paul Moore,
	Eric Paris, Giovanni Cabiddu, Xin Zeng, Yahui Cao, Bjorn Helgaas,
	Kevin Tian, Niklas Schnelle, Yunxiang Li, Dongdong Zhang,
	Avihai Horon, linux-kernel, audit

On Mon, 28 Apr 2025 10:24:55 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Sat, Apr 26, 2025 at 09:22:47PM +0000, Chathura Rajapaksha wrote:
> > Some PCIe devices trigger PCI bus errors when accesses are made to
> > unassigned regions within their PCI configuration space. On certain
> > platforms, this can lead to host system hangs or reboots.  
> 
> Do you have an example of this? What do you mean by bus error?
> 
> I would expect the device to return some constant like 0, or to return
> an error TLP. The host bridge should convert the error TLP to
> 0XFFFFFFF like all other read error conversions.
> 
> Is it a device problem or host bridge problem you are facing?

Or system problem.  Is it the access itself that generates a problem or
is it what the device does as a result of the access?  If the latter,
does this only remove a config space fuzzing attack vector against that
behavior or do we expect the device cannot generate the same behavior
via MMIO or IO register accesses?

We've previously leaned in the direction that we depend on hardware to
contain errors.  We cannot trap every access to the device or else we'd
severely limit the devices available to use and the performance of
those devices to the point that device assignment isn't worthwhile.

PCI config space is a slow path, it's already trapped, and it's
theoretically architected that we could restrict and audit much of it,
though some devices do rely on access to unarchitected config space.
But even within the architected space there are device specific
capabilities with undocumented protocols, exposing unknown features of
devices.  Does this incrementally make things better in general, or is
this largely masking a poorly behaved device/system?

> > 1. Support for blocking guest accesses to unassigned
> >    PCI configuration space, and the ability to bypass this access control
> >    for specific devices. The patch introduces three module parameters:
> > 
> >    block_pci_unassigned_write:
> >    Blocks write accesses to unassigned config space regions.
> > 
> >    block_pci_unassigned_read:
> >    Blocks read accesses to unassigned config space regions.
> > 
> >    uaccess_allow_ids:
> >    Specifies the devices for which the above access control is bypassed.
> >    The value is a comma-separated list of device IDs in
> >    <vendor_id>:<device_id> format.
> > 
> >    Example usage:
> >    To block guest write accesses to unassigned config regions for all
> >    passed through devices except for the device with vendor ID 0x1234 and
> >    device ID 0x5678:
> > 
> >    block_pci_unassigned_write=1 uaccess_allow_ids=1234:5678  
> 
> No module parameters please.
> 
> At worst the kernel should maintain a quirks list to control this,
> maybe with a sysfs to update it.

No module parameters might be difficult if we end up managing this as a
default policy selection, but certainly agree that if we get into
device specific behaviors we probably want those quirks automatically
deployed by the kernel.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions
  2025-04-28 20:25   ` Alex Williamson
@ 2025-04-29 13:44     ` Jason Gunthorpe
  2025-05-16 18:17       ` Chathura Rajapaksha
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2025-04-29 13:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Chathura Rajapaksha, kvm, Chathura Rajapaksha, Paul Moore,
	Eric Paris, Giovanni Cabiddu, Xin Zeng, Yahui Cao, Bjorn Helgaas,
	Kevin Tian, Niklas Schnelle, Yunxiang Li, Dongdong Zhang,
	Avihai Horon, linux-kernel, audit

On Mon, Apr 28, 2025 at 02:25:58PM -0600, Alex Williamson wrote:

> PCI config space is a slow path, it's already trapped, and it's
> theoretically architected that we could restrict and audit much of it,
> though some devices do rely on access to unarchitected config space.
> But even within the architected space there are device specific
> capabilities with undocumented protocols, exposing unknown features of
> devices.  Does this incrementally make things better in general, or is
> this largely masking a poorly behaved device/system?

I think there would be merit in having a qemu option to secure the
config space.

We talked about this before about presenting a perscribed virtualized
config space.

But we still have the issue that userpace with access to VFIO could
crash the machine, on these uncontained platforms, which is not great.

It would be nice if the kernel could discover this, but it doesn't
seem possible. There is so much in the SOC design and FW
implementation that has to be done correctly for errors to be properly
containable.

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions
  2025-04-29 13:44     ` Jason Gunthorpe
@ 2025-05-16 18:17       ` Chathura Rajapaksha
  2025-05-16 18:35         ` Jason Gunthorpe
  0 siblings, 1 reply; 15+ messages in thread
From: Chathura Rajapaksha @ 2025-05-16 18:17 UTC (permalink / raw)
  To: jgg
  Cc: Yunxiang.Li, alex.williamson, audit, avihaih, bhelgaas, chath,
	chathura.abeyrathne.lk, eparis, giovanni.cabiddu, kevin.tian, kvm,
	linux-kernel, paul, schnelle, xin.zeng, yahui.cao, zhangdongdong

Hi Jason and Alex,

Thank you for the comments, and apologies for the delayed response.

On Mon, Apr 28, 2025 at 9:24 AM
Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > Some PCIe devices trigger PCI bus errors when accesses are made to
> > unassigned regions within their PCI configuration space. On certain
> > platforms, this can lead to host system hangs or reboots.
>
> Do you have an example of this? What do you mean by bus error?

By PCI bus error, I was referring to AER-reported uncorrectable errors.
For example:
pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
pcieport 0000:c0:01.1:   device [1022:1483] error status/mask=00004000/07a10000
pcieport 0000:c0:01.1:    [14] CmpltTO                (First)

In another case, with a different device on a separate system, we
observed an uncorrectable machine check exception:
mce: [Hardware Error]: CPU 10: Machine Check Exception: 5 Bank 6: fb80000000000e0b

> I would expect the device to return some constant like 0, or to return
> an error TLP. The host bridge should convert the error TLP to
> 0XFFFFFFF like all other read error conversions.
>
> Is it a device problem or host bridge problem you are facing?

We haven’t been able to confirm definitively whether it’s a device or
host bridge issue due to limited visibility into platform internals.
However, we suspect it’s device-specific, as the same device triggered
similar failures across two different systems when writing to a
specific unassigned region in its config space. That said, we haven’t
exhaustively tested across devices and platforms.

If you have suggestions for identifying whether this stems from the
device or host bridge, we’d appreciate the guidance.

On Mon, Apr 28, 2025 at 4:26 PM
Alex Williamson <alex.williamson@redhat.com> wrote:
> Or system problem.  Is it the access itself that generates a problem or
> is it what the device does as a result of the access?  If the latter,
> does this only remove a config space fuzzing attack vector against that
> behavior or do we expect the device cannot generate the same behavior
> via MMIO or IO register accesses?

Unfortunately, we can't say for certain whether the fault lies in the
access itself or in the device's response. The cloud environments we
tested on don’t provide sufficient low-level hardware insight to
determine that. Please let me know if you have any pointers on how to
determine this at the kernel level.

This patch specifically aims to eliminate the config space fuzzing
vector. We have not investigated whether similar behavior can be
triggered through MMIO or IO register accesses.

> > No module parameters please.
> >
> > At worst the kernel should maintain a quirks list to control this,
> > maybe with a sysfs to update it.
>
> No module parameters might be difficult if we end up managing this as a
> default policy selection, but certainly agree that if we get into
> device specific behaviors we probably want those quirks automatically
> deployed by the kernel.  Thanks,

We used module parameters to give the flexibility to block unassigned
config space accesses on specific devices, especially in cases where new
problematic devices might emerge.

Is it feasible to support such use cases using a quirk-based mechanism?
For example, could we implement a quirk table that’s updateable via
sysfs, as you suggested?

Thank you for your time, and again, apologies for the delayed response.

Thanks,
Chathura

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions
  2025-05-16 18:17       ` Chathura Rajapaksha
@ 2025-05-16 18:35         ` Jason Gunthorpe
  2025-05-17 17:14           ` Chathura Rajapaksha
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2025-05-16 18:35 UTC (permalink / raw)
  To: Chathura Rajapaksha
  Cc: Yunxiang.Li, alex.williamson, audit, avihaih, bhelgaas, chath,
	eparis, giovanni.cabiddu, kevin.tian, kvm, linux-kernel, paul,
	schnelle, xin.zeng, yahui.cao, zhangdongdong

On Fri, May 16, 2025 at 06:17:54PM +0000, Chathura Rajapaksha wrote:
> Hi Jason and Alex,
> 
> Thank you for the comments, and apologies for the delayed response.
> 
> On Mon, Apr 28, 2025 at 9:24 AM
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > Some PCIe devices trigger PCI bus errors when accesses are made to
> > > unassigned regions within their PCI configuration space. On certain
> > > platforms, this can lead to host system hangs or reboots.
> >
> > Do you have an example of this? What do you mean by bus error?
> 
> By PCI bus error, I was referring to AER-reported uncorrectable errors.
> For example:
> pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
> pcieport 0000:c0:01.1:   device [1022:1483] error status/mask=00004000/07a10000
> pcieport 0000:c0:01.1:    [14] CmpltTO                (First)

That's sure looks like a device bug. You should not ever get time out
for a config space read.

> In another case, with a different device on a separate system, we
> observed an uncorrectable machine check exception:
> mce: [Hardware Error]: CPU 10: Machine Check Exception: 5 Bank 6: fb80000000000e0b

FW turning AER into a MCE is not suitable to use as a virtualization
host, IMHO. It is not possible to contain PCIe errors when they are
turned into MCE.

> Is it feasible to support such use cases using a quirk-based mechanism?
> For example, could we implement a quirk table that’s updateable via
> sysfs, as you suggested?

Dynamically updateable might be overkill, I think you have one
defective device. Have you talked to the supplier to see if it can be
corrected?

I think Alex is right to worry, if the device got this wrong, what
other mistakes have been made? Supporting virtualization is more than
just making a PCI device and using VFIO. You need to robustly design
HW to have full containment as well, including managing errors.

Alternatively you could handle this in qemu by sanitizing the config
space..

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] audit accesses to unassigned PCI config regions
  2025-04-26 21:22 ` [RFC PATCH 2/2] audit " Chathura Rajapaksha
  2025-04-28 15:05   ` Bjorn Helgaas
@ 2025-05-16 20:41   ` Paul Moore
  2025-05-20 16:33     ` Chathura Rajapaksha
  1 sibling, 1 reply; 15+ messages in thread
From: Paul Moore @ 2025-05-16 20:41 UTC (permalink / raw)
  To: Chathura Rajapaksha, kvm
  Cc: Chathura Rajapaksha, William Wang, Alex Williamson, Eric Paris,
	Niklas Schnelle, Xin Zeng, Kevin Tian, Bjorn Helgaas, Yahui Cao,
	Yunxiang Li, Dongdong Zhang, Avihai Horon, linux-kernel, audit

On Apr 26, 2025 Chathura Rajapaksha <chathura.abeyrathne.lk@gmail.com> wrote:
> 
> Some PCIe devices trigger PCI bus errors when accesses are made to
> unassigned regions within their PCI configuration space. On certain
> platforms, this can lead to host system hangs or reboots.
> 
> The current vfio-pci driver allows guests to access unassigned regions
> in the PCI configuration space. Therefore, when such a device is passed
> through to a guest, the guest can induce a host system hang or reboot
> through crafted configuration space accesses, posing a threat to
> system availability.
> 
> This patch introduces auditing support for config space accesses to
> unassigned regions. When enabled, this logs such accesses for all
> passthrough devices. 
> This feature is controlled via a new Kconfig option:
> 
>   CONFIG_VFIO_PCI_UNASSIGNED_ACCESS_AUDIT
> 
> A new audit event type, AUDIT_VFIO, has been introduced to support
> this, allowing administrators to monitor and investigate suspicious
> behavior by guests.

I try to encourage people to put a sample audit record in the commit
description as it helps others, even those not overly familiar with the
Linux kernel, know what to expect in the audit log and provide feedback.

> Co-developed by: William Wang <xwill@bu.edu>
> 
> Signed-off-by: William Wang <xwill@bu.edu>
> Signed-off-by: Chathura Rajapaksha <chath@bu.edu>
> ---
>  drivers/vfio/pci/Kconfig           | 12 ++++++++
>  drivers/vfio/pci/vfio_pci_config.c | 46 ++++++++++++++++++++++++++++--
>  include/uapi/linux/audit.h         |  1 +
>  3 files changed, 57 insertions(+), 2 deletions(-)

...

> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index cb4d11aa5598..ddd10904d60f 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -25,6 +25,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/slab.h>
> +#include <linux/audit.h>
>  
>  #include "vfio_pci_priv.h"
>  
> @@ -1980,6 +1981,37 @@ static size_t vfio_pci_cap_remaining_dword(struct vfio_pci_core_device *vdev,
>  	return i;
>  }
>  
> +enum vfio_audit {
> +	VFIO_AUDIT_READ,
> +	VFIO_AUDIT_WRITE,
> +	VFIO_AUDIT_MAX,
> +};
> +
> +static const char * const vfio_audit_str[VFIO_AUDIT_MAX] = {
> +	[VFIO_AUDIT_READ]  = "READ",
> +	[VFIO_AUDIT_WRITE] = "WRITE",
> +};

We generally don't capitalize things like this in audit, "read" and
"write" would be preferred.

> +static void vfio_audit_access(const struct pci_dev *pdev,
> +			      size_t count, loff_t *ppos, bool blocked, unsigned int op)
> +{
> +	struct audit_buffer *ab;
> +
> +	if (WARN_ON_ONCE(op >= VFIO_AUDIT_MAX))
> +		return;
> +	if (audit_enabled == AUDIT_OFF)
> +		return;
> +	ab = audit_log_start(audit_context(), GFP_ATOMIC, AUDIT_VFIO);
> +	if (unlikely(!ab))
> +		return;
> +	audit_log_format(ab,
> +			 "device=%04x:%02x:%02x.%d access=%s offset=0x%llx size=%ld blocked=%u\n",
> +			 pci_domain_nr(pdev->bus), pdev->bus->number,
> +			 PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
> +			 vfio_audit_str[op], *ppos, count, blocked);
> +	audit_log_end(ab);
> +}

In the commit description you talk about a general PCIe device issue
in the first paragraph before going into the specifics of the VFIO
driver.  That's all well and good, but it makes me wonder if this
audit code above is better done as a generic PCI function that other
PCI drivers could use if they had similar concerns?  Please correct
me if I'm wrong, but other than symbol naming I don't see anyting
above which is specific to VFIO.  Thoughts?

Beyond that, I might also change the "access=" field to "op=" as we
already use the "op=" field name for similar things in audit, it would
be good to leverage that familiarity here.  Similarly using "res=",
specifically "res=0" for failure/blocked or "res=1" allowed, would
better fit with audit conventions.

> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index 9a4ecc9f6dc5..c0aace7384f3 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -122,6 +122,7 @@
>  #define AUDIT_OPENAT2		1337	/* Record showing openat2 how args */
>  #define AUDIT_DM_CTRL		1338	/* Device Mapper target control */
>  #define AUDIT_DM_EVENT		1339	/* Device Mapper events */
> +#define AUDIT_VFIO		1340	/* VFIO events */

If the audit code above becomes more generalized as discussed, I would
expect this to change to AUDIT_VFIO to AUDIT_PCI, or something similar.

--
paul-moore.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions
  2025-05-16 18:35         ` Jason Gunthorpe
@ 2025-05-17 17:14           ` Chathura Rajapaksha
  2025-05-26 19:44             ` Jason Gunthorpe
  0 siblings, 1 reply; 15+ messages in thread
From: Chathura Rajapaksha @ 2025-05-17 17:14 UTC (permalink / raw)
  To: jgg
  Cc: Yunxiang.Li, alex.williamson, audit, avihaih, bhelgaas, chath,
	chathura.abeyrathne.lk, eparis, giovanni.cabiddu, kevin.tian, kvm,
	linux-kernel, paul, schnelle, xin.zeng, yahui.cao, zhangdongdong

On Fri, May 16, 2025 at 2:35 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > By PCI bus error, I was referring to AER-reported uncorrectable errors.
> > For example:
> > pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
> > pcieport 0000:c0:01.1:   device [1022:1483] error status/mask=00004000/07a10000
> > pcieport 0000:c0:01.1:    [14] CmpltTO                (First)
>
> That's sure looks like a device bug. You should not ever get time out
> for a config space read.

Just to clarify, the above error was triggered by a write to the
configuration space. In fact, all the errors we have observed so far
were triggered by writes to unassigned PCI config space regions.

> Dynamically updateable might be overkill, I think you have one
> defective device. Have you talked to the supplier to see if it can be
> corrected?

So far, we have seen this issue on five PCIe devices across GPU and
storage classes from two different vendors. Therefore, we suspect the
problem is not limited to a single device, vendor, or class of devices.
We reported the issue to both vendors over two months ago. But we
have not gained any insights into the root cause of the issue from
either vendor so far.

> Alternatively you could handle this in qemu by sanitizing the config
> space..

While it's possible to address this issue for QEMU-KVM guests by
modifying QEMU, PCIe devices can also be assigned directly to
user-space applications such as DPDK via VFIO. We thought addressing
this at the VFIO driver level would help mitigate the issue in a
broader context beyond virtualized environments.

Thanks,
Chathura

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] audit accesses to unassigned PCI config regions
  2025-05-16 20:41   ` [PATCH RFC " Paul Moore
@ 2025-05-20 16:33     ` Chathura Rajapaksha
  2025-05-20 18:08       ` Paul Moore
  0 siblings, 1 reply; 15+ messages in thread
From: Chathura Rajapaksha @ 2025-05-20 16:33 UTC (permalink / raw)
  To: paul
  Cc: Yunxiang.Li, alex.williamson, audit, avihaih, bhelgaas, chath,
	chathura.abeyrathne.lk, eparis, kevin.tian, kvm, linux-kernel,
	schnelle, xin.zeng, xwill, yahui.cao, zhangdongdong

Hi Bjorn and Paul,

Thank you for your comments, and sorry for the late reply.

On Mon, Apr 28, 2025 at 11:05 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> Add blank line between paragraphs.

> Use imperative mood ("Introduce" instead of "This patch introduces
> ..." and "Add ..." instead of "A new type has been introduced").

> Simplify this patch by adding "blocked" in the first patch.  Then you
> won't have to touch the permission checking that is unrelated to the
> audit logging.  Consider adding a helper to do the checking and return
> "blocked" so it doesn't clutter vfio_config_do_rw().

I will address the above comments in the next revision.

On Fri, May 16, 2025 at 4:41 PM Paul Moore <paul@paul-moore.com> wrote:
> I try to encourage people to put a sample audit record in the commit
> description as it helps others, even those not overly familiar with the
> Linux kernel, know what to expect in the audit log and provide feedback.

> > +static const char * const vfio_audit_str[VFIO_AUDIT_MAX] = {
> > +     [VFIO_AUDIT_READ]  = "READ",
> > +     [VFIO_AUDIT_WRITE] = "WRITE",
> > +};
>
> We generally don't capitalize things like this in audit, "read" and
> "write" would be preferred.

I will address the above comments in the next revision.
The following is the expected audit message when a write is performed
to an unassigned PCI config region:

  device=0000:01:00.1 access=WRITE offset=0x298 size=1 blocked=0

> In the commit description you talk about a general PCIe device issue
> in the first paragraph before going into the specifics of the VFIO
> driver.  That's all well and good, but it makes me wonder if this
> audit code above is better done as a generic PCI function that other
> PCI drivers could use if they had similar concerns?  Please correct
> me if I'm wrong, but other than symbol naming I don't see anyting
> above which is specific to VFIO.  Thoughts?

While the issue is independent of VFIO, the security and availability
concerns arise when guests are able to write to unassigned PCI config
regions on devices passed through using VFIO. That's why we thought it
would be better to audit these accesses in the VFIO driver. Given this
context, do you think it would be more appropriate to audit these
accesses through a generic PCI function instead?

> Beyond that, I might also change the "access=" field to "op=" as we
> already use the "op=" field name for similar things in audit, it would
> be good to leverage that familiarity here.  Similarly using "res=",
> specifically "res=0" for failure/blocked or "res=1" allowed, would
> better fit with audit conventions.

Thanks for the suggestions, I will address these in the next revision.

Regards,
Chathura

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/2] audit accesses to unassigned PCI config regions
  2025-05-20 16:33     ` Chathura Rajapaksha
@ 2025-05-20 18:08       ` Paul Moore
  0 siblings, 0 replies; 15+ messages in thread
From: Paul Moore @ 2025-05-20 18:08 UTC (permalink / raw)
  To: Chathura Rajapaksha
  Cc: Yunxiang.Li, alex.williamson, audit, avihaih, bhelgaas, chath,
	eparis, kevin.tian, kvm, linux-kernel, schnelle, xin.zeng, xwill,
	yahui.cao, zhangdongdong

On Tue, May 20, 2025 at 12:34 PM Chathura Rajapaksha
<chathura.abeyrathne.lk@gmail.com> wrote:
> On Fri, May 16, 2025 at 4:41 PM Paul Moore <paul@paul-moore.com> wrote:
>
> > In the commit description you talk about a general PCIe device issue
> > in the first paragraph before going into the specifics of the VFIO
> > driver.  That's all well and good, but it makes me wonder if this
> > audit code above is better done as a generic PCI function that other
> > PCI drivers could use if they had similar concerns?  Please correct
> > me if I'm wrong, but other than symbol naming I don't see anyting
> > above which is specific to VFIO.  Thoughts?
>
> While the issue is independent of VFIO, the security and availability
> concerns arise when guests are able to write to unassigned PCI config
> regions on devices passed through using VFIO. That's why we thought it
> would be better to audit these accesses in the VFIO driver. Given this
> context, do you think it would be more appropriate to audit these
> accesses through a generic PCI function instead?

I would suggest a generic PCI function, e.g. pci_audit_access(...),
that lives in the general PCI code and would be suitable for callers
other than VFIO, that you can call from within vfio_config_do_rw()
when Bad Things happen.

Does that make sense?

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions
  2025-05-17 17:14           ` Chathura Rajapaksha
@ 2025-05-26 19:44             ` Jason Gunthorpe
  0 siblings, 0 replies; 15+ messages in thread
From: Jason Gunthorpe @ 2025-05-26 19:44 UTC (permalink / raw)
  To: Chathura Rajapaksha
  Cc: Yunxiang.Li, alex.williamson, audit, avihaih, bhelgaas, chath,
	eparis, giovanni.cabiddu, kevin.tian, kvm, linux-kernel, paul,
	schnelle, xin.zeng, yahui.cao, zhangdongdong

On Sat, May 17, 2025 at 05:14:59PM +0000, Chathura Rajapaksha wrote:
> On Fri, May 16, 2025 at 2:35 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > By PCI bus error, I was referring to AER-reported uncorrectable errors.
> > > For example:
> > > pcieport 0000:c0:01.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
> > > pcieport 0000:c0:01.1:   device [1022:1483] error status/mask=00004000/07a10000
> > > pcieport 0000:c0:01.1:    [14] CmpltTO                (First)
> >
> > That's sure looks like a device bug. You should not ever get time out
> > for a config space read.
> 
> Just to clarify, the above error was triggered by a write to the
> configuration space. In fact, all the errors we have observed so far
> were triggered by writes to unassigned PCI config space regions.

Yuk, devices really shouldn't refuse to respond to writes or reads :(

> So far, we have seen this issue on five PCIe devices across GPU and
> storage classes from two different vendors. 

Ugh, that's awful.

> > Alternatively you could handle this in qemu by sanitizing the config
> > space..
> 
> While it's possible to address this issue for QEMU-KVM guests by
> modifying QEMU, PCIe devices can also be assigned directly to
> user-space applications such as DPDK via VFIO. We thought addressing
> this at the VFIO driver level would help mitigate the issue in a
> broader context beyond virtualized environments.

VFIO can probably already trigger command timeouts if it tries hard
enough, as long as it is a contained AER I don't see that the kernel
*needs* to prevent it.. 

For virtualization I really do expect that any serious user will be
tightly controlling the config space and maybe this finding just
supports that qemu needs to be enhanced to have more configurability
here.

It certainly is easier to add an option to qemu to make it block
any address not in a cap chain than to add a bunch of PCI ID tables
and detection to the kernel..

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-05-26 19:44 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-26 21:22 [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned config regions Chathura Rajapaksha
2025-04-26 21:22 ` [RFC PATCH 1/2] block accesses to unassigned PCI " Chathura Rajapaksha
2025-04-28 15:00   ` Bjorn Helgaas
2025-04-26 21:22 ` [RFC PATCH 2/2] audit " Chathura Rajapaksha
2025-04-28 15:05   ` Bjorn Helgaas
2025-05-16 20:41   ` [PATCH RFC " Paul Moore
2025-05-20 16:33     ` Chathura Rajapaksha
2025-05-20 18:08       ` Paul Moore
2025-04-28 13:24 ` [RFC PATCH 0/2] vfio/pci: Block and audit accesses to unassigned " Jason Gunthorpe
2025-04-28 20:25   ` Alex Williamson
2025-04-29 13:44     ` Jason Gunthorpe
2025-05-16 18:17       ` Chathura Rajapaksha
2025-05-16 18:35         ` Jason Gunthorpe
2025-05-17 17:14           ` Chathura Rajapaksha
2025-05-26 19:44             ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).