[RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue
@ 2025-06-26 18:04 Aaron Lewis
  2025-06-26 18:04 ` [RFC PATCH 1/3] vfio: selftests: Allow run.sh to bind to more than one device Aaron Lewis
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Aaron Lewis @ 2025-06-26 18:04 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, dmatlack, vipinsh
  Cc: kvm, seanjc, jrhilke, Aaron Lewis

This series is being sent as an RFC to help brainstorm the best way to
fix a latency issue it uncovers.

The crux of the issue is that when initializing multiple VFs from the
same PF the devices are reset serially rather than in parallel
regardless if they are initialized from different threads.  That happens
because a shared lock is acquired when vfio_df_ioctl_bind_iommufd() is
called, then a FLR (function level reset) is done which takes 100ms to
complete.  That in combination with trying to initialize many devices at
the same time results in a lot of wasted time.

While the PCI spec does specify that a FLR requires 100ms to ensure it
has time to complete, I don't see anything indicating that other VFs
can't be reset at the same time.

A couple of ideas on how to approach a fix are:

  1. See if the lock preventing the second thread from making forward
  progress can be sharded to only include the VF it protects.

  2. Do the FLR for the VF in probe() and close(device_fd) rather than in
  vfio_df_ioctl_bind_iommufd().

To demonstrate the problem the run script had to be extended to bind
multiple devices to the vfio-driver, not just one.  E.g.

  $ ./run.sh -d 0000:17:0c.1 -d 0000:17:0c.2 -d 0000:16:01.7 -s

Also included is a selftest and BPF script.  With those, the problem can
be reproduced with the output logging showing that one of the devices
takes >200ms to initialize despite running from different threads.

  $ VFIO_BDF_1=0000:17:0c.1 VFIO_BDF_2=0000:17:0c.2 ./vfio_flr_test
  [0x7f61bb888700] '0000:17:0c.2' initialized in 108.6ms.
  [0x7f61bc089700] '0000:17:0c.1' initialized in 212.3ms.

And the BPF script indicating that the latency issues are coming from the
mutex in vfio_df_ioctl_bind_iommufd().

  [pcie_flr] duration = 108ms
  [vfio_df_ioctl_bind_iommufd] duration = 108ms
  [pcie_flr] duration = 104ms
  [vfio_df_ioctl_bind_iommufd] duration = 212ms

  [__mutex_lock] duration = 103ms
  __mutex_lock+5
  vfio_df_ioctl_bind_iommufd+171
  __se_sys_ioctl+110
  do_syscall_64+109
  entry_SYSCALL_64_after_hwframe+120

This series can be applied on top of the VFIO selftests using the branch:
upstream/vfio/selftests/v1.

https://github.com/dmatlack/linux/tree/vfio/selftests/v1

Aaron Lewis (3):
  vfio: selftests: Allow run.sh to bind to more than one device
  vfio: selftests: Introduce the selftest vfio_flr_test
  vfio: selftests: Include a BPF script to pair with the selftest vfio_flr_test

 tools/testing/selftests/vfio/Makefile         |   1 +
 tools/testing/selftests/vfio/run.sh           |  73 +++++++----
 tools/testing/selftests/vfio/vfio_flr_test.c  | 120 ++++++++++++++++++
 .../testing/selftests/vfio/vfio_flr_trace.bt  |  83 ++++++++++++
 4 files changed, 251 insertions(+), 26 deletions(-)
 create mode 100644 tools/testing/selftests/vfio/vfio_flr_test.c
 create mode 100644 tools/testing/selftests/vfio/vfio_flr_trace.bt

-- 
2.50.0.727.gbf7dc18ff4-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/3] vfio: selftests: Allow run.sh to bind to more than one device
  2025-06-26 18:04 [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue Aaron Lewis
@ 2025-06-26 18:04 ` Aaron Lewis
  2025-06-27 22:19   ` David Matlack
  2025-06-26 18:04 ` [RFC PATCH 2/3] vfio: selftests: Introduce the selftest vfio_flr_test Aaron Lewis
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Aaron Lewis @ 2025-06-26 18:04 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, dmatlack, vipinsh
  Cc: kvm, seanjc, jrhilke, Aaron Lewis

Refactor the script "run.sh" to allow it to bind to more than one device
at a time. Previously, the script would allow one BDF to be passed in as
an argument to the script.  Extend this behavior to allow more than
one, e.g.

  $ ./run.sh -d 0000:17:0c.1 -d 0000:17:0c.2 -d 0000:16:01.7 -s

This results in unbinding the devices 0000:17:0c.1, 0000:17:0c.2 and
0000:16:01.7 from their current drivers, binding them to the
vfio-pci driver, then breaking out into a shell.

When testing is complete simply exit the shell to have those devices
unbind from the vfio-pci driver and rebind to their previous ones.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
---
 tools/testing/selftests/vfio/run.sh | 73 +++++++++++++++++++----------
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/tools/testing/selftests/vfio/run.sh b/tools/testing/selftests/vfio/run.sh
index 0476b6d7adc3..334934dab5c5 100755
--- a/tools/testing/selftests/vfio/run.sh
+++ b/tools/testing/selftests/vfio/run.sh
@@ -2,11 +2,11 @@
 
 # Global variables initialized in main() and then used during cleanup() when
 # the script exits.
-declare DEVICE_BDF
-declare NEW_DRIVER
-declare OLD_DRIVER
-declare OLD_NUMVFS
-declare DRIVER_OVERRIDE
+declare -a DEVICE_BDFS
+declare -a NEW_DRIVERS
+declare -a OLD_DRIVERS
+declare -a OLD_NUMVFS
+declare -a DRIVER_OVERRIDES
 
 function write_to() {
 	# Unfortunately set -x does not show redirects so use echo to manually
@@ -36,10 +36,20 @@ function clear_driver_override() {
 }
 
 function cleanup() {
-	if [ "${NEW_DRIVER}"      ]; then unbind ${DEVICE_BDF} ${NEW_DRIVER} ; fi
-	if [ "${DRIVER_OVERRIDE}" ]; then clear_driver_override ${DEVICE_BDF} ; fi
-	if [ "${OLD_DRIVER}"      ]; then bind ${DEVICE_BDF} ${OLD_DRIVER} ; fi
-	if [ "${OLD_NUMVFS}"      ]; then set_sriov_numvfs ${DEVICE_BDF} ${OLD_NUMVFS} ; fi
+	for i in "${!DEVICE_BDFS[@]}"; do
+		if [ ${NEW_DRIVERS[$i]} != false      ]; then unbind ${DEVICE_BDFS[$i]} ${NEW_DRIVERS[$i]}; fi
+		if [ ${DRIVER_OVERRIDES[$i]} != false ]; then clear_driver_override ${DEVICE_BDFS[$i]}; fi
+		if [ ${OLD_DRIVERS[$i]} != false      ]; then bind ${DEVICE_BDFS[$i]} ${OLD_DRIVERS[$i]}; fi
+		if [ ${OLD_NUMVFS[$i]} != false       ]; then set_sriov_numvfs ${DEVICE_BDFS[$i]} ${OLD_NUMVFS[$i]}; fi
+	done
+}
+
+function get_bdfs_string() {
+	local bdfs_str;
+
+	printf -v bdfs_str '%s,' "${DEVICE_BDFS[@]}"
+	bdfs_str="${bdfs_str%,}"
+	echo "${bdfs_str}"
 }
 
 function usage() {
@@ -60,7 +70,7 @@ function main() {
 
 	while getopts "d:hs" opt; do
 		case $opt in
-			d) DEVICE_BDF="$OPTARG" ;;
+			d) DEVICE_BDFS+=("$OPTARG") ;;
 			s) shell=true ;;
 			*) usage ;;
 		esac
@@ -73,33 +83,44 @@ function main() {
 	[ ! "${shell}" ] && [ $# = 0 ] && usage
 
 	# Check that the user passed in a BDF.
-	[ "${DEVICE_BDF}" ] || usage
+	[[ -n "${DEVICE_BDFS[@]}" ]] || usage
 
 	trap cleanup EXIT
 	set -e
 
-	test -d /sys/bus/pci/devices/${DEVICE_BDF}
+	for device_bdf in "${DEVICE_BDFS[@]}"; do
+		local old_numvf=false
+		local old_driver=false
+		local driver_override=false
 
-	if [ -f /sys/bus/pci/devices/${DEVICE_BDF}/sriov_numvfs ]; then
-		OLD_NUMVFS=$(cat /sys/bus/pci/devices/${DEVICE_BDF}/sriov_numvfs)
-		set_sriov_numvfs ${DEVICE_BDF} 0
-	fi
+		test -d /sys/bus/pci/devices/${device_bdf}
 
-	if [ -L /sys/bus/pci/devices/${DEVICE_BDF}/driver ]; then
-		OLD_DRIVER=$(basename $(readlink -m /sys/bus/pci/devices/${DEVICE_BDF}/driver))
-		unbind ${DEVICE_BDF} ${OLD_DRIVER}
-	fi
+		if [ -f /sys/bus/pci/devices/${device_bdf}/sriov_numvfs ]; then
+			old_numvf=$(cat /sys/bus/pci/devices/${device_bdf}/sriov_numvfs)
+			set_sriov_numvfs ${device_bdf} 0
+		fi
+
+		if [ -L /sys/bus/pci/devices/${device_bdf}/driver ]; then
+			old_driver=$(basename $(readlink -m /sys/bus/pci/devices/${device_bdf}/driver))
+			unbind ${device_bdf} ${old_driver}
+		fi
 
-	set_driver_override ${DEVICE_BDF} vfio-pci
-	DRIVER_OVERRIDE=true
+		set_driver_override ${device_bdf} vfio-pci
+		driver_override=true
 
-	bind ${DEVICE_BDF} vfio-pci
-	NEW_DRIVER=vfio-pci
+		bind ${device_bdf} vfio-pci
+
+		NEW_DRIVERS+=(vfio-pci)
+		OLD_DRIVERS+=(${old_driver})
+		OLD_NUMVFS+=(${old_numvf})
+		DRIVER_OVERRIDES+=(${driver_override})
+	done
 
 	echo
 	if [ "${shell}" ]; then
-		echo "Dropping into ${SHELL} with VFIO_SELFTESTS_BDF=${DEVICE_BDF}"
-		VFIO_SELFTESTS_BDF=${DEVICE_BDF} ${SHELL}
+		local bdfs_str=$(get_bdfs_string);
+		echo "Dropping into ${SHELL} with VFIO_SELFTESTS_BDFS=${bdfs_str}"
+		VFIO_SELFTESTS_BDFS=${bdfs_str} ${SHELL}
 	else
 		"$@" ${DEVICE_BDF}
 	fi
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/3] vfio: selftests: Introduce the selftest vfio_flr_test
  2025-06-26 18:04 [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue Aaron Lewis
  2025-06-26 18:04 ` [RFC PATCH 1/3] vfio: selftests: Allow run.sh to bind to more than one device Aaron Lewis
@ 2025-06-26 18:04 ` Aaron Lewis
  2025-06-27 22:37   ` David Matlack
  2025-06-26 18:04 ` [RFC PATCH 3/3] vfio: selftests: Include a BPF script to pair with " Aaron Lewis
  2025-06-26 19:26 ` [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue Alex Williamson
  3 siblings, 1 reply; 11+ messages in thread
From: Aaron Lewis @ 2025-06-26 18:04 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, dmatlack, vipinsh
  Cc: kvm, seanjc, jrhilke, Aaron Lewis

Introduce the VFIO selftest "vfio_flr_test" as a way of demonstrating a
latency issue that occurs when multiple devices are reset at the
same time.  To reproduce this issue simply run the selftest, e.g.

  $ VFIO_BDF_1=0000:17:0c.1 VFIO_BDF_2=0000:17:0c.2 ./vfio_flr_test
  [0x7f61bb888700] '0000:17:0c.2' initialized in 108.6ms.
  [0x7f61bc089700] '0000:17:0c.1' initialized in 212.3ms.

When the devices are initialized in the test, despite that happening on
separate threads, one of the devices will take >200ms.  Initializing a
device requires a FLR (function level reset), so the device taking
~100ms to initialize is expected, however, when a second devices is
initialized in parallel the desired behavior would be that the FLR
happens concurrently.  Unfortunately, due to a lock in
vfio_df_group_open() that does not happen.

As for how the selftest works, it requires two BDF's be passed to it as
environment variables as shown in the example above:
 - VFIO_BDF_1
 - VFIO_BDF_2

It then initializes each of them on separate threads and reports how long
they took.

The Thread ID is included in the debug prints to show what thread they
are executing from, and to show that this is a parallel operation.

Some of the prints are commented out to allow the focus to be on the
issue at hand, however, they can be useful for debugging so they were
left in.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
---
 tools/testing/selftests/vfio/Makefile        |   1 +
 tools/testing/selftests/vfio/vfio_flr_test.c | 120 +++++++++++++++++++
 2 files changed, 121 insertions(+)
 create mode 100644 tools/testing/selftests/vfio/vfio_flr_test.c

diff --git a/tools/testing/selftests/vfio/Makefile b/tools/testing/selftests/vfio/Makefile
index 324ba0175a33..57635883f2c2 100644
--- a/tools/testing/selftests/vfio/Makefile
+++ b/tools/testing/selftests/vfio/Makefile
@@ -1,5 +1,6 @@
 CFLAGS = $(KHDR_INCLUDES)
 TEST_GEN_PROGS += vfio_dma_mapping_test
+TEST_GEN_PROGS += vfio_flr_test
 TEST_GEN_PROGS += vfio_iommufd_setup_test
 TEST_GEN_PROGS += vfio_pci_device_test
 TEST_GEN_PROGS += vfio_pci_driver_test
diff --git a/tools/testing/selftests/vfio/vfio_flr_test.c b/tools/testing/selftests/vfio/vfio_flr_test.c
new file mode 100644
index 000000000000..5522f6d256dc
--- /dev/null
+++ b/tools/testing/selftests/vfio/vfio_flr_test.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <fcntl.h>
+#include <pthread.h>
+#include <stdlib.h>
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include <linux/limits.h>
+#include <linux/pci_regs.h>
+#include <linux/sizes.h>
+#include <linux/vfio.h>
+
+#include <vfio_util.h>
+
+#define NR_THREADS 2
+
+double freq_ms = .0;
+
+static bool is_bdf(const char *str)
+{
+	unsigned int s, b, d, f;
+	int length, count;
+
+	count = sscanf(str, "%4x:%2x:%2x.%2x%n", &s, &b, &d, &f, &length);
+	return count == 4 && length == strlen(str);
+}
+
+static const char *vfio_get_bdf(const char *env_var)
+{
+	char *bdf;
+
+	bdf = getenv(env_var);
+	if (bdf) {
+		VFIO_ASSERT_TRUE(is_bdf(bdf), "Invalid BDF: %s\n", bdf);
+		return bdf;
+	}
+
+	return NULL;
+}
+
+static inline uint64_t rdtsc(void)
+{
+	uint32_t eax, edx;
+	uint64_t tsc_val;
+	/*
+	 * The lfence is to wait (on Intel CPUs) until all previous
+	 * instructions have been executed. If software requires RDTSC to be
+	 * executed prior to execution of any subsequent instruction, it can
+	 * execute LFENCE immediately after RDTSC
+	 */
+	__asm__ __volatile__("lfence; rdtsc; lfence" : "=a"(eax), "=d"(edx));
+	tsc_val = ((uint64_t)edx) << 32 | eax;
+	return tsc_val;
+}
+
+static void init_freq_ms(void)
+{
+	uint64_t tsc_start, tsc_end;
+
+	tsc_start = rdtsc();
+	sleep(1);
+	tsc_end = rdtsc();
+	freq_ms = (double)(tsc_end - tsc_start) / 1000.0;
+	// printf("[0x%lx] freq_ms = %.03lf\n", pthread_self(), freq_ms);
+}
+
+static double now_ms(void)
+{
+	return (double)rdtsc() / freq_ms;
+}
+
+static double time_since_last_ms(double *last_ms)
+{
+	double now = now_ms();
+	double duration = now - *last_ms;
+
+	*last_ms = now;
+	return duration;
+}
+
+static void *flr_test_thread(void *arg)
+{
+	const char *device_bdf = (const char *)arg;
+	struct vfio_pci_device *device;
+	double last = now_ms();
+
+	device = vfio_pci_device_init(device_bdf, default_iommu_mode);
+	printf("[0x%lx] '%s' initialized in %.1lfms.\n",
+	       pthread_self(), device_bdf, time_since_last_ms(&last));
+
+	vfio_pci_device_cleanup(device);
+	// printf("[0x%lx] '%s' cleaned up in %.1lfms\n",
+	//        pthread_self(), device_bdf, time_since_last_ms(&last));
+
+	return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+	const char *device_bdfs[NR_THREADS];
+	pthread_t threads[NR_THREADS];
+	int i;
+
+	init_freq_ms();
+
+	device_bdfs[0] = vfio_get_bdf("VFIO_BDF_1");
+	device_bdfs[1] = vfio_get_bdf("VFIO_BDF_2");
+
+	// printf("[0x%lx] nr_threads = '%d'\n", pthread_self(), NR_THREADS);
+
+	for (i = 0; i < NR_THREADS; i++)
+		pthread_create(&threads[i], NULL, flr_test_thread,
+			       (void *)device_bdfs[i]);
+
+	for (i = 0; i < NR_THREADS; i++)
+		pthread_join(threads[i], NULL);
+
+	return 0;
+}
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/3] vfio: selftests: Include a BPF script to pair with the selftest vfio_flr_test
  2025-06-26 18:04 [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue Aaron Lewis
  2025-06-26 18:04 ` [RFC PATCH 1/3] vfio: selftests: Allow run.sh to bind to more than one device Aaron Lewis
  2025-06-26 18:04 ` [RFC PATCH 2/3] vfio: selftests: Introduce the selftest vfio_flr_test Aaron Lewis
@ 2025-06-26 18:04 ` Aaron Lewis
  2025-06-27 22:51   ` David Matlack
  2025-06-26 19:26 ` [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue Alex Williamson
  3 siblings, 1 reply; 11+ messages in thread
From: Aaron Lewis @ 2025-06-26 18:04 UTC (permalink / raw)
  To: alex.williamson, bhelgaas, dmatlack, vipinsh
  Cc: kvm, seanjc, jrhilke, Aaron Lewis

This BPF script is included as a way of verifying where the latency issues are
coming from in the kernel.

The test will print how long it took to initialize each device, E.g.:
  [0x7f61bb888700] '0000:17:0c.2' initialized in 108.6ms.
  [0x7f61bc089700] '0000:17:0c.1' initialized in 212.3ms.

That then pairs with the results from this script to show where the latency
issues are coming from.

  [pcie_flr] duration = 108ms
  [vfio_df_ioctl_bind_iommufd] duration = 108ms
  [pcie_flr] duration = 104ms
  [vfio_df_ioctl_bind_iommufd] duration = 212ms

Of note, the second call to vfio_df_ioctl_bind_iommufd() takes >200ms,
yet both calls to pcie_flr() only take ~100ms.  That indicates the latency
issue occurs between these two calls.

Looking further, one of the attempts to lock the mutex in
vfio_df_ioctl_bind_iommufd() takes ~100ms.

  [__mutex_lock] duration = 103ms

And has the callstack.

  __mutex_lock+5
  vfio_df_ioctl_bind_iommufd+171
  __se_sys_ioctl+110
  do_syscall_64+109
  entry_SYSCALL_64_after_hwframe+120

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
---
 .../testing/selftests/vfio/vfio_flr_trace.bt  | 83 +++++++++++++++++++
 1 file changed, 83 insertions(+)
 create mode 100644 tools/testing/selftests/vfio/vfio_flr_trace.bt

diff --git a/tools/testing/selftests/vfio/vfio_flr_trace.bt b/tools/testing/selftests/vfio/vfio_flr_trace.bt
new file mode 100644
index 000000000000..b246c71c8562
--- /dev/null
+++ b/tools/testing/selftests/vfio/vfio_flr_trace.bt
@@ -0,0 +1,83 @@
+#define msecs (nsecs / 1000 / 1000)
+
+/*
+ * Time 'vfio_df_ioctl_bind_iommufd()'.
+ */
+kfunc:vfio_df_ioctl_bind_iommufd
+{
+	@vfio_df_ioctl_bind_iommufd_start[tid] = msecs;
+}
+
+kretfunc:vfio_df_ioctl_bind_iommufd
+{
+	printf("[vfio_df_ioctl_bind_iommufd] duration = %ldms\n", msecs - @vfio_df_ioctl_bind_iommufd_start[tid]);
+}
+
+/*
+ * Time 'vfio_df_unbind_iommufd()'.
+ */
+kfunc:vfio_df_unbind_iommufd
+{
+	@vfio_df_unbind_iommufd_start[tid] = msecs;
+}
+
+kretfunc:vfio_df_unbind_iommufd
+{
+	// printf("[vfio_df_unbind_iommufd] duration = %ldms\n", msecs - @vfio_df_unbind_iommufd_start[tid]);
+}
+
+/*
+ * Time 'vfio_df_open()'.
+ */
+kfunc:vfio_df_open
+{
+	@vfio_df_open_start[tid] = msecs;
+}
+
+kretfunc:vfio_df_open
+{
+	// printf("[vfio_df_open] duration = %ldms\n", msecs - @vfio_df_open_start[tid]);
+}
+
+/*
+ * Time 'pcie_flr()'.
+ */
+kfunc:pcie_flr
+{
+	// printf("cpu = %d\ncomm = %s (PID = %d)\nkstack:\n%s\nustack:\n%s\n", cpu, comm, pid, kstack, ustack());
+
+	@pcie_flr_start[tid] = msecs;
+}
+
+kretfunc:pcie_flr
+{
+	printf("[pcie_flr] duration = %ldms\n", msecs - @pcie_flr_start[tid]);
+}
+
+/*
+ * Time '__mutex_lock()'.
+ */
+kfunc:__mutex_lock
+{
+	// printf("cpu = %d\ncomm = %s (PID = %d)\nkstack:\n%s\nustack:\n%s\n", cpu, comm, pid, kstack, ustack());
+
+	@mutex_lock_start[tid] = msecs;
+}
+
+kretfunc:__mutex_lock
+{
+	// printf("[__mutex_lock] cpu = %d, tid = %d, duration = %ldms\n", cpu, tid, msecs - @mutex_lock_start[tid]);
+}
+
+/*
+ * Dump the results to stdout.
+ */
+END
+{
+	// Clean up maps.
+	clear(@vfio_df_unbind_iommufd_start);
+	clear(@vfio_df_ioctl_bind_iommufd_start);
+	clear(@vfio_df_open_start);
+	clear(@pcie_flr_start);
+	clear(@mutex_lock_start);
+}
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue
  2025-06-26 18:04 [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue Aaron Lewis
                   ` (2 preceding siblings ...)
  2025-06-26 18:04 ` [RFC PATCH 3/3] vfio: selftests: Include a BPF script to pair with " Aaron Lewis
@ 2025-06-26 19:26 ` Alex Williamson
  2025-06-26 20:56   ` Jason Gunthorpe
  3 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2025-06-26 19:26 UTC (permalink / raw)
  To: Aaron Lewis
  Cc: bhelgaas, dmatlack, vipinsh, kvm, seanjc, jrhilke,
	Jason Gunthorpe

On Thu, 26 Jun 2025 18:04:21 +0000
Aaron Lewis <aaronlewis@google.com> wrote:

> This series is being sent as an RFC to help brainstorm the best way to
> fix a latency issue it uncovers.
> 
> The crux of the issue is that when initializing multiple VFs from the
> same PF the devices are reset serially rather than in parallel
> regardless if they are initialized from different threads.  That happens
> because a shared lock is acquired when vfio_df_ioctl_bind_iommufd() is
> called, then a FLR (function level reset) is done which takes 100ms to
> complete.  That in combination with trying to initialize many devices at
> the same time results in a lot of wasted time.
> 
> While the PCI spec does specify that a FLR requires 100ms to ensure it
> has time to complete, I don't see anything indicating that other VFs
> can't be reset at the same time.
> 
> A couple of ideas on how to approach a fix are:
> 
>   1. See if the lock preventing the second thread from making forward
>   progress can be sharded to only include the VF it protects.

I think we're talking about the dev_set mutex here, right?  I think this
is just an oversight.  The original lock that dev_set replaced was
devised to manage the set of devices affected by the same bus or slot
reset.  I believe we've held the same semantics though and VFs just
happen to fall through to the default of a bus-based dev_set.
Obviously we cannot do a bus or slot reset of a VF, we only have FLR,
and it especially doesn't make sense that VFs on the same "bus" from
different PFs share this mutex.

Seems like we just need this:

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 6328c3a05bcd..261a6dc5a5fc 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2149,7 +2149,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 		return -EBUSY;
 	}
 
-	if (pci_is_root_bus(pdev->bus)) {
+	if (pci_is_root_bus(pdev->bus) || pdev->is_virtfn) {
 		ret = vfio_assign_device_set(&vdev->vdev, vdev);
 	} else if (!pci_probe_reset_slot(pdev->slot)) {
 		ret = vfio_assign_device_set(&vdev->vdev, pdev->slot);

Does that allow fully parallelized resets?

Meanwhile this is a good use case of selftests and further impetus for
me to get it working.  Thanks,

Alex

>   
>   2. Do the FLR for the VF in probe() and close(device_fd) rather than in
>   vfio_df_ioctl_bind_iommufd().
> 
> To demonstrate the problem the run script had to be extended to bind
> multiple devices to the vfio-driver, not just one.  E.g.
> 
>   $ ./run.sh -d 0000:17:0c.1 -d 0000:17:0c.2 -d 0000:16:01.7 -s
> 
> Also included is a selftest and BPF script.  With those, the problem can
> be reproduced with the output logging showing that one of the devices
> takes >200ms to initialize despite running from different threads.
> 
>   $ VFIO_BDF_1=0000:17:0c.1 VFIO_BDF_2=0000:17:0c.2 ./vfio_flr_test
>   [0x7f61bb888700] '0000:17:0c.2' initialized in 108.6ms.
>   [0x7f61bc089700] '0000:17:0c.1' initialized in 212.3ms.
> 
> And the BPF script indicating that the latency issues are coming from the
> mutex in vfio_df_ioctl_bind_iommufd().
> 
>   [pcie_flr] duration = 108ms
>   [vfio_df_ioctl_bind_iommufd] duration = 108ms
>   [pcie_flr] duration = 104ms
>   [vfio_df_ioctl_bind_iommufd] duration = 212ms
> 
>   [__mutex_lock] duration = 103ms
>   __mutex_lock+5
>   vfio_df_ioctl_bind_iommufd+171
>   __se_sys_ioctl+110
>   do_syscall_64+109
>   entry_SYSCALL_64_after_hwframe+120
> 
> This series can be applied on top of the VFIO selftests using the branch:
> upstream/vfio/selftests/v1.
> 
> https://github.com/dmatlack/linux/tree/vfio/selftests/v1
> 
> Aaron Lewis (3):
>   vfio: selftests: Allow run.sh to bind to more than one device
>   vfio: selftests: Introduce the selftest vfio_flr_test
>   vfio: selftests: Include a BPF script to pair with the selftest vfio_flr_test
> 
>  tools/testing/selftests/vfio/Makefile         |   1 +
>  tools/testing/selftests/vfio/run.sh           |  73 +++++++----
>  tools/testing/selftests/vfio/vfio_flr_test.c  | 120 ++++++++++++++++++
>  .../testing/selftests/vfio/vfio_flr_trace.bt  |  83 ++++++++++++
>  4 files changed, 251 insertions(+), 26 deletions(-)
>  create mode 100644 tools/testing/selftests/vfio/vfio_flr_test.c
>  create mode 100644 tools/testing/selftests/vfio/vfio_flr_trace.bt
> 


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue
  2025-06-26 19:26 ` [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue Alex Williamson
@ 2025-06-26 20:56   ` Jason Gunthorpe
  2025-06-26 21:58     ` Aaron Lewis
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2025-06-26 20:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Aaron Lewis, bhelgaas, dmatlack, vipinsh, kvm, seanjc, jrhilke

On Thu, Jun 26, 2025 at 01:26:59PM -0600, Alex Williamson wrote:
> On Thu, 26 Jun 2025 18:04:21 +0000
> Aaron Lewis <aaronlewis@google.com> wrote:
> 
> > This series is being sent as an RFC to help brainstorm the best way to
> > fix a latency issue it uncovers.
> > 
> > The crux of the issue is that when initializing multiple VFs from the
> > same PF the devices are reset serially rather than in parallel
> > regardless if they are initialized from different threads.  That happens
> > because a shared lock is acquired when vfio_df_ioctl_bind_iommufd() is
> > called, then a FLR (function level reset) is done which takes 100ms to
> > complete.  That in combination with trying to initialize many devices at
> > the same time results in a lot of wasted time.
> > 
> > While the PCI spec does specify that a FLR requires 100ms to ensure it
> > has time to complete, I don't see anything indicating that other VFs
> > can't be reset at the same time.
> > 
> > A couple of ideas on how to approach a fix are:
> > 
> >   1. See if the lock preventing the second thread from making forward
> >   progress can be sharded to only include the VF it protects.
> 
> I think we're talking about the dev_set mutex here, right?  I think this
> is just an oversight.  The original lock that dev_set replaced was
> devised to manage the set of devices affected by the same bus or slot
> reset.  I believe we've held the same semantics though and VFs just
> happen to fall through to the default of a bus-based dev_set.
> Obviously we cannot do a bus or slot reset of a VF, we only have FLR,
> and it especially doesn't make sense that VFs on the same "bus" from
> different PFs share this mutex.

It certainly could be.. But I am feeling a bit wary and would want to
check this carefully. We ended up using the devset for more things -
need to check where everything ended up.

Off hand I don't recall any reason why the VF should be part of the
dev set..

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 6328c3a05bcd..261a6dc5a5fc 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -2149,7 +2149,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  		return -EBUSY;
>  	}
>  
> -	if (pci_is_root_bus(pdev->bus)) {
> +	if (pci_is_root_bus(pdev->bus) || pdev->is_virtfn) {
>  		ret = vfio_assign_device_set(&vdev->vdev, vdev);
>  	} else if (!pci_probe_reset_slot(pdev->slot)) {
>  		ret = vfio_assign_device_set(&vdev->vdev, pdev->slot);
> 
> Does that allow fully parallelized resets?

I forget all the details but if we are sure the reset of a VF is only
the VF then that does seem like the right direction

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue
  2025-06-26 20:56   ` Jason Gunthorpe
@ 2025-06-26 21:58     ` Aaron Lewis
  2025-06-26 22:10       ` Alex Williamson
  0 siblings, 1 reply; 11+ messages in thread
From: Aaron Lewis @ 2025-06-26 21:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, bhelgaas, dmatlack, vipinsh, kvm, seanjc,
	jrhilke

On Thu, Jun 26, 2025 at 1:56 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Jun 26, 2025 at 01:26:59PM -0600, Alex Williamson wrote:
> > On Thu, 26 Jun 2025 18:04:21 +0000
> > Aaron Lewis <aaronlewis@google.com> wrote:
> >
> > > This series is being sent as an RFC to help brainstorm the best way to
> > > fix a latency issue it uncovers.
> > >
> > > The crux of the issue is that when initializing multiple VFs from the
> > > same PF the devices are reset serially rather than in parallel
> > > regardless if they are initialized from different threads.  That happens
> > > because a shared lock is acquired when vfio_df_ioctl_bind_iommufd() is
> > > called, then a FLR (function level reset) is done which takes 100ms to
> > > complete.  That in combination with trying to initialize many devices at
> > > the same time results in a lot of wasted time.
> > >
> > > While the PCI spec does specify that a FLR requires 100ms to ensure it
> > > has time to complete, I don't see anything indicating that other VFs
> > > can't be reset at the same time.
> > >
> > > A couple of ideas on how to approach a fix are:
> > >
> > >   1. See if the lock preventing the second thread from making forward
> > >   progress can be sharded to only include the VF it protects.
> >
> > I think we're talking about the dev_set mutex here, right?  I think this
> > is just an oversight.  The original lock that dev_set replaced was
> > devised to manage the set of devices affected by the same bus or slot
> > reset.  I believe we've held the same semantics though and VFs just
> > happen to fall through to the default of a bus-based dev_set.
> > Obviously we cannot do a bus or slot reset of a VF, we only have FLR,
> > and it especially doesn't make sense that VFs on the same "bus" from
> > different PFs share this mutex.
>
> It certainly could be.. But I am feeling a bit wary and would want to
> check this carefully. We ended up using the devset for more things -
> need to check where everything ended up.
>
> Off hand I don't recall any reason why the VF should be part of the
> dev set..
>
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 6328c3a05bcd..261a6dc5a5fc 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -2149,7 +2149,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
> >               return -EBUSY;
> >       }
> >
> > -     if (pci_is_root_bus(pdev->bus)) {
> > +     if (pci_is_root_bus(pdev->bus) || pdev->is_virtfn) {

That fixes the issue.  I applied this patch and reran the test and
both threads finish in ~100ms now!  We are now getting fully
parallelized resets!

[0x7fd12afc0700] '0000:17:0c.2' initialized in 102.3ms.
[0x7fd12b7c1700] '0000:17:0c.1' initialized in 102.4ms.

> >               ret = vfio_assign_device_set(&vdev->vdev, vdev);
> >       } else if (!pci_probe_reset_slot(pdev->slot)) {
> >               ret = vfio_assign_device_set(&vdev->vdev, pdev->slot);
> >
> > Does that allow fully parallelized resets?
>
> I forget all the details but if we are sure the reset of a VF is only
> the VF then that does seem like the right direction
>
> Jason

Alex, would you like me to include your patch in this series so we
have the fix and test together, or would you prefer to propose the fix
yourself?  Either way works for me.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue
  2025-06-26 21:58     ` Aaron Lewis
@ 2025-06-26 22:10       ` Alex Williamson
  0 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2025-06-26 22:10 UTC (permalink / raw)
  To: Aaron Lewis
  Cc: Jason Gunthorpe, bhelgaas, dmatlack, vipinsh, kvm, seanjc,
	jrhilke

On Thu, 26 Jun 2025 14:58:54 -0700
Aaron Lewis <aaronlewis@google.com> wrote:

> On Thu, Jun 26, 2025 at 1:56 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Jun 26, 2025 at 01:26:59PM -0600, Alex Williamson wrote:  
> > > On Thu, 26 Jun 2025 18:04:21 +0000
> > > Aaron Lewis <aaronlewis@google.com> wrote:
> > >  
> > > > This series is being sent as an RFC to help brainstorm the best way to
> > > > fix a latency issue it uncovers.
> > > >
> > > > The crux of the issue is that when initializing multiple VFs from the
> > > > same PF the devices are reset serially rather than in parallel
> > > > regardless if they are initialized from different threads.  That happens
> > > > because a shared lock is acquired when vfio_df_ioctl_bind_iommufd() is
> > > > called, then a FLR (function level reset) is done which takes 100ms to
> > > > complete.  That in combination with trying to initialize many devices at
> > > > the same time results in a lot of wasted time.
> > > >
> > > > While the PCI spec does specify that a FLR requires 100ms to ensure it
> > > > has time to complete, I don't see anything indicating that other VFs
> > > > can't be reset at the same time.
> > > >
> > > > A couple of ideas on how to approach a fix are:
> > > >
> > > >   1. See if the lock preventing the second thread from making forward
> > > >   progress can be sharded to only include the VF it protects.  
> > >
> > > I think we're talking about the dev_set mutex here, right?  I think this
> > > is just an oversight.  The original lock that dev_set replaced was
> > > devised to manage the set of devices affected by the same bus or slot
> > > reset.  I believe we've held the same semantics though and VFs just
> > > happen to fall through to the default of a bus-based dev_set.
> > > Obviously we cannot do a bus or slot reset of a VF, we only have FLR,
> > > and it especially doesn't make sense that VFs on the same "bus" from
> > > different PFs share this mutex.  
> >
> > It certainly could be.. But I am feeling a bit wary and would want to
> > check this carefully. We ended up using the devset for more things -
> > need to check where everything ended up.
> >
> > Off hand I don't recall any reason why the VF should be part of the
> > dev set..
> >  
> > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > > index 6328c3a05bcd..261a6dc5a5fc 100644
> > > --- a/drivers/vfio/pci/vfio_pci_core.c
> > > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > > @@ -2149,7 +2149,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
> > >               return -EBUSY;
> > >       }
> > >
> > > -     if (pci_is_root_bus(pdev->bus)) {
> > > +     if (pci_is_root_bus(pdev->bus) || pdev->is_virtfn) {  
> 
> That fixes the issue.  I applied this patch and reran the test and
> both threads finish in ~100ms now!  We are now getting fully
> parallelized resets!
> 
> [0x7fd12afc0700] '0000:17:0c.2' initialized in 102.3ms.
> [0x7fd12b7c1700] '0000:17:0c.1' initialized in 102.4ms.
> 
> > >               ret = vfio_assign_device_set(&vdev->vdev, vdev);
> > >       } else if (!pci_probe_reset_slot(pdev->slot)) {
> > >               ret = vfio_assign_device_set(&vdev->vdev, pdev->slot);
> > >
> > > Does that allow fully parallelized resets?  
> >
> > I forget all the details but if we are sure the reset of a VF is only
> > the VF then that does seem like the right direction
> >
> > Jason  
> 
> Alex, would you like me to include your patch in this series so we
> have the fix and test together, or would you prefer to propose the fix
> yourself?  Either way works for me.

I think it's best not to queue the fix behind the selftests when
there's no code dependency.  I'll send out the patch and hopefully
Jason can chew on whether I'm overlooking something.  This mutex has
basically become our defacto serialization mutex, but I think that hot
reset remains the only multi-device serialization problem.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] vfio: selftests: Allow run.sh to bind to more than one device
  2025-06-26 18:04 ` [RFC PATCH 1/3] vfio: selftests: Allow run.sh to bind to more than one device Aaron Lewis
@ 2025-06-27 22:19   ` David Matlack
  0 siblings, 0 replies; 11+ messages in thread
From: David Matlack @ 2025-06-27 22:19 UTC (permalink / raw)
  To: Aaron Lewis; +Cc: alex.williamson, bhelgaas, vipinsh, kvm, seanjc, jrhilke

On 2025-06-26 06:04 PM, Aaron Lewis wrote:
> Refactor the script "run.sh" to allow it to bind to more than one device
> at a time. Previously, the script would allow one BDF to be passed in as
> an argument to the script.  Extend this behavior to allow more than
> one, e.g.
> 
>   $ ./run.sh -d 0000:17:0c.1 -d 0000:17:0c.2 -d 0000:16:01.7 -s
> 
> This results in unbinding the devices 0000:17:0c.1, 0000:17:0c.2 and
> 0000:16:01.7 from their current drivers, binding them to the
> vfio-pci driver, then breaking out into a shell.
> 
> When testing is complete simply exit the shell to have those devices
> unbind from the vfio-pci driver and rebind to their previous ones.

Thanks for adding this. I also planning to upstream the change I just
made to our Google internal copy of the VFIO selftests to split this
script up into a setup and cleanup that will make it easier to use
multiple devices.

The setup script would let you run it multiple times to set up multiple
devices:

  tools/testing/seftests/vfio/setup.sh -d BDF_1
  tools/testing/seftests/vfio/setup.sh -d BDF_2
  tools/testing/seftests/vfio/setup.sh -d BDF_3
  tools/testing/seftests/vfio/setup.sh -d BDF_4

State about in-use devices are stored on the filesystem so that run.sh
can find it.

Then all devices can later be cleaned up all at once (or individually):

  tools/testing/seftests/vfio/cleanup.sh -d BDF_1
  tools/testing/seftests/vfio/cleanup.sh

Would you be interested in picking up that change as part of this
series?

> 
> Signed-off-by: Aaron Lewis <aaronlewis@google.com>
> ---
>  tools/testing/selftests/vfio/run.sh | 73 +++++++++++++++++++----------
>  1 file changed, 47 insertions(+), 26 deletions(-)
> 
> diff --git a/tools/testing/selftests/vfio/run.sh b/tools/testing/selftests/vfio/run.sh
> index 0476b6d7adc3..334934dab5c5 100755
> --- a/tools/testing/selftests/vfio/run.sh
> +++ b/tools/testing/selftests/vfio/run.sh
> @@ -2,11 +2,11 @@
>  
>  # Global variables initialized in main() and then used during cleanup() when
>  # the script exits.
> -declare DEVICE_BDF
> -declare NEW_DRIVER
> -declare OLD_DRIVER
> -declare OLD_NUMVFS
> -declare DRIVER_OVERRIDE
> +declare -a DEVICE_BDFS
> +declare -a NEW_DRIVERS
> +declare -a OLD_DRIVERS
> +declare -a OLD_NUMVFS
> +declare -a DRIVER_OVERRIDES
>  
>  function write_to() {
>  	# Unfortunately set -x does not show redirects so use echo to manually
> @@ -36,10 +36,20 @@ function clear_driver_override() {
>  }
>  
>  function cleanup() {
> -	if [ "${NEW_DRIVER}"      ]; then unbind ${DEVICE_BDF} ${NEW_DRIVER} ; fi
> -	if [ "${DRIVER_OVERRIDE}" ]; then clear_driver_override ${DEVICE_BDF} ; fi
> -	if [ "${OLD_DRIVER}"      ]; then bind ${DEVICE_BDF} ${OLD_DRIVER} ; fi
> -	if [ "${OLD_NUMVFS}"      ]; then set_sriov_numvfs ${DEVICE_BDF} ${OLD_NUMVFS} ; fi
> +	for i in "${!DEVICE_BDFS[@]}"; do
> +		if [ ${NEW_DRIVERS[$i]} != false      ]; then unbind ${DEVICE_BDFS[$i]} ${NEW_DRIVERS[$i]}; fi
> +		if [ ${DRIVER_OVERRIDES[$i]} != false ]; then clear_driver_override ${DEVICE_BDFS[$i]}; fi
> +		if [ ${OLD_DRIVERS[$i]} != false      ]; then bind ${DEVICE_BDFS[$i]} ${OLD_DRIVERS[$i]}; fi
> +		if [ ${OLD_NUMVFS[$i]} != false       ]; then set_sriov_numvfs ${DEVICE_BDFS[$i]} ${OLD_NUMVFS[$i]}; fi
> +	done

If you want false to work here then you need to preinitialize all of the
arrays with false. Otherwise if an error occurs part-way through the
loop in main then these arrays will be partially initialized.

I think it would be simpler to just continue using the empty string
to indicate "false" since that will also work for uninitialized arrays.

> +}
> +
> +function get_bdfs_string() {
> +	local bdfs_str;
> +
> +	printf -v bdfs_str '%s,' "${DEVICE_BDFS[@]}"
> +	bdfs_str="${bdfs_str%,}"
> +	echo "${bdfs_str}"
>  }
>  
>  function usage() {
> @@ -60,7 +70,7 @@ function main() {
>  
>  	while getopts "d:hs" opt; do
>  		case $opt in
> -			d) DEVICE_BDF="$OPTARG" ;;
> +			d) DEVICE_BDFS+=("$OPTARG") ;;
>  			s) shell=true ;;
>  			*) usage ;;
>  		esac
> @@ -73,33 +83,44 @@ function main() {
>  	[ ! "${shell}" ] && [ $# = 0 ] && usage
>  
>  	# Check that the user passed in a BDF.
> -	[ "${DEVICE_BDF}" ] || usage
> +	[[ -n "${DEVICE_BDFS[@]}" ]] || usage
>  
>  	trap cleanup EXIT
>  	set -e
>  
> -	test -d /sys/bus/pci/devices/${DEVICE_BDF}
> +	for device_bdf in "${DEVICE_BDFS[@]}"; do
> +		local old_numvf=false
> +		local old_driver=false
> +		local driver_override=false

If an error happens between here and where the arrays are updated below
then cleanup() won't unwind the change.

Can you initialize the arrays here instead of local variables? That will
fix it.

>  
> -	if [ -f /sys/bus/pci/devices/${DEVICE_BDF}/sriov_numvfs ]; then
> -		OLD_NUMVFS=$(cat /sys/bus/pci/devices/${DEVICE_BDF}/sriov_numvfs)
> -		set_sriov_numvfs ${DEVICE_BDF} 0
> -	fi
> +		test -d /sys/bus/pci/devices/${device_bdf}
>  
> -	if [ -L /sys/bus/pci/devices/${DEVICE_BDF}/driver ]; then
> -		OLD_DRIVER=$(basename $(readlink -m /sys/bus/pci/devices/${DEVICE_BDF}/driver))
> -		unbind ${DEVICE_BDF} ${OLD_DRIVER}
> -	fi
> +		if [ -f /sys/bus/pci/devices/${device_bdf}/sriov_numvfs ]; then
> +			old_numvf=$(cat /sys/bus/pci/devices/${device_bdf}/sriov_numvfs)
> +			set_sriov_numvfs ${device_bdf} 0
> +		fi
> +
> +		if [ -L /sys/bus/pci/devices/${device_bdf}/driver ]; then
> +			old_driver=$(basename $(readlink -m /sys/bus/pci/devices/${device_bdf}/driver))
> +			unbind ${device_bdf} ${old_driver}
> +		fi
>  
> -	set_driver_override ${DEVICE_BDF} vfio-pci
> -	DRIVER_OVERRIDE=true
> +		set_driver_override ${device_bdf} vfio-pci
> +		driver_override=true
>  
> -	bind ${DEVICE_BDF} vfio-pci
> -	NEW_DRIVER=vfio-pci
> +		bind ${device_bdf} vfio-pci
> +
> +		NEW_DRIVERS+=(vfio-pci)
> +		OLD_DRIVERS+=(${old_driver})
> +		OLD_NUMVFS+=(${old_numvf})
> +		DRIVER_OVERRIDES+=(${driver_override})
> +	done
>  
>  	echo
>  	if [ "${shell}" ]; then
> -		echo "Dropping into ${SHELL} with VFIO_SELFTESTS_BDF=${DEVICE_BDF}"
> -		VFIO_SELFTESTS_BDF=${DEVICE_BDF} ${SHELL}
> +		local bdfs_str=$(get_bdfs_string);
> +		echo "Dropping into ${SHELL} with VFIO_SELFTESTS_BDFS=${bdfs_str}"
> +		VFIO_SELFTESTS_BDFS=${bdfs_str} ${SHELL}

What's the reason to use get_bdfs_string() instead of just:

  VFIO_SELFTESTS_BDFS="${DEVICE_BDFS[@]}" ${SHELL}

?

>  	else
>  		"$@" ${DEVICE_BDF}

This should be:

		"$@" ${DEVICE_BDFS[@]}

>  	fi
> -- 
> 2.50.0.727.gbf7dc18ff4-goog
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/3] vfio: selftests: Introduce the selftest vfio_flr_test
  2025-06-26 18:04 ` [RFC PATCH 2/3] vfio: selftests: Introduce the selftest vfio_flr_test Aaron Lewis
@ 2025-06-27 22:37   ` David Matlack
  0 siblings, 0 replies; 11+ messages in thread
From: David Matlack @ 2025-06-27 22:37 UTC (permalink / raw)
  To: Aaron Lewis; +Cc: alex.williamson, bhelgaas, vipinsh, kvm, seanjc, jrhilke

On 2025-06-26 06:04 PM, Aaron Lewis wrote:
> Introduce the VFIO selftest "vfio_flr_test" as a way of demonstrating a
> latency issue that occurs when multiple devices are reset at the
> same time.  To reproduce this issue simply run the selftest, e.g.
> 
>   $ VFIO_BDF_1=0000:17:0c.1 VFIO_BDF_2=0000:17:0c.2 ./vfio_flr_test
>   [0x7f61bb888700] '0000:17:0c.2' initialized in 108.6ms.
>   [0x7f61bc089700] '0000:17:0c.1' initialized in 212.3ms.
> 
> When the devices are initialized in the test, despite that happening on
> separate threads, one of the devices will take >200ms.  Initializing a
> device requires a FLR (function level reset), so the device taking
> ~100ms to initialize is expected, however, when a second devices is
> initialized in parallel the desired behavior would be that the FLR
> happens concurrently.  Unfortunately, due to a lock in
> vfio_df_group_open() that does not happen.
> 
> As for how the selftest works, it requires two BDF's be passed to it as
> environment variables as shown in the example above:
>  - VFIO_BDF_1
>  - VFIO_BDF_2
> 
> It then initializes each of them on separate threads and reports how long
> they took.
> 
> The Thread ID is included in the debug prints to show what thread they
> are executing from, and to show that this is a parallel operation.
> 
> Some of the prints are commented out to allow the focus to be on the
> issue at hand, however, they can be useful for debugging so they were
> left in.
> 
> Signed-off-by: Aaron Lewis <aaronlewis@google.com>
> ---
>  tools/testing/selftests/vfio/Makefile        |   1 +
>  tools/testing/selftests/vfio/vfio_flr_test.c | 120 +++++++++++++++++++
>  2 files changed, 121 insertions(+)
>  create mode 100644 tools/testing/selftests/vfio/vfio_flr_test.c
> 
> diff --git a/tools/testing/selftests/vfio/Makefile b/tools/testing/selftests/vfio/Makefile
> index 324ba0175a33..57635883f2c2 100644
> --- a/tools/testing/selftests/vfio/Makefile
> +++ b/tools/testing/selftests/vfio/Makefile
> @@ -1,5 +1,6 @@
>  CFLAGS = $(KHDR_INCLUDES)
>  TEST_GEN_PROGS += vfio_dma_mapping_test
> +TEST_GEN_PROGS += vfio_flr_test
>  TEST_GEN_PROGS += vfio_iommufd_setup_test
>  TEST_GEN_PROGS += vfio_pci_device_test
>  TEST_GEN_PROGS += vfio_pci_driver_test
> diff --git a/tools/testing/selftests/vfio/vfio_flr_test.c b/tools/testing/selftests/vfio/vfio_flr_test.c
> new file mode 100644
> index 000000000000..5522f6d256dc
> --- /dev/null
> +++ b/tools/testing/selftests/vfio/vfio_flr_test.c

I don't think vfio_flr_test.c is the right name for this test. FLR is
something that happens in the kernel during one specific ioctl() deep
within vfio_pci_device_init(). And it also might not happen at all (if
the device being used does not support FLR).

What this test actually does it measure the time it takes to do
vfio_pci_device_init() in parallel across multiple threads.

How about vfio_pci_device_init_perf_test.c?

> @@ -0,0 +1,120 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <stdlib.h>
> +
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +
> +#include <linux/limits.h>
> +#include <linux/pci_regs.h>
> +#include <linux/sizes.h>
> +#include <linux/vfio.h>
> +
> +#include <vfio_util.h>
> +
> +#define NR_THREADS 2
> +
> +double freq_ms = .0;
> +
> +static bool is_bdf(const char *str)
> +{
> +	unsigned int s, b, d, f;
> +	int length, count;
> +
> +	count = sscanf(str, "%4x:%2x:%2x.%2x%n", &s, &b, &d, &f, &length);
> +	return count == 4 && length == strlen(str);
> +}
> +
> +static const char *vfio_get_bdf(const char *env_var)
> +{
> +	char *bdf;
> +
> +	bdf = getenv(env_var);
> +	if (bdf) {
> +		VFIO_ASSERT_TRUE(is_bdf(bdf), "Invalid BDF: %s\n", bdf);
> +		return bdf;
> +	}
> +
> +	return NULL;
> +}

Please update the library to support multiple BDFs rather than creating
a custom solution for this test.

In the library we have support for getting a single BDF from either the
command line or the environment variable $VFIO_SELFTESTS_BDF via:

  const char *device_bdf = vfio_selftests_get_bdf(&argc, argv);

Add a variant of this that supports looking for and returning multiple
BDFs and use it here, e.g.

  const char *device_bdfs[2];

  vfio_selftests_get_bdfs(&argc, argv, device_bdfs, 2);

Then we can reimplement vfio_selftests_get_bdf() on top like this:

const char *vfio_selftests_get_bdf(int *argc, char *argv[])
{
	const char *bdf;

	vfio_selftests_get_bdfs(argc, argv, &bdf, 1);
	return bdf;
}

The new function should support getting the BDFs from either the command
line or $VFIO_SELFTESTS_BDF, just like the current function.

For bonus points we could even make vfio_selftests_get_bdfs() support
any number of BDFs by dynamically allocating the array and returning the
number it found on the command line or in the env var. That way your
test can support running with any number of devices (not just 2).

Once you do this then you can run your test just like any other VFIO
selftests, e.g.:

   $ ./run.sh -d 0000:17:0c.1 -d 0000:17:0c.2 -- ./vfio_flr_test

Or like this:

   $ ./run.sh -d 0000:17:0c.1 -d 0000:17:0c.2 -s
   $ ./vfio_flr_test
   $ exit


> +
> +static inline uint64_t rdtsc(void)
> +{
> +	uint32_t eax, edx;
> +	uint64_t tsc_val;
> +	/*
> +	 * The lfence is to wait (on Intel CPUs) until all previous
> +	 * instructions have been executed. If software requires RDTSC to be
> +	 * executed prior to execution of any subsequent instruction, it can
> +	 * execute LFENCE immediately after RDTSC
> +	 */
> +	__asm__ __volatile__("lfence; rdtsc; lfence" : "=a"(eax), "=d"(edx));
> +	tsc_val = ((uint64_t)edx) << 32 | eax;
> +	return tsc_val;
> +}
> +
> +static void init_freq_ms(void)
> +{
> +	uint64_t tsc_start, tsc_end;
> +
> +	tsc_start = rdtsc();
> +	sleep(1);
> +	tsc_end = rdtsc();
> +	freq_ms = (double)(tsc_end - tsc_start) / 1000.0;
> +	// printf("[0x%lx] freq_ms = %.03lf\n", pthread_self(), freq_ms);
> +}
> +
> +static double now_ms(void)
> +{
> +	return (double)rdtsc() / freq_ms;
> +}
> +
> +static double time_since_last_ms(double *last_ms)
> +{
> +	double now = now_ms();
> +	double duration = now - *last_ms;
> +
> +	*last_ms = now;
> +	return duration;
> +}

Please use clock_gettime(CLOCK_MONOTONIC) to measure elapsed time in a
platform-independent way.

> +
> +static void *flr_test_thread(void *arg)
> +{
> +	const char *device_bdf = (const char *)arg;
> +	struct vfio_pci_device *device;
> +	double last = now_ms();

Should we put a simple spin-barrier here to ensure the threads do
vfio_pci_device_init() at the same time (at least as best as we can
guarantee from userspace)?

> +
> +	device = vfio_pci_device_init(device_bdf, default_iommu_mode);
> +	printf("[0x%lx] '%s' initialized in %.1lfms.\n",
> +	       pthread_self(), device_bdf, time_since_last_ms(&last));

I think we need a barrier here to make sure nothing in
vfio_pci_device_cleanup() interferes with the performance of a
still-running vfio_pci_device_init().

> +
> +	vfio_pci_device_cleanup(device);
> +	// printf("[0x%lx] '%s' cleaned up in %.1lfms\n",
> +	//        pthread_self(), device_bdf, time_since_last_ms(&last));
> +
> +	return NULL;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	const char *device_bdfs[NR_THREADS];
> +	pthread_t threads[NR_THREADS];
> +	int i;
> +
> +	init_freq_ms();
> +
> +	device_bdfs[0] = vfio_get_bdf("VFIO_BDF_1");
> +	device_bdfs[1] = vfio_get_bdf("VFIO_BDF_2");
> +
> +	// printf("[0x%lx] nr_threads = '%d'\n", pthread_self(), NR_THREADS);
> +
> +	for (i = 0; i < NR_THREADS; i++)
> +		pthread_create(&threads[i], NULL, flr_test_thread,
> +			       (void *)device_bdfs[i]);
> +
> +	for (i = 0; i < NR_THREADS; i++)
> +		pthread_join(threads[i], NULL);
> +
> +	return 0;
> +}
> -- 
> 2.50.0.727.gbf7dc18ff4-goog
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] vfio: selftests: Include a BPF script to pair with the selftest vfio_flr_test
  2025-06-26 18:04 ` [RFC PATCH 3/3] vfio: selftests: Include a BPF script to pair with " Aaron Lewis
@ 2025-06-27 22:51   ` David Matlack
  0 siblings, 0 replies; 11+ messages in thread
From: David Matlack @ 2025-06-27 22:51 UTC (permalink / raw)
  To: Aaron Lewis; +Cc: alex.williamson, bhelgaas, vipinsh, kvm, seanjc, jrhilke

On 2025-06-26 06:04 PM, Aaron Lewis wrote:
> This BPF script is included as a way of verifying where the latency issues are
> coming from in the kernel.
> 
> The test will print how long it took to initialize each device, E.g.:
>   [0x7f61bb888700] '0000:17:0c.2' initialized in 108.6ms.
>   [0x7f61bc089700] '0000:17:0c.1' initialized in 212.3ms.
> 
> That then pairs with the results from this script to show where the latency
> issues are coming from.
> 
>   [pcie_flr] duration = 108ms
>   [vfio_df_ioctl_bind_iommufd] duration = 108ms
>   [pcie_flr] duration = 104ms
>   [vfio_df_ioctl_bind_iommufd] duration = 212ms
> 
> Of note, the second call to vfio_df_ioctl_bind_iommufd() takes >200ms,
> yet both calls to pcie_flr() only take ~100ms.  That indicates the latency
> issue occurs between these two calls.
> 
> Looking further, one of the attempts to lock the mutex in
> vfio_df_ioctl_bind_iommufd() takes ~100ms.
> 
>   [__mutex_lock] duration = 103ms
> 
> And has the callstack.
> 
>   __mutex_lock+5
>   vfio_df_ioctl_bind_iommufd+171
>   __se_sys_ioctl+110
>   do_syscall_64+109
>   entry_SYSCALL_64_after_hwframe+120

Very slick. Can you share the sequence of commands you ran to load this
script, run the test, and get the latency data out of the kernel?

> 
> Signed-off-by: Aaron Lewis <aaronlewis@google.com>
> ---
>  .../testing/selftests/vfio/vfio_flr_trace.bt  | 83 +++++++++++++++++++
>  1 file changed, 83 insertions(+)
>  create mode 100644 tools/testing/selftests/vfio/vfio_flr_trace.bt

I'm not sure where we should put this to be honest. Did you want to get
this merged or is this just in the series for visibility?

It would be nice to have a corpus of BPF traces checked in that people
can easily use to debug issues and time things while running VFIO
selftests.

Perhaps just put it in its own directory, e.g.
tools/testing/selftests/vfio/bpf?

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-06-27 22:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-26 18:04 [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue Aaron Lewis
2025-06-26 18:04 ` [RFC PATCH 1/3] vfio: selftests: Allow run.sh to bind to more than one device Aaron Lewis
2025-06-27 22:19   ` David Matlack
2025-06-26 18:04 ` [RFC PATCH 2/3] vfio: selftests: Introduce the selftest vfio_flr_test Aaron Lewis
2025-06-27 22:37   ` David Matlack
2025-06-26 18:04 ` [RFC PATCH 3/3] vfio: selftests: Include a BPF script to pair with " Aaron Lewis
2025-06-27 22:51   ` David Matlack
2025-06-26 19:26 ` [RFC PATCH 0/3] vfio: selftests: Add VFIO selftest to demontrate a latency issue Alex Williamson
2025-06-26 20:56   ` Jason Gunthorpe
2025-06-26 21:58     ` Aaron Lewis
2025-06-26 22:10       ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).