netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver
@ 2022-08-15 15:10 Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 01/10] net/mlx5: Introduce ifc bits for page tracker Yishai Hadas
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:10 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

This series adds device DMA logging uAPIs and their implementation as
part of mlx5 driver.

DMA logging allows a device to internally record what DMAs the device is
initiating and report them back to userspace. It is part of the VFIO
migration infrastructure that allows implementing dirty page tracking
during the pre copy phase of live migration. Only DMA WRITEs are logged,
and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.

The uAPIs are based on the FEATURE ioctl as were introduced earlier by
the below RFC [1] and follows the notes that were discussed in the
mailing list.

It includes:
- A PROBE option to detect if the device supports DMA logging.
- A SET option to start device DMA logging in given IOVAs ranges.
- A GET option to read back and clear the device DMA log.
- A SET option to stop device DMA logging that was previously started.

Extra details exist as part of relevant patches in the series.

In addition, the series adds some infrastructure support for managing an
IOVA bitmap done by Joao Martins.

It abstracts how an IOVA range is represented in a bitmap that is
granulated by a given page_size. So it translates all the lifting of
dealing with user pointers into its corresponding kernel addresses.
This new functionality abstracts the complexity of user/kernel bitmap
pointer usage and finally enables an API to set some bits.

This functionality will be used as part of IOMMUFD series for the system
IOMMU tracking.

Finally, we come with mlx5 implementation based on its device
specification for the DMA logging APIs.

The matching qemu changes can be previewed here [2].
They come on top of the v2 migration protocol patches that were sent
already to the mailing list.

Note:
- As this series touched mlx5_core parts we may need to send the
  net/mlx5 patches as a pull request format to VFIO to avoid conflicts
  before acceptance.

[1] https://lore.kernel.org/all/20220501123301.127279-1-yishaih@nvidia.com/T/
[2] https://github.com/avihai1122/qemu/commits/device_dirty_tracking

Changes from V3: https://lore.kernel.org/all/202207011231.1oPQhSzo-lkp@intel.com/t/
Rebase on top of v6.0 rc1.
Patch #3:
- Drop the documentation note regarding the 'atomic OR' usage of the bitmap
  as part of VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT.
  This deletion was missed as part of V3 to match kernel code.
  To better clarify, as part of testing V3, we could see a big penalty in
  performance (*2 in some cases) when the iova bitmap patch used atomic
  bit operations. As QEMU doesn't really share bitmaps between multiple
  trackers we saw no reason to use atomics and get a bad performance.
  If in the future, will be a real use case that will justify it, we can
  come with VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT_ATOMIC new option with
  the matching kernel code.
Patch #4:
- The rename patch from vfio.c to vfio_main.c was accepted already, not
  part of this series anymore.

Changes from V2: https://lore.kernel.org/netdev/20220726151232.GF4438@nvidia.com/t/
Patch #1
- Add some reserved fields that were missed.
Patch #3:
- Improve the UAPI documentation in few places as was asked by Alex and
  Kevin, based on the discussion in the mailing list.
Patch #5:
- Improvements from Joao for his IOVA bitmap patch to be
  cleaner/simpler as was asked by Alex. It includes the below:
   * Make iova_to_index and index_to_iova fully symmetrical.
   * Use 'sizeof(*iter->data) * BITS_PER_BYTE' in both index_to_iova
     and iova_to_index.
   * Remove iova_bitmap_init() and just stay with iova_bitmap_iter_init().
   * s/left/remaining/
   * To not use @remaining variable for both index and iova/length.
   * Remove stale comment on max dirty bitmap bits.
   * Remove DIV_ROUNDUP from iova_to_index() helper and replace with a
     division.
   * Use iova rather than length where appropriate, while noting with
     commentary the usage of length as next relative IOVA.
   * Rework pinning to be internal and remove that from the iova iter
     API caller.
   * get() and put() now teardown iova_bitmap::dirty npages.
   * Move unnecessary includes into the C file.
   * Add theory of operation and theory of usage in the header file.
   * Add more comments on private helpers on less obvious logic
   * Add documentation on all public APIs.
  * Change commit to reflect new usage of APIs.
Patch #6:
- Drop the hard-coded 1024 for LOG_MAX_RANGES and replace to consider
  PAGE_SIZE as was suggested by Jason.
- Return -E2BIG as Alex suggested.
- Adapt the loop upon logging report to new IOVA bit map stuff.

Changes from V1: https://lore.kernel.org/netdev/202207052209.x00Iykkp-lkp@intel.com/T/

- Patch #6: Fix a note given by krobot, select INTERVAL_TREE for VFIO.

Changes from V0: https://lore.kernel.org/netdev/202207011231.1oPQhSzo-lkp@intel.com/T/

- Drop the first 2 patches that Alex merged already.
- Fix a note given by krobot, based on Jason's suggestion.
- Some improvements from Joao for his IOVA bitmap patch to be
  cleaner/simpler. It includes the below:
    * Rename iova_bitmap_array_length to iova_bitmap_iova_to_index.
    * Rename iova_bitmap_index_to_length to iova_bitmap_index_to_iova.
    * Change iova_bitmap_iova_to_index to take an iova_bitmap_iter
      as an argument to pair with iova_bitmap_index_to_length.
    * Make iova_bitmap_iter_done() use >= instead of
      substraction+comparison. This fixes iova_bitmap_iter_done()
      return as it was previously returning when !done.
    * Remove iova_bitmap_iter_length().
    * Simplify iova_bitmap_length() overcomplicated trailing end check
    * Convert all sizeof(u64) into sizeof(*iter->data).
    * Use u64 __user for ::data instead of void in both struct and
      initialization of iova_bitmap.

Yishai

Joao Martins (1):
  vfio: Add an IOVA bitmap support

Yishai Hadas (9):
  net/mlx5: Introduce ifc bits for page tracker
  net/mlx5: Query ADV_VIRTUALIZATION capabilities
  vfio: Introduce DMA logging uAPIs
  vfio: Introduce the DMA logging feature support
  vfio/mlx5: Init QP based resources for dirty tracking
  vfio/mlx5: Create and destroy page tracker object
  vfio/mlx5: Report dirty pages from tracker
  vfio/mlx5: Manage error scenarios on tracker
  vfio/mlx5: Set the driver DMA logging callbacks

 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |   6 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |   1 +
 drivers/vfio/Kconfig                          |   1 +
 drivers/vfio/Makefile                         |   6 +-
 drivers/vfio/iova_bitmap.c                    | 224 ++++
 drivers/vfio/pci/mlx5/cmd.c                   | 995 +++++++++++++++++-
 drivers/vfio/pci/mlx5/cmd.h                   |  63 +-
 drivers/vfio/pci/mlx5/main.c                  |   9 +-
 drivers/vfio/pci/vfio_pci_core.c              |   5 +
 drivers/vfio/vfio_main.c                      | 159 +++
 include/linux/iova_bitmap.h                   | 189 ++++
 include/linux/mlx5/device.h                   |   9 +
 include/linux/mlx5/mlx5_ifc.h                 |  83 +-
 include/linux/vfio.h                          |  21 +-
 include/uapi/linux/vfio.h                     |  86 ++
 15 files changed, 1837 insertions(+), 20 deletions(-)
 create mode 100644 drivers/vfio/iova_bitmap.c
 create mode 100644 include/linux/iova_bitmap.h

-- 
2.18.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V4 vfio 01/10] net/mlx5: Introduce ifc bits for page tracker
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
@ 2022-08-15 15:11 ` Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 02/10] net/mlx5: Query ADV_VIRTUALIZATION capabilities Yishai Hadas
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:11 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Introduce ifc related stuff to enable using page tracker.

A page tracker is a dirty page tracking object used by the device to
report the tracking log.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 83 ++++++++++++++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 4acd5610e96b..06eab92b9fb3 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -89,6 +89,7 @@ enum {
 	MLX5_OBJ_TYPE_VIRTIO_NET_Q = 0x000d,
 	MLX5_OBJ_TYPE_VIRTIO_Q_COUNTERS = 0x001c,
 	MLX5_OBJ_TYPE_MATCH_DEFINER = 0x0018,
+	MLX5_OBJ_TYPE_PAGE_TRACK = 0x46,
 	MLX5_OBJ_TYPE_MKEY = 0xff01,
 	MLX5_OBJ_TYPE_QP = 0xff02,
 	MLX5_OBJ_TYPE_PSV = 0xff03,
@@ -1733,7 +1734,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         max_geneve_tlv_options[0x8];
 	u8         reserved_at_568[0x3];
 	u8         max_geneve_tlv_option_data_len[0x5];
-	u8         reserved_at_570[0x10];
+	u8         reserved_at_570[0x9];
+	u8         adv_virtualization[0x1];
+	u8         reserved_at_57a[0x6];
 
 	u8	   reserved_at_580[0xb];
 	u8	   log_max_dci_stream_channels[0x5];
@@ -11818,4 +11821,82 @@ struct mlx5_ifc_load_vhca_state_out_bits {
 	u8         reserved_at_40[0x40];
 };
 
+struct mlx5_ifc_adv_virtualization_cap_bits {
+	u8         reserved_at_0[0x3];
+	u8         pg_track_log_max_num[0x5];
+	u8         pg_track_max_num_range[0x8];
+	u8         pg_track_log_min_addr_space[0x8];
+	u8         pg_track_log_max_addr_space[0x8];
+
+	u8         reserved_at_20[0x3];
+	u8         pg_track_log_min_msg_size[0x5];
+	u8         reserved_at_28[0x3];
+	u8         pg_track_log_max_msg_size[0x5];
+	u8         reserved_at_30[0x3];
+	u8         pg_track_log_min_page_size[0x5];
+	u8         reserved_at_38[0x3];
+	u8         pg_track_log_max_page_size[0x5];
+
+	u8         reserved_at_40[0x7c0];
+};
+
+struct mlx5_ifc_page_track_report_entry_bits {
+	u8         dirty_address_high[0x20];
+
+	u8         dirty_address_low[0x20];
+};
+
+enum {
+	MLX5_PAGE_TRACK_STATE_TRACKING,
+	MLX5_PAGE_TRACK_STATE_REPORTING,
+	MLX5_PAGE_TRACK_STATE_ERROR,
+};
+
+struct mlx5_ifc_page_track_range_bits {
+	u8         start_address[0x40];
+
+	u8         length[0x40];
+};
+
+struct mlx5_ifc_page_track_bits {
+	u8         modify_field_select[0x40];
+
+	u8         reserved_at_40[0x10];
+	u8         vhca_id[0x10];
+
+	u8         reserved_at_60[0x20];
+
+	u8         state[0x4];
+	u8         track_type[0x4];
+	u8         log_addr_space_size[0x8];
+	u8         reserved_at_90[0x3];
+	u8         log_page_size[0x5];
+	u8         reserved_at_98[0x3];
+	u8         log_msg_size[0x5];
+
+	u8         reserved_at_a0[0x8];
+	u8         reporting_qpn[0x18];
+
+	u8         reserved_at_c0[0x18];
+	u8         num_ranges[0x8];
+
+	u8         reserved_at_e0[0x20];
+
+	u8         range_start_address[0x40];
+
+	u8         length[0x40];
+
+	struct     mlx5_ifc_page_track_range_bits track_range[0];
+};
+
+struct mlx5_ifc_create_page_track_obj_in_bits {
+	struct mlx5_ifc_general_obj_in_cmd_hdr_bits general_obj_in_cmd_hdr;
+	struct mlx5_ifc_page_track_bits obj_context;
+};
+
+struct mlx5_ifc_modify_page_track_obj_in_bits {
+	struct mlx5_ifc_general_obj_in_cmd_hdr_bits general_obj_in_cmd_hdr;
+	struct mlx5_ifc_page_track_bits obj_context;
+};
+
 #endif /* MLX5_IFC_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 vfio 02/10] net/mlx5: Query ADV_VIRTUALIZATION capabilities
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 01/10] net/mlx5: Introduce ifc bits for page tracker Yishai Hadas
@ 2022-08-15 15:11 ` Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 03/10] vfio: Introduce DMA logging uAPIs Yishai Hadas
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:11 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Query ADV_VIRTUALIZATION capabilities which provide information for
advanced virtualization related features.

Current capabilities refer to the page tracker object which is used for
tracking the pages that are dirtied by the device.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fw.c   | 6 ++++++
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 1 +
 include/linux/mlx5/device.h                    | 9 +++++++++
 3 files changed, 16 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw.c b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
index 079fa44ada71..483a51870505 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
@@ -273,6 +273,12 @@ int mlx5_query_hca_caps(struct mlx5_core_dev *dev)
 			return err;
 	}
 
+	if (MLX5_CAP_GEN(dev, adv_virtualization)) {
+		err = mlx5_core_get_caps(dev, MLX5_CAP_ADV_VIRTUALIZATION);
+		if (err)
+			return err;
+	}
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index bec8d6d0b5f6..806213d5b049 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1488,6 +1488,7 @@ static const int types[] = {
 	MLX5_CAP_IPSEC,
 	MLX5_CAP_PORT_SELECTION,
 	MLX5_CAP_DEV_SHAMPO,
+	MLX5_CAP_ADV_VIRTUALIZATION,
 };
 
 static void mlx5_hca_caps_free(struct mlx5_core_dev *dev)
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index b5f58fd37a0f..5b41b9fb3d48 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -1200,6 +1200,7 @@ enum mlx5_cap_type {
 	MLX5_CAP_DEV_SHAMPO = 0x1d,
 	MLX5_CAP_GENERAL_2 = 0x20,
 	MLX5_CAP_PORT_SELECTION = 0x25,
+	MLX5_CAP_ADV_VIRTUALIZATION = 0x26,
 	/* NUM OF CAP Types */
 	MLX5_CAP_NUM
 };
@@ -1365,6 +1366,14 @@ enum mlx5_qcam_feature_groups {
 	MLX5_GET(port_selection_cap, \
 		 mdev->caps.hca[MLX5_CAP_PORT_SELECTION]->max, cap)
 
+#define MLX5_CAP_ADV_VIRTUALIZATION(mdev, cap) \
+	MLX5_GET(adv_virtualization_cap, \
+		 mdev->caps.hca[MLX5_CAP_ADV_VIRTUALIZATION]->cur, cap)
+
+#define MLX5_CAP_ADV_VIRTUALIZATION_MAX(mdev, cap) \
+	MLX5_GET(adv_virtualization_cap, \
+		 mdev->caps.hca[MLX5_CAP_ADV_VIRTUALIZATION]->max, cap)
+
 #define MLX5_CAP_FLOWTABLE_PORT_SELECTION(mdev, cap) \
 	MLX5_CAP_PORT_SELECTION(mdev, flow_table_properties_port_selection.cap)
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 vfio 03/10] vfio: Introduce DMA logging uAPIs
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 01/10] net/mlx5: Introduce ifc bits for page tracker Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 02/10] net/mlx5: Query ADV_VIRTUALIZATION capabilities Yishai Hadas
@ 2022-08-15 15:11 ` Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support Yishai Hadas
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:11 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

DMA logging allows a device to internally record what DMAs the device is
initiating and report them back to userspace. It is part of the VFIO
migration infrastructure that allows implementing dirty page tracking
during the pre copy phase of live migration. Only DMA WRITEs are logged,
and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.

This patch introduces the DMA logging involved uAPIs.

It uses the FEATURE ioctl with its GET/SET/PROBE options as of below.

It exposes a PROBE option to detect if the device supports DMA logging.
It exposes a SET option to start device DMA logging in given IOVAs
ranges.
It exposes a SET option to stop device DMA logging that was previously
started.
It exposes a GET option to read back and clear the device DMA log.

Extra details exist as part of vfio.h per a specific option.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/uapi/linux/vfio.h | 86 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 733a1cddde30..aa63effc38e8 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,92 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET start/stop device DMA logging.
+ * VFIO_DEVICE_FEATURE_PROBE can be used to detect if the device supports
+ * DMA logging.
+ *
+ * DMA logging allows a device to internally record what DMAs the device is
+ * initiating and report them back to userspace. It is part of the VFIO
+ * migration infrastructure that allows implementing dirty page tracking
+ * during the pre copy phase of live migration. Only DMA WRITEs are logged,
+ * and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
+ *
+ * When DMA logging is started a range of IOVAs to monitor is provided and the
+ * device can optimize its logging to cover only the IOVA range given. Each
+ * DMA that the device initiates inside the range will be logged by the device
+ * for later retrieval.
+ *
+ * page_size is an input that hints what tracking granularity the device
+ * should try to achieve. If the device cannot do the hinted page size then
+ * it's the driver choice which page size to pick based on its support.
+ * On output the device will return the page size it selected.
+ *
+ * ranges is a pointer to an array of
+ * struct vfio_device_feature_dma_logging_range.
+ *
+ * The core kernel code guarantees to support by minimum num_ranges that fit
+ * into a single kernel page. User space can try higher values but should give
+ * up if the above can't be achieved as of some driver limitations.
+ *
+ * A single call to start device DMA logging can be issued and a matching stop
+ * should follow at the end. Another start is not allowed in the meantime.
+ */
+struct vfio_device_feature_dma_logging_control {
+	__aligned_u64 page_size;
+	__u32 num_ranges;
+	__u32 __reserved;
+	__aligned_u64 ranges;
+};
+
+struct vfio_device_feature_dma_logging_range {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_START 3
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET stop device DMA logging that was started
+ * by VFIO_DEVICE_FEATURE_DMA_LOGGING_START
+ */
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP 4
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_GET read back and clear the device DMA log
+ *
+ * Query the device's DMA log for written pages within the given IOVA range.
+ * During querying the log is cleared for the IOVA range.
+ *
+ * bitmap is a pointer to an array of u64s that will hold the output bitmap
+ * with 1 bit reporting a page_size unit of IOVA. The mapping of IOVA to bits
+ * is given by:
+ *  bitmap[(addr - iova)/page_size] & (1ULL << (addr % 64))
+ *
+ * The input page_size can be any power of two value and does not have to
+ * match the value given to VFIO_DEVICE_FEATURE_DMA_LOGGING_START. The driver
+ * will format its internal logging to match the reporting page size, possibly
+ * by replicating bits if the internal page size is lower than requested.
+ *
+ * The LOGGING_REPORT will only set bits in the bitmap and never clear or
+ * perform any initialization of the user provided bitmap.
+ *
+ * If any error is returned userspace should assume that the dirty log is
+ * corrupted. Error recovery is to consider all memory dirty and try to
+ * restart the dirty tracking, or to abort/restart the whole migration.
+ *
+ * If DMA logging is not enabled, an error will be returned.
+ *
+ */
+struct vfio_device_feature_dma_logging_report {
+	__aligned_u64 iova;
+	__aligned_u64 length;
+	__aligned_u64 page_size;
+	__aligned_u64 bitmap;
+};
+
+#define VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT 5
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (2 preceding siblings ...)
  2022-08-15 15:11 ` [PATCH V4 vfio 03/10] vfio: Introduce DMA logging uAPIs Yishai Hadas
@ 2022-08-15 15:11 ` Yishai Hadas
  2022-08-25 19:27   ` Alex Williamson
  2022-08-15 15:11 ` [PATCH V4 vfio 05/10] vfio: Introduce the DMA logging feature support Yishai Hadas
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:11 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

From: Joao Martins <joao.m.martins@oracle.com>

The new facility adds a bunch of wrappers that abstract how an IOVA
range is represented in a bitmap that is granulated by a given
page_size. So it translates all the lifting of dealing with user
pointers into its corresponding kernel addresses backing said user
memory into doing finally the (non-atomic) bitmap ops to change
various bits.

The formula for the bitmap is:

   data[(iova / page_size) / 64] & (1ULL << (iova % 64))

Where 64 is the number of bits in a unsigned long (depending on arch)

It introduces an IOVA iterator that uses a windowing scheme to minimize
the pinning overhead, as opposed to be pinning it on demand 4K at a
time. So on a 512G and with base page size it would iterate in ranges of
64G at a time, while pinning 512 pages at a time leading to fewer
atomics (specially if the underlying user memory are hugepages).

An example usage of these helpers for a given @base_iova, @page_size, @length
and __user @data:

	ret = iova_bitmap_iter_init(&iter, base_iova, page_size, length, data);
	if (ret)
		return -ENOMEM;

	for (; !iova_bitmap_iter_done(&iter) && !ret;
	     ret = iova_bitmap_iter_advance(&iter)) {
		dirty_reporter_ops(&iter.dirty, iova_bitmap_iova(&iter),
				   iova_bitmap_length(&iter));
	}

	iova_bitmap_iter_free(&iter);

An implementation of the lower end -- referred to above as dirty_reporter_ops
to exemplify -- that is tracking dirty bits would mark an IOVA as dirty
as following:

	iova_bitmap_set(&iter.dirty, iova, page_size);

or a contiguous range (example two pages):

	iova_bitmap_set(&iter.dirty, iova, 2 * page_size);

The facility is intended to be used for user bitmaps representing
dirtied IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/Makefile       |   6 +-
 drivers/vfio/iova_bitmap.c  | 224 ++++++++++++++++++++++++++++++++++++
 include/linux/iova_bitmap.h | 189 ++++++++++++++++++++++++++++++
 3 files changed, 417 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vfio/iova_bitmap.c
 create mode 100644 include/linux/iova_bitmap.h

diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 1a32357592e3..1d6cad32d366 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,9 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0
 vfio_virqfd-y := virqfd.o
 
-vfio-y += vfio_main.o
-
 obj-$(CONFIG_VFIO) += vfio.o
+
+vfio-y := vfio_main.o \
+          iova_bitmap.o \
+
 obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
new file mode 100644
index 000000000000..6b6008ef146c
--- /dev/null
+++ b/drivers/vfio/iova_bitmap.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022, Oracle and/or its affiliates.
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+#include <linux/iova_bitmap.h>
+#include <linux/highmem.h>
+
+#define BITS_PER_PAGE (PAGE_SIZE * BITS_PER_BYTE)
+
+static void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
+
+/*
+ * Converts a relative IOVA to a bitmap index.
+ * The bitmap is viewed an array of u64, and each u64 represents
+ * a range of IOVA, and the whole pinned pages to the range window.
+ * Relative IOVA means relative to the iter::dirty base IOVA (stored
+ * in dirty::iova). All computations in this file are done using
+ * relative IOVAs and thus avoid an extra subtraction against
+ * dirty::iova. The user API iova_bitmap_set() always uses a regular
+ * absolute IOVAs.
+ */
+static unsigned long iova_bitmap_iova_to_index(struct iova_bitmap_iter *iter,
+					       unsigned long iova)
+{
+	unsigned long pgsize = 1 << iter->dirty.pgshift;
+
+	return iova / (BITS_PER_TYPE(*iter->data) * pgsize);
+}
+
+/*
+ * Converts a bitmap index to a *relative* IOVA.
+ */
+static unsigned long iova_bitmap_index_to_iova(struct iova_bitmap_iter *iter,
+					       unsigned long index)
+{
+	unsigned long pgshift = iter->dirty.pgshift;
+
+	return (index * BITS_PER_TYPE(*iter->data)) << pgshift;
+}
+
+/*
+ * Pins the bitmap user pages for the current range window.
+ * This is internal to IOVA bitmap and called when advancing the
+ * iterator.
+ */
+static int iova_bitmap_iter_get(struct iova_bitmap_iter *iter)
+{
+	struct iova_bitmap *dirty = &iter->dirty;
+	unsigned long npages;
+	u64 __user *addr;
+	long ret;
+
+	/*
+	 * @offset is the cursor of the currently mapped u64 words
+	 * that we have access. And it indexes u64 bitmap word that is
+	 * mapped. Anything before @offset is not mapped. The range
+	 * @offset .. @count is mapped but capped at a maximum number
+	 * of pages.
+	 */
+	npages = DIV_ROUND_UP((iter->count - iter->offset) *
+			      sizeof(*iter->data), PAGE_SIZE);
+
+	/*
+	 * We always cap at max number of 'struct page' a base page can fit.
+	 * This is, for example, on x86 means 2M of bitmap data max.
+	 */
+	npages = min(npages,  PAGE_SIZE / sizeof(struct page *));
+	addr = iter->data + iter->offset;
+	ret = pin_user_pages_fast((unsigned long)addr, npages,
+				  FOLL_WRITE, dirty->pages);
+	if (ret <= 0)
+		return -EFAULT;
+
+	dirty->npages = (unsigned long)ret;
+	/* Base IOVA where @pages point to i.e. bit 0 of the first page */
+	dirty->iova = iova_bitmap_iova(iter);
+
+	/*
+	 * offset of the page where pinned pages bit 0 is located.
+	 * This handles the case where the bitmap is not PAGE_SIZE
+	 * aligned.
+	 */
+	dirty->start_offset = offset_in_page(addr);
+	return 0;
+}
+
+/*
+ * Unpins the bitmap user pages and clears @npages
+ * (un)pinning is abstracted from API user and it's done
+ * when advancing or freeing the iterator.
+ */
+static void iova_bitmap_iter_put(struct iova_bitmap_iter *iter)
+{
+	struct iova_bitmap *dirty = &iter->dirty;
+
+	if (dirty->npages) {
+		unpin_user_pages(dirty->pages, dirty->npages);
+		dirty->npages = 0;
+	}
+}
+
+int iova_bitmap_iter_init(struct iova_bitmap_iter *iter,
+			  unsigned long iova, unsigned long length,
+			  unsigned long page_size, u64 __user *data)
+{
+	struct iova_bitmap *dirty = &iter->dirty;
+
+	memset(iter, 0, sizeof(*iter));
+	dirty->pgshift = __ffs(page_size);
+	iter->data = data;
+	iter->count = iova_bitmap_iova_to_index(iter, length - 1) + 1;
+	iter->iova = iova;
+	iter->length = length;
+
+	dirty->iova = iova;
+	dirty->pages = (struct page **)__get_free_page(GFP_KERNEL);
+	if (!dirty->pages)
+		return -ENOMEM;
+
+	return iova_bitmap_iter_get(iter);
+}
+
+void iova_bitmap_iter_free(struct iova_bitmap_iter *iter)
+{
+	struct iova_bitmap *dirty = &iter->dirty;
+
+	iova_bitmap_iter_put(iter);
+
+	if (dirty->pages) {
+		free_page((unsigned long)dirty->pages);
+		dirty->pages = NULL;
+	}
+
+	memset(iter, 0, sizeof(*iter));
+}
+
+unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter)
+{
+	unsigned long skip = iter->offset;
+
+	return iter->iova + iova_bitmap_index_to_iova(iter, skip);
+}
+
+/*
+ * Returns the remaining bitmap indexes count to process for the currently pinned
+ * bitmap pages.
+ */
+static unsigned long iova_bitmap_iter_remaining(struct iova_bitmap_iter *iter)
+{
+	unsigned long remaining = iter->count - iter->offset;
+
+	remaining = min_t(unsigned long, remaining,
+		     (iter->dirty.npages << PAGE_SHIFT) / sizeof(*iter->data));
+
+	return remaining;
+}
+
+unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter)
+{
+	unsigned long max_iova = iter->iova + iter->length - 1;
+	unsigned long iova = iova_bitmap_iova(iter);
+	unsigned long remaining;
+
+	/*
+	 * iova_bitmap_iter_remaining() returns a number of indexes which
+	 * when converted to IOVA gives us a max length that the bitmap
+	 * pinned data can cover. Afterwards, that is capped to
+	 * only cover the IOVA range in @iter::iova .. iter::length.
+	 */
+	remaining = iova_bitmap_index_to_iova(iter,
+			iova_bitmap_iter_remaining(iter));
+
+	if (iova + remaining - 1 > max_iova)
+		remaining -= ((iova + remaining - 1) - max_iova);
+
+	return remaining;
+}
+
+bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter)
+{
+	return iter->offset >= iter->count;
+}
+
+int iova_bitmap_iter_advance(struct iova_bitmap_iter *iter)
+{
+	unsigned long iova = iova_bitmap_length(iter) - 1;
+	unsigned long count = iova_bitmap_iova_to_index(iter, iova) + 1;
+
+	iter->offset += count;
+
+	iova_bitmap_iter_put(iter);
+	if (iova_bitmap_iter_done(iter))
+		return 0;
+
+	/* When we advance the iterator we pin the next set of bitmap pages */
+	return iova_bitmap_iter_get(iter);
+}
+
+unsigned long iova_bitmap_set(struct iova_bitmap *dirty,
+			      unsigned long iova, unsigned long length)
+{
+	unsigned long nbits = max(1UL, length >> dirty->pgshift), set = nbits;
+	unsigned long offset = (iova - dirty->iova) >> dirty->pgshift;
+	unsigned long page_idx = offset / BITS_PER_PAGE;
+	unsigned long page_offset = dirty->start_offset;
+	void *kaddr;
+
+	offset = offset % BITS_PER_PAGE;
+
+	do {
+		unsigned long size = min(BITS_PER_PAGE - offset, nbits);
+
+		kaddr = kmap_local_page(dirty->pages[page_idx]);
+		bitmap_set(kaddr + page_offset, offset, size);
+		kunmap_local(kaddr);
+		page_offset = offset = 0;
+		nbits -= size;
+		page_idx++;
+	} while (nbits > 0);
+
+	return set;
+}
+EXPORT_SYMBOL_GPL(iova_bitmap_set);
diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
new file mode 100644
index 000000000000..e258d03386d3
--- /dev/null
+++ b/include/linux/iova_bitmap.h
@@ -0,0 +1,189 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022, Oracle and/or its affiliates.
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+#ifndef _IOVA_BITMAP_H_
+#define _IOVA_BITMAP_H_
+
+#include <linux/mm.h>
+
+/**
+ * struct iova_bitmap - A bitmap representing a portion IOVA space
+ *
+ * Main data structure for tracking dirty IOVAs.
+ *
+ * For example something recording dirty IOVAs, will be provided of a
+ * struct iova_bitmap structure. This structure only represents a
+ * subset of the total IOVA space pinned by its parent counterpart
+ * iterator object.
+ *
+ * The user does not need to exact location of the bits in the bitmap.
+ * From user perspective the bitmap the only API available to the dirty
+ * tracker is iova_bitmap_set() which records the dirty IOVA *range*
+ * in the bitmap data.
+ *
+ * The bitmap is an array of u64 whereas each bit represents an IOVA
+ * of range of (1 << pgshift). Thus formula for the bitmap data to be
+ * set is:
+ *
+ *   data[(iova / page_size) / 64] & (1ULL << (iova % 64))
+ */
+struct iova_bitmap {
+	/* base IOVA representing bit 0 of the first page */
+	unsigned long iova;
+
+	/* page size order that each bit granules to */
+	unsigned long pgshift;
+
+	/* offset of the first user page pinned */
+	unsigned long start_offset;
+
+	/* number of pages pinned */
+	unsigned long npages;
+
+	/* pinned pages representing the bitmap data */
+	struct page **pages;
+};
+
+/**
+ * struct iova_bitmap_iter - Iterator object of the IOVA bitmap
+ *
+ * Main data structure for walking the bitmap data.
+ *
+ * Abstracts the pinning work to iterate an IOVA ranges.
+ * It uses a windowing scheme and pins the bitmap in relatively
+ * big ranges e.g.
+ *
+ * The iterator uses one base page to store all the pinned pages
+ * pointers related to the bitmap. For sizeof(struct page) == 64 it
+ * stores 512 struct pages which, if base page size is 4096 it means 2M
+ * of bitmap data is pinned at a time. If the iova_bitmap page size is
+ * also base page size then the range window to iterate is 64G.
+ *
+ * For example iterating on a total IOVA range of 4G..128G, it will
+ * walk through this set of ranges:
+ *
+ *  - 4G  -  68G-1 (64G)
+ *  - 68G - 128G-1 (64G)
+ *
+ * An example of the APIs on how to iterate the IOVA bitmap:
+ *
+ *   ret = iova_bitmap_iter_init(&iter, iova, PAGE_SIZE, length, data);
+ *   if (ret)
+ *       return -ENOMEM;
+ *
+ *   for (; !iova_bitmap_iter_done(&iter) && !ret;
+ *        ret = iova_bitmap_iter_advance(&iter)) {
+ *
+ *        dirty_reporter_ops(&iter.dirty, iova_bitmap_iova(&iter),
+ *                           iova_bitmap_length(&iter));
+ *   }
+ *
+ * An implementation of the lower end (referred to above as
+ * dirty_reporter_ops) that is tracking dirty bits would:
+ *
+ *        if (iova_dirty)
+ *            iova_bitmap_set(&iter.dirty, iova, PAGE_SIZE);
+ *
+ * The internals of the object use a cursor @offset that indexes
+ * which part u64 word of the bitmap is mapped, up to @count.
+ * Those keep being incremented until @count reaches while mapping
+ * up to PAGE_SIZE / sizeof(struct page*) maximum of pages.
+ *
+ * The iterator is usually located on what tracks DMA mapped ranges
+ * or some form of IOVA range tracking that co-relates to the user
+ * passed bitmap.
+ */
+struct iova_bitmap_iter {
+	/* IOVA range representing the currently pinned bitmap data */
+	struct iova_bitmap dirty;
+
+	/* userspace address of the bitmap */
+	u64 __user *data;
+
+	/* u64 index that @dirty points to */
+	size_t offset;
+
+	/* how many u64 can we walk in total */
+	size_t count;
+
+	/* base IOVA of the whole bitmap */
+	unsigned long iova;
+
+	/* length of the IOVA range for the whole bitmap */
+	unsigned long length;
+};
+
+/**
+ * iova_bitmap_iter_init() - Initializes an IOVA bitmap iterator object.
+ * @iter: IOVA bitmap iterator to initialize
+ * @iova: Start address of the IOVA range
+ * @length: Length of the IOVA range
+ * @page_size: Page size of the IOVA bitmap. It defines what each bit
+ *             granularity represents
+ * @data: Userspace address of the bitmap
+ *
+ * Initializes all the fields in the IOVA iterator including the first
+ * user pages of @data. Returns 0 on success or otherwise errno on error.
+ */
+int iova_bitmap_iter_init(struct iova_bitmap_iter *iter, unsigned long iova,
+			  unsigned long length, unsigned long page_size,
+			  u64 __user *data);
+
+/**
+ * iova_bitmap_iter_free() - Frees an IOVA bitmap iterator object
+ * @iter: IOVA bitmap iterator to free
+ *
+ * It unpins and releases pages array memory and clears any leftover
+ * state.
+ */
+void iova_bitmap_iter_free(struct iova_bitmap_iter *iter);
+
+/**
+ * iova_bitmap_iter_done: Checks if the IOVA bitmap has data to iterate
+ * @iter: IOVA bitmap iterator to free
+ *
+ * Returns true if there's more data to iterate.
+ */
+bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter);
+
+/**
+ * iova_bitmap_iter_advance: Advances the IOVA bitmap iterator
+ * @iter: IOVA bitmap iterator to advance
+ *
+ * Advances to the next range, releases the current pinned
+ * pages and pins the next set of bitmap pages.
+ * Returns 0 on success or otherwise errno.
+ */
+int iova_bitmap_iter_advance(struct iova_bitmap_iter *iter);
+
+/**
+ * iova_bitmap_iova: Base IOVA of the current range
+ * @iter: IOVA bitmap iterator
+ *
+ * Returns the base IOVA of the current range.
+ */
+unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter);
+
+/**
+ * iova_bitmap_length: IOVA length of the current range
+ * @iter: IOVA bitmap iterator
+ *
+ * Returns the length of the current IOVA range.
+ */
+unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter);
+
+/**
+ * iova_bitmap_set: Marks an IOVA range as dirty
+ * @dirty: IOVA bitmap
+ * @iova: IOVA to mark as dirty
+ * @length: IOVA range length
+ *
+ * Marks the range [iova .. iova+length-1] as dirty in the bitmap.
+ * Returns the number of bits set.
+ */
+unsigned long iova_bitmap_set(struct iova_bitmap *dirty,
+			      unsigned long iova, unsigned long length);
+
+#endif
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 vfio 05/10] vfio: Introduce the DMA logging feature support
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (3 preceding siblings ...)
  2022-08-15 15:11 ` [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support Yishai Hadas
@ 2022-08-15 15:11 ` Yishai Hadas
  2022-08-25 20:49   ` Alex Williamson
  2022-08-15 15:11 ` [PATCH V4 vfio 06/10] vfio/mlx5: Init QP based resources for dirty tracking Yishai Hadas
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:11 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Introduce the DMA logging feature support in the vfio core layer.

It includes the processing of the device start/stop/report DMA logging
UAPIs and calling the relevant driver 'op' to do the work.

Specifically,
Upon start, the core translates the given input ranges into an interval
tree, checks for unexpected overlapping, non aligned ranges and then
pass the translated input to the driver for start tracking the given
ranges.

Upon report, the core translates the given input user space bitmap and
page size into an IOVA kernel bitmap iterator. Then it iterates it and
call the driver to set the corresponding bits for the dirtied pages in a
specific IOVA range.

Upon stop, the driver is called to stop the previous started tracking.

The next patches from the series will introduce the mlx5 driver
implementation for the logging ops.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/pci/vfio_pci_core.c |   5 +
 drivers/vfio/vfio_main.c         | 159 +++++++++++++++++++++++++++++++
 include/linux/vfio.h             |  21 +++-
 4 files changed, 184 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 6130d00252ed..86c381ceb9a1 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -3,6 +3,7 @@ menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	select IOMMU_API
 	select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64)
+	select INTERVAL_TREE
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/driver-api/vfio.rst for more details.
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index c8d3b0450fb3..2b31184dddde 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1875,6 +1875,11 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 			return -EINVAL;
 	}
 
+	if (vdev->vdev.log_ops && !(vdev->vdev.log_ops->log_start &&
+	    vdev->vdev.log_ops->log_stop &&
+	    vdev->vdev.log_ops->log_read_and_clear))
+		return -EINVAL;
+
 	/*
 	 * Prevent binding to PFs with VFs enabled, the VFs might be in use
 	 * by the host or other users.  We cannot capture the VFs if they
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 7cb56c382c97..e961e9ff449f 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -32,6 +32,8 @@
 #include <linux/vfio.h>
 #include <linux/wait.h>
 #include <linux/sched/signal.h>
+#include <linux/interval_tree.h>
+#include <linux/iova_bitmap.h>
 #include "vfio.h"
 
 #define DRIVER_VERSION	"0.3"
@@ -1628,6 +1630,151 @@ static int vfio_ioctl_device_feature_migration(struct vfio_device *device,
 	return 0;
 }
 
+/* Ranges should fit into a single kernel page */
+#define LOG_MAX_RANGES \
+	(PAGE_SIZE / sizeof(struct vfio_device_feature_dma_logging_range))
+
+static int
+vfio_ioctl_device_feature_logging_start(struct vfio_device *device,
+					u32 flags, void __user *arg,
+					size_t argsz)
+{
+	size_t minsz =
+		offsetofend(struct vfio_device_feature_dma_logging_control,
+			    ranges);
+	struct vfio_device_feature_dma_logging_range __user *ranges;
+	struct vfio_device_feature_dma_logging_control control;
+	struct vfio_device_feature_dma_logging_range range;
+	struct rb_root_cached root = RB_ROOT_CACHED;
+	struct interval_tree_node *nodes;
+	u32 nnodes;
+	int i, ret;
+
+	if (!device->log_ops)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz,
+				 VFIO_DEVICE_FEATURE_SET,
+				 sizeof(control));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&control, arg, minsz))
+		return -EFAULT;
+
+	nnodes = control.num_ranges;
+	if (!nnodes)
+		return -EINVAL;
+
+	if (nnodes > LOG_MAX_RANGES)
+		return -E2BIG;
+
+	ranges = u64_to_user_ptr(control.ranges);
+	nodes = kmalloc_array(nnodes, sizeof(struct interval_tree_node),
+			      GFP_KERNEL);
+	if (!nodes)
+		return -ENOMEM;
+
+	for (i = 0; i < nnodes; i++) {
+		if (copy_from_user(&range, &ranges[i], sizeof(range))) {
+			ret = -EFAULT;
+			goto end;
+		}
+		if (!IS_ALIGNED(range.iova, control.page_size) ||
+		    !IS_ALIGNED(range.length, control.page_size)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		nodes[i].start = range.iova;
+		nodes[i].last = range.iova + range.length - 1;
+		if (interval_tree_iter_first(&root, nodes[i].start,
+					     nodes[i].last)) {
+			/* Range overlapping */
+			ret = -EINVAL;
+			goto end;
+		}
+		interval_tree_insert(nodes + i, &root);
+	}
+
+	ret = device->log_ops->log_start(device, &root, nnodes,
+					 &control.page_size);
+	if (ret)
+		goto end;
+
+	if (copy_to_user(arg, &control, sizeof(control))) {
+		ret = -EFAULT;
+		device->log_ops->log_stop(device);
+	}
+
+end:
+	kfree(nodes);
+	return ret;
+}
+
+static int
+vfio_ioctl_device_feature_logging_stop(struct vfio_device *device,
+				       u32 flags, void __user *arg,
+				       size_t argsz)
+{
+	int ret;
+
+	if (!device->log_ops)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz,
+				 VFIO_DEVICE_FEATURE_SET, 0);
+	if (ret != 1)
+		return ret;
+
+	return device->log_ops->log_stop(device);
+}
+
+static int
+vfio_ioctl_device_feature_logging_report(struct vfio_device *device,
+					 u32 flags, void __user *arg,
+					 size_t argsz)
+{
+	size_t minsz =
+		offsetofend(struct vfio_device_feature_dma_logging_report,
+			    bitmap);
+	struct vfio_device_feature_dma_logging_report report;
+	struct iova_bitmap_iter iter;
+	int ret;
+
+	if (!device->log_ops)
+		return -ENOTTY;
+
+	ret = vfio_check_feature(flags, argsz,
+				 VFIO_DEVICE_FEATURE_GET,
+				 sizeof(report));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&report, arg, minsz))
+		return -EFAULT;
+
+	if (report.page_size < PAGE_SIZE || !is_power_of_2(report.page_size))
+		return -EINVAL;
+
+	ret = iova_bitmap_iter_init(&iter, report.iova, report.length,
+				    report.page_size,
+				    u64_to_user_ptr(report.bitmap));
+	if (ret)
+		return ret;
+
+	for (; !iova_bitmap_iter_done(&iter) && !ret;
+	     ret = iova_bitmap_iter_advance(&iter)) {
+		ret = device->log_ops->log_read_and_clear(device,
+			iova_bitmap_iova(&iter),
+			iova_bitmap_length(&iter), &iter.dirty);
+		if (ret)
+			break;
+	}
+
+	iova_bitmap_iter_free(&iter);
+	return ret;
+}
+
 static int vfio_ioctl_device_feature(struct vfio_device *device,
 				     struct vfio_device_feature __user *arg)
 {
@@ -1661,6 +1808,18 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
 		return vfio_ioctl_device_feature_mig_device_state(
 			device, feature.flags, arg->data,
 			feature.argsz - minsz);
+	case VFIO_DEVICE_FEATURE_DMA_LOGGING_START:
+		return vfio_ioctl_device_feature_logging_start(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
+	case VFIO_DEVICE_FEATURE_DMA_LOGGING_STOP:
+		return vfio_ioctl_device_feature_logging_stop(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
+	case VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT:
+		return vfio_ioctl_device_feature_logging_report(
+			device, feature.flags, arg->data,
+			feature.argsz - minsz);
 	default:
 		if (unlikely(!device->ops->device_feature))
 			return -EINVAL;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e05ddc6fe6a5..b17f2f454389 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -14,6 +14,7 @@
 #include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
+#include <linux/iova_bitmap.h>
 
 struct kvm;
 
@@ -33,10 +34,11 @@ struct vfio_device {
 	struct device *dev;
 	const struct vfio_device_ops *ops;
 	/*
-	 * mig_ops is a static property of the vfio_device which must be set
-	 * prior to registering the vfio_device.
+	 * mig_ops/log_ops is a static property of the vfio_device which must
+	 * be set prior to registering the vfio_device.
 	 */
 	const struct vfio_migration_ops *mig_ops;
+	const struct vfio_log_ops *log_ops;
 	struct vfio_group *group;
 	struct vfio_device_set *dev_set;
 	struct list_head dev_set_list;
@@ -108,6 +110,21 @@ struct vfio_migration_ops {
 				   enum vfio_device_mig_state *curr_state);
 };
 
+/**
+ * @log_start: Optional callback to ask the device start DMA logging.
+ * @log_stop: Optional callback to ask the device stop DMA logging.
+ * @log_read_and_clear: Optional callback to ask the device read
+ *         and clear the dirty DMAs in some given range.
+ */
+struct vfio_log_ops {
+	int (*log_start)(struct vfio_device *device,
+		struct rb_root_cached *ranges, u32 nnodes, u64 *page_size);
+	int (*log_stop)(struct vfio_device *device);
+	int (*log_read_and_clear)(struct vfio_device *device,
+		unsigned long iova, unsigned long length,
+		struct iova_bitmap *dirty);
+};
+
 /**
  * vfio_check_feature - Validate user input for the VFIO_DEVICE_FEATURE ioctl
  * @flags: Arg from the device_feature op
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 vfio 06/10] vfio/mlx5: Init QP based resources for dirty tracking
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (4 preceding siblings ...)
  2022-08-15 15:11 ` [PATCH V4 vfio 05/10] vfio: Introduce the DMA logging feature support Yishai Hadas
@ 2022-08-15 15:11 ` Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 07/10] vfio/mlx5: Create and destroy page tracker object Yishai Hadas
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:11 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Init QP based resources for dirty tracking to be used upon start
logging.

It includes:
Creating the host and firmware RC QPs, move each of them to its expected
state based on the device specification, etc.

Creating the relevant resources which are needed by both QPs as of UAR,
PD, etc.

Creating the host receive side resources as of MKEY, CQ, receive WQEs,
etc.

The above resources are cleaned-up upon stop logging.

The tracker object that will be introduced by next patches will use
those resources.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 595 +++++++++++++++++++++++++++++++++++-
 drivers/vfio/pci/mlx5/cmd.h |  53 ++++
 2 files changed, 636 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index dd5d7bfe0a49..0a362796d567 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -7,6 +7,8 @@
 
 static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
 				  u16 *vhca_id);
+static void
+_mlx5vf_free_page_tracker_resources(struct mlx5vf_pci_core_device *mvdev);
 
 int mlx5vf_cmd_suspend_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod)
 {
@@ -72,19 +74,22 @@ static int mlx5fv_vf_event(struct notifier_block *nb,
 	struct mlx5vf_pci_core_device *mvdev =
 		container_of(nb, struct mlx5vf_pci_core_device, nb);
 
-	mutex_lock(&mvdev->state_mutex);
 	switch (event) {
 	case MLX5_PF_NOTIFY_ENABLE_VF:
+		mutex_lock(&mvdev->state_mutex);
 		mvdev->mdev_detach = false;
+		mlx5vf_state_mutex_unlock(mvdev);
 		break;
 	case MLX5_PF_NOTIFY_DISABLE_VF:
-		mlx5vf_disable_fds(mvdev);
+		mlx5vf_cmd_close_migratable(mvdev);
+		mutex_lock(&mvdev->state_mutex);
 		mvdev->mdev_detach = true;
+		mlx5vf_state_mutex_unlock(mvdev);
 		break;
 	default:
 		break;
 	}
-	mlx5vf_state_mutex_unlock(mvdev);
+
 	return 0;
 }
 
@@ -95,6 +100,7 @@ void mlx5vf_cmd_close_migratable(struct mlx5vf_pci_core_device *mvdev)
 
 	mutex_lock(&mvdev->state_mutex);
 	mlx5vf_disable_fds(mvdev);
+	_mlx5vf_free_page_tracker_resources(mvdev);
 	mlx5vf_state_mutex_unlock(mvdev);
 }
 
@@ -188,11 +194,13 @@ static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
 	return ret;
 }
 
-static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
-			      struct mlx5_vf_migration_file *migf, u32 *mkey)
+static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
+			struct mlx5_vf_migration_file *migf,
+			struct mlx5_vhca_recv_buf *recv_buf,
+			u32 *mkey)
 {
-	size_t npages = DIV_ROUND_UP(migf->total_length, PAGE_SIZE);
-	struct sg_dma_page_iter dma_iter;
+	size_t npages = migf ? DIV_ROUND_UP(migf->total_length, PAGE_SIZE) :
+				recv_buf->npages;
 	int err = 0, inlen;
 	__be64 *mtt;
 	void *mkc;
@@ -209,8 +217,17 @@ static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 		 DIV_ROUND_UP(npages, 2));
 	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
 
-	for_each_sgtable_dma_page(&migf->table.sgt, &dma_iter, 0)
-		*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+	if (migf) {
+		struct sg_dma_page_iter dma_iter;
+
+		for_each_sgtable_dma_page(&migf->table.sgt, &dma_iter, 0)
+			*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+	} else {
+		int i;
+
+		for (i = 0; i < npages; i++)
+			*mtt++ = cpu_to_be64(recv_buf->dma_addrs[i]);
+	}
 
 	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
 	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
@@ -223,7 +240,8 @@ static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 	MLX5_SET(mkc, mkc, qpn, 0xffffff);
 	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
 	MLX5_SET(mkc, mkc, translations_octword_size, DIV_ROUND_UP(npages, 2));
-	MLX5_SET64(mkc, mkc, len, migf->total_length);
+	MLX5_SET64(mkc, mkc, len,
+		   migf ? migf->total_length : (npages * PAGE_SIZE));
 	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
 	kvfree(in);
 	return err;
@@ -297,7 +315,7 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	if (err)
 		goto err_dma_map;
 
-	err = _create_state_mkey(mdev, pdn, migf, &mkey);
+	err = _create_mkey(mdev, pdn, migf, NULL, &mkey);
 	if (err)
 		goto err_create_mkey;
 
@@ -369,7 +387,7 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	if (err)
 		goto err_reg;
 
-	err = _create_state_mkey(mdev, pdn, migf, &mkey);
+	err = _create_mkey(mdev, pdn, migf, NULL, &mkey);
 	if (err)
 		goto err_mkey;
 
@@ -391,3 +409,556 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	mutex_unlock(&migf->lock);
 	return err;
 }
+
+static int alloc_cq_frag_buf(struct mlx5_core_dev *mdev,
+			     struct mlx5_vhca_cq_buf *buf, int nent,
+			     int cqe_size)
+{
+	struct mlx5_frag_buf *frag_buf = &buf->frag_buf;
+	u8 log_wq_stride = 6 + (cqe_size == 128 ? 1 : 0);
+	u8 log_wq_sz = ilog2(cqe_size);
+	int err;
+
+	err = mlx5_frag_buf_alloc_node(mdev, nent * cqe_size, frag_buf,
+				       mdev->priv.numa_node);
+	if (err)
+		return err;
+
+	mlx5_init_fbc(frag_buf->frags, log_wq_stride, log_wq_sz, &buf->fbc);
+	buf->cqe_size = cqe_size;
+	buf->nent = nent;
+	return 0;
+}
+
+static void init_cq_frag_buf(struct mlx5_vhca_cq_buf *buf)
+{
+	struct mlx5_cqe64 *cqe64;
+	void *cqe;
+	int i;
+
+	for (i = 0; i < buf->nent; i++) {
+		cqe = mlx5_frag_buf_get_wqe(&buf->fbc, i);
+		cqe64 = buf->cqe_size == 64 ? cqe : cqe + 64;
+		cqe64->op_own = MLX5_CQE_INVALID << 4;
+	}
+}
+
+static void mlx5vf_destroy_cq(struct mlx5_core_dev *mdev,
+			      struct mlx5_vhca_cq *cq)
+{
+	mlx5_core_destroy_cq(mdev, &cq->mcq);
+	mlx5_frag_buf_free(mdev, &cq->buf.frag_buf);
+	mlx5_db_free(mdev, &cq->db);
+}
+
+static int mlx5vf_create_cq(struct mlx5_core_dev *mdev,
+			    struct mlx5_vhca_page_tracker *tracker,
+			    size_t ncqe)
+{
+	int cqe_size = cache_line_size() == 128 ? 128 : 64;
+	u32 out[MLX5_ST_SZ_DW(create_cq_out)];
+	struct mlx5_vhca_cq *cq;
+	int inlen, err, eqn;
+	void *cqc, *in;
+	__be64 *pas;
+	int vector;
+
+	cq = &tracker->cq;
+	ncqe = roundup_pow_of_two(ncqe);
+	err = mlx5_db_alloc_node(mdev, &cq->db, mdev->priv.numa_node);
+	if (err)
+		return err;
+
+	cq->ncqe = ncqe;
+	cq->mcq.set_ci_db = cq->db.db;
+	cq->mcq.arm_db = cq->db.db + 1;
+	cq->mcq.cqe_sz = cqe_size;
+	err = alloc_cq_frag_buf(mdev, &cq->buf, ncqe, cqe_size);
+	if (err)
+		goto err_db_free;
+
+	init_cq_frag_buf(&cq->buf);
+	inlen = MLX5_ST_SZ_BYTES(create_cq_in) +
+		MLX5_FLD_SZ_BYTES(create_cq_in, pas[0]) *
+		cq->buf.frag_buf.npages;
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in) {
+		err = -ENOMEM;
+		goto err_buff;
+	}
+
+	vector = raw_smp_processor_id() % mlx5_comp_vectors_count(mdev);
+	err = mlx5_vector2eqn(mdev, vector, &eqn);
+	if (err)
+		goto err_vec;
+
+	cqc = MLX5_ADDR_OF(create_cq_in, in, cq_context);
+	MLX5_SET(cqc, cqc, log_cq_size, ilog2(ncqe));
+	MLX5_SET(cqc, cqc, c_eqn_or_apu_element, eqn);
+	MLX5_SET(cqc, cqc, uar_page, tracker->uar->index);
+	MLX5_SET(cqc, cqc, log_page_size, cq->buf.frag_buf.page_shift -
+		 MLX5_ADAPTER_PAGE_SHIFT);
+	MLX5_SET64(cqc, cqc, dbr_addr, cq->db.dma);
+	pas = (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas);
+	mlx5_fill_page_frag_array(&cq->buf.frag_buf, pas);
+	err = mlx5_core_create_cq(mdev, &cq->mcq, in, inlen, out, sizeof(out));
+	if (err)
+		goto err_vec;
+
+	kvfree(in);
+	return 0;
+
+err_vec:
+	kvfree(in);
+err_buff:
+	mlx5_frag_buf_free(mdev, &cq->buf.frag_buf);
+err_db_free:
+	mlx5_db_free(mdev, &cq->db);
+	return err;
+}
+
+static struct mlx5_vhca_qp *
+mlx5vf_create_rc_qp(struct mlx5_core_dev *mdev,
+		    struct mlx5_vhca_page_tracker *tracker, u32 max_recv_wr)
+{
+	u32 out[MLX5_ST_SZ_DW(create_qp_out)] = {};
+	struct mlx5_vhca_qp *qp;
+	u8 log_rq_stride;
+	u8 log_rq_sz;
+	void *qpc;
+	int inlen;
+	void *in;
+	int err;
+
+	qp = kzalloc(sizeof(*qp), GFP_KERNEL);
+	if (!qp)
+		return ERR_PTR(-ENOMEM);
+
+	qp->rq.wqe_cnt = roundup_pow_of_two(max_recv_wr);
+	log_rq_stride = ilog2(MLX5_SEND_WQE_DS);
+	log_rq_sz = ilog2(qp->rq.wqe_cnt);
+	err = mlx5_db_alloc_node(mdev, &qp->db, mdev->priv.numa_node);
+	if (err)
+		goto err_free;
+
+	if (max_recv_wr) {
+		err = mlx5_frag_buf_alloc_node(mdev,
+			wq_get_byte_sz(log_rq_sz, log_rq_stride),
+			&qp->buf, mdev->priv.numa_node);
+		if (err)
+			goto err_db_free;
+		mlx5_init_fbc(qp->buf.frags, log_rq_stride, log_rq_sz, &qp->rq.fbc);
+	}
+
+	qp->rq.db = &qp->db.db[MLX5_RCV_DBR];
+	inlen = MLX5_ST_SZ_BYTES(create_qp_in) +
+		MLX5_FLD_SZ_BYTES(create_qp_in, pas[0]) *
+		qp->buf.npages;
+	in = kvzalloc(inlen, GFP_KERNEL);
+	if (!in) {
+		err = -ENOMEM;
+		goto err_in;
+	}
+
+	qpc = MLX5_ADDR_OF(create_qp_in, in, qpc);
+	MLX5_SET(qpc, qpc, st, MLX5_QP_ST_RC);
+	MLX5_SET(qpc, qpc, pm_state, MLX5_QP_PM_MIGRATED);
+	MLX5_SET(qpc, qpc, pd, tracker->pdn);
+	MLX5_SET(qpc, qpc, uar_page, tracker->uar->index);
+	MLX5_SET(qpc, qpc, log_page_size,
+		 qp->buf.page_shift - MLX5_ADAPTER_PAGE_SHIFT);
+	MLX5_SET(qpc, qpc, ts_format, mlx5_get_qp_default_ts(mdev));
+	if (MLX5_CAP_GEN(mdev, cqe_version) == 1)
+		MLX5_SET(qpc, qpc, user_index, 0xFFFFFF);
+	MLX5_SET(qpc, qpc, no_sq, 1);
+	if (max_recv_wr) {
+		MLX5_SET(qpc, qpc, cqn_rcv, tracker->cq.mcq.cqn);
+		MLX5_SET(qpc, qpc, log_rq_stride, log_rq_stride - 4);
+		MLX5_SET(qpc, qpc, log_rq_size, log_rq_sz);
+		MLX5_SET(qpc, qpc, rq_type, MLX5_NON_ZERO_RQ);
+		MLX5_SET64(qpc, qpc, dbr_addr, qp->db.dma);
+		mlx5_fill_page_frag_array(&qp->buf,
+					  (__be64 *)MLX5_ADDR_OF(create_qp_in,
+								 in, pas));
+	} else {
+		MLX5_SET(qpc, qpc, rq_type, MLX5_ZERO_LEN_RQ);
+	}
+
+	MLX5_SET(create_qp_in, in, opcode, MLX5_CMD_OP_CREATE_QP);
+	err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+	kvfree(in);
+	if (err)
+		goto err_in;
+
+	qp->qpn = MLX5_GET(create_qp_out, out, qpn);
+	return qp;
+
+err_in:
+	if (max_recv_wr)
+		mlx5_frag_buf_free(mdev, &qp->buf);
+err_db_free:
+	mlx5_db_free(mdev, &qp->db);
+err_free:
+	kfree(qp);
+	return ERR_PTR(err);
+}
+
+static void mlx5vf_post_recv(struct mlx5_vhca_qp *qp)
+{
+	struct mlx5_wqe_data_seg *data;
+	unsigned int ix;
+
+	WARN_ON(qp->rq.pc - qp->rq.cc >= qp->rq.wqe_cnt);
+	ix = qp->rq.pc & (qp->rq.wqe_cnt - 1);
+	data = mlx5_frag_buf_get_wqe(&qp->rq.fbc, ix);
+	data->byte_count = cpu_to_be32(qp->max_msg_size);
+	data->lkey = cpu_to_be32(qp->recv_buf.mkey);
+	data->addr = cpu_to_be64(qp->recv_buf.next_rq_offset);
+	qp->rq.pc++;
+	/* Make sure that descriptors are written before doorbell record. */
+	dma_wmb();
+	*qp->rq.db = cpu_to_be32(qp->rq.pc & 0xffff);
+}
+
+static int mlx5vf_activate_qp(struct mlx5_core_dev *mdev,
+			      struct mlx5_vhca_qp *qp, u32 remote_qpn,
+			      bool host_qp)
+{
+	u32 init_in[MLX5_ST_SZ_DW(rst2init_qp_in)] = {};
+	u32 rtr_in[MLX5_ST_SZ_DW(init2rtr_qp_in)] = {};
+	u32 rts_in[MLX5_ST_SZ_DW(rtr2rts_qp_in)] = {};
+	void *qpc;
+	int ret;
+
+	/* Init */
+	qpc = MLX5_ADDR_OF(rst2init_qp_in, init_in, qpc);
+	MLX5_SET(qpc, qpc, primary_address_path.vhca_port_num, 1);
+	MLX5_SET(qpc, qpc, pm_state, MLX5_QPC_PM_STATE_MIGRATED);
+	MLX5_SET(qpc, qpc, rre, 1);
+	MLX5_SET(qpc, qpc, rwe, 1);
+	MLX5_SET(rst2init_qp_in, init_in, opcode, MLX5_CMD_OP_RST2INIT_QP);
+	MLX5_SET(rst2init_qp_in, init_in, qpn, qp->qpn);
+	ret = mlx5_cmd_exec_in(mdev, rst2init_qp, init_in);
+	if (ret)
+		return ret;
+
+	if (host_qp) {
+		struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
+		int i;
+
+		for (i = 0; i < qp->rq.wqe_cnt; i++) {
+			mlx5vf_post_recv(qp);
+			recv_buf->next_rq_offset += qp->max_msg_size;
+		}
+	}
+
+	/* RTR */
+	qpc = MLX5_ADDR_OF(init2rtr_qp_in, rtr_in, qpc);
+	MLX5_SET(init2rtr_qp_in, rtr_in, qpn, qp->qpn);
+	MLX5_SET(qpc, qpc, mtu, IB_MTU_4096);
+	MLX5_SET(qpc, qpc, log_msg_max, MLX5_CAP_GEN(mdev, log_max_msg));
+	MLX5_SET(qpc, qpc, remote_qpn, remote_qpn);
+	MLX5_SET(qpc, qpc, primary_address_path.vhca_port_num, 1);
+	MLX5_SET(qpc, qpc, primary_address_path.fl, 1);
+	MLX5_SET(qpc, qpc, min_rnr_nak, 1);
+	MLX5_SET(init2rtr_qp_in, rtr_in, opcode, MLX5_CMD_OP_INIT2RTR_QP);
+	MLX5_SET(init2rtr_qp_in, rtr_in, qpn, qp->qpn);
+	ret = mlx5_cmd_exec_in(mdev, init2rtr_qp, rtr_in);
+	if (ret || host_qp)
+		return ret;
+
+	/* RTS */
+	qpc = MLX5_ADDR_OF(rtr2rts_qp_in, rts_in, qpc);
+	MLX5_SET(rtr2rts_qp_in, rts_in, qpn, qp->qpn);
+	MLX5_SET(qpc, qpc, retry_count, 7);
+	MLX5_SET(qpc, qpc, rnr_retry, 7); /* Infinite retry if RNR NACK */
+	MLX5_SET(qpc, qpc, primary_address_path.ack_timeout, 0x8); /* ~1ms */
+	MLX5_SET(rtr2rts_qp_in, rts_in, opcode, MLX5_CMD_OP_RTR2RTS_QP);
+	MLX5_SET(rtr2rts_qp_in, rts_in, qpn, qp->qpn);
+
+	return mlx5_cmd_exec_in(mdev, rtr2rts_qp, rts_in);
+}
+
+static void mlx5vf_destroy_qp(struct mlx5_core_dev *mdev,
+			      struct mlx5_vhca_qp *qp)
+{
+	u32 in[MLX5_ST_SZ_DW(destroy_qp_in)] = {};
+
+	MLX5_SET(destroy_qp_in, in, opcode, MLX5_CMD_OP_DESTROY_QP);
+	MLX5_SET(destroy_qp_in, in, qpn, qp->qpn);
+	mlx5_cmd_exec_in(mdev, destroy_qp, in);
+
+	mlx5_frag_buf_free(mdev, &qp->buf);
+	mlx5_db_free(mdev, &qp->db);
+	kfree(qp);
+}
+
+static void free_recv_pages(struct mlx5_vhca_recv_buf *recv_buf)
+{
+	int i;
+
+	/* Undo alloc_pages_bulk_array() */
+	for (i = 0; i < recv_buf->npages; i++)
+		__free_page(recv_buf->page_list[i]);
+
+	kvfree(recv_buf->page_list);
+}
+
+static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
+			    unsigned int npages)
+{
+	unsigned int filled = 0, done = 0;
+	int i;
+
+	recv_buf->page_list = kvcalloc(npages, sizeof(*recv_buf->page_list),
+				       GFP_KERNEL);
+	if (!recv_buf->page_list)
+		return -ENOMEM;
+
+	for (;;) {
+		filled = alloc_pages_bulk_array(GFP_KERNEL, npages - done,
+						recv_buf->page_list + done);
+		if (!filled)
+			goto err;
+
+		done += filled;
+		if (done == npages)
+			break;
+	}
+
+	recv_buf->npages = npages;
+	return 0;
+
+err:
+	for (i = 0; i < npages; i++) {
+		if (recv_buf->page_list[i])
+			__free_page(recv_buf->page_list[i]);
+	}
+
+	kvfree(recv_buf->page_list);
+	return -ENOMEM;
+}
+
+static int register_dma_recv_pages(struct mlx5_core_dev *mdev,
+				   struct mlx5_vhca_recv_buf *recv_buf)
+{
+	int i, j;
+
+	recv_buf->dma_addrs = kvcalloc(recv_buf->npages,
+				       sizeof(*recv_buf->dma_addrs),
+				       GFP_KERNEL);
+	if (!recv_buf->dma_addrs)
+		return -ENOMEM;
+
+	for (i = 0; i < recv_buf->npages; i++) {
+		recv_buf->dma_addrs[i] = dma_map_page(mdev->device,
+						      recv_buf->page_list[i],
+						      0, PAGE_SIZE,
+						      DMA_FROM_DEVICE);
+		if (dma_mapping_error(mdev->device, recv_buf->dma_addrs[i]))
+			goto error;
+	}
+	return 0;
+
+error:
+	for (j = 0; j < i; j++)
+		dma_unmap_single(mdev->device, recv_buf->dma_addrs[j],
+				 PAGE_SIZE, DMA_FROM_DEVICE);
+
+	kvfree(recv_buf->dma_addrs);
+	return -ENOMEM;
+}
+
+static void unregister_dma_recv_pages(struct mlx5_core_dev *mdev,
+				      struct mlx5_vhca_recv_buf *recv_buf)
+{
+	int i;
+
+	for (i = 0; i < recv_buf->npages; i++)
+		dma_unmap_single(mdev->device, recv_buf->dma_addrs[i],
+				 PAGE_SIZE, DMA_FROM_DEVICE);
+
+	kvfree(recv_buf->dma_addrs);
+}
+
+static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
+					  struct mlx5_vhca_qp *qp)
+{
+	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
+
+	mlx5_core_destroy_mkey(mdev, recv_buf->mkey);
+	unregister_dma_recv_pages(mdev, recv_buf);
+	free_recv_pages(&qp->recv_buf);
+}
+
+static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
+					  struct mlx5_vhca_qp *qp, u32 pdn,
+					  u64 rq_size)
+{
+	unsigned int npages = DIV_ROUND_UP_ULL(rq_size, PAGE_SIZE);
+	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
+	int err;
+
+	err = alloc_recv_pages(recv_buf, npages);
+	if (err < 0)
+		return err;
+
+	err = register_dma_recv_pages(mdev, recv_buf);
+	if (err)
+		goto end;
+
+	err = _create_mkey(mdev, pdn, NULL, recv_buf, &recv_buf->mkey);
+	if (err)
+		goto err_create_mkey;
+
+	return 0;
+
+err_create_mkey:
+	unregister_dma_recv_pages(mdev, recv_buf);
+end:
+	free_recv_pages(recv_buf);
+	return err;
+}
+
+static void
+_mlx5vf_free_page_tracker_resources(struct mlx5vf_pci_core_device *mvdev)
+{
+	struct mlx5_vhca_page_tracker *tracker = &mvdev->tracker;
+	struct mlx5_core_dev *mdev = mvdev->mdev;
+
+	lockdep_assert_held(&mvdev->state_mutex);
+
+	if (!mvdev->log_active)
+		return;
+
+	WARN_ON(mvdev->mdev_detach);
+
+	mlx5vf_destroy_qp(mdev, tracker->fw_qp);
+	mlx5vf_free_qp_recv_resources(mdev, tracker->host_qp);
+	mlx5vf_destroy_qp(mdev, tracker->host_qp);
+	mlx5vf_destroy_cq(mdev, &tracker->cq);
+	mlx5_core_dealloc_pd(mdev, tracker->pdn);
+	mlx5_put_uars_page(mdev, tracker->uar);
+	mvdev->log_active = false;
+}
+
+int mlx5vf_stop_page_tracker(struct vfio_device *vdev)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+
+	mutex_lock(&mvdev->state_mutex);
+	if (!mvdev->log_active)
+		goto end;
+
+	_mlx5vf_free_page_tracker_resources(mvdev);
+	mvdev->log_active = false;
+end:
+	mlx5vf_state_mutex_unlock(mvdev);
+	return 0;
+}
+
+int mlx5vf_start_page_tracker(struct vfio_device *vdev,
+			      struct rb_root_cached *ranges, u32 nnodes,
+			      u64 *page_size)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+	struct mlx5_vhca_page_tracker *tracker = &mvdev->tracker;
+	u8 log_tracked_page = ilog2(*page_size);
+	struct mlx5_vhca_qp *host_qp;
+	struct mlx5_vhca_qp *fw_qp;
+	struct mlx5_core_dev *mdev;
+	u32 max_msg_size = PAGE_SIZE;
+	u64 rq_size = SZ_2M;
+	u32 max_recv_wr;
+	int err;
+
+	mutex_lock(&mvdev->state_mutex);
+	if (mvdev->mdev_detach) {
+		err = -ENOTCONN;
+		goto end;
+	}
+
+	if (mvdev->log_active) {
+		err = -EINVAL;
+		goto end;
+	}
+
+	mdev = mvdev->mdev;
+	memset(tracker, 0, sizeof(*tracker));
+	tracker->uar = mlx5_get_uars_page(mdev);
+	if (IS_ERR(tracker->uar)) {
+		err = PTR_ERR(tracker->uar);
+		goto end;
+	}
+
+	err = mlx5_core_alloc_pd(mdev, &tracker->pdn);
+	if (err)
+		goto err_uar;
+
+	max_recv_wr = DIV_ROUND_UP_ULL(rq_size, max_msg_size);
+	err = mlx5vf_create_cq(mdev, tracker, max_recv_wr);
+	if (err)
+		goto err_dealloc_pd;
+
+	host_qp = mlx5vf_create_rc_qp(mdev, tracker, max_recv_wr);
+	if (IS_ERR(host_qp)) {
+		err = PTR_ERR(host_qp);
+		goto err_cq;
+	}
+
+	host_qp->max_msg_size = max_msg_size;
+	if (log_tracked_page < MLX5_CAP_ADV_VIRTUALIZATION(mdev,
+				pg_track_log_min_page_size)) {
+		log_tracked_page = MLX5_CAP_ADV_VIRTUALIZATION(mdev,
+				pg_track_log_min_page_size);
+	} else if (log_tracked_page > MLX5_CAP_ADV_VIRTUALIZATION(mdev,
+				pg_track_log_max_page_size)) {
+		log_tracked_page = MLX5_CAP_ADV_VIRTUALIZATION(mdev,
+				pg_track_log_max_page_size);
+	}
+
+	host_qp->tracked_page_size = (1ULL << log_tracked_page);
+	err = mlx5vf_alloc_qp_recv_resources(mdev, host_qp, tracker->pdn,
+					     rq_size);
+	if (err)
+		goto err_host_qp;
+
+	fw_qp = mlx5vf_create_rc_qp(mdev, tracker, 0);
+	if (IS_ERR(fw_qp)) {
+		err = PTR_ERR(fw_qp);
+		goto err_recv_resources;
+	}
+
+	err = mlx5vf_activate_qp(mdev, host_qp, fw_qp->qpn, true);
+	if (err)
+		goto err_activate;
+
+	err = mlx5vf_activate_qp(mdev, fw_qp, host_qp->qpn, false);
+	if (err)
+		goto err_activate;
+
+	tracker->host_qp = host_qp;
+	tracker->fw_qp = fw_qp;
+	*page_size = host_qp->tracked_page_size;
+	mvdev->log_active = true;
+	mlx5vf_state_mutex_unlock(mvdev);
+	return 0;
+
+err_activate:
+	mlx5vf_destroy_qp(mdev, fw_qp);
+err_recv_resources:
+	mlx5vf_free_qp_recv_resources(mdev, host_qp);
+err_host_qp:
+	mlx5vf_destroy_qp(mdev, host_qp);
+err_cq:
+	mlx5vf_destroy_cq(mdev, &tracker->cq);
+err_dealloc_pd:
+	mlx5_core_dealloc_pd(mdev, tracker->pdn);
+err_uar:
+	mlx5_put_uars_page(mdev, tracker->uar);
+end:
+	mlx5vf_state_mutex_unlock(mvdev);
+	return err;
+}
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 8208f4701a90..e71ec017bf04 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -9,6 +9,8 @@
 #include <linux/kernel.h>
 #include <linux/vfio_pci_core.h>
 #include <linux/mlx5/driver.h>
+#include <linux/mlx5/cq.h>
+#include <linux/mlx5/qp.h>
 
 struct mlx5vf_async_data {
 	struct mlx5_async_work cb_work;
@@ -39,6 +41,52 @@ struct mlx5_vf_migration_file {
 	struct mlx5vf_async_data async_data;
 };
 
+struct mlx5_vhca_cq_buf {
+	struct mlx5_frag_buf_ctrl fbc;
+	struct mlx5_frag_buf frag_buf;
+	int cqe_size;
+	int nent;
+};
+
+struct mlx5_vhca_cq {
+	struct mlx5_vhca_cq_buf buf;
+	struct mlx5_db db;
+	struct mlx5_core_cq mcq;
+	size_t ncqe;
+};
+
+struct mlx5_vhca_recv_buf {
+	u32 npages;
+	struct page **page_list;
+	dma_addr_t *dma_addrs;
+	u32 next_rq_offset;
+	u32 mkey;
+};
+
+struct mlx5_vhca_qp {
+	struct mlx5_frag_buf buf;
+	struct mlx5_db db;
+	struct mlx5_vhca_recv_buf recv_buf;
+	u32 tracked_page_size;
+	u32 max_msg_size;
+	u32 qpn;
+	struct {
+		unsigned int pc;
+		unsigned int cc;
+		unsigned int wqe_cnt;
+		__be32 *db;
+		struct mlx5_frag_buf_ctrl fbc;
+	} rq;
+};
+
+struct mlx5_vhca_page_tracker {
+	u32 pdn;
+	struct mlx5_uars_page *uar;
+	struct mlx5_vhca_cq cq;
+	struct mlx5_vhca_qp *host_qp;
+	struct mlx5_vhca_qp *fw_qp;
+};
+
 struct mlx5vf_pci_core_device {
 	struct vfio_pci_core_device core_device;
 	int vf_id;
@@ -46,6 +94,7 @@ struct mlx5vf_pci_core_device {
 	u8 migrate_cap:1;
 	u8 deferred_reset:1;
 	u8 mdev_detach:1;
+	u8 log_active:1;
 	/* protect migration state */
 	struct mutex state_mutex;
 	enum vfio_device_mig_state mig_state;
@@ -53,6 +102,7 @@ struct mlx5vf_pci_core_device {
 	spinlock_t reset_lock;
 	struct mlx5_vf_migration_file *resuming_migf;
 	struct mlx5_vf_migration_file *saving_migf;
+	struct mlx5_vhca_page_tracker tracker;
 	struct workqueue_struct *cb_wq;
 	struct notifier_block nb;
 	struct mlx5_core_dev *mdev;
@@ -73,4 +123,7 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev);
 void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev);
 void mlx5vf_mig_file_cleanup_cb(struct work_struct *_work);
+int mlx5vf_start_page_tracker(struct vfio_device *vdev,
+		struct rb_root_cached *ranges, u32 nnodes, u64 *page_size);
+int mlx5vf_stop_page_tracker(struct vfio_device *vdev);
 #endif /* MLX5_VFIO_CMD_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 vfio 07/10] vfio/mlx5: Create and destroy page tracker object
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (5 preceding siblings ...)
  2022-08-15 15:11 ` [PATCH V4 vfio 06/10] vfio/mlx5: Init QP based resources for dirty tracking Yishai Hadas
@ 2022-08-15 15:11 ` Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 08/10] vfio/mlx5: Report dirty pages from tracker Yishai Hadas
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:11 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Add support for creating and destroying page tracker object.

This object is used to control/report the device dirty pages.

As part of creating the tracker need to consider the device capabilities
for max ranges and adapt/combine ranges accordingly.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 147 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/mlx5/cmd.h |   1 +
 2 files changed, 148 insertions(+)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 0a362796d567..f1cad96af6ab 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -410,6 +410,148 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	return err;
 }
 
+static void combine_ranges(struct rb_root_cached *root, u32 cur_nodes,
+			   u32 req_nodes)
+{
+	struct interval_tree_node *prev, *curr, *comb_start, *comb_end;
+	unsigned long min_gap;
+	unsigned long curr_gap;
+
+	/* Special shortcut when a single range is required */
+	if (req_nodes == 1) {
+		unsigned long last;
+
+		curr = comb_start = interval_tree_iter_first(root, 0, ULONG_MAX);
+		while (curr) {
+			last = curr->last;
+			prev = curr;
+			curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
+			if (prev != comb_start)
+				interval_tree_remove(prev, root);
+		}
+		comb_start->last = last;
+		return;
+	}
+
+	/* Combine ranges which have the smallest gap */
+	while (cur_nodes > req_nodes) {
+		prev = NULL;
+		min_gap = ULONG_MAX;
+		curr = interval_tree_iter_first(root, 0, ULONG_MAX);
+		while (curr) {
+			if (prev) {
+				curr_gap = curr->start - prev->last;
+				if (curr_gap < min_gap) {
+					min_gap = curr_gap;
+					comb_start = prev;
+					comb_end = curr;
+				}
+			}
+			prev = curr;
+			curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
+		}
+		comb_start->last = comb_end->last;
+		interval_tree_remove(comb_end, root);
+		cur_nodes--;
+	}
+}
+
+static int mlx5vf_create_tracker(struct mlx5_core_dev *mdev,
+				 struct mlx5vf_pci_core_device *mvdev,
+				 struct rb_root_cached *ranges, u32 nnodes)
+{
+	int max_num_range =
+		MLX5_CAP_ADV_VIRTUALIZATION(mdev, pg_track_max_num_range);
+	struct mlx5_vhca_page_tracker *tracker = &mvdev->tracker;
+	int record_size = MLX5_ST_SZ_BYTES(page_track_range);
+	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)] = {};
+	struct interval_tree_node *node = NULL;
+	u64 total_ranges_len = 0;
+	u32 num_ranges = nnodes;
+	u8 log_addr_space_size;
+	void *range_list_ptr;
+	void *obj_context;
+	void *cmd_hdr;
+	int inlen;
+	void *in;
+	int err;
+	int i;
+
+	if (num_ranges > max_num_range) {
+		combine_ranges(ranges, nnodes, max_num_range);
+		num_ranges = max_num_range;
+	}
+
+	inlen = MLX5_ST_SZ_BYTES(create_page_track_obj_in) +
+				 record_size * num_ranges;
+	in = kzalloc(inlen, GFP_KERNEL);
+	if (!in)
+		return -ENOMEM;
+
+	cmd_hdr = MLX5_ADDR_OF(create_page_track_obj_in, in,
+			       general_obj_in_cmd_hdr);
+	MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, opcode,
+		 MLX5_CMD_OP_CREATE_GENERAL_OBJECT);
+	MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, obj_type,
+		 MLX5_OBJ_TYPE_PAGE_TRACK);
+	obj_context = MLX5_ADDR_OF(create_page_track_obj_in, in, obj_context);
+	MLX5_SET(page_track, obj_context, vhca_id, mvdev->vhca_id);
+	MLX5_SET(page_track, obj_context, track_type, 1);
+	MLX5_SET(page_track, obj_context, log_page_size,
+		 ilog2(tracker->host_qp->tracked_page_size));
+	MLX5_SET(page_track, obj_context, log_msg_size,
+		 ilog2(tracker->host_qp->max_msg_size));
+	MLX5_SET(page_track, obj_context, reporting_qpn, tracker->fw_qp->qpn);
+	MLX5_SET(page_track, obj_context, num_ranges, num_ranges);
+
+	range_list_ptr = MLX5_ADDR_OF(page_track, obj_context, track_range);
+	node = interval_tree_iter_first(ranges, 0, ULONG_MAX);
+	for (i = 0; i < num_ranges; i++) {
+		void *addr_range_i_base = range_list_ptr + record_size * i;
+		unsigned long length = node->last - node->start;
+
+		MLX5_SET64(page_track_range, addr_range_i_base, start_address,
+			   node->start);
+		MLX5_SET64(page_track_range, addr_range_i_base, length, length);
+		total_ranges_len += length;
+		node = interval_tree_iter_next(node, 0, ULONG_MAX);
+	}
+
+	WARN_ON(node);
+	log_addr_space_size = ilog2(total_ranges_len);
+	if (log_addr_space_size <
+	    (MLX5_CAP_ADV_VIRTUALIZATION(mdev, pg_track_log_min_addr_space)) ||
+	    log_addr_space_size >
+	    (MLX5_CAP_ADV_VIRTUALIZATION(mdev, pg_track_log_max_addr_space))) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+
+	MLX5_SET(page_track, obj_context, log_addr_space_size,
+		 log_addr_space_size);
+	err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+	if (err)
+		goto out;
+
+	tracker->id = MLX5_GET(general_obj_out_cmd_hdr, out, obj_id);
+out:
+	kfree(in);
+	return err;
+}
+
+static int mlx5vf_cmd_destroy_tracker(struct mlx5_core_dev *mdev,
+				      u32 tracker_id)
+{
+	u32 in[MLX5_ST_SZ_DW(general_obj_in_cmd_hdr)] = {};
+	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)] = {};
+
+	MLX5_SET(general_obj_in_cmd_hdr, in, opcode, MLX5_CMD_OP_DESTROY_GENERAL_OBJECT);
+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_type, MLX5_OBJ_TYPE_PAGE_TRACK);
+	MLX5_SET(general_obj_in_cmd_hdr, in, obj_id, tracker_id);
+
+	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
 static int alloc_cq_frag_buf(struct mlx5_core_dev *mdev,
 			     struct mlx5_vhca_cq_buf *buf, int nent,
 			     int cqe_size)
@@ -833,6 +975,7 @@ _mlx5vf_free_page_tracker_resources(struct mlx5vf_pci_core_device *mvdev)
 
 	WARN_ON(mvdev->mdev_detach);
 
+	mlx5vf_cmd_destroy_tracker(mdev, tracker->id);
 	mlx5vf_destroy_qp(mdev, tracker->fw_qp);
 	mlx5vf_free_qp_recv_resources(mdev, tracker->host_qp);
 	mlx5vf_destroy_qp(mdev, tracker->host_qp);
@@ -941,6 +1084,10 @@ int mlx5vf_start_page_tracker(struct vfio_device *vdev,
 
 	tracker->host_qp = host_qp;
 	tracker->fw_qp = fw_qp;
+	err = mlx5vf_create_tracker(mdev, mvdev, ranges, nnodes);
+	if (err)
+		goto err_activate;
+
 	*page_size = host_qp->tracked_page_size;
 	mvdev->log_active = true;
 	mlx5vf_state_mutex_unlock(mvdev);
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index e71ec017bf04..658925ba5459 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -80,6 +80,7 @@ struct mlx5_vhca_qp {
 };
 
 struct mlx5_vhca_page_tracker {
+	u32 id;
 	u32 pdn;
 	struct mlx5_uars_page *uar;
 	struct mlx5_vhca_cq cq;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 vfio 08/10] vfio/mlx5: Report dirty pages from tracker
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (6 preceding siblings ...)
  2022-08-15 15:11 ` [PATCH V4 vfio 07/10] vfio/mlx5: Create and destroy page tracker object Yishai Hadas
@ 2022-08-15 15:11 ` Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 09/10] vfio/mlx5: Manage error scenarios on tracker Yishai Hadas
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:11 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Report dirty pages from tracker.

It includes:
Querying for dirty pages in a given IOVA range, this is done by
modifying the tracker into the reporting state and supplying the
required range.

Using the CQ event completion mechanism to be notified once data is
ready on the CQ/QP to be processed.

Once data is available turn on the corresponding bits in the bit map.

This functionality will be used as part of the 'log_read_and_clear'
driver callback in the next patches.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 191 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/mlx5/cmd.h |   4 +
 2 files changed, 195 insertions(+)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index f1cad96af6ab..fa9ddd926500 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -5,6 +5,8 @@
 
 #include "cmd.h"
 
+enum { CQ_OK = 0, CQ_EMPTY = -1, CQ_POLL_ERR = -2 };
+
 static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
 				  u16 *vhca_id);
 static void
@@ -157,6 +159,7 @@ void mlx5vf_cmd_set_migratable(struct mlx5vf_pci_core_device *mvdev,
 		VFIO_MIGRATION_STOP_COPY |
 		VFIO_MIGRATION_P2P;
 	mvdev->core_device.vdev.mig_ops = mig_ops;
+	init_completion(&mvdev->tracker_comp);
 
 end:
 	mlx5_vf_put_core_dev(mvdev->mdev);
@@ -552,6 +555,29 @@ static int mlx5vf_cmd_destroy_tracker(struct mlx5_core_dev *mdev,
 	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
 }
 
+static int mlx5vf_cmd_modify_tracker(struct mlx5_core_dev *mdev,
+				     u32 tracker_id, unsigned long iova,
+				     unsigned long length, u32 tracker_state)
+{
+	u32 in[MLX5_ST_SZ_DW(modify_page_track_obj_in)] = {};
+	u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)] = {};
+	void *obj_context;
+	void *cmd_hdr;
+
+	cmd_hdr = MLX5_ADDR_OF(modify_page_track_obj_in, in, general_obj_in_cmd_hdr);
+	MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, opcode, MLX5_CMD_OP_MODIFY_GENERAL_OBJECT);
+	MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, obj_type, MLX5_OBJ_TYPE_PAGE_TRACK);
+	MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, obj_id, tracker_id);
+
+	obj_context = MLX5_ADDR_OF(modify_page_track_obj_in, in, obj_context);
+	MLX5_SET64(page_track, obj_context, modify_field_select, 0x3);
+	MLX5_SET64(page_track, obj_context, range_start_address, iova);
+	MLX5_SET64(page_track, obj_context, length, length);
+	MLX5_SET(page_track, obj_context, state, tracker_state);
+
+	return mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out));
+}
+
 static int alloc_cq_frag_buf(struct mlx5_core_dev *mdev,
 			     struct mlx5_vhca_cq_buf *buf, int nent,
 			     int cqe_size)
@@ -593,6 +619,16 @@ static void mlx5vf_destroy_cq(struct mlx5_core_dev *mdev,
 	mlx5_db_free(mdev, &cq->db);
 }
 
+static void mlx5vf_cq_complete(struct mlx5_core_cq *mcq,
+			       struct mlx5_eqe *eqe)
+{
+	struct mlx5vf_pci_core_device *mvdev =
+		container_of(mcq, struct mlx5vf_pci_core_device,
+			     tracker.cq.mcq);
+
+	complete(&mvdev->tracker_comp);
+}
+
 static int mlx5vf_create_cq(struct mlx5_core_dev *mdev,
 			    struct mlx5_vhca_page_tracker *tracker,
 			    size_t ncqe)
@@ -643,10 +679,13 @@ static int mlx5vf_create_cq(struct mlx5_core_dev *mdev,
 	MLX5_SET64(cqc, cqc, dbr_addr, cq->db.dma);
 	pas = (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas);
 	mlx5_fill_page_frag_array(&cq->buf.frag_buf, pas);
+	cq->mcq.comp = mlx5vf_cq_complete;
 	err = mlx5_core_create_cq(mdev, &cq->mcq, in, inlen, out, sizeof(out));
 	if (err)
 		goto err_vec;
 
+	mlx5_cq_arm(&cq->mcq, MLX5_CQ_DB_REQ_NOT, tracker->uar->map,
+		    cq->mcq.cons_index);
 	kvfree(in);
 	return 0;
 
@@ -1109,3 +1148,155 @@ int mlx5vf_start_page_tracker(struct vfio_device *vdev,
 	mlx5vf_state_mutex_unlock(mvdev);
 	return err;
 }
+
+static void
+set_report_output(u32 size, int index, struct mlx5_vhca_qp *qp,
+		  struct iova_bitmap *dirty)
+{
+	u32 entry_size = MLX5_ST_SZ_BYTES(page_track_report_entry);
+	u32 nent = size / entry_size;
+	struct page *page;
+	u64 addr;
+	u64 *buf;
+	int i;
+
+	if (WARN_ON(index >= qp->recv_buf.npages ||
+		    (nent > qp->max_msg_size / entry_size)))
+		return;
+
+	page = qp->recv_buf.page_list[index];
+	buf = kmap_local_page(page);
+	for (i = 0; i < nent; i++) {
+		addr = MLX5_GET(page_track_report_entry, buf + i,
+				dirty_address_low);
+		addr |= (u64)MLX5_GET(page_track_report_entry, buf + i,
+				      dirty_address_high) << 32;
+		iova_bitmap_set(dirty, addr, qp->tracked_page_size);
+	}
+	kunmap_local(buf);
+}
+
+static void
+mlx5vf_rq_cqe(struct mlx5_vhca_qp *qp, struct mlx5_cqe64 *cqe,
+	      struct iova_bitmap *dirty, int *tracker_status)
+{
+	u32 size;
+	int ix;
+
+	qp->rq.cc++;
+	*tracker_status = be32_to_cpu(cqe->immediate) >> 28;
+	size = be32_to_cpu(cqe->byte_cnt);
+	ix = be16_to_cpu(cqe->wqe_counter) & (qp->rq.wqe_cnt - 1);
+
+	/* zero length CQE, no data */
+	WARN_ON(!size && *tracker_status == MLX5_PAGE_TRACK_STATE_REPORTING);
+	if (size)
+		set_report_output(size, ix, qp, dirty);
+
+	qp->recv_buf.next_rq_offset = ix * qp->max_msg_size;
+	mlx5vf_post_recv(qp);
+}
+
+static void *get_cqe(struct mlx5_vhca_cq *cq, int n)
+{
+	return mlx5_frag_buf_get_wqe(&cq->buf.fbc, n);
+}
+
+static struct mlx5_cqe64 *get_sw_cqe(struct mlx5_vhca_cq *cq, int n)
+{
+	void *cqe = get_cqe(cq, n & (cq->ncqe - 1));
+	struct mlx5_cqe64 *cqe64;
+
+	cqe64 = (cq->mcq.cqe_sz == 64) ? cqe : cqe + 64;
+
+	if (likely(get_cqe_opcode(cqe64) != MLX5_CQE_INVALID) &&
+	    !((cqe64->op_own & MLX5_CQE_OWNER_MASK) ^ !!(n & (cq->ncqe)))) {
+		return cqe64;
+	} else {
+		return NULL;
+	}
+}
+
+static int
+mlx5vf_cq_poll_one(struct mlx5_vhca_cq *cq, struct mlx5_vhca_qp *qp,
+		   struct iova_bitmap *dirty, int *tracker_status)
+{
+	struct mlx5_cqe64 *cqe;
+	u8 opcode;
+
+	cqe = get_sw_cqe(cq, cq->mcq.cons_index);
+	if (!cqe)
+		return CQ_EMPTY;
+
+	++cq->mcq.cons_index;
+	/*
+	 * Make sure we read CQ entry contents after we've checked the
+	 * ownership bit.
+	 */
+	rmb();
+	opcode = get_cqe_opcode(cqe);
+	switch (opcode) {
+	case MLX5_CQE_RESP_SEND_IMM:
+		mlx5vf_rq_cqe(qp, cqe, dirty, tracker_status);
+		return CQ_OK;
+	default:
+		return CQ_POLL_ERR;
+	}
+}
+
+int mlx5vf_tracker_read_and_clear(struct vfio_device *vdev, unsigned long iova,
+				  unsigned long length,
+				  struct iova_bitmap *dirty)
+{
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		vdev, struct mlx5vf_pci_core_device, core_device.vdev);
+	struct mlx5_vhca_page_tracker *tracker = &mvdev->tracker;
+	struct mlx5_vhca_cq *cq = &tracker->cq;
+	struct mlx5_core_dev *mdev;
+	int poll_err, err;
+
+	mutex_lock(&mvdev->state_mutex);
+	if (!mvdev->log_active) {
+		err = -EINVAL;
+		goto end;
+	}
+
+	if (mvdev->mdev_detach) {
+		err = -ENOTCONN;
+		goto end;
+	}
+
+	mdev = mvdev->mdev;
+	err = mlx5vf_cmd_modify_tracker(mdev, tracker->id, iova, length,
+					MLX5_PAGE_TRACK_STATE_REPORTING);
+	if (err)
+		goto end;
+
+	tracker->status = MLX5_PAGE_TRACK_STATE_REPORTING;
+	while (tracker->status == MLX5_PAGE_TRACK_STATE_REPORTING) {
+		poll_err = mlx5vf_cq_poll_one(cq, tracker->host_qp, dirty,
+					      &tracker->status);
+		if (poll_err == CQ_EMPTY) {
+			mlx5_cq_arm(&cq->mcq, MLX5_CQ_DB_REQ_NOT, tracker->uar->map,
+				    cq->mcq.cons_index);
+			poll_err = mlx5vf_cq_poll_one(cq, tracker->host_qp,
+						      dirty, &tracker->status);
+			if (poll_err == CQ_EMPTY) {
+				wait_for_completion(&mvdev->tracker_comp);
+				continue;
+			}
+		}
+		if (poll_err == CQ_POLL_ERR) {
+			err = -EIO;
+			goto end;
+		}
+		mlx5_cq_set_ci(&cq->mcq);
+	}
+
+	if (tracker->status == MLX5_PAGE_TRACK_STATE_ERROR)
+		err = -EIO;
+
+end:
+	mlx5vf_state_mutex_unlock(mvdev);
+	return err;
+}
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 658925ba5459..fa1f9ab4d3d0 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -86,6 +86,7 @@ struct mlx5_vhca_page_tracker {
 	struct mlx5_vhca_cq cq;
 	struct mlx5_vhca_qp *host_qp;
 	struct mlx5_vhca_qp *fw_qp;
+	int status;
 };
 
 struct mlx5vf_pci_core_device {
@@ -96,6 +97,7 @@ struct mlx5vf_pci_core_device {
 	u8 deferred_reset:1;
 	u8 mdev_detach:1;
 	u8 log_active:1;
+	struct completion tracker_comp;
 	/* protect migration state */
 	struct mutex state_mutex;
 	enum vfio_device_mig_state mig_state;
@@ -127,4 +129,6 @@ void mlx5vf_mig_file_cleanup_cb(struct work_struct *_work);
 int mlx5vf_start_page_tracker(struct vfio_device *vdev,
 		struct rb_root_cached *ranges, u32 nnodes, u64 *page_size);
 int mlx5vf_stop_page_tracker(struct vfio_device *vdev);
+int mlx5vf_tracker_read_and_clear(struct vfio_device *vdev, unsigned long iova,
+			unsigned long length, struct iova_bitmap *dirty);
 #endif /* MLX5_VFIO_CMD_H */
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 vfio 09/10] vfio/mlx5: Manage error scenarios on tracker
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (7 preceding siblings ...)
  2022-08-15 15:11 ` [PATCH V4 vfio 08/10] vfio/mlx5: Report dirty pages from tracker Yishai Hadas
@ 2022-08-15 15:11 ` Yishai Hadas
  2022-08-15 15:11 ` [PATCH V4 vfio 10/10] vfio/mlx5: Set the driver DMA logging callbacks Yishai Hadas
  2022-08-25 11:13 ` [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
  10 siblings, 0 replies; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:11 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Handle async error events and health/recovery flow to safely stop the
tracker upon error scenarios.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 61 +++++++++++++++++++++++++++++++++++--
 drivers/vfio/pci/mlx5/cmd.h |  2 ++
 2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index fa9ddd926500..3e92b4d92be2 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -70,6 +70,13 @@ int mlx5vf_cmd_query_vhca_migration_state(struct mlx5vf_pci_core_device *mvdev,
 	return 0;
 }
 
+static void set_tracker_error(struct mlx5vf_pci_core_device *mvdev)
+{
+	/* Mark the tracker under an error and wake it up if it's running */
+	mvdev->tracker.is_err = true;
+	complete(&mvdev->tracker_comp);
+}
+
 static int mlx5fv_vf_event(struct notifier_block *nb,
 			   unsigned long event, void *data)
 {
@@ -100,6 +107,8 @@ void mlx5vf_cmd_close_migratable(struct mlx5vf_pci_core_device *mvdev)
 	if (!mvdev->migrate_cap)
 		return;
 
+	/* Must be done outside the lock to let it progress */
+	set_tracker_error(mvdev);
 	mutex_lock(&mvdev->state_mutex);
 	mlx5vf_disable_fds(mvdev);
 	_mlx5vf_free_page_tracker_resources(mvdev);
@@ -619,6 +628,47 @@ static void mlx5vf_destroy_cq(struct mlx5_core_dev *mdev,
 	mlx5_db_free(mdev, &cq->db);
 }
 
+static void mlx5vf_cq_event(struct mlx5_core_cq *mcq, enum mlx5_event type)
+{
+	if (type != MLX5_EVENT_TYPE_CQ_ERROR)
+		return;
+
+	set_tracker_error(container_of(mcq, struct mlx5vf_pci_core_device,
+				       tracker.cq.mcq));
+}
+
+static int mlx5vf_event_notifier(struct notifier_block *nb, unsigned long type,
+				 void *data)
+{
+	struct mlx5_vhca_page_tracker *tracker =
+		mlx5_nb_cof(nb, struct mlx5_vhca_page_tracker, nb);
+	struct mlx5vf_pci_core_device *mvdev = container_of(
+		tracker, struct mlx5vf_pci_core_device, tracker);
+	struct mlx5_eqe *eqe = data;
+	u8 event_type = (u8)type;
+	u8 queue_type;
+	int qp_num;
+
+	switch (event_type) {
+	case MLX5_EVENT_TYPE_WQ_CATAS_ERROR:
+	case MLX5_EVENT_TYPE_WQ_ACCESS_ERROR:
+	case MLX5_EVENT_TYPE_WQ_INVAL_REQ_ERROR:
+		queue_type = eqe->data.qp_srq.type;
+		if (queue_type != MLX5_EVENT_QUEUE_TYPE_QP)
+			break;
+		qp_num = be32_to_cpu(eqe->data.qp_srq.qp_srq_n) & 0xffffff;
+		if (qp_num != tracker->host_qp->qpn &&
+		    qp_num != tracker->fw_qp->qpn)
+			break;
+		set_tracker_error(mvdev);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
 static void mlx5vf_cq_complete(struct mlx5_core_cq *mcq,
 			       struct mlx5_eqe *eqe)
 {
@@ -680,6 +730,7 @@ static int mlx5vf_create_cq(struct mlx5_core_dev *mdev,
 	pas = (__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas);
 	mlx5_fill_page_frag_array(&cq->buf.frag_buf, pas);
 	cq->mcq.comp = mlx5vf_cq_complete;
+	cq->mcq.event = mlx5vf_cq_event;
 	err = mlx5_core_create_cq(mdev, &cq->mcq, in, inlen, out, sizeof(out));
 	if (err)
 		goto err_vec;
@@ -1014,6 +1065,7 @@ _mlx5vf_free_page_tracker_resources(struct mlx5vf_pci_core_device *mvdev)
 
 	WARN_ON(mvdev->mdev_detach);
 
+	mlx5_eq_notifier_unregister(mdev, &tracker->nb);
 	mlx5vf_cmd_destroy_tracker(mdev, tracker->id);
 	mlx5vf_destroy_qp(mdev, tracker->fw_qp);
 	mlx5vf_free_qp_recv_resources(mdev, tracker->host_qp);
@@ -1127,6 +1179,8 @@ int mlx5vf_start_page_tracker(struct vfio_device *vdev,
 	if (err)
 		goto err_activate;
 
+	MLX5_NB_INIT(&tracker->nb, mlx5vf_event_notifier, NOTIFY_ANY);
+	mlx5_eq_notifier_register(mdev, &tracker->nb);
 	*page_size = host_qp->tracked_page_size;
 	mvdev->log_active = true;
 	mlx5vf_state_mutex_unlock(mvdev);
@@ -1273,7 +1327,8 @@ int mlx5vf_tracker_read_and_clear(struct vfio_device *vdev, unsigned long iova,
 		goto end;
 
 	tracker->status = MLX5_PAGE_TRACK_STATE_REPORTING;
-	while (tracker->status == MLX5_PAGE_TRACK_STATE_REPORTING) {
+	while (tracker->status == MLX5_PAGE_TRACK_STATE_REPORTING &&
+	       !tracker->is_err) {
 		poll_err = mlx5vf_cq_poll_one(cq, tracker->host_qp, dirty,
 					      &tracker->status);
 		if (poll_err == CQ_EMPTY) {
@@ -1294,8 +1349,10 @@ int mlx5vf_tracker_read_and_clear(struct vfio_device *vdev, unsigned long iova,
 	}
 
 	if (tracker->status == MLX5_PAGE_TRACK_STATE_ERROR)
-		err = -EIO;
+		tracker->is_err = true;
 
+	if (tracker->is_err)
+		err = -EIO;
 end:
 	mlx5vf_state_mutex_unlock(mvdev);
 	return err;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index fa1f9ab4d3d0..8b0ae40c620c 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -82,10 +82,12 @@ struct mlx5_vhca_qp {
 struct mlx5_vhca_page_tracker {
 	u32 id;
 	u32 pdn;
+	u8 is_err:1;
 	struct mlx5_uars_page *uar;
 	struct mlx5_vhca_cq cq;
 	struct mlx5_vhca_qp *host_qp;
 	struct mlx5_vhca_qp *fw_qp;
+	struct mlx5_nb nb;
 	int status;
 };
 
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V4 vfio 10/10] vfio/mlx5: Set the driver DMA logging callbacks
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (8 preceding siblings ...)
  2022-08-15 15:11 ` [PATCH V4 vfio 09/10] vfio/mlx5: Manage error scenarios on tracker Yishai Hadas
@ 2022-08-15 15:11 ` Yishai Hadas
  2022-08-25 11:13 ` [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
  10 siblings, 0 replies; 26+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:11 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	yishaih, maorg, cohuck

Now that everything is ready set the driver DMA logging callbacks if
supported by the device.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 5 ++++-
 drivers/vfio/pci/mlx5/cmd.h  | 3 ++-
 drivers/vfio/pci/mlx5/main.c | 9 ++++++++-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 3e92b4d92be2..c604b70437a5 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -126,7 +126,8 @@ void mlx5vf_cmd_remove_migratable(struct mlx5vf_pci_core_device *mvdev)
 }
 
 void mlx5vf_cmd_set_migratable(struct mlx5vf_pci_core_device *mvdev,
-			       const struct vfio_migration_ops *mig_ops)
+			       const struct vfio_migration_ops *mig_ops,
+			       const struct vfio_log_ops *log_ops)
 {
 	struct pci_dev *pdev = mvdev->core_device.pdev;
 	int ret;
@@ -169,6 +170,8 @@ void mlx5vf_cmd_set_migratable(struct mlx5vf_pci_core_device *mvdev,
 		VFIO_MIGRATION_P2P;
 	mvdev->core_device.vdev.mig_ops = mig_ops;
 	init_completion(&mvdev->tracker_comp);
+	if (MLX5_CAP_GEN(mvdev->mdev, adv_virtualization))
+		mvdev->core_device.vdev.log_ops = log_ops;
 
 end:
 	mlx5_vf_put_core_dev(mvdev->mdev);
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 8b0ae40c620c..921d5720a1e5 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -118,7 +118,8 @@ int mlx5vf_cmd_resume_vhca(struct mlx5vf_pci_core_device *mvdev, u16 op_mod);
 int mlx5vf_cmd_query_vhca_migration_state(struct mlx5vf_pci_core_device *mvdev,
 					  size_t *state_size);
 void mlx5vf_cmd_set_migratable(struct mlx5vf_pci_core_device *mvdev,
-			       const struct vfio_migration_ops *mig_ops);
+			       const struct vfio_migration_ops *mig_ops,
+			       const struct vfio_log_ops *log_ops);
 void mlx5vf_cmd_remove_migratable(struct mlx5vf_pci_core_device *mvdev);
 void mlx5vf_cmd_close_migratable(struct mlx5vf_pci_core_device *mvdev);
 int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index a9b63d15c5d3..759a5f5f7b3f 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -579,6 +579,12 @@ static const struct vfio_migration_ops mlx5vf_pci_mig_ops = {
 	.migration_get_state = mlx5vf_pci_get_device_state,
 };
 
+static const struct vfio_log_ops mlx5vf_pci_log_ops = {
+	.log_start = mlx5vf_start_page_tracker,
+	.log_stop = mlx5vf_stop_page_tracker,
+	.log_read_and_clear = mlx5vf_tracker_read_and_clear,
+};
+
 static const struct vfio_device_ops mlx5vf_pci_ops = {
 	.name = "mlx5-vfio-pci",
 	.open_device = mlx5vf_pci_open_device,
@@ -602,7 +608,8 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 	if (!mvdev)
 		return -ENOMEM;
 	vfio_pci_core_init_device(&mvdev->core_device, pdev, &mlx5vf_pci_ops);
-	mlx5vf_cmd_set_migratable(mvdev, &mlx5vf_pci_mig_ops);
+	mlx5vf_cmd_set_migratable(mvdev, &mlx5vf_pci_mig_ops,
+				  &mlx5vf_pci_log_ops);
 	dev_set_drvdata(&pdev->dev, &mvdev->core_device);
 	ret = vfio_pci_core_register_device(&mvdev->core_device);
 	if (ret)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver
  2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
                   ` (9 preceding siblings ...)
  2022-08-15 15:11 ` [PATCH V4 vfio 10/10] vfio/mlx5: Set the driver DMA logging callbacks Yishai Hadas
@ 2022-08-25 11:13 ` Yishai Hadas
  2022-08-25 19:37   ` Alex Williamson
  10 siblings, 1 reply; 26+ messages in thread
From: Yishai Hadas @ 2022-08-25 11:13 UTC (permalink / raw)
  To: alex.williamson, jgg
  Cc: saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins, leonro,
	maorg, cohuck

On 15/08/2022 18:10, Yishai Hadas wrote:
> This series adds device DMA logging uAPIs and their implementation as
> part of mlx5 driver.
>
> DMA logging allows a device to internally record what DMAs the device is
> initiating and report them back to userspace. It is part of the VFIO
> migration infrastructure that allows implementing dirty page tracking
> during the pre copy phase of live migration. Only DMA WRITEs are logged,
> and this API is not connected to VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE.
>
> The uAPIs are based on the FEATURE ioctl as were introduced earlier by
> the below RFC [1] and follows the notes that were discussed in the
> mailing list.
>
> It includes:
> - A PROBE option to detect if the device supports DMA logging.
> - A SET option to start device DMA logging in given IOVAs ranges.
> - A GET option to read back and clear the device DMA log.
> - A SET option to stop device DMA logging that was previously started.
>
> Extra details exist as part of relevant patches in the series.
>
> In addition, the series adds some infrastructure support for managing an
> IOVA bitmap done by Joao Martins.
>
> It abstracts how an IOVA range is represented in a bitmap that is
> granulated by a given page_size. So it translates all the lifting of
> dealing with user pointers into its corresponding kernel addresses.
> This new functionality abstracts the complexity of user/kernel bitmap
> pointer usage and finally enables an API to set some bits.
>
> This functionality will be used as part of IOMMUFD series for the system
> IOMMU tracking.
>
> Finally, we come with mlx5 implementation based on its device
> specification for the DMA logging APIs.
>
> The matching qemu changes can be previewed here [2].
> They come on top of the v2 migration protocol patches that were sent
> already to the mailing list.
>
> Note:
> - As this series touched mlx5_core parts we may need to send the
>    net/mlx5 patches as a pull request format to VFIO to avoid conflicts
>    before acceptance.
>
> [1] https://lore.kernel.org/all/20220501123301.127279-1-yishaih@nvidia.com/T/
> [2] https://github.com/avihai1122/qemu/commits/device_dirty_tracking
>
> Changes from V3: https://lore.kernel.org/all/202207011231.1oPQhSzo-lkp@intel.com/t/
> Rebase on top of v6.0 rc1.
> Patch #3:
> - Drop the documentation note regarding the 'atomic OR' usage of the bitmap
>    as part of VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT.
>    This deletion was missed as part of V3 to match kernel code.
>    To better clarify, as part of testing V3, we could see a big penalty in
>    performance (*2 in some cases) when the iova bitmap patch used atomic
>    bit operations. As QEMU doesn't really share bitmaps between multiple
>    trackers we saw no reason to use atomics and get a bad performance.
>    If in the future, will be a real use case that will justify it, we can
>    come with VFIO_DEVICE_FEATURE_DMA_LOGGING_REPORT_ATOMIC new option with
>    the matching kernel code.
> Patch #4:
> - The rename patch from vfio.c to vfio_main.c was accepted already, not
>    part of this series anymore.
>
> Changes from V2: https://lore.kernel.org/netdev/20220726151232.GF4438@nvidia.com/t/
> Patch #1
> - Add some reserved fields that were missed.
> Patch #3:
> - Improve the UAPI documentation in few places as was asked by Alex and
>    Kevin, based on the discussion in the mailing list.
> Patch #5:
> - Improvements from Joao for his IOVA bitmap patch to be
>    cleaner/simpler as was asked by Alex. It includes the below:
>     * Make iova_to_index and index_to_iova fully symmetrical.
>     * Use 'sizeof(*iter->data) * BITS_PER_BYTE' in both index_to_iova
>       and iova_to_index.
>     * Remove iova_bitmap_init() and just stay with iova_bitmap_iter_init().
>     * s/left/remaining/
>     * To not use @remaining variable for both index and iova/length.
>     * Remove stale comment on max dirty bitmap bits.
>     * Remove DIV_ROUNDUP from iova_to_index() helper and replace with a
>       division.
>     * Use iova rather than length where appropriate, while noting with
>       commentary the usage of length as next relative IOVA.
>     * Rework pinning to be internal and remove that from the iova iter
>       API caller.
>     * get() and put() now teardown iova_bitmap::dirty npages.
>     * Move unnecessary includes into the C file.
>     * Add theory of operation and theory of usage in the header file.
>     * Add more comments on private helpers on less obvious logic
>     * Add documentation on all public APIs.
>    * Change commit to reflect new usage of APIs.
> Patch #6:
> - Drop the hard-coded 1024 for LOG_MAX_RANGES and replace to consider
>    PAGE_SIZE as was suggested by Jason.
> - Return -E2BIG as Alex suggested.
> - Adapt the loop upon logging report to new IOVA bit map stuff.
>
> Changes from V1: https://lore.kernel.org/netdev/202207052209.x00Iykkp-lkp@intel.com/T/
>
> - Patch #6: Fix a note given by krobot, select INTERVAL_TREE for VFIO.
>
> Changes from V0: https://lore.kernel.org/netdev/202207011231.1oPQhSzo-lkp@intel.com/T/
>
> - Drop the first 2 patches that Alex merged already.
> - Fix a note given by krobot, based on Jason's suggestion.
> - Some improvements from Joao for his IOVA bitmap patch to be
>    cleaner/simpler. It includes the below:
>      * Rename iova_bitmap_array_length to iova_bitmap_iova_to_index.
>      * Rename iova_bitmap_index_to_length to iova_bitmap_index_to_iova.
>      * Change iova_bitmap_iova_to_index to take an iova_bitmap_iter
>        as an argument to pair with iova_bitmap_index_to_length.
>      * Make iova_bitmap_iter_done() use >= instead of
>        substraction+comparison. This fixes iova_bitmap_iter_done()
>        return as it was previously returning when !done.
>      * Remove iova_bitmap_iter_length().
>      * Simplify iova_bitmap_length() overcomplicated trailing end check
>      * Convert all sizeof(u64) into sizeof(*iter->data).
>      * Use u64 __user for ::data instead of void in both struct and
>        initialization of iova_bitmap.
>
> Yishai
>
> Joao Martins (1):
>    vfio: Add an IOVA bitmap support
>
> Yishai Hadas (9):
>    net/mlx5: Introduce ifc bits for page tracker
>    net/mlx5: Query ADV_VIRTUALIZATION capabilities
>    vfio: Introduce DMA logging uAPIs
>    vfio: Introduce the DMA logging feature support
>    vfio/mlx5: Init QP based resources for dirty tracking
>    vfio/mlx5: Create and destroy page tracker object
>    vfio/mlx5: Report dirty pages from tracker
>    vfio/mlx5: Manage error scenarios on tracker
>    vfio/mlx5: Set the driver DMA logging callbacks
>
>   drivers/net/ethernet/mellanox/mlx5/core/fw.c  |   6 +
>   .../net/ethernet/mellanox/mlx5/core/main.c    |   1 +
>   drivers/vfio/Kconfig                          |   1 +
>   drivers/vfio/Makefile                         |   6 +-
>   drivers/vfio/iova_bitmap.c                    | 224 ++++
>   drivers/vfio/pci/mlx5/cmd.c                   | 995 +++++++++++++++++-
>   drivers/vfio/pci/mlx5/cmd.h                   |  63 +-
>   drivers/vfio/pci/mlx5/main.c                  |   9 +-
>   drivers/vfio/pci/vfio_pci_core.c              |   5 +
>   drivers/vfio/vfio_main.c                      | 159 +++
>   include/linux/iova_bitmap.h                   | 189 ++++
>   include/linux/mlx5/device.h                   |   9 +
>   include/linux/mlx5/mlx5_ifc.h                 |  83 +-
>   include/linux/vfio.h                          |  21 +-
>   include/uapi/linux/vfio.h                     |  86 ++
>   15 files changed, 1837 insertions(+), 20 deletions(-)
>   create mode 100644 drivers/vfio/iova_bitmap.c
>   create mode 100644 include/linux/iova_bitmap.h
>
Alex,

Can we please proceed with sending PR for the series to be accepted ?  
(i.e. as of the first two net/mlx5 patches).

The comments that were given in the previous kernel cycle were addressed 
and there is no open comment here for few weeks already.

Yishai


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support
  2022-08-15 15:11 ` [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support Yishai Hadas
@ 2022-08-25 19:27   ` Alex Williamson
  2022-08-25 22:24     ` Joao Martins
  2022-08-26 12:58     ` Jason Gunthorpe
  0 siblings, 2 replies; 26+ messages in thread
From: Alex Williamson @ 2022-08-25 19:27 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck

On Mon, 15 Aug 2022 18:11:03 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> From: Joao Martins <joao.m.martins@oracle.com>
> 
> The new facility adds a bunch of wrappers that abstract how an IOVA
> range is represented in a bitmap that is granulated by a given
> page_size. So it translates all the lifting of dealing with user
> pointers into its corresponding kernel addresses backing said user
> memory into doing finally the (non-atomic) bitmap ops to change
> various bits.
> 
> The formula for the bitmap is:
> 
>    data[(iova / page_size) / 64] & (1ULL << (iova % 64))
> 
> Where 64 is the number of bits in a unsigned long (depending on arch)
> 
> It introduces an IOVA iterator that uses a windowing scheme to minimize
> the pinning overhead, as opposed to be pinning it on demand 4K at a

s/ be / /

> time. So on a 512G and with base page size it would iterate in ranges of
> 64G at a time, while pinning 512 pages at a time leading to fewer

"on a 512G" what?  The overall size of the IOVA range is somewhat
irrelevant here and it's unclear where 64G comes from without reading
deeper into the series.  Maybe this should be something like:

"Assuming a 4K kernel page and 4K requested page size, we can use a
single kernel page to hold 512 page pointers, mapping 2M of bitmap,
representing 64G of IOVA space."

> atomics (specially if the underlying user memory are hugepages).
> 
> An example usage of these helpers for a given @base_iova, @page_size, @length

Several long lines here that could be wrapped.

> and __user @data:
> 
> 	ret = iova_bitmap_iter_init(&iter, base_iova, page_size, length, data);
> 	if (ret)
> 		return -ENOMEM;
> 
> 	for (; !iova_bitmap_iter_done(&iter) && !ret;
> 	     ret = iova_bitmap_iter_advance(&iter)) {
> 		dirty_reporter_ops(&iter.dirty, iova_bitmap_iova(&iter),
> 				   iova_bitmap_length(&iter));
> 	}
> 
> 	iova_bitmap_iter_free(&iter);
> 
> An implementation of the lower end -- referred to above as dirty_reporter_ops
> to exemplify -- that is tracking dirty bits would mark an IOVA as dirty
> as following:
> 
> 	iova_bitmap_set(&iter.dirty, iova, page_size);
> 
> or a contiguous range (example two pages):
> 
> 	iova_bitmap_set(&iter.dirty, iova, 2 * page_size);
> 
> The facility is intended to be used for user bitmaps representing
> dirtied IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> ---
>  drivers/vfio/Makefile       |   6 +-
>  drivers/vfio/iova_bitmap.c  | 224 ++++++++++++++++++++++++++++++++++++
>  include/linux/iova_bitmap.h | 189 ++++++++++++++++++++++++++++++
>  3 files changed, 417 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/vfio/iova_bitmap.c
>  create mode 100644 include/linux/iova_bitmap.h
> 
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 1a32357592e3..1d6cad32d366 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,9 +1,11 @@
>  # SPDX-License-Identifier: GPL-2.0
>  vfio_virqfd-y := virqfd.o
>  
> -vfio-y += vfio_main.o
> -
>  obj-$(CONFIG_VFIO) += vfio.o
> +
> +vfio-y := vfio_main.o \
> +          iova_bitmap.o \
> +
>  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
> new file mode 100644
> index 000000000000..6b6008ef146c
> --- /dev/null
> +++ b/drivers/vfio/iova_bitmap.c
> @@ -0,0 +1,224 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022, Oracle and/or its affiliates.
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +#include <linux/iova_bitmap.h>
> +#include <linux/highmem.h>
> +
> +#define BITS_PER_PAGE (PAGE_SIZE * BITS_PER_BYTE)
> +
> +static void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
> +
> +/*
> + * Converts a relative IOVA to a bitmap index.
> + * The bitmap is viewed an array of u64, and each u64 represents

"The bitmap is viewed as an u64 array and each u64 represents"

> + * a range of IOVA, and the whole pinned pages to the range window.

I think this phrase after the comma is trying to say something about the
windowed mapping, but I don't know what.

This function provides the index into that u64 array for a given IOVA
offset.

> + * Relative IOVA means relative to the iter::dirty base IOVA (stored
> + * in dirty::iova). All computations in this file are done using
> + * relative IOVAs and thus avoid an extra subtraction against
> + * dirty::iova. The user API iova_bitmap_set() always uses a regular
> + * absolute IOVAs.

So why don't we use variables that make it clear when an IOVA is an
IOVA and when it's an offset?

> + */
> +static unsigned long iova_bitmap_iova_to_index(struct iova_bitmap_iter *iter,
> +					       unsigned long iova)

iova_bitmap_offset_to_index(... unsigned long offset)?

> +{
> +	unsigned long pgsize = 1 << iter->dirty.pgshift;
> +
> +	return iova / (BITS_PER_TYPE(*iter->data) * pgsize);

Why do we name the bitmap "data" rather than "bitmap"?

Why do we call the mapped section "dirty" rather than "mapped"?  It's
not necessarily dirty, it's just the window that's current mapped.

> +}
> +
> +/*
> + * Converts a bitmap index to a *relative* IOVA.
> + */
> +static unsigned long iova_bitmap_index_to_iova(struct iova_bitmap_iter *iter,
> +					       unsigned long index)

iova_bitmap_index_to_offset()?

> +{
> +	unsigned long pgshift = iter->dirty.pgshift;
> +
> +	return (index * BITS_PER_TYPE(*iter->data)) << pgshift;
> +}
> +
> +/*
> + * Pins the bitmap user pages for the current range window.
> + * This is internal to IOVA bitmap and called when advancing the
> + * iterator.
> + */
> +static int iova_bitmap_iter_get(struct iova_bitmap_iter *iter)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +	unsigned long npages;
> +	u64 __user *addr;
> +	long ret;
> +
> +	/*
> +	 * @offset is the cursor of the currently mapped u64 words

So it's an index?  I don't know what a cursor is.  If we start using
"offset" to describe a relative iova, maybe this becomes something more
descriptive, mapped_base_index?

> +	 * that we have access. And it indexes u64 bitmap word that is
> +	 * mapped. Anything before @offset is not mapped. The range
> +	 * @offset .. @count is mapped but capped at a maximum number
> +	 * of pages.

@total_indexes rather than @count maybe?

> +	 */
> +	npages = DIV_ROUND_UP((iter->count - iter->offset) *
> +			      sizeof(*iter->data), PAGE_SIZE);
> +
> +	/*
> +	 * We always cap at max number of 'struct page' a base page can fit.
> +	 * This is, for example, on x86 means 2M of bitmap data max.
> +	 */
> +	npages = min(npages,  PAGE_SIZE / sizeof(struct page *));
> +	addr = iter->data + iter->offset;

Subtle pointer arithmetic.

> +	ret = pin_user_pages_fast((unsigned long)addr, npages,
> +				  FOLL_WRITE, dirty->pages);
> +	if (ret <= 0)
> +		return -EFAULT;
> +
> +	dirty->npages = (unsigned long)ret;
> +	/* Base IOVA where @pages point to i.e. bit 0 of the first page */
> +	dirty->iova = iova_bitmap_iova(iter);

If we're operating on an iterator, wouldn't convention suggest this is
an iova_bitmap_itr_FOO function?  mapped_iova perhaps.

> +
> +	/*
> +	 * offset of the page where pinned pages bit 0 is located.
> +	 * This handles the case where the bitmap is not PAGE_SIZE
> +	 * aligned.
> +	 */
> +	dirty->start_offset = offset_in_page(addr);

Maybe pgoff to avoid confusion with relative iova offsets.

It seems suspect that the length calculations don't take this into
account.

> +	return 0;
> +}
> +
> +/*
> + * Unpins the bitmap user pages and clears @npages
> + * (un)pinning is abstracted from API user and it's done
> + * when advancing or freeing the iterator.
> + */
> +static void iova_bitmap_iter_put(struct iova_bitmap_iter *iter)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +
> +	if (dirty->npages) {
> +		unpin_user_pages(dirty->pages, dirty->npages);
> +		dirty->npages = 0;
> +	}
> +}
> +
> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter,
> +			  unsigned long iova, unsigned long length,
> +			  unsigned long page_size, u64 __user *data)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +
> +	memset(iter, 0, sizeof(*iter));
> +	dirty->pgshift = __ffs(page_size);
> +	iter->data = data;
> +	iter->count = iova_bitmap_iova_to_index(iter, length - 1) + 1;
> +	iter->iova = iova;
> +	iter->length = length;
> +
> +	dirty->iova = iova;
> +	dirty->pages = (struct page **)__get_free_page(GFP_KERNEL);
> +	if (!dirty->pages)
> +		return -ENOMEM;
> +
> +	return iova_bitmap_iter_get(iter);
> +}
> +
> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter)
> +{
> +	struct iova_bitmap *dirty = &iter->dirty;
> +
> +	iova_bitmap_iter_put(iter);
> +
> +	if (dirty->pages) {
> +		free_page((unsigned long)dirty->pages);
> +		dirty->pages = NULL;
> +	}
> +
> +	memset(iter, 0, sizeof(*iter));
> +}
> +
> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter)
> +{
> +	unsigned long skip = iter->offset;
> +
> +	return iter->iova + iova_bitmap_index_to_iova(iter, skip);
> +}
> +
> +/*
> + * Returns the remaining bitmap indexes count to process for the currently pinned
> + * bitmap pages.
> + */
> +static unsigned long iova_bitmap_iter_remaining(struct iova_bitmap_iter *iter)

iova_bitmap_iter_mapped_remaining()?

> +{
> +	unsigned long remaining = iter->count - iter->offset;
> +
> +	remaining = min_t(unsigned long, remaining,
> +		     (iter->dirty.npages << PAGE_SHIFT) / sizeof(*iter->data));
> +
> +	return remaining;
> +}
> +
> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter)

iova_bitmap_iter_mapped_length()?

Maybe it doesn't really make sense to differentiate the iterator from
the bitmap in the API.  In fact, couldn't we reduce the API to simply:

int iova_bitmap_init(struct iova_bitmap *bitmap, dma_addr_t iova,
		     size_t length, size_t page_size, u64 __user *data);

int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *data,
			 int (*fn)(void *data, dma_addr_t iova,
			 	   size_t length,
				   struct iova_bitmap *bitmap));

void iova_bitmap_free(struct iova_bitmap *bitmap);

unsigned long iova_bitmap_set(struct iova_bitmap *bitmap,
			      dma_addr_t iova, size_t length);

Removes the need for the API to have done, advance, iova, and length
functions.

> +{
> +	unsigned long max_iova = iter->iova + iter->length - 1;
> +	unsigned long iova = iova_bitmap_iova(iter);
> +	unsigned long remaining;
> +
> +	/*
> +	 * iova_bitmap_iter_remaining() returns a number of indexes which
> +	 * when converted to IOVA gives us a max length that the bitmap
> +	 * pinned data can cover. Afterwards, that is capped to
> +	 * only cover the IOVA range in @iter::iova .. iter::length.
> +	 */
> +	remaining = iova_bitmap_index_to_iova(iter,
> +			iova_bitmap_iter_remaining(iter));
> +
> +	if (iova + remaining - 1 > max_iova)
> +		remaining -= ((iova + remaining - 1) - max_iova);
> +
> +	return remaining;
> +}
> +
> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter)
> +{
> +	return iter->offset >= iter->count;
> +}
> +
> +int iova_bitmap_iter_advance(struct iova_bitmap_iter *iter)
> +{
> +	unsigned long iova = iova_bitmap_length(iter) - 1;
> +	unsigned long count = iova_bitmap_iova_to_index(iter, iova) + 1;
> +
> +	iter->offset += count;
> +
> +	iova_bitmap_iter_put(iter);
> +	if (iova_bitmap_iter_done(iter))
> +		return 0;
> +
> +	/* When we advance the iterator we pin the next set of bitmap pages */
> +	return iova_bitmap_iter_get(iter);
> +}
> +
> +unsigned long iova_bitmap_set(struct iova_bitmap *dirty,
> +			      unsigned long iova, unsigned long length)
> +{
> +	unsigned long nbits = max(1UL, length >> dirty->pgshift), set = nbits;
> +	unsigned long offset = (iova - dirty->iova) >> dirty->pgshift;
> +	unsigned long page_idx = offset / BITS_PER_PAGE;
> +	unsigned long page_offset = dirty->start_offset;
> +	void *kaddr;
> +
> +	offset = offset % BITS_PER_PAGE;
> +
> +	do {
> +		unsigned long size = min(BITS_PER_PAGE - offset, nbits);
> +
> +		kaddr = kmap_local_page(dirty->pages[page_idx]);
> +		bitmap_set(kaddr + page_offset, offset, size);
> +		kunmap_local(kaddr);
> +		page_offset = offset = 0;
> +		nbits -= size;
> +		page_idx++;
> +	} while (nbits > 0);
> +
> +	return set;
> +}
> +EXPORT_SYMBOL_GPL(iova_bitmap_set);
> diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
> new file mode 100644
> index 000000000000..e258d03386d3
> --- /dev/null
> +++ b/include/linux/iova_bitmap.h
> @@ -0,0 +1,189 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2022, Oracle and/or its affiliates.
> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +#ifndef _IOVA_BITMAP_H_
> +#define _IOVA_BITMAP_H_
> +
> +#include <linux/mm.h>
> +
> +/**
> + * struct iova_bitmap - A bitmap representing a portion IOVA space
> + *
> + * Main data structure for tracking dirty IOVAs.
> + *
> + * For example something recording dirty IOVAs, will be provided of a
> + * struct iova_bitmap structure. This structure only represents a
> + * subset of the total IOVA space pinned by its parent counterpart
> + * iterator object.
> + *
> + * The user does not need to exact location of the bits in the bitmap.
> + * From user perspective the bitmap the only API available to the dirty
> + * tracker is iova_bitmap_set() which records the dirty IOVA *range*
> + * in the bitmap data.
> + *
> + * The bitmap is an array of u64 whereas each bit represents an IOVA
> + * of range of (1 << pgshift). Thus formula for the bitmap data to be
> + * set is:
> + *
> + *   data[(iova / page_size) / 64] & (1ULL << (iova % 64))
> + */
> +struct iova_bitmap {
> +	/* base IOVA representing bit 0 of the first page */
> +	unsigned long iova;
> +
> +	/* page size order that each bit granules to */
> +	unsigned long pgshift;
> +
> +	/* offset of the first user page pinned */
> +	unsigned long start_offset;
> +
> +	/* number of pages pinned */
> +	unsigned long npages;
> +
> +	/* pinned pages representing the bitmap data */
> +	struct page **pages;
> +};
> +
> +/**
> + * struct iova_bitmap_iter - Iterator object of the IOVA bitmap
> + *
> + * Main data structure for walking the bitmap data.
> + *
> + * Abstracts the pinning work to iterate an IOVA ranges.
> + * It uses a windowing scheme and pins the bitmap in relatively
> + * big ranges e.g.
> + *
> + * The iterator uses one base page to store all the pinned pages
> + * pointers related to the bitmap. For sizeof(struct page) == 64 it
> + * stores 512 struct pages which, if base page size is 4096 it means 2M
> + * of bitmap data is pinned at a time. If the iova_bitmap page size is
> + * also base page size then the range window to iterate is 64G.
> + *
> + * For example iterating on a total IOVA range of 4G..128G, it will
> + * walk through this set of ranges:
> + *
> + *  - 4G  -  68G-1 (64G)
> + *  - 68G - 128G-1 (64G)
> + *
> + * An example of the APIs on how to iterate the IOVA bitmap:
> + *
> + *   ret = iova_bitmap_iter_init(&iter, iova, PAGE_SIZE, length, data);
> + *   if (ret)
> + *       return -ENOMEM;
> + *
> + *   for (; !iova_bitmap_iter_done(&iter) && !ret;
> + *        ret = iova_bitmap_iter_advance(&iter)) {
> + *
> + *        dirty_reporter_ops(&iter.dirty, iova_bitmap_iova(&iter),
> + *                           iova_bitmap_length(&iter));
> + *   }
> + *
> + * An implementation of the lower end (referred to above as
> + * dirty_reporter_ops) that is tracking dirty bits would:
> + *
> + *        if (iova_dirty)
> + *            iova_bitmap_set(&iter.dirty, iova, PAGE_SIZE);
> + *
> + * The internals of the object use a cursor @offset that indexes
> + * which part u64 word of the bitmap is mapped, up to @count.
> + * Those keep being incremented until @count reaches while mapping
> + * up to PAGE_SIZE / sizeof(struct page*) maximum of pages.
> + *
> + * The iterator is usually located on what tracks DMA mapped ranges
> + * or some form of IOVA range tracking that co-relates to the user
> + * passed bitmap.
> + */
> +struct iova_bitmap_iter {
> +	/* IOVA range representing the currently pinned bitmap data */
> +	struct iova_bitmap dirty;
> +
> +	/* userspace address of the bitmap */
> +	u64 __user *data;
> +
> +	/* u64 index that @dirty points to */
> +	size_t offset;
> +
> +	/* how many u64 can we walk in total */
> +	size_t count;

size_t?  These are both indexes afaict.

> +
> +	/* base IOVA of the whole bitmap */
> +	unsigned long iova;
> +
> +	/* length of the IOVA range for the whole bitmap */
> +	unsigned long length;

OTOH this could arguably be size_t and iova dma_addr_t.  Thanks,

Alex

> +};
> +
> +/**
> + * iova_bitmap_iter_init() - Initializes an IOVA bitmap iterator object.
> + * @iter: IOVA bitmap iterator to initialize
> + * @iova: Start address of the IOVA range
> + * @length: Length of the IOVA range
> + * @page_size: Page size of the IOVA bitmap. It defines what each bit
> + *             granularity represents
> + * @data: Userspace address of the bitmap
> + *
> + * Initializes all the fields in the IOVA iterator including the first
> + * user pages of @data. Returns 0 on success or otherwise errno on error.
> + */
> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter, unsigned long iova,
> +			  unsigned long length, unsigned long page_size,
> +			  u64 __user *data);
> +
> +/**
> + * iova_bitmap_iter_free() - Frees an IOVA bitmap iterator object
> + * @iter: IOVA bitmap iterator to free
> + *
> + * It unpins and releases pages array memory and clears any leftover
> + * state.
> + */
> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter);
> +
> +/**
> + * iova_bitmap_iter_done: Checks if the IOVA bitmap has data to iterate
> + * @iter: IOVA bitmap iterator to free
> + *
> + * Returns true if there's more data to iterate.
> + */
> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter);
> +
> +/**
> + * iova_bitmap_iter_advance: Advances the IOVA bitmap iterator
> + * @iter: IOVA bitmap iterator to advance
> + *
> + * Advances to the next range, releases the current pinned
> + * pages and pins the next set of bitmap pages.
> + * Returns 0 on success or otherwise errno.
> + */
> +int iova_bitmap_iter_advance(struct iova_bitmap_iter *iter);
> +
> +/**
> + * iova_bitmap_iova: Base IOVA of the current range
> + * @iter: IOVA bitmap iterator
> + *
> + * Returns the base IOVA of the current range.
> + */
> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter);
> +
> +/**
> + * iova_bitmap_length: IOVA length of the current range
> + * @iter: IOVA bitmap iterator
> + *
> + * Returns the length of the current IOVA range.
> + */
> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter);
> +
> +/**
> + * iova_bitmap_set: Marks an IOVA range as dirty
> + * @dirty: IOVA bitmap
> + * @iova: IOVA to mark as dirty
> + * @length: IOVA range length
> + *
> + * Marks the range [iova .. iova+length-1] as dirty in the bitmap.
> + * Returns the number of bits set.
> + */
> +unsigned long iova_bitmap_set(struct iova_bitmap *dirty,
> +			      unsigned long iova, unsigned long length);
> +
> +#endif


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver
  2022-08-25 11:13 ` [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
@ 2022-08-25 19:37   ` Alex Williamson
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Williamson @ 2022-08-25 19:37 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck

On Thu, 25 Aug 2022 14:13:01 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:
> Alex,
> 
> Can we please proceed with sending PR for the series to be accepted ?  
> (i.e. as of the first two net/mlx5 patches).
> 
> The comments that were given in the previous kernel cycle were addressed 
> and there is no open comment here for few weeks already.

Hmm, it's only been posted since last week.  I still find the iova
bitmap code to be quite a mess, I just sent comments.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 05/10] vfio: Introduce the DMA logging feature support
  2022-08-15 15:11 ` [PATCH V4 vfio 05/10] vfio: Introduce the DMA logging feature support Yishai Hadas
@ 2022-08-25 20:49   ` Alex Williamson
  2022-08-25 22:26     ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-08-25 20:49 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, joao.m.martins,
	leonro, maorg, cohuck

On Mon, 15 Aug 2022 18:11:04 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:
> +static int
> +vfio_ioctl_device_feature_logging_report(struct vfio_device *device,
> +					 u32 flags, void __user *arg,
> +					 size_t argsz)
> +{
> +	size_t minsz =
> +		offsetofend(struct vfio_device_feature_dma_logging_report,
> +			    bitmap);
> +	struct vfio_device_feature_dma_logging_report report;
> +	struct iova_bitmap_iter iter;
> +	int ret;
> +
> +	if (!device->log_ops)
> +		return -ENOTTY;
> +
> +	ret = vfio_check_feature(flags, argsz,
> +				 VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(report));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&report, arg, minsz))
> +		return -EFAULT;
> +
> +	if (report.page_size < PAGE_SIZE || !is_power_of_2(report.page_size))

Why is PAGE_SIZE a factor here?  I'm under the impression that
iova_bitmap is intended to handle arbitrary page sizes.  Thanks,

Alex

> +		return -EINVAL;
> +
> +	ret = iova_bitmap_iter_init(&iter, report.iova, report.length,
> +				    report.page_size,
> +				    u64_to_user_ptr(report.bitmap));
> +	if (ret)
> +		return ret;
> +
> +	for (; !iova_bitmap_iter_done(&iter) && !ret;
> +	     ret = iova_bitmap_iter_advance(&iter)) {
> +		ret = device->log_ops->log_read_and_clear(device,
> +			iova_bitmap_iova(&iter),
> +			iova_bitmap_length(&iter), &iter.dirty);
> +		if (ret)
> +			break;
> +	}
> +
> +	iova_bitmap_iter_free(&iter);
> +	return ret;
> +}


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support
  2022-08-25 19:27   ` Alex Williamson
@ 2022-08-25 22:24     ` Joao Martins
  2022-08-25 23:15       ` Alex Williamson
  2022-08-26 12:58     ` Jason Gunthorpe
  1 sibling, 1 reply; 26+ messages in thread
From: Joao Martins @ 2022-08-25 22:24 UTC (permalink / raw)
  To: Alex Williamson, Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, leonro, maorg, cohuck

On 8/25/22 20:27, Alex Williamson wrote:
> On Mon, 15 Aug 2022 18:11:03 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
>> From: Joao Martins <joao.m.martins@oracle.com>
>>
>> The new facility adds a bunch of wrappers that abstract how an IOVA
>> range is represented in a bitmap that is granulated by a given
>> page_size. So it translates all the lifting of dealing with user
>> pointers into its corresponding kernel addresses backing said user
>> memory into doing finally the (non-atomic) bitmap ops to change
>> various bits.
>>
>> The formula for the bitmap is:
>>
>>    data[(iova / page_size) / 64] & (1ULL << (iova % 64))
>>
>> Where 64 is the number of bits in a unsigned long (depending on arch)
>>
>> It introduces an IOVA iterator that uses a windowing scheme to minimize
>> the pinning overhead, as opposed to be pinning it on demand 4K at a
> 
> s/ be / /
> 
Will fix.

>> time. So on a 512G and with base page size it would iterate in ranges of
>> 64G at a time, while pinning 512 pages at a time leading to fewer
> 
> "on a 512G" what?  The overall size of the IOVA range is somewhat
> irrelevant here and it's unclear where 64G comes from without reading
> deeper into the series.  Maybe this should be something like:
> 
> "Assuming a 4K kernel page and 4K requested page size, we can use a
> single kernel page to hold 512 page pointers, mapping 2M of bitmap,
> representing 64G of IOVA space."
> 
Much more readable indeed. Will use that.

>> atomics (specially if the underlying user memory are hugepages).
>>
>> An example usage of these helpers for a given @base_iova, @page_size, @length
> 
> Several long lines here that could be wrapped.
> 
It's already wrapped (by my editor) and also at 75 columns. I can do a
bit shorter if that's hurting readability.

>> and __user @data:
>>
>> 	ret = iova_bitmap_iter_init(&iter, base_iova, page_size, length, data);
>> 	if (ret)
>> 		return -ENOMEM;
>>
>> 	for (; !iova_bitmap_iter_done(&iter) && !ret;
>> 	     ret = iova_bitmap_iter_advance(&iter)) {
>> 		dirty_reporter_ops(&iter.dirty, iova_bitmap_iova(&iter),
>> 				   iova_bitmap_length(&iter));
>> 	}
>>
>> 	iova_bitmap_iter_free(&iter);
>>
>> An implementation of the lower end -- referred to above as dirty_reporter_ops
>> to exemplify -- that is tracking dirty bits would mark an IOVA as dirty
>> as following:
>>
>> 	iova_bitmap_set(&iter.dirty, iova, page_size);
>>
>> or a contiguous range (example two pages):
>>
>> 	iova_bitmap_set(&iter.dirty, iova, 2 * page_size);
>>
>> The facility is intended to be used for user bitmaps representing
>> dirtied IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> ---
>>  drivers/vfio/Makefile       |   6 +-
>>  drivers/vfio/iova_bitmap.c  | 224 ++++++++++++++++++++++++++++++++++++
>>  include/linux/iova_bitmap.h | 189 ++++++++++++++++++++++++++++++
>>  3 files changed, 417 insertions(+), 2 deletions(-)
>>  create mode 100644 drivers/vfio/iova_bitmap.c
>>  create mode 100644 include/linux/iova_bitmap.h
>>
>> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
>> index 1a32357592e3..1d6cad32d366 100644
>> --- a/drivers/vfio/Makefile
>> +++ b/drivers/vfio/Makefile
>> @@ -1,9 +1,11 @@
>>  # SPDX-License-Identifier: GPL-2.0
>>  vfio_virqfd-y := virqfd.o
>>  
>> -vfio-y += vfio_main.o
>> -
>>  obj-$(CONFIG_VFIO) += vfio.o
>> +
>> +vfio-y := vfio_main.o \
>> +          iova_bitmap.o \
>> +
>>  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
>>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
>>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
>> new file mode 100644
>> index 000000000000..6b6008ef146c
>> --- /dev/null
>> +++ b/drivers/vfio/iova_bitmap.c
>> @@ -0,0 +1,224 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2022, Oracle and/or its affiliates.
>> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +#include <linux/iova_bitmap.h>
>> +#include <linux/highmem.h>
>> +
>> +#define BITS_PER_PAGE (PAGE_SIZE * BITS_PER_BYTE)
>> +
>> +static void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
>> +
>> +/*
>> + * Converts a relative IOVA to a bitmap index.
>> + * The bitmap is viewed an array of u64, and each u64 represents
> 
> "The bitmap is viewed as an u64 array and each u64 represents"
> 
Will use that.

>> + * a range of IOVA, and the whole pinned pages to the range window.
> 
> I think this phrase after the comma is trying to say something about the
> windowed mapping, but I don't know what.
> 
Yes. doesn't add much in the context of the function.

> This function provides the index into that u64 array for a given IOVA
> offset.
> 
I'll use this instead.

>> + * Relative IOVA means relative to the iter::dirty base IOVA (stored
>> + * in dirty::iova). All computations in this file are done using
>> + * relative IOVAs and thus avoid an extra subtraction against
>> + * dirty::iova. The user API iova_bitmap_set() always uses a regular
>> + * absolute IOVAs.
> 
> So why don't we use variables that make it clear when an IOVA is an
> IOVA and when it's an offset?
> 
I was just sticking the name @offset to how we iterate towards the u64s
to avoid confusion. Should I switch to offset here I should probably change
@offset of the struct into something else. But I see you suggested something
like that too further below.

>> + */
>> +static unsigned long iova_bitmap_iova_to_index(struct iova_bitmap_iter *iter,
>> +					       unsigned long iova)
> 
> iova_bitmap_offset_to_index(... unsigned long offset)?
> 
>> +{OK.

>> +	unsigned long pgsize = 1 << iter->dirty.pgshift;
>> +
>> +	return iova / (BITS_PER_TYPE(*iter->data) * pgsize);
> 
> Why do we name the bitmap "data" rather than "bitmap"?
> 
I was avoid overusing the word bitmap given structure is already called @bitmap.
At the end of the day it's a user data pointer. But I can call it @bitmap.

> Why do we call the mapped section "dirty" rather than "mapped"?  It's
> not necessarily dirty, it's just the window that's current mapped.
> 
Good point. Dirty is just what we tracked, but the structure ::dirty is closer
to representing what's actually mapped yes. I'll switch to mapped.

>> +}
>> +
>> +/*
>> + * Converts a bitmap index to a *relative* IOVA.
>> + */
>> +static unsigned long iova_bitmap_index_to_iova(struct iova_bitmap_iter *iter,
>> +					       unsigned long index)
> 
> iova_bitmap_index_to_offset()?
> 
ack

>> +{
>> +	unsigned long pgshift = iter->dirty.pgshift;
>> +
>> +	return (index * BITS_PER_TYPE(*iter->data)) << pgshift;
>> +}
>> +
>> +/*
>> + * Pins the bitmap user pages for the current range window.
>> + * This is internal to IOVA bitmap and called when advancing the
>> + * iterator.
>> + */
>> +static int iova_bitmap_iter_get(struct iova_bitmap_iter *iter)
>> +{
>> +	struct iova_bitmap *dirty = &iter->dirty;
>> +	unsigned long npages;
>> +	u64 __user *addr;
>> +	long ret;
>> +
>> +	/*
>> +	 * @offset is the cursor of the currently mapped u64 words
> 
> So it's an index?  I don't know what a cursor is.  

In my "english" 'cursor' as a synonym for index yes.

> If we start using
> "offset" to describe a relative iova, maybe this becomes something more
> descriptive, mapped_base_index?
> 
I am not very fond of long names, @mapped_index maybe hmm

>> +	 * that we have access. And it indexes u64 bitmap word that is
>> +	 * mapped. Anything before @offset is not mapped. The range
>> +	 * @offset .. @count is mapped but capped at a maximum number
>> +	 * of pages.
> 
> @total_indexes rather than @count maybe?
> 
It's still a count of indexes, I thought @count was explicit already without
being too wordy. I can suffix with indexes if going with mapped_index. Or maybe
@mapped_count maybe

>> +	 */
>> +	npages = DIV_ROUND_UP((iter->count - iter->offset) *
>> +			      sizeof(*iter->data), PAGE_SIZE);
>> +
>> +	/*
>> +	 * We always cap at max number of 'struct page' a base page can fit.
>> +	 * This is, for example, on x86 means 2M of bitmap data max.
>> +	 */
>> +	npages = min(npages,  PAGE_SIZE / sizeof(struct page *));
>> +	addr = iter->data + iter->offset;
> 
> Subtle pointer arithmetic.
> 
>> +	ret = pin_user_pages_fast((unsigned long)addr, npages,
>> +				  FOLL_WRITE, dirty->pages);
>> +	if (ret <= 0)
>> +		return -EFAULT;
>> +
>> +	dirty->npages = (unsigned long)ret;
>> +	/* Base IOVA where @pages point to i.e. bit 0 of the first page */
>> +	dirty->iova = iova_bitmap_iova(iter);
> 
> If we're operating on an iterator, wouldn't convention suggest this is
> an iova_bitmap_itr_FOO function?  mapped_iova perhaps.
> 

Yes.

Given your earlier comment, mapped iova is a bit more obvious.

>> +
>> +	/*
>> +	 * offset of the page where pinned pages bit 0 is located.
>> +	 * This handles the case where the bitmap is not PAGE_SIZE
>> +	 * aligned.
>> +	 */
>> +	dirty->start_offset = offset_in_page(addr);
> 
> Maybe pgoff to avoid confusion with relative iova offsets.
> 
Will fix. And it's also convention in mm code, so I should stick with that.

> It seems suspect that the length calculations don't take this into
> account.
> 
The iova/length/indexes functions only work over bit/iova "quantity" and indexing of it
without needing to know where the first bit of the mapped range starts. So the pgoff
is only important when we actually set bits on the bitmap i.e. iova_bitmap_set().

>> +	return 0;
>> +}
>> +
>> +/*
>> + * Unpins the bitmap user pages and clears @npages
>> + * (un)pinning is abstracted from API user and it's done
>> + * when advancing or freeing the iterator.
>> + */
>> +static void iova_bitmap_iter_put(struct iova_bitmap_iter *iter)
>> +{
>> +	struct iova_bitmap *dirty = &iter->dirty;
>> +
>> +	if (dirty->npages) {
>> +		unpin_user_pages(dirty->pages, dirty->npages);
>> +		dirty->npages = 0;
>> +	}
>> +}
>> +
>> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter,
>> +			  unsigned long iova, unsigned long length,
>> +			  unsigned long page_size, u64 __user *data)
>> +{
>> +	struct iova_bitmap *dirty = &iter->dirty;
>> +
>> +	memset(iter, 0, sizeof(*iter));
>> +	dirty->pgshift = __ffs(page_size);
>> +	iter->data = data;
>> +	iter->count = iova_bitmap_iova_to_index(iter, length - 1) + 1;
>> +	iter->iova = iova;
>> +	iter->length = length;
>> +
>> +	dirty->iova = iova;
>> +	dirty->pages = (struct page **)__get_free_page(GFP_KERNEL);
>> +	if (!dirty->pages)
>> +		return -ENOMEM;
>> +
>> +	return iova_bitmap_iter_get(iter);
>> +}
>> +
>> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter)
>> +{
>> +	struct iova_bitmap *dirty = &iter->dirty;
>> +
>> +	iova_bitmap_iter_put(iter);
>> +
>> +	if (dirty->pages) {
>> +		free_page((unsigned long)dirty->pages);
>> +		dirty->pages = NULL;
>> +	}
>> +
>> +	memset(iter, 0, sizeof(*iter));
>> +}
>> +
>> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter)
>> +{
>> +	unsigned long skip = iter->offset;
>> +
>> +	return iter->iova + iova_bitmap_index_to_iova(iter, skip);
>> +}
>> +
>> +/*
>> + * Returns the remaining bitmap indexes count to process for the currently pinned
>> + * bitmap pages.
>> + */
>> +static unsigned long iova_bitmap_iter_remaining(struct iova_bitmap_iter *iter)
> 
> iova_bitmap_iter_mapped_remaining()?
> 
Yes.

>> +{
>> +	unsigned long remaining = iter->count - iter->offset;
>> +
>> +	remaining = min_t(unsigned long, remaining,
>> +		     (iter->dirty.npages << PAGE_SHIFT) / sizeof(*iter->data));
>> +
>> +	return remaining;
>> +}
>> +
>> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter)
> 
> iova_bitmap_iter_mapped_length()?
> 
Yes.

I don't particularly like long names, but doesn't seem to have better alternatives.

Part of the reason the names look 'shortened' was because the object we pass
is already an iterator, so it's implicit that we only fetch the under-iteration/mapped
iova. Or that was at least the intention.

> Maybe it doesn't really make sense to differentiate the iterator from
> the bitmap in the API.  In fact, couldn't we reduce the API to simply:
> 
> int iova_bitmap_init(struct iova_bitmap *bitmap, dma_addr_t iova,
> 		     size_t length, size_t page_size, u64 __user *data);
> 
> int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *data,
> 			 int (*fn)(void *data, dma_addr_t iova,
> 			 	   size_t length,
> 				   struct iova_bitmap *bitmap));
> 
> void iova_bitmap_free(struct iova_bitmap *bitmap);
> 
> unsigned long iova_bitmap_set(struct iova_bitmap *bitmap,
> 			      dma_addr_t iova, size_t length);
> 
> Removes the need for the API to have done, advance, iova, and length
> functions.
> 
True, it would be simpler.

Could also allow us to hide the iterator details enterily and switch to
container_of() from iova_bitmap pointer. Though, from caller, it would be
weird to do:

struct iova_bitmap_iter iter;

iova_bitmap_init(&iter.dirty, ....);

Hmm, maybe not that strange.

Unless you are trying to suggest to merge both struct iova_bitmap and
iova_bitmap_iter together? I was trying to keep them separate more for
the dirty tracker (IOMMUFD/VFIO, to just be limited to iova_bitmap_set()
with the generic infra being the one managing that iterator state in a
separate structure.

>> +{
>> +	unsigned long max_iova = iter->iova + iter->length - 1;
>> +	unsigned long iova = iova_bitmap_iova(iter);
>> +	unsigned long remaining;
>> +
>> +	/*
>> +	 * iova_bitmap_iter_remaining() returns a number of indexes which
>> +	 * when converted to IOVA gives us a max length that the bitmap
>> +	 * pinned data can cover. Afterwards, that is capped to
>> +	 * only cover the IOVA range in @iter::iova .. iter::length.
>> +	 */
>> +	remaining = iova_bitmap_index_to_iova(iter,
>> +			iova_bitmap_iter_remaining(iter));
>> +
>> +	if (iova + remaining - 1 > max_iova)
>> +		remaining -= ((iova + remaining - 1) - max_iova);
>> +
>> +	return remaining;
>> +}
>> +
>> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter)
>> +{
>> +	return iter->offset >= iter->count;
>> +}
>> +
>> +int iova_bitmap_iter_advance(struct iova_bitmap_iter *iter)
>> +{
>> +	unsigned long iova = iova_bitmap_length(iter) - 1;
>> +	unsigned long count = iova_bitmap_iova_to_index(iter, iova) + 1;
>> +
>> +	iter->offset += count;
>> +
>> +	iova_bitmap_iter_put(iter);
>> +	if (iova_bitmap_iter_done(iter))
>> +		return 0;
>> +
>> +	/* When we advance the iterator we pin the next set of bitmap pages */
>> +	return iova_bitmap_iter_get(iter);
>> +}
>> +
>> +unsigned long iova_bitmap_set(struct iova_bitmap *dirty,
>> +			      unsigned long iova, unsigned long length)
>> +{
>> +	unsigned long nbits = max(1UL, length >> dirty->pgshift), set = nbits;
>> +	unsigned long offset = (iova - dirty->iova) >> dirty->pgshift;
>> +	unsigned long page_idx = offset / BITS_PER_PAGE;
>> +	unsigned long page_offset = dirty->start_offset;
>> +	void *kaddr;
>> +
>> +	offset = offset % BITS_PER_PAGE;
>> +
>> +	do {
>> +		unsigned long size = min(BITS_PER_PAGE - offset, nbits);
>> +
>> +		kaddr = kmap_local_page(dirty->pages[page_idx]);
>> +		bitmap_set(kaddr + page_offset, offset, size);
>> +		kunmap_local(kaddr);
>> +		page_offset = offset = 0;
>> +		nbits -= size;
>> +		page_idx++;
>> +	} while (nbits > 0);
>> +
>> +	return set;
>> +}
>> +EXPORT_SYMBOL_GPL(iova_bitmap_set);
>> diff --git a/include/linux/iova_bitmap.h b/include/linux/iova_bitmap.h
>> new file mode 100644
>> index 000000000000..e258d03386d3
>> --- /dev/null
>> +++ b/include/linux/iova_bitmap.h
>> @@ -0,0 +1,189 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (c) 2022, Oracle and/or its affiliates.
>> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +#ifndef _IOVA_BITMAP_H_
>> +#define _IOVA_BITMAP_H_
>> +
>> +#include <linux/mm.h>
>> +
>> +/**
>> + * struct iova_bitmap - A bitmap representing a portion IOVA space
>> + *
>> + * Main data structure for tracking dirty IOVAs.
>> + *
>> + * For example something recording dirty IOVAs, will be provided of a
>> + * struct iova_bitmap structure. This structure only represents a
>> + * subset of the total IOVA space pinned by its parent counterpart
>> + * iterator object.
>> + *
>> + * The user does not need to exact location of the bits in the bitmap.
>> + * From user perspective the bitmap the only API available to the dirty
>> + * tracker is iova_bitmap_set() which records the dirty IOVA *range*
>> + * in the bitmap data.
>> + *
>> + * The bitmap is an array of u64 whereas each bit represents an IOVA
>> + * of range of (1 << pgshift). Thus formula for the bitmap data to be
>> + * set is:
>> + *
>> + *   data[(iova / page_size) / 64] & (1ULL << (iova % 64))
>> + */
>> +struct iova_bitmap {
>> +	/* base IOVA representing bit 0 of the first page */
>> +	unsigned long iova;
>> +
>> +	/* page size order that each bit granules to */
>> +	unsigned long pgshift;
>> +
>> +	/* offset of the first user page pinned */
>> +	unsigned long start_offset;
>> +
>> +	/* number of pages pinned */
>> +	unsigned long npages;
>> +
>> +	/* pinned pages representing the bitmap data */
>> +	struct page **pages;
>> +};
>> +
>> +/**
>> + * struct iova_bitmap_iter - Iterator object of the IOVA bitmap
>> + *
>> + * Main data structure for walking the bitmap data.
>> + *
>> + * Abstracts the pinning work to iterate an IOVA ranges.
>> + * It uses a windowing scheme and pins the bitmap in relatively
>> + * big ranges e.g.
>> + *
>> + * The iterator uses one base page to store all the pinned pages
>> + * pointers related to the bitmap. For sizeof(struct page) == 64 it
>> + * stores 512 struct pages which, if base page size is 4096 it means 2M
>> + * of bitmap data is pinned at a time. If the iova_bitmap page size is
>> + * also base page size then the range window to iterate is 64G.
>> + *
>> + * For example iterating on a total IOVA range of 4G..128G, it will
>> + * walk through this set of ranges:
>> + *
>> + *  - 4G  -  68G-1 (64G)
>> + *  - 68G - 128G-1 (64G)
>> + *
>> + * An example of the APIs on how to iterate the IOVA bitmap:
>> + *
>> + *   ret = iova_bitmap_iter_init(&iter, iova, PAGE_SIZE, length, data);
>> + *   if (ret)
>> + *       return -ENOMEM;
>> + *
>> + *   for (; !iova_bitmap_iter_done(&iter) && !ret;
>> + *        ret = iova_bitmap_iter_advance(&iter)) {
>> + *
>> + *        dirty_reporter_ops(&iter.dirty, iova_bitmap_iova(&iter),
>> + *                           iova_bitmap_length(&iter));
>> + *   }
>> + *
>> + * An implementation of the lower end (referred to above as
>> + * dirty_reporter_ops) that is tracking dirty bits would:
>> + *
>> + *        if (iova_dirty)
>> + *            iova_bitmap_set(&iter.dirty, iova, PAGE_SIZE);
>> + *
>> + * The internals of the object use a cursor @offset that indexes
>> + * which part u64 word of the bitmap is mapped, up to @count.
>> + * Those keep being incremented until @count reaches while mapping
>> + * up to PAGE_SIZE / sizeof(struct page*) maximum of pages.
>> + *
>> + * The iterator is usually located on what tracks DMA mapped ranges
>> + * or some form of IOVA range tracking that co-relates to the user
>> + * passed bitmap.
>> + */
>> +struct iova_bitmap_iter {
>> +	/* IOVA range representing the currently pinned bitmap data */
>> +	struct iova_bitmap dirty;
>> +
>> +	/* userspace address of the bitmap */
>> +	u64 __user *data;
>> +
>> +	/* u64 index that @dirty points to */
>> +	size_t offset;
>> +
>> +	/* how many u64 can we walk in total */
>> +	size_t count;
> 
> size_t?  These are both indexes afaict.
> 
Yes these are both indexes, I'll move away from size_t in these two.

>> +
>> +	/* base IOVA of the whole bitmap */
>> +	unsigned long iova;
>> +
>> +	/* length of the IOVA range for the whole bitmap */
>> +	unsigned long length;
> 
> OTOH this could arguably be size_t and iova dma_addr_t.  Thanks,
> 
OK.

Thanks a lot for the review and suggestions!

> Alex
> 
>> +};
>> +
>> +/**
>> + * iova_bitmap_iter_init() - Initializes an IOVA bitmap iterator object.
>> + * @iter: IOVA bitmap iterator to initialize
>> + * @iova: Start address of the IOVA range
>> + * @length: Length of the IOVA range
>> + * @page_size: Page size of the IOVA bitmap. It defines what each bit
>> + *             granularity represents
>> + * @data: Userspace address of the bitmap
>> + *
>> + * Initializes all the fields in the IOVA iterator including the first
>> + * user pages of @data. Returns 0 on success or otherwise errno on error.
>> + */
>> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter, unsigned long iova,
>> +			  unsigned long length, unsigned long page_size,
>> +			  u64 __user *data);
>> +
>> +/**
>> + * iova_bitmap_iter_free() - Frees an IOVA bitmap iterator object
>> + * @iter: IOVA bitmap iterator to free
>> + *
>> + * It unpins and releases pages array memory and clears any leftover
>> + * state.
>> + */
>> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter);
>> +
>> +/**
>> + * iova_bitmap_iter_done: Checks if the IOVA bitmap has data to iterate
>> + * @iter: IOVA bitmap iterator to free
>> + *
>> + * Returns true if there's more data to iterate.
>> + */
>> +bool iova_bitmap_iter_done(struct iova_bitmap_iter *iter);
>> +
>> +/**
>> + * iova_bitmap_iter_advance: Advances the IOVA bitmap iterator
>> + * @iter: IOVA bitmap iterator to advance
>> + *
>> + * Advances to the next range, releases the current pinned
>> + * pages and pins the next set of bitmap pages.
>> + * Returns 0 on success or otherwise errno.
>> + */
>> +int iova_bitmap_iter_advance(struct iova_bitmap_iter *iter);
>> +
>> +/**
>> + * iova_bitmap_iova: Base IOVA of the current range
>> + * @iter: IOVA bitmap iterator
>> + *
>> + * Returns the base IOVA of the current range.
>> + */
>> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter);
>> +
>> +/**
>> + * iova_bitmap_length: IOVA length of the current range
>> + * @iter: IOVA bitmap iterator
>> + *
>> + * Returns the length of the current IOVA range.
>> + */
>> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter);
>> +
>> +/**
>> + * iova_bitmap_set: Marks an IOVA range as dirty
>> + * @dirty: IOVA bitmap
>> + * @iova: IOVA to mark as dirty
>> + * @length: IOVA range length
>> + *
>> + * Marks the range [iova .. iova+length-1] as dirty in the bitmap.
>> + * Returns the number of bits set.
>> + */
>> +unsigned long iova_bitmap_set(struct iova_bitmap *dirty,
>> +			      unsigned long iova, unsigned long length);
>> +
>> +#endif
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 05/10] vfio: Introduce the DMA logging feature support
  2022-08-25 20:49   ` Alex Williamson
@ 2022-08-25 22:26     ` Joao Martins
  2022-08-25 22:46       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2022-08-25 22:26 UTC (permalink / raw)
  To: Alex Williamson, Yishai Hadas
  Cc: jgg, saeedm, kvm, netdev, kuba, kevin.tian, leonro, maorg, cohuck

On 8/25/22 21:49, Alex Williamson wrote:
> On Mon, 15 Aug 2022 18:11:04 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>> +static int
>> +vfio_ioctl_device_feature_logging_report(struct vfio_device *device,
>> +					 u32 flags, void __user *arg,
>> +					 size_t argsz)
>> +{
>> +	size_t minsz =
>> +		offsetofend(struct vfio_device_feature_dma_logging_report,
>> +			    bitmap);
>> +	struct vfio_device_feature_dma_logging_report report;
>> +	struct iova_bitmap_iter iter;
>> +	int ret;
>> +
>> +	if (!device->log_ops)
>> +		return -ENOTTY;
>> +
>> +	ret = vfio_check_feature(flags, argsz,
>> +				 VFIO_DEVICE_FEATURE_GET,
>> +				 sizeof(report));
>> +	if (ret != 1)
>> +		return ret;
>> +
>> +	if (copy_from_user(&report, arg, minsz))
>> +		return -EFAULT;
>> +
>> +	if (report.page_size < PAGE_SIZE || !is_power_of_2(report.page_size))
> 
> Why is PAGE_SIZE a factor here?  I'm under the impression that
> iova_bitmap is intended to handle arbitrary page sizes.  Thanks,

Arbritary page sizes ... which are powers of 2. We use page shift in iova bitmap.
While it's not hard to lose this restriction (trading a shift over a slower mul)
... I am not sure it is worth supporting said use considering that there aren't
non-powers of 2 page sizes right now?

The PAGE_SIZE restriction might be that it's supposed to be the lowest possible page_size.

> 
> Alex
> 
>> +		return -EINVAL;
>> +
>> +	ret = iova_bitmap_iter_init(&iter, report.iova, report.length,
>> +				    report.page_size,
>> +				    u64_to_user_ptr(report.bitmap));
>> +	if (ret)
>> +		return ret;
>> +
>> +	for (; !iova_bitmap_iter_done(&iter) && !ret;
>> +	     ret = iova_bitmap_iter_advance(&iter)) {
>> +		ret = device->log_ops->log_read_and_clear(device,
>> +			iova_bitmap_iova(&iter),
>> +			iova_bitmap_length(&iter), &iter.dirty);
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	iova_bitmap_iter_free(&iter);
>> +	return ret;
>> +}
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 05/10] vfio: Introduce the DMA logging feature support
  2022-08-25 22:26     ` Joao Martins
@ 2022-08-25 22:46       ` Alex Williamson
  2022-08-26 12:52         ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-08-25 22:46 UTC (permalink / raw)
  To: Joao Martins
  Cc: Yishai Hadas, jgg, saeedm, kvm, netdev, kuba, kevin.tian, leonro,
	maorg, cohuck

On Thu, 25 Aug 2022 23:26:04 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 8/25/22 21:49, Alex Williamson wrote:
> > On Mon, 15 Aug 2022 18:11:04 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:  
> >> +static int
> >> +vfio_ioctl_device_feature_logging_report(struct vfio_device *device,
> >> +					 u32 flags, void __user *arg,
> >> +					 size_t argsz)
> >> +{
> >> +	size_t minsz =
> >> +		offsetofend(struct vfio_device_feature_dma_logging_report,
> >> +			    bitmap);
> >> +	struct vfio_device_feature_dma_logging_report report;
> >> +	struct iova_bitmap_iter iter;
> >> +	int ret;
> >> +
> >> +	if (!device->log_ops)
> >> +		return -ENOTTY;
> >> +
> >> +	ret = vfio_check_feature(flags, argsz,
> >> +				 VFIO_DEVICE_FEATURE_GET,
> >> +				 sizeof(report));
> >> +	if (ret != 1)
> >> +		return ret;
> >> +
> >> +	if (copy_from_user(&report, arg, minsz))
> >> +		return -EFAULT;
> >> +
> >> +	if (report.page_size < PAGE_SIZE || !is_power_of_2(report.page_size))  
> > 
> > Why is PAGE_SIZE a factor here?  I'm under the impression that
> > iova_bitmap is intended to handle arbitrary page sizes.  Thanks,  
> 
> Arbritary page sizes ... which are powers of 2. We use page shift in iova bitmap.
> While it's not hard to lose this restriction (trading a shift over a slower mul)
> ... I am not sure it is worth supporting said use considering that there aren't
> non-powers of 2 page sizes right now?
> 
> The PAGE_SIZE restriction might be that it's supposed to be the lowest possible page_size.

Sorry, I was unclear.  Size relative to PAGE_SIZE was my only question,
not that we shouldn't require power of 2 sizes.  We're adding device
level dirty tracking, where the device page size granularity might be
4K on a host with a CPU 64K page size.  Maybe there's a use case for
that.  Given the flexibility claimed by the iova_bitmap support,
requiring reported page size less than system PAGE_SIZE seems
unjustified.  Thanks,

Alex

> >> +		return -EINVAL;
> >> +
> >> +	ret = iova_bitmap_iter_init(&iter, report.iova, report.length,
> >> +				    report.page_size,
> >> +				    u64_to_user_ptr(report.bitmap));
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	for (; !iova_bitmap_iter_done(&iter) && !ret;
> >> +	     ret = iova_bitmap_iter_advance(&iter)) {
> >> +		ret = device->log_ops->log_read_and_clear(device,
> >> +			iova_bitmap_iova(&iter),
> >> +			iova_bitmap_length(&iter), &iter.dirty);
> >> +		if (ret)
> >> +			break;
> >> +	}
> >> +
> >> +	iova_bitmap_iter_free(&iter);
> >> +	return ret;
> >> +}  
> >   
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support
  2022-08-25 22:24     ` Joao Martins
@ 2022-08-25 23:15       ` Alex Williamson
  2022-08-26  9:37         ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-08-25 23:15 UTC (permalink / raw)
  To: Joao Martins
  Cc: Yishai Hadas, jgg, saeedm, kvm, netdev, kuba, kevin.tian, leonro,
	maorg, cohuck

On Thu, 25 Aug 2022 23:24:39 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 8/25/22 20:27, Alex Williamson wrote:
> > On Mon, 15 Aug 2022 18:11:03 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >   
> >> From: Joao Martins <joao.m.martins@oracle.com>
> >>
> >> The new facility adds a bunch of wrappers that abstract how an IOVA
> >> range is represented in a bitmap that is granulated by a given
> >> page_size. So it translates all the lifting of dealing with user
> >> pointers into its corresponding kernel addresses backing said user
> >> memory into doing finally the (non-atomic) bitmap ops to change
> >> various bits.
> >>
> >> The formula for the bitmap is:
> >>
> >>    data[(iova / page_size) / 64] & (1ULL << (iova % 64))
> >>
> >> Where 64 is the number of bits in a unsigned long (depending on arch)
> >>
> >> It introduces an IOVA iterator that uses a windowing scheme to minimize
> >> the pinning overhead, as opposed to be pinning it on demand 4K at a  
> > 
> > s/ be / /
> >   
> Will fix.
> 
> >> time. So on a 512G and with base page size it would iterate in ranges of
> >> 64G at a time, while pinning 512 pages at a time leading to fewer  
> > 
> > "on a 512G" what?  The overall size of the IOVA range is somewhat
> > irrelevant here and it's unclear where 64G comes from without reading
> > deeper into the series.  Maybe this should be something like:
> > 
> > "Assuming a 4K kernel page and 4K requested page size, we can use a
> > single kernel page to hold 512 page pointers, mapping 2M of bitmap,
> > representing 64G of IOVA space."
> >   
> Much more readable indeed. Will use that.
> 
> >> atomics (specially if the underlying user memory are hugepages).
> >>
> >> An example usage of these helpers for a given @base_iova, @page_size, @length  
> > 
> > Several long lines here that could be wrapped.
> >   
> It's already wrapped (by my editor) and also at 75 columns. I can do a
> bit shorter if that's hurting readability.

78 chars above, but git log indents by another 4 spaces, so they do
wrap.  Something around 70/72 seems better for commit logs.

> >> and __user @data:
> >>
> >> 	ret = iova_bitmap_iter_init(&iter, base_iova, page_size, length, data);
> >> 	if (ret)
> >> 		return -ENOMEM;
> >>
> >> 	for (; !iova_bitmap_iter_done(&iter) && !ret;
> >> 	     ret = iova_bitmap_iter_advance(&iter)) {
> >> 		dirty_reporter_ops(&iter.dirty, iova_bitmap_iova(&iter),
> >> 				   iova_bitmap_length(&iter));
> >> 	}
> >>
> >> 	iova_bitmap_iter_free(&iter);
> >>
> >> An implementation of the lower end -- referred to above as dirty_reporter_ops
> >> to exemplify -- that is tracking dirty bits would mark an IOVA as dirty
> >> as following:
> >>
> >> 	iova_bitmap_set(&iter.dirty, iova, page_size);
> >>
> >> or a contiguous range (example two pages):
> >>
> >> 	iova_bitmap_set(&iter.dirty, iova, 2 * page_size);
> >>
> >> The facility is intended to be used for user bitmaps representing
> >> dirtied IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
> >> ---
> >>  drivers/vfio/Makefile       |   6 +-
> >>  drivers/vfio/iova_bitmap.c  | 224 ++++++++++++++++++++++++++++++++++++
> >>  include/linux/iova_bitmap.h | 189 ++++++++++++++++++++++++++++++
> >>  3 files changed, 417 insertions(+), 2 deletions(-)
> >>  create mode 100644 drivers/vfio/iova_bitmap.c
> >>  create mode 100644 include/linux/iova_bitmap.h
> >>
> >> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> >> index 1a32357592e3..1d6cad32d366 100644
> >> --- a/drivers/vfio/Makefile
> >> +++ b/drivers/vfio/Makefile
> >> @@ -1,9 +1,11 @@
> >>  # SPDX-License-Identifier: GPL-2.0
> >>  vfio_virqfd-y := virqfd.o
> >>  
> >> -vfio-y += vfio_main.o
> >> -
> >>  obj-$(CONFIG_VFIO) += vfio.o
> >> +
> >> +vfio-y := vfio_main.o \
> >> +          iova_bitmap.o \
> >> +
> >>  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
> >>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> >>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> >> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
> >> new file mode 100644
> >> index 000000000000..6b6008ef146c
> >> --- /dev/null
> >> +++ b/drivers/vfio/iova_bitmap.c
> >> @@ -0,0 +1,224 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Copyright (c) 2022, Oracle and/or its affiliates.
> >> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> >> + */
> >> +#include <linux/iova_bitmap.h>
> >> +#include <linux/highmem.h>
> >> +
> >> +#define BITS_PER_PAGE (PAGE_SIZE * BITS_PER_BYTE)
> >> +
> >> +static void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
> >> +
> >> +/*
> >> + * Converts a relative IOVA to a bitmap index.
> >> + * The bitmap is viewed an array of u64, and each u64 represents  
> > 
> > "The bitmap is viewed as an u64 array and each u64 represents"
> >   
> Will use that.
> 
> >> + * a range of IOVA, and the whole pinned pages to the range window.  
> > 
> > I think this phrase after the comma is trying to say something about the
> > windowed mapping, but I don't know what.
> >   
> Yes. doesn't add much in the context of the function.
> 
> > This function provides the index into that u64 array for a given IOVA
> > offset.
> >   
> I'll use this instead.
> 
> >> + * Relative IOVA means relative to the iter::dirty base IOVA (stored
> >> + * in dirty::iova). All computations in this file are done using
> >> + * relative IOVAs and thus avoid an extra subtraction against
> >> + * dirty::iova. The user API iova_bitmap_set() always uses a regular
> >> + * absolute IOVAs.  
> > 
> > So why don't we use variables that make it clear when an IOVA is an
> > IOVA and when it's an offset?
> >   
> I was just sticking the name @offset to how we iterate towards the u64s
> to avoid confusion. Should I switch to offset here I should probably change
> @offset of the struct into something else. But I see you suggested something
> like that too further below.
> 
> >> + */
> >> +static unsigned long iova_bitmap_iova_to_index(struct iova_bitmap_iter *iter,
> >> +					       unsigned long iova)  
> > 
> > iova_bitmap_offset_to_index(... unsigned long offset)?
> >   
> >> +{OK.  
> 
> >> +	unsigned long pgsize = 1 << iter->dirty.pgshift;
> >> +
> >> +	return iova / (BITS_PER_TYPE(*iter->data) * pgsize);  
> > 
> > Why do we name the bitmap "data" rather than "bitmap"?
> >   
> I was avoid overusing the word bitmap given structure is already called @bitmap.
> At the end of the day it's a user data pointer. But I can call it @bitmap.

@data is not very descriptive, and finding a pointer to a bitmap inside
a struct iova_bitmap feels like a fairly natural thing to me ;)

> > Why do we call the mapped section "dirty" rather than "mapped"?  It's
> > not necessarily dirty, it's just the window that's current mapped.
> >   
> Good point. Dirty is just what we tracked, but the structure ::dirty is closer
> to representing what's actually mapped yes. I'll switch to mapped.
> 
> >> +}
> >> +
> >> +/*
> >> + * Converts a bitmap index to a *relative* IOVA.
> >> + */
> >> +static unsigned long iova_bitmap_index_to_iova(struct iova_bitmap_iter *iter,
> >> +					       unsigned long index)  
> > 
> > iova_bitmap_index_to_offset()?
> >   
> ack
> 
> >> +{
> >> +	unsigned long pgshift = iter->dirty.pgshift;
> >> +
> >> +	return (index * BITS_PER_TYPE(*iter->data)) << pgshift;
> >> +}
> >> +
> >> +/*
> >> + * Pins the bitmap user pages for the current range window.
> >> + * This is internal to IOVA bitmap and called when advancing the
> >> + * iterator.
> >> + */
> >> +static int iova_bitmap_iter_get(struct iova_bitmap_iter *iter)
> >> +{
> >> +	struct iova_bitmap *dirty = &iter->dirty;
> >> +	unsigned long npages;
> >> +	u64 __user *addr;
> >> +	long ret;
> >> +
> >> +	/*
> >> +	 * @offset is the cursor of the currently mapped u64 words  
> > 
> > So it's an index?  I don't know what a cursor is.    
> 
> In my "english" 'cursor' as a synonym for index yes.
> 
> > If we start using
> > "offset" to describe a relative iova, maybe this becomes something more
> > descriptive, mapped_base_index?
> >   
> I am not very fond of long names, @mapped_index maybe hmm
> 
> >> +	 * that we have access. And it indexes u64 bitmap word that is
> >> +	 * mapped. Anything before @offset is not mapped. The range
> >> +	 * @offset .. @count is mapped but capped at a maximum number
> >> +	 * of pages.  
> > 
> > @total_indexes rather than @count maybe?
> >   
> It's still a count of indexes, I thought @count was explicit already without
> being too wordy. I can suffix with indexes if going with mapped_index. Or maybe
> @mapped_count maybe

I was trying to get "index" in there somehow to make it stupid obvious
that it's a count of indexes.

> >> +	 */
> >> +	npages = DIV_ROUND_UP((iter->count - iter->offset) *
> >> +			      sizeof(*iter->data), PAGE_SIZE);
> >> +
> >> +	/*
> >> +	 * We always cap at max number of 'struct page' a base page can fit.
> >> +	 * This is, for example, on x86 means 2M of bitmap data max.
> >> +	 */
> >> +	npages = min(npages,  PAGE_SIZE / sizeof(struct page *));
> >> +	addr = iter->data + iter->offset;  
> > 
> > Subtle pointer arithmetic.
> >   
> >> +	ret = pin_user_pages_fast((unsigned long)addr, npages,
> >> +				  FOLL_WRITE, dirty->pages);
> >> +	if (ret <= 0)
> >> +		return -EFAULT;
> >> +
> >> +	dirty->npages = (unsigned long)ret;
> >> +	/* Base IOVA where @pages point to i.e. bit 0 of the first page */
> >> +	dirty->iova = iova_bitmap_iova(iter);  
> > 
> > If we're operating on an iterator, wouldn't convention suggest this is
> > an iova_bitmap_itr_FOO function?  mapped_iova perhaps.
> >   
> 
> Yes.
> 
> Given your earlier comment, mapped iova is a bit more obvious.
> 
> >> +
> >> +	/*
> >> +	 * offset of the page where pinned pages bit 0 is located.
> >> +	 * This handles the case where the bitmap is not PAGE_SIZE
> >> +	 * aligned.
> >> +	 */
> >> +	dirty->start_offset = offset_in_page(addr);  
> > 
> > Maybe pgoff to avoid confusion with relative iova offsets.
> >   
> Will fix. And it's also convention in mm code, so I should stick with that.
> 
> > It seems suspect that the length calculations don't take this into
> > account.
> >   
> The iova/length/indexes functions only work over bit/iova "quantity" and indexing of it
> without needing to know where the first bit of the mapped range starts. So the pgoff
> is only important when we actually set bits on the bitmap i.e. iova_bitmap_set().

Ok

> >> +	return 0;
> >> +}
> >> +
> >> +/*
> >> + * Unpins the bitmap user pages and clears @npages
> >> + * (un)pinning is abstracted from API user and it's done
> >> + * when advancing or freeing the iterator.
> >> + */
> >> +static void iova_bitmap_iter_put(struct iova_bitmap_iter *iter)
> >> +{
> >> +	struct iova_bitmap *dirty = &iter->dirty;
> >> +
> >> +	if (dirty->npages) {
> >> +		unpin_user_pages(dirty->pages, dirty->npages);
> >> +		dirty->npages = 0;
> >> +	}
> >> +}
> >> +
> >> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter,
> >> +			  unsigned long iova, unsigned long length,
> >> +			  unsigned long page_size, u64 __user *data)
> >> +{
> >> +	struct iova_bitmap *dirty = &iter->dirty;
> >> +
> >> +	memset(iter, 0, sizeof(*iter));
> >> +	dirty->pgshift = __ffs(page_size);
> >> +	iter->data = data;
> >> +	iter->count = iova_bitmap_iova_to_index(iter, length - 1) + 1;
> >> +	iter->iova = iova;
> >> +	iter->length = length;
> >> +
> >> +	dirty->iova = iova;
> >> +	dirty->pages = (struct page **)__get_free_page(GFP_KERNEL);
> >> +	if (!dirty->pages)
> >> +		return -ENOMEM;
> >> +
> >> +	return iova_bitmap_iter_get(iter);
> >> +}
> >> +
> >> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter)
> >> +{
> >> +	struct iova_bitmap *dirty = &iter->dirty;
> >> +
> >> +	iova_bitmap_iter_put(iter);
> >> +
> >> +	if (dirty->pages) {
> >> +		free_page((unsigned long)dirty->pages);
> >> +		dirty->pages = NULL;
> >> +	}
> >> +
> >> +	memset(iter, 0, sizeof(*iter));
> >> +}
> >> +
> >> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter)
> >> +{
> >> +	unsigned long skip = iter->offset;
> >> +
> >> +	return iter->iova + iova_bitmap_index_to_iova(iter, skip);
> >> +}
> >> +
> >> +/*
> >> + * Returns the remaining bitmap indexes count to process for the currently pinned
> >> + * bitmap pages.
> >> + */
> >> +static unsigned long iova_bitmap_iter_remaining(struct iova_bitmap_iter *iter)  
> > 
> > iova_bitmap_iter_mapped_remaining()?
> >   
> Yes.
> 
> >> +{
> >> +	unsigned long remaining = iter->count - iter->offset;
> >> +
> >> +	remaining = min_t(unsigned long, remaining,
> >> +		     (iter->dirty.npages << PAGE_SHIFT) / sizeof(*iter->data));
> >> +
> >> +	return remaining;
> >> +}
> >> +
> >> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter)  
> > 
> > iova_bitmap_iter_mapped_length()?
> >   
> Yes.
> 
> I don't particularly like long names, but doesn't seem to have better alternatives.
> 
> Part of the reason the names look 'shortened' was because the object we pass
> is already an iterator, so it's implicit that we only fetch the under-iteration/mapped
> iova. Or that was at least the intention.

Yes, but that means you need to look at the function declaration to
know that it takes an iova_bitmap_iter rather than an iova_bitmap,
which already means it's not intuitive enough.

> > Maybe it doesn't really make sense to differentiate the iterator from
> > the bitmap in the API.  In fact, couldn't we reduce the API to simply:
> > 
> > int iova_bitmap_init(struct iova_bitmap *bitmap, dma_addr_t iova,
> > 		     size_t length, size_t page_size, u64 __user *data);
> > 
> > int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *data,
> > 			 int (*fn)(void *data, dma_addr_t iova,
> > 			 	   size_t length,
> > 				   struct iova_bitmap *bitmap));
> > 
> > void iova_bitmap_free(struct iova_bitmap *bitmap);
> > 
> > unsigned long iova_bitmap_set(struct iova_bitmap *bitmap,
> > 			      dma_addr_t iova, size_t length);
> > 
> > Removes the need for the API to have done, advance, iova, and length
> > functions.
> >   
> True, it would be simpler.
> 
> Could also allow us to hide the iterator details enterily and switch to
> container_of() from iova_bitmap pointer. Though, from caller, it would be
> weird to do:
> 
> struct iova_bitmap_iter iter;
> 
> iova_bitmap_init(&iter.dirty, ....);
> 
> Hmm, maybe not that strange.
> 
> Unless you are trying to suggest to merge both struct iova_bitmap and
> iova_bitmap_iter together? I was trying to keep them separate more for
> the dirty tracker (IOMMUFD/VFIO, to just be limited to iova_bitmap_set()
> with the generic infra being the one managing that iterator state in a
> separate structure.

Not suggesting the be merged, but why does the embedded mapping
structure need to be exposed to the API?  That's an implementation
detail that's causing confusion and naming issues for which structure
is passed and how do we represent that in the function name.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support
  2022-08-25 23:15       ` Alex Williamson
@ 2022-08-26  9:37         ` Joao Martins
  2022-08-26 12:02           ` Alex Williamson
  2022-08-26 13:01           ` Jason Gunthorpe
  0 siblings, 2 replies; 26+ messages in thread
From: Joao Martins @ 2022-08-26  9:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, jgg, saeedm, kvm, netdev, kuba, kevin.tian, leonro,
	maorg, cohuck

On 8/26/22 00:15, Alex Williamson wrote:
> On Thu, 25 Aug 2022 23:24:39 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 8/25/22 20:27, Alex Williamson wrote:
>>> On Mon, 15 Aug 2022 18:11:03 +0300
>>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>
>>>> The new facility adds a bunch of wrappers that abstract how an IOVA
>>>> range is represented in a bitmap that is granulated by a given
>>>> page_size. So it translates all the lifting of dealing with user
>>>> pointers into its corresponding kernel addresses backing said user
>>>> memory into doing finally the (non-atomic) bitmap ops to change
>>>> various bits.
>>>>
>>>> The formula for the bitmap is:
>>>>
>>>>    data[(iova / page_size) / 64] & (1ULL << (iova % 64))
>>>>
>>>> Where 64 is the number of bits in a unsigned long (depending on arch)
>>>>
>>>> It introduces an IOVA iterator that uses a windowing scheme to minimize
>>>> the pinning overhead, as opposed to be pinning it on demand 4K at a  
>>>
>>> s/ be / /
>>>   
>> Will fix.
>>
>>>> time. So on a 512G and with base page size it would iterate in ranges of
>>>> 64G at a time, while pinning 512 pages at a time leading to fewer  
>>>
>>> "on a 512G" what?  The overall size of the IOVA range is somewhat
>>> irrelevant here and it's unclear where 64G comes from without reading
>>> deeper into the series.  Maybe this should be something like:
>>>
>>> "Assuming a 4K kernel page and 4K requested page size, we can use a
>>> single kernel page to hold 512 page pointers, mapping 2M of bitmap,
>>> representing 64G of IOVA space."
>>>   
>> Much more readable indeed. Will use that.
>>
>>>> atomics (specially if the underlying user memory are hugepages).
>>>>
>>>> An example usage of these helpers for a given @base_iova, @page_size, @length  
>>>
>>> Several long lines here that could be wrapped.
>>>   
>> It's already wrapped (by my editor) and also at 75 columns. I can do a
>> bit shorter if that's hurting readability.
> 
> 78 chars above, but git log indents by another 4 spaces, so they do
> wrap.  Something around 70/72 seems better for commit logs.
> 
OK, I'll wrap at 70.

>>>> and __user @data:
>>>>
>>>> 	ret = iova_bitmap_iter_init(&iter, base_iova, page_size, length, data);
>>>> 	if (ret)
>>>> 		return -ENOMEM;
>>>>
>>>> 	for (; !iova_bitmap_iter_done(&iter) && !ret;
>>>> 	     ret = iova_bitmap_iter_advance(&iter)) {
>>>> 		dirty_reporter_ops(&iter.dirty, iova_bitmap_iova(&iter),
>>>> 				   iova_bitmap_length(&iter));
>>>> 	}
>>>>
>>>> 	iova_bitmap_iter_free(&iter);
>>>>
>>>> An implementation of the lower end -- referred to above as dirty_reporter_ops
>>>> to exemplify -- that is tracking dirty bits would mark an IOVA as dirty
>>>> as following:
>>>>
>>>> 	iova_bitmap_set(&iter.dirty, iova, page_size);
>>>>
>>>> or a contiguous range (example two pages):
>>>>
>>>> 	iova_bitmap_set(&iter.dirty, iova, 2 * page_size);
>>>>
>>>> The facility is intended to be used for user bitmaps representing
>>>> dirtied IOVAs by IOMMU (via IOMMUFD) and PCI Devices (via vfio-pci).
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>>>> ---
>>>>  drivers/vfio/Makefile       |   6 +-
>>>>  drivers/vfio/iova_bitmap.c  | 224 ++++++++++++++++++++++++++++++++++++
>>>>  include/linux/iova_bitmap.h | 189 ++++++++++++++++++++++++++++++
>>>>  3 files changed, 417 insertions(+), 2 deletions(-)
>>>>  create mode 100644 drivers/vfio/iova_bitmap.c
>>>>  create mode 100644 include/linux/iova_bitmap.h
>>>>
>>>> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
>>>> index 1a32357592e3..1d6cad32d366 100644
>>>> --- a/drivers/vfio/Makefile
>>>> +++ b/drivers/vfio/Makefile
>>>> @@ -1,9 +1,11 @@
>>>>  # SPDX-License-Identifier: GPL-2.0
>>>>  vfio_virqfd-y := virqfd.o
>>>>  
>>>> -vfio-y += vfio_main.o
>>>> -
>>>>  obj-$(CONFIG_VFIO) += vfio.o
>>>> +
>>>> +vfio-y := vfio_main.o \
>>>> +          iova_bitmap.o \
>>>> +
>>>>  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
>>>>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
>>>>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>>>> diff --git a/drivers/vfio/iova_bitmap.c b/drivers/vfio/iova_bitmap.c
>>>> new file mode 100644
>>>> index 000000000000..6b6008ef146c
>>>> --- /dev/null
>>>> +++ b/drivers/vfio/iova_bitmap.c
>>>> @@ -0,0 +1,224 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Copyright (c) 2022, Oracle and/or its affiliates.
>>>> + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>>>> + */
>>>> +#include <linux/iova_bitmap.h>
>>>> +#include <linux/highmem.h>
>>>> +
>>>> +#define BITS_PER_PAGE (PAGE_SIZE * BITS_PER_BYTE)
>>>> +
>>>> +static void iova_bitmap_iter_put(struct iova_bitmap_iter *iter);
>>>> +
>>>> +/*
>>>> + * Converts a relative IOVA to a bitmap index.
>>>> + * The bitmap is viewed an array of u64, and each u64 represents  
>>>
>>> "The bitmap is viewed as an u64 array and each u64 represents"
>>>   
>> Will use that.
>>
>>>> + * a range of IOVA, and the whole pinned pages to the range window.  
>>>
>>> I think this phrase after the comma is trying to say something about the
>>> windowed mapping, but I don't know what.
>>>   
>> Yes. doesn't add much in the context of the function.
>>
>>> This function provides the index into that u64 array for a given IOVA
>>> offset.
>>>   
>> I'll use this instead.
>>
>>>> + * Relative IOVA means relative to the iter::dirty base IOVA (stored
>>>> + * in dirty::iova). All computations in this file are done using
>>>> + * relative IOVAs and thus avoid an extra subtraction against
>>>> + * dirty::iova. The user API iova_bitmap_set() always uses a regular
>>>> + * absolute IOVAs.  
>>>
>>> So why don't we use variables that make it clear when an IOVA is an
>>> IOVA and when it's an offset?
>>>   
>> I was just sticking the name @offset to how we iterate towards the u64s
>> to avoid confusion. Should I switch to offset here I should probably change
>> @offset of the struct into something else. But I see you suggested something
>> like that too further below.
>>
>>>> + */
>>>> +static unsigned long iova_bitmap_iova_to_index(struct iova_bitmap_iter *iter,
>>>> +					       unsigned long iova)  
>>>
>>> iova_bitmap_offset_to_index(... unsigned long offset)?
>>>   
>>>> +{OK.  
>>
>>>> +	unsigned long pgsize = 1 << iter->dirty.pgshift;
>>>> +
>>>> +	return iova / (BITS_PER_TYPE(*iter->data) * pgsize);  
>>>
>>> Why do we name the bitmap "data" rather than "bitmap"?
>>>   
>> I was avoid overusing the word bitmap given structure is already called @bitmap.
>> At the end of the day it's a user data pointer. But I can call it @bitmap.
> 
> @data is not very descriptive, and finding a pointer to a bitmap inside
> a struct iova_bitmap feels like a fairly natural thing to me ;)
> 
OK

>>> Why do we call the mapped section "dirty" rather than "mapped"?  It's
>>> not necessarily dirty, it's just the window that's current mapped.
>>>   
>> Good point. Dirty is just what we tracked, but the structure ::dirty is closer
>> to representing what's actually mapped yes. I'll switch to mapped.
>>
>>>> +}
>>>> +
>>>> +/*
>>>> + * Converts a bitmap index to a *relative* IOVA.
>>>> + */
>>>> +static unsigned long iova_bitmap_index_to_iova(struct iova_bitmap_iter *iter,
>>>> +					       unsigned long index)  
>>>
>>> iova_bitmap_index_to_offset()?
>>>   
>> ack
>>
>>>> +{
>>>> +	unsigned long pgshift = iter->dirty.pgshift;
>>>> +
>>>> +	return (index * BITS_PER_TYPE(*iter->data)) << pgshift;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Pins the bitmap user pages for the current range window.
>>>> + * This is internal to IOVA bitmap and called when advancing the
>>>> + * iterator.
>>>> + */
>>>> +static int iova_bitmap_iter_get(struct iova_bitmap_iter *iter)
>>>> +{
>>>> +	struct iova_bitmap *dirty = &iter->dirty;
>>>> +	unsigned long npages;
>>>> +	u64 __user *addr;
>>>> +	long ret;
>>>> +
>>>> +	/*
>>>> +	 * @offset is the cursor of the currently mapped u64 words  
>>>
>>> So it's an index?  I don't know what a cursor is.    
>>
>> In my "english" 'cursor' as a synonym for index yes.
>>
>>> If we start using
>>> "offset" to describe a relative iova, maybe this becomes something more
>>> descriptive, mapped_base_index?
>>>   
>> I am not very fond of long names, @mapped_index maybe hmm
>>
>>>> +	 * that we have access. And it indexes u64 bitmap word that is
>>>> +	 * mapped. Anything before @offset is not mapped. The range
>>>> +	 * @offset .. @count is mapped but capped at a maximum number
>>>> +	 * of pages.  
>>>
>>> @total_indexes rather than @count maybe?
>>>   
>> It's still a count of indexes, I thought @count was explicit already without
>> being too wordy. I can suffix with indexes if going with mapped_index. Or maybe
>> @mapped_count maybe
> 
> I was trying to get "index" in there somehow to make it stupid obvious
> that it's a count of indexes.
> 
OK, @total_index{es} then (probably drop the plural to keep suffix naming convention
as them being related)

>>>> +	 */
>>>> +	npages = DIV_ROUND_UP((iter->count - iter->offset) *
>>>> +			      sizeof(*iter->data), PAGE_SIZE);
>>>> +
>>>> +	/*
>>>> +	 * We always cap at max number of 'struct page' a base page can fit.
>>>> +	 * This is, for example, on x86 means 2M of bitmap data max.
>>>> +	 */
>>>> +	npages = min(npages,  PAGE_SIZE / sizeof(struct page *));
>>>> +	addr = iter->data + iter->offset;  
>>>
>>> Subtle pointer arithmetic.
>>>   
>>>> +	ret = pin_user_pages_fast((unsigned long)addr, npages,
>>>> +				  FOLL_WRITE, dirty->pages);
>>>> +	if (ret <= 0)
>>>> +		return -EFAULT;
>>>> +
>>>> +	dirty->npages = (unsigned long)ret;
>>>> +	/* Base IOVA where @pages point to i.e. bit 0 of the first page */
>>>> +	dirty->iova = iova_bitmap_iova(iter);  
>>>
>>> If we're operating on an iterator, wouldn't convention suggest this is
>>> an iova_bitmap_itr_FOO function?  mapped_iova perhaps.
>>>   
>>
>> Yes.
>>
>> Given your earlier comment, mapped iova is a bit more obvious.
>>
>>>> +
>>>> +	/*
>>>> +	 * offset of the page where pinned pages bit 0 is located.
>>>> +	 * This handles the case where the bitmap is not PAGE_SIZE
>>>> +	 * aligned.
>>>> +	 */
>>>> +	dirty->start_offset = offset_in_page(addr);  
>>>
>>> Maybe pgoff to avoid confusion with relative iova offsets.
>>>   
>> Will fix. And it's also convention in mm code, so I should stick with that.
>>
>>> It seems suspect that the length calculations don't take this into
>>> account.
>>>   
>> The iova/length/indexes functions only work over bit/iova "quantity" and indexing of it
>> without needing to know where the first bit of the mapped range starts. So the pgoff
>> is only important when we actually set bits on the bitmap i.e. iova_bitmap_set().
> 
> Ok
> 
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Unpins the bitmap user pages and clears @npages
>>>> + * (un)pinning is abstracted from API user and it's done
>>>> + * when advancing or freeing the iterator.
>>>> + */
>>>> +static void iova_bitmap_iter_put(struct iova_bitmap_iter *iter)
>>>> +{
>>>> +	struct iova_bitmap *dirty = &iter->dirty;
>>>> +
>>>> +	if (dirty->npages) {
>>>> +		unpin_user_pages(dirty->pages, dirty->npages);
>>>> +		dirty->npages = 0;
>>>> +	}
>>>> +}
>>>> +
>>>> +int iova_bitmap_iter_init(struct iova_bitmap_iter *iter,
>>>> +			  unsigned long iova, unsigned long length,
>>>> +			  unsigned long page_size, u64 __user *data)
>>>> +{
>>>> +	struct iova_bitmap *dirty = &iter->dirty;
>>>> +
>>>> +	memset(iter, 0, sizeof(*iter));
>>>> +	dirty->pgshift = __ffs(page_size);
>>>> +	iter->data = data;
>>>> +	iter->count = iova_bitmap_iova_to_index(iter, length - 1) + 1;
>>>> +	iter->iova = iova;
>>>> +	iter->length = length;
>>>> +
>>>> +	dirty->iova = iova;
>>>> +	dirty->pages = (struct page **)__get_free_page(GFP_KERNEL);
>>>> +	if (!dirty->pages)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	return iova_bitmap_iter_get(iter);
>>>> +}
>>>> +
>>>> +void iova_bitmap_iter_free(struct iova_bitmap_iter *iter)
>>>> +{
>>>> +	struct iova_bitmap *dirty = &iter->dirty;
>>>> +
>>>> +	iova_bitmap_iter_put(iter);
>>>> +
>>>> +	if (dirty->pages) {
>>>> +		free_page((unsigned long)dirty->pages);
>>>> +		dirty->pages = NULL;
>>>> +	}
>>>> +
>>>> +	memset(iter, 0, sizeof(*iter));
>>>> +}
>>>> +
>>>> +unsigned long iova_bitmap_iova(struct iova_bitmap_iter *iter)
>>>> +{
>>>> +	unsigned long skip = iter->offset;
>>>> +
>>>> +	return iter->iova + iova_bitmap_index_to_iova(iter, skip);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Returns the remaining bitmap indexes count to process for the currently pinned
>>>> + * bitmap pages.
>>>> + */
>>>> +static unsigned long iova_bitmap_iter_remaining(struct iova_bitmap_iter *iter)  
>>>
>>> iova_bitmap_iter_mapped_remaining()?
>>>   
>> Yes.
>>
>>>> +{
>>>> +	unsigned long remaining = iter->count - iter->offset;
>>>> +
>>>> +	remaining = min_t(unsigned long, remaining,
>>>> +		     (iter->dirty.npages << PAGE_SHIFT) / sizeof(*iter->data));
>>>> +
>>>> +	return remaining;
>>>> +}
>>>> +
>>>> +unsigned long iova_bitmap_length(struct iova_bitmap_iter *iter)  
>>>
>>> iova_bitmap_iter_mapped_length()?
>>>   
>> Yes.
>>
>> I don't particularly like long names, but doesn't seem to have better alternatives.
>>
>> Part of the reason the names look 'shortened' was because the object we pass
>> is already an iterator, so it's implicit that we only fetch the under-iteration/mapped
>> iova. Or that was at least the intention.
> 
> Yes, but that means you need to look at the function declaration to
> know that it takes an iova_bitmap_iter rather than an iova_bitmap,
> which already means it's not intuitive enough.
> 
OK

>>> Maybe it doesn't really make sense to differentiate the iterator from
>>> the bitmap in the API.  In fact, couldn't we reduce the API to simply:
>>>
>>> int iova_bitmap_init(struct iova_bitmap *bitmap, dma_addr_t iova,
>>> 		     size_t length, size_t page_size, u64 __user *data);
>>>
>>> int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *data,
>>> 			 int (*fn)(void *data, dma_addr_t iova,
>>> 			 	   size_t length,
>>> 				   struct iova_bitmap *bitmap));
>>>
>>> void iova_bitmap_free(struct iova_bitmap *bitmap);
>>>
>>> unsigned long iova_bitmap_set(struct iova_bitmap *bitmap,
>>> 			      dma_addr_t iova, size_t length);
>>>
>>> Removes the need for the API to have done, advance, iova, and length
>>> functions.
>>>   
>> True, it would be simpler.
>>
>> Could also allow us to hide the iterator details enterily and switch to
>> container_of() from iova_bitmap pointer. Though, from caller, it would be
>> weird to do:
>>
>> struct iova_bitmap_iter iter;
>>
>> iova_bitmap_init(&iter.dirty, ....);
>>
>> Hmm, maybe not that strange.
>>
>> Unless you are trying to suggest to merge both struct iova_bitmap and
>> iova_bitmap_iter together? I was trying to keep them separate more for
>> the dirty tracker (IOMMUFD/VFIO, to just be limited to iova_bitmap_set()
>> with the generic infra being the one managing that iterator state in a
>> separate structure.
> 
> Not suggesting the be merged, but why does the embedded mapping
> structure need to be exposed to the API?  That's an implementation
> detail that's causing confusion and naming issues for which structure
> is passed and how do we represent that in the function name.  Thanks,

I wanted the convention to be that the end 'device' tracker (IOMMU or VFIO
vendor driver) does not have "direct" access to the iterator state. So it acesses
or modifies only the mapped bitmap *data*. The hardware tracker is always *provided*
with a iova_bitmap to set bits but it should not allocate, iterate or pin anything,
making things simpler for tracker.

Thus the point was to have a clear division between how you iterate
(iova_bitmap_iter* API) and the actual bits manipulation (so far only
iova_bitmap_set()) including which data structures you access in the APIs, thus
embedding the least accessed there (struct iova_bitmap).

The alternative is to reverse it and just allocate iter state in iova_bitmap_init()
and have it stored as a pointer say as iova_bitmap::iter. We encapsulate both and mix
the structures, which while not as clean but maybe this is not that big of a deal as
I thought it would be

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support
  2022-08-26  9:37         ` Joao Martins
@ 2022-08-26 12:02           ` Alex Williamson
  2022-08-26 12:10             ` Joao Martins
  2022-08-26 13:01           ` Jason Gunthorpe
  1 sibling, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-08-26 12:02 UTC (permalink / raw)
  To: Joao Martins
  Cc: Yishai Hadas, jgg, saeedm, kvm, netdev, kuba, kevin.tian, leonro,
	maorg, cohuck

On Fri, 26 Aug 2022 10:37:26 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:
> On 8/26/22 00:15, Alex Williamson wrote:
> > On Thu, 25 Aug 2022 23:24:39 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:  
> >> On 8/25/22 20:27, Alex Williamson wrote:  
> >>> Maybe it doesn't really make sense to differentiate the iterator from
> >>> the bitmap in the API.  In fact, couldn't we reduce the API to simply:
> >>>
> >>> int iova_bitmap_init(struct iova_bitmap *bitmap, dma_addr_t iova,
> >>> 		     size_t length, size_t page_size, u64 __user *data);
> >>>
> >>> int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *data,
> >>> 			 int (*fn)(void *data, dma_addr_t iova,
> >>> 			 	   size_t length,
> >>> 				   struct iova_bitmap *bitmap));
> >>>
> >>> void iova_bitmap_free(struct iova_bitmap *bitmap);
> >>>
> >>> unsigned long iova_bitmap_set(struct iova_bitmap *bitmap,
> >>> 			      dma_addr_t iova, size_t length);
> >>>
> >>> Removes the need for the API to have done, advance, iova, and length
> >>> functions.
> >>>     
> >> True, it would be simpler.
> >>
> >> Could also allow us to hide the iterator details enterily and switch to
> >> container_of() from iova_bitmap pointer. Though, from caller, it would be
> >> weird to do:
> >>
> >> struct iova_bitmap_iter iter;
> >>
> >> iova_bitmap_init(&iter.dirty, ....);
> >>
> >> Hmm, maybe not that strange.
> >>
> >> Unless you are trying to suggest to merge both struct iova_bitmap and
> >> iova_bitmap_iter together? I was trying to keep them separate more for
> >> the dirty tracker (IOMMUFD/VFIO, to just be limited to iova_bitmap_set()
> >> with the generic infra being the one managing that iterator state in a
> >> separate structure.  
> > 
> > Not suggesting the be merged, but why does the embedded mapping
> > structure need to be exposed to the API?  That's an implementation
> > detail that's causing confusion and naming issues for which structure
> > is passed and how do we represent that in the function name.  Thanks,  
> 
> I wanted the convention to be that the end 'device' tracker (IOMMU or VFIO
> vendor driver) does not have "direct" access to the iterator state. So it acesses
> or modifies only the mapped bitmap *data*. The hardware tracker is always *provided*
> with a iova_bitmap to set bits but it should not allocate, iterate or pin anything,
> making things simpler for tracker.
> 
> Thus the point was to have a clear division between how you iterate
> (iova_bitmap_iter* API) and the actual bits manipulation (so far only
> iova_bitmap_set()) including which data structures you access in the APIs, thus
> embedding the least accessed there (struct iova_bitmap).
> 
> The alternative is to reverse it and just allocate iter state in iova_bitmap_init()
> and have it stored as a pointer say as iova_bitmap::iter. We encapsulate both and mix
> the structures, which while not as clean but maybe this is not that big of a deal as
> I thought it would be

Is there really a need for struct iova_bitmap to be defined in a shared
header, or could we just have a forward declaration?  With the proposed
interface above, iova_bitmap could be opaque to the caller if it were
dynamically allocated, ex:

struct iova_bitmap* iova_bitmap_alloc(dma_addr_t iova, size_t length,
				      size_t page_size, u64 __user *bitmap);

Thanks,
Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support
  2022-08-26 12:02           ` Alex Williamson
@ 2022-08-26 12:10             ` Joao Martins
  0 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2022-08-26 12:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, jgg, saeedm, kvm, netdev, kuba, kevin.tian, leonro,
	maorg, cohuck

On 8/26/22 13:02, Alex Williamson wrote:
> On Fri, 26 Aug 2022 10:37:26 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 8/26/22 00:15, Alex Williamson wrote:
>>> On Thu, 25 Aug 2022 23:24:39 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> On 8/25/22 20:27, Alex Williamson wrote:  
>>>>> Maybe it doesn't really make sense to differentiate the iterator from
>>>>> the bitmap in the API.  In fact, couldn't we reduce the API to simply:
>>>>>
>>>>> int iova_bitmap_init(struct iova_bitmap *bitmap, dma_addr_t iova,
>>>>> 		     size_t length, size_t page_size, u64 __user *data);
>>>>>
>>>>> int iova_bitmap_for_each(struct iova_bitmap *bitmap, void *data,
>>>>> 			 int (*fn)(void *data, dma_addr_t iova,
>>>>> 			 	   size_t length,
>>>>> 				   struct iova_bitmap *bitmap));
>>>>>
>>>>> void iova_bitmap_free(struct iova_bitmap *bitmap);
>>>>>
>>>>> unsigned long iova_bitmap_set(struct iova_bitmap *bitmap,
>>>>> 			      dma_addr_t iova, size_t length);
>>>>>
>>>>> Removes the need for the API to have done, advance, iova, and length
>>>>> functions.
>>>>>     
>>>> True, it would be simpler.
>>>>
>>>> Could also allow us to hide the iterator details enterily and switch to
>>>> container_of() from iova_bitmap pointer. Though, from caller, it would be
>>>> weird to do:
>>>>
>>>> struct iova_bitmap_iter iter;
>>>>
>>>> iova_bitmap_init(&iter.dirty, ....);
>>>>
>>>> Hmm, maybe not that strange.
>>>>
>>>> Unless you are trying to suggest to merge both struct iova_bitmap and
>>>> iova_bitmap_iter together? I was trying to keep them separate more for
>>>> the dirty tracker (IOMMUFD/VFIO, to just be limited to iova_bitmap_set()
>>>> with the generic infra being the one managing that iterator state in a
>>>> separate structure.  
>>>
>>> Not suggesting the be merged, but why does the embedded mapping
>>> structure need to be exposed to the API?  That's an implementation
>>> detail that's causing confusion and naming issues for which structure
>>> is passed and how do we represent that in the function name.  Thanks,  
>>
>> I wanted the convention to be that the end 'device' tracker (IOMMU or VFIO
>> vendor driver) does not have "direct" access to the iterator state. So it acesses
>> or modifies only the mapped bitmap *data*. The hardware tracker is always *provided*
>> with a iova_bitmap to set bits but it should not allocate, iterate or pin anything,
>> making things simpler for tracker.
>>
>> Thus the point was to have a clear division between how you iterate
>> (iova_bitmap_iter* API) and the actual bits manipulation (so far only
>> iova_bitmap_set()) including which data structures you access in the APIs, thus
>> embedding the least accessed there (struct iova_bitmap).
>>
>> The alternative is to reverse it and just allocate iter state in iova_bitmap_init()
>> and have it stored as a pointer say as iova_bitmap::iter. We encapsulate both and mix
>> the structures, which while not as clean but maybe this is not that big of a deal as
>> I thought it would be
> 
> Is there really a need for struct iova_bitmap to be defined in a shared
> header, or could we just have a forward declaration?  With the proposed
> interface above, iova_bitmap could be opaque to the caller if it were
> dynamically allocated, ex:
> 

/facepalm Oh yes -- even better! Let me try that along with the other comments.

> struct iova_bitmap* iova_bitmap_alloc(dma_addr_t iova, size_t length,
> 				      size_t page_size, u64 __user *bitmap);
> 
> Thanks,
> Alex
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 05/10] vfio: Introduce the DMA logging feature support
  2022-08-25 22:46       ` Alex Williamson
@ 2022-08-26 12:52         ` Jason Gunthorpe
  2022-08-28 13:29           ` Yishai Hadas
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2022-08-26 12:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Joao Martins, Yishai Hadas, saeedm, kvm, netdev, kuba, kevin.tian,
	leonro, maorg, cohuck

On Thu, Aug 25, 2022 at 04:46:51PM -0600, Alex Williamson wrote:
> On Thu, 25 Aug 2022 23:26:04 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> > On 8/25/22 21:49, Alex Williamson wrote:
> > > On Mon, 15 Aug 2022 18:11:04 +0300
> > > Yishai Hadas <yishaih@nvidia.com> wrote:  
> > >> +static int
> > >> +vfio_ioctl_device_feature_logging_report(struct vfio_device *device,
> > >> +					 u32 flags, void __user *arg,
> > >> +					 size_t argsz)
> > >> +{
> > >> +	size_t minsz =
> > >> +		offsetofend(struct vfio_device_feature_dma_logging_report,
> > >> +			    bitmap);
> > >> +	struct vfio_device_feature_dma_logging_report report;
> > >> +	struct iova_bitmap_iter iter;
> > >> +	int ret;
> > >> +
> > >> +	if (!device->log_ops)
> > >> +		return -ENOTTY;
> > >> +
> > >> +	ret = vfio_check_feature(flags, argsz,
> > >> +				 VFIO_DEVICE_FEATURE_GET,
> > >> +				 sizeof(report));
> > >> +	if (ret != 1)
> > >> +		return ret;
> > >> +
> > >> +	if (copy_from_user(&report, arg, minsz))
> > >> +		return -EFAULT;
> > >> +
> > >> +	if (report.page_size < PAGE_SIZE || !is_power_of_2(report.page_size))  
> > > 
> > > Why is PAGE_SIZE a factor here?  I'm under the impression that
> > > iova_bitmap is intended to handle arbitrary page sizes.  Thanks,  
> > 
> > Arbritary page sizes ... which are powers of 2. We use page shift in iova bitmap.
> > While it's not hard to lose this restriction (trading a shift over a slower mul)
> > ... I am not sure it is worth supporting said use considering that there aren't
> > non-powers of 2 page sizes right now?
> > 
> > The PAGE_SIZE restriction might be that it's supposed to be the lowest possible page_size.
> 
> Sorry, I was unclear.  Size relative to PAGE_SIZE was my only question,
> not that we shouldn't require power of 2 sizes.  We're adding device
> level dirty tracking, where the device page size granularity might be
> 4K on a host with a CPU 64K page size.  Maybe there's a use case for
> that.  Given the flexibility claimed by the iova_bitmap support,
> requiring reported page size less than system PAGE_SIZE seems
> unjustified.  Thanks,

+1

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support
  2022-08-25 19:27   ` Alex Williamson
  2022-08-25 22:24     ` Joao Martins
@ 2022-08-26 12:58     ` Jason Gunthorpe
  1 sibling, 0 replies; 26+ messages in thread
From: Jason Gunthorpe @ 2022-08-26 12:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yishai Hadas, saeedm, kvm, netdev, kuba, kevin.tian,
	joao.m.martins, leonro, maorg, cohuck

On Thu, Aug 25, 2022 at 01:27:01PM -0600, Alex Williamson wrote:

> > +	/* length of the IOVA range for the whole bitmap */
> > +	unsigned long length;
> 
> OTOH this could arguably be size_t and iova dma_addr_t.  Thanks,

iova, for the purposes of iommu, is always unsigned long:

	int (*map)(struct iommu_domain *domain, unsigned long iova,
		   phys_addr_t paddr, size_t size, int prot, gfp_t gfp);

The use of dma_addr_t is something vfio does, which is sort of
problematic because vfio also assumes that dma_addr_t can safely be
implicitly cast to unsigned long, and on some 32 bit configurations
this is not true.

As this is intended to move to the drives/iommu some day it should
remain aligned to the iommu scheme.

And also make sure there are the proper checks when casting from u64
at the uAPI boundary to unsigned long internally that the user
provided u64 doesn't overflow the unsigned long.

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support
  2022-08-26  9:37         ` Joao Martins
  2022-08-26 12:02           ` Alex Williamson
@ 2022-08-26 13:01           ` Jason Gunthorpe
  1 sibling, 0 replies; 26+ messages in thread
From: Jason Gunthorpe @ 2022-08-26 13:01 UTC (permalink / raw)
  To: Joao Martins
  Cc: Alex Williamson, Yishai Hadas, saeedm, kvm, netdev, kuba,
	kevin.tian, leonro, maorg, cohuck

On Fri, Aug 26, 2022 at 10:37:26AM +0100, Joao Martins wrote:

> >> It's already wrapped (by my editor) and also at 75 columns. I can do a
> >> bit shorter if that's hurting readability.
> > 
> > 78 chars above, but git log indents by another 4 spaces, so they do
> > wrap.  Something around 70/72 seems better for commit logs.
> > 
> OK, I'll wrap at 70.

We have a documented standard for this:

Documentation/process/submitting-patches.rst:

The canonical patch format
--------------------------
  - The body of the explanation, line wrapped at 75 columns, which will
    be copied to the permanent changelog to describe this patch.

Please follow it - it always bugs me when people randomly choose to
wrap at something alot less than 75 columns for some reason.

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V4 vfio 05/10] vfio: Introduce the DMA logging feature support
  2022-08-26 12:52         ` Jason Gunthorpe
@ 2022-08-28 13:29           ` Yishai Hadas
  0 siblings, 0 replies; 26+ messages in thread
From: Yishai Hadas @ 2022-08-28 13:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Joao Martins, saeedm, kvm, netdev, kuba, kevin.tian, leonro,
	maorg, cohuck

On 26/08/2022 15:52, Jason Gunthorpe wrote:
> On Thu, Aug 25, 2022 at 04:46:51PM -0600, Alex Williamson wrote:
>> On Thu, 25 Aug 2022 23:26:04 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>> On 8/25/22 21:49, Alex Williamson wrote:
>>>> On Mon, 15 Aug 2022 18:11:04 +0300
>>>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>>>> +static int
>>>>> +vfio_ioctl_device_feature_logging_report(struct vfio_device *device,
>>>>> +					 u32 flags, void __user *arg,
>>>>> +					 size_t argsz)
>>>>> +{
>>>>> +	size_t minsz =
>>>>> +		offsetofend(struct vfio_device_feature_dma_logging_report,
>>>>> +			    bitmap);
>>>>> +	struct vfio_device_feature_dma_logging_report report;
>>>>> +	struct iova_bitmap_iter iter;
>>>>> +	int ret;
>>>>> +
>>>>> +	if (!device->log_ops)
>>>>> +		return -ENOTTY;
>>>>> +
>>>>> +	ret = vfio_check_feature(flags, argsz,
>>>>> +				 VFIO_DEVICE_FEATURE_GET,
>>>>> +				 sizeof(report));
>>>>> +	if (ret != 1)
>>>>> +		return ret;
>>>>> +
>>>>> +	if (copy_from_user(&report, arg, minsz))
>>>>> +		return -EFAULT;
>>>>> +
>>>>> +	if (report.page_size < PAGE_SIZE || !is_power_of_2(report.page_size))
>>>> Why is PAGE_SIZE a factor here?  I'm under the impression that
>>>> iova_bitmap is intended to handle arbitrary page sizes.  Thanks,
>>> Arbritary page sizes ... which are powers of 2. We use page shift in iova bitmap.
>>> While it's not hard to lose this restriction (trading a shift over a slower mul)
>>> ... I am not sure it is worth supporting said use considering that there aren't
>>> non-powers of 2 page sizes right now?
>>>
>>> The PAGE_SIZE restriction might be that it's supposed to be the lowest possible page_size.
>> Sorry, I was unclear.  Size relative to PAGE_SIZE was my only question,
>> not that we shouldn't require power of 2 sizes.  We're adding device
>> level dirty tracking, where the device page size granularity might be
>> 4K on a host with a CPU 64K page size.  Maybe there's a use case for
>> that.  Given the flexibility claimed by the iova_bitmap support,
>> requiring reported page size less than system PAGE_SIZE seems
>> unjustified.  Thanks,
> +1
>
> Jason

OK

So in V5 we may come with the below as we don't really expect page size 
smaller than 4K.

if (report.page_size < SZ_4K || !is_power_of_2(report.page_size))
         return -EINVAL;

Yishai


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-08-28 13:29 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-15 15:10 [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
2022-08-15 15:11 ` [PATCH V4 vfio 01/10] net/mlx5: Introduce ifc bits for page tracker Yishai Hadas
2022-08-15 15:11 ` [PATCH V4 vfio 02/10] net/mlx5: Query ADV_VIRTUALIZATION capabilities Yishai Hadas
2022-08-15 15:11 ` [PATCH V4 vfio 03/10] vfio: Introduce DMA logging uAPIs Yishai Hadas
2022-08-15 15:11 ` [PATCH V4 vfio 04/10] vfio: Add an IOVA bitmap support Yishai Hadas
2022-08-25 19:27   ` Alex Williamson
2022-08-25 22:24     ` Joao Martins
2022-08-25 23:15       ` Alex Williamson
2022-08-26  9:37         ` Joao Martins
2022-08-26 12:02           ` Alex Williamson
2022-08-26 12:10             ` Joao Martins
2022-08-26 13:01           ` Jason Gunthorpe
2022-08-26 12:58     ` Jason Gunthorpe
2022-08-15 15:11 ` [PATCH V4 vfio 05/10] vfio: Introduce the DMA logging feature support Yishai Hadas
2022-08-25 20:49   ` Alex Williamson
2022-08-25 22:26     ` Joao Martins
2022-08-25 22:46       ` Alex Williamson
2022-08-26 12:52         ` Jason Gunthorpe
2022-08-28 13:29           ` Yishai Hadas
2022-08-15 15:11 ` [PATCH V4 vfio 06/10] vfio/mlx5: Init QP based resources for dirty tracking Yishai Hadas
2022-08-15 15:11 ` [PATCH V4 vfio 07/10] vfio/mlx5: Create and destroy page tracker object Yishai Hadas
2022-08-15 15:11 ` [PATCH V4 vfio 08/10] vfio/mlx5: Report dirty pages from tracker Yishai Hadas
2022-08-15 15:11 ` [PATCH V4 vfio 09/10] vfio/mlx5: Manage error scenarios on tracker Yishai Hadas
2022-08-15 15:11 ` [PATCH V4 vfio 10/10] vfio/mlx5: Set the driver DMA logging callbacks Yishai Hadas
2022-08-25 11:13 ` [PATCH V4 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
2022-08-25 19:37   ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).