Intel-Wired-Lan Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Jacob Keller <jacob.e.keller@intel.com>
To: Intel Wired LAN <intel-wired-lan@lists.osuosl.org>
Cc: Anthony Nguyen <anthony.l.nguyen@intel.com>
Subject: [Intel-wired-lan] [PATCH net-next 05/13] ice: Fix RDMA latency issue by allowing write-combining
Date: Fri, 13 Jan 2023 14:37:27 -0800	[thread overview]
Message-ID: <20230113223735.2514364-6-jacob.e.keller@intel.com> (raw)
In-Reply-To: <20230113223735.2514364-1-jacob.e.keller@intel.com>

The current method of mapping the entire BAR region as a single uncacheable
region does not allow RDMA to use write combining (WC). This results in
increased latency with RDMA.

To fix this, we initially planned to reduce the size of the map made by the
PF driver to include only up to the beginning of the RDMA space.
Unfortunately this will not work in the future as there are some hardware
features which use registers beyond the RDMA area. This includes Scalable
IOV, a virtualization feature being worked on currently.

Instead of simply reducing the size of the map, we need a solution which
will allow access to all areas of the address space while leaving the RDMA
area open to be mapped with write combining.

To allow for this, and fix the RMDA latency issue without blocking the
higher areas of the BAR, we need to create multiple separate memory maps.
Doing so will create a sparse mapping rather than a contiguous single area.

Replace the void *hw_addr with a special ice_hw_addr structure which
represents the multiple mappings as a flexible array.

Based on the available BAR size, map up to 3 regions:

 * The space before the RDMA section
 * The RDMA section which wants write combining behavior
 * The space after the RDMA section

Add an ice_get_hw_addr function which converts a register offset into the
appropriate kernel address based on which chunk it falls into. This does
cost us slightly more computation overhead for register access as we now
must check the table each access. However, we can pre-compute the addresses
where this would most be a problem.

With this change, the RDMA driver is now free to map the RDMA register
section as write-combined without impacting access to other device
registers used by the main PF driver.

Reported-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/ice/ice.h         |   4 +-
 drivers/net/ethernet/intel/ice/ice_base.c    |   5 +-
 drivers/net/ethernet/intel/ice/ice_ethtool.c |   3 +-
 drivers/net/ethernet/intel/ice/ice_main.c    | 176 +++++++++++++++++--
 drivers/net/ethernet/intel/ice/ice_osdep.h   |  48 ++++-
 drivers/net/ethernet/intel/ice/ice_txrx.h    |   2 +-
 drivers/net/ethernet/intel/ice/ice_type.h    |   2 +-
 7 files changed, 217 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 51a1a89f7b5a..cd81974822cc 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -75,7 +75,9 @@
 #include "ice_vsi_vlan_ops.h"
 #include "ice_gnss.h"
 
-#define ICE_BAR0		0
+#define ICE_BAR0			0
+#define ICE_BAR_RDMA_WC_START		0x0800000
+#define ICE_BAR_RDMA_WC_END		0x1000000
 #define ICE_REQ_DESC_MULTIPLE	32
 #define ICE_MIN_NUM_DESC	64
 #define ICE_MAX_NUM_DESC	8160
diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 554095b25f44..332d5a1b326c 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -480,7 +480,7 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
 	ring->rx_offset = ice_rx_offset(ring);
 
 	/* init queue specific tail register */
-	ring->tail = hw->hw_addr + QRX_TAIL(pf_q);
+	ring->tail = ice_get_hw_addr(hw, QRX_TAIL(pf_q));
 	writel(0, ring->tail);
 
 	return 0;
@@ -790,8 +790,7 @@ ice_vsi_cfg_txq(struct ice_vsi *vsi, struct ice_tx_ring *ring,
 	/* init queue specific tail reg. It is referred as
 	 * transmit comm scheduler queue doorbell.
 	 */
-	ring->tail = hw->hw_addr + QTX_COMM_DBELL(pf_q);
-
+	ring->tail = ice_get_hw_addr(hw, QTX_COMM_DBELL(pf_q));
 	if (IS_ENABLED(CONFIG_DCB))
 		tc = ring->dcb_tc;
 	else
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 263d59929602..231fda1c4513 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -3085,7 +3085,8 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
 		/* this is to allow wr32 to have something to write to
 		 * during early allocation of Rx buffers
 		 */
-		rx_rings[i].tail = vsi->back->hw.hw_addr + PRTGEN_STATUS;
+		rx_rings[i].tail = ice_get_hw_addr(&vsi->back->hw,
+						   PRTGEN_STATUS);
 
 		err = ice_setup_rx_ring(&rx_rings[i]);
 		if (err)
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 29cd77dd3812..5ddd9fe7514f 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -596,6 +596,162 @@ ice_prepare_for_reset(struct ice_pf *pf, enum ice_reset_req reset_type)
 	set_bit(ICE_PREPARED_FOR_RESET, pf->state);
 }
 
+/**
+ * ice_get_hw_addr - Get memory address for a given device register
+ * @hw: pointer to the HW struct
+ * @reg: the register to get address of
+ *
+ * Convert a register offset into the appropriate memory mapped kernel
+ * address.
+ *
+ * Returns the pointer address or an ERR_PTR on failure.
+ */
+void __iomem *ice_get_hw_addr(struct ice_hw *hw, resource_size_t reg)
+{
+	struct ice_hw_addr *hw_addr = (struct ice_hw_addr *)hw->hw_addr;
+	struct ice_hw_addr_map *map;
+	unsigned int i;
+
+	if (WARN_ON(!hw_addr))
+		return (void __iomem *)ERR_PTR(-EIO);
+
+	for (i = 0, map = hw_addr->maps; i < hw_addr->nr; i++, map++)
+		if (reg >= map->start && reg < map->end)
+			return (u8 __iomem *)map->addr + (reg - map->start);
+
+	WARN_ONCE(1, "Unable to map register address 0x%0llx to kernel address",
+		  reg);
+
+	return (void __iomem *)ERR_PTR(-EFAULT);
+}
+
+/**
+ * ice_map_hw_addr - map a region of device registers to memory
+ * @pdev: the PCI device
+ * @map: the address map structure
+ *
+ * Map the specified section of the hardware registers into memory, storing
+ * the memory mapped address in the provided structure.
+ *
+ * Returns 0 on success or an error code on failure.
+ */
+static int ice_map_hw_addr(struct pci_dev *pdev, struct ice_hw_addr_map *map)
+{
+	struct device *dev = &pdev->dev;
+	resource_size_t size, base;
+	void __iomem *addr;
+
+	if (WARN_ON(map->end <= map->start))
+		return -EIO;
+
+	size = map->end - map->start;
+	base = pci_resource_start(pdev, map->bar) + map->start;
+	addr = ioremap(base, size);
+	if (!addr) {
+		dev_err(dev, "%s: remap at offset %llu failed\n",
+			__func__, map->start);
+		return -EIO;
+	}
+
+	map->addr = addr;
+
+	return 0;
+}
+
+/**
+ * ice_map_all_hw_addr - Request and map PCI BAR memory
+ * @pf: pointer to the PF structure
+ *
+ * Request and reserve all PCI BAR regions. Memory map chunks of the PCI BAR
+ * 0 into a sparse memory map to allow the RDMA region to be mapped with write
+ * combining.
+ *
+ * Returns 0 on success or an error code on failure.
+ */
+static int ice_map_all_hw_addr(struct ice_pf *pf)
+{
+	struct pci_dev *pdev = pf->pdev;
+	struct device *dev = &pdev->dev;
+	struct ice_hw_addr *hw_addr;
+	resource_size_t bar_len;
+	unsigned int nr_maps;
+	int err;
+
+	bar_len = pci_resource_len(pdev, 0);
+	if (bar_len > ICE_BAR_RDMA_WC_END)
+		nr_maps = 2;
+	else
+		nr_maps = 1;
+
+	hw_addr = kzalloc(struct_size(hw_addr, maps, nr_maps), GFP_KERNEL);
+	if (!hw_addr)
+		return -ENOMEM;
+
+	hw_addr->nr = nr_maps;
+
+	err = pci_request_mem_regions(pdev, dev_driver_string(dev));
+	if (err) {
+		dev_err(dev, "pci_request_mem_regions failed, err %pe\n",
+			ERR_PTR(err));
+		goto err_free_hw_addr;
+	}
+
+	/* Map the start of the BAR as uncachable */
+	hw_addr->maps[0].bar = 0;
+	hw_addr->maps[0].start = 0;
+	hw_addr->maps[0].end = min_t(resource_size_t, bar_len,
+				     ICE_BAR_RDMA_WC_START);
+	err = ice_map_hw_addr(pdev, &hw_addr->maps[0]);
+	if (err)
+		goto err_release_mem_regions;
+
+	/* Map everything past the RDMA section as uncachable */
+	if (nr_maps > 1) {
+		hw_addr->maps[1].bar = 0;
+		hw_addr->maps[1].start = ICE_BAR_RDMA_WC_END;
+		hw_addr->maps[1].end = bar_len;
+		err = ice_map_hw_addr(pdev, &hw_addr->maps[1]);
+		if (err)
+			goto err_unmap_bar_start;
+	}
+
+	pf->hw.hw_addr = (typeof(pf->hw.hw_addr))hw_addr;
+
+	return 0;
+
+err_unmap_bar_start:
+	iounmap(hw_addr->maps[0].addr);
+err_release_mem_regions:
+	pci_release_mem_regions(pdev);
+err_free_hw_addr:
+	kfree(hw_addr);
+
+	return err;
+}
+
+/**
+ * ice_unmap_all_hw_addr - Release device register memory maps
+ * @pf: pointer to the PF structure
+ *
+ * Release all PCI memory maps and regions.
+ */
+static void ice_unmap_all_hw_addr(struct ice_pf *pf)
+{
+	struct ice_hw_addr *hw_addr = (struct ice_hw_addr *)pf->hw.hw_addr;
+	struct pci_dev *pdev = pf->pdev;
+	unsigned int i;
+
+	if (WARN_ON(!hw_addr))
+		return;
+
+	pf->hw.hw_addr = NULL;
+	for (i = 0; i < hw_addr->nr; i++)
+		iounmap(hw_addr->maps[i].addr);
+	kfree(hw_addr);
+
+	pci_release_mem_regions(pdev);
+}
+
 /**
  * ice_do_reset - Initiate one of many types of resets
  * @pf: board private structure
@@ -5072,19 +5228,10 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent)
 		return -EINVAL;
 	}
 
-	/* this driver uses devres, see
-	 * Documentation/driver-api/driver-model/devres.rst
-	 */
-	err = pcim_enable_device(pdev);
+	err = pci_enable_device(pdev);
 	if (err)
 		return err;
 
-	err = pcim_iomap_regions(pdev, BIT(ICE_BAR0), dev_driver_string(dev));
-	if (err) {
-		dev_err(dev, "BAR0 I/O map error %d\n", err);
-		return err;
-	}
-
 	pf = ice_allocate_pf(dev);
 	if (!pf)
 		return -ENOMEM;
@@ -5109,7 +5256,11 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent)
 	set_bit(ICE_SERVICE_DIS, pf->state);
 
 	hw = &pf->hw;
-	hw->hw_addr = pcim_iomap_table(pdev)[ICE_BAR0];
+
+	err = ice_map_all_hw_addr(pf);
+	if (err)
+		goto err_init_iomap_fail;
+
 	pci_save_state(pdev);
 
 	hw->back = pf;
@@ -5157,6 +5308,8 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent)
 err_init_eth:
 	ice_deinit(pf);
 err_init:
+	ice_unmap_all_hw_addr(pf);
+err_init_iomap_fail:
 	pci_disable_pcie_error_reporting(pdev);
 	pci_disable_device(pdev);
 	return err;
@@ -5266,6 +5419,7 @@ static void ice_remove(struct pci_dev *pdev)
 	 */
 	ice_reset(&pf->hw, ICE_RESET_PFR);
 	pci_wait_for_pending_transaction(pdev);
+	ice_unmap_all_hw_addr(pf);
 	pci_disable_pcie_error_reporting(pdev);
 	pci_disable_device(pdev);
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_osdep.h b/drivers/net/ethernet/intel/ice/ice_osdep.h
index 82bc54fec7f3..4b16ff489c3a 100644
--- a/drivers/net/ethernet/intel/ice/ice_osdep.h
+++ b/drivers/net/ethernet/intel/ice/ice_osdep.h
@@ -18,10 +18,49 @@
 #endif
 #include <net/udp_tunnel.h>
 
-#define wr32(a, reg, value)	writel((value), ((a)->hw_addr + (reg)))
-#define rd32(a, reg)		readl((a)->hw_addr + (reg))
-#define wr64(a, reg, value)	writeq((value), ((a)->hw_addr + (reg)))
-#define rd64(a, reg)		readq((a)->hw_addr + (reg))
+struct ice_hw;
+
+/**
+ * struct ice_hw_addr_map - a single hardware address memory map
+ * @addr: iomem address of the start of this map
+ * @start: register offset at the start of this map, inclusive bound
+ * @end: register offset at the end of this map, exclusive bound
+ * @bar: the BAR this map is for
+ *
+ * Structure representing one map of a device BAR register space. Stored as
+ * part of the ice_hw_addr structure in an array ordered by the start offset.
+ *
+ * The addr value is an iomem address returned by ioremap. The start indicates
+ * the first register offset this map is valid for. The end indicates the end
+ * of the map, and is an exclusive bound.
+ */
+struct ice_hw_addr_map {
+	void __iomem *addr;
+	resource_size_t start;
+	resource_size_t end;
+	int bar;
+};
+
+/**
+ * struct ice_hw_addr - a list of hardware address memory maps
+ * @nr: the number of maps made
+ * @maps: flexible array of maps made during device initialization
+ *
+ * Structure representing a series of sparse maps of the device BAR 0 address
+ * space to kernel addresses. Users must convert a register offset to an iomem
+ * address using ice_get_hw_addr.
+ */
+struct ice_hw_addr {
+	unsigned int nr;
+	struct ice_hw_addr_map maps[];
+};
+
+void __iomem *ice_get_hw_addr(struct ice_hw *hw, resource_size_t reg);
+
+#define wr32(a, reg, value)	writel((value), ice_get_hw_addr((a), (reg)))
+#define rd32(a, reg)		readl(ice_get_hw_addr((a), (reg)))
+#define wr64(a, reg, value)	writeq((value), ice_get_hw_addr((a), (reg)))
+#define rd64(a, reg)		readq(ice_get_hw_addr((a), (reg)))
 
 #define ice_flush(a)		rd32((a), GLGEN_STAT)
 #define ICE_M(m, s)		((m) << (s))
@@ -32,7 +71,6 @@ struct ice_dma_mem {
 	size_t size;
 };
 
-struct ice_hw;
 struct device *ice_hw_to_dev(struct ice_hw *hw);
 
 #ifdef CONFIG_DYNAMIC_DEBUG
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 4fd0e5d0a313..3d2834673903 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -272,7 +272,7 @@ struct ice_rx_ring {
 	struct net_device *netdev;	/* netdev ring maps to */
 	struct ice_vsi *vsi;		/* Backreference to associated VSI */
 	struct ice_q_vector *q_vector;	/* Backreference to associated vector */
-	u8 __iomem *tail;
+	void __iomem *tail;
 	union {
 		struct ice_rx_buf *rx_buf;
 		struct xdp_buff **xdp_buf;
diff --git a/drivers/net/ethernet/intel/ice/ice_type.h b/drivers/net/ethernet/intel/ice/ice_type.h
index e3f622cad425..f34975efeed7 100644
--- a/drivers/net/ethernet/intel/ice/ice_type.h
+++ b/drivers/net/ethernet/intel/ice/ice_type.h
@@ -821,7 +821,7 @@ struct ice_mbx_data {
 
 /* Port hardware description */
 struct ice_hw {
-	u8 __iomem *hw_addr;
+	void *hw_addr;
 	void *back;
 	struct ice_aqc_layer_props *layer_info;
 	struct ice_port_info *port_info;
-- 
2.38.1.420.g319605f8f00e

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

  parent reply	other threads:[~2023-01-13 22:38 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-13 22:37 [Intel-wired-lan] [PATCH net-next 00/13] ice: various virtualization cleanups Jacob Keller
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 01/13] ice: fix function comment referring to ice_vsi_alloc Jacob Keller
2023-01-16 10:31   ` Michal Swiatkowski
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 02/13] ice: drop unnecessary VF parameter from several VSI functions Jacob Keller
2023-01-16 10:34   ` Michal Swiatkowski
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 03/13] ice: move vsi_type assignment from ice_vsi_alloc to ice_vsi_cfg Jacob Keller
2023-01-16 10:47   ` Michal Swiatkowski
2023-01-17 19:20     ` Jacob Keller
2023-01-18  5:37       ` Michal Swiatkowski
2023-01-18 20:50         ` Jacob Keller
2023-01-17 22:24     ` Jacob Keller
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 04/13] ice: add helper function for checking VSI VF requirement Jacob Keller
2023-01-16 10:56   ` Michal Swiatkowski
2023-01-17 19:23     ` Jacob Keller
2023-01-18  5:24       ` Michal Swiatkowski
2023-01-18 20:49         ` Jacob Keller
2023-01-13 22:37 ` Jacob Keller [this message]
2023-01-14  0:37   ` [Intel-wired-lan] [PATCH net-next 05/13] ice: Fix RDMA latency issue by allowing write-combining kernel test robot
2023-01-14  1:58   ` kernel test robot
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 06/13] ice: move ice_vf_vsi_release into ice_vf_lib.c Jacob Keller
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 07/13] ice: Pull common tasks into ice_vf_post_vsi_rebuild Jacob Keller
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 08/13] ice: add a function to initialize vf entry Jacob Keller
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 09/13] ice: introduce ice_vf_init_host_cfg function Jacob Keller
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 10/13] ice: convert vf_ops .vsi_rebuild to .create_vsi Jacob Keller
2023-01-27  9:46   ` Szlosek, Marek
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 11/13] ice: introduce clear_reset_state operation Jacob Keller
2023-01-14  1:17   ` kernel test robot
2023-01-14  3:49   ` kernel test robot
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 12/13] ice: introduce .irq_close VF operation Jacob Keller
2023-01-13 22:37 ` [Intel-wired-lan] [PATCH net-next 13/13] ice: remove unnecessary virtchnl_ether_addr struct use Jacob Keller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230113223735.2514364-6-jacob.e.keller@intel.com \
    --to=jacob.e.keller@intel.com \
    --cc=anthony.l.nguyen@intel.com \
    --cc=intel-wired-lan@lists.osuosl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox