linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Dave Jiang <dave.jiang@intel.com>
Cc: <linux-cxl@vger.kernel.org>, <linux-pci@vger.kernel.org>,
	<dan.j.williams@intel.com>, <ira.weiny@intel.com>,
	<vishal.l.verma@intel.com>, <alison.schofield@intel.com>,
	<dave@stgolabs.net>
Subject: Re: [PATCH v5 2/2] cxl: Calculate region bandwidth of targets with shared upstream link
Date: Thu, 20 Jun 2024 11:45:58 +0100	[thread overview]
Message-ID: <20240620114558.0000154d@Huawei.com> (raw)
In-Reply-To: <20240618231730.2533819-3-dave.jiang@intel.com>

On Tue, 18 Jun 2024 16:16:41 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> The current bandwidth calculation aggregates all the targets. This simple
> method does not take into account where multiple targets sharing under
> a switch where the aggregated bandwidth can be less or greater than the
> upstream link of the switch.
> 
> To accurately account for the shared upstream uplink cases, a new update
> function is introduced by walking from the leaves to the root of the
> hierarchy and adjust the bandwidth in the process. This process is done
> when all the targets for a region are present but before the final values
> are send to the HMAT handling code cached access_coordinate targets.
> 
> The original perf calculation path was kept to calculate the latency
> performance data that does not require the shared link consideration.
> The shared upstream link calculation is done as a second pass when all
> the endpoints have arrived.
> 
> Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Link: https://lore.kernel.org/linux-cxl/20240501152503.00002e60@Huawei.com/
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>

A few trivial things inline. I haven't tested this yet in anger
(as for 'reasons' my test machine is running an emulation of s390 emulating
 x86 this morning and so is a bit busy).

I'll try and get at least some tests done in the near future and the
additional control patches for doing so posted for the QEMU emulation.
Feel free to poke me if I've not given a tested by the end of next week.

In meantime, lgtm.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> 
> ---
> v5:
> - Fix cxl_memdev_get_dpa_perf() default return. (Jonathan)
> - Direct return dpa_perf_contains(). (Jonathan)
> - Fix incorrect bandwidth member reference. (Jonathan)
> - Direct return error for pcie_link_speed_mbps(). (Jonathan)
> - Remove stray edit. (Jonathan)
> - Adjust calculated bandwidth of RPs sharing same host bridge. (Jonathan)
> - Fix uninit var gp_is_root. (kernel test robot)
> ---
>  drivers/cxl/core/cdat.c   | 411 ++++++++++++++++++++++++++++++++++++--
>  drivers/cxl/core/core.h   |   4 +-
>  drivers/cxl/core/pci.c    |  23 +++
>  drivers/cxl/core/port.c   |  20 ++
>  drivers/cxl/core/region.c |   4 +
>  drivers/cxl/cxl.h         |   1 +
>  6 files changed, 446 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
> index fea214340d4b..72f86bc99750 100644
> --- a/drivers/cxl/core/cdat.c
> +++ b/drivers/cxl/core/cdat.c

...

> +static struct cxl_dpa_perf *cxl_memdev_get_dpa_perf(struct cxl_memdev_state *mds,
> +						    enum cxl_decoder_mode mode)
> +{
> +	switch (mode) {
>  	case CXL_DECODER_RAM:
> -		perf = &mds->ram_perf;
> -		break;
> +		return &mds->ram_perf;
>  	case CXL_DECODER_PMEM:
> -		perf = &mds->pmem_perf;
> -		break;
> +		return &mds->pmem_perf;
>  	default:
> +		return ERR_PTR(-EINVAL);
> +	}
> +

Can't get here so delete the below return.

> +	return ERR_PTR(-EINVAL);
> +}

> +static int cxl_endpoint_gather_coordinates(struct cxl_region *cxlr,
> +					   struct cxl_endpoint_decoder *cxled,
> +					   struct xarray *usp_xa)
> +{
> +	struct cxl_port *endpoint = to_cxl_port(cxled->cxld.dev.parent);
> +	struct access_coordinate pci_coord[ACCESS_COORDINATE_MAX];
> +	struct access_coordinate sw_coord[ACCESS_COORDINATE_MAX];
> +	struct access_coordinate ep_coord[ACCESS_COORDINATE_MAX];
> +	struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
> +	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> +	struct cxl_port *parent_port, *gp_port;
> +	struct cxl_perf_ctx *perf_ctx;
> +	struct cxl_dpa_perf *perf;
> +	bool gp_is_root = false;
> +	unsigned long index;
> +	void *ptr;
> +	int rc;
> +
> +	if (cxlds->rcd)
> +		return -ENODEV;
> +
> +	parent_port = to_cxl_port(endpoint->dev.parent);
> +	gp_port = to_cxl_port(parent_port->dev.parent);
> +	if (is_cxl_root(gp_port))
> +		gp_is_root = true;

	gp_is_root = is_cxl_root(gp_port);
and drop the initialization at declaration.
Or just use is_cxl_root(gp_port) at the check below.

Could set all these at declaration though I guess a few lines
would be a bit long.

> +
> +	perf = cxl_memdev_get_dpa_perf(mds, cxlr->mode);
> +	if (IS_ERR(perf))
> +		return PTR_ERR(perf);
> +
> +	if (!dpa_perf_contains(perf, cxled->dpa_res))
> +		return -EINVAL;
> +
> +	/*
> +	 * The index for the xarray is the upstream port device of the upstream
> +	 * CXL switch.
> +	 */
> +	index = (unsigned long)parent_port->uport_dev;
> +	perf_ctx = xa_load(usp_xa, index);
> +	if (!perf_ctx) {
> +		struct cxl_perf_ctx *c __free(kfree) =
> +			kzalloc(sizeof(*perf_ctx), GFP_KERNEL);
> +
> +		if (!c)
> +			return -ENOMEM;
> +		ptr = xa_store(usp_xa, index, c, GFP_KERNEL);
> +		if (xa_is_err(ptr))
> +			return xa_err(ptr);
> +		perf_ctx = no_free_ptr(c);
> +	}
> +
> +	/* Direct upstream link from EP bandwidth */
> +	rc = cxl_pci_get_bandwidth(pdev, pci_coord);
> +	if (rc < 0)
> +		return rc;
> +
> +	/*
> +	 * Min of upstream link bandwidth and Endpoint CDAT bandwidth from
> +	 * DSLBIS.
> +	 */
> +	cxl_coordinates_combine(ep_coord, pci_coord, perf->cdat_coord);
> +
> +	/*
> +	 * If grandparent port is root, then there's no switch involved and
> +	 * the endpoint is connected to a root port.
> +	 */
> +	if (!gp_is_root) {
> +		/*
> +		 * Retrieve the switch SSLBIS for switch downstream port
> +		 * associated with the endpoint bandwidth.
> +		 */
> +		rc = cxl_port_get_switch_dport_bandwidth(endpoint, sw_coord);
> +		if (rc)
> +			return rc;
> +
> +		/*
> +		 * Min of the earlier coordinates with the switch SSLBIS
> +		 * bandwidth
> +		 */
> +		cxl_coordinates_combine(ep_coord, ep_coord, sw_coord);
> +	}
> +
> +	/*
> +	 * Aggregate the computed bandwidth with the current aggregated bandwidth
> +	 * of the endpoints with the same switch upstream port.
> +	 */
> +	cxl_bandwidth_add(perf_ctx->coord, perf_ctx->coord, ep_coord);
> +	perf_ctx->port = parent_port;
> +
> +	return 0;
> +}
> +
> +static struct xarray *cxl_switch_iterate_coordinates(struct xarray *input_xa,
> +						     bool *parent_is_root)
> +{
> +	struct xarray *res_xa __free(kfree) = kzalloc(sizeof(*res_xa), GFP_KERNEL);
> +	struct access_coordinate coords[ACCESS_COORDINATE_MAX];
> +	struct cxl_perf_ctx *ctx, *us_ctx;
> +	unsigned long index, us_index;
> +	void *ptr;
> +	int rc;
> +
> +	if (!res_xa)
> +		return ERR_PTR(-ENOMEM);
> +	xa_init(res_xa);
> +
> +	*parent_is_root = false;
> +	xa_for_each(input_xa, index, ctx) {
> +		struct cxl_port *parent_port, *port, *gp_port;
> +		struct device *dev = (struct device *)index;
> +		struct cxl_dport *dport;
> +		struct pci_dev *pdev;
> +		bool gp_is_root;
> +
> +		gp_is_root = false;
> +		port = ctx->port;
> +		parent_port = to_cxl_port(port->dev.parent);

Maybe set those at declaration?

> +		if (is_cxl_root(parent_port)) {
> +			*parent_is_root = true;
> +		} else {
> +			gp_port = to_cxl_port(parent_port->dev.parent);
> +			gp_is_root = is_cxl_root(gp_port);
> +		}

From a pure code organization point of view, this could be moved
below the upstream port section to where these are first used.
However I can see some advantage in gathering all the 'who am I'
stuff in one place, even if it costs a few lines of code.
So I'm fine with this if you prefer it.

> +
> +		dport = port->parent_dport;
> +
> +		/*
> +		 * Create an xarray entry with the key of the upstream
> +		 * port of the upstream switch.
> +		 */
> +		us_index = (unsigned long)parent_port->uport_dev;
> +		us_ctx = xa_load(res_xa, us_index);
> +		if (!us_ctx) {
> +			struct cxl_perf_ctx *n __free(kfree) =
> +				kzalloc(sizeof(*n), GFP_KERNEL);
> +
> +			if (!n)
> +				return ERR_PTR(-ENOMEM);
> +
> +			ptr = xa_store(res_xa, us_index, n, GFP_KERNEL);
> +			if (xa_is_err(ptr))
> +				return ERR_PTR(xa_err(ptr));
> +			us_ctx = no_free_ptr(n);
> +		}
> +
> +		us_ctx->port = parent_port;
> +
> +		/*
> +		 * Take the min of the downstream aggregated bandwdith and the

bandwidth

> +		 * GP provided bandwidth if parent is CXL root.
> +		 */
> +		if (*parent_is_root) {
> +			cxl_coordinates_combine(us_ctx->coord, ctx->coord, dport->coord);
> +			continue;
> +		}
> +
> +		/* Below is the calculation for another switch upstream */
> +		if (!gp_is_root) {
> +			/*
> +			 * If the device isn't an upstream PCIe port, there's something
> +			 * wrong with the topology.
> +			 */
> +			if (!dev_is_pci(dev))
> +				return ERR_PTR(-EINVAL);
> +
> +			/* Retrieve the upstream link bandwidth */
> +			pdev = to_pci_dev(dev);
> +			rc = cxl_pci_get_bandwidth(pdev, coords);

might as well combine these 2 lines and drop the pdev local variable
(which otherwise could have been scoped to this if() block

			rc = cxl_pci_get_bandwidth(to_pci_dev(dev), coords);

> +			if (rc)
> +				return ERR_PTR(-ENXIO);
> +
> +			/*
> +			 * Take the min of downstream bandwidth and the upstream link
> +			 * bandwidth.
> +			 */
> +			cxl_coordinates_combine(coords, coords, ctx->coord);
> +
> +			/*
> +			 * Take the min of the calculated bandwdith and the upstream
> +			 * switch SSLBIS bandwidth.
> +			 */
> +			cxl_coordinates_combine(coords, coords, dport->coord);
> +
> +			/*
> +			 * Aggregate the calculated bandwidth common to an upstream
> +			 * switch.
> +			 */
> +			cxl_bandwidth_add(us_ctx->coord, us_ctx->coord, coords);
> +		} else {
> +			/*
> +			 * Aggregate the bandwidth common to an upstream switch.
> +			 */
> +			cxl_bandwidth_add(us_ctx->coord, us_ctx->coord,
> +					  ctx->coord);
> +		}
> +	}
> +
> +	return_ptr(res_xa);
> +}
> +
> +static void cxl_region_update_access_coordinate(struct cxl_region *cxlr,
> +						struct xarray *input_xa)
> +{
> +	struct access_coordinate coord[ACCESS_COORDINATE_MAX];
> +	struct cxl_perf_ctx *ctx;
> +	unsigned long index;
> +
> +	memset(coord, 0, sizeof(coord));
> +	xa_for_each(input_xa, index, ctx)
> +		cxl_bandwidth_add(coord, coord, ctx->coord);
> +
> +	for (int i = 0; i < ACCESS_COORDINATE_MAX; i++) {
> +		cxlr->coord[i].read_bandwidth = coord[i].read_bandwidth;
> +		cxlr->coord[i].write_bandwidth = coord[i].write_bandwidth;
> +	}
> +}
> +
> +static void free_perf_xa(struct xarray *xa)
> +{
> +	struct cxl_perf_ctx *ctx;
> +	unsigned long index;
> +
> +	if (!xa)
> +		return;
> +
> +	xa_for_each(xa, index, ctx)
> +		kfree(ctx);
> +	xa_destroy(xa);
> +	kfree(xa);
> +}
> +
> +/*
> + * cxl_region_shared_upstream_perf_update - Recalculate the access coordinates
> + * @cxl_region: the cxl region to recalculate
> + *
> + * For certain region construction with endpoints behind CXL switches,
> + * there is the possibility of the total bandwdith for all the endpoints
> + * behind a switch being less or more than the switch upstream link. The
> + * algorithm assumes the configuration is a symmetric topology as that
> + * maximizes performance.
> + *
> + * There can be multiple switches under a RP. There can be multiple RPs under
> + * a HB.
> + *
> + * An example hierarchy:
> + *
> + *                 CFMWS 0
> + *                   |
> + *          _________|_________
> + *         |                   |
> + *     ACPI0017-0          ACPI0017-1
> + *  GP0/HB0/ACPI0016-0   GP1/HB1/ACPI0016-1
> + *     |          |        |           |
> + *    RP0        RP1      RP2         RP3
> + *     |          |        |           |
> + *   SW 0       SW 1     SW 2        SW 3
> + *   |   |      |   |    |   |       |   |
> + *  EP0 EP1    EP2 EP3  EP4  EP5    EP6 EP7
> + *
> + * Computation for the example hierarchy:
> + *
> + * Min (GP0 to CPU BW,
> + *      Min(SW 0 Upstream Link to RP0 BW,
> + *          Min(SW0SSLBIS for SW0DSP0 (EP0), EP0 DSLBIS, EP0 Upstream Link) +
> + *          Min(SW0SSLBIS for SW0DSP1 (EP1), EP1 DSLBIS, EP1 Upstream link)) +
> + *      Min(SW 1 Upstream Link to RP1 BW,
> + *          Min(SW1SSLBIS for SW1DSP0 (EP2), EP2 DSLBIS, EP2 Upstream Link) +
> + *          Min(SW1SSLBIS for SW1DSP1 (EP3), EP3 DSLBIS, EP3 Upstream link))) +
> + * Min (GP1 to CPU BW,
> + *      Min(SW 2 Upstream Link to RP2 BW,
> + *          Min(SW2SSLBIS for SW2DSP0 (EP4), EP4 DSLBIS, EP4 Upstream Link) +
> + *          Min(SW2SSLBIS for SW2DSP1 (EP5), EP5 DSLBIS, EP5 Upstream link)) +
> + *      Min(SW 3 Upstream Link to RP3 BW,
> + *          Min(SW3SSLBIS for SW3DSP0 (EP6), EP6 DSLBIS, EP6 Upstream Link) +
> + *          Min(SW3SSLBIS for SW3DSP1 (EP7), EP7 DSLBIS, EP7 Upstream link))))
> + */
> +void cxl_region_shared_upstream_perf_update(struct cxl_region *cxlr)
> +{
> +	struct xarray *usp_xa, *iter_xa, *working_xa;
> +	bool is_root;
> +	int rc;
> +
> +	lockdep_assert_held(&cxl_dpa_rwsem);
> +
> +	usp_xa = kzalloc(sizeof(*usp_xa), GFP_KERNEL);
> +	if (!usp_xa)
>  		return;
> +
> +	xa_init(usp_xa);
> +
> +	/*
> +	 * Collect aggregated endpoint bandwidth and store the bandwidth in
> +	 * an xarray indexed by the upstream port of the switch or RP. The
> +	 * bandwidth is aggregated per switch. Each endpoint consists of the
> +	 * minimum of bandwidth from DSLBIS from the endpoint CDAT, the endpoint
> +	 * upstream link bandwidth, and the bandwidth from the SSLBIS of the
> +	 * switch CDAT for the switch upstream port to the downstream port that's
> +	 * associated with the endpoint. If the device is directly connected to
> +	 * a RP, then no SSLBIS is involved.
> +	 */
> +	for (int i = 0; i < cxlr->params.nr_targets; i++) {
> +		struct cxl_endpoint_decoder *cxled = cxlr->params.targets[i];
> +
> +		rc = cxl_endpoint_gather_coordinates(cxlr, cxled, usp_xa);
> +		if (rc) {
> +			free_perf_xa(usp_xa);
> +			return;
> +		}
>  	}
>  
> +	iter_xa = usp_xa;
> +	usp_xa = NULL;
> +	/*
> +	 * Iterate through the components in the xarray and aggregate any
> +	 * component that share the same upstream link from the switch.
> +	 * The iteration takes consideration of multi-level switch
> +	 * hierarchy.
> +	 *
> +	 * When cxl_switch_iterate_coordinates() detect the grandparent
> +	 * upstream is a cxl root, it aggregates the bandwidth under
> +	 * the host bridge. A RP does not have SSLBIS nor does it have
> +	 * upstream PCIe link.
> +	 *
> +	 * When cxl_switch_iterate_coordinates() detect the parent upstream
> +	 * is a cxl root, it takes the min() of the aggregated RP perfs and
> +	 * the bandwidth from the Generic Port (GP). 'is_root' is set at this
> +	 * point as well.
> +	 */
> +	do {
> +		working_xa = cxl_switch_iterate_coordinates(iter_xa, &is_root);
> +		if (IS_ERR(working_xa))
> +			goto out;
> +		free_perf_xa(iter_xa);
> +		iter_xa = working_xa;
> +	} while (!is_root);
> +
> +	/*
> +	 * Aggregate the bandwidths in the xarray (for all the HBs) and update
> +	 * the region bandwidths with the newly calculated bandwidths.
> +	 */
> +	cxl_region_update_access_coordinate(cxlr, iter_xa);
> +
> +out:
> +	free_perf_xa(iter_xa);
> +}


  reply	other threads:[~2024-06-20 10:46 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-18 23:16 [PATCH v5 0/2] cxl: Region bandwidth calculation for targets with shared upstream link Dave Jiang
2024-06-18 23:16 ` [PATCH v5 1/2] cxl: Preserve the CDAT access_coordinate for an endpoint Dave Jiang
2024-06-20 10:13   ` Jonathan Cameron
2024-06-27 14:32   ` Ira Weiny
2024-06-27 17:53     ` Dave Jiang
2024-06-18 23:16 ` [PATCH v5 2/2] cxl: Calculate region bandwidth of targets with shared upstream link Dave Jiang
2024-06-20 10:45   ` Jonathan Cameron [this message]
2024-06-20 18:34     ` Dave Jiang
2024-06-27 15:50   ` Ira Weiny
2024-06-27 17:57     ` Dave Jiang
2024-06-25 18:29 ` [PATCH v5 0/2] cxl: Region bandwidth calculation for " Ira Weiny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240620114558.0000154d@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=alison.schofield@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).