From: Alejandro Lucero Palau <alucerop@amd.com>
To: Dan Williams <dan.j.williams@intel.com>, linux-cxl@vger.kernel.org
Cc: Dave Jiang <dave.jiang@intel.com>, Ira Weiny <ira.weiny@intel.com>
Subject: Re: [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic
Date: Fri, 17 Jan 2025 15:42:58 +0000 [thread overview]
Message-ID: <53ef6364-4523-05bc-4fa1-8a2110f5fe54@amd.com> (raw)
In-Reply-To: <173709425022.753996.16667967718406367188.stgit@dwillia2-xfh.jf.intel.com>
On 1/17/25 06:10, Dan Williams wrote:
> cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> allocations being distinct from RAM allocations in specific ways when in
> practice the allocation rules are only relative to DPA partition index.
>
> The rules for cxl_dpa_alloc() are:
>
> - allocations can only come from 1 partition
>
> - if allocating at partition-index-N, all free space in partitions less
> than partition-index-N must be skipped over
In my view, you are mixing the current code with the new code in this
explanation. It would be better to say the current code assumption is
just two partitions, ram and pmem, but DCD changes the game.
> Use the new 'struct cxl_dpa_partition' array to support allocation with
> an arbitrary number of DPA partitions on the device.
>
> A follow-on patch can go further to cleanup 'enum cxl_decoder_mode'
> concept and supersede it with looking up the memory properties from
> partition metadata.
>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Alejandro Lucero <alucerop@amd.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> drivers/cxl/core/hdm.c | 167 +++++++++++++++++++++++++++++++++---------------
> drivers/cxl/cxlmem.h | 9 +++
> 2 files changed, 125 insertions(+), 51 deletions(-)
>
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index 7e1559b3ed88..4a2816102a1e 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -223,6 +223,30 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
>
> +static void release_skip(struct cxl_dev_state *cxlds,
> + const resource_size_t skip_base,
> + const resource_size_t skip_len)
> +{
> + resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> + for (int i = 0; i < cxlds->nr_partitions; i++) {
> + const struct resource *part_res = &cxlds->part[i].res;
> + resource_size_t skip_end, skip_size;
> +
> + if (skip_start < part_res->start || skip_start > part_res->end)
> + continue;
> +
> + skip_end = min(part_res->end, skip_start + skip_rem - 1);
> + skip_size = skip_end - skip_start + 1;
> + __release_region(&cxlds->dpa_res, skip_start, skip_size);
> + skip_start += skip_size;
> + skip_rem -= skip_size;
> +
> + if (!skip_rem)
> + break;
> + }
> +}
> +
This implies the skip can not be based on the last child end as the code
implements.
> /*
> * Must be called in a context that synchronizes against this decoder's
> * port ->remove() callback (like an endpoint decoder sysfs attribute)
> @@ -241,7 +265,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> skip_start = res->start - cxled->skip;
> __release_region(&cxlds->dpa_res, res->start, resource_size(res));
> if (cxled->skip)
> - __release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> + release_skip(cxlds, skip_start, cxled->skip);
> cxled->skip = 0;
> cxled->dpa_res = NULL;
> put_device(&cxled->cxld.dev);
> @@ -268,6 +292,47 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> __cxl_dpa_release(cxled);
> }
>
> +static int request_skip(struct cxl_dev_state *cxlds,
> + struct cxl_endpoint_decoder *cxled,
> + const resource_size_t skip_base,
> + const resource_size_t skip_len)
> +{
> + resource_size_t skip_start = skip_base, skip_rem = skip_len;
> +
> + for (int i = 0; i < cxlds->nr_partitions; i++) {
> + const struct resource *part_res = &cxlds->part[i].res;
> + struct cxl_port *port = cxled_to_port(cxled);
> + resource_size_t skip_end, skip_size;
> + struct resource *res;
> +
> + if (skip_start < part_res->start || skip_start > part_res->end)
> + continue;
> +
> + skip_end = min(part_res->end, skip_start + skip_rem - 1);
> + skip_size = skip_end - skip_start + 1;
> +
> + res = __request_region(&cxlds->dpa_res, skip_start, skip_size,
> + dev_name(&cxled->cxld.dev), 0);
> + if (!res) {
> + dev_dbg(cxlds->dev,
> + "decoder%d.%d: failed to reserve skipped space\n",
> + port->id, cxled->cxld.id);
> + break;
> + }
> + skip_start += skip_size;
> + skip_rem -= skip_size;
> + if (!skip_rem)
> + break;
> + }
> +
> + if (skip_rem == 0)
> + return 0;
> +
> + release_skip(cxlds, skip_base, skip_len - skip_rem);
> +
> + return -EBUSY;
> +}
> +
> static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> resource_size_t base, resource_size_t len,
> resource_size_t skipped)
> @@ -277,6 +342,7 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct device *dev = &port->dev;
> struct resource *res;
> + int rc;
>
> lockdep_assert_held_write(&cxl_dpa_rwsem);
>
> @@ -305,14 +371,9 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> }
>
> if (skipped) {
> - res = __request_region(&cxlds->dpa_res, base - skipped, skipped,
> - dev_name(&cxled->cxld.dev), 0);
> - if (!res) {
> - dev_dbg(dev,
> - "decoder%d.%d: failed to reserve skipped space\n",
> - port->id, cxled->cxld.id);
> - return -EBUSY;
> - }
> + rc = request_skip(cxlds, cxled, base - skipped, skipped);
> + if (rc)
> + return rc;
> }
> res = __request_region(&cxlds->dpa_res, base, len,
> dev_name(&cxled->cxld.dev), 0);
> @@ -320,16 +381,15 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> dev_dbg(dev, "decoder%d.%d: failed to reserve allocation\n",
> port->id, cxled->cxld.id);
> if (skipped)
> - __release_region(&cxlds->dpa_res, base - skipped,
> - skipped);
> + release_skip(cxlds, base - skipped, skipped);
> return -EBUSY;
> }
> cxled->dpa_res = res;
> cxled->skip = skipped;
>
> - if (resource_contains(to_pmem_res(cxlds), res))
> + if (cxl_partition_contains(cxlds, CXL_PARTITION_PMEM, res))
> cxled->mode = CXL_DECODER_PMEM;
> - else if (resource_contains(to_ram_res(cxlds), res))
> + else if (cxl_partition_contains(cxlds, CXL_PARTITION_RAM, res))
> cxled->mode = CXL_DECODER_RAM;
> else {
> dev_warn(dev, "decoder%d.%d: %pr does not map any partition\n",
> @@ -527,15 +587,13 @@ int cxl_dpa_set_mode(struct cxl_endpoint_decoder *cxled,
> int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> {
> struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> - resource_size_t free_ram_start, free_pmem_start;
> struct cxl_port *port = cxled_to_port(cxled);
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
> struct device *dev = &cxled->cxld.dev;
> - resource_size_t start, avail, skip;
> + struct resource *res, *prev = NULL;
> + resource_size_t start, avail, skip, skip_start;
> struct resource *p, *last;
> - const struct resource *ram_res = to_ram_res(cxlds);
> - const struct resource *pmem_res = to_pmem_res(cxlds);
> - int rc;
> + int part, rc;
>
> down_write(&cxl_dpa_rwsem);
> if (cxled->cxld.region) {
> @@ -551,47 +609,54 @@ int cxl_dpa_alloc(struct cxl_endpoint_decoder *cxled, unsigned long long size)
> goto out;
> }
>
> - for (p = ram_res->child, last = NULL; p; p = p->sibling)
> - last = p;
> - if (last)
> - free_ram_start = last->end + 1;
> + if (cxled->mode == CXL_DECODER_RAM)
> + part = CXL_PARTITION_RAM;
> + else if (cxled->mode == CXL_DECODER_PMEM)
> + part = CXL_PARTITION_PMEM;
> else
> - free_ram_start = ram_res->start;
> + part = cxlds->nr_partitions;
> +
> + if (part >= cxlds->nr_partitions) {
> + dev_dbg(dev, "partition %d not found\n", part);
> + rc = -EBUSY;
> + goto out;
> + }
> +
> + res = &cxlds->part[part].res;
>
> - for (p = pmem_res->child, last = NULL; p; p = p->sibling)
> + for (p = res->child, last = NULL; p; p = p->sibling)
> last = p;
> if (last)
> - free_pmem_start = last->end + 1;
> + start = last->end + 1;
> else
> - free_pmem_start = pmem_res->start;
> + start = res->start;
>
As said above, this is not correct if there are holes due to releases.
> - if (cxled->mode == CXL_DECODER_RAM) {
> - start = free_ram_start;
> - avail = ram_res->end - start + 1;
> - skip = 0;
> - } else if (cxled->mode == CXL_DECODER_PMEM) {
> - resource_size_t skip_start, skip_end;
> -
> - start = free_pmem_start;
> - avail = pmem_res->end - start + 1;
> - skip_start = free_ram_start;
> -
> - /*
> - * If some pmem is already allocated, then that allocation
> - * already handled the skip.
> - */
> - if (pmem_res->child &&
> - skip_start == pmem_res->child->start)
> - skip_end = skip_start - 1;
> - else
> - skip_end = start - 1;
> - skip = skip_end - skip_start + 1;
> - } else {
> - dev_dbg(dev, "mode not set\n");
> - rc = -EINVAL;
> - goto out;
> + /*
> + * To allocate at partition N, a skip needs to be calculated for all
> + * unallocated space at lower partitions indices.
> + *
> + * If a partition has any allocations, the search can end because a
> + * previous cxl_dpa_alloc() invocation is assumed to have accounted for
> + * all previous partitions.
> + */
This is right, but the code below is not because ...
> + skip_start = CXL_RESOURCE_NONE;
> + for (int i = part; i; i--) {
> + prev = &cxlds->part[i - 1].res;
> + for (p = prev->child, last = NULL; p; p = p->sibling)
> + last = p;
... holes ...
I think the problem here is we assumed ram and pmem being a child and
likely some free space, but a device with multiple HDM decoders implies
potentially several child.
The code supported the case of multiple child but I guess we still had
in mind the simple case. Otherwise I can not understand all this ...
next prev parent reply other threads:[~2025-01-17 15:43 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-17 6:10 [PATCH 0/4] cxl: DPA partition metadata is a mess Dan Williams
2025-01-17 6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
2025-01-17 10:03 ` Jonathan Cameron
2025-01-17 17:47 ` Dan Williams
2025-01-17 10:24 ` Alejandro Lucero Palau
2025-01-17 17:54 ` Dan Williams
2025-01-17 18:45 ` Ira Weiny
2025-01-17 6:10 ` [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers Dan Williams
2025-01-17 10:20 ` Jonathan Cameron
2025-01-17 10:23 ` Jonathan Cameron
2025-01-17 17:55 ` Dan Williams
2025-01-17 13:33 ` Alejandro Lucero Palau
2025-01-17 20:47 ` Dan Williams
2025-01-17 6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
2025-01-17 10:52 ` Jonathan Cameron
2025-01-17 13:38 ` Alejandro Lucero Palau
2025-01-17 18:23 ` Dan Williams
2025-01-17 20:32 ` Ira Weiny
2025-01-20 12:24 ` Alejandro Lucero Palau
2025-01-31 23:54 ` Dan Williams
2025-01-17 15:58 ` Alejandro Lucero Palau
2025-01-17 22:52 ` Dan Williams
2025-01-17 20:42 ` Ira Weiny
2025-01-17 22:08 ` Ira Weiny
2025-01-31 23:39 ` Dan Williams
2025-01-17 6:10 ` [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic Dan Williams
2025-01-17 11:12 ` Jonathan Cameron
2025-01-17 18:37 ` Dan Williams
2025-01-17 15:42 ` Alejandro Lucero Palau [this message]
2025-01-17 20:57 ` Dan Williams
2025-01-20 12:39 ` Alejandro Lucero Palau
2025-02-01 0:08 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53ef6364-4523-05bc-4fa1-8a2110f5fe54@amd.com \
--to=alucerop@amd.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=ira.weiny@intel.com \
--cc=linux-cxl@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox