From: Alejandro Lucero Palau <alucerop@amd.com>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>,
Dan Williams <dan.j.williams@intel.com>
Cc: linux-cxl@vger.kernel.org, Dave Jiang <dave.jiang@intel.com>,
Ira Weiny <ira.weiny@intel.com>
Subject: Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
Date: Fri, 17 Jan 2025 13:38:07 +0000 [thread overview]
Message-ID: <7f1fb3cd-15ff-66ff-f395-35de5f984d0d@amd.com> (raw)
In-Reply-To: <20250117105254.00001dd4@huawei.com>
On 1/17/25 10:52, Jonathan Cameron wrote:
> On Thu, 16 Jan 2025 22:10:44 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
>
>> The pending efforts to add CXL Accelerator (type-2) device [1], and
>> Dynamic Capacity (DCD) support [2], tripped on the
>> no-longer-fit-for-purpose design in the CXL subsystem for tracking
>> device-physical-address (DPA) metadata. Trip hazards include:
>>
>> - CXL Memory Devices need to consider a PMEM partition, but Accelerator
>> devices with CXL.mem likely do not in the common case.
>>
>> - CXL Memory Devices enumerate DPA through Memory Device mailbox
>> commands like Partition Info, Accelerators devices do not.
>>
>> - CXL Memory Devices that support DCD support more than 2 partitions.
>> Some of the driver algorithms are awkward to expand to > 2 partition
>> cases.
>>
>> - DPA performance data is a general capability that can be shared with
>> accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
>> suitable.
>>
>> - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
>> memory property, it should be phased in favor of a partition id and
>> the memory property comes from the partition info.
>>
>> Towards cleaning up those issues and allowing a smoother landing for the
>> aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
>> array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
>> way for Memory Devices and Accelerators to initialize the DPA information
>> in 'struct cxl_dev_state'.
>>
>> For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
>> get the new data structure initialized, and cleanup some qos_class init.
>> Follow on patches will go further to use the new data structure to
>> cleanup algorithms that are better suited to loop over all possible
>> partitions.
>>
>> cxl_dpa_setup() follows the locking expectations of mutating the device
>> DPA map, and is suitable for Accelerator drivers to use. Accelerators
>> likely only have one hardcoded 'ram' partition to convey to the
>> cxl_core.
>>
>> Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
>> Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
>> Cc: Dave Jiang <dave.jiang@intel.com>
>> Cc: Alejandro Lucero <alucerop@amd.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Hi Dan,
>
> In basic form this seems fine, but I find the nr_paritions variable usage very
> counter intuitive. It's just how many we configured not how many there
> are, potentially with 0 size (so not a partition). I'd be happier if we
> can avoid that by just prefilling the lot with zero size and filling in
> the ones we want. So zero size means doesn't exist and use an iterator where
> appropriate to skip the zero size ones.
>
> Without that tidied up, to me this is more confusing than the previous code.
>
> Jonathan
>
>> ---
>> drivers/cxl/core/cdat.c | 15 ++-----
>> drivers/cxl/core/hdm.c | 69 ++++++++++++++++++++++++++++++++++
>> drivers/cxl/core/mbox.c | 86 ++++++++++++++++++------------------------
>> drivers/cxl/cxlmem.h | 79 +++++++++++++++++++++++++--------------
>> drivers/cxl/pci.c | 7 +++
>> tools/testing/cxl/test/cxl.c | 15 ++-----
>> tools/testing/cxl/test/mem.c | 7 +++
>> 7 files changed, 176 insertions(+), 102 deletions(-)
>>
>> diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
>> index b177a488e29b..5400a421ad30 100644
>> --- a/drivers/cxl/core/cdat.c
>> +++ b/drivers/cxl/core/cdat.c
>> @@ -261,25 +261,18 @@ static void cxl_memdev_set_qos_class(struct cxl_dev_state *cxlds,
>> struct device *dev = cxlds->dev;
>> struct dsmas_entry *dent;
>> unsigned long index;
>> - const struct resource *partition[] = {
>> - to_ram_res(cxlds),
>> - to_pmem_res(cxlds),
>> - };
>> - struct cxl_dpa_perf *perf[] = {
>> - to_ram_perf(cxlds),
>> - to_pmem_perf(cxlds),
>> - };
> Ok. This removes some of the concerns from previous patch.
>
>>
>> xa_for_each(dsmas_xa, index, dent) {
>> - for (int i = 0; i < ARRAY_SIZE(partition); i++) {
>> - const struct resource *res = partition[i];
>> + for (int i = 0; i < cxlds->nr_partitions; i++) {
>> + struct resource *res = &cxlds->part[i].res;
>> struct range range = {
>> .start = res->start,
>> .end = res->end,
>> };
>>
>> if (range_contains(&range, &dent->dpa_range))
>> - update_perf_entry(dev, dent, perf[i]);
>> + update_perf_entry(dev, dent,
>> + &cxlds->part[i].perf);
>> else
>> dev_dbg(dev,
>> "no partition for dsmas dpa: %pra\n",
>> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
>> index 7a85522294ad..7e1559b3ed88 100644
>> --- a/drivers/cxl/core/hdm.c
>> +++ b/drivers/cxl/core/hdm.c
>> @@ -342,6 +342,75 @@ static int __cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>> return 0;
>> }
>>
>> +static int add_dpa_res(struct device *dev, struct resource *parent,
>> + struct resource *res, resource_size_t start,
>> + resource_size_t size, const char *type)
>> +{
>> + int rc;
>> +
>> + *res = (struct resource) {
>> + .name = type,
>> + .start = start,
>> + .end = start + size - 1,
>> + .flags = IORESOURCE_MEM,
>> + };
>> + if (resource_size(res) == 0) {
>> + dev_dbg(dev, "DPA(%s): no capacity\n", res->name);
>> + return 0;
>> + }
>> + rc = request_resource(parent, res);
>> + if (rc) {
>> + dev_err(dev, "DPA(%s): failed to track %pr (%d)\n", res->name,
>> + res, rc);
>> + return rc;
>> + }
>> +
>> + dev_dbg(dev, "DPA(%s): %pr\n", res->name, res);
>> +
>> + return 0;
>> +}
>> +
>> +/* if this fails the caller must destroy @cxlds, there is no recovery */
>> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
>> +{
>> + struct device *dev = cxlds->dev;
>> +
>> + guard(rwsem_write)(&cxl_dpa_rwsem);
>> +
>> + if (cxlds->nr_partitions)
>> + return -EBUSY;
>> +
>> + if (!info->size || !info->nr_partitions) {
>> + cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>> + cxlds->nr_partitions = 0;
>> + return 0;
>> + }
>> +
>> + cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
>> +
>> + for (int i = 0; i < info->nr_partitions; i++) {
>> + const char *desc;
>> + int rc;
>> +
>> + if (i == CXL_PARTITION_RAM)
>> + desc = "ram";
>> + else if (i == CXL_PARTITION_PMEM)
>> + desc = "pmem";
>> + else
>> + desc = "";
>> + cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
>> + rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
>> + info->range[i].start,
>> + range_len(&info->range[i]), desc);
>> + if (rc)
>> + return rc;
>> + cxlds->nr_partitions++;
> I'd just initialize the rest to 0 length similar to what is happening
> if we have pmem only anyway. Then this nr_patitions goes away and
> stops being a possible source of confusion.
>
>> + }
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(cxl_dpa_setup);
>> diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
>> index 3502f1633ad2..7dca5c8c3494 100644
>> --- a/drivers/cxl/core/mbox.c
>> +++ b/drivers/cxl/core/mbox.c
>> @@ -1241,57 +1241,36 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
>> -int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>> +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
>> {
>> struct cxl_dev_state *cxlds = &mds->cxlds;
>> - struct resource *ram_res = to_ram_res(cxlds);
>> - struct resource *pmem_res = to_pmem_res(cxlds);
>> struct device *dev = cxlds->dev;
>> int rc;
>>
>> if (!cxlds->media_ready) {
>> - cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
>> - *ram_res = DEFINE_RES_MEM(0, 0);
>> - *pmem_res = DEFINE_RES_MEM(0, 0);
>> + info->size = 0;
>> return 0;
>> }
>>
>> - cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
>> + info->size = mds->total_bytes;
>>
>> if (mds->partition_align_bytes == 0) {
> Obviously nothing to do with your patch as such, but maybe tidy this up
> by making active values == fixed values when we don't have partition control.
> That seems logical anyway to me and means we only end up with one lot of
> range setup in here. I can't immediately see any side effects of doing this.
>
>
>
> if (mds->partition_align_bytes != 0) {
> rc = cxl_mem_get_partition_info(mds);
> if (rc)
> return rc;
> } else {
> mds->active_volatile_bytes = mds->volatile_only_bytes;
> mds->active_persistent_bytes = mds->persistent_only_bytes;
> }
> info->range[CXL_PARTITION_RAM] = (struct range) {
> .start = 0,
> .end = mds->active_volatile_bytes - 1,
> };
> info->nr_partitions++;
>
> if (!mds->active_persistent_bytes)
> return 0;
>
> info->range[CXL_PARTITION_PMEM] = (struct range) {
> .start = mds->active_volatile_bytes,
> .end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
> };
> info->nr_partitions++;
>
> return 0;
> }
>
>> - rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
>> - mds->volatile_only_bytes, "ram");
>> - if (rc)
>> - return rc;
>> - return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
>> - mds->volatile_only_bytes,
>> - mds->persistent_only_bytes, "pmem");
>> + info->range[CXL_PARTITION_RAM] = (struct range) {
>> + .start = 0,
>> + .end = mds->volatile_only_bytes - 1,
>> + };
>> + info->nr_partitions++;
>> +
>> + if (!mds->persistent_only_bytes)
>> + return 0;
>> +
>> + info->range[CXL_PARTITION_PMEM] = (struct range) {
>> + .start = mds->volatile_only_bytes,
>> + .end = mds->volatile_only_bytes +
>> + mds->persistent_only_bytes - 1,
>> + };
>> + info->nr_partitions++;
> This nr partitions makes some sense though I'd be tempted to add a type
> array to info so that we can just not pass empty ones if we don't want to.
> Makes this code a little more complex, but not a lot and means
> nr->partitions becomes the ones that actually exist.
>
>> + return 0;
>> }
>>
>> rc = cxl_mem_get_partition_info(mds);
>> @@ -1300,15 +1279,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
>> return rc;
>> }
>>
>> - rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
>> - mds->active_volatile_bytes, "ram");
>> - if (rc)
>> - return rc;
>> - return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
>> - mds->active_volatile_bytes,
>> - mds->active_persistent_bytes, "pmem");
>> + info->range[CXL_PARTITION_RAM] = (struct range) {
>> + .start = 0,
>> + .end = mds->active_volatile_bytes - 1,
>> + };
>> + info->nr_partitions++;
>> +
>> + if (!mds->active_persistent_bytes)
>> + return 0;
>> +
>> + info->range[CXL_PARTITION_PMEM] = (struct range) {
>> + .start = mds->active_volatile_bytes,
>> + .end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
>> + };
>> + info->nr_partitions++;
>> +
>> + return 0;
>> }
>> -EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
>> +EXPORT_SYMBOL_NS_GPL(cxl_mem_dpa_fetch, "CXL");
>> diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
>> index 78e92e24d7b5..2e728d4b7327 100644
>> --- a/drivers/cxl/cxlmem.h
>> +++ b/drivers/cxl/cxlmem.h
>> @@ -97,6 +97,20 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
>> resource_size_t base, resource_size_t len,
>> resource_size_t skipped);
>>
>> +/* Well known, spec defined partition indices */
>> +enum cxl_partition {
>> + CXL_PARTITION_RAM,
>> + CXL_PARTITION_PMEM,
>> + CXL_PARTITION_MAX,
>> +};
>> +
>> +struct cxl_dpa_info {
>> + u64 size;
>> + struct range range[CXL_PARTITION_MAX];
>> + int nr_partitions;
>> +};
> blank line seems appropriate here.
>
>> +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info);
>> +
>> static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
>> struct cxl_memdev *cxlmd)
>> {
>> @@ -408,6 +422,16 @@ struct cxl_dpa_perf {
>> int qos_class;
>> };
>>
>> /**
>> * struct cxl_dev_state - The driver device state
>> *
>> @@ -423,8 +447,8 @@ struct cxl_dpa_perf {
>> * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
>> * @media_ready: Indicate whether the device media is usable
>> * @dpa_res: Overall DPA resource tree for the device
>> - * @_pmem_res: Active Persistent memory capacity configuration
>> - * @_ram_res: Active Volatile memory capacity configuration
>> + * @part: DPA partition array
>> + * @nr_partitions: Number of DPA partitions
> This needs more. It is not the number of partitions present I think, it
> is the number that a particular driver is potentially interested in.
>
>> * @serial: PCIe Device Serial Number
>> * @type: Generic Memory Class device or Vendor Specific Memory device
>> * @cxl_mbox: CXL mailbox context
>> @@ -438,21 +462,39 @@ struct cxl_dev_state {
>> bool rcd;
>> bool media_ready;
>> struct resource dpa_res;
>> - struct resource _pmem_res;
>> - struct resource _ram_res;
>> + struct cxl_dpa_partition part[CXL_PARTITION_MAX];
>> + unsigned int nr_partitions;
>> u64 serial;
>> enum cxl_devtype type;
>> struct cxl_mailbox cxl_mbox;
>> };
>>
>> -static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
>> +static inline const struct resource *to_ram_res(struct cxl_dev_state *cxlds)
>> {
>> - return &cxlds->_ram_res;
>> + if (cxlds->nr_partitions > 0)
>> + return &cxlds->part[CXL_PARTITION_RAM].res;
>> + return NULL;
>> }
>>
>> -static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
>> +static inline const struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
>> {
>> - return &cxlds->_pmem_res;
>> + if (cxlds->nr_partitions > 1)
> This is very confusing as nr_partitions is being used not to indicate
> number of partitions but whether a driver has filled in the data for them
> (which may well be empty).
I would say this is more than confusing: it is broken. What if a device
only has pmem?
Number of partitions would be 1 ...
>
> I'd rather see that as a bitmap, or a 'not set' value initialized by
> the core that is then replaced when they are set.
Repeating my doubt I expressed in the previous patch but with other words:
Is it not enough to give the reference to the ram/pmem resource and then
work based on the size?
>> + return &cxlds->part[CXL_PARTITION_PMEM].res;
>> + return NULL;
>> +}
>> +
>> +static inline struct cxl_dpa_perf *to_ram_perf(struct cxl_dev_state *cxlds)
>> +{
>> + if (cxlds->nr_partitions > 0)
>> + return &cxlds->part[CXL_PARTITION_RAM].perf;
>> + return NULL;
>> +}
>> +
>> +static inline struct cxl_dpa_perf *to_pmem_perf(struct cxl_dev_state *cxlds)
>> +{
>> + if (cxlds->nr_partitions > 1)
>> + return &cxlds->part[CXL_PARTITION_PMEM].perf;
>> + return NULL;
>> }
>
>> @@ -860,7 +883,7 @@ int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
>> int cxl_dev_state_identify(struct cxl_memdev_state *mds);
>> int cxl_await_media_ready(struct cxl_dev_state *cxlds);
>> int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
>> -int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
>> +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
>> struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev);
>> void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
>> unsigned long *cmds);
>> diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c
>> index 7f1c5061307b..ba3d48b37de3 100644
>> --- a/tools/testing/cxl/test/cxl.c
>> +++ b/tools/testing/cxl/test/cxl.c
>> @@ -1001,26 +1001,19 @@ static void mock_cxl_endpoint_parse_cdat(struct cxl_port *port)
>> struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>> struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> struct access_coordinate ep_c[ACCESS_COORDINATE_MAX];
>> - const struct resource *partition[] = {
>> - to_ram_res(cxlds),
>> - to_pmem_res(cxlds),
>> - };
>> - struct cxl_dpa_perf *perf[] = {
>> - to_ram_perf(cxlds),
>> - to_pmem_perf(cxlds),
>> - };
> Ok. This gets rid of some of the earlier concerns.
>
>>
>> if (!cxl_root)
>> return;
>>
>> - for (int i = 0; i < ARRAY_SIZE(partition); i++) {
>> - const struct resource *res = partition[i];
>> + for (int i = 0; i < cxlds->nr_partitions; i++) {
>> + struct resource *res = &cxlds->part[i].res;
>> + struct cxl_dpa_perf *perf = &cxlds->part[i].perf;
>> struct range range = {
>> .start = res->start,
>> .end = res->end,
>> };
>>
>> - dpa_perf_setup(port, &range, perf[i]);
>> + dpa_perf_setup(port, &range, perf);
>> }
>>
>> cxl_memdev_update_perf(cxlmd);
>
next prev parent reply other threads:[~2025-01-17 13:38 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-17 6:10 [PATCH 0/4] cxl: DPA partition metadata is a mess Dan Williams
2025-01-17 6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
2025-01-17 10:03 ` Jonathan Cameron
2025-01-17 17:47 ` Dan Williams
2025-01-17 10:24 ` Alejandro Lucero Palau
2025-01-17 17:54 ` Dan Williams
2025-01-17 18:45 ` Ira Weiny
2025-01-17 6:10 ` [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers Dan Williams
2025-01-17 10:20 ` Jonathan Cameron
2025-01-17 10:23 ` Jonathan Cameron
2025-01-17 17:55 ` Dan Williams
2025-01-17 13:33 ` Alejandro Lucero Palau
2025-01-17 20:47 ` Dan Williams
2025-01-17 6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
2025-01-17 10:52 ` Jonathan Cameron
2025-01-17 13:38 ` Alejandro Lucero Palau [this message]
2025-01-17 18:23 ` Dan Williams
2025-01-17 20:32 ` Ira Weiny
2025-01-20 12:24 ` Alejandro Lucero Palau
2025-01-31 23:54 ` Dan Williams
2025-01-17 15:58 ` Alejandro Lucero Palau
2025-01-17 22:52 ` Dan Williams
2025-01-17 20:42 ` Ira Weiny
2025-01-17 22:08 ` Ira Weiny
2025-01-31 23:39 ` Dan Williams
2025-01-17 6:10 ` [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic Dan Williams
2025-01-17 11:12 ` Jonathan Cameron
2025-01-17 18:37 ` Dan Williams
2025-01-17 15:42 ` Alejandro Lucero Palau
2025-01-17 20:57 ` Dan Williams
2025-01-20 12:39 ` Alejandro Lucero Palau
2025-02-01 0:08 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7f1fb3cd-15ff-66ff-f395-35de5f984d0d@amd.com \
--to=alucerop@amd.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=ira.weiny@intel.com \
--cc=linux-cxl@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox