Linux CXL
 help / color / mirror / Atom feed
From: Ira Weiny <ira.weiny@intel.com>
To: Dan Williams <dan.j.williams@intel.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <linux-cxl@vger.kernel.org>, Dave Jiang <dave.jiang@intel.com>,
	"Alejandro Lucero" <alucerop@amd.com>,
	Ira Weiny <ira.weiny@intel.com>
Subject: Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
Date: Fri, 17 Jan 2025 14:32:13 -0600	[thread overview]
Message-ID: <678abe4d4c449_1f2d2b2945d@iweiny-mobl.notmuch> (raw)
In-Reply-To: <678aa0381324e_20fa2942a@dwillia2-xfh.jf.intel.com.notmuch>

Dan Williams wrote:
> Jonathan Cameron wrote:
> > On Thu, 16 Jan 2025 22:10:44 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > The pending efforts to add CXL Accelerator (type-2) device [1], and
> > > Dynamic Capacity (DCD) support [2], tripped on the
> > > no-longer-fit-for-purpose design in the CXL subsystem for tracking
> > > device-physical-address (DPA) metadata. Trip hazards include:
> > > 
> > > - CXL Memory Devices need to consider a PMEM partition, but Accelerator
> > >   devices with CXL.mem likely do not in the common case.
> > > 
> > > - CXL Memory Devices enumerate DPA through Memory Device mailbox
> > >   commands like Partition Info, Accelerators devices do not.
> > > 
> > > - CXL Memory Devices that support DCD support more than 2 partitions.
> > >   Some of the driver algorithms are awkward to expand to > 2 partition
> > >   cases.
> > > 
> > > - DPA performance data is a general capability that can be shared with
> > >   accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
> > >   suitable.
> > > 
> > > - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
> > >   memory property, it should be phased in favor of a partition id and
> > >   the memory property comes from the partition info.
> > > 
> > > Towards cleaning up those issues and allowing a smoother landing for the
> > > aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
> > > array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
> > > way for Memory Devices and Accelerators to initialize the DPA information
> > > in 'struct cxl_dev_state'.
> > > 
> > > For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
> > > get the new data structure initialized, and cleanup some qos_class init.
> > > Follow on patches will go further to use the new data structure to
> > > cleanup algorithms that are better suited to loop over all possible
> > > partitions.
> > > 
> > > cxl_dpa_setup() follows the locking expectations of mutating the device
> > > DPA map, and is suitable for Accelerator drivers to use. Accelerators
> > > likely only have one hardcoded 'ram' partition to convey to the
> > > cxl_core.
> > > 
> > > Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
> > > Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
> > > Cc: Dave Jiang <dave.jiang@intel.com>
> > > Cc: Alejandro Lucero <alucerop@amd.com>
> > > Cc: Ira Weiny <ira.weiny@intel.com>
> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > 
> > Hi Dan,
> > 
> > In basic form this seems fine, but I find the nr_paritions variable usage very
> > counter intuitive.  It's just how many we configured not how many there
> > are, potentially with 0 size (so not a partition).  I'd be happier if we
> > can avoid that by just prefilling the lot with zero size and filling in
> > the ones we want.  So zero size means doesn't exist and use an iterator where
> > appropriate to skip the zero size ones.
> 
> The PMEM-only device case did give me pause. Is that 2 partitions with a
> zero-sized first partition, or is that just 1 partition?
> 
> Ultimately I do think the code should further evolve to treat that as
> 1-PMEM-partition, but as far as I can see that depends on 'enum
> cxl_decoder_mode' being eliminated and teaching all code paths to search
> for the position of the PMEM partition.

I was of the same mind that the decoder names could be used to index the
array.  For ram/pmem this is baked into the user API but for DCD one could
imagine not just specifying partition dc0 but rather a 'ram' dcd
partition.

> 
> > Without that tidied up, to me this is more confusing than the previous code.
> 
> I was going to save PMEM at a partition other than 1 for the DCD series,
> but let me take another pass at adding that to this series.
> 
> [..]
> > > +/* if this fails the caller must destroy @cxlds, there is no recovery */
> > > +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> > > +{
> > > +	struct device *dev = cxlds->dev;
> > > +
> > > +	guard(rwsem_write)(&cxl_dpa_rwsem);
> > > +
> > > +	if (cxlds->nr_partitions)
> > > +		return -EBUSY;
> > > +
> > > +	if (!info->size || !info->nr_partitions) {
> > > +		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> > > +		cxlds->nr_partitions = 0;
> > > +		return 0;
> > > +	}
> > > +
> > > +	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
> > > +
> > > +	for (int i = 0; i < info->nr_partitions; i++) {
> > > +		const char *desc;
> > > +		int rc;
> > > +
> > > +		if (i == CXL_PARTITION_RAM)
> > > +			desc = "ram";
> > > +		else if (i == CXL_PARTITION_PMEM)
> > > +			desc = "pmem";
> > > +		else
> > > +			desc = "";
> > > +		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
> > > +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
> > > +				 info->range[i].start,
> > > +				 range_len(&info->range[i]), desc);
> > > +		if (rc)
> > > +			return rc;
> > > +		cxlds->nr_partitions++;
> > I'd just initialize the rest to 0 length similar to what is happening
> > if we have pmem only anyway.  Then this nr_patitions goes away and
> > stops being a possible source of confusion.
> 
> Modulo teaching other code that wants to ask "what is the size of the
> PMEM partition" to use a helper that hides the "find the device's PMEM
> partition".
> 
> 
> > 
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(cxl_dpa_setup);
> > 
> > > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > > index 3502f1633ad2..7dca5c8c3494 100644
> > > --- a/drivers/cxl/core/mbox.c
> > > +++ b/drivers/cxl/core/mbox.c
> > > @@ -1241,57 +1241,36 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> > 
> > > -int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> > > +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
> > >  {
> > >  	struct cxl_dev_state *cxlds = &mds->cxlds;
> > > -	struct resource *ram_res = to_ram_res(cxlds);
> > > -	struct resource *pmem_res = to_pmem_res(cxlds);
> > >  	struct device *dev = cxlds->dev;
> > >  	int rc;
> > >  
> > >  	if (!cxlds->media_ready) {
> > > -		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> > > -		*ram_res = DEFINE_RES_MEM(0, 0);
> > > -		*pmem_res = DEFINE_RES_MEM(0, 0);
> > > +		info->size = 0;
> > >  		return 0;
> > >  	}
> > >  
> > > -	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
> > > +	info->size = mds->total_bytes;
> > >  
> > >  	if (mds->partition_align_bytes == 0) {
> > Obviously nothing to do with your patch as such, but maybe tidy this up
> > by making active values == fixed values when we don't have partition control.
> > That seems logical anyway to me and means we only end up with one lot of
> > range setup in here.  I can't immediately see any side effects of doing this.
> 
> Yeah, I mentioned this in another thread. There is no reason
> for 'struct cxl_memdev_state' to carry these values at all. They are
> just temporary init-data.
> 
> So, cxl_dev_state_identify() becomes cxl_mem_identify(), since
> it is a memory-device command. Move it inside of cxl_mem_dpa_fetch()
> since it is just temporary init-data for 'struct cxl_dpa_info'.

I took a different direction and removed all these temporary variables
which also had the side effect of separating the mbox command processing
from the resource creation.

I do prefer cxl_dpa_info to the way I coded it but I did not anticipate my
structure living long I'm also not a fan of 'cxl_byte_layout':


diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 6d63c29eb0e1..9646465e2cbe 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -463,12 +463,6 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox)
  * @firmware_version: Firmware version for the memory device.
  * @enabled_cmds: Hardware commands found enabled in CEL.
  * @exclusive_cmds: Commands that are kernel-internal only
- * @total_bytes: sum of all possible capacities
- * @volatile_only_bytes: hard volatile capacity
- * @persistent_only_bytes: hard persistent capacity
- * @partition_align_bytes: alignment size for partition-able capacity
- * @active_volatile_bytes: sum of hard + soft volatile
- * @active_persistent_bytes: sum of hard + soft persistent
  * @ram_perf: performance data entry matched to RAM partition
  * @pmem_perf: performance data entry matched to PMEM partition
  * @event: event log driver state
@@ -485,12 +479,6 @@ struct cxl_memdev_state {
        char firmware_version[0x10];
        DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);
        DECLARE_BITMAP(exclusive_cmds, CXL_MEM_COMMAND_ID_MAX);
-       u64 total_bytes;
-       u64 volatile_only_bytes;
-       u64 persistent_only_bytes;
-       u64 partition_align_bytes;
-       u64 active_volatile_bytes;
-       u64 active_persistent_bytes;
 
        struct cxl_dpa_perf ram_perf;
        struct cxl_dpa_perf pmem_perf;
@@ -811,10 +799,19 @@ enum {
 
 int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
                          struct cxl_mbox_cmd *cmd);
-int cxl_dev_state_identify(struct cxl_memdev_state *mds);
+
+struct cxl_mem_byte_layout {
+       u64 total_bytes;
+       u64 volatile_bytes;
+       u64 persistent_bytes;
+};
+
+int cxl_dev_state_identify(struct cxl_memdev_state *mds,
+                          struct cxl_mem_byte_layout *byte_layout);
 int cxl_await_media_ready(struct cxl_dev_state *cxlds);
 int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
-int cxl_mem_create_range_info(struct cxl_memdev_state *mds);
+int cxl_create_range_info(struct cxl_dev_state *cxlds,
+                             struct cxl_mem_byte_layout *byte_layout);
 struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev);
 void set_exclusive_cxl_commands(struct cxl_memdev_state *mds,
                                unsigned long *cmds);


> 
> [..]
> > > -		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> > > -				 mds->volatile_only_bytes, "ram");
> > > -		if (rc)
> > > -			return rc;
> > > -		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> > > -				   mds->volatile_only_bytes,
> > > -				   mds->persistent_only_bytes, "pmem");
> > > +		info->range[CXL_PARTITION_RAM] = (struct range) {
> > > +			.start = 0,
> > > +			.end = mds->volatile_only_bytes - 1,
> > > +		};
> > > +		info->nr_partitions++;
> > > +
> > > +		if (!mds->persistent_only_bytes)
> > > +			return 0;
> > > +
> > > +		info->range[CXL_PARTITION_PMEM] = (struct range) {
> > > +			.start = mds->volatile_only_bytes,
> > > +			.end = mds->volatile_only_bytes +
> > > +			       mds->persistent_only_bytes - 1,
> > > +		};
> > > +		info->nr_partitions++;
> > 
> > This nr partitions makes some sense though I'd be tempted to add a type
> > array to info so that we can just not pass empty ones if we don't want to.
> > Makes this code a little more complex, but not a lot and means
> > nr->partitions becomes the ones that actually exist.
> 
> Agree, that's the end goal.
> 
> > 
> > > +		return 0;
> > >  	}
> > >  
> > >  	rc = cxl_mem_get_partition_info(mds);
> > > @@ -1300,15 +1279,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> > >  		return rc;
> > >  	}
> > >  
> > > -	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> > > -			 mds->active_volatile_bytes, "ram");
> > > -	if (rc)
> > > -		return rc;
> > > -	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> > > -			   mds->active_volatile_bytes,
> > > -			   mds->active_persistent_bytes, "pmem");
> > > +	info->range[CXL_PARTITION_RAM] = (struct range) {
> > > +		.start = 0,
> > > +		.end = mds->active_volatile_bytes - 1,
> > > +	};
> > > +	info->nr_partitions++;
> > > +
> > > +	if (!mds->active_persistent_bytes)
> > > +		return 0;
> > > +
> > > +	info->range[CXL_PARTITION_PMEM] = (struct range) {
> > > +		.start = mds->active_volatile_bytes,
> > > +		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
> > > +	};
> > > +	info->nr_partitions++;
> > > +
> > > +	return 0;
> > >  }
> > > -EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
> > > +EXPORT_SYMBOL_NS_GPL(cxl_mem_dpa_fetch, "CXL");
> > 
> > > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > > index 78e92e24d7b5..2e728d4b7327 100644
> > > --- a/drivers/cxl/cxlmem.h
> > > +++ b/drivers/cxl/cxlmem.h
> > > @@ -97,6 +97,20 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> > >  			 resource_size_t base, resource_size_t len,
> > >  			 resource_size_t skipped);
> > >  
> > > +/* Well known, spec defined partition indices */
> > > +enum cxl_partition {
> > > +	CXL_PARTITION_RAM,
> > > +	CXL_PARTITION_PMEM,
> > > +	CXL_PARTITION_MAX,
> > > +};
> > > +
> > > +struct cxl_dpa_info {
> > > +	u64 size;
> > > +	struct range range[CXL_PARTITION_MAX];
> > > +	int nr_partitions;
> > > +};
> > 
> > blank line seems appropriate here.
> 
> Added.
> 
> > 
> > > +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info);
> > > +
> > >  static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
> > >  					 struct cxl_memdev *cxlmd)
> > >  {
> > > @@ -408,6 +422,16 @@ struct cxl_dpa_perf {
> > >  	int qos_class;
> > >  };
> > >  
> > 
> > >  /**
> > >   * struct cxl_dev_state - The driver device state
> > >   *
> > > @@ -423,8 +447,8 @@ struct cxl_dpa_perf {
> > >   * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
> > >   * @media_ready: Indicate whether the device media is usable
> > >   * @dpa_res: Overall DPA resource tree for the device
> > > - * @_pmem_res: Active Persistent memory capacity configuration
> > > - * @_ram_res: Active Volatile memory capacity configuration
> > > + * @part: DPA partition array
> > > + * @nr_partitions: Number of DPA partitions
> > 
> > This needs more. It is not the number of partitions present I think, it
> > is the number that a particular driver is potentially interested in.
> > 
> > >   * @serial: PCIe Device Serial Number
> > >   * @type: Generic Memory Class device or Vendor Specific Memory device
> > >   * @cxl_mbox: CXL mailbox context
> > > @@ -438,21 +462,39 @@ struct cxl_dev_state {
> > >  	bool rcd;
> > >  	bool media_ready;
> > >  	struct resource dpa_res;
> > > -	struct resource _pmem_res;
> > > -	struct resource _ram_res;
> > > +	struct cxl_dpa_partition part[CXL_PARTITION_MAX];
> > > +	unsigned int nr_partitions;
> > >  	u64 serial;
> > >  	enum cxl_devtype type;
> > >  	struct cxl_mailbox cxl_mbox;
> > >  };
> > >  
> > > -static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> > > +static inline const struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> > >  {
> > > -	return &cxlds->_ram_res;
> > > +	if (cxlds->nr_partitions > 0)
> > > +		return &cxlds->part[CXL_PARTITION_RAM].res;
> > > +	return NULL;
> > >  }
> > >  
> > > -static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> > > +static inline const struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> > >  {
> > > -	return &cxlds->_pmem_res;
> > > +	if (cxlds->nr_partitions > 1)
> > 
> > This is very confusing as nr_partitions is being used not to indicate
> > number of partitions but whether a driver has filled in the data for them
> > (which may well be empty).
> > 
> > I'd rather see that as a bitmap, or a 'not set' value initialized by
> > the core that is then replaced when they are set.
> 
> ...or even better, not require PMEM to be at partition1.

FWIW I think that is a step to far to be rushing into 6.14.

Ira

  reply	other threads:[~2025-01-17 20:33 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-17  6:10 [PATCH 0/4] cxl: DPA partition metadata is a mess Dan Williams
2025-01-17  6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
2025-01-17 10:03   ` Jonathan Cameron
2025-01-17 17:47     ` Dan Williams
2025-01-17 10:24   ` Alejandro Lucero Palau
2025-01-17 17:54     ` Dan Williams
2025-01-17 18:45   ` Ira Weiny
2025-01-17  6:10 ` [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers Dan Williams
2025-01-17 10:20   ` Jonathan Cameron
2025-01-17 10:23     ` Jonathan Cameron
2025-01-17 17:55       ` Dan Williams
2025-01-17 13:33   ` Alejandro Lucero Palau
2025-01-17 20:47     ` Dan Williams
2025-01-17  6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
2025-01-17 10:52   ` Jonathan Cameron
2025-01-17 13:38     ` Alejandro Lucero Palau
2025-01-17 18:23     ` Dan Williams
2025-01-17 20:32       ` Ira Weiny [this message]
2025-01-20 12:24       ` Alejandro Lucero Palau
2025-01-31 23:54         ` Dan Williams
2025-01-17 15:58   ` Alejandro Lucero Palau
2025-01-17 22:52     ` Dan Williams
2025-01-17 20:42   ` Ira Weiny
2025-01-17 22:08   ` Ira Weiny
2025-01-31 23:39     ` Dan Williams
2025-01-17  6:10 ` [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic Dan Williams
2025-01-17 11:12   ` Jonathan Cameron
2025-01-17 18:37     ` Dan Williams
2025-01-17 15:42   ` Alejandro Lucero Palau
2025-01-17 20:57     ` Dan Williams
2025-01-20 12:39       ` Alejandro Lucero Palau
2025-02-01  0:08         ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=678abe4d4c449_1f2d2b2945d@iweiny-mobl.notmuch \
    --to=ira.weiny@intel.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=alucerop@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox