Linux CXL
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Dan Williams <dan.j.williams@intel.com>
Cc: <linux-cxl@vger.kernel.org>, Dave Jiang <dave.jiang@intel.com>,
	"Alejandro Lucero" <alucerop@amd.com>,
	Ira Weiny <ira.weiny@intel.com>
Subject: Re: [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info'
Date: Fri, 17 Jan 2025 10:23:52 -0800	[thread overview]
Message-ID: <678aa0381324e_20fa2942a@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <20250117105254.00001dd4@huawei.com>

Jonathan Cameron wrote:
> On Thu, 16 Jan 2025 22:10:44 -0800
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > The pending efforts to add CXL Accelerator (type-2) device [1], and
> > Dynamic Capacity (DCD) support [2], tripped on the
> > no-longer-fit-for-purpose design in the CXL subsystem for tracking
> > device-physical-address (DPA) metadata. Trip hazards include:
> > 
> > - CXL Memory Devices need to consider a PMEM partition, but Accelerator
> >   devices with CXL.mem likely do not in the common case.
> > 
> > - CXL Memory Devices enumerate DPA through Memory Device mailbox
> >   commands like Partition Info, Accelerators devices do not.
> > 
> > - CXL Memory Devices that support DCD support more than 2 partitions.
> >   Some of the driver algorithms are awkward to expand to > 2 partition
> >   cases.
> > 
> > - DPA performance data is a general capability that can be shared with
> >   accelerators, so tracking it in 'struct cxl_memdev_state' is no longer
> >   suitable.
> > 
> > - 'enum cxl_decoder_mode' is sometimes a partition id and sometimes a
> >   memory property, it should be phased in favor of a partition id and
> >   the memory property comes from the partition info.
> > 
> > Towards cleaning up those issues and allowing a smoother landing for the
> > aforementioned pending efforts, introduce a 'struct cxl_dpa_partition'
> > array to 'struct cxl_dev_state', and 'struct cxl_range_info' as a shared
> > way for Memory Devices and Accelerators to initialize the DPA information
> > in 'struct cxl_dev_state'.
> > 
> > For now, split a new cxl_dpa_setup() from cxl_mem_create_range_info() to
> > get the new data structure initialized, and cleanup some qos_class init.
> > Follow on patches will go further to use the new data structure to
> > cleanup algorithms that are better suited to loop over all possible
> > partitions.
> > 
> > cxl_dpa_setup() follows the locking expectations of mutating the device
> > DPA map, and is suitable for Accelerator drivers to use. Accelerators
> > likely only have one hardcoded 'ram' partition to convey to the
> > cxl_core.
> > 
> > Link: http://lore.kernel.org/20241230214445.27602-1-alejandro.lucero-palau@amd.com [1]
> > Link: http://lore.kernel.org/20241210-dcd-type2-upstream-v8-0-812852504400@intel.com [2]
> > Cc: Dave Jiang <dave.jiang@intel.com>
> > Cc: Alejandro Lucero <alucerop@amd.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> 
> Hi Dan,
> 
> In basic form this seems fine, but I find the nr_paritions variable usage very
> counter intuitive.  It's just how many we configured not how many there
> are, potentially with 0 size (so not a partition).  I'd be happier if we
> can avoid that by just prefilling the lot with zero size and filling in
> the ones we want.  So zero size means doesn't exist and use an iterator where
> appropriate to skip the zero size ones.

The PMEM-only device case did give me pause. Is that 2 partitions with a
zero-sized first partition, or is that just 1 partition?

Ultimately I do think the code should further evolve to treat that as
1-PMEM-partition, but as far as I can see that depends on 'enum
cxl_decoder_mode' being eliminated and teaching all code paths to search
for the position of the PMEM partition.

> Without that tidied up, to me this is more confusing than the previous code.

I was going to save PMEM at a partition other than 1 for the DCD series,
but let me take another pass at adding that to this series.

[..]
> > +/* if this fails the caller must destroy @cxlds, there is no recovery */
> > +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info)
> > +{
> > +	struct device *dev = cxlds->dev;
> > +
> > +	guard(rwsem_write)(&cxl_dpa_rwsem);
> > +
> > +	if (cxlds->nr_partitions)
> > +		return -EBUSY;
> > +
> > +	if (!info->size || !info->nr_partitions) {
> > +		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> > +		cxlds->nr_partitions = 0;
> > +		return 0;
> > +	}
> > +
> > +	cxlds->dpa_res = DEFINE_RES_MEM(0, info->size);
> > +
> > +	for (int i = 0; i < info->nr_partitions; i++) {
> > +		const char *desc;
> > +		int rc;
> > +
> > +		if (i == CXL_PARTITION_RAM)
> > +			desc = "ram";
> > +		else if (i == CXL_PARTITION_PMEM)
> > +			desc = "pmem";
> > +		else
> > +			desc = "";
> > +		cxlds->part[i].perf.qos_class = CXL_QOS_CLASS_INVALID;
> > +		rc = add_dpa_res(dev, &cxlds->dpa_res, &cxlds->part[i].res,
> > +				 info->range[i].start,
> > +				 range_len(&info->range[i]), desc);
> > +		if (rc)
> > +			return rc;
> > +		cxlds->nr_partitions++;
> I'd just initialize the rest to 0 length similar to what is happening
> if we have pmem only anyway.  Then this nr_patitions goes away and
> stops being a possible source of confusion.

Modulo teaching other code that wants to ask "what is the size of the
PMEM partition" to use a helper that hides the "find the device's PMEM
partition".


> 
> > +	}
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(cxl_dpa_setup);
> 
> > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c
> > index 3502f1633ad2..7dca5c8c3494 100644
> > --- a/drivers/cxl/core/mbox.c
> > +++ b/drivers/cxl/core/mbox.c
> > @@ -1241,57 +1241,36 @@ int cxl_mem_sanitize(struct cxl_memdev *cxlmd, u16 cmd)
> 
> > -int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> > +int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info)
> >  {
> >  	struct cxl_dev_state *cxlds = &mds->cxlds;
> > -	struct resource *ram_res = to_ram_res(cxlds);
> > -	struct resource *pmem_res = to_pmem_res(cxlds);
> >  	struct device *dev = cxlds->dev;
> >  	int rc;
> >  
> >  	if (!cxlds->media_ready) {
> > -		cxlds->dpa_res = DEFINE_RES_MEM(0, 0);
> > -		*ram_res = DEFINE_RES_MEM(0, 0);
> > -		*pmem_res = DEFINE_RES_MEM(0, 0);
> > +		info->size = 0;
> >  		return 0;
> >  	}
> >  
> > -	cxlds->dpa_res = DEFINE_RES_MEM(0, mds->total_bytes);
> > +	info->size = mds->total_bytes;
> >  
> >  	if (mds->partition_align_bytes == 0) {
> Obviously nothing to do with your patch as such, but maybe tidy this up
> by making active values == fixed values when we don't have partition control.
> That seems logical anyway to me and means we only end up with one lot of
> range setup in here.  I can't immediately see any side effects of doing this.

Yeah, I mentioned this in another thread. There is no reason
for 'struct cxl_memdev_state' to carry these values at all. They are
just temporary init-data.

So, cxl_dev_state_identify() becomes cxl_mem_identify(), since
it is a memory-device command. Move it inside of cxl_mem_dpa_fetch()
since it is just temporary init-data for 'struct cxl_dpa_info'.

[..]
> > -		rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> > -				 mds->volatile_only_bytes, "ram");
> > -		if (rc)
> > -			return rc;
> > -		return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> > -				   mds->volatile_only_bytes,
> > -				   mds->persistent_only_bytes, "pmem");
> > +		info->range[CXL_PARTITION_RAM] = (struct range) {
> > +			.start = 0,
> > +			.end = mds->volatile_only_bytes - 1,
> > +		};
> > +		info->nr_partitions++;
> > +
> > +		if (!mds->persistent_only_bytes)
> > +			return 0;
> > +
> > +		info->range[CXL_PARTITION_PMEM] = (struct range) {
> > +			.start = mds->volatile_only_bytes,
> > +			.end = mds->volatile_only_bytes +
> > +			       mds->persistent_only_bytes - 1,
> > +		};
> > +		info->nr_partitions++;
> 
> This nr partitions makes some sense though I'd be tempted to add a type
> array to info so that we can just not pass empty ones if we don't want to.
> Makes this code a little more complex, but not a lot and means
> nr->partitions becomes the ones that actually exist.

Agree, that's the end goal.

> 
> > +		return 0;
> >  	}
> >  
> >  	rc = cxl_mem_get_partition_info(mds);
> > @@ -1300,15 +1279,24 @@ int cxl_mem_create_range_info(struct cxl_memdev_state *mds)
> >  		return rc;
> >  	}
> >  
> > -	rc = add_dpa_res(dev, &cxlds->dpa_res, ram_res, 0,
> > -			 mds->active_volatile_bytes, "ram");
> > -	if (rc)
> > -		return rc;
> > -	return add_dpa_res(dev, &cxlds->dpa_res, pmem_res,
> > -			   mds->active_volatile_bytes,
> > -			   mds->active_persistent_bytes, "pmem");
> > +	info->range[CXL_PARTITION_RAM] = (struct range) {
> > +		.start = 0,
> > +		.end = mds->active_volatile_bytes - 1,
> > +	};
> > +	info->nr_partitions++;
> > +
> > +	if (!mds->active_persistent_bytes)
> > +		return 0;
> > +
> > +	info->range[CXL_PARTITION_PMEM] = (struct range) {
> > +		.start = mds->active_volatile_bytes,
> > +		.end = mds->active_volatile_bytes + mds->active_persistent_bytes - 1,
> > +	};
> > +	info->nr_partitions++;
> > +
> > +	return 0;
> >  }
> > -EXPORT_SYMBOL_NS_GPL(cxl_mem_create_range_info, "CXL");
> > +EXPORT_SYMBOL_NS_GPL(cxl_mem_dpa_fetch, "CXL");
> 
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 78e92e24d7b5..2e728d4b7327 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -97,6 +97,20 @@ int devm_cxl_dpa_reserve(struct cxl_endpoint_decoder *cxled,
> >  			 resource_size_t base, resource_size_t len,
> >  			 resource_size_t skipped);
> >  
> > +/* Well known, spec defined partition indices */
> > +enum cxl_partition {
> > +	CXL_PARTITION_RAM,
> > +	CXL_PARTITION_PMEM,
> > +	CXL_PARTITION_MAX,
> > +};
> > +
> > +struct cxl_dpa_info {
> > +	u64 size;
> > +	struct range range[CXL_PARTITION_MAX];
> > +	int nr_partitions;
> > +};
> 
> blank line seems appropriate here.

Added.

> 
> > +int cxl_dpa_setup(struct cxl_dev_state *cxlds, const struct cxl_dpa_info *info);
> > +
> >  static inline struct cxl_ep *cxl_ep_load(struct cxl_port *port,
> >  					 struct cxl_memdev *cxlmd)
> >  {
> > @@ -408,6 +422,16 @@ struct cxl_dpa_perf {
> >  	int qos_class;
> >  };
> >  
> 
> >  /**
> >   * struct cxl_dev_state - The driver device state
> >   *
> > @@ -423,8 +447,8 @@ struct cxl_dpa_perf {
> >   * @rcd: operating in RCD mode (CXL 3.0 9.11.8 CXL Devices Attached to an RCH)
> >   * @media_ready: Indicate whether the device media is usable
> >   * @dpa_res: Overall DPA resource tree for the device
> > - * @_pmem_res: Active Persistent memory capacity configuration
> > - * @_ram_res: Active Volatile memory capacity configuration
> > + * @part: DPA partition array
> > + * @nr_partitions: Number of DPA partitions
> 
> This needs more. It is not the number of partitions present I think, it
> is the number that a particular driver is potentially interested in.
> 
> >   * @serial: PCIe Device Serial Number
> >   * @type: Generic Memory Class device or Vendor Specific Memory device
> >   * @cxl_mbox: CXL mailbox context
> > @@ -438,21 +462,39 @@ struct cxl_dev_state {
> >  	bool rcd;
> >  	bool media_ready;
> >  	struct resource dpa_res;
> > -	struct resource _pmem_res;
> > -	struct resource _ram_res;
> > +	struct cxl_dpa_partition part[CXL_PARTITION_MAX];
> > +	unsigned int nr_partitions;
> >  	u64 serial;
> >  	enum cxl_devtype type;
> >  	struct cxl_mailbox cxl_mbox;
> >  };
> >  
> > -static inline struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> > +static inline const struct resource *to_ram_res(struct cxl_dev_state *cxlds)
> >  {
> > -	return &cxlds->_ram_res;
> > +	if (cxlds->nr_partitions > 0)
> > +		return &cxlds->part[CXL_PARTITION_RAM].res;
> > +	return NULL;
> >  }
> >  
> > -static inline struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> > +static inline const struct resource *to_pmem_res(struct cxl_dev_state *cxlds)
> >  {
> > -	return &cxlds->_pmem_res;
> > +	if (cxlds->nr_partitions > 1)
> 
> This is very confusing as nr_partitions is being used not to indicate
> number of partitions but whether a driver has filled in the data for them
> (which may well be empty).
> 
> I'd rather see that as a bitmap, or a 'not set' value initialized by
> the core that is then replaced when they are set.

...or even better, not require PMEM to be at partition1.

[..]

  parent reply	other threads:[~2025-01-17 18:24 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-17  6:10 [PATCH 0/4] cxl: DPA partition metadata is a mess Dan Williams
2025-01-17  6:10 ` [PATCH 1/4] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
2025-01-17 10:03   ` Jonathan Cameron
2025-01-17 17:47     ` Dan Williams
2025-01-17 10:24   ` Alejandro Lucero Palau
2025-01-17 17:54     ` Dan Williams
2025-01-17 18:45   ` Ira Weiny
2025-01-17  6:10 ` [PATCH 2/4] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers Dan Williams
2025-01-17 10:20   ` Jonathan Cameron
2025-01-17 10:23     ` Jonathan Cameron
2025-01-17 17:55       ` Dan Williams
2025-01-17 13:33   ` Alejandro Lucero Palau
2025-01-17 20:47     ` Dan Williams
2025-01-17  6:10 ` [PATCH 3/4] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
2025-01-17 10:52   ` Jonathan Cameron
2025-01-17 13:38     ` Alejandro Lucero Palau
2025-01-17 18:23     ` Dan Williams [this message]
2025-01-17 20:32       ` Ira Weiny
2025-01-20 12:24       ` Alejandro Lucero Palau
2025-01-31 23:54         ` Dan Williams
2025-01-17 15:58   ` Alejandro Lucero Palau
2025-01-17 22:52     ` Dan Williams
2025-01-17 20:42   ` Ira Weiny
2025-01-17 22:08   ` Ira Weiny
2025-01-31 23:39     ` Dan Williams
2025-01-17  6:10 ` [PATCH 4/4] cxl: Make cxl_dpa_alloc() DPA partition number agnostic Dan Williams
2025-01-17 11:12   ` Jonathan Cameron
2025-01-17 18:37     ` Dan Williams
2025-01-17 15:42   ` Alejandro Lucero Palau
2025-01-17 20:57     ` Dan Williams
2025-01-20 12:39       ` Alejandro Lucero Palau
2025-02-01  0:08         ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=678aa0381324e_20fa2942a@dwillia2-xfh.jf.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=alucerop@amd.com \
    --cc=dave.jiang@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox