Linux CXL
 help / color / mirror / Atom feed
From: Ira Weiny <ira.weiny@intel.com>
To: Dan Williams <dan.j.williams@intel.com>,
	Ira Weiny <ira.weiny@intel.com>, <linux-cxl@vger.kernel.org>
Cc: Dave Jiang <dave.jiang@intel.com>,
	Alejandro Lucero <alucerop@amd.com>,
	Ira Weiny <ira.weiny@intel.com>, <Jonathan.Cameron@huawei.com>
Subject: Re: [PATCH v2 4/5] cxl: Make cxl_dpa_alloc() DPA partition number agnostic
Date: Wed, 22 Jan 2025 21:14:57 -0600	[thread overview]
Message-ID: <6791b430ed4f6_5584029448@iweiny-mobl.notmuch> (raw)
In-Reply-To: <679172a7ac00f_20fa29442@dwillia2-xfh.jf.intel.com.notmuch>

Dan Williams wrote:
> Ira Weiny wrote:
> > Dan Williams wrote:
> > > cxl_dpa_alloc() is a hard coded nest of assumptions around PMEM
> > > allocations being distinct from RAM allocations in specific ways when in
> > > practice the allocation rules are only relative to DPA partition index.
> > > 
> > > The rules for cxl_dpa_alloc() are:
> > > 
> > > - allocations can only come from 1 partition
> > > 
> > > - if allocating at partition-index-N, all free space in partitions less
> > >   than partition-index-N must be skipped over
> > 
> > I think this is a bit deeper.  The partition index must also correspond to
> > the DPA order.  The DCD code verifies the partition index's are in DPA
> > order when reading them from the device.  Therefore, that code will add
> > them to cxl_dpa_info in order.  But general device driver writers may miss
> > this point.
> 
> We could save them from themselves with some paranoia in
> cxl_dpa_setup(), but as Alejandro said accelerators are typically
> single-static-RAM-partition devices. The risk is low that someone builds
> a multi-partition accelerator *and* builds a driver that messes that up,
> but I would not say no to a comment that notes that expectation.
> 
> > [snip]
> > 
> > > diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> > > index 3f8a54ca4624..591aeb26c9e1 100644
> > > --- a/drivers/cxl/core/hdm.c
> > > +++ b/drivers/cxl/core/hdm.c
> > > @@ -223,6 +223,31 @@ void cxl_dpa_debug(struct seq_file *file, struct cxl_dev_state *cxlds)
> > >  }
> > >  EXPORT_SYMBOL_NS_GPL(cxl_dpa_debug, "CXL");
> > >  
> > > +/* See request_skip() kernel-doc */
> > > +static void release_skip(struct cxl_dev_state *cxlds,
> > > +			 const resource_size_t skip_base,
> > > +			 const resource_size_t skip_len)
> > > +{
> > > +	resource_size_t skip_start = skip_base, skip_rem = skip_len;
> > > +
> > > +	for (int i = 0; i < cxlds->nr_partitions; i++) {
> > > +		const struct resource *part_res = &cxlds->part[i].res;
> > > +		resource_size_t skip_end, skip_size;
> > > +
> > > +		if (skip_start < part_res->start || skip_start > part_res->end)
> > > +			continue;
> > > +
> > > +		skip_end = min(part_res->end, skip_start + skip_rem - 1);
> > > +		skip_size = skip_end - skip_start + 1;
> > > +		__release_region(&cxlds->dpa_res, skip_start, skip_size);
> > > +		skip_start += skip_size;
> > > +		skip_rem -= skip_size;
> > > +
> > > +		if (!skip_rem)
> > > +			break;
> > > +	}
> > > +}
> > > +
> > >  /*
> > >   * Must be called in a context that synchronizes against this decoder's
> > >   * port ->remove() callback (like an endpoint decoder sysfs attribute)
> > > @@ -241,7 +266,7 @@ static void __cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> > >  	skip_start = res->start - cxled->skip;
> > >  	__release_region(&cxlds->dpa_res, res->start, resource_size(res));
> > >  	if (cxled->skip)
> > > -		__release_region(&cxlds->dpa_res, skip_start, cxled->skip);
> > > +		release_skip(cxlds, skip_start, cxled->skip);
> > >  	cxled->skip = 0;
> > >  	cxled->dpa_res = NULL;
> > >  	put_device(&cxled->cxld.dev);
> > > @@ -268,6 +293,79 @@ static void devm_cxl_dpa_release(struct cxl_endpoint_decoder *cxled)
> > >  	__cxl_dpa_release(cxled);
> > >  }
> > >  
> > > +/**
> > > + * request_skip() - Track DPA 'skip' in @cxlds->dpa_res resource tree
> > > + * @cxlds: CXL.mem device context that parents @cxled
> > > + * @cxled: Endpoint decoder establishing new allocation that skips lower DPA
> > > + * @skip_base: DPA < start of new DPA allocation (DPAnew)
> > > + * @skip_len: @skip_base + @skip_len == DPAnew
> > > + *
> > > + * DPA 'skip' arises from out-of-sequence DPA allocation events relative
> > > + * to free capacity across multiple partitions. It is a wasteful event
> > > + * as usable DPA gets thrown away, but if a deployment has, for example,
> > > + * a dual RAM+PMEM device, wants to use PMEM, and has unallocated RAM
> > > + * DPA, the free RAM DPA must be sacrificed to start allocating PMEM.
> > > + * See third "Implementation Note" in CXL 3.1 8.2.4.19.13 "Decoder
> > > + * Protection" for more details.
> > 
> > I think this is a great comment here.
> 
> Appreciate that, never know how these things are going to translate.
> 
> > 
> > > + *
> > > + * A 'skip' always covers the last allocated DPA in a previous partition
> > > + * to the start of the current partition to allocate.  Allocations never
> > > + * start in the middle of a partition, and allocations are always
> > > + * de-allocated in reverse order (see cxl_dpa_free(), or natural devm
> > > + * unwind order from forced in-order allocation).
> > > + *
> > > + * If @cxlds->nr_partitions was guaranteed to be <= 2 then the 'skip'
> > > + * would always be contained to a single partition. Given
> > > + * @cxlds->nr_partitions may be > 2 it results in cases where the 'skip'
> > > + * might span "tail capacity of partition[0], all of partition[1], ...,
> > > + * all of partition[N-1]" to support allocating from partition[N]. That
> > > + * in turn interacts with the partition 'struct resource' boundaries
> > > + * within @cxlds->dpa_res whereby 'skip' requests need to be divided by
> > > + * partition. I.e. this is a quirk of using a 'struct resource' tree to
> > > + * detect range conflicts while also tracking partition boundaries in
> > > + * @cxlds->dpa_res.
> > 
> > Another great comment but it does not actually cover the DCD case.  This
> > is because in DCD the partitions might also have skips between them.
> 
> I think that "just works". The allocation will be bound by the
> partition, and the skip is calculated from the "end of last allocation
> in a previous partition". So, the distance between "end of last" and
> "allocation start" will naturally include inter-partition holes, right?

Not without a change to the algorithm I came up with.  We could create
phantom partitions which represent the skips between partitions.
Otherwise the skip resources need a different parent.

From my commit message:

    Two complications arise with Dynamic Capacity regions which did not         
    exist with Ram and PMEM partitions.  First, gaps in the DPA space can       
    exist between and around the DC partitions.  Second, the Linux resource     
    tree does not allow a resource to be marked across existing nodes within    
    a tree.                                                                     
                                                                                 
    For clarity, below is an example of an 60GB device with 10GB of RAM,        
    10GB of PMEM and 10GB for each of 2 DC partitions.  The desired CXL         
    mapping is 5GB of RAM, 5GB of PMEM, and 5GB of DC1.                         
                                                                                
         DPA RANGE                                                               
         (dpa_res)                                                               
    0GB        10GB       20GB       30GB       40GB       50GB       60GB      
    |----------|----------|----------|----------|----------|----------|          
                                                                                
    RAM         PMEM                  DC0                   DC1                 
     (ram_res)  (pmem_res)            (dc_res[0])           (dc_res[1])         
    |----------|----------|   <gap>  |----------|   <gap>  |----------|         
                                                                                
     RAM        PMEM                                        DC1                 
    |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|         
    0GB   5GB  10GB  15GB 20GB       30GB       40GB       50GB       60GB      
                                                                                
    The previous skip resource between RAM and PMEM was always a child of       
    the RAM resource and fit nicely [see (S) below].  Because of this           
    simplicity this skip resource reference was not stored in any CXL state.    
    On release the skip range could be calculated based on the endpoint         
    decoders stored values.                                                      
                                                                                
    Now when DC1 is being mapped 4 skip resources must be created as            
    children.  One for the PMEM resource (A), two of the parent DPA resource    
    (B,D), and one more child of the DC0 resource (C).                          
                                                                                
    0GB        10GB       20GB       30GB       40GB       50GB       60GB      
    |----------|----------|----------|----------|----------|----------|         
                               |                     |                          
    |----------|----------|    |     |----------|    |     |----------|         
            |          |       |          |          |                          
           (S)        (A)     (B)        (C)        (D)                         
            v          v       v          v          v                          
    |XXXXX|----|XXXXX|----|----------|----------|----------|XXXXX-----|         
           skip       skip  skip        skip      skip                          

  reply	other threads:[~2025-01-23  3:15 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-22  8:59 [PATCH v2 0/5] cxl: DPA partition metadata is a mess Dan Williams
2025-01-22  8:59 ` [PATCH v2 1/5] cxl: Remove the CXL_DECODER_MIXED mistake Dan Williams
2025-01-22 14:11   ` Ira Weiny
2025-01-23 15:49   ` Jonathan Cameron
2025-01-23 15:58   ` Alejandro Lucero Palau
2025-01-23 16:03   ` Dave Jiang
2025-01-22  8:59 ` [PATCH v2 2/5] cxl: Introduce to_{ram,pmem}_{res,perf}() helpers Dan Williams
2025-01-22 14:18   ` Ira Weiny
2025-01-23 15:57   ` Jonathan Cameron
2025-01-23 20:01     ` Dan Williams
2025-01-23 16:13   ` Dave Jiang
2025-01-23 16:25   ` Alejandro Lucero Palau
2025-01-23 21:04     ` Dan Williams
2025-01-24 10:15       ` Alejandro Lucero Palau
2025-01-25  0:45         ` Dan Williams
2025-01-22  8:59 ` [PATCH v2 3/5] cxl: Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' Dan Williams
2025-01-22 14:53   ` Ira Weiny
2025-01-22 22:24     ` Dan Williams
2025-01-23  3:10       ` Ira Weiny
2025-01-23 16:09   ` Jonathan Cameron
2025-01-23 20:24     ` Dan Williams
2025-01-23 16:57   ` Dave Jiang
2025-01-23 17:00   ` Alejandro Lucero Palau
2025-01-23 22:43     ` Dan Williams
2025-01-23 17:17   ` Alejandro Lucero Palau
2025-01-23 22:48     ` Dan Williams
2025-01-24 10:29       ` Alejandro Lucero Palau
2025-01-22  8:59 ` [PATCH v2 4/5] cxl: Make cxl_dpa_alloc() DPA partition number agnostic Dan Williams
2025-01-22 16:29   ` Ira Weiny
2025-01-22 22:35     ` Dan Williams
2025-01-23  3:14       ` Ira Weiny [this message]
2025-01-23  3:28         ` Dan Williams
2025-01-23 16:41   ` Jonathan Cameron
2025-01-23 21:34     ` Dan Williams
2025-01-23 17:21   ` Alejandro Lucero Palau
2025-01-23 20:52   ` Dave Jiang
2025-01-22  8:59 ` [PATCH v2 5/5] cxl: Kill enum cxl_decoder_mode Dan Williams
2025-01-22 17:42   ` Ira Weiny
2025-01-22 22:58     ` Dan Williams
2025-01-23  3:39       ` Ira Weiny
2025-01-23  4:11         ` Dan Williams
2025-01-23 21:30     ` Dave Jiang
2025-01-24 22:22       ` Ira Weiny
2025-01-23 16:51   ` Jonathan Cameron
2025-01-23 21:50     ` Dan Williams
2025-01-23 17:20   ` Alejandro Lucero Palau
2025-01-23 21:29   ` Dave Jiang
2025-01-23 17:23 ` [PATCH v2 0/5] cxl: DPA partition metadata is a mess Alejandro Lucero Palau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6791b430ed4f6_5584029448@iweiny-mobl.notmuch \
    --to=ira.weiny@intel.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=alucerop@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox