public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Ira Weiny <ira.weiny@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>, Fan Ni <fan.ni@samsung.com>,
	"Navneet Singh" <navneet.singh@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Alison Schofield <alison.schofield@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	<linux-btrfs@vger.kernel.org>, <linux-cxl@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, Chris Mason <clm@fb.com>,
	Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.com>
Subject: Re: [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD)
Date: Fri, 3 May 2024 10:20:51 +0100	[thread overview]
Message-ID: <20240503102051.00004a99@Huawei.com> (raw)
In-Reply-To: <6632d503f3ae5_e1f58294df@iweiny-mobl.notmuch>

On Wed, 1 May 2024 16:49:24 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> Jonathan Cameron wrote:
> >   
> > > 
> > > Fan Ni's latest v5 of Qemu DCD was used for testing.[2]  
> > Hi Ira, Navneet.  
> > > 
> > > Remaining work:
> > > 
> > > 	1) Integrate the QoS work from Dave Jiang
> > > 	2) Interleave support  
> > 
> > 
> > More flag.  This one I think is potentially important and don't
> > see any handling in here.  
> 
> Nope I admit I missed the spec requirement.
> 
> > 
> > Whilst an FM could in theory be careful to avoid sending a
> > sparse set of extents, if the device is managing the memory range
> > (which is possible all it supports) and the FM issues an Initiate Dynamic
> > Capacity Add with Free (again may be all device supports) then we
> > can't stop the device issuing a bunch of sparse extents.
> > 
> > Now it won't be broken as such without this, but every time we
> > accept the first extent that will implicitly reject the rest.
> > That will look very ugly to an FM which has to poke potentially many
> > times to successfully allocate memory to a host.  
> 
> This helps me to see see why the more bit is useful.
> 
> > 
> > I also don't think it will be that hard to support, but maybe I'm
> > missing something?   
> 
> Just a bunch of code and refactoring busy work.  ;-)  It's not rocket
> science but does fundamentally change the arch again.
> 
> > 
> > My first thought is it's just a loop in cxl_handle_dcd_add_extent()
> > over a list of extents passed in then slightly more complex response
> > generation.  
> 
> Not exactly 'just a loop'.  No matter how I work this out there is the
> possibility that some extents get surfaced and then the kernel tries to
> remove them because it should not have.

Lets consider why it might need to back out.
1) Device sends an invalid set of extents - so maybe one in a later message
   overlaps with an already allocated extent.   Device bug, handling can
   be extremely inelegant - up to crashing the kernel.  Worst that happens
   due to race is probably a poison storm / machine check fun?  Not our
   responsibility to deal with something that broken (in my view!) Best effort
   only.

2) Host can't handle the extent for some reason and didn't know that until
   later - can just reject the ones it can't handle. 

> 
> To be most safe the cxl core is going to have to make 2 round trips to the
> cxl region layer for each extent.  The first determines if the extent is
> valid and creates the extent as much as possible.  The second actually
> surfaces the extents.  However, if the surface fails then you might not
> get the extents back.  So now we are in an invalid state.  :-/  WARN and
> continue I guess?!??!

Yes. Orchestrator can decide how to handle - probably reboot server in as
gentle a fashion as possible.


> 
> I think the safest way to handle this is add a new kernel notify event
> called 'extent create' which stops short of surfacing the extent.  [I'm
> not 100% sure how this is going to affect interleave.]
> 
> I think the safest logic for add is something like:
> 
> 	cxl_handle_dcd_add_event()
> 		add_extent(squirl_list, extent);
> 
> 		if (more bit) /* wait for more */
> 			return;
> 
> 		/* Create extents to hedge the bets against failure */
> 		for_each(squirl_list)
> 			if (notify 'extent create' != ok)
> 				send_response(fail);
> 				return;
> 		
> 		for_each(squirl_list)
> 			if (notify 'surface' != ok)
> 				/*
> 				 * If the more bit was set, some extents
> 				 * have been surfaced and now need to be
> 				 * removed...
> 				 *
> 				 * Try to remove them and hope...
> 				 */

If we failed to surface them all another option is just tell the device
that.   Responds with the extents that successfully surfaced and reject
all others (or all after the one that failed?)  So for the lower layers
send the device a response that says "thanks but I only took these ones"
and for the upper layers pretend "I was only offered these ones"

> 				WARN_ON('surface extents failed');
> 				for_each(squirl_list)
> 					notify 'remove without response'
> 				send_response(fail);
> 				return;
> 
> 		send_response(squirl_list, accept);
> 
> The logic for remove is not changed AFAICS because the device must allow
> for memory to be released at any time so the host is free to release each
> of the extents individually despite the 'more' bit???

Yes, but only after it accepted them - which needs to be done in one go.
So you can't just send releases before that (the device will return an
error and keep them in the pending list I think...)

> 
> > 
> > I don't want this to block getting initial DCD support in but it
> > will be a bit ugly if we quickly support the more flag and then end
> > up with just one kernel that an FM has to be careful with...  
> 
> I'm not sure which is worse.  Given your use case above it seems like the
> more bit may be more important for 'dumb' devices which want to add
> extents in blocks before responding to the FM.  Thus complicating the FM.
> 
> It seems 'smarter' devices which could figure this out (not requiring the
> more bit) are the ones which will be developed later.  So it seems the use
> case time line is the opposite of what we need right now.

Once we hit shareable capacity (which the smarter devices will use) then
this become the dominant approach to non contiguous allocations because
you can't add extents with a given tag in multiple goes.

So I'd expect the more flag to be more common not less over time.
> 
> For that reason I'm inclined to try and get this in.
> 

Great - but I'd not worry too much about bad effects if you get invalid
lists from the device.  If the only option is shout and panic, then fine
though I'd imagine we can do slightly better than that, so maybe warn
extensively and don't let the region be used.

Jonathan

> Ira
> 


  reply	other threads:[~2024-05-03  9:20 UTC|newest]

Thread overview: 161+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-24 23:18 [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD) ira.weiny
2024-03-24 23:18 ` [PATCH 01/26] cxl/mbox: Flag " ira.weiny
2024-03-25 16:11   ` Jonathan Cameron
2024-03-25 22:16   ` fan
2024-03-25 22:56   ` Davidlohr Bueso
2024-04-02 22:26     ` Ira Weiny
2024-03-26 16:34   ` Dave Jiang
2024-04-02 22:30     ` Ira Weiny
2024-04-10 18:15   ` Alison Schofield
2024-03-24 23:18 ` [PATCH 02/26] cxl/core: Separate region mode from decoder mode ira.weiny
2024-03-25 16:20   ` Jonathan Cameron
2024-04-02 23:24     ` Ira Weiny
2024-03-25 23:18   ` Davidlohr Bueso
2024-03-28  5:22     ` Ira Weiny
2024-03-28 20:09       ` Dave Jiang
2024-04-02 23:27         ` Ira Weiny
2024-04-24 17:58         ` Ira Weiny
2024-04-02 23:25     ` Ira Weiny
2024-04-10 18:49   ` Alison Schofield
2024-03-24 23:18 ` [PATCH 03/26] cxl/mem: Read dynamic capacity configuration from the device ira.weiny
2024-03-25 17:40   ` Jonathan Cameron
2024-04-03 22:22     ` Ira Weiny
2024-03-25 23:36   ` fan
2024-04-03 22:41     ` Ira Weiny
2024-04-02 11:41   ` Jørgen Hansen
2024-04-05 18:09     ` Ira Weiny
2024-04-09  8:42       ` Jørgen Hansen
2024-04-09  2:00   ` Alison Schofield
2024-03-24 23:18 ` [PATCH 04/26] cxl/region: Add dynamic capacity decoder and region modes ira.weiny
2024-03-25 17:42   ` Jonathan Cameron
2024-03-26 16:17   ` fan
2024-03-27 15:43   ` Dave Jiang
2024-04-05 18:19     ` Ira Weiny
2024-04-06  0:01       ` Dave Jiang
2024-05-14  2:40   ` Zhijian Li (Fujitsu)
2024-03-24 23:18 ` [PATCH 05/26] cxl/core: Simplify cxl_dpa_set_mode() Ira Weiny
2024-03-25 17:46   ` Jonathan Cameron
2024-03-25 21:38   ` Davidlohr Bueso
2024-03-26 16:25   ` fan
2024-03-26 17:46   ` Dave Jiang
2024-04-05 19:21     ` Ira Weiny
2024-04-06  0:02       ` Dave Jiang
2024-04-09  0:43   ` Alison Schofield
2024-05-03 19:09     ` Ira Weiny
2024-05-03 20:33       ` Alison Schofield
2024-05-04  1:19       ` Dan Williams
2024-05-06  4:06         ` Ira Weiny
2024-05-04  4:13       ` Dan Williams
2024-05-06  3:46         ` Ira Weiny
2024-03-24 23:18 ` [PATCH 06/26] cxl/port: Add Dynamic Capacity mode support to endpoint decoders ira.weiny
2024-03-26 16:35   ` fan
2024-04-05 19:50     ` Ira Weiny
2024-03-26 17:58   ` Dave Jiang
2024-04-05 20:34     ` Ira Weiny
2024-04-04  8:32   ` Jonathan Cameron
2024-04-05 20:56     ` Ira Weiny
2024-05-06 16:22       ` Dan Williams
2024-05-10  5:31         ` Ira Weiny
2024-04-10 20:33   ` Alison Schofield
2024-03-24 23:18 ` [PATCH 07/26] cxl/port: Add dynamic capacity size " ira.weiny
2024-04-05 13:54   ` Jonathan Cameron
2024-05-03 17:09     ` Ira Weiny
2024-05-03 17:21       ` Dan Williams
2024-05-06  4:07         ` Ira Weiny
2024-04-10 22:50   ` Alison Schofield
2024-03-24 23:18 ` [PATCH 08/26] cxl/mem: Expose device dynamic capacity capabilities ira.weiny
2024-03-25 23:40   ` Davidlohr Bueso
2024-03-26 18:30     ` fan
2024-04-04  8:44       ` Jonathan Cameron
2024-04-04  8:51   ` Jonathan Cameron
2024-03-24 23:18 ` [PATCH 09/26] cxl/region: Add Dynamic Capacity CXL region support ira.weiny
2024-03-26 22:31   ` fan
2024-04-10  4:25     ` Ira Weiny
2024-03-27 17:27   ` Dave Jiang
2024-04-10  4:35     ` Ira Weiny
2024-04-04 10:26   ` Jonathan Cameron
2024-04-10  4:40     ` Ira Weiny
2024-03-24 23:18 ` [PATCH 10/26] cxl/events: Factor out event msgnum configuration Ira Weiny
2024-03-27 17:38   ` Dave Jiang
2024-04-04 15:07   ` Jonathan Cameron
2024-03-24 23:18 ` [PATCH 11/26] cxl/pci: Delay event buffer allocation Ira Weiny
2024-03-25 22:26   ` Davidlohr Bueso
2024-03-27 17:38   ` Dave Jiang
2024-04-04 15:08   ` Jonathan Cameron
2024-03-24 23:18 ` [PATCH 12/26] cxl/pci: Factor out interrupt policy check Ira Weiny
2024-03-27 17:41   ` Dave Jiang
2024-04-04 15:10   ` Jonathan Cameron
2024-03-24 23:18 ` [PATCH 13/26] cxl/mem: Configure dynamic capacity interrupts ira.weiny
2024-03-26 23:12   ` fan
2024-04-10  4:48     ` Ira Weiny
2024-03-27 17:54   ` Dave Jiang
2024-04-10  5:26     ` Ira Weiny
2024-04-04 15:22   ` Jonathan Cameron
2024-04-10  5:34     ` Ira Weiny
2024-04-10 23:23   ` Alison Schofield
2024-05-06 16:56   ` Dan Williams
2024-03-24 23:18 ` [PATCH 14/26] cxl/region: Read existing extents on region creation ira.weiny
2024-03-26 23:27   ` fan
2024-04-10  5:46     ` Ira Weiny
2024-03-27 17:45   ` fan
2024-04-10  6:19     ` Ira Weiny
2024-03-27 18:31   ` Dave Jiang
2024-04-10  6:09     ` Ira Weiny
2024-04-02 13:57   ` Jørgen Hansen
2024-04-10  6:29     ` Ira Weiny
2024-04-04 16:04   ` Jonathan Cameron
2024-04-04 16:13   ` Jonathan Cameron
2024-04-10 17:44   ` Alison Schofield
2024-05-06 18:34   ` Dan Williams
2024-06-29  3:47     ` Ira Weiny
2024-03-24 23:18 ` [PATCH 15/26] range: Add range_overlaps() Ira Weiny
2024-03-25 18:33   ` David Sterba
2024-03-25 21:24   ` Davidlohr Bueso
2024-03-26 12:51   ` Johannes Thumshirn
2024-03-27 17:36   ` fan
2024-03-28 20:09   ` Dave Jiang
2024-04-04 16:06   ` Jonathan Cameron
2024-03-24 23:18 ` [PATCH 16/26] cxl/extent: Realize extent devices ira.weiny
2024-03-27 22:34   ` fan
2024-03-28 21:11   ` Dave Jiang
2024-04-24 19:57     ` Ira Weiny
2024-04-04 16:32   ` Jonathan Cameron
2024-04-30  3:23     ` Ira Weiny
2024-05-02 21:12       ` Dan Williams
2024-05-06  4:35         ` Ira Weiny
2024-04-11  0:09   ` Alison Schofield
2024-05-07  1:30   ` Dan Williams
2024-03-24 23:18 ` [PATCH 17/26] dax/region: Create extent resources on DAX region driver load ira.weiny
2024-04-04 16:36   ` Jonathan Cameron
2024-04-09 16:22   ` fan
2024-05-07  2:31   ` Dan Williams
2024-03-24 23:18 ` [PATCH 18/26] cxl/mem: Handle DCD add & release capacity events ira.weiny
2024-04-04 17:03   ` Jonathan Cameron
2024-05-07  5:04   ` Dan Williams
2024-03-24 23:18 ` [PATCH 19/26] dax/bus: Factor out dev dax resize logic Ira Weiny
2024-04-04 17:15   ` Jonathan Cameron
2024-03-24 23:18 ` [PATCH 20/26] dax: Document dax dev range tuple Ira Weiny
2024-04-01 17:06   ` Dave Jiang
2024-04-04 17:19   ` Jonathan Cameron
2024-03-24 23:18 ` [PATCH 21/26] dax/region: Prevent range mapping allocation on sparse regions Ira Weiny
2024-04-01 17:07   ` Dave Jiang
2024-04-10 23:02   ` Alison Schofield
2024-03-24 23:18 ` [PATCH 22/26] dax/region: Support DAX device creation on sparse DAX regions Ira Weiny
2024-04-04 17:36   ` Jonathan Cameron
2024-03-24 23:18 ` [PATCH 23/26] cxl/mem: Trace Dynamic capacity Event Record ira.weiny
2024-04-01 17:56   ` Dave Jiang
2024-04-04 17:38   ` Jonathan Cameron
2024-04-10 17:03   ` Alison Schofield
2024-03-24 23:18 ` [PATCH 24/26] tools/testing/cxl: Make event logs dynamic Ira Weiny
2024-03-24 23:18 ` [PATCH 25/26] tools/testing/cxl: Add DC Regions to mock mem data Ira Weiny
2024-03-24 23:18 ` [PATCH 26/26] tools/testing/cxl: Add Dynamic Capacity events Ira Weiny
2024-03-25 19:24 ` [PATCH 00/26] DCD: Add support for Dynamic Capacity Devices (DCD) fan
2024-03-28  5:20   ` Ira Weiny
2024-04-03 20:39     ` Jonathan Cameron
2024-04-04 10:20 ` Jonathan Cameron
2024-04-04 17:49 ` Jonathan Cameron
2024-05-01 23:49   ` Ira Weiny
2024-05-03  9:20     ` Jonathan Cameron [this message]
2024-05-06  4:24       ` Ira Weiny
2024-05-08 14:43         ` Jonathan Cameron
2024-04-10 18:01 ` Alison Schofield

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240503102051.00004a99@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=alison.schofield@intel.com \
    --cc=clm@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=dsterba@suse.com \
    --cc=fan.ni@samsung.com \
    --cc=ira.weiny@intel.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=navneet.singh@intel.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox