Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: James Morse <james.morse@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Dongsheng Yang <dongsheng.yang@easystack.cn>,
	Gregory Price <gregory.price@memverge.com>,
	John Groves <John@groves.net>, <axboe@kernel.dk>,
	<linux-block@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-cxl@vger.kernel.org>, <nvdimm@lists.linux.dev>,
	Mark Rutland <mark.rutland@arm.com>
Subject: Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
Date: Tue, 4 Jun 2024 15:26:48 +0100	[thread overview]
Message-ID: <20240604152648.000071f8@Huawei.com> (raw)
In-Reply-To: <3c7c9b07-78b2-4b8d-968e-0c395c8f22b3@arm.com>

On Mon, 3 Jun 2024 18:28:51 +0100
James Morse <james.morse@arm.com> wrote:

> Hi guys,
> 
> On 03/06/2024 13:48, Jonathan Cameron wrote:
> > On Fri, 31 May 2024 20:22:42 -0700
> > Dan Williams <dan.j.williams@intel.com> wrote:  
> >> Jonathan Cameron wrote:  
> >>> On Thu, 30 May 2024 14:59:38 +0800
> >>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:  
> >>>> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:    
> >>>>> It's not just a CXL spec issue, though that is part of it. I think the
> >>>>> CXL spec would have to expose some form of puncturing flush, and this
> >>>>> makes the assumption that such a flush doesn't cause some kind of
> >>>>> race/deadlock issue.  Certainly this needs to be discussed.
> >>>>>
> >>>>> However, consider that the upstream processor actually has to generate
> >>>>> this flush.  This means adding the flush to existing coherence protocols,
> >>>>> or at the very least a new instruction to generate the flush explicitly.
> >>>>> The latter seems more likely than the former.
> >>>>>
> >>>>> This flush would need to ensure the data is forced out of the local WPQ
> >>>>> AND all WPQs south of the PCIE complex - because what you really want to
> >>>>> know is that the data has actually made it back to a place where remote
> >>>>> viewers are capable of percieving the change.
> >>>>>
> >>>>> So this means:
> >>>>> 1) Spec revision with puncturing flush
> >>>>> 2) Buy-in from CPU vendors to generate such a flush
> >>>>> 3) A new instruction added to the architecture.
> >>>>>
> >>>>> Call me in a decade or so.
> >>>>>
> >>>>>
> >>>>> But really, I think it likely we see hardware-coherence well before this.
> >>>>> For this reason, I have become skeptical of all but a few memory sharing
> >>>>> use cases that depend on software-controlled cache-coherency.      
> >>>>
> >>>> Hi Gregory,
> >>>>
> >>>> 	From my understanding, we actually has the same idea here. What I am 
> >>>> saying is that we need SPEC to consider this issue, meaning we need to 
> >>>> describe how the entire software-coherency mechanism operates, which 
> >>>> includes the necessary hardware support. Additionally, I agree that if 
> >>>> software-coherency also requires hardware support, it seems that 
> >>>> hardware-coherency is the better path.    
> >>>>>
> >>>>> There are some (FAMFS, for example). The coherence state of these
> >>>>> systems tend to be less volatile (e.g. mappings are read-only), or
> >>>>> they have inherent design limitations (cacheline-sized message passing
> >>>>> via write-ahead logging only).      
> >>>>
> >>>> Can you explain more about this? I understand that if the reader in the 
> >>>> writer-reader model is using a readonly mapping, the interaction will be 
> >>>> much simpler. However, after the writer writes data, if we don't have a 
> >>>> mechanism to flush and invalidate puncturing all caches, how can the 
> >>>> readonly reader access the new data?    
> >>>
> >>> There is a mechanism for doing coarse grained flushing that is known to
> >>> work on some architectures. Look at cpu_cache_invalidate_memregion().
> >>> On intel/x86 it's wbinvd_on_all_cpu_cpus()    
> >>
> >> There is no guarantee on x86 that after cpu_cache_invalidate_memregion()
> >> that a remote shared memory consumer can be assured to see the writes
> >> from that event.  
> > 
> > I was wondering about that after I wrote this...  I guess it guarantees
> > we won't get a late landing write or is that not even true?
> > 
> > So if we remove memory, then added fresh memory again quickly enough
> > can we get a left over write showing up?  I guess that doesn't matter as
> > the kernel will chase it with a memset(0) anyway and that will be ordered
> > as to the same address.
> > 
> > However we won't be able to elide that zeroing even if we know the device
> > did it which is makes some operations the device might support rather
> > pointless :(  
> 
> >>> on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a
> >>> public alpha specification for PSCI 1.3 with that defined but we
> >>> don't yet have kernel code.)    
> 
> I have an RFC for that - but I haven't had time to update and re-test it.

If it's useful, I might either be able to find time to take that forwards
(or get someone else to do it).

Let me know if that would be helpful; I'd love to add this to the list
of things I can forget about because it just works for kernel
(and hence is a problem for the firmware and uarch folk).

> 
> If you need this, and have a platform where it can be implemented, please get in touch
> with the people that look after the specs to move it along from alpha.
> 
> 
> >> That punches visibility through CXL shared memory devices?  
> 
> > It's a draft spec and Mark + James in +CC can hopefully confirm.
> > It does say
> > "Cleans and invalidates all caches, including system caches".
> > which I'd read as meaning it should but good to confirm.  
> 
> It's intended to remove any cached entries - including lines in what the arm-arm calls
> "invisible" system caches, which typically only platform firmware can touch. The next
> access should have to go all the way to the media. (I don't know enough about CXL to say
> what a remote shared memory consumer observes)

If it's out of the host bridge buffers (and known to have succeeded in write back) which I
think the host should know, I believe what happens next is a device implementer problem.
Hopefully anyone designing a device that does memory sharing has built that part right.

> 
> Without it, all we have are the by-VA operations which are painfully slow for large
> regions, and insufficient for system caches.
> 
> As with all those firmware interfaces - its for the platform implementer to wire up
> whatever is necessary to remove cached content for the specified range. Just because there
> is an (alpha!) spec doesn't mean it can be supported efficiently by a particular platform.
> 
> 
> >>> These are very big hammers and so unsuited for anything fine grained.  
> 
> You forgot really ugly too!

I was being polite :)

> 
> 
> >>> In the extreme end of possible implementations they briefly stop all
> >>> CPUs and clean and invalidate all caches of all types. So not suited
> >>> to anything fine grained, but may be acceptable for a rare setup event,
> >>> particularly if the main job of the writing host is to fill that memory
> >>> for lots of other hosts to use.
> >>>
> >>> At least the ARM one takes a range so allows for a less painful
> >>> implementation.   
> 
> That is to allow some ranges to fail. (e.g. you can do this to the CXL windows, but not
> the regular DRAM).
> 
> On the less painful implementation, arm's interconnect has a gadget that does "Address
> based flush" which could be used here. I'd hope platforms with that don't need to
> interrupt all CPUs - but it depends on what else needs to be done.
> 
> 
> >>> I'm assuming we'll see new architecture over time
> >>> but this is a different (and potentially easier) problem space
> >>> to what you need.    
> >>
> >> cpu_cache_invalidate_memregion() is only about making sure local CPU
> >> sees new contents after an DPA:HPA remap event. I hope CPUs are able to
> >> get away from that responsibility long term when / if future memory
> >> expanders just issue back-invalidate automatically when the HDM decoder
> >> configuration changes.  
> > 
> > I would love that to be the way things go, but I fear the overheads of
> > doing that on the protocol means people will want the option of the painful
> > approach.  
> 
> 
> 
> Thanks,
> 
> James

Thanks for the info,

Jonathan

>

next prev parent reply	other threads:[~2024-06-04 14:26 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-22  7:15 [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
2024-04-22  7:16 ` [PATCH 1/7] block: Init for CBD(CXL " Dongsheng Yang
2024-04-22 18:39   ` Randy Dunlap
2024-04-22 22:41     ` Dongsheng Yang
2024-04-24  3:58   ` Chaitanya Kulkarni
2024-04-24  8:36     ` Dongsheng Yang
2024-04-22  7:16 ` [PATCH 2/7] cbd: introduce cbd_transport Dongsheng Yang
2024-04-24  4:08   ` Chaitanya Kulkarni
2024-04-24  8:43     ` Dongsheng Yang
2024-04-22  7:16 ` [PATCH 3/7] cbd: introduce cbd_channel Dongsheng Yang
2024-04-22  7:16 ` [PATCH 4/7] cbd: introduce cbd_host Dongsheng Yang
2024-04-25  5:51   ` [EXTERNAL] " Bharat Bhushan
2024-04-22  7:16 ` [PATCH 5/7] cbd: introuce cbd_backend Dongsheng Yang
2024-04-24  5:03   ` Chaitanya Kulkarni
2024-04-24  8:36     ` Dongsheng Yang
2024-04-25  5:46   ` [EXTERNAL] " Bharat Bhushan
2024-04-22  7:16 ` [PATCH 7/7] cbd: add related sysfs files in transport register Dongsheng Yang
2024-04-25  5:24   ` [EXTERNAL] " Bharat Bhushan
2024-04-22 22:42 ` [PATCH 6/7] cbd: introduce cbd_blkdev Dongsheng Yang
2024-04-23  7:27   ` Dongsheng Yang
2024-04-24  4:29 ` [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dan Williams
2024-04-24  6:33   ` Dongsheng Yang
2024-04-24 15:14     ` Gregory Price
2024-04-26  1:25       ` Dongsheng Yang
2024-04-26 13:48         ` Gregory Price
2024-04-26 14:53           ` Dongsheng Yang
2024-04-26 16:14             ` Gregory Price
2024-04-28  5:47               ` Dongsheng Yang
2024-04-28 16:44                 ` Gregory Price
2024-04-28 16:55                 ` John Groves
2024-05-03  9:52                   ` Jonathan Cameron
2024-05-08 11:39                     ` Dongsheng Yang
2024-05-08 12:11                       ` Jonathan Cameron
2024-05-08 13:03                         ` Dongsheng Yang
2024-05-08 15:44                           ` Jonathan Cameron
2024-05-09 11:24                             ` Dongsheng Yang
2024-05-09 12:21                               ` Jonathan Cameron
2024-05-09 13:03                                 ` Dongsheng Yang
2024-05-21 18:41                                   ` Dan Williams
2024-05-22  6:17                                     ` Dongsheng Yang
2024-05-29 15:25                                       ` Gregory Price
2024-05-30  6:59                                         ` Dongsheng Yang
2024-05-30 13:38                                           ` Jonathan Cameron
2024-06-01  3:22                                             ` Dan Williams
2024-06-03 12:48                                               ` Jonathan Cameron
2024-06-03 17:28                                                 ` James Morse
2024-06-04 14:26                                                   ` Jonathan Cameron [this message]
2024-05-31 14:23                                           ` Gregory Price
2024-06-03  1:33                                             ` Dongsheng Yang
2024-04-30  0:34                 ` Dan Williams
2024-04-24 18:08     ` Dan Williams
     [not found]       ` <539c1323-68f9-d753-a102-692b69049c20@easystack.cn>
2024-04-30  0:10         ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240604152648.000071f8@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=John@groves.net \
    --cc=axboe@kernel.dk \
    --cc=dan.j.williams@intel.com \
    --cc=dongsheng.yang@easystack.cn \
    --cc=gregory.price@memverge.com \
    --cc=james.morse@arm.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=nvdimm@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).