Re: [RFC] Set addresses for memory devices [CXL]

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: "Ben Widawsky" <ben@bwidawsk.net>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Philippe Mathieu-Daudé" <f4bug@amsat.org>,
	"Qemu Developers" <qemu-devel@nongnu.org>,
	"Igor Mammedov" <imammedo@redhat.com>
Subject: Re: [RFC] Set addresses for memory devices [CXL]
Date: Thu, 28 Jan 2021 10:51:14 +0000	[thread overview]
Message-ID: <20210128105114.0000715a@Huawei.com> (raw)
In-Reply-To: <CAPcyv4gbfrHM0L8WFU2jKLJw5DFxj5mpEOi62wyxAoKsQLMdhQ@mail.gmail.com>

On Wed, 27 Jan 2021 21:20:21 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Wed, Jan 27, 2021 at 7:52 PM Ben Widawsky <ben@bwidawsk.net> wrote:
> >
> > Hi list, Igor.
> >
> > I wanted to get some ideas on how to better handle this. Per the recent
> > discussion [1], it's become clear that there needs to be more thought put into
> > how to manage the address space for CXL memory devices. If you see the
> > discussion on interleave [2] there's a decent diagram for the problem statement.
> >
> > A CXL topology looks just like a PCIe topology. A CXL memory device is a memory
> > expander. It's a byte addressable address range with a combination of persistent
> > and volatile memory. In a CXL capable system, you can effectively think of these
> > things as more configurable NVDIMMs. The memory devices have an interface that
> > allows the OS to program the base physical address range it claims called an HDM
> > (Host Defined Memory) decoder. A larger address range is claimed by a host
> > bridge (or a combination of host bridges in the interleaved case) which is
> > platform specific.
> >
> > Originally, my plan was to create a single memory backend for a "window" and
> > subregion the devices in there. So for example, if you had two devices under a
> > hostbridge, each of 256M size, the window would be some fixed GPA of 512M+ size
> > memory backend, and those memory devices would be a subregion of the
> > hostbridge's window. I thought this was working in my patch series, but as it
> > turns out, this doesn't actually work as I intended. `info mtree` looks good,
> > but `info memory-devices` doesn't.
> >  
> 
> A couple clarifying questions...
> 
> > So let me list the requirements and hopefully get some feedback on the best way
> > to handle it.
> > 1. A PCIe like device has a persistent memory region (I don't care about
> > volatile at the moment).  
> 
> What do you mean by "PCIe" like? If it is PCI enumerable by the guest
> it has no business being treated as proper memory because the OS
> rightly assumes that PCIe address space is not I/O coherent to other
> initiators.
> 
> > 2. The physical address base for the memory region is programmable.
> > 3. Memory accesses will support interleaving across multiple host bridges.  
> 
> So, per 1. it would look like a PCIe address space inside QEMU but
> advertised as an I/O coherent platform resource in the guest?

Personally I find it easier to think of these devices as containing:

1) A PCI based configuration interface (in config + bar space).

2) Memory accessed via an entirely separate memory bus -
   the PA translations for which (system address map etc) happens to
   be controllable via the PCI path.

The memory traffic goes over the PCI wires, but doesn't otherwise obey
any of the rules of PCI, so separate decode etc allowing for interleaving.
From an emulation point of view it might as well be an entirely different bus.
(with a similar tree).

The host allocates certain windows of PA space for which it routes
PA reads / writes to particular physical ports - beyond that all the
PA routing to particular memory devices can be programmed at runtime.

Interleave makes this more 'interesting' :)

The host can set certain PA regions to interleave across multiple CXL root ports.
So if base PA = 128G, interleave of 512Bytes, 2 way.

Read to 128G + 0bytes   -> port 0
Read to 128G + 512bytes -> port 1
Read to 128G + 1024bytes-> port 0 etc

OS can then put two devices into such a PA region and let them know about
the interleave (via that PCI based config interface)
If there are switches below those ports, further interleave can occur
as well. It's very flexible.

Of course, others may prefer a different mental model!

Jonathan

     prev parent reply	other threads:[~2021-01-28 10:53 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-28  3:51 [RFC] Set addresses for memory devices [CXL] Ben Widawsky
2021-01-28  5:20 ` Dan Williams
2021-01-28 10:51   ` Jonathan Cameron [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210128105114.0000715a@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=armbru@redhat.com \
    --cc=ben@bwidawsk.net \
    --cc=dan.j.williams@intel.com \
    --cc=f4bug@amsat.org \
    --cc=imammedo@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).