[RFC] Set addresses for memory devices [CXL]

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Set addresses for memory devices [CXL]
@ 2021-01-28  3:51 Ben Widawsky
  2021-01-28  5:20 ` Dan Williams
  0 siblings, 1 reply; 3+ messages in thread
From: Ben Widawsky @ 2021-01-28  3:51 UTC (permalink / raw)
  To: qemu-devel, Igor Mammedov
  Cc: Dan Williams, Philippe Mathieu-Daudé, Markus Armbruster,
	Jonathan Cameron

Hi list, Igor.

I wanted to get some ideas on how to better handle this. Per the recent
discussion [1], it's become clear that there needs to be more thought put into
how to manage the address space for CXL memory devices. If you see the
discussion on interleave [2] there's a decent diagram for the problem statement.

A CXL topology looks just like a PCIe topology. A CXL memory device is a memory
expander. It's a byte addressable address range with a combination of persistent
and volatile memory. In a CXL capable system, you can effectively think of these
things as more configurable NVDIMMs. The memory devices have an interface that
allows the OS to program the base physical address range it claims called an HDM
(Host Defined Memory) decoder. A larger address range is claimed by a host
bridge (or a combination of host bridges in the interleaved case) which is
platform specific.

Originally, my plan was to create a single memory backend for a "window" and
subregion the devices in there. So for example, if you had two devices under a
hostbridge, each of 256M size, the window would be some fixed GPA of 512M+ size
memory backend, and those memory devices would be a subregion of the
hostbridge's window. I thought this was working in my patch series, but as it
turns out, this doesn't actually work as I intended. `info mtree` looks good,
but `info memory-devices` doesn't.

So let me list the requirements and hopefully get some feedback on the best way
to handle it.
1. A PCIe like device has a persistent memory region (I don't care about
volatile at the moment).
2. The physical address base for the memory region is programmable.
3. Memory accesses will support interleaving across multiple host bridges.

As far as I can tell, there isn't anything that works quite like this today,
and, my attempts so far haven't been correct.

Thanks.
Ben

References:
[1] https://lore.kernel.org/qemu-devel/20210126213013.6v24im4sler3q3am@mail.bwidawsk.net/
[2] https://lore.kernel.org/qemu-devel/c51b000e-80db-40e9-d878-f260c49e4a2e@amsat.org/

Other:
https://lore.kernel.org/qemu-devel/20210105165323.783725-23-ben.widawsky@intel.com/
https://lore.kernel.org/qemu-devel/20210105165323.783725-26-ben.widawsky@intel.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] Set addresses for memory devices [CXL]
  2021-01-28  3:51 [RFC] Set addresses for memory devices [CXL] Ben Widawsky
@ 2021-01-28  5:20 ` Dan Williams
  2021-01-28 10:51   ` Jonathan Cameron
  0 siblings, 1 reply; 3+ messages in thread
From: Dan Williams @ 2021-01-28  5:20 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: Markus Armbruster, Philippe Mathieu-Daudé, Qemu Developers,
	Jonathan Cameron, Igor Mammedov

On Wed, Jan 27, 2021 at 7:52 PM Ben Widawsky <ben@bwidawsk.net> wrote:
>
> Hi list, Igor.
>
> I wanted to get some ideas on how to better handle this. Per the recent
> discussion [1], it's become clear that there needs to be more thought put into
> how to manage the address space for CXL memory devices. If you see the
> discussion on interleave [2] there's a decent diagram for the problem statement.
>
> A CXL topology looks just like a PCIe topology. A CXL memory device is a memory
> expander. It's a byte addressable address range with a combination of persistent
> and volatile memory. In a CXL capable system, you can effectively think of these
> things as more configurable NVDIMMs. The memory devices have an interface that
> allows the OS to program the base physical address range it claims called an HDM
> (Host Defined Memory) decoder. A larger address range is claimed by a host
> bridge (or a combination of host bridges in the interleaved case) which is
> platform specific.
>
> Originally, my plan was to create a single memory backend for a "window" and
> subregion the devices in there. So for example, if you had two devices under a
> hostbridge, each of 256M size, the window would be some fixed GPA of 512M+ size
> memory backend, and those memory devices would be a subregion of the
> hostbridge's window. I thought this was working in my patch series, but as it
> turns out, this doesn't actually work as I intended. `info mtree` looks good,
> but `info memory-devices` doesn't.
>

A couple clarifying questions...

> So let me list the requirements and hopefully get some feedback on the best way
> to handle it.
> 1. A PCIe like device has a persistent memory region (I don't care about
> volatile at the moment).

What do you mean by "PCIe" like? If it is PCI enumerable by the guest
it has no business being treated as proper memory because the OS
rightly assumes that PCIe address space is not I/O coherent to other
initiators.

> 2. The physical address base for the memory region is programmable.
> 3. Memory accesses will support interleaving across multiple host bridges.

So, per 1. it would look like a PCIe address space inside QEMU but
advertised as an I/O coherent platform resource in the guest?


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] Set addresses for memory devices [CXL]
  2021-01-28  5:20 ` Dan Williams
@ 2021-01-28 10:51   ` Jonathan Cameron
  0 siblings, 0 replies; 3+ messages in thread
From: Jonathan Cameron @ 2021-01-28 10:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ben Widawsky, Markus Armbruster, Philippe Mathieu-Daudé,
	Qemu Developers, Igor Mammedov

On Wed, 27 Jan 2021 21:20:21 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Wed, Jan 27, 2021 at 7:52 PM Ben Widawsky <ben@bwidawsk.net> wrote:
> >
> > Hi list, Igor.
> >
> > I wanted to get some ideas on how to better handle this. Per the recent
> > discussion [1], it's become clear that there needs to be more thought put into
> > how to manage the address space for CXL memory devices. If you see the
> > discussion on interleave [2] there's a decent diagram for the problem statement.
> >
> > A CXL topology looks just like a PCIe topology. A CXL memory device is a memory
> > expander. It's a byte addressable address range with a combination of persistent
> > and volatile memory. In a CXL capable system, you can effectively think of these
> > things as more configurable NVDIMMs. The memory devices have an interface that
> > allows the OS to program the base physical address range it claims called an HDM
> > (Host Defined Memory) decoder. A larger address range is claimed by a host
> > bridge (or a combination of host bridges in the interleaved case) which is
> > platform specific.
> >
> > Originally, my plan was to create a single memory backend for a "window" and
> > subregion the devices in there. So for example, if you had two devices under a
> > hostbridge, each of 256M size, the window would be some fixed GPA of 512M+ size
> > memory backend, and those memory devices would be a subregion of the
> > hostbridge's window. I thought this was working in my patch series, but as it
> > turns out, this doesn't actually work as I intended. `info mtree` looks good,
> > but `info memory-devices` doesn't.
> >  
> 
> A couple clarifying questions...
> 
> > So let me list the requirements and hopefully get some feedback on the best way
> > to handle it.
> > 1. A PCIe like device has a persistent memory region (I don't care about
> > volatile at the moment).  
> 
> What do you mean by "PCIe" like? If it is PCI enumerable by the guest
> it has no business being treated as proper memory because the OS
> rightly assumes that PCIe address space is not I/O coherent to other
> initiators.
> 
> > 2. The physical address base for the memory region is programmable.
> > 3. Memory accesses will support interleaving across multiple host bridges.  
> 
> So, per 1. it would look like a PCIe address space inside QEMU but
> advertised as an I/O coherent platform resource in the guest?

Personally I find it easier to think of these devices as containing:

1) A PCI based configuration interface (in config + bar space).

2) Memory accessed via an entirely separate memory bus -
   the PA translations for which (system address map etc) happens to
   be controllable via the PCI path.

The memory traffic goes over the PCI wires, but doesn't otherwise obey
any of the rules of PCI, so separate decode etc allowing for interleaving.
From an emulation point of view it might as well be an entirely different bus.
(with a similar tree).

The host allocates certain windows of PA space for which it routes
PA reads / writes to particular physical ports - beyond that all the
PA routing to particular memory devices can be programmed at runtime.

Interleave makes this more 'interesting' :)

The host can set certain PA regions to interleave across multiple CXL root ports.
So if base PA = 128G, interleave of 512Bytes, 2 way.

Read to 128G + 0bytes   -> port 0
Read to 128G + 512bytes -> port 1
Read to 128G + 1024bytes-> port 0 etc

OS can then put two devices into such a PA region and let them know about
the interleave (via that PCI based config interface)
If there are switches below those ports, further interleave can occur
as well. It's very flexible.

Of course, others may prefer a different mental model!

Jonathan





^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-28 10:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-28  3:51 [RFC] Set addresses for memory devices [CXL] Ben Widawsky
2021-01-28  5:20 ` Dan Williams
2021-01-28 10:51   ` Jonathan Cameron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).