qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* Re: An issue with x86 tcg and MMIO
       [not found] <78bc53e3-bad3-a5c3-9e53-7a89054aa37a@wdc.com>
@ 2023-02-01 21:50 ` Richard Henderson
  2023-02-02  9:39   ` Jonathan Cameron via
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Henderson @ 2023-02-01 21:50 UTC (permalink / raw)
  To: Jørgen Hansen; +Cc: Ajay Joshi, qemu-devel, Sid Manning

On 2/1/23 05:24, Jørgen Hansen wrote:
> Hello Richard,
> 
> We are using x86 qemu to test some CXL stuff, and in that process we are
> running into an issue with tcg. In qemu, CXL memory is mapped as an MMIO
> region, so when using CXL memory as part of the system memory,
> application code and data can be spread out across a combination of DRAM
> and CXL memory (we are using the Linux tiered memory numa balancing,
> that will migrate individual pages between DRAM and CXL memory based on
> access patterns). When we are running memory intensive applications, we
> hit the following assert in translator_access:
> 
>                /* We cannot handle MMIO as second page. */
>                assert(tb->page_addr[1] != -1);
> 
> introduced in your commit 50627f1b. This is using existing applications
> and standard Linux. We discussed this with Alistair Francis and he
> mentioned that it looks like a load across a page boundary is happening,
> and it so happens that the first page is DRAM and second page MMIO. We
> tried - as a workaround - to return NULL instead of the assert to
> trigger the slow path processing, and that allows the system to make
> forward progress, but we aren't familiar with tcg, and as such aren't
> sure if that is a correct fix.
> 
> So we'd like to get your input on this - and understand whether the
> above usage isn't supported at all or if there are other possible
> workarounds.

Well, this may answer my question in

https://lore.kernel.org/qemu-devel/1d6b1894-9c45-2d70-abde-9c10c1b3b93f@linaro.org/

as to how this could occur.

Until relatively recently, TCG would refuse to execute out of MMIO *at all*.  This was 
relaxed to support Arm m-profile, which needs to execute a few instructions out of MMIO 
during the boot process, before jumping into flash.

This works by reading one instruction, translating it, executing it, and immediately 
discarding the translation.  It could be possible to adjust the translator to allow the 
second half of an instruction to be in MMIO, such that we execute and discard, however...

What is it about CXL that requires modeling with MMIO?  If it is intended to be used 
interchangeably with RAM by the guest, then you really won't like the performance you will 
see with TCG executing out of these regions.

Could memory across the CXL link be modeled as a ROM device, similar to flash?  This does 
not have the same restrictions as MMIO.


r~


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: An issue with x86 tcg and MMIO
  2023-02-01 21:50 ` An issue with x86 tcg and MMIO Richard Henderson
@ 2023-02-02  9:39   ` Jonathan Cameron via
  2023-02-02 10:56     ` Richard Henderson
  0 siblings, 1 reply; 6+ messages in thread
From: Jonathan Cameron via @ 2023-02-02  9:39 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Jørgen Hansen, Ajay Joshi, qemu-devel, Sid Manning

On Wed, 1 Feb 2023 11:50:50 -1000
Richard Henderson <richard.henderson@linaro.org> wrote:

> On 2/1/23 05:24, Jørgen Hansen wrote:
> > Hello Richard,
> > 
> > We are using x86 qemu to test some CXL stuff, and in that process we are
> > running into an issue with tcg. In qemu, CXL memory is mapped as an MMIO
> > region, so when using CXL memory as part of the system memory,
> > application code and data can be spread out across a combination of DRAM
> > and CXL memory (we are using the Linux tiered memory numa balancing,
> > that will migrate individual pages between DRAM and CXL memory based on
> > access patterns). When we are running memory intensive applications, we
> > hit the following assert in translator_access:
> > 
> >                /* We cannot handle MMIO as second page. */
> >                assert(tb->page_addr[1] != -1);
> > 
> > introduced in your commit 50627f1b. This is using existing applications
> > and standard Linux. We discussed this with Alistair Francis and he
> > mentioned that it looks like a load across a page boundary is happening,
> > and it so happens that the first page is DRAM and second page MMIO. We
> > tried - as a workaround - to return NULL instead of the assert to
> > trigger the slow path processing, and that allows the system to make
> > forward progress, but we aren't familiar with tcg, and as such aren't
> > sure if that is a correct fix.
> > 
> > So we'd like to get your input on this - and understand whether the
> > above usage isn't supported at all or if there are other possible
> > workarounds.  
> 
> Well, this may answer my question in
> 
> https://lore.kernel.org/qemu-devel/1d6b1894-9c45-2d70-abde-9c10c1b3b93f@linaro.org/
> 
> as to how this could occur.
> 
> Until relatively recently, TCG would refuse to execute out of MMIO *at all*.  This was 
> relaxed to support Arm m-profile, which needs to execute a few instructions out of MMIO 
> during the boot process, before jumping into flash.
> 
> This works by reading one instruction, translating it, executing it, and immediately 
> discarding the translation.  It could be possible to adjust the translator to allow the 
> second half of an instruction to be in MMIO, such that we execute and discard, however...
> 
> What is it about CXL that requires modeling with MMIO?  If it is intended to be used 
> interchangeably with RAM by the guest, then you really won't like the performance you will 
> see with TCG executing out of these regions.

To be honest I wasn't aware of this restriction.

I 'thought' it was necessary to support interleaving as we need some callbacks
in the read / write paths to map through to the right address space.
So performance will suck even if we can model as memory (I'm not sure how to do
whilst maintaining the interleaving code).

To have anything approaching accurate modeling we need to apply a memory decoders
in the host, host-bridge and switches.  We could have faked all this and mapped directly
through to a single memory backend, but that would then have been useless for actually
testing the interleaving.

Specifically what happens in cxl_cfmws_find_device()
https://elixir.bootlin.com/qemu/latest/source/hw/cxl/cxl-host.c#L129

If we can do that for other types of region then I'm fine with changing it.

> 
> Could memory across the CXL link be modeled as a ROM device, similar to flash?  This does 
> not have the same restrictions as MMIO.

Not sure - if we can do the handling above then sure we could make that change.
I can see there is a path to register the callbacks but I'd kind of assumed
ROM meant read only...

Jonathan

> 
> 
> r~
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: An issue with x86 tcg and MMIO
  2023-02-02  9:39   ` Jonathan Cameron via
@ 2023-02-02 10:56     ` Richard Henderson
  2023-02-02 11:39       ` Peter Maydell
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Henderson @ 2023-02-02 10:56 UTC (permalink / raw)
  To: Jonathan Cameron; +Cc: Jørgen Hansen, Ajay Joshi, qemu-devel, Sid Manning

On 2/1/23 23:39, Jonathan Cameron wrote:
> Not sure - if we can do the handling above then sure we could make that change.
> I can see there is a path to register the callbacks but I'd kind of assumed
> ROM meant read only...

I think "romd" means "read mostly".

In the case of flash, I believe that a write changes modes (block erase something 
something) and the region changes state into MMIO.  But normal state is read mode where 
read+execute go through unchallenged.

It has been a long time since I studied how all that works, so I may well have forgotten 
something.


r~


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: An issue with x86 tcg and MMIO
  2023-02-02 10:56     ` Richard Henderson
@ 2023-02-02 11:39       ` Peter Maydell
  2023-02-02 12:31         ` Jonathan Cameron via
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Maydell @ 2023-02-02 11:39 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Jonathan Cameron, Jørgen Hansen, Ajay Joshi, qemu-devel,
	Sid Manning

On Thu, 2 Feb 2023 at 10:56, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 2/1/23 23:39, Jonathan Cameron wrote:
> > Not sure - if we can do the handling above then sure we could make that change.
> > I can see there is a path to register the callbacks but I'd kind of assumed
> > ROM meant read only...
>
> I think "romd" means "read mostly".
>
> In the case of flash, I believe that a write changes modes (block erase something
> something) and the region changes state into MMIO.  But normal state is read mode where
> read+execute go through unchallenged.

In QEMU a ROMD MemoryRegion (created by memory_region_init_rom_device())
is a memory region backed by RAM for reads and by callbacks for writes.
(I think ROMD stands for "ROM device".)

You can then use memory_region_device_set_romd() to put the ROMD into
either ROMD mode (the default, reads backed by RAM) or MMIO mode
(reads backed by MMIO callbacks). Writes are always callbacks regardless.
This is mainly meant for flash devices, which are usually reads-as-data
but have a programming mode where you write a command to it and then
read back command results. It's possible to use it for other tricks too.

When a ROMD is in ROMD mode then execution from it is as fast as execution
from any RAM; when it is in MMIO mode then execution from it is as slow
as execution from any other MMIO-backed MemoryRegion.

Note that AFAIK you can't execute from MMIO at all with KVM (either
ROMD-in-MMIO mode or a plain old MMIO device).

You might want to look at whether QEMU's iommu functionality is helpful
to you -- I'm assuming CXL doesn't do weird stuff on a less-than-page
granularity, and the iommu APIs will let you do "programmatically decide
where this address should actually go". The other option involves
mapping and unmapping MemoryRegions inside a container MR.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: An issue with x86 tcg and MMIO
  2023-02-02 11:39       ` Peter Maydell
@ 2023-02-02 12:31         ` Jonathan Cameron via
  2023-02-02 13:31           ` Peter Maydell
  0 siblings, 1 reply; 6+ messages in thread
From: Jonathan Cameron via @ 2023-02-02 12:31 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Richard Henderson, Jørgen Hansen, Ajay Joshi, qemu-devel,
	Sid Manning

On Thu, 2 Feb 2023 11:39:28 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Thu, 2 Feb 2023 at 10:56, Richard Henderson
> <richard.henderson@linaro.org> wrote:
> >
> > On 2/1/23 23:39, Jonathan Cameron wrote:  
> > > Not sure - if we can do the handling above then sure we could make that change.
> > > I can see there is a path to register the callbacks but I'd kind of assumed
> > > ROM meant read only...  
> >
> > I think "romd" means "read mostly".
> >
> > In the case of flash, I believe that a write changes modes (block erase something
> > something) and the region changes state into MMIO.  But normal state is read mode where
> > read+execute go through unchallenged.  
> 
> In QEMU a ROMD MemoryRegion (created by memory_region_init_rom_device())
> is a memory region backed by RAM for reads and by callbacks for writes.
> (I think ROMD stands for "ROM device".)
> 
> You can then use memory_region_device_set_romd() to put the ROMD into
> either ROMD mode (the default, reads backed by RAM) or MMIO mode
> (reads backed by MMIO callbacks). Writes are always callbacks regardless.
> This is mainly meant for flash devices, which are usually reads-as-data
> but have a programming mode where you write a command to it and then
> read back command results. It's possible to use it for other tricks too.
> 
> When a ROMD is in ROMD mode then execution from it is as fast as execution
> from any RAM; when it is in MMIO mode then execution from it is as slow
> as execution from any other MMIO-backed MemoryRegion.

Thanks for the info - I don't think ROMD helps us much here as we'd need
to be constantly in the MMIO mode as we need the callbacks for both read
and write.

> 
> Note that AFAIK you can't execute from MMIO at all with KVM (either
> ROMD-in-MMIO mode or a plain old MMIO device).

That may not be a significant problem for CXL emulation - though we should
definitely make that restriction clear and it might slow down some testing.
As far as I know there are no usecases beyond testing of software stacks
and TCG is fine for that.

> 
> You might want to look at whether QEMU's iommu functionality is helpful
> to you -- I'm assuming CXL doesn't do weird stuff on a less-than-page
> granularity, and the iommu APIs will let you do "programmatically decide
> where this address should actually go". The other option involves
> mapping and unmapping MemoryRegions inside a container MR.

Unfortunately it does weird stuff well below a page granularity.
Interleaving is down to 256 bytes.

We discussed the memory region approach when this originally came up.
The issue is that we get an insane number of memory regions to support
even basic interleave setups (many millions) - hence doing the address
decoding via a read and write callbacks at runtime instead. 

Jonathan

> 
> thanks
> -- PMM



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: An issue with x86 tcg and MMIO
  2023-02-02 12:31         ` Jonathan Cameron via
@ 2023-02-02 13:31           ` Peter Maydell
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Maydell @ 2023-02-02 13:31 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Richard Henderson, Jørgen Hansen, Ajay Joshi, qemu-devel,
	Sid Manning

On Thu, 2 Feb 2023 at 12:31, Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Thu, 2 Feb 2023 11:39:28 +0000
> Peter Maydell <peter.maydell@linaro.org> wrote:
> > You might want to look at whether QEMU's iommu functionality is helpful
> > to you -- I'm assuming CXL doesn't do weird stuff on a less-than-page
> > granularity, and the iommu APIs will let you do "programmatically decide
> > where this address should actually go". The other option involves
> > mapping and unmapping MemoryRegions inside a container MR.
>
> Unfortunately it does weird stuff well below a page granularity.
> Interleaving is down to 256 bytes.

That's unfortunate...

At any rate, conceptually we ought to be able to execute from
an instruction that overlaps between actual-RAM and an MMIO
MemoryRegion; it would be nice if we could fall back to the
execute-and-discard approach for that in the same way as if
the entire insn was in MMIO.

-- PMM


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-02-02 13:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <78bc53e3-bad3-a5c3-9e53-7a89054aa37a@wdc.com>
2023-02-01 21:50 ` An issue with x86 tcg and MMIO Richard Henderson
2023-02-02  9:39   ` Jonathan Cameron via
2023-02-02 10:56     ` Richard Henderson
2023-02-02 11:39       ` Peter Maydell
2023-02-02 12:31         ` Jonathan Cameron via
2023-02-02 13:31           ` Peter Maydell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).