Re: Performance Issue with CXL-emulation

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Jonathan Cameron via <qemu-devel@nongnu.org>
To: lokesh jaliminche <lokesh.jaliminche@gmail.com>
Cc: <qemu-devel-request@nongnu.org>, <linux-cxl@vger.kernel.org>,
	<qemu-devel@nongnu.org>
Subject: Re: Performance Issue with CXL-emulation
Date: Mon, 16 Oct 2023 10:55:38 +0100	[thread overview]
Message-ID: <20231016105538.00000de5@Huawei.com> (raw)
In-Reply-To: <CAKJOkCoxLG01Dt7xMjOPWRqhyLPuaNGRUaDn-sgAFfhERtAYJA@mail.gmail.com>

On Sun, 15 Oct 2023 10:39:46 -0700
lokesh jaliminche <lokesh.jaliminche@gmail.com> wrote:

> Hi Everyone,
> 
> I am facing performance issues while copying data to the CXL device
> (Emulated with QEMU). I get approximately 500KB/Sec. Any suggestion on how
> to improve this?

Hi Lokesh,

The target so far of QEMU emulation of CXL devices has been on functionality.
I'm in favour of work to improve on that, but it isn't likely to be my focus
- can offer some pointers on where to look though!

The fundamental problem (probably) is address decoding in CXL for interleaving
is at a sub page granularity. That means we can't use page table to perform the address
look ups in hardware. Note this also has the side effect that kvm won't work if
there is any chance that you will run instructions out of the CXL memory - it's
fine if you are interested in data only (DAX etc). (I've had a note in my todo list
to add a warning message about the KVM limitations for a while).

There have been a few discussions (mostly when we were debugging some TCG issues
and considering KVM support) about how we 'might' be able to improve this.  That focused
on a general 'fix', but there may be some lower hanging fruit.

The options I think might work are:

1) Special case configurations where there is no interleave going on.
   I'm not entirely sure how this would fit together and it won't deal with the
   more interesting cases - if it does work I'd want it to be minimally invasive because
   those complex cases are the main focus of testing etc.  There is an extension of this
   where we handle interleave, but only if it is 4k or above (on appropriately configured
   host).

2) Add caching layer to the CXL fixed memory windows.  That would hold copies of a
   number of pages that have been accessed in a software cache and setup the mappings for
   the hardware page table walkers to find them. If the page isn't cached we'd trigger
   a pagefault and have to bring it into the cache. If the configuration of the interleave
   is touched, all caches would need to be written back etc. This would need to be optional
   because I don't want to have to add cache coherency protocols etc when we add shared
   memory support (fun though it would be ;) 

3) Might be worth looking at the critical paths for lookups in your configuration.
   Maybe we can optimize the address decoders (basically a software TLB for HPA to DPA).
   I've not looked at the performance of those paths.  For your example the lookup is
   * CFMWS - nothing to do
   * Host bridge - nothing to do beyond a sanity check on range I think.
   * Nothing to to do.
   * Type 3 device - basic range match.
   So I'm not sure it is worth while - but you could do a really simple test by detecting
   no interleave is going on and caching the offset needed to go HPA to DPA + a device reference
   for the first time cxl_cfmws_find_device() is called. 
   https://elixir.bootlin.com/qemu/latest/source/hw/cxl/cxl-host.c#L129

   Then just match on hwaddr on another call of cxl_cmws_find_device() and return the device
   directly.  Maybe also shortcut lookups in cxl_type3_hpa_to_as_and_dpa() which does the endpoint
   decoding part. A quick hack would let you know if it was worth looking at something more general.

   Gut feeling is this last approach might get you some perf uptick but not going to solve
   the fundamental problem that in general we can't do the translation in hardware (unlike most
   other memory accesses in QEMU).

   Not I believe all writes to file backed memory will go all the way to the file. So you might want
   to try backing it with RAM but I as with the above, that's not going to address the fundamental
   problem.

Jonathan

> 
> Steps to reproduce :
> ===============
> 1. QEMU Command:
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -hda ./images/ubuntu-22.04-server-cloudimg-amd64.img \
> -hdb ./images/user-data.img \
> -M q35,cxl=on,accel=kvm,nvdimm=on \
> -smp 16 \
> -m 16G,maxmem=32G,slots=8 \
> -object
> memory-backend-file,id=cxl-mem1,share=on,mem-path=/mnt/qemu_files/cxltest.raw,size=256M
> \
> -object
> memory-backend-file,id=cxl-lsa1,share=on,mem-path=/mnt/qemu_files/lsa.raw,size=256M
> \
> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> -device
> cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0
> \
> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> -nographic \
> 
> 2. Configure device with fsdax mode
> ubuntu@ubuntu:~$ cxl list
> [
>   {
>     "memdevs":[
>       {
>         "memdev":"mem0",
>         "pmem_size":268435456,
>         "serial":0,
>         "host":"0000:0d:00.0"
>       }
>     ]
>   },
>   {
>     "regions":[
>       {
>         "region":"region0",
>         "resource":45365592064,
>         "size":268435456,
>         "type":"pmem",
>         "interleave_ways":1,
>         "interleave_granularity":1024,
>         "decode_state":"commit"
>       }
>     ]
>   }
> ]
> 
> 3. Format the device with ext4 file system in dax mode
> 
> 4. Write data to mounted device with dd
> 
> ubuntu@ubuntu:~$ time sudo dd if=/dev/urandom
> of=/home/ubuntu/mnt/pmem0/test bs=1M count=128
> 128+0 records in
> 128+0 records out
> 134217728 bytes (134 MB, 128 MiB) copied, 244.802 s, 548 kB/s
> 
> real    4m4.850s
> user    0m0.014s
> sys     0m0.013s
> 
> 
> Thanks & Regards,
> Lokesh
>

WARNING: multiple messages have this Message-ID (diff)

From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: lokesh jaliminche <lokesh.jaliminche@gmail.com>
Cc: <qemu-devel-request@nongnu.org>, <linux-cxl@vger.kernel.org>,
	<qemu-devel@nongnu.org>
Subject: Re: Performance Issue with CXL-emulation
Date: Mon, 16 Oct 2023 10:55:38 +0100	[thread overview]
Message-ID: <20231016105538.00000de5@Huawei.com> (raw)
Message-ID: <20231016095538.g0FHA5uUCibw8B_RZFFZhAByhn9T4RDvRb2UztTmPmc@z> (raw)
In-Reply-To: <CAKJOkCoxLG01Dt7xMjOPWRqhyLPuaNGRUaDn-sgAFfhERtAYJA@mail.gmail.com>

On Sun, 15 Oct 2023 10:39:46 -0700
lokesh jaliminche <lokesh.jaliminche@gmail.com> wrote:

> Hi Everyone,
> 
> I am facing performance issues while copying data to the CXL device
> (Emulated with QEMU). I get approximately 500KB/Sec. Any suggestion on how
> to improve this?

Hi Lokesh,

The target so far of QEMU emulation of CXL devices has been on functionality.
I'm in favour of work to improve on that, but it isn't likely to be my focus
- can offer some pointers on where to look though!

The fundamental problem (probably) is address decoding in CXL for interleaving
is at a sub page granularity. That means we can't use page table to perform the address
look ups in hardware. Note this also has the side effect that kvm won't work if
there is any chance that you will run instructions out of the CXL memory - it's
fine if you are interested in data only (DAX etc). (I've had a note in my todo list
to add a warning message about the KVM limitations for a while).

There have been a few discussions (mostly when we were debugging some TCG issues
and considering KVM support) about how we 'might' be able to improve this.  That focused
on a general 'fix', but there may be some lower hanging fruit.

The options I think might work are:

1) Special case configurations where there is no interleave going on.
   I'm not entirely sure how this would fit together and it won't deal with the
   more interesting cases - if it does work I'd want it to be minimally invasive because
   those complex cases are the main focus of testing etc.  There is an extension of this
   where we handle interleave, but only if it is 4k or above (on appropriately configured
   host).

2) Add caching layer to the CXL fixed memory windows.  That would hold copies of a
   number of pages that have been accessed in a software cache and setup the mappings for
   the hardware page table walkers to find them. If the page isn't cached we'd trigger
   a pagefault and have to bring it into the cache. If the configuration of the interleave
   is touched, all caches would need to be written back etc. This would need to be optional
   because I don't want to have to add cache coherency protocols etc when we add shared
   memory support (fun though it would be ;) 

3) Might be worth looking at the critical paths for lookups in your configuration.
   Maybe we can optimize the address decoders (basically a software TLB for HPA to DPA).
   I've not looked at the performance of those paths.  For your example the lookup is
   * CFMWS - nothing to do
   * Host bridge - nothing to do beyond a sanity check on range I think.
   * Nothing to to do.
   * Type 3 device - basic range match.
   So I'm not sure it is worth while - but you could do a really simple test by detecting
   no interleave is going on and caching the offset needed to go HPA to DPA + a device reference
   for the first time cxl_cfmws_find_device() is called. 
   https://elixir.bootlin.com/qemu/latest/source/hw/cxl/cxl-host.c#L129

   Then just match on hwaddr on another call of cxl_cmws_find_device() and return the device
   directly.  Maybe also shortcut lookups in cxl_type3_hpa_to_as_and_dpa() which does the endpoint
   decoding part. A quick hack would let you know if it was worth looking at something more general.

   Gut feeling is this last approach might get you some perf uptick but not going to solve
   the fundamental problem that in general we can't do the translation in hardware (unlike most
   other memory accesses in QEMU).

   Not I believe all writes to file backed memory will go all the way to the file. So you might want
   to try backing it with RAM but I as with the above, that's not going to address the fundamental
   problem.

Jonathan

> 
> Steps to reproduce :
> ===============
> 1. QEMU Command:
> sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \
> -hda ./images/ubuntu-22.04-server-cloudimg-amd64.img \
> -hdb ./images/user-data.img \
> -M q35,cxl=on,accel=kvm,nvdimm=on \
> -smp 16 \
> -m 16G,maxmem=32G,slots=8 \
> -object
> memory-backend-file,id=cxl-mem1,share=on,mem-path=/mnt/qemu_files/cxltest.raw,size=256M
> \
> -object
> memory-backend-file,id=cxl-lsa1,share=on,mem-path=/mnt/qemu_files/lsa.raw,size=256M
> \
> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> -device
> cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0
> \
> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \
> -nographic \
> 
> 2. Configure device with fsdax mode
> ubuntu@ubuntu:~$ cxl list
> [
>   {
>     "memdevs":[
>       {
>         "memdev":"mem0",
>         "pmem_size":268435456,
>         "serial":0,
>         "host":"0000:0d:00.0"
>       }
>     ]
>   },
>   {
>     "regions":[
>       {
>         "region":"region0",
>         "resource":45365592064,
>         "size":268435456,
>         "type":"pmem",
>         "interleave_ways":1,
>         "interleave_granularity":1024,
>         "decode_state":"commit"
>       }
>     ]
>   }
> ]
> 
> 3. Format the device with ext4 file system in dax mode
> 
> 4. Write data to mounted device with dd
> 
> ubuntu@ubuntu:~$ time sudo dd if=/dev/urandom
> of=/home/ubuntu/mnt/pmem0/test bs=1M count=128
> 128+0 records in
> 128+0 records out
> 134217728 bytes (134 MB, 128 MiB) copied, 244.802 s, 548 kB/s
> 
> real    4m4.850s
> user    0m0.014s
> sys     0m0.013s
> 
> 
> Thanks & Regards,
> Lokesh
>

next      parent reply	other threads:[~2023-10-16  9:56 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAKJOkCoxLG01Dt7xMjOPWRqhyLPuaNGRUaDn-sgAFfhERtAYJA@mail.gmail.com>
2023-10-16  9:55 ` Jonathan Cameron via [this message]
2023-10-16  9:55   ` Performance Issue with CXL-emulation Jonathan Cameron
2023-10-16 22:26   ` lokesh jaliminche
2023-10-16 22:37   ` lokesh jaliminche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231016105538.00000de5@Huawei.com \
    --to=qemu-devel@nongnu.org \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=lokesh.jaliminche@gmail.com \
    --cc=qemu-devel-request@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).