From: Jonathan Cameron via <qemu-devel@nongnu.org> To: lokesh jaliminche <lokesh.jaliminche@gmail.com> Cc: <qemu-devel-request@nongnu.org>, <linux-cxl@vger.kernel.org>, <qemu-devel@nongnu.org> Subject: Re: Performance Issue with CXL-emulation Date: Mon, 16 Oct 2023 10:55:38 +0100 [thread overview] Message-ID: <20231016105538.00000de5@Huawei.com> (raw) In-Reply-To: <CAKJOkCoxLG01Dt7xMjOPWRqhyLPuaNGRUaDn-sgAFfhERtAYJA@mail.gmail.com> On Sun, 15 Oct 2023 10:39:46 -0700 lokesh jaliminche <lokesh.jaliminche@gmail.com> wrote: > Hi Everyone, > > I am facing performance issues while copying data to the CXL device > (Emulated with QEMU). I get approximately 500KB/Sec. Any suggestion on how > to improve this? Hi Lokesh, The target so far of QEMU emulation of CXL devices has been on functionality. I'm in favour of work to improve on that, but it isn't likely to be my focus - can offer some pointers on where to look though! The fundamental problem (probably) is address decoding in CXL for interleaving is at a sub page granularity. That means we can't use page table to perform the address look ups in hardware. Note this also has the side effect that kvm won't work if there is any chance that you will run instructions out of the CXL memory - it's fine if you are interested in data only (DAX etc). (I've had a note in my todo list to add a warning message about the KVM limitations for a while). There have been a few discussions (mostly when we were debugging some TCG issues and considering KVM support) about how we 'might' be able to improve this. That focused on a general 'fix', but there may be some lower hanging fruit. The options I think might work are: 1) Special case configurations where there is no interleave going on. I'm not entirely sure how this would fit together and it won't deal with the more interesting cases - if it does work I'd want it to be minimally invasive because those complex cases are the main focus of testing etc. There is an extension of this where we handle interleave, but only if it is 4k or above (on appropriately configured host). 2) Add caching layer to the CXL fixed memory windows. That would hold copies of a number of pages that have been accessed in a software cache and setup the mappings for the hardware page table walkers to find them. If the page isn't cached we'd trigger a pagefault and have to bring it into the cache. If the configuration of the interleave is touched, all caches would need to be written back etc. This would need to be optional because I don't want to have to add cache coherency protocols etc when we add shared memory support (fun though it would be ;) 3) Might be worth looking at the critical paths for lookups in your configuration. Maybe we can optimize the address decoders (basically a software TLB for HPA to DPA). I've not looked at the performance of those paths. For your example the lookup is * CFMWS - nothing to do * Host bridge - nothing to do beyond a sanity check on range I think. * Nothing to to do. * Type 3 device - basic range match. So I'm not sure it is worth while - but you could do a really simple test by detecting no interleave is going on and caching the offset needed to go HPA to DPA + a device reference for the first time cxl_cfmws_find_device() is called. https://elixir.bootlin.com/qemu/latest/source/hw/cxl/cxl-host.c#L129 Then just match on hwaddr on another call of cxl_cmws_find_device() and return the device directly. Maybe also shortcut lookups in cxl_type3_hpa_to_as_and_dpa() which does the endpoint decoding part. A quick hack would let you know if it was worth looking at something more general. Gut feeling is this last approach might get you some perf uptick but not going to solve the fundamental problem that in general we can't do the translation in hardware (unlike most other memory accesses in QEMU). Not I believe all writes to file backed memory will go all the way to the file. So you might want to try backing it with RAM but I as with the above, that's not going to address the fundamental problem. Jonathan > > Steps to reproduce : > =============== > 1. QEMU Command: > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > -hda ./images/ubuntu-22.04-server-cloudimg-amd64.img \ > -hdb ./images/user-data.img \ > -M q35,cxl=on,accel=kvm,nvdimm=on \ > -smp 16 \ > -m 16G,maxmem=32G,slots=8 \ > -object > memory-backend-file,id=cxl-mem1,share=on,mem-path=/mnt/qemu_files/cxltest.raw,size=256M > \ > -object > memory-backend-file,id=cxl-lsa1,share=on,mem-path=/mnt/qemu_files/lsa.raw,size=256M > \ > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > -device > cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 > \ > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \ > -nographic \ > > 2. Configure device with fsdax mode > ubuntu@ubuntu:~$ cxl list > [ > { > "memdevs":[ > { > "memdev":"mem0", > "pmem_size":268435456, > "serial":0, > "host":"0000:0d:00.0" > } > ] > }, > { > "regions":[ > { > "region":"region0", > "resource":45365592064, > "size":268435456, > "type":"pmem", > "interleave_ways":1, > "interleave_granularity":1024, > "decode_state":"commit" > } > ] > } > ] > > 3. Format the device with ext4 file system in dax mode > > 4. Write data to mounted device with dd > > ubuntu@ubuntu:~$ time sudo dd if=/dev/urandom > of=/home/ubuntu/mnt/pmem0/test bs=1M count=128 > 128+0 records in > 128+0 records out > 134217728 bytes (134 MB, 128 MiB) copied, 244.802 s, 548 kB/s > > real 4m4.850s > user 0m0.014s > sys 0m0.013s > > > Thanks & Regards, > Lokesh >
WARNING: multiple messages have this Message-ID (diff)
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com> To: lokesh jaliminche <lokesh.jaliminche@gmail.com> Cc: <qemu-devel-request@nongnu.org>, <linux-cxl@vger.kernel.org>, <qemu-devel@nongnu.org> Subject: Re: Performance Issue with CXL-emulation Date: Mon, 16 Oct 2023 10:55:38 +0100 [thread overview] Message-ID: <20231016105538.00000de5@Huawei.com> (raw) Message-ID: <20231016095538.g0FHA5uUCibw8B_RZFFZhAByhn9T4RDvRb2UztTmPmc@z> (raw) In-Reply-To: <CAKJOkCoxLG01Dt7xMjOPWRqhyLPuaNGRUaDn-sgAFfhERtAYJA@mail.gmail.com> On Sun, 15 Oct 2023 10:39:46 -0700 lokesh jaliminche <lokesh.jaliminche@gmail.com> wrote: > Hi Everyone, > > I am facing performance issues while copying data to the CXL device > (Emulated with QEMU). I get approximately 500KB/Sec. Any suggestion on how > to improve this? Hi Lokesh, The target so far of QEMU emulation of CXL devices has been on functionality. I'm in favour of work to improve on that, but it isn't likely to be my focus - can offer some pointers on where to look though! The fundamental problem (probably) is address decoding in CXL for interleaving is at a sub page granularity. That means we can't use page table to perform the address look ups in hardware. Note this also has the side effect that kvm won't work if there is any chance that you will run instructions out of the CXL memory - it's fine if you are interested in data only (DAX etc). (I've had a note in my todo list to add a warning message about the KVM limitations for a while). There have been a few discussions (mostly when we were debugging some TCG issues and considering KVM support) about how we 'might' be able to improve this. That focused on a general 'fix', but there may be some lower hanging fruit. The options I think might work are: 1) Special case configurations where there is no interleave going on. I'm not entirely sure how this would fit together and it won't deal with the more interesting cases - if it does work I'd want it to be minimally invasive because those complex cases are the main focus of testing etc. There is an extension of this where we handle interleave, but only if it is 4k or above (on appropriately configured host). 2) Add caching layer to the CXL fixed memory windows. That would hold copies of a number of pages that have been accessed in a software cache and setup the mappings for the hardware page table walkers to find them. If the page isn't cached we'd trigger a pagefault and have to bring it into the cache. If the configuration of the interleave is touched, all caches would need to be written back etc. This would need to be optional because I don't want to have to add cache coherency protocols etc when we add shared memory support (fun though it would be ;) 3) Might be worth looking at the critical paths for lookups in your configuration. Maybe we can optimize the address decoders (basically a software TLB for HPA to DPA). I've not looked at the performance of those paths. For your example the lookup is * CFMWS - nothing to do * Host bridge - nothing to do beyond a sanity check on range I think. * Nothing to to do. * Type 3 device - basic range match. So I'm not sure it is worth while - but you could do a really simple test by detecting no interleave is going on and caching the offset needed to go HPA to DPA + a device reference for the first time cxl_cfmws_find_device() is called. https://elixir.bootlin.com/qemu/latest/source/hw/cxl/cxl-host.c#L129 Then just match on hwaddr on another call of cxl_cmws_find_device() and return the device directly. Maybe also shortcut lookups in cxl_type3_hpa_to_as_and_dpa() which does the endpoint decoding part. A quick hack would let you know if it was worth looking at something more general. Gut feeling is this last approach might get you some perf uptick but not going to solve the fundamental problem that in general we can't do the translation in hardware (unlike most other memory accesses in QEMU). Not I believe all writes to file backed memory will go all the way to the file. So you might want to try backing it with RAM but I as with the above, that's not going to address the fundamental problem. Jonathan > > Steps to reproduce : > =============== > 1. QEMU Command: > sudo /opt/qemu-cxl/bin/qemu-system-x86_64 \ > -hda ./images/ubuntu-22.04-server-cloudimg-amd64.img \ > -hdb ./images/user-data.img \ > -M q35,cxl=on,accel=kvm,nvdimm=on \ > -smp 16 \ > -m 16G,maxmem=32G,slots=8 \ > -object > memory-backend-file,id=cxl-mem1,share=on,mem-path=/mnt/qemu_files/cxltest.raw,size=256M > \ > -object > memory-backend-file,id=cxl-lsa1,share=on,mem-path=/mnt/qemu_files/lsa.raw,size=256M > \ > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > -device > cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 > \ > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G \ > -nographic \ > > 2. Configure device with fsdax mode > ubuntu@ubuntu:~$ cxl list > [ > { > "memdevs":[ > { > "memdev":"mem0", > "pmem_size":268435456, > "serial":0, > "host":"0000:0d:00.0" > } > ] > }, > { > "regions":[ > { > "region":"region0", > "resource":45365592064, > "size":268435456, > "type":"pmem", > "interleave_ways":1, > "interleave_granularity":1024, > "decode_state":"commit" > } > ] > } > ] > > 3. Format the device with ext4 file system in dax mode > > 4. Write data to mounted device with dd > > ubuntu@ubuntu:~$ time sudo dd if=/dev/urandom > of=/home/ubuntu/mnt/pmem0/test bs=1M count=128 > 128+0 records in > 128+0 records out > 134217728 bytes (134 MB, 128 MiB) copied, 244.802 s, 548 kB/s > > real 4m4.850s > user 0m0.014s > sys 0m0.013s > > > Thanks & Regards, > Lokesh >
next parent reply other threads:[~2023-10-16 9:56 UTC|newest] Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top [not found] <CAKJOkCoxLG01Dt7xMjOPWRqhyLPuaNGRUaDn-sgAFfhERtAYJA@mail.gmail.com> 2023-10-16 9:55 ` Jonathan Cameron via [this message] 2023-10-16 9:55 ` Performance Issue with CXL-emulation Jonathan Cameron 2023-10-16 22:26 ` lokesh jaliminche 2023-10-16 22:37 ` lokesh jaliminche
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20231016105538.00000de5@Huawei.com \ --to=qemu-devel@nongnu.org \ --cc=Jonathan.Cameron@Huawei.com \ --cc=linux-cxl@vger.kernel.org \ --cc=lokesh.jaliminche@gmail.com \ --cc=qemu-devel-request@nongnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).