Re: [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Pierrick Bouvier <pierrick.bouvier@linaro.org>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	fan.ni@samsung.com, linux-cxl@vger.kernel.org,
	qemu-devel@nongnu.org
Cc: "Alex Bennée" <alex.bennee@linaro.org>,
	"Alexandre Iooss" <erdnaxe@crans.org>,
	"Mahmoud Mandour" <ma.mandourr@gmail.com>,
	linuxarm@huawei.com, "Niyas Sait" <niyas.sait@huawei.com>
Subject: Re: [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.
Date: Fri, 24 Jan 2025 12:55:52 -0800	[thread overview]
Message-ID: <5e0876e8-4c2c-4ba5-86dc-d9ca241b743d@linaro.org> (raw)
In-Reply-To: <20250124172905.84099-1-Jonathan.Cameron@huawei.com>

Hi Jonathan,

thanks for posting this. It's a creative usage of plugins.

I think that your current approach, decoupling plugins, CHMU and device 
model is a good thing.

I'm not familiar with CXL, but one question that comes to my mind is:
Is that mandatory to do this analysis during execution (vs dumping 
binary traces from CHMU and plugin and running an analysis post execution)?

Regards,
Pierrick

On 1/24/25 09:29, Jonathan Cameron wrote:
> Hi All,
> 
> This is an RFC mainly to seek feedback on the approach used, particularly
> the aspect of how to get data from a TCG plugin into a device model.
> Two options that we have tried
> 1. Socket over which the plugin sends data to an external server
>     (as seen here)
> 2. Register and manage a plugin from within a device model
> 
> The external server approach keeps things loosely coupled, but at the cost
> of separately maintaining that server, protocol definitions etc and
> some overhead.
> The closely couple solution is neater, but I suspect might be controversial
> (hence I didn't start with that :)
> 
> The code here is at best a PoC to illustrate what we have in mind
> It's not nice code at all, feature gaps, bugs and all!  So whilst
> review is always welcome I'm not requesting it for now.
> 
> Kernel support was posted a while back but was done against fake data
> (still supported here if you don't provide the port parameter to the type3 device)
> https://lore.kernel.org/linux-cxl/20241121101845.1815660-1-Jonathan.Cameron@huawei.com/
> I'll post a minor update of that driver shortly to take into account
> a few specification clarifications but it should work with this without
> those.
> 
> Note there are some other patches on the tree I generated this from
> so this may not apply to upstream. Easiest is probably to test
> using gitlab.com/jic23/qemu cxl-2025-01-24
> 
> Thanks to Niyas for his suggestions on how to make all this work!
> 
> Background
> ----------
> 
> What is the Compute eXpress Link Hotness Monitoring unit and what is it for?
> - In a tiered memory equipped server with the slow tier being attached via
>    CXL the expectation is a given workload will benefit from putting data
>    that is frequently fetched from memory in lower latency directly attached
>    DRAM.  Less frequently used data can be served from the CXL attached memory
>    with no significant loss of performance.  Any data that is hot enough to
>    almost always be in cache doesn't matter as it is only fetch from memory
>    occasionally.
> - Working out which memory is best places where is hard to do and in some
>    workloads a dynamic problem. As such we need something we can measure
>    to provide some indication of what data is in the wrong place.
>    There are existing techniques to do this (page faulting, various
>    CPU tracing systems, access bit scanning etc) but they all have significant
>    overheads.
> - Monitoring accesses on the CXL device provides a path to getting good
>    data without those overheads.  These units are known as CXL Hotness
>    Monitoring Units or CHMUs.  Loosely speaking they count accesses to
>    granuals of data (e.g. 4KiB pages).  Exactly how they do that and
>    where they sacrifice data accuracy is an implementation trade off.
> 
> Why do we need a model that gives real data?
> - In general there is a need to develop software on top of these units
>    to move data to the right place. Hard to evaluate that if we are making
>    up the info on what is 'hot'.
> - Need to allow for a bunch of 'impdef' solutions. Note that CHMU
>    in this patch set is an oracle - it has enough counters to count
>    every access.  That's not realistic but it doesn't get me shouted
>    at by our architecture teams for giving away any secrets.
>    If we move forward with this, I'll probably implement a limited
>    counter + full CAM solution (also unrealistic, but closer to real)
>    I'd be very interested in contributions of other approaches (there
>    are lots in the literature, under the term top-k)
> - Resources will be constrained, so whilst a CHMU might in theory
>    allow monitoring everything at once, that will come with a big
>    accuracy cost.  We need to design the algorithms that give us
>    good data given those constraints.
> 
> So we need a solution to explore the design space and develop the software
> to take advantage of this hardware (there are various LSF/MM proposals
> on how to use this an other ways of tracking hotness).
> https://lore.kernel.org/all/20250123105721.424117-1-raghavendra.kt@amd.com/
> https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
> 
> QEMU plugins give us a way to do this.  In particular the existing
> Cache plugin can be easily modified to tell use what memory addresses
> missed at the last level of emulated cache.  We can then filter those
> for the memory address range that maps to CXL and feed them to our
> counter implementation. On the other side, each instance of CXL type 3
> device can connect to this server and request hotness monitoring
> services + provide parameters etc.  Elements such as list threshold
> management and overflow detection etc are in the CXL HMU QEMU device mode.
> As noted above, we have an alternative approach that can closely couple
> things, so the device model registers the plugin directly and there
> is no server.
> 
> How to use it!
> --------------
> 
> It runs a little slow but it runs and generates somewhat plausible outputs.
> I'd definitely suggest running it with the pass through optimization
> patch on the CXL staging tree (and a single direct connected device).
> Your millage will vary if you try to use other parameters, or
> hotness units beyond the first one (implementation far from complete!)
> 
> To run start the server in contrib/hmu/ providing a port number to listen
> on.
> 
> ./chmu 4443
> 
> Then launch QEMU with something like the following.
> 
> qemu-system-aarch64 -icount shift=1 \
>   -plugin ../qemu/bin/native/contrib/plugins/libcache.so,port=4443,missfilterbase=1099511627776,missfiltersize=1099511627776,dcachesize=8192,dassoc=4,dblksize=64,icachesize=8192,iassoc=4,iblksize=64,l2cachesize=32768,l2assoc=16,l2blksize=64 \
>   -M virt,ras=on,nvdimm=on,gic-version=3,cxl=on,hmat=on -m 4g,maxmem=8g,slots=4 -cpu max -smp 4 \
>   -kernel Image \
>   -drive if=none,file=full.qcow2,format=qcow2,id=hd \
>   -device pcie-root-port,id=root_port1 \
>   -device virtio-blk-pci,drive=hd,x-max-bounce-buffer-size=512k \
>   -nographic -no-reboot -append 'earlycon memblock=debug root=/dev/vda2 fsck.mode=skip maxcpus=4 tp_printk' \
>   -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \
>   -object memory-backend-ram,size=4G,id=mem0 \
>   -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/t3_cxl1.raw,size=1G,align=256M \
>   -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/t3_cxl2.raw,size=1G,align=256M \
>   -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/t3_lsa1.raw,size=1M,align=1M \
>    -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/t3_cxl3.raw,size=1G,align=256M \
>   -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/t3_cxl4.raw,size=1G,align=256M \
>   -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/t3_lsa2.raw,size=1M,align=1M \
>   -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true,numa_node=0\
>   -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
>   -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1,sn=3,x-speed=32,x-width=16,chmu-port=4443 \
>   -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=8G,cxl-fmw.0.interleave-granularity=1k \
>   -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
>   -numa node,nodeid=1 \
>   -object acpi-generic-initiator,id=bob2,pci-dev=bob,node=1 \
>   -numa node,nodeid=2 \
>   -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \
> 
> In the guest, create and bind the region - this brings up the CXL memory
> device so accesses go to the memory.
> 
>    cd /sys/bus/cxl/devices/decoder0.0/
>    cat create_ram_region
>    echo region0 > create_ram_region
>    echo ram > /sys/bus/cxl/devices/decoder2.0/mode
>    echo ram > /sys/bus/cxl/devices/decoder3.0/mode
>    echo $((256 << 21)) > /sys/bus/cxl/devices/decoder2.0/dpa_size
>    cd /sys/bus/cxl/devices/region0/
>    echo 256 > interleave_granularity
>    echo 1 > interleave_ways
>    echo $((256 << 21)) > size
>    echo decoder2.0 > target0
>    echo 1 > commit
>    echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> 
> Finally start perf with something like:
> 
> ./perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> hotness_threshold=635,epoch_multiplier=4,epoch_scale=4,\
> range_base=0,range_size=4096/  ./stress.sh
> 
> where stress.sh is
> 
>    sleep 2
>    numactl --membind 3 stress-ng --vm 1 --vm-bytes 1M --vm-keep -t 5s
>    sleep 2
> 
> See the results with
> ./perf report --dump-raw-trace | grep -A 200 HMU
> 
> Enjoy and have a good weekend.
> 
> Thanks,
> 
> Jonathan
> 
> Jonathan Cameron (3):
>    hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
>    plugins: Add cache miss reporting over a socket.
>    contrib: Add example hotness monitoring unit server
> 
>   include/hw/cxl/cxl.h        |   1 +
>   include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
>   include/hw/cxl/cxl_device.h |  13 +-
>   include/hw/cxl/cxl_pci.h    |   7 +-
>   contrib/hmu/hmu.c           | 312 ++++++++++++++++++++++++
>   contrib/plugins/cache.c     |  75 +++++-
>   hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
>   hw/mem/cxl_type3.c          |  25 +-
>   hw/cxl/meson.build          |   1 +
>   9 files changed, 1035 insertions(+), 12 deletions(-)
>   create mode 100644 include/hw/cxl/cxl_chmu.h
>   create mode 100644 contrib/hmu/hmu.c
>   create mode 100644 hw/cxl/cxl-chmu.c
>

next prev parent reply	other threads:[~2025-01-24 20:56 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-24 17:29 [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Jonathan Cameron via
2025-01-24 17:29 ` [RFC PATCH QEMU 1/3] hw/cxl: Initial CXL Hotness Monitoring Unit Emulation Jonathan Cameron via
2025-01-24 17:29 ` [RFC PATCH QEMU 2/3] plugins: Add cache miss reporting over a socket Jonathan Cameron via
2025-05-20 14:16   ` Alex Bennée
2025-01-24 17:29 ` [RFC PATCH QEMU x3/3] contrib: Add example hotness monitoring unit server Jonathan Cameron via
2025-01-24 20:55 ` Pierrick Bouvier [this message]
2025-01-27 10:20   ` [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Jonathan Cameron via
2025-01-28 20:04     ` Pierrick Bouvier
2025-01-29 10:29       ` Jonathan Cameron via
2025-01-29 22:31         ` Pierrick Bouvier
2025-01-30 15:52           ` Jonathan Cameron via
2025-01-30 18:28             ` Pierrick Bouvier
2025-01-31 11:15               ` Jonathan Cameron via

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5e0876e8-4c2c-4ba5-86dc-d9ca241b743d@linaro.org \
    --to=pierrick.bouvier@linaro.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=alex.bennee@linaro.org \
    --cc=erdnaxe@crans.org \
    --cc=fan.ni@samsung.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=ma.mandourr@gmail.com \
    --cc=niyas.sait@huawei.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).