From: Jonathan Cameron via <qemu-devel@nongnu.org>
To: Peter Xu <peterx@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
"Peter Maydell" <peter.maydell@linaro.org>,
"Ben Widawsky" <ben.widawsky@intel.com>,
qemu-devel@nongnu.org, "Samarth Saxena" <samarths@cadence.com>,
"Chris Browy" <cbrowy@avery-design.com>,
linuxarm@huawei.com, linux-cxl@vger.kernel.org,
"Markus Armbruster" <armbru@redhat.com>,
"Shreyas Shah" <shreyas.shah@elastics.cloud>,
"Saransh Gupta1" <saransh@ibm.com>,
"Shameerali Kolothum Thodi"
<shameerali.kolothum.thodi@huawei.com>,
"Marcel Apfelbaum" <marcel@redhat.com>,
"Igor Mammedov" <imammedo@redhat.com>,
"Dan Williams" <dan.j.williams@intel.com>,
"Alex Bennée" <alex.bennee@linaro.org>,
"Philippe Mathieu-Daudé" <f4bug@amsat.org>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"David Hildenbrand" <david@redhat.com>
Subject: Re: [PATCH v7 00/46] CXl 2.0 emulation Support
Date: Wed, 16 Mar 2022 16:50:34 +0000 [thread overview]
Message-ID: <20220316165034.000037e7@Huawei.com> (raw)
In-Reply-To: <Yimwjtd8SsVLOU5e@xz-m1.local>
On Thu, 10 Mar 2022 16:02:22 +0800
Peter Xu <peterx@redhat.com> wrote:
> On Wed, Mar 09, 2022 at 11:28:27AM +0000, Jonathan Cameron wrote:
> > Hi Peter,
>
> Hi, Jonathan,
>
> >
> > >
> > > https://lore.kernel.org/qemu-devel/20220306174137.5707-35-Jonathan.Cameron@huawei.com/
> > >
> > > Having mr->ops set but with memory_access_is_direct() returning true sounds
> > > weird to me.
> > >
> > > Sorry to have no understanding of the whole picture, but.. could you share
> > > more on what's the interleaving requirement on the proxying, and why it
> > > can't be done with adding some IO memory regions as sub-regions upon the
> > > file one?
> >
> > The proxying requirement is simply a means to read/write to a computed address
> > within a memory region. There may well be a better way to do that.
> >
> > If I understand your suggestion correctly you would need a very high
> > number of IO memory regions to be created dynamically when particular sets of
> > registers across multiple devices in the topology are all programmed.
> >
> > The interleave can be 256 bytes across up to 16x, many terabyte, devices.
> > So assuming a simple set of 16 1TB devices I think you'd need about 4x10^9
> > IO regions. Even for a minimal useful test case of largest interleave
> > set of 16x 256MB devices (256MB is minimum size the specification allows per
> > decoded region at the device) and 16 way interleave we'd need 10^6 IO regions.
> > Any idea if that approach would scale sensibly to this number of regions?
> >
> > There are also complexities to getting all the information in one place to
> > work out which IO memory regions maps where in PA space. Current solution is
> > to do that mapping in the same way the hardware does which is hierarchical,
> > so we walk the path to the device, picking directions based on each interleave
> > decoder that we meet.
> > Obviously this is a bit slow but I only really care about correctness at the
> > moment. I can think of various approaches to speeding it up but I'm not sure
> > if we will ever care about performance.
> >
> > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/cxl/cxl-host.c#L131
> > has the logic for that and as you can see it's fairly simple because we are always
> > going down the topology following the decoders.
> >
> > Below I have mapped out an algorithm I think would work for doing it with
> > IO memory regions as subregions.
> >
> > We could fake the whole thing by limiting ourselves to small host
> > memory windows which are always directly backed, but then I wouldn't
> > achieve the main aim of this which is to provide a test base for the OS code.
> > To do that I need real interleave so I can seed the files with test patterns
> > and verify the accesses hit the correct locations. Emulating what the hardware
> > is actually doing on a device by device basis is the easiest way I have
> > come up with to do that.
> >
> > Let me try to provide some more background so you hopefully don't have
> > to have read the specs to follow what is going on!
> > There are an example for directly connected (no switches) topology in the
> > docs
> >
> > https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/docs/system/devices/cxl.rst
> >
> > The overall picture is we have a large number of CXL Type 3 memory devices,
> > which at runtime (by OS at boot/on hotplug) are configured into various
> > interleaving sets with hierarchical decoding at the host + host bridge
> > + switch levels. For test setups I probably need to go to around 32 devices
> > so I can hit various configurations simultaneously.
> > No individual device has visibility of the full interleave setup - hence
> > the walk in the existing code through the various decoders to find the
> > final Device Physical address.
> >
> > At the host level the host provides a set of Physical Address windows with
> > a fixed interleave decoding across the different host bridges in the system
> > (CXL Fixed Memory windows, CFMWs)
> > On a real system these have to be large enough to allow for any memory
> > devices that might be hotplugged and all possible configurations (so
> > with 2 host bridges you need at least 3 windows in the many TB range,
> > much worse as the number of host bridges goes up). It'll be worse than
> > this when we have QoS groups, but the current Qemu code just puts all
> > the windows in group 0. Hence my first thought of just putting memory
> > behind those doesn't scale (a similar approach to this was in the
> > earliest versions of this patch set - though the full access path
> > wasn't wired up).
> >
> > The granularity can be in powers of 2 from 256 bytes to 16 kbytes
> >
> > Next each host bridge has programmable address decoders which take the
> > incoming (often already interleaved) memory access and direct them to
> > appropriate root ports. The root ports can be connected to a switch
> > which has additional address decoders in the upstream port to decide
> > which downstream port to route to. Note we currently only support 1 level
> > of switches but it's easy to make this algorithm recursive to support
> > multiple switch levels (currently the kernel proposals only support 1 level)
> >
> > Finally the End Point with the actual memory receives the interleaved request and
> > takes the full address and (for power of 2 decoding - we don't yet support
> > 3,6 and 12 way which is more complex and there is no kernel support yet)
> > it drops a few address bits and adds an offset for the decoder used to
> > calculate it's own device physical address. Note device will support
> > multiple interleave sets for different parts of it's file once we add
> > multiple decoder support (on the todo list).
> >
> > So the current solution is straight forward (with the exception of that
> > proxying) because it follows the same decoding as used in real hardware
> > to route the memory accesses. As a result we get a read/write to a
> > device physical address and hence proxy that. If any of the decoders
> > along the path are not configured then we error out at that stage.
> >
> > To create the equivalent as IO subregions I think we'd have to do the
> > following from (this might be mediated by some central entity that
> > doesn't currently exist, or done on demand from which ever CXL device
> > happens to have it's decoder set up last)
> >
> > 1) Wait for a decoder commit (enable) on any component. Goto 2.
> > 2) Walk the topology (up to host decoder, down to memory device)
> > If a complete interleaving path has been configured -
> > i.e. we have committed decoders all the way to the memory
> > device goto step 3, otherwise return to step 1 to wait for
> > more decoders to be committed.
> > 3) For the memory region being supplied by the memory device,
> > add subregions to map the device physical address (address
> > in the file) for each interleave stride to the appropriate
> > host Physical Address.
> > 4) Return to step 1 to wait for more decoders to commit.
> >
> > So summary is we can do it with IO regions, but there are a lot of them
> > and the setup is somewhat complex as we don't have one single point in
> > time where we know all the necessary information is available to compute
> > the right addresses.
> >
> > Looking forward to your suggestions if I haven't caused more confusion!
Hi Peter,
>
> Thanks for the write up - I must confess they're a lot! :)
>
> I merely only learned what is CXL today, and I'm not very experienced on
> device modeling either, so please bare with me with stupid questions..
>
> IIUC so far CXL traps these memory accesses using CXLFixedWindow.mr.
> That's a normal IO region, which looks very reasonable.
>
> However I'm confused why patch "RFC: softmmu/memory: Add ops to
> memory_region_ram_init_from_file" helped.
>
> Per my knowledge, all the memory accesses upon this CFMW window trapped
> using this IO region already. There can be multiple memory file objects
> underneath, and when read/write happens the object will be decoded from
> cxl_cfmws_find_device() as you referenced.
Yes.
>
> However I see nowhere that these memory objects got mapped as sub-regions
> into parent (CXLFixedWindow.mr). Then I don't understand why they cannot
> be trapped.
AS you note they aren't mapped into the parent mr, hence we are trapping.
The parent mem_ops are responsible for decoding the 'which device' +
'what address in device memory space'. Once we've gotten that info
the question is how do I actually do the access?
Mapping as subregions seems unwise due to the huge number required.
>
> To ask in another way: what will happen if you simply revert this RFC
> patch? What will go wrong?
The call to memory_region_dispatch_read()
https://gitlab.com/jic23/qemu/-/blob/cxl-v7-draft-2-for-test/hw/mem/cxl_type3.c#L556
would call memory_region_access_valid() that calls
mr->ops->valid.accepts() which is set to
unassigned_mem_accepts() and hence...
you get back a MEMTX_DECODE_ERROR back and an exception in the
guest.
That wouldn't happen with a non proxied access to the ram as
those paths never uses the ops as memory_access_is_direct() is called
and simply memcpy used without any involvement of the ops.
Is a better way to proxy those writes to the backing files?
I was fishing a bit in the dark here and saw the existing ops defined
for a different purpose for VFIO
4a2e242bbb ("memory Don't use memcpy for ram_device regions")
and those allowed the use of memory_region_dispatch_write() to work.
Hence the RFC marking on that patch :)
Thanks,
Jonathan
>
> Thanks,
>
next prev parent reply other threads:[~2022-03-16 16:53 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-06 17:40 [PATCH v7 00/46] CXl 2.0 emulation Support Jonathan Cameron via
2022-03-06 17:40 ` [PATCH v7 01/46] hw/pci/cxl: Add a CXL component type (interface) Jonathan Cameron via
2022-03-06 17:40 ` [PATCH v7 02/46] hw/cxl/component: Introduce CXL components (8.1.x, 8.2.5) Jonathan Cameron via
2022-03-06 17:40 ` [PATCH v7 03/46] MAINTAINERS: Add entry for Compute Express Link Emulation Jonathan Cameron via
2022-03-06 17:40 ` [PATCH v7 04/46] hw/cxl/device: Introduce a CXL device (8.2.8) Jonathan Cameron via
2022-03-06 17:40 ` [PATCH v7 05/46] hw/cxl/device: Implement the CAP array (8.2.8.1-2) Jonathan Cameron via
2022-03-06 17:40 ` [PATCH v7 06/46] hw/cxl/device: Implement basic mailbox (8.2.8.4) Jonathan Cameron via
2022-03-06 17:40 ` [PATCH v7 07/46] hw/cxl/device: Add memory device utilities Jonathan Cameron via
2022-03-06 17:40 ` [PATCH v7 08/46] hw/cxl/device: Add cheap EVENTS implementation (8.2.9.1) Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 09/46] hw/cxl/device: Timestamp implementation (8.2.9.3) Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 10/46] hw/cxl/device: Add log commands (8.2.9.4) + CEL Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 11/46] hw/pxb: Use a type for realizing expanders Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 12/46] hw/pci/cxl: Create a CXL bus type Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 13/46] cxl: Machine level control on whether CXL support is enabled Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 14/46] hw/pxb: Allow creation of a CXL PXB (host bridge) Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 15/46] qtest/cxl: Introduce initial test for pxb-cxl only Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 16/46] hw/cxl/rp: Add a root port Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 17/46] hw/cxl/device: Add a memory device (8.2.8.5) Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 18/46] hw/cxl/device: Implement MMIO HDM decoding (8.2.5.12) Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 19/46] hw/cxl/device: Add some trivial commands Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 20/46] hw/cxl/device: Plumb real Label Storage Area (LSA) sizing Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 21/46] hw/cxl/device: Implement get/set Label Storage Area (LSA) Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 22/46] qtests/cxl: Add initial root port and CXL type3 tests Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 23/46] hw/cxl/component: Implement host bridge MMIO (8.2.5, table 142) Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 24/46] acpi/cxl: Add _OSC implementation (9.14.2) Jonathan Cameron via
2022-03-06 21:31 ` Michael S. Tsirkin
2022-03-07 17:01 ` Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 25/46] acpi/cxl: Create the CEDT (9.14.1) Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 26/46] hw/cxl/component: Add utils for interleave parameter encoding/decoding Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 27/46] hw/cxl/host: Add support for CXL Fixed Memory Windows Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 28/46] acpi/cxl: Introduce CFMWS structures in CEDT Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 29/46] hw/pci-host/gpex-acpi: Add support for dsdt construction for pxb-cxl Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 30/46] pci/pcie_port: Add pci_find_port_by_pn() Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 31/46] CXL/cxl_component: Add cxl_get_hb_cstate() Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 32/46] mem/cxl_type3: Add read and write functions for associated hostmem Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 33/46] cxl/cxl-host: Add memops for CFMWS region Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 34/46] RFC: softmmu/memory: Add ops to memory_region_ram_init_from_file Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 35/46] hw/cxl/component Add a dumb HDM decoder handler Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 36/46] i386/pc: Enable CXL fixed memory windows Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 37/46] tests/acpi: q35: Allow addition of a CXL test Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 38/46] qtests/bios-tables-test: Add a test for CXL emulation Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 39/46] tests/acpi: Add tables " Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 40/46] qtest/cxl: Add more complex test cases with CFMWs Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 41/46] hw/arm/virt: Basic CXL enablement on pci_expander_bridge instances pxb-cxl Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 42/46] qtest/cxl: Add aarch64 virt test for CXL Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 43/46] docs/cxl: Add initial Compute eXpress Link (CXL) documentation Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 44/46] pci-bridge/cxl_upstream: Add a CXL switch upstream port Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 45/46] pci-bridge/cxl_downstream: Add a CXL switch downstream port Jonathan Cameron via
2022-03-06 17:41 ` [PATCH v7 46/46] cxl/cxl-host: Support interleave decoding with one level of switches Jonathan Cameron via
2022-03-06 21:33 ` [PATCH v7 00/46] CXl 2.0 emulation Support Michael S. Tsirkin
2022-03-07 9:39 ` Jonathan Cameron via
2022-03-09 8:15 ` Peter Xu
2022-03-09 11:28 ` Jonathan Cameron via
2022-03-10 8:02 ` Peter Xu
2022-03-16 16:50 ` Jonathan Cameron via [this message]
2022-03-16 17:16 ` Mark Cave-Ayland
2022-03-16 17:58 ` Jonathan Cameron via
2022-03-16 18:26 ` Jonathan Cameron via
2022-03-17 8:12 ` Mark Cave-Ayland
2022-03-17 16:47 ` Jonathan Cameron via
2022-03-18 8:14 ` Mark Cave-Ayland
2022-03-18 10:08 ` Jonathan Cameron via
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220316165034.000037e7@Huawei.com \
--to=qemu-devel@nongnu.org \
--cc=Jonathan.Cameron@Huawei.com \
--cc=alex.bennee@linaro.org \
--cc=armbru@redhat.com \
--cc=ben.widawsky@intel.com \
--cc=cbrowy@avery-design.com \
--cc=dan.j.williams@intel.com \
--cc=david@redhat.com \
--cc=f4bug@amsat.org \
--cc=imammedo@redhat.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linuxarm@huawei.com \
--cc=marcel@redhat.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=peterx@redhat.com \
--cc=samarths@cadence.com \
--cc=saransh@ibm.com \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=shreyas.shah@elastics.cloud \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).