From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Gregory Price <gourry@gourry.net>
Cc: <linux-cxl@vger.kernel.org>, <qemu-devel@nongnu.org>,
<svetly.todorov@memverge.com>, <nifan.cxl@gmail.com>
Subject: Re: [PATCH RFC v3 3/3] mhsld: implement MHSLD device
Date: Thu, 12 Dec 2024 17:40:16 +0000 [thread overview]
Message-ID: <20241212174016.0000002a@huawei.com> (raw)
In-Reply-To: <20241018161252.8896-4-gourry@gourry.net>
On Fri, 18 Oct 2024 12:12:52 -0400
Gregory Price <gourry@gourry.net> wrote:
> From: Svetly Todorov <svetly.todorov@memverge.com>
>
> Using a shared-memory bytemap, validates that DC adds, releases,
> and reclamations happen on extents belonging to the appropriate
> host.
>
> The MHSLD device inherits from the CXL_TYPE3 class and adds the following
> configuration options:
As in previous the -- syntax is somewhat confusing as that's
not how the parameters are applied.
> --mhd-head=<u32>
> --mhd-state_file=<str>
> --mhd-init=<bool>
>
> --mhd-head specifies the head ID of the host on the given device.
>
> --mhd-state_file is the name of the shared-memory-backed file used
> to store the MHD state.
>
> --mhd-init indicates whether this QEMU instance should initialize
> the state_file; if so, the instance will create the file if it does
> not exist, ftruncate it to the appropriate size, and initialize its
> header. It is assumed that the --mhd-init instance is run and allowed
> to completely finish configuration before any other guests access the
> shared state.
>
> The shared state file only needs to be intialized once. Even if a guest
> dies without clearing the ownership bits associated with its head-ID,
> future guests with that ID will clear those bits in cxl_mhsld_realize(),
> regardless of whether mhd_init is true or false.
That sounds like a race condition if not all hosts are brought
up before the first add.
>
> The following command line options create an MHSLD with 4GB of
> backing memory, whose state is tracked in /dev/shm/mhd_metadata.
> --mhd-init=true tells this instance to initialize the state as
> described above.
>
> ./qemu-system_x86-64 \
> [... other options ...] \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
> -object memory-backend-ram,id=mem0,size=4G \
> -device cxl-mhsld,bus=rp0,num-dc-regions=1,volatile-dc-memdev=mem0,id=cxl-mem0,sn=66667,mhd-head=0,mhd-state_file=mhd_metadata,mhd-init=true \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G \
> -qmp unix:/tmp/qmp-sock-1,server,nowait
>
> Once this guest completes setup, other guests looking to access the
> device can be booted with the same configuration options, but with
> --mhd-head != 0,
> --mhd-init=false,
> and a different QMP socket.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Svetly Todorov <svetly.todorov@memverge.com>
A few trivial things inline.
In general the scheme looks workable but I'm not sure the contraints at setup time
etc are suitable for an upstream solution. Certainly a useful tool to have
for kernel development though so I'll try and find time in next few days to apply
this on my gitlab tree.
Longer term I think we need a more complex external program or a main / proxy
type arrangement so that ordering requirements can be enforce and we can have
richer info. Having to chat to each qmp interface independently works fine is
also a bit more complex than I think we would eventually want.
Having a solution in place though will make it much easier to move towards
an eventual upstreamable solution. This is a great place to start from!
Jonathan
> +static bool cxl_mhsld_release_extent(PCIDevice *d, uint64_t dpa, uint64_t len)
> +{
> + cxl_mhsld_state_clear(CXL_MHSLD(d), dpa / MHSLD_BLOCK_SZ,
> + len / MHSLD_BLOCK_SZ);
Trivial but align after (
> + return true;
> +}
> diff --git a/hw/cxl/mhsld/mhsld.h b/hw/cxl/mhsld/mhsld.h
> new file mode 100644
> index 0000000000..e7ead1f0d2
> --- /dev/null
> +++ b/hw/cxl/mhsld/mhsld.h
> @@ -0,0 +1,75 @@
> +/*
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * Copyright (c) 2024 MemVerge Inc.
> + *
> + */
> +
> +#ifndef CXL_MHSLD_H
> +#define CXL_MHSLD_H
> +#include <stdint.h>
> +#include "hw/cxl/cxl.h"
> +#include "hw/cxl/cxl_mailbox.h"
> +#include "hw/cxl/cxl_device.h"
> +#include "qemu/units.h"
> +
> +#define MHSLD_BLOCK_SZ (2 * MiB)
> +
> +/*
> + * We limit the number of heads to prevent the shared state
> + * region from becoming a major memory hog. We need 512MB of
> + * memory space to track 8-host ownership of 4GB of memory in
> + * blocks of 2MB. This can change if the block size is increased.
I'm lost what makes up that size?
> + */
> +#define MHSLD_HEADS (8)
> +
> +/*
> + * The shared state cannot have 2 variable sized regions
> + * so we have to max out the ldmap.
> + */
> +typedef struct MHSLDSharedState {
> + uint8_t nr_heads;
> + uint8_t nr_lds;
> + uint8_t ldmap[MHSLD_HEADS];
> + uint64_t nr_blocks;
> + uint8_t blocks[];
> +} MHSLDSharedState;
> +
> +struct CXLMHSLDState {
> + CXLType3Dev ct3d;
> + bool mhd_init;
> + char *mhd_state_file;
> + int mhd_state_fd;
> + size_t mhd_state_size;
> + uint32_t mhd_head;
> + MHSLDSharedState *mhd_state;
> +};
> +
> +struct CXLMHSLDClass {
> + CXLType3Class parent_class;
> +};
> +
> +enum {
> + MHSLD_MHD = 0x55,
> + #define GET_MHD_INFO 0x0
> +};
> +
> +/*
> + * MHD Get Info Command
> + * Returns information the LD's associated with this head
> + */
> +typedef struct MHDGetInfoInput {
> + uint8_t start_ld;
> + uint8_t ldmap_len;
> +} QEMU_PACKED MHDGetInfoInput;
> +
> +typedef struct MHDGetInfoOutput {
> + uint8_t nr_lds;
> + uint8_t nr_heads;
> + uint16_t resv1;
> + uint8_t start_ld;
> + uint8_t ldmap_len;
> + uint16_t resv2;
> + uint8_t ldmap[];
> +} QEMU_PACKED MHDGetInfoOutput;
> +#endif
WARNING: multiple messages have this Message-ID (diff)
From: Jonathan Cameron via <qemu-devel@nongnu.org>
To: Gregory Price <gourry@gourry.net>
Cc: <linux-cxl@vger.kernel.org>, <qemu-devel@nongnu.org>,
<svetly.todorov@memverge.com>, <nifan.cxl@gmail.com>
Subject: Re: [PATCH RFC v3 3/3] mhsld: implement MHSLD device
Date: Thu, 12 Dec 2024 17:40:16 +0000 [thread overview]
Message-ID: <20241212174016.0000002a@huawei.com> (raw)
In-Reply-To: <20241018161252.8896-4-gourry@gourry.net>
On Fri, 18 Oct 2024 12:12:52 -0400
Gregory Price <gourry@gourry.net> wrote:
> From: Svetly Todorov <svetly.todorov@memverge.com>
>
> Using a shared-memory bytemap, validates that DC adds, releases,
> and reclamations happen on extents belonging to the appropriate
> host.
>
> The MHSLD device inherits from the CXL_TYPE3 class and adds the following
> configuration options:
As in previous the -- syntax is somewhat confusing as that's
not how the parameters are applied.
> --mhd-head=<u32>
> --mhd-state_file=<str>
> --mhd-init=<bool>
>
> --mhd-head specifies the head ID of the host on the given device.
>
> --mhd-state_file is the name of the shared-memory-backed file used
> to store the MHD state.
>
> --mhd-init indicates whether this QEMU instance should initialize
> the state_file; if so, the instance will create the file if it does
> not exist, ftruncate it to the appropriate size, and initialize its
> header. It is assumed that the --mhd-init instance is run and allowed
> to completely finish configuration before any other guests access the
> shared state.
>
> The shared state file only needs to be intialized once. Even if a guest
> dies without clearing the ownership bits associated with its head-ID,
> future guests with that ID will clear those bits in cxl_mhsld_realize(),
> regardless of whether mhd_init is true or false.
That sounds like a race condition if not all hosts are brought
up before the first add.
>
> The following command line options create an MHSLD with 4GB of
> backing memory, whose state is tracked in /dev/shm/mhd_metadata.
> --mhd-init=true tells this instance to initialize the state as
> described above.
>
> ./qemu-system_x86-64 \
> [... other options ...] \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0 \
> -object memory-backend-ram,id=mem0,size=4G \
> -device cxl-mhsld,bus=rp0,num-dc-regions=1,volatile-dc-memdev=mem0,id=cxl-mem0,sn=66667,mhd-head=0,mhd-state_file=mhd_metadata,mhd-init=true \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G \
> -qmp unix:/tmp/qmp-sock-1,server,nowait
>
> Once this guest completes setup, other guests looking to access the
> device can be booted with the same configuration options, but with
> --mhd-head != 0,
> --mhd-init=false,
> and a different QMP socket.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Svetly Todorov <svetly.todorov@memverge.com>
A few trivial things inline.
In general the scheme looks workable but I'm not sure the contraints at setup time
etc are suitable for an upstream solution. Certainly a useful tool to have
for kernel development though so I'll try and find time in next few days to apply
this on my gitlab tree.
Longer term I think we need a more complex external program or a main / proxy
type arrangement so that ordering requirements can be enforce and we can have
richer info. Having to chat to each qmp interface independently works fine is
also a bit more complex than I think we would eventually want.
Having a solution in place though will make it much easier to move towards
an eventual upstreamable solution. This is a great place to start from!
Jonathan
> +static bool cxl_mhsld_release_extent(PCIDevice *d, uint64_t dpa, uint64_t len)
> +{
> + cxl_mhsld_state_clear(CXL_MHSLD(d), dpa / MHSLD_BLOCK_SZ,
> + len / MHSLD_BLOCK_SZ);
Trivial but align after (
> + return true;
> +}
> diff --git a/hw/cxl/mhsld/mhsld.h b/hw/cxl/mhsld/mhsld.h
> new file mode 100644
> index 0000000000..e7ead1f0d2
> --- /dev/null
> +++ b/hw/cxl/mhsld/mhsld.h
> @@ -0,0 +1,75 @@
> +/*
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * Copyright (c) 2024 MemVerge Inc.
> + *
> + */
> +
> +#ifndef CXL_MHSLD_H
> +#define CXL_MHSLD_H
> +#include <stdint.h>
> +#include "hw/cxl/cxl.h"
> +#include "hw/cxl/cxl_mailbox.h"
> +#include "hw/cxl/cxl_device.h"
> +#include "qemu/units.h"
> +
> +#define MHSLD_BLOCK_SZ (2 * MiB)
> +
> +/*
> + * We limit the number of heads to prevent the shared state
> + * region from becoming a major memory hog. We need 512MB of
> + * memory space to track 8-host ownership of 4GB of memory in
> + * blocks of 2MB. This can change if the block size is increased.
I'm lost what makes up that size?
> + */
> +#define MHSLD_HEADS (8)
> +
> +/*
> + * The shared state cannot have 2 variable sized regions
> + * so we have to max out the ldmap.
> + */
> +typedef struct MHSLDSharedState {
> + uint8_t nr_heads;
> + uint8_t nr_lds;
> + uint8_t ldmap[MHSLD_HEADS];
> + uint64_t nr_blocks;
> + uint8_t blocks[];
> +} MHSLDSharedState;
> +
> +struct CXLMHSLDState {
> + CXLType3Dev ct3d;
> + bool mhd_init;
> + char *mhd_state_file;
> + int mhd_state_fd;
> + size_t mhd_state_size;
> + uint32_t mhd_head;
> + MHSLDSharedState *mhd_state;
> +};
> +
> +struct CXLMHSLDClass {
> + CXLType3Class parent_class;
> +};
> +
> +enum {
> + MHSLD_MHD = 0x55,
> + #define GET_MHD_INFO 0x0
> +};
> +
> +/*
> + * MHD Get Info Command
> + * Returns information the LD's associated with this head
> + */
> +typedef struct MHDGetInfoInput {
> + uint8_t start_ld;
> + uint8_t ldmap_len;
> +} QEMU_PACKED MHDGetInfoInput;
> +
> +typedef struct MHDGetInfoOutput {
> + uint8_t nr_lds;
> + uint8_t nr_heads;
> + uint16_t resv1;
> + uint8_t start_ld;
> + uint8_t ldmap_len;
> + uint16_t resv2;
> + uint8_t ldmap[];
> +} QEMU_PACKED MHDGetInfoOutput;
> +#endif
next prev parent reply other threads:[~2024-12-12 17:40 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-18 16:12 [PATCH RFC v3 0/3] cxl: Multi-headed Single Logical Device (MHSLD) Gregory Price
2024-10-18 16:12 ` [PATCH RFC v3 1/3] cxl-mailbox-utils: move CXLUpdateDCExtentListInPl into header Gregory Price
2024-10-18 16:12 ` [PATCH RFC v3 2/3] cxl_type3: add MHD callbacks Gregory Price
2024-12-12 17:12 ` Jonathan Cameron
2024-12-12 17:12 ` Jonathan Cameron via
2024-10-18 16:12 ` [PATCH RFC v3 3/3] mhsld: implement MHSLD device Gregory Price
2024-12-12 17:40 ` Jonathan Cameron [this message]
2024-12-12 17:40 ` Jonathan Cameron via
2024-12-12 19:52 ` Gregory Price
2024-12-13 16:18 ` Jonathan Cameron
2024-12-13 16:18 ` Jonathan Cameron via
2024-12-13 17:03 ` Gregory Price
2025-01-24 14:12 ` Jonathan Cameron
2025-01-24 14:12 ` Jonathan Cameron via
2025-01-24 15:50 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241212174016.0000002a@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=gourry@gourry.net \
--cc=linux-cxl@vger.kernel.org \
--cc=nifan.cxl@gmail.com \
--cc=qemu-devel@nongnu.org \
--cc=svetly.todorov@memverge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.