Re: [PATCH qemu v4 6/7] hw/cxl: Add Maintenance support

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Jonathan Cameron via <qemu-devel@nongnu.org>
To: <shiju.jose@huawei.com>
Cc: <qemu-devel@nongnu.org>, <linux-cxl@vger.kernel.org>,
	<fan.ni@samsung.com>, <dave@stgolabs.net>, <linuxarm@huawei.com>
Subject: Re: [PATCH qemu v4 6/7] hw/cxl: Add Maintenance support
Date: Fri, 25 Jul 2025 14:26:35 +0100	[thread overview]
Message-ID: <20250725142635.000014fa@huawei.com> (raw)
In-Reply-To: <20250721172228.2118-7-shiju.jose@huawei.com>

On Mon, 21 Jul 2025 18:22:27 +0100
<shiju.jose@huawei.com> wrote:

> From: Davidlohr Bueso <dave@stgolabs.net>

I tweaked the title to mention Post Package Repair.  If anyone is ever
looking for that particular maintenance command they might want to know
it is in here from the title.

> 
> This adds initial support for the Maintenance command, specifically
> the soft and hard PPR operations on a dpa. The implementation allows
> to be executed at runtime, therefore semantically, data is retained
> and CXL.mem requests are correctly processed.
> 
> Keep track of the requests upon a general media or DRAM event.
> 
> Post Package Repair (PPR) maintenance operations may be supported by CXL
> devices that implement CXL.mem protocol. A PPR maintenance operation
> requests the CXL device to perform a repair operation on its media.
> For example, a CXL device with DRAM components that support PPR features
> may implement PPR Maintenance operations. DRAM components may support two
> types of PPR, hard PPR (hPPR), for a permanent row repair, and Soft PPR
> (sPPR), for a temporary row repair. Soft PPR is much faster than hPPR,
> but the repair is lost with a power cycle.
> 
> CXL spec 3.2 section 8.2.10.7.1.2 describes the device's sPPR (soft PPR)
> maintenance operation and section 8.2.10.7.1.3 describes the device's
> hPPR (hard PPR) maintenance operation feature.
> 
> CXL spec 3.2 section 8.2.10.7.2.1 describes the sPPR feature discovery and
> configuration.
> 
> CXL spec 3.2 section 8.2.10.7.2.2 describes the hPPR feature discovery and
> configuration.
> 
> CXL spec 3.2 section 8.2.10.2.1.4 Table 8-60 describes the Memory Sparing
> Event Record.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> Co-developed-by: Shiju Jose <shiju.jose@huawei.com>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
2x SoB for Shiju.

Main question I have on this is why we currently track things marked
for maintenance in injected error records, but don't act on that in any
way.  I'm also thinking we could easily pass on any provided geometry so
the sparing record reflects what was injected if that's where it came from.

If we do PPR on something not injected we might still want to make some
plausible geometry up.

> +static void cxl_mbox_create_mem_sparing_event_records(CXLType3Dev *ct3d,
> +                            uint8_t class, uint8_t sub_class)
> +{
> +    CXLEventSparing event_rec = {};
> +
> +    cxl_assign_event_header(&event_rec.hdr,
> +                            &sparing_uuid,
> +                            (1 << CXL_EVENT_TYPE_INFO),
> +                            sizeof(event_rec),
> +                            cxl_device_get_timestamp(&ct3d->cxl_dstate),
> +                            1, class, 1, sub_class, 0, 0, 0, 0);
> +
> +    event_rec.flags = 0;
> +    event_rec.result = 0;
> +    event_rec.validity_flags = CXL_MSER_VALID_CHANNEL |
> +                               CXL_MSER_VALID_RANK |
> +                               CXL_MSER_VALID_NIB_MASK |
> +                               CXL_MSER_VALID_BANK_GROUP |
> +                               CXL_MSER_VALID_BANK |
> +                               CXL_MSER_VALID_ROW |
> +                               CXL_MSER_VALID_COLUMN |
> +                               CXL_MSER_VALID_SUB_CHANNEL;
> +
> +    event_rec.res_avail = 1;
> +    event_rec.channel = 2;
> +    event_rec.rank = 5;
> +    st24_le_p(event_rec.nibble_mask, 0xA59C);
> +    event_rec.bank_group = 2;
> +    event_rec.bank = 4;
> +    st24_le_p(event_rec.row, 13);
> +    event_rec.column = 23;
> +    event_rec.sub_channel = 7;

At some point we should cycle back and make up some 'geometry' for the
memory so we can map different DPAs to different places.  This is fine
for now though.

> +
> +    if (cxl_event_insert(&ct3d->cxl_dstate,
> +                         CXL_EVENT_TYPE_INFO,
> +                         (CXLEventRecordRaw *)&event_rec)) {
> +        cxl_event_irq_assert(ct3d);
> +    }
> +}
> +
> +
> +static void cxl_perform_ppr(CXLType3Dev *ct3d, uint64_t dpa)
> +{
> +    CXLMaintenance *ent, *next;
> +
> +    QLIST_FOREACH_SAFE(ent, &ct3d->maint_list, node, next) {

If we did want to generate the right geometry to match the injected
event we'd want to retrieve it here (having stashed it in the ent)

> +        if (dpa == ent->dpa) {
> +            QLIST_REMOVE(ent, node);

What is this actually for at the moment?  We track them on a list but
don't enforce anything with it?  I don't think we should enforce this
as you can issue PPR on stuff that was never in error if you like.

> +            g_free(ent);
> +            break;
> +        }
> +    }
> +
> +    /* Produce a Memory Sparing Event Record */
> +    if (ct3d->soft_ppr_attrs.sppr_op_mode &
> +        CXL_MEMDEV_SPPR_OP_MODE_MEM_SPARING_EV_REC_EN) {
> +        cxl_mbox_create_mem_sparing_event_records(ct3d,
> +                                CXL_MEMDEV_MAINT_CLASS_SPARING,
> +                                CXL_MEMDEV_MAINT_SUBCLASS_CACHELINE_SPARING);
> +    }
> +}

>  /* Component ID is device specific.  Define this as a string. */
>  void qmp_cxl_inject_general_media_event(const char *path, CxlEventLog log,
>                                          uint32_t flags, bool has_maint_op_class,
> @@ -1715,6 +1756,11 @@ void qmp_cxl_inject_general_media_event(const char *path, CxlEventLog log,
>          error_setg(errp, "Unhandled error log type");
>          return;
>      }
> +    if (rc == CXL_EVENT_TYPE_INFO &&
> +        (flags & CXL_EVENT_REC_FLAGS_MAINT_NEEDED)) {
> +        error_setg(errp, "Informational event cannot require maintenance");
> +        return;
> +    }
>      enc_log = rc;
>  
>      memset(&gem, 0, sizeof(gem));
> @@ -1773,6 +1819,10 @@ void qmp_cxl_inject_general_media_event(const char *path, CxlEventLog log,
>      if (cxl_event_insert(cxlds, enc_log, (CXLEventRecordRaw *)&gem)) {
>          cxl_event_irq_assert(ct3d);
>      }
> +
> +    if (flags & CXL_EVENT_REC_FLAGS_MAINT_NEEDED) {
> +        cxl_maintenance_insert(ct3d, dpa);

Same as below.

> +    }

>  
>      memset(&dram, 0, sizeof(dram));
> @@ -1935,6 +1990,10 @@ void qmp_cxl_inject_dram_event(const char *path, CxlEventLog log,
>      if (cxl_event_insert(cxlds, enc_log, (CXLEventRecordRaw *)&dram)) {
>          cxl_event_irq_assert(ct3d);
>      }
> +
> +    if (flags & CXL_EVENT_REC_FLAGS_MAINT_NEEDED) {
> +        cxl_maintenance_insert(ct3d, dpa);
We make up the geometry details for the sparing record, but we 'could'
store them here if they were injected and hence spit out an appropriate
sparing record?

Do you think it's worth doing at this stage?

Jonathan

> +    }
>  }
>  
>  #define CXL_MMER_VALID_COMPONENT                        BIT(0)

next prev parent reply	other threads:[~2025-07-25 13:27 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-21 17:22 [PATCH qemu v4 0/7] hw/cxl: Update CXL events to rev3.2 and add maintenance support for memory repair features shiju.jose--- via
2025-07-21 17:22 ` [PATCH qemu v4 1/7] hw/cxl/events: Update for rev3.2 common event record format shiju.jose--- via
2025-07-25 12:45   ` Jonathan Cameron via
2025-08-06  7:45   ` Markus Armbruster
2025-07-21 17:22 ` [PATCH qemu v4 2/7] hw/cxl/events: Updates for rev3.2 general media event record shiju.jose--- via
2025-08-06  7:57   ` Markus Armbruster
2025-08-06  9:21     ` Shiju Jose via
2025-08-06 10:23       ` Markus Armbruster
2025-07-21 17:22 ` [PATCH qemu v4 3/7] hw/cxl/events: Updates for rev3.2 DRAM " shiju.jose--- via
2025-08-06  8:05   ` Markus Armbruster
2025-07-21 17:22 ` [PATCH qemu v4 4/7] hw/cxl/events: Updates for rev3.2 memory module " shiju.jose--- via
2025-08-06  8:08   ` Markus Armbruster
2025-07-21 17:22 ` [PATCH qemu v4 5/7] hw/cxl/cxl-mailbox-utils: Move declaration of scrub and ECS feature attributes in cmd_features_set_feature() shiju.jose--- via
2025-07-21 17:22 ` [PATCH qemu v4 6/7] hw/cxl: Add Maintenance support shiju.jose--- via
2025-07-25 13:26   ` Jonathan Cameron via [this message]
2025-07-21 17:22 ` [PATCH qemu v4 7/7] hw/cxl: Add emulation for memory sparing control feature shiju.jose--- via
2025-07-25 13:31   ` Jonathan Cameron via

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250725142635.000014fa@huawei.com \
    --to=qemu-devel@nongnu.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=dave@stgolabs.net \
    --cc=fan.ni@samsung.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=shiju.jose@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).