Re: [PATCH RESEND v2 3/3] target/arm/kvm: Support multiple memory CPERs injection

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Gavin Shan <gshan@redhat.com>
To: Igor Mammedov <imammedo@redhat.com>
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, mst@redhat.com,
	anisinha@redhat.com, gengdongjiu1@gmail.com,
	peter.maydell@linaro.org, pbonzini@redhat.com,
	mchehab+huawei@kernel.org, Jonathan.Cameron@huawei.com,
	shan.gavin@gmail.com
Subject: Re: [PATCH RESEND v2 3/3] target/arm/kvm: Support multiple memory CPERs injection
Date: Tue, 4 Nov 2025 09:51:42 +1000	[thread overview]
Message-ID: <1cde8845-d72f-494f-b7b2-3d7329a7d1c0@redhat.com> (raw)
In-Reply-To: <20251103105216.1f4241d7@fedora>

On 11/3/25 7:52 PM, Igor Mammedov wrote:
> On Mon, 3 Nov 2025 09:02:54 +1000
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> On 10/31/25 11:55 PM, Igor Mammedov wrote:
>>> On Sun, 19 Oct 2025 10:36:16 +1000
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>> On 10/18/25 12:27 AM, Igor Mammedov wrote:
>>>>> On Tue,  7 Oct 2025 16:08:10 +1000
>>>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>>       
>>>>>> In the combination of 64KB host and 4KB guest, a problematic host page
>>>>>> affects 16x guest pages. In this specific case, it's reasonable to
>>>>>> push 16 consecutive memory CPERs. Otherwise, QEMU can run into core
>>>>>> dump due to the current error can't be delivered as the previous error
>>>>>> isn't acknoledges. It's caused by the nature the host page can be
>>>>>> accessed in parallel due to the mismatched host and guest page sizes.
>>>>>
>>>>> can you explain a bit more what goes wrong?
>>>>>
>>>>> I'm especially interested in parallel access you've mentioned
>>>>> and why batch adding error records is needed
>>>>> as opposed to adding records every time invalid access happens?
>>>>>
>>>>> PS:
>>>>> Assume I don't remember details on how HEST works,
>>>>> Answering it in this format also should improve commit message
>>>>> making it more digestible for uninitiated.
>>>>>       
>>>>
>>>> Thanks for your review and I'm trying to answer your question below. Please let
>>>> me know if there are more questions.
>>>>
>>>> There are two signals (BUS_MCEERR_AR and BUS_MCEERR_AO) and BUS_MCEERR_AR is
>>>> concerned here. This signal BUS_MCEERR_AR is sent by host's stage2 page fault
>>>> handler when the resolved host page has been marked as marked as poisoned.
>>>> The stage2 page fault handler is invoked on every access to the host page.
>>>>
>>>> In the combination where host and guest has 64KB and 4KB separately, A 64KB
>>>> host page corresponds to 16x consecutive 4KB guest pages. It means we're
>>>> accessing the 64KB host page when any of those 16x consecutive 4KB guest pages
>>>> is accessed. In other words, a problematic 64KB host page affects the accesses
>>>> on 16x 4KB guest pages. Those 16x 4KB guest pages can be owned by different
>>>> threads on the guest and they run in parallel, potentially to access those
>>>> 16x 4KB guest pages in parallel. It potentially leading to 16x BUS_MCEERR_AR
>>>> signals at one point.
>>>>
>>>> In current implementation, the error record is built as the following calltrace
>>>> indicates. There are 16 error records in the extreme case (parallel accesses on
>>>> 16x 4KB guest pages, mapped to one 64KB host page). However, we can't handle
>>>> multiple error records at once due to the acknowledgement mechanism in
>>>> ghes_record_cper_errors(). For example, the first error record has been sent,
>>>> but not consumed by the guest yet. We fail to send the second error record.
>>>>
>>>> kvm_arch_on_sigbus_vcpu
>>>>      acpi_ghes_memory_errors
>>>>        ghes_gen_err_data_uncorrectable_recoverable      // Generic Error Data Entry
>>>>        acpi_ghes_build_append_mem_cper                  // Memory Error
>>>>        ghes_record_cper_errors
>>>>          
>>>> So this series improves this situation by simply sending 16x error records in
>>>> one shot for the combination of 64KB host + 4KB guest.
>>>
>>> 1) What I'm concerned about is that it target one specific case only.
>>> Imagine if 1st cpu get error on page1 and another on page2=(page1+host_page_size)
>>> and so on for other CPUs. Then we are back where we were before this series.
>>>
>>> Also in abstract future when ARM gets 1Gb pages, that won't scale well.
>>>
>>> Can we instead of making up CPERs to cover whole host page,
>>> create 1/vcpu GHES source?
>>> That way when vcpu trips over bad page, it would have its own
>>> error status block to put errors in.
>>> That would address [1] and deterministically scale
>>> (well assuming that multiple SEA error sources are possible in theory)
>>>    
>>
>> I think it's a good idea to have individual error source for each vCPU. In this
>> way, the read_ack_reg won't be a limitation. I hope Jonathan is ok to this scheme.
>>
>> Currently, we have two (fixed) error sources like below. I assume the index of
>> the error source per vCPU will starts from (ACPI_HEST_SRC_ID_QMP + 1) based on
>> CPUState::cpu_index.
> 
> I'd suggest ditch cpu index and use arch_id instead.
> 

arch_id, which is returned from CPUClass::get_arch_id(), is uint64_t. The error
source index is uint16_t, as defined in ACPI spec 6.5 (section 18.3.2.7 Generic
Hardware Error Source). So I think CPUState::cpu_index is the appropriate error
source index.

>>
>> enum AcpiGhesSourceID {
>>       ACPI_HEST_SRC_ID_SYNC,
>>       ACPI_HEST_SRC_ID_QMP,       /* Use it only for QMP injected errors */
>> };
>>
>>
>>> PS:
>>> I also wonder what real HW does when it gets in similar situation
>>> (i.e. error status block is not yet acknowledged but another async
>>> error arrived for the same error source)?
>>>    
>>
>> I believe real HW also have this specific issue. As to ARM64, I ever did some
>> google search and was told the error follows the firmware-first policy and
>> handled by a component of trustfirmware-a. However, I was unable to get the
>> source code. So it's hard for me to know how this specific issue is handled
>> there.
> 
> Perhaps Jonathan can help with finding how real hw works around it?
> 
> My idea using per cpu source is just a speculation based on spec
> on how workaround the problem,
> I don't really know if guest OS will be able to handle it (aka,
> need to be tested is it's viable). That also probably was a reason
> in previous review, why should've waited for multiple sources
> support be be merged first before this series.
> 

Well, the point is the guest won't be full functional until the the problematic
host page is isolated and replaced by another host page. To avoid crash the qemu
still gives customer a chance to collect important information from the guest
by luck.

Thanks,
Gavin


>>
>>>>>> Imporve push_ghes_memory_errors() to push 16x consecutive memory CPERs
>>>>>> for this specific case. The maximal error block size is bumped to 4KB,
>>>>>> providing enough storage space for those 16x memory CPERs.
>>>>>>
>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>>> ---
>>>>>>     hw/acpi/ghes.c   |  2 +-
>>>>>>     target/arm/kvm.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
>>>>>>     2 files changed, 46 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
>>>>>> index 045b77715f..5c87b3a027 100644
>>>>>> --- a/hw/acpi/ghes.c
>>>>>> +++ b/hw/acpi/ghes.c
>>>>>> @@ -33,7 +33,7 @@
>>>>>>     #define ACPI_HEST_ADDR_FW_CFG_FILE          "etc/acpi_table_hest_addr"
>>>>>>     
>>>>>>     /* The max size in bytes for one error block */
>>>>>> -#define ACPI_GHES_MAX_RAW_DATA_LENGTH   (1 * KiB)
>>>>>> +#define ACPI_GHES_MAX_RAW_DATA_LENGTH   (4 * KiB)
>>>>>>     
>>>>>>     /* Generic Hardware Error Source version 2 */
>>>>>>     #define ACPI_GHES_SOURCE_GENERIC_ERROR_V2   10
>>>>>> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
>>>>>> index c5d5b3b16e..3ecb85e4b7 100644
>>>>>> --- a/target/arm/kvm.c
>>>>>> +++ b/target/arm/kvm.c
>>>>>> @@ -11,6 +11,7 @@
>>>>>>      */
>>>>>>     
>>>>>>     #include "qemu/osdep.h"
>>>>>> +#include "qemu/units.h"
>>>>>>     #include <sys/ioctl.h>
>>>>>>     
>>>>>>     #include <linux/kvm.h>
>>>>>> @@ -2433,10 +2434,53 @@ static void push_ghes_memory_errors(CPUState *c, AcpiGhesState *ags,
>>>>>>                                         uint64_t paddr)
>>>>>>     {
>>>>>>         GArray *addresses = g_array_new(false, false, sizeof(paddr));
>>>>>> +    uint64_t val, start, end, guest_pgsz, host_pgsz;
>>>>>>         int ret;
>>>>>>     
>>>>>>         kvm_cpu_synchronize_state(c);
>>>>>> -    g_array_append_vals(addresses, &paddr, 1);
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Sort out the guest page size from TCR_EL1, which can be modified
>>>>>> +     * by the guest from time to time. So we have to sort it out dynamically.
>>>>>> +     */
>>>>>> +    ret = read_sys_reg64(c->kvm_fd, &val, ARM64_SYS_REG(3, 0, 2, 0, 2));
>>>>>> +    if (ret) {
>>>>>> +        goto error;
>>>>>> +    }
>>>>>> +
>>>>>> +    switch (extract64(val, 14, 2)) {
>>>>>> +    case 0:
>>>>>> +        guest_pgsz = 4 * KiB;
>>>>>> +        break;
>>>>>> +    case 1:
>>>>>> +        guest_pgsz = 64 * KiB;
>>>>>> +        break;
>>>>>> +    case 2:
>>>>>> +        guest_pgsz = 16 * KiB;
>>>>>> +        break;
>>>>>> +    default:
>>>>>> +        error_report("unknown page size from TCR_EL1 (0x%" PRIx64 ")", val);
>>>>>> +        goto error;
>>>>>> +    }
>>>>>> +
>>>>>> +    host_pgsz = qemu_real_host_page_size();
>>>>>> +    start = paddr & ~(host_pgsz - 1);
>>>>>> +    end = start + host_pgsz;
>>>>>> +    while (start < end) {
>>>>>> +        /*
>>>>>> +         * The precise physical address is provided for the affected
>>>>>> +         * guest page that contains @paddr. Otherwise, the starting
>>>>>> +         * address of the guest page is provided.
>>>>>> +         */
>>>>>> +        if (paddr >= start && paddr < (start + guest_pgsz)) {
>>>>>> +            g_array_append_vals(addresses, &paddr, 1);
>>>>>> +        } else {
>>>>>> +            g_array_append_vals(addresses, &start, 1);
>>>>>> +        }
>>>>>> +
>>>>>> +        start += guest_pgsz;
>>>>>> +    }
>>>>>> +
>>>>>>         ret = acpi_ghes_memory_errors(ags, ACPI_HEST_SRC_ID_SYNC, addresses);
>>>>>>         if (ret) {
>>>>>>             goto error;
>>>>>       
>>>>   
>>>    
>>
>

next prev parent reply	other threads:[~2025-11-03 23:54 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-07  6:08 [PATCH RESEND v2 0/3] target/arm/kvm: Improve memory error handling Gavin Shan
2025-10-07  6:08 ` [PATCH RESEND v2 1/3] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs Gavin Shan
2025-10-31  9:58   ` Jonathan Cameron via
2025-10-31 10:08     ` Jonathan Cameron via
2025-11-02 22:45       ` Gavin Shan
2025-10-31 13:17   ` Igor Mammedov
2025-11-02 22:51     ` Gavin Shan
2025-10-07  6:08 ` [PATCH RESEND v2 2/3] kvm/arm/kvm: Introduce helper push_ghes_memory_errors() Gavin Shan
2025-10-31 10:09   ` Jonathan Cameron via
2025-11-02 23:39     ` Gavin Shan
2025-11-03  9:45       ` Igor Mammedov
2025-10-31 13:25   ` Igor Mammedov
2025-11-02 23:35     ` Gavin Shan
2025-10-07  6:08 ` [PATCH RESEND v2 3/3] target/arm/kvm: Support multiple memory CPERs injection Gavin Shan
2025-10-07 10:57   ` Mauro Carvalho Chehab
2025-10-08  3:57     ` Gavin Shan
2025-10-17 14:27   ` Igor Mammedov
2025-10-19  0:36     ` Gavin Shan
2025-10-31 13:55       ` Igor Mammedov
2025-11-02 23:02         ` Gavin Shan
2025-11-03  9:52           ` Igor Mammedov
2025-11-03 23:51             ` Gavin Shan [this message]
2025-11-06  7:57               ` Igor Mammedov
2025-11-06 21:43                 ` Gavin Shan
2025-11-04 12:21             ` Jonathan Cameron via
2025-11-05  0:40               ` Gavin Shan
2025-11-05  9:02                 ` Jonathan Cameron via
2025-11-07  5:11                   ` Gavin Shan
2025-10-31 10:10   ` Jonathan Cameron via
2025-11-02 23:03     ` Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1cde8845-d72f-494f-b7b2-3d7329a7d1c0@redhat.com \
    --to=gshan@redhat.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=anisinha@redhat.com \
    --cc=gengdongjiu1@gmail.com \
    --cc=imammedo@redhat.com \
    --cc=mchehab+huawei@kernel.org \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).