Re: [PATCH 4/4] target/arm: Retry pushing CPER error if necessary

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Gavin Shan <gshan@redhat.com>
To: Igor Mammedov <imammedo@redhat.com>
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, mst@redhat.com,
	anisinha@redhat.com, gengdongjiu1@gmail.com,
	peter.maydell@linaro.org, pbonzini@redhat.com,
	shan.gavin@gmail.com,
	Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Subject: Re: [PATCH 4/4] target/arm: Retry pushing CPER error if necessary
Date: Fri, 21 Feb 2025 15:27:36 +1000	[thread overview]
Message-ID: <7caa54df-abe1-4833-bb59-cb83f8241962@redhat.com> (raw)
In-Reply-To: <20250219185518.767a48d9@imammedo.users.ipa.redhat.com>

On 2/20/25 3:55 AM, Igor Mammedov wrote:
> On Fri, 14 Feb 2025 14:16:35 +1000
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> The error -1 is returned if the previously reported CPER error
>> hasn't been claimed. The virtual machine is terminated due to
>> abort(). It's conflicting to the ideal behaviour that the affected
>> vCPU retries pushing the CPER error in this case since the vCPU
>> can't proceed its execution.
>>
>> Move the chunk of code to push CPER error to a separate helper
>> report_memory_errors() and retry the request when the return
>> value from acpi_ghes_memory_errors() is greater than zero.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   target/arm/kvm.c | 31 +++++++++++++++++++++++++------
>>   1 file changed, 25 insertions(+), 6 deletions(-)
>>
>> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
>> index 5c0bf99aec..9f063f6053 100644
>> --- a/target/arm/kvm.c
>> +++ b/target/arm/kvm.c
>> @@ -2362,6 +2362,30 @@ int kvm_arch_get_registers(CPUState *cs, Error **errp)
>>       return ret;
>>   }
>>   
>> +static void report_memory_error(CPUState *c, hwaddr paddr)
>> +{
>> +    int ret;
>> +
>> +    while (true) {
>> +        /* Retry if the previously report error hasn't been claimed */
>> +        ret = acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, true);
>> +        if (ret <= 0) {
>> +            break;
>> +        }
>> +
>> +        bql_unlock();
>> +        g_usleep(1000);

Igor, thanks for the detailed comments. Sorry for a bit delay of the reply, I
was checking the code to understand it better :)

> even with bql released it's not safe to loop in here.
> consider,
>    a guest with 2 vcpus
>      * vcpu 1 gets SIGBUS due to error
>      * vcpu 2 trips over the same error and gets into this loop
>      * on guest side vcpu 1 continues to run to handle SEA but
>        might need to acquire a lock that vcpu 2 holds
> 

Agreed.

> GHESv2 error source we support, can report several errors,
> currently QEMU supports only 1 'error status block' which
> can hold several error records (CPER) (though storage size is limited)
> 
> 1:
> We can potentially add support for more GHESv2 error sources
> with their own Read ACK registers (let's say =max_cpus)
> (that is under assumption that no other error will be
> triggered while guest VCPUs handle their own SEA (upto clearing Read ACK))
> 
> 2:
> Another way could be for QEMU to allocate more error status _blocks_
> for the only one error source it has now and try to find
> empty status block to inject new error(s).
>   * it can be saturated with high rate of errors (so what do we do in case it happens?)
>   * subject to race between clearing/setting Read ACK
>      (maybe it can dealt with that on side by keeping internal read_ack counter)
> 
> 3:
> And alternatively, queue incoming errors until read ack is cleared
> and then inject pending errors in one go.
> (problem with that is that at the moment QEMU doesn't monitor
> read ack register memory so it won't notice guest clearing that)
> 
> 
> Given spec has provision for multiple error status blocks/error data entries
> it seems that #2 is an expected way to deal with the problem.
> 

I would say #1 is the ideal model because the read_ack_register is the bottleneck
and it should be scaled up to max_cpus. In that way, the bottleneck can be avoided
from the bottom. Another benefit with #1 is the error can be delivered immediately
to the vCPU where the error was raised. This matches with the syntax of SEA to me.

#2 still has the risk to saturate the multiple error status blocks if there are
high rate of errors as you said. Besides, the vCPU where read_ack_register is acknoledged
can be different from the vCPU where the error is raised, violating the syntax of
SEA.

#3's drawback is to violate the syntax of SEA, similar to #2.

However, #2/#3 wouldn't be that complicated to #1. I didn't expect big surgery to
GHES module, but it seems there isn't perfect solution without a big surgery.
I would vote for #1 to resolve the issue from the ground. What do you think, Igor?
I'm also hoping Jonathan and Mauro can provide their preference.

> PS:
> I'd prefer Mauro's series being merged 1st (once it's resplit),
> for it refactors a bunch of original code and hopefully makes
> code easier to follow/extend.
> 

Sure. I won't start the coding until the solution is confirmed. All the followup
work will base on Mauro's series.

>> +        bql_lock();
>> +    }
>> +
>> +    if (ret == 0) {
>> +        kvm_inject_arm_sea(c);
>> +    } else {
>> +        error_report("Error %d to report memory error", ret);
>> +        abort();
>> +    }
>> +}
>> +
>>   void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>   {
>>       ram_addr_t ram_addr;
>> @@ -2387,12 +2411,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>                */
>>               if (code == BUS_MCEERR_AR) {
>>                   kvm_cpu_synchronize_state(c);
>> -                if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, false)) {
>> -                    kvm_inject_arm_sea(c);
>> -                } else {
>> -                    error_report("failed to record the error");
>> -                    abort();
>> -                }
>> +                report_memory_error(c, paddr);
>>               }
>>               return;
>>           }
> 

Thanks,
Gavin

next prev parent reply	other threads:[~2025-02-21  5:28 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-14  4:16 [PATCH 0/4] target/arm: Improvement on memory error handling Gavin Shan
2025-02-14  4:16 ` [PATCH 1/4] acpi/ghes: Make ghes_record_cper_errors() static Gavin Shan
2025-02-21 10:44   ` Philippe Mathieu-Daudé
2025-02-14  4:16 ` [PATCH 2/4] acpi/ghes: Use error_report() in ghes_record_cper_errors() Gavin Shan
2025-02-14  4:16 ` [PATCH 3/4] acpi/ghes: Allow retry to write CPER errors Gavin Shan
2025-02-14  4:16 ` [PATCH 4/4] target/arm: Retry pushing CPER error if necessary Gavin Shan
2025-02-19 17:55   ` Igor Mammedov
2025-02-21  5:27     ` Gavin Shan [this message]
2025-02-21 11:04       ` Jonathan Cameron via
2025-02-25 11:19         ` Igor Mammedov
2025-02-26  4:58           ` Gavin Shan
2025-02-28  1:55             ` Jonathan Cameron via
2025-02-26  6:56         ` Gavin Shan
2025-02-14  9:53 ` [PATCH 0/4] target/arm: Improvement on memory error handling Jonathan Cameron via
2025-02-17  0:29   ` Gavin Shan
2025-02-14 10:12 ` Jonathan Cameron via
2025-02-17  3:49   ` Gavin Shan
2025-02-14 12:59 ` Mauro Carvalho Chehab
2025-02-17  3:58   ` Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7caa54df-abe1-4833-bb59-cb83f8241962@redhat.com \
    --to=gshan@redhat.com \
    --cc=anisinha@redhat.com \
    --cc=gengdongjiu1@gmail.com \
    --cc=imammedo@redhat.com \
    --cc=mchehab+huawei@kernel.org \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).