All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron via <qemu-arm@nongnu.org>
To: Gavin Shan <gshan@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>, <shan.gavin@gmail.com>,
	<qemu-arm@nongnu.org>, <qemu-devel@nongnu.org>,
	<mchehab+huawei@kernel.org>, <gengdongjiu1@gmail.com>,
	<mst@redhat.com>, <anisinha@redhat.com>,
	<peter.maydell@linaro.org>, <pbonzini@redhat.com>
Subject: Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Date: Tue, 11 Nov 2025 13:12:53 +0000	[thread overview]
Message-ID: <20251111131253.00007197@huawei.com> (raw)
In-Reply-To: <e797a9f6-aee6-4c0f-9c17-f4200199e317@redhat.com>

On Tue, 11 Nov 2025 22:19:18 +1000
Gavin Shan <gshan@redhat.com> wrote:

> Hi Jonathan,
> 
> On 11/11/25 9:55 PM, Jonathan Cameron wrote:
> > On Tue, 11 Nov 2025 20:55:17 +1000
> > Gavin Shan <gshan@redhat.com> wrote:  
> >> On 11/11/25 8:07 PM, Jonathan Cameron wrote:  
> >>> On Tue, 11 Nov 2025 14:08:13 +1000
> >>> Gavin Shan <gshan@redhat.com> wrote:  
> >>>> On 11/11/25 12:49 AM, Igor Mammedov wrote:  
> >>>>> On Thu, 6 Nov 2025 13:15:52 +1000
> >>>>> Gavin Shan <gshan@redhat.com> wrote:  
> >>>>>> On 11/6/25 12:14 AM, Jonathan Cameron wrote:  
> >>>>>>> On Wed,  5 Nov 2025 21:44:49 +1000
> >>>>>>> Gavin Shan <gshan@redhat.com> wrote:
> >>>>>>>            
> >>>>>>>> In the situation where host and guest has 64KiB and 4KiB page sizes,
> >>>>>>>> one problematic host page affects 16 guest pages. we need to send 16
> >>>>>>>> consective errors in this specific case.
> >>>>>>>>
> >>>>>>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> >>>>>>>> hunk of code to generate the GHES error status is pulled out from
> >>>>>>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> >>>>>>>> generic error status block is also updated accordingly if multiple
> >>>>>>>> error data entries are contained in the generic error status block.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>  
> >>>>>>> Hi Gavin,
> >>>>>>>
> >>>>>>> Mostly fine, but a few comments on the defines added and a
> >>>>>>> question on what the multiple things are meant to mean?
> >>>>>>>            
> >>>>>>
> >>>>>> Thanks for your review and comments, replies as below.
> >>>>>>        
> >>>>>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> >>>>>>>> index a9c08e73c0..527b85c8d8 100644
> >>>>>>>> --- a/hw/acpi/ghes.c
> >>>>>>>> +++ b/hw/acpi/ghes.c
> >>>>>>>> @@ -57,8 +57,12 @@
> >>>>>>>>      /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
> >>>>>>>>      #define ACPI_GHES_MEM_CPER_LENGTH           80
> >>>>>>>>      
> >>>>>>>> -/* Masks for block_status flags */
> >>>>>>>> -#define ACPI_GEBS_UNCORRECTABLE         1
> >>>>>>>> +/* Bits for block_status flags */
> >>>>>>>> +#define ACPI_GEBS_UNCORRECTABLE           0
> >>>>>>>> +#define ACPI_GEBS_CORRECTABLE             1
> >>>>>>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
> >>>>>>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3  
> >>>>>>>
> >>>>>>> So this maps to the bits in block status.
> >>>>>>>
> >>>>>>> I'm not actually sure what these multiple variants are meant to tell us.
> >>>>>>> The multiple error blocks example referred to by the spec is a way to represent
> >>>>>>> the same error applying to multiple places.  So that's one error, many blocks.
> >>>>>>> I have no idea if we set these bits in that case.
> >>>>>>>
> >>>>>>> Based on a quick look I don't think linux even takes any notice.  THere
> >>>>>>> are defines in actbl1.h but I'm not seeing any use made of them.
> >>>>>>>            
> >>>>>>
> >>>>>> I hope Igor can confirm since it was suggested by him.
> >>>>>>
> >>>>>> It's hard to understand how exactly these multiple variants are used from the
> >>>>>> spec. In ACPI 6.5 Table 18.11, it's explained as below.
> >>>>>>
> >>>>>> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that more
> >>>>>> than one uncorrectable errors have been detected.
> >>>>>>
> >>>>>> I don't see those multiple variants have been used by Linux. So I think it's
> >>>>>> safe to drop them.  
> >>>>>
> >>>>> even though example describes 'same' error at different components,
> >>>>> the bit fields descriptions doesn't set any limits on what 'more than one' means.
> >>>>>
> >>>>> Also from guest POV it's multiple different pages that we are reporting here
> >>>>> as multiple CPERs.
> >>>>> It seems to me that setting *_MULTIPLE_* here is correct thing to do.
> >>>>>         
> >>>>
> >>>> I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if Jonathan
> >>>> is fine. Again, this field isn't used by Linux guest.  
> >>> I don't care strongly.  Maybe we should ask for a spec clarification as I doubt
> >>> implementations will be consistent on this given the vague description and that
> >>> Linux ignores it today.
> >>>      
> >>
> >> Google Gemini has the following question. If it can be trusted, it should be
> >> set when @num_of_addresses is larger than 1.
> >>
> >> Quota from Google Gemini:
> >>
> >> The system firmware sets this bit to indicate to the Operating System Power Management (OSPM)
> >> that more than one correctable error condition has been detected and logged for the associated
> >> hardware component since the last time the status was cleared by the software. This is crucial
> >> because a high frequency of correctable errors often indicates a potential underlying hardware
> >> issue that could lead to uncorrectable (and potentially fatal) errors if not addressed (e.g.,
> >> in memory, where multiple correctable errors might trigger a spare memory operation).
> >>  
> >>>>     
> >>>>>>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4  
> >>>>>>>
> >>>>>>> This is bits 4-13 and the define isn't used. I'd drop it.
> >>>>>>>            
> >>>>>>
> >>>>>> The definition is used in acpi_ghes_memory_errors() of this patch. However,
> >>>>>> I don't see it has been used by Linux. This field isn't used by Linux to determine
> >>>>>> the total number of error entries. So I think I can drop it either if Igor is ok.
> >>>>>>        
> >>>>
> >>>> Lets keep this field either in next revision if Jonathan is fine.  
> >>>
> >>> I'm fine with the field, but not the value.  As far as I can tell form the spec, it should
> >>> be a mask, not a single bit.
> >>>      
> >>
> >> Agreed, lets keep ACPI_HEST_ERROR_ENTRY_COUNT as zero in next revision.  
> > 
> > I'm even more confused now.  The GEBS Error Data entry count should be field from 13:4
> > and the value taken should be the number of entries in the record, so 1, 4, 16 depending
> > on the page size.
> > 
> > So that define of the value 4 is garbage. If it were DATA_ENTRIES_SHIFT then I'd be much happier.
> >   
> 
> My bad. I misunderstood your point. It will be fixed by using APIs from
> "hw/registerfields.h" as suggested by Philippe in another reply.
> 
>    ...
>    FIELD(ACPI_GEBS, MULTIPLE_CORRECTABLE, 3, 1)
>    FIELD(ACPI_GEBS, ERROR_DATA_ENTRIES, 4, 10)
> 
>    then use FIELD_DP32() to only set the correct bits.
> 
Perfect. Thanks!

J
> Thanks,
> Gavin
> 
> 



WARNING: multiple messages have this Message-ID (diff)
From: Jonathan Cameron via <qemu-devel@nongnu.org>
To: Gavin Shan <gshan@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>, <shan.gavin@gmail.com>,
	<qemu-arm@nongnu.org>, <qemu-devel@nongnu.org>,
	<mchehab+huawei@kernel.org>, <gengdongjiu1@gmail.com>,
	<mst@redhat.com>, <anisinha@redhat.com>,
	<peter.maydell@linaro.org>, <pbonzini@redhat.com>
Subject: Re: [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs
Date: Tue, 11 Nov 2025 13:12:53 +0000	[thread overview]
Message-ID: <20251111131253.00007197@huawei.com> (raw)
In-Reply-To: <e797a9f6-aee6-4c0f-9c17-f4200199e317@redhat.com>

On Tue, 11 Nov 2025 22:19:18 +1000
Gavin Shan <gshan@redhat.com> wrote:

> Hi Jonathan,
> 
> On 11/11/25 9:55 PM, Jonathan Cameron wrote:
> > On Tue, 11 Nov 2025 20:55:17 +1000
> > Gavin Shan <gshan@redhat.com> wrote:  
> >> On 11/11/25 8:07 PM, Jonathan Cameron wrote:  
> >>> On Tue, 11 Nov 2025 14:08:13 +1000
> >>> Gavin Shan <gshan@redhat.com> wrote:  
> >>>> On 11/11/25 12:49 AM, Igor Mammedov wrote:  
> >>>>> On Thu, 6 Nov 2025 13:15:52 +1000
> >>>>> Gavin Shan <gshan@redhat.com> wrote:  
> >>>>>> On 11/6/25 12:14 AM, Jonathan Cameron wrote:  
> >>>>>>> On Wed,  5 Nov 2025 21:44:49 +1000
> >>>>>>> Gavin Shan <gshan@redhat.com> wrote:
> >>>>>>>            
> >>>>>>>> In the situation where host and guest has 64KiB and 4KiB page sizes,
> >>>>>>>> one problematic host page affects 16 guest pages. we need to send 16
> >>>>>>>> consective errors in this specific case.
> >>>>>>>>
> >>>>>>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the
> >>>>>>>> hunk of code to generate the GHES error status is pulled out from
> >>>>>>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of
> >>>>>>>> generic error status block is also updated accordingly if multiple
> >>>>>>>> error data entries are contained in the generic error status block.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>  
> >>>>>>> Hi Gavin,
> >>>>>>>
> >>>>>>> Mostly fine, but a few comments on the defines added and a
> >>>>>>> question on what the multiple things are meant to mean?
> >>>>>>>            
> >>>>>>
> >>>>>> Thanks for your review and comments, replies as below.
> >>>>>>        
> >>>>>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c
> >>>>>>>> index a9c08e73c0..527b85c8d8 100644
> >>>>>>>> --- a/hw/acpi/ghes.c
> >>>>>>>> +++ b/hw/acpi/ghes.c
> >>>>>>>> @@ -57,8 +57,12 @@
> >>>>>>>>      /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */
> >>>>>>>>      #define ACPI_GHES_MEM_CPER_LENGTH           80
> >>>>>>>>      
> >>>>>>>> -/* Masks for block_status flags */
> >>>>>>>> -#define ACPI_GEBS_UNCORRECTABLE         1
> >>>>>>>> +/* Bits for block_status flags */
> >>>>>>>> +#define ACPI_GEBS_UNCORRECTABLE           0
> >>>>>>>> +#define ACPI_GEBS_CORRECTABLE             1
> >>>>>>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE  2
> >>>>>>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE    3  
> >>>>>>>
> >>>>>>> So this maps to the bits in block status.
> >>>>>>>
> >>>>>>> I'm not actually sure what these multiple variants are meant to tell us.
> >>>>>>> The multiple error blocks example referred to by the spec is a way to represent
> >>>>>>> the same error applying to multiple places.  So that's one error, many blocks.
> >>>>>>> I have no idea if we set these bits in that case.
> >>>>>>>
> >>>>>>> Based on a quick look I don't think linux even takes any notice.  THere
> >>>>>>> are defines in actbl1.h but I'm not seeing any use made of them.
> >>>>>>>            
> >>>>>>
> >>>>>> I hope Igor can confirm since it was suggested by him.
> >>>>>>
> >>>>>> It's hard to understand how exactly these multiple variants are used from the
> >>>>>> spec. In ACPI 6.5 Table 18.11, it's explained as below.
> >>>>>>
> >>>>>> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that more
> >>>>>> than one uncorrectable errors have been detected.
> >>>>>>
> >>>>>> I don't see those multiple variants have been used by Linux. So I think it's
> >>>>>> safe to drop them.  
> >>>>>
> >>>>> even though example describes 'same' error at different components,
> >>>>> the bit fields descriptions doesn't set any limits on what 'more than one' means.
> >>>>>
> >>>>> Also from guest POV it's multiple different pages that we are reporting here
> >>>>> as multiple CPERs.
> >>>>> It seems to me that setting *_MULTIPLE_* here is correct thing to do.
> >>>>>         
> >>>>
> >>>> I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if Jonathan
> >>>> is fine. Again, this field isn't used by Linux guest.  
> >>> I don't care strongly.  Maybe we should ask for a spec clarification as I doubt
> >>> implementations will be consistent on this given the vague description and that
> >>> Linux ignores it today.
> >>>      
> >>
> >> Google Gemini has the following question. If it can be trusted, it should be
> >> set when @num_of_addresses is larger than 1.
> >>
> >> Quota from Google Gemini:
> >>
> >> The system firmware sets this bit to indicate to the Operating System Power Management (OSPM)
> >> that more than one correctable error condition has been detected and logged for the associated
> >> hardware component since the last time the status was cleared by the software. This is crucial
> >> because a high frequency of correctable errors often indicates a potential underlying hardware
> >> issue that could lead to uncorrectable (and potentially fatal) errors if not addressed (e.g.,
> >> in memory, where multiple correctable errors might trigger a spare memory operation).
> >>  
> >>>>     
> >>>>>>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES      4  
> >>>>>>>
> >>>>>>> This is bits 4-13 and the define isn't used. I'd drop it.
> >>>>>>>            
> >>>>>>
> >>>>>> The definition is used in acpi_ghes_memory_errors() of this patch. However,
> >>>>>> I don't see it has been used by Linux. This field isn't used by Linux to determine
> >>>>>> the total number of error entries. So I think I can drop it either if Igor is ok.
> >>>>>>        
> >>>>
> >>>> Lets keep this field either in next revision if Jonathan is fine.  
> >>>
> >>> I'm fine with the field, but not the value.  As far as I can tell form the spec, it should
> >>> be a mask, not a single bit.
> >>>      
> >>
> >> Agreed, lets keep ACPI_HEST_ERROR_ENTRY_COUNT as zero in next revision.  
> > 
> > I'm even more confused now.  The GEBS Error Data entry count should be field from 13:4
> > and the value taken should be the number of entries in the record, so 1, 4, 16 depending
> > on the page size.
> > 
> > So that define of the value 4 is garbage. If it were DATA_ENTRIES_SHIFT then I'd be much happier.
> >   
> 
> My bad. I misunderstood your point. It will be fixed by using APIs from
> "hw/registerfields.h" as suggested by Philippe in another reply.
> 
>    ...
>    FIELD(ACPI_GEBS, MULTIPLE_CORRECTABLE, 3, 1)
>    FIELD(ACPI_GEBS, ERROR_DATA_ENTRIES, 4, 10)
> 
>    then use FIELD_DP32() to only set the correct bits.
> 
Perfect. Thanks!

J
> Thanks,
> Gavin
> 
> 



  reply	other threads:[~2025-11-11 13:13 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-05 11:44 [PATCH v3 0/8] target/arm/kvm: Improve memory error handling Gavin Shan
2025-11-05 11:44 ` [PATCH v3 1/8] tests/qtest/bios-tables-test: Prepare for changes in the HEST table Gavin Shan
2025-11-05 14:16   ` Jonathan Cameron via
2025-11-05 14:16     ` Jonathan Cameron via
2025-11-05 11:44 ` [PATCH v3 2/8] acpi/ghes: Increase GHES raw data maximal length to 4KiB Gavin Shan
2025-11-05 14:16   ` Jonathan Cameron via
2025-11-05 14:16     ` Jonathan Cameron via
2025-11-10 14:11   ` Igor Mammedov
2025-11-11  4:05     ` Gavin Shan
2025-11-12 12:32       ` Igor Mammedov
2025-11-12 17:41         ` Gavin Shan
2025-11-05 11:44 ` [PATCH v3 3/8] tests/qtest/bios-tables-test: Update HEST table Gavin Shan
2025-11-05 14:17   ` Jonathan Cameron via
2025-11-05 14:17     ` Jonathan Cameron via
2025-11-05 11:44 ` [PATCH v3 4/8] acpi/ghes: Extend acpi_ghes_memory_errors() to support multiple CPERs Gavin Shan
2025-11-05 14:14   ` Jonathan Cameron via
2025-11-05 14:14     ` Jonathan Cameron via
2025-11-06  3:15     ` Gavin Shan
2025-11-10 14:49       ` Igor Mammedov
2025-11-11  4:08         ` Gavin Shan
2025-11-11 10:07           ` Jonathan Cameron via
2025-11-11 10:07             ` Jonathan Cameron via
2025-11-11 10:55             ` Gavin Shan
2025-11-11 11:55               ` Jonathan Cameron via
2025-11-11 11:55                 ` Jonathan Cameron via
2025-11-11 12:19                 ` Gavin Shan
2025-11-11 13:12                   ` Jonathan Cameron via [this message]
2025-11-11 13:12                     ` Jonathan Cameron via
2025-11-10 14:38   ` Igor Mammedov
2025-11-11  4:40     ` Gavin Shan
2025-11-12 13:12       ` Igor Mammedov
2025-11-12 17:36         ` Gavin Shan
2025-11-10 14:43   ` Philippe Mathieu-Daudé
2025-11-10 23:38     ` Gavin Shan
2025-11-11  3:40       ` Gavin Shan
2025-11-10 14:48   ` Philippe Mathieu-Daudé
2025-11-11  3:44     ` Gavin Shan
2025-11-05 11:44 ` [PATCH v3 5/8] acpi/ghes: Bail early on error from get_ghes_source_offsets() Gavin Shan
2025-11-05 14:17   ` Jonathan Cameron via
2025-11-05 14:17     ` Jonathan Cameron via
2025-11-10 14:50   ` Philippe Mathieu-Daudé
2025-11-11  3:48     ` Gavin Shan
2025-11-10 14:51   ` Igor Mammedov
2025-11-05 11:44 ` [PATCH v3 6/8] acpi/ghes: Use error_abort in acpi_ghes_memory_errors() Gavin Shan
2025-11-05 14:18   ` Jonathan Cameron via
2025-11-05 14:18     ` Jonathan Cameron via
2025-11-10 14:53   ` Igor Mammedov
2025-11-10 14:54   ` Philippe Mathieu-Daudé
2025-11-11  3:58     ` Gavin Shan
2025-11-12 12:49       ` Igor Mammedov
2025-11-12 17:38         ` Gavin Shan
2025-11-11  5:08     ` Markus Armbruster
2025-11-11  5:25   ` Markus Armbruster
2025-11-11  6:02     ` Gavin Shan
2025-11-11  7:31       ` Markus Armbruster
2025-11-05 11:44 ` [PATCH v3 7/8] kvm/arm/kvm: Introduce helper push_ghes_memory_errors() Gavin Shan
2025-11-05 14:19   ` Jonathan Cameron via
2025-11-05 14:19     ` Jonathan Cameron via
2025-11-10 14:56   ` Igor Mammedov
2025-11-11  4:09     ` Gavin Shan
2025-11-05 11:44 ` [PATCH v3 8/8] target/arm/kvm: Support multiple memory CPERs injection Gavin Shan
2025-11-05 14:37   ` Jonathan Cameron via
2025-11-05 14:37     ` Jonathan Cameron via
2025-11-06  3:26     ` Gavin Shan
2025-11-11 10:12       ` Jonathan Cameron
2025-11-11 10:12         ` Jonathan Cameron via
2025-11-11 10:12         ` Jonathan Cameron via

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251111131253.00007197@huawei.com \
    --to=qemu-arm@nongnu.org \
    --cc=anisinha@redhat.com \
    --cc=gengdongjiu1@gmail.com \
    --cc=gshan@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=mchehab+huawei@kernel.org \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.