Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH v3 01/12] x86/resctrl: Support Privilege-Level Zero Association (PLZA)
From: Babu Moger @ 2026-06-17 16:28 UTC (permalink / raw)
  To: Reinette Chatre, Moger, Babu, corbet, tony.luck, Dave.Martin,
	james.morse, tglx, bp, dave.hansen
  Cc: skhan, x86, mingo, hpa, akpm, rdunlap, pawan.kumar.gupta,
	feng.tang, dapeng1.mi, kees, elver, lirongqing, paulmck, bhelgaas,
	seanjc, alexandre.chartre, yazen.ghannam, peterz, chang.seok.bae,
	kim.phillips, xin, naveen, thomas.lendacky, linux-doc,
	linux-kernel, eranian, peternewman
In-Reply-To: <353185bd-2b3e-484e-bf4c-e774c70ea63c@intel.com>

Hi Reinette,


On 6/16/26 19:00, Reinette Chatre wrote:
> Hi Babu,
> 
> On 6/12/26 9:56 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 6/11/2026 6:23 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 4/30/26 4:24 PM, Babu Moger wrote:
>>>> Customers have identified an issue while using the QoS resource Control
>>>
>>> "Control" -> "control"?
>>>
>>
>> ack
>>
>>>> feature. If a memory bandwidth associated with a CLOSID is aggressively
>>>
>>> "a memory bandwidth" -> "memory bandwidth"?
>>
>> ack.
>>
>>>
>>>> throttled, and it moves into Kernel mode, the Kernel operations are also
>>>
>>> What does "it" refer to here? From text it seems to be the "CLOSID" but that
>>> does not sound right? Should "it" instead be something like "a task with that
>>> CLOSID"?
>>
>> sure.
>>
>>>
>>> "Kernel" -> "kernel"?
>>
>> ack.
>>>
>>>> aggressively throttled. This can stall forward progress and eventually
>>>> degrade overall system performance. AMD hardware supports a feature
>>>> Privilege-Level Zero Association (PLZA) to change the association of the
>>>> thread as soon as it begins executing.
>>>
>>> "change the association of the thread as soon as it begins executing." I am
>>> not able to parse this.
>>
>> How about ?
>>
>> Customers have identified an issue while using the QoS resource Control
>> feature. If memory bandwidth associated with a CLOSID is aggressively
>> throttled, and a task with that CLOSID moves into kernel mode, the kernel operations are also aggressively throttled. This can stall forward progress and eventually degrade overall system performance.
>> AMD hardware supports a feature Privilege-Level Zero Association (PLZA)
>> to change the CPU association at the user-to-kernel transition, so the kernel execution can use a different association than user mode.
> 
> "change the CPU association at the user-to-kernel transition" -> What is this
> trying to describe? CPU association of what?
> 
> "a different association"? What does this mean?
> 

Will change it to:

AMD hardware supports a feature Privilege-Level Zero Association (PLZA),
which allows the CPU’s CLOSID association to be changed during the 
transition from user mode to kernel mode. This enables the kernel to 
operate with a different CLOSID than the user mode.


>>
>> Privilege-Level Zero Association (PLZA) allows the user to specify a> CLOSID and/or RMID associated with execution in Privilege-Level
>> Zero. When enabled on a CPU, as the CPU enters Privilege-Level Zero,
>> allocation and monitoring for that CPU will be associated with the
>> PLZA CLOSID and/or RMID. Otherwise, the CPU will be associated with
>> the CLOSID and RMID given by PQR_ASSOC.
> 
> 
> Sounds like this is vague because MSR_IA32_PQR_PLZA_ASSOC has not been
> introduced yet. Could it help to introduce MSR_IA32_PQR_PLZA_ASSOC as
> part of this patch and then the changelog can be specific about PLZA
> feature introducing this new MSR and how it complements MSR_IA32_PQR_ASSOC?

Its probably better to remove the second paragraph. This text can go 
with the patch which introduces MSR_IA32_PQR_PLZA_ASSOC.

With splitting the patch, this will only have cpufeatures changes.

> 
> ...
> 
>>>>    Documentation/admin-guide/kernel-parameters.txt | 2 +-
>>>>    arch/x86/include/asm/cpufeatures.h              | 1 +
>>>>    arch/x86/kernel/cpu/resctrl/core.c              | 2 ++
>>>>    arch/x86/kernel/cpu/scattered.c                 | 1 +
>>>
>>> Please split changes to other subsystems and make these changes
>>> obvious with their own subject prefix to avoid sneaking changes into
>>> other subsystems via resctrl.
>>>
>>
>> Ok. Will be two patches.
>> 1. For Documentation/admin-guide/kernel-parameters.txt
>> 2.  arch/x86/include/asm/cpufeatures.h
>>      arch/x86/kernel/cpu/resctrl/core.c
>>      arch/x86/kernel/cpu/scattered.c
> 
> The resctrl changes found in (2) would be documented in (1)? That does not
> look right. Why not just split the resctrl changes from the cpufeatures changes?
> This would be similar to how you did ABMC enabling.
> 

Sounds good.

Thanks
Babu

^ permalink raw reply

* Re: [PATCH v3 00/12] [PATCH v3 00/12] x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Babu Moger @ 2026-06-17 15:56 UTC (permalink / raw)
  To: Reinette Chatre, Moger, Babu, corbet, tony.luck, Dave.Martin,
	james.morse, tglx, bp, dave.hansen
  Cc: skhan, x86, mingo, hpa, akpm, rdunlap, pawan.kumar.gupta,
	feng.tang, dapeng1.mi, kees, elver, lirongqing, paulmck, bhelgaas,
	seanjc, alexandre.chartre, yazen.ghannam, peterz, chang.seok.bae,
	kim.phillips, xin, naveen, thomas.lendacky, linux-doc,
	linux-kernel, eranian, peternewman
In-Reply-To: <2ba92dec-47ea-404e-8dc9-197a846bdb2d@intel.com>

Hi Reinette,

On 6/16/26 23:34, Reinette Chatre wrote:
> Hi Babu,
> 
> On 6/12/26 8:37 AM, Moger, Babu wrote:
>>
>>
>> On 6/11/2026 4:53 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 4/30/26 4:24 PM, Babu Moger wrote:
>>>> Design
>>>> ======
>>>>
>>>> A new sysfs file, info/kernel_mode, holds a single global policy that
>>>> selects what kernel work is steered and which rdtgroup it is steered
>>>
>>> How should "selects *what* kernel work is steered" be interpreted? Do these
>>> modes not all apply to *all* kernel work?
>>
>> How about?
>>
>> A new sysfs file, info/kernel_mode, holds a single global policy for
>> kernel contexts and the rdtgroup associated with the policy.
> 
> What does "kernel contexts" mean here?
> Also, since rdtgroup refers more to "struct rdtgroup" that is internal to resctrl
> I would suggest "resource group" be used instead.
> Consider, for example (this is just something to get started, please do not just
> copy&paste):
> 
>     A new resctrl file, info/kernel_mode, holds the global policy for
>     resource allocation and monitoring of kernel work and the resource group
>     (when applicable) associated with the policy.

Sounds good.

> 
> 
>>>> to.  Reads describe the supported modes and the currently-active
>>>> binding; writes change the policy or rebind to a different group.
>>>> Look at the thread below for design discussion.
>>>> https://lore.kernel.org/lkml/14a8ad0a-e842-4268-871a-0762f1169e03@intel.com/
>>>>
>>>
>>> ...
>>>
>>>> Examples
>>>> ========
>>>>
>>>> (See Documentation/filesystems/resctrl.rst, "kernel_mode" and
>>>> "kmode_cpus" sections, for the full UAPI.)
>>>>
>>>>     # Mount resctrl
>>>>     # mount -t resctrl resctrl /sys/fs/resctrl
>>>>     # cd /sys/fs/resctrl
>>>>
>>>>     # Read the supported modes.  The active mode is bracketed and reports
>>>>     # the bound "<ctrl>/<mon>/" group; other supported modes report
>>>>     # ":group=none" because nothing is bound to them.
>>>>     # cat info/kernel_mode
>>>>     [inherit_ctrl_and_mon:group=//]
>>>
>>> This is unexpected since associating a group to this mode implies that this
>>> group is used to manage allocations and monitoring of kernel work but this
>>> is not true, right? From what I understand there should be no group associated with
>>> this default "inherit_ctrl_and_mon" mode.
>>
>> The default mode is "inherit_ctrl_and_mon", where both user mode and kernel mode share the same CLOSID and RMID. This is current mode (without this series).
>>
>> I thought we are going to set the default mode with the default group when system boots up. No?
> 
> As I have it, the end of our discussion on this topic is at:
> https://lore.kernel.org/lkml/6709398b-269d-47b5-9b41-084f410bb1a6@amd.com/
> 
> Based on that discussion there is no resource group associated with the default
> inherit_ctrl_and_mon.
> I find the above output confusing to user space since adding the default group as the only
> group to this mode implies that kernel work inherits resource allocation and monitoring
> from the default group but that is not correct.

Ah, I misunderstood earlier.

The display will look like this when the system boots up.

# cat info/kernel_mode
   inherit_ctrl_and_mon:
   global_assign_ctrl_assign_mon_per_cpu:group=uninitialized
   global_assign_ctrl_assign_mon_per_cpu:group=uninitialized

There will not be any group associated with "inherit_ctrl_and_mon".
It is only used to switch from other two modes.

> 
> Your answer seems to refer to other discussions about what group should be used for
> a mode when switching to a new mode and user space has not set the resource group. If not,
> could you please point me to which discussion you are referring to?
> 
>>
>>
>>>
>>>>     global_assign_ctrl_inherit_mon_per_cpu:group=none
>>>>     global_assign_ctrl_assign_mon_per_cpu:group=none
>>>
>>> nit: "none" does not reflect state as clearly as "unset"/"uninitialized"/"NA"
>>
>> Lets go with "uninitialized".
> 
> ok
> 
> ...
>>>>     # echo 0-3 > ctrl1/kmode_cpus_list
>>>>     # cat ctrl1/kmode_cpus
>>>>     f
>>>>     # cat ctrl1/kmode_cpus_list
>>>>     0-3
>>>>
>>>>     # Empty masks are rejected; use info/kernel_mode to reset to
>>>>     # "every online CPU".
>>>>     # echo "" > ctrl1/kmode_cpus_list
>>>>     bash: echo: write error: Invalid argument
>>>>     # cat info/last_cmd_status
>>>>     Empty mask not allowed; use info/kernel_mode to unbind
>>>
>>> Why are empty masks rejected/not allowed?
>>
>> No specific reason.
>>
>> When the mode is switched, we discussed earlier to globally apply the mode to all the online CPUs.
> 
> Right. I did not see this being done in this implementation though. As I mentioned in
> my response to patch #8 it appears that it uses old data from the resource
> group's kmode_cpu_mask. I do think that applying it to all the online CPUs
> matches the intention and would make the code much simpler.

Yes. Sure.

> 
>>
>> At this point reading "kmode_cpus_list" will still report empty.
> 
> I do not think resctrl should do this. This is not accurate and conflicts with the
> existing cpus resctrl files. It should be simple to just present the actual and
> accurate data to user space, especially after incorporating Qinyun Tan's contributions.

Sure. Will do

> 
>>
>> Users can change it to selectively apply the mode by writing to "kmode_cpus_list".
>>
>> I was not sure what was the action when empty masks are written.
>>
>> Should the empty mask apply the mode to all the online CPUs?
> 
> Users are used to being able to use an empty write to remove all CPUs from a
> resource group. It thus seems intuitive that an empty write to the kmode_cpus file

Ok, That seems reasonable. Lets keep the same behavior as we update 
other cpu_lists.

> behave similarly. Sounds like this could mean that if user space sets the
> kmode to global_assign_ctrl_inherit_mon_per_cpu or global_assign_ctrl_assign_mon_per_cpu
> and then writes an empty mask to kmode_cpus then it would essentially be setting
> inherit_ctrl_and_mon mode? This still seems ok since if disabling one of the "global"
> modes on a CPU results in that CPU inheriting from PQR_ASSOC then it seems
> reasonable to extend to when that mode is disabled for all CPUs.

Yes. That is correct.

> 
>>>>     # Disable kernel-mode steering (back to inherit, default group).
>>>
>>> This sounds like kernel work is steered to default group which I
>>> do not think is accurate for the "inherit_ctrl_and_mon" mode.
>>
>> How about ?
>>
>> Drop the kernel-mode binding and restore inherit_ctrl_and_mon on the default group.
> 
> No. There is no "inherit_ctrl_and_mon on the default group". There is nothing special
> about the default group when it comes to inherit_ctrl_and_mon mode ... or am I missing
> something?

You are right. There is nothing special about inherit_ctrl_and_mon on 
the default group.

> This could be something like: "Activate inherit_ctrl_and_mon mode to let
> kernel work inherit the resource allocation and monitoring from the user space task."
> (and drop the default group from the output associated with inherit_ctrl_and_mon)
> 

Sure.

Thanks
Babu


^ permalink raw reply

* Re: [PATCH v5 2/6] alloc_tag: add ioctl filters to /proc/allocinfo
From: Suren Baghdasaryan @ 2026-06-17 15:55 UTC (permalink / raw)
  To: Abhishek Bapat
  Cc: Andrew Morton, Kent Overstreet, Hao Ge, Shuah Khan,
	Jonathan Corbet, linux-doc, linux-kernel, linux-mm, Sourav Panda
In-Reply-To: <2f9c58f6fd8a81325ec03e19327e03b0b7dca2aa.1781564384.git.abhishekbapat@google.com>

On Mon, Jun 15, 2026 at 4:04 PM Abhishek Bapat <abhishekbapat@google.com> wrote:
>
> Extend the capability of the IOCTL mechanism to filter allocations based
> on tag's module name, function name, file name and line number.
>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> Acked-by: Hao Ge <hao.ge@linux.dev>

Acked-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  include/uapi/linux/alloc_tag.h | 26 ++++++++++++-
>  lib/alloc_tag.c                | 68 ++++++++++++++++++++++++++++++++--
>  2 files changed, 89 insertions(+), 5 deletions(-)
>
> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> index 0928e1a48d49..3b11877955b9 100644
> --- a/include/uapi/linux/alloc_tag.h
> +++ b/include/uapi/linux/alloc_tag.h
> @@ -40,8 +40,32 @@ struct allocinfo_tag_data {
>         struct allocinfo_counter counter;
>  };
>
> +enum {
> +       ALLOCINFO_FILTER_MODNAME,
> +       ALLOCINFO_FILTER_FUNCTION,
> +       ALLOCINFO_FILTER_FILENAME,
> +       ALLOCINFO_FILTER_LINENO,
> +       __ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_LINENO
> +};
> +
> +#define ALLOCINFO_FILTER_MASK_MODNAME          (1 << ALLOCINFO_FILTER_MODNAME)
> +#define ALLOCINFO_FILTER_MASK_FUNCTION         (1 << ALLOCINFO_FILTER_FUNCTION)
> +#define ALLOCINFO_FILTER_MASK_FILENAME         (1 << ALLOCINFO_FILTER_FILENAME)
> +#define ALLOCINFO_FILTER_MASK_LINENO           (1 << ALLOCINFO_FILTER_LINENO)
> +
> +#define ALLOCINFO_FILTER_MASKS \
> +       ((1 << (__ALLOCINFO_FILTER_LAST + 1)) - 1)
> +
> +struct allocinfo_filter {
> +       __u64 mask; /* bitmask of the filter fields used */
> +       struct allocinfo_tag fields;
> +};
> +
>  struct allocinfo_get_at {
> -       __u64 pos;      /* input */
> +       /* inputs */
> +       __u64 pos;
> +       struct allocinfo_filter filter;
> +       /* output */
>         struct allocinfo_tag_data data;
>  };
>
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index 82e3b5f32dff..5feb61d9fb92 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -49,6 +49,7 @@ struct allocinfo_private {
>         struct codetag_iterator iter;
>         struct codetag_iterator reported_iter;
>         bool print_header;
> +       struct allocinfo_filter filter;
>         /* ioctl uses a separate iterator not to interfere with reads */
>         struct codetag_iterator ioctl_iter;
>         bool positioned; /* seq_open_private() sets to 0 */
> @@ -188,6 +189,12 @@ static void allocinfo_copy_str(char *dest, const char *src)
>         strscpy_pad(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
>  }
>
> +/* Compare two strings and only consider the trimmed suffix if s1 is too long */
> +static int allocinfo_cmp_str(const char *str, const char *template)
> +{
> +       return strncmp(allocinfo_str(str), template, ALLOCINFO_STR_SIZE);
> +}
> +
>  /*
>   * Populates the UAPI allocinfo_tag_data structure with active runtime
>   * profiling counters extracted from the given kernel codetag.
> @@ -227,6 +234,40 @@ static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
>         return 0;
>  }
>
> +/*
> + * Verifies whether a given codetag satisfies the active filtering criteria by
> + * matching its characteristics against the specified filter.
> + */
> +static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter)
> +{
> +       if (!filter || !filter->mask)
> +               return true;
> +
> +       if (filter->mask & ALLOCINFO_FILTER_MASK_MODNAME) {
> +               /* user wants to filter by modname but ct->modname is NULL */
> +               if (!ct->modname) {
> +                       /* validate if user was attempting to filter for built-in allocations */
> +                       if (filter->fields.modname[0] != '\0')
> +                               return false;
> +               } else if (allocinfo_cmp_str(ct->modname, filter->fields.modname))
> +                       return false;
> +       }
> +
> +       if ((filter->mask & ALLOCINFO_FILTER_MASK_FUNCTION) &&
> +           ct->function && allocinfo_cmp_str(ct->function, filter->fields.function))
> +               return false;
> +
> +       if ((filter->mask & ALLOCINFO_FILTER_MASK_FILENAME) &&
> +           ct->filename && allocinfo_cmp_str(ct->filename, filter->fields.filename))
> +               return false;
> +
> +       if ((filter->mask & ALLOCINFO_FILTER_MASK_LINENO) &&
> +           ct->lineno != filter->fields.lineno)
> +               return false;
> +
> +       return true;
> +}
> +
>  /*
>   * Seeks the ioctl iterator to the specified 0-indexed tag position, reads its
>   * profiling data and returns it to userspace.
> @@ -235,29 +276,46 @@ static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
>  {
>         struct allocinfo_private *priv;
>         struct codetag *ct;
> -       __u64 pos;
>         struct allocinfo_get_at params = {0};
> +       __u64 skip_count;
>
>         if (copy_from_user(&params, arg, sizeof(params)))
>                 return -EFAULT;
>
> +       if (params.filter.mask & ~ALLOCINFO_FILTER_MASKS)
> +               return -EINVAL;
> +
>         priv = m->private;
> -       pos = params.pos;
>
>         mutex_lock(&priv->ioctl_lock);
>         codetag_lock_module_list(alloc_tag_cttype);
>
> -       if (pos >= codetag_get_count(alloc_tag_cttype)) {
> +       if (params.pos >= codetag_get_count(alloc_tag_cttype)) {
>                 codetag_unlock_module_list(alloc_tag_cttype);
>                 mutex_unlock(&priv->ioctl_lock);
>                 return -ENOENT;
>         }
>
> +       skip_count = params.pos;
> +
> +       if (params.filter.mask)
> +               priv->filter = params.filter;
> +       else
> +               priv->filter.mask = 0;
> +
>         /* Find the codetag */
>         priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
>         ct = codetag_next_ct(&priv->ioctl_iter);
> -       while (ct && pos--)
> +
> +       while (ct) {
> +               if (matches_filter(ct, &priv->filter)) {
> +                       if (skip_count == 0)
> +                               break;
> +                       skip_count--;
> +               }
>                 ct = codetag_next_ct(&priv->ioctl_iter);
> +       }
> +
>         if (ct) {
>                 allocinfo_to_params(ct, &params.data);
>                 priv->positioned = true;
> @@ -298,6 +356,8 @@ static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
>         }
>
>         ct = codetag_next_ct(&priv->ioctl_iter);
> +       while (ct && !matches_filter(ct, &priv->filter))
> +               ct = codetag_next_ct(&priv->ioctl_iter);
>         if (ct)
>                 allocinfo_to_params(ct, &params);
>
> --
> 2.54.0.1136.gdb2ca164c4-goog
>

^ permalink raw reply

* Re: [PATCH v5 1/6] alloc_tag: add ioctl to /proc/allocinfo
From: Suren Baghdasaryan @ 2026-06-17 15:50 UTC (permalink / raw)
  To: Hao Ge
  Cc: Abhishek Bapat, Andrew Morton, Kent Overstreet, Shuah Khan,
	Jonathan Corbet, linux-doc, linux-kernel, linux-mm, Sourav Panda
In-Reply-To: <f23682fb-6107-4dad-bc60-b8816fba23d6@linux.dev>

On Mon, Jun 15, 2026 at 6:40 PM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Abhishek
>
>
> On 2026/6/16 07:04, Abhishek Bapat wrote:
> > From: Suren Baghdasaryan <surenb@google.com>
> >
> > Add the following ioctl commands for /proc/allocinfo file:
> >
> > ALLOCINFO_IOC_CONTENT_ID - gets content identifier which can be used
> > to check whether the file content has changed specifically due to module
> > load/unload. Every time a module is loaded / unloaded, the returned
> > value will be different. By comparing the identifier value at the
> > beginning and at the end of the content retrieval operation, users can
> > validate retrieved information for consistency.
> >
> > ALLOCINFO_IOC_GET_AT - gets the record at the specified position. This
> > is the position of a record in /proc/allocinfo.
> >
> > ALLOCINFO_IOC_GET_NEXT - gets the record next to the last retrieved
> > one. If no records were previously retrieved, returns the first
> > record.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
>
>
> Thanks for updating the patch, LGTM.
>
> Acked-by: Hao Ge <hao.ge@linux.dev>
>
>
> > ---
> >   Documentation/mm/allocation-profiling.rst     |   5 +
> >   .../userspace-api/ioctl/ioctl-number.rst      |   2 +
> >   MAINTAINERS                                   |   1 +
> >   include/linux/codetag.h                       |   2 +
> >   include/uapi/linux/alloc_tag.h                |  60 +++++
> >   lib/alloc_tag.c                               | 235 +++++++++++++++++-
> >   lib/codetag.c                                 |  18 ++
> >   7 files changed, 321 insertions(+), 2 deletions(-)
> >   create mode 100644 include/uapi/linux/alloc_tag.h
> >
> > diff --git a/Documentation/mm/allocation-profiling.rst b/Documentation/mm/allocation-profiling.rst
> > index 5389d241176a..c3a28467955f 100644
> > --- a/Documentation/mm/allocation-profiling.rst
> > +++ b/Documentation/mm/allocation-profiling.rst
> > @@ -46,6 +46,11 @@ sysctl:
> >   Runtime info:
> >     /proc/allocinfo
> >
> > +  Profiling data can be retrieved either by reading `/proc/allocinfo` directly as
> > +  text or programmatically via `ioctl()` calls defined in `<uapi/linux/alloc_tag.h>`.
> > +  The ioctl interface supports structured binary data extraction as well as filtering
> > +  by module name, function, file, line number, accuracy, or allocation size limits.
> > +
> >   Example output::
> >
> >     root@moria-kvm:~# sort -g /proc/allocinfo|tail|numfmt --to=iec
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index 331223761fff..84f6808a8578 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -349,6 +349,8 @@ Code  Seq#    Include File                                             Comments
> >                                                                          <mailto:luzmaximilian@gmail.com>
> >   0xA5  20-2F  linux/surface_aggregator/dtx.h                            Microsoft Surface DTX driver
> >                                                                          <mailto:luzmaximilian@gmail.com>
> > +0xA6  00-0F  uapi/linux/alloc_tag.h                                    Memory allocation profiling
> > +                                                                       <mailto:surenb@google.com>
> >   0xAA  00-3F  linux/uapi/linux/userfaultfd.h
> >   0xAB  00-1F  linux/nbd.h
> >   0xAC  00-1F  linux/raw.h
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 65bd4328fe05..019cc4c285a3 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -16713,6 +16713,7 @@ S:    Maintained
> >   F:  Documentation/mm/allocation-profiling.rst
> >   F:  include/linux/alloc_tag.h
> >   F:  include/linux/pgalloc_tag.h
> > +F:   include/uapi/linux/alloc_tag.h
> >   F:  lib/alloc_tag.c
> >
> >   MEMORY CONTROLLER DRIVERS
> > diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> > index ddae7484ca45..a25a085c2df1 100644
> > --- a/include/linux/codetag.h
> > +++ b/include/linux/codetag.h
> > @@ -77,6 +77,8 @@ struct codetag_iterator {
> >   void codetag_lock_module_list(struct codetag_type *cttype);
> >   bool codetag_trylock_module_list(struct codetag_type *cttype);
> >   void codetag_unlock_module_list(struct codetag_type *cttype);
> > +unsigned long codetag_get_content_id(struct codetag_type *cttype);
> > +unsigned int codetag_get_count(struct codetag_type *cttype);
> >   struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
> >   struct codetag *codetag_next_ct(struct codetag_iterator *iter);
> >
> > diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> > new file mode 100644
> > index 000000000000..0928e1a48d49
> > --- /dev/null
> > +++ b/include/uapi/linux/alloc_tag.h
> > @@ -0,0 +1,60 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +/*
> > + * alloc_tag IOCTL API definition
> > + *
> > + * Copyright (C) 2026 Google, LLC.  All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#ifndef _UAPI_ALLOC_TAG_H
> > +#define _UAPI_ALLOC_TAG_H
> > +
> > +#include <linux/types.h>
> > +
> > +#define ALLOCINFO_STR_SIZE   64

The fact that we use the last 64 characters when comparing strings
should be better documented.
I suggest adding a comment before  ALLOCINFO_STR_SIZE definition and
before allocinfo_str(), highlighting this fact. The changelog should
also mention this and explain the reason:

Function, file and module names often have same prefixes, therefore
when filtering by these criterias and if the name is longer than 64
characters, we compare the last 64 characters to minimize the chances
of a name collision.
Otherwise LGTM.

> > +
> > +struct allocinfo_content_id {
> > +     __u64 id;
> > +};
> > +
> > +struct allocinfo_tag {
> > +     /* Longer names are trimmed */
> > +     char modname[ALLOCINFO_STR_SIZE];
> > +     char function[ALLOCINFO_STR_SIZE];
> > +     char filename[ALLOCINFO_STR_SIZE];
> > +     __u64 lineno;
> > +};
> > +
> > +/* The alignment ensures 32-bit compatible interfaces are not broken */
> > +struct allocinfo_counter {
> > +     __u64 bytes;
> > +     __u64 calls;
> > +     __u8 accurate;
> > +} __attribute__((aligned(8)));
> > +
> > +struct allocinfo_tag_data {
> > +     struct allocinfo_tag tag;
> > +     struct allocinfo_counter counter;
> > +};
> > +
> > +struct allocinfo_get_at {
> > +     __u64 pos;      /* input */
> > +     struct allocinfo_tag_data data;
> > +};
> > +
> > +#define _ALLOCINFO_IOC_CONTENT_ID    0
> > +#define _ALLOCINFO_IOC_GET_AT                1
> > +#define _ALLOCINFO_IOC_GET_NEXT              2
> > +
> > +#define ALLOCINFO_IOC_BASE           0xA6
> > +#define ALLOCINFO_IOC_CONTENT_ID     _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_CONTENT_ID,     \
> > +                                          struct allocinfo_content_id)
> > +#define ALLOCINFO_IOC_GET_AT         _IOWR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_AT,        \
> > +                                           struct allocinfo_get_at)
> > +#define ALLOCINFO_IOC_GET_NEXT               _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_NEXT,       \
> > +                                          struct allocinfo_tag_data)
> > +
> > +#endif /* _UAPI_ALLOC_TAG_H */
> > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > index d9be1cf5187d..82e3b5f32dff 100644
> > --- a/lib/alloc_tag.c
> > +++ b/lib/alloc_tag.c
> > @@ -5,6 +5,7 @@
> >   #include <linux/gfp.h>
> >   #include <linux/kallsyms.h>
> >   #include <linux/module.h>
> > +#include <linux/mutex.h>
> >   #include <linux/page_ext.h>
> >   #include <linux/pgalloc_tag.h>
> >   #include <linux/proc_fs.h>
> > @@ -14,6 +15,7 @@
> >   #include <linux/string_choices.h>
> >   #include <linux/vmalloc.h>
> >   #include <linux/kmemleak.h>
> > +#include <uapi/linux/alloc_tag.h>
> >
> >   #define ALLOCINFO_FILE_NAME         "allocinfo"
> >   #define MODULE_ALLOC_TAG_VMAP_SIZE  (100000UL * sizeof(struct alloc_tag))
> > @@ -47,6 +49,10 @@ struct allocinfo_private {
> >       struct codetag_iterator iter;
> >       struct codetag_iterator reported_iter;
> >       bool print_header;
> > +     /* ioctl uses a separate iterator not to interfere with reads */
> > +     struct codetag_iterator ioctl_iter;
> > +     bool positioned; /* seq_open_private() sets to 0 */
> > +     struct mutex ioctl_lock;
> >   };
> >
> >   static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> > @@ -130,6 +136,232 @@ static const struct seq_operations allocinfo_seq_op = {
> >       .show   = allocinfo_show,
> >   };
> >
> > +/*
> > + * Initializes seq_file operations and allocates private state when opening
> > + * the /proc/allocinfo procfs entry.
> > + */
> > +static int allocinfo_open(struct inode *inode, struct file *file)
> > +{
> > +     int ret;
> > +
> > +     ret = seq_open_private(file, &allocinfo_seq_op,
> > +                            sizeof(struct allocinfo_private));
> > +     if (!ret) {
> > +             struct seq_file *m = file->private_data;
> > +             struct allocinfo_private *priv = m->private;
> > +
> > +             mutex_init(&priv->ioctl_lock);
> > +     }
> > +     return ret;
> > +}
> > +
> > +/*
> > + * Cleans up the seq_file state and frees up the private state allocated in
> > + * allocinfo_open() when closing the /proc/allocinfo file descriptor.
> > + */
> > +static int allocinfo_release(struct inode *inode, struct file *file)
> > +{
> > +     struct seq_file *m = file->private_data;
> > +     struct allocinfo_private *priv = m->private;
> > +
> > +     mutex_destroy(&priv->ioctl_lock);
> > +     return seq_release_private(inode, file);
> > +}
> > +
> > +/*
> > + * Returns a pointer to the suffix of a string so that its length fits within
> > + * ALLOCINFO_STR_SIZE, preserving the trailing characters.
> > + */
> > +static const char *allocinfo_str(const char *str)
> > +{
> > +     size_t len = strlen(str);
> > +
> > +     /* Keep an extra space for the trailing NULL. */
> > +     if (len >= ALLOCINFO_STR_SIZE)
> > +             str += (len - ALLOCINFO_STR_SIZE) + 1;
> > +     return str;
> > +}
> > +
> > +/* Copy a string and trim from the beginning if it's too long */
> > +static void allocinfo_copy_str(char *dest, const char *src)
> > +{
> > +     strscpy_pad(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
> > +}
> > +
> > +/*
> > + * Populates the UAPI allocinfo_tag_data structure with active runtime
> > + * profiling counters extracted from the given kernel codetag.
> > + */
> > +static void allocinfo_to_params(struct codetag *ct,
> > +                             struct allocinfo_tag_data *data)
> > +{
> > +     struct alloc_tag *tag = ct_to_alloc_tag(ct);
> > +     struct alloc_tag_counters counter = alloc_tag_read(tag);
> > +
> > +     if (ct->modname)
> > +             allocinfo_copy_str(data->tag.modname, ct->modname);
> > +     else
> > +             data->tag.modname[0] = '\0';
> > +     allocinfo_copy_str(data->tag.function, ct->function);
> > +     allocinfo_copy_str(data->tag.filename, ct->filename);
> > +     data->tag.lineno = ct->lineno;
> > +     data->counter.bytes = counter.bytes;
> > +     data->counter.calls = counter.calls;
> > +     data->counter.accurate = !alloc_tag_is_inaccurate(tag);
> > +}
> > +
> > +/*
> > + * Retrieves the unique content ID representing the current allocation tag module
> > + * layout, allowing userspace to detect if modules were loaded / unloaded.
> > + */
> > +static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> > +{
> > +     struct allocinfo_content_id params;
> > +
> > +     codetag_lock_module_list(alloc_tag_cttype);
> > +     params.id = codetag_get_content_id(alloc_tag_cttype);
> > +     codetag_unlock_module_list(alloc_tag_cttype);
> > +     if (copy_to_user(arg, &params, sizeof(params)))
> > +             return -EFAULT;
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Seeks the ioctl iterator to the specified 0-indexed tag position, reads its
> > + * profiling data and returns it to userspace.
> > + */
> > +static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> > +{
> > +     struct allocinfo_private *priv;
> > +     struct codetag *ct;
> > +     __u64 pos;
> > +     struct allocinfo_get_at params = {0};
> > +
> > +     if (copy_from_user(&params, arg, sizeof(params)))
> > +             return -EFAULT;
> > +
> > +     priv = m->private;
> > +     pos = params.pos;
> > +
> > +     mutex_lock(&priv->ioctl_lock);
> > +     codetag_lock_module_list(alloc_tag_cttype);
> > +
> > +     if (pos >= codetag_get_count(alloc_tag_cttype)) {
> > +             codetag_unlock_module_list(alloc_tag_cttype);
> > +             mutex_unlock(&priv->ioctl_lock);
> > +             return -ENOENT;
> > +     }
> > +
> > +     /* Find the codetag */
> > +     priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> > +     ct = codetag_next_ct(&priv->ioctl_iter);
> > +     while (ct && pos--)
> > +             ct = codetag_next_ct(&priv->ioctl_iter);
> > +     if (ct) {
> > +             allocinfo_to_params(ct, &params.data);
> > +             priv->positioned = true;
> > +     }
> > +
> > +     codetag_unlock_module_list(alloc_tag_cttype);
> > +     mutex_unlock(&priv->ioctl_lock);
> > +
> > +     if (!ct)
> > +             return -ENOENT;
> > +
> > +     if (copy_to_user(arg, &params, sizeof(params)))
> > +             return -EFAULT;
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Advances the ioctl iterator to the next allocation tag in the sequence and
> > + * returns its profiling data to userspace.
> > + */
> > +static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> > +{
> > +     struct allocinfo_private *priv;
> > +     struct codetag *ct;
> > +     struct allocinfo_tag_data params;
> > +     int ret = 0;
> > +
> > +     memset(&params, 0, sizeof(params));
> > +     priv = m->private;
> > +
> > +     mutex_lock(&priv->ioctl_lock);
> > +     codetag_lock_module_list(alloc_tag_cttype);
> > +
> > +     if (!priv->positioned) {
> > +             priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> > +             priv->positioned = true;
> > +     }
> > +
> > +     ct = codetag_next_ct(&priv->ioctl_iter);
> > +     if (ct)
> > +             allocinfo_to_params(ct, &params);
> > +
> > +     if (!ct) {
> > +             priv->positioned = false;
> > +             ret = -ENOENT;
> > +     }
> > +     codetag_unlock_module_list(alloc_tag_cttype);
> > +     mutex_unlock(&priv->ioctl_lock);
> > +
> > +     if (ret == 0) {
> > +             if (copy_to_user(arg, &params, sizeof(params)))
> > +                     return -EFAULT;
> > +     }
> > +     return ret;
> > +}
> > +
> > +/*
> > + * Entry point ioctl function for /proc/allocinfo routing requests to fetch the
> > + * layout content ID, seek to a specific tag, or read sequential tags.
> > + */
> > +static long allocinfo_ioctl(struct file *file, unsigned int cmd,
> > +                         unsigned long __arg)
> > +{
> > +     void __user *arg = (void __user *)__arg;
> > +     int ret;
> > +
> > +     switch (cmd) {
> > +     case ALLOCINFO_IOC_CONTENT_ID:
> > +             ret = allocinfo_ioctl_get_content_id(file->private_data, arg);
> > +             break;
> > +     case ALLOCINFO_IOC_GET_AT:
> > +             ret = allocinfo_ioctl_get_at(file->private_data, arg);
> > +             break;
> > +     case ALLOCINFO_IOC_GET_NEXT:
> > +             ret = allocinfo_ioctl_get_next(file->private_data, arg);
> > +             break;
> > +     default:
> > +             ret = -ENOIOCTLCMD;
> > +             break;
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
> > +                                unsigned long arg)
> > +{
> > +     return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> > +}
> > +#endif
> > +
> > +static const struct proc_ops allocinfo_proc_ops = {
> > +     .proc_open              = allocinfo_open,
> > +     .proc_read_iter         = seq_read_iter,
> > +     .proc_lseek             = seq_lseek,
> > +     .proc_release           = allocinfo_release,
> > +     .proc_ioctl             = allocinfo_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +     .proc_compat_ioctl      = allocinfo_compat_ioctl,
> > +#endif
> > +};
> > +
> >   size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sleep)
> >   {
> >       struct codetag_iterator iter;
> > @@ -993,8 +1225,7 @@ static int __init alloc_tag_init(void)
> >               return 0;
> >       }
> >
> > -     if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> > -                                  sizeof(struct allocinfo_private), NULL)) {
> > +     if (!proc_create(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_proc_ops)) {
> >               pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
> >               shutdown_mem_profiling(false);
> >               return -ENOMEM;
> > diff --git a/lib/codetag.c b/lib/codetag.c
> > index 4001a7ea6675..a9cda4c962a3 100644
> > --- a/lib/codetag.c
> > +++ b/lib/codetag.c
> > @@ -19,6 +19,8 @@ struct codetag_type {
> >       struct codetag_type_desc desc;
> >       /* generates unique sequence number for module load */
> >       unsigned long next_mod_seq;
> > +     /* bumped on every module load and unload */
> > +     unsigned long content_id;
> >   };
> >
> >   struct codetag_range {
> > @@ -50,6 +52,20 @@ void codetag_unlock_module_list(struct codetag_type *cttype)
> >       up_read(&cttype->mod_lock);
> >   }
> >
> > +unsigned long codetag_get_content_id(struct codetag_type *cttype)
> > +{
> > +     lockdep_assert_held(&cttype->mod_lock);
> > +
> > +     return cttype->content_id;
> > +}
> > +
> > +unsigned int codetag_get_count(struct codetag_type *cttype)
> > +{
> > +     lockdep_assert_held(&cttype->mod_lock);
> > +
> > +     return cttype->count;
> > +}
> > +
> >   struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> >   {
> >       struct codetag_iterator iter = {
> > @@ -204,6 +220,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
> >
> >       down_write(&cttype->mod_lock);
> >       cmod->mod_seq = ++cttype->next_mod_seq;
> > +     ++cttype->content_id;
> >       mod_id = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
> >       if (mod_id >= 0) {
> >               if (cttype->desc.module_load) {
> > @@ -368,6 +385,7 @@ void codetag_unload_module(struct module *mod)
> >                       cttype->count -= range_size(cttype, &cmod->range);
> >                       idr_remove(&cttype->mod_idr, mod_id);
> >                       kfree(cmod);
> > +                     ++cttype->content_id;
> >               }
> >               up_write(&cttype->mod_lock);
> >               if (found && cttype->desc.free_section_mem)

^ permalink raw reply

* Re: [PATCH RFC 3/4] printk: nbcon: move printk_delay to console emiting code
From: Petr Mladek @ 2026-06-17 15:45 UTC (permalink / raw)
  To: Andrew Murray
  Cc: Jonathan Corbet, Shuah Khan, Russell King, Florian Fainelli,
	Broadcom internal kernel review list, Ray Jui, Scott Branden,
	Steven Rostedt, John Ogness, Sergey Senozhatsky, Andrew Morton,
	Sebastian Andrzej Siewior, Clark Williams, Randy Dunlap,
	Linus Torvalds, linux-doc, linux-kernel, linux-arm-kernel,
	linux-rpi-kernel, linux-rt-devel
In-Reply-To: <CALqELGymunRD=ku6CpoZ6OMH9QOYvyPB-EMN24ez5WAeXFtCsg@mail.gmail.com>

On Sun 2026-06-14 13:55:31, Andrew Murray wrote:
> On Mon, 8 Jun 2026 at 16:25, Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Mon 2026-06-01 00:17:39, Andrew Murray wrote:
> > > The printk_delay and boot_delay features are helpful for debugging
> > > as kernel output can be slowed down during boot allowing messages to
> > > be seen before scrolling off the screen, or to correlate timing between
> > > some physical event and console output.
> > >
> > > However, since the introduction of nbcon and the legacy printer thread
> > > for PREEMPT_RT kernels, printk records are now emited to the console
> > > asynchronously to the caller of printk. Thus, any printk delay added by
> > > boot_delay/printk_delay continues to slow down the calling process but
> > > may not have any impact to the rate in which records are emited to the
> > > console.
> > >
> > > Let's address this by moving the printk delay from the calling code
> > > to the console emiting code instead. Whilst this ensures that delays
> > > are still observed (especially for slower consoles), it doesn't improve
> > > the use-case of using boot_delay/printk_delay to correlate timings
> > > between physical events and console output.
> > >
> > > diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> > > index d7044a7a214bdd4537a5e20d876d99bc3ffe8b3a..a507a2fed5bf4366e24330f763b842a698ecf6f7 100644
> > > --- a/kernel/printk/nbcon.c
> > > +++ b/kernel/printk/nbcon.c
> > > @@ -1267,11 +1267,16 @@ static int nbcon_kthread_func(void *__console)
> > >
> > >               con_flags = console_srcu_read_flags(con);
> > >
> > > +             wctxt.len = 0;
> > > +
> > >               if (console_is_usable(con, con_flags, false))
> > >                       backlog = nbcon_emit_one(&wctxt, false);
> > >
> > >               console_srcu_read_unlock(cookie);
> > >
> > > +             if (backlog && wctxt.len > 0)
> >
> > Heh, this is tricky. It might probably work but it is not guarantted
> > by design.
> >
> > The "backlog" name is a bit misleading. The value is basically
> > wctxt.ctxt.backlog. The real meaning is that printk_get_next_message()
> > was able to read a message. It means that there _was_ a backlog.
> > But it is not clear whether there are still pending messages or not.
> 
> Yes I found that to be the case (see my notes in the cover letter) -
> backlog is only true if a record was successfully retrieved, though
> that record may be one that is suppressed.
> 
> 
> >
> > Also it is not clear that whether the message was pushed to the
> > console or not. It might have been supressed in which case
> > (wctxt.len == 0). But it might also be emitted only partially
> > when a higher priority context took over the console context
> > ownership.
> 
> You say it might probably work but isn't guaranteed by design, I'm
> struggling to see what I've missed...

Ah, I am sorry I was not clear enough, see below.

> As far as I could tell, nbcon_emit_next_record only returns true when
> a record has been printed and it still has context. The only exception
> to that is where pmsg.outbuf_len is zero (suppressed), in which case
> it may return true. Thus if (nbcon_emit_next_record() &&
> !pmsg.outbuf_len) then we can be sure a record was printed. In order
> to apply this test from the various callers...
> 
> for nbcon_emit_one - this returns ctxt->backlog if
> nbcon_emit_next_record returned true. But backlog is *always* true
> when nbcon_emit_next_record returns true. Thus the test of (backlog &&
> wctxt.len) is equivelant to (nbcon_emit_next_record() &&
> !pmsg.outbuf_len).
> 
> So I still think this implementation is valid.

You are right. Your change would work with the current code.

When I talked about the design I meant that we "did not expect"
that anyone would need to check whether nbcon_emit_next_record()
really emitted something.

Or better to say, the design has always been complicated.
And we had to review it when the callers needed anything.
For example, see see the commit ba00f7c4d0051e351e
("printk: console_flush_one_record() code cleanup").
We did it before commit 1bc9a28f076fa68736 ("printk:
Use console_flush_one_record for legacy printer kthread").

And the check if (backlog && wctxt.len > 0) is not a good design.
I would say that it depends on internal implementation details
which might change in the future. Also the variable name "backlog"
does not well describe the real behavior.

This is why I suggested to add the @emitted field into
struct nbcon_context instead. It might look like it adds some
complexity. But the semantic of this field will be clear and
simple. So, it should make the life easier in the long term.

> > I would prefer to explicitely set some flag when
> > nbcon_emit_next_record() really called con->write*().
> > See below.
> >
> > > +                     printk_delay(false);
> > > +
> > >               cond_resched();
> > >
> > >       } while (backlog);
> > > @@ -1595,7 +1604,9 @@ static int __nbcon_atomic_flush_pending_con(struct console *con, u64 stop_seq)
> > >                       nbcon_context_release(ctxt);
> > >               }
> > >
> > > -             if (!ctxt->backlog) {
> > > +             if (ctxt->backlog && wctxt.len > 0) {
> > > +                     printk_delay(true);
> > > +             } else {
> >
> > This changes the semantic. The original code call this when
> > no message was read. The new code would call this path also
> > when the output was suppressed. It would probably work.
> > But still.
> 
> Ah, good spot! I missed that.
> 
> 
> >
> > >                       /* Are there reserved but not yet finalized records? */
> > >                       if (nbcon_seq_read(con) < stop_seq)
> > >                               err = -ENOENT;
> >
> >
> > As mentioned above, I would add a flag which would be set when
> > con->write*() was called.
> 
> I'm not sure why I tried to avoid adding members to nbcon_context, but
> I prefer your solution, it isn't so fragile, and makes it easier to
> understand. I'll update for my next revision.

Uff :-)

> >
> > It modifies the type of unsafe_takeover in struct nbcon_write_context.
> > But it actually makes it more compatible with struct nbcon_state.
> 
> What is the intent of this change (bool to unsigned char)?

I meant that my proposed change switched it from bool to 1 bit.
But I see that it will not be fully compatible anyway because it is

   unsigned int unsafe_takeover	: 1 in struct nbcon_state

vs
   unsigned char unsafe_takeover : 1 in struct struct nbcon_write_context.

Honestly, I am not sure what C standard says about bool vs. bit
and bit vs. bit operations. But I believe that compilers would
do the right thing in both cases. Anyway, bit vs. bit sounds cleaner.


> > My proposal (on top of this patch):
> >
> > diff --git a/include/linux/console.h b/include/linux/console.h
> > index 5520e4477ad7..5a86942e55ef 100644
> > --- a/include/linux/console.h
> > +++ b/include/linux/console.h
> > @@ -290,6 +290,7 @@ struct nbcon_context {
> >   * @outbuf:            Pointer to the text buffer for output
> >   * @len:               Length to write
> >   * @unsafe_takeover:   If a hostile takeover in an unsafe state has occurred
> > + * @emitted:           The write context tried to emit the message. Might be incomplete.
> >   * @cpu:               CPU on which the message was generated
> >   * @pid:               PID of the task that generated the message
> >   * @comm:              Name of the task that generated the message
> > @@ -298,7 +299,8 @@ struct nbcon_write_context {
> >         struct nbcon_context    __private ctxt;
> >         char                    *outbuf;
> >         unsigned int            len;
> > -       bool                    unsafe_takeover;
> > +       unsigned char           unsafe_takeover :  1;
> > +       unsigned char           emitted : 1
> >  #ifdef CONFIG_PRINTK_EXECUTION_CTX
> >         int                     cpu;
> >         pid_t                   pid;

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH v5 5/6] kselftest: alloc_tag: add kselftest for ioctl interface
From: Suren Baghdasaryan @ 2026-06-17 15:39 UTC (permalink / raw)
  To: Abhishek Bapat
  Cc: Hao Ge, Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel,
	linux-mm, Sourav Panda, Andrew Morton, Kent Overstreet
In-Reply-To: <CAL41Mv7s9iA6a3YKV40E-9vk9WZfCjMEgBbLq_FVOjRS-PdVvA@mail.gmail.com>

On Tue, Jun 16, 2026 at 9:56 AM Abhishek Bapat <abhishekbapat@google.com> wrote:
>
> On Mon, Jun 15, 2026 at 11:01 PM Hao Ge <hao.ge@linux.dev> wrote:
> >
> > Hi Abhishek
> >
> >
> > On 2026/6/16 07:04, Abhishek Bapat wrote:
> > > Introduce a kselftest to verify the new IOCTL-based interface for
> > > /proc/allocinfo. The test covers:
> > >
> > > 1. Validation of the filename filter.
> > > 2. Validation of the function filter.
> > >
> > > The first test validates the functionality of the filename filter. Using
> > > "mm/memory.c" as the candidate filename filter, it retrieves filtered
> > > entries from both procfs and ioctl and matches the first VEC_MAX_ENTRIES
> > > entries.
> > >
> > > The second test validates the functionality of the function filter.
> > > It uses "dup_mm" as the candidate function as we do not expect this
> > > function name to change frequently and hence won't be needing to modify
> > > this test often.
> > >
> > > Note that both the tests match line no, function name and file name
> > > fields. Bytes allocated and calls are not matched as those values may
> > > change in the time when the data is being read from procfs and ioctl and
> > > hence can lead to false negatives.
> > >
> > > Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> > > ---
> > >   MAINTAINERS                                   |   1 +
> > >   tools/testing/selftests/alloc_tag/Makefile    |   9 +
> > >   .../alloc_tag/allocinfo_ioctl_test.c          | 331 ++++++++++++++++++
> > >   3 files changed, 341 insertions(+)
> > >   create mode 100644 tools/testing/selftests/alloc_tag/Makefile
> > >   create mode 100644 tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> > >
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index 019cc4c285a3..6610dd42e484 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -16715,6 +16715,7 @@ F:    include/linux/alloc_tag.h
> > >   F:  include/linux/pgalloc_tag.h
> > >   F:  include/uapi/linux/alloc_tag.h
> > >   F:  lib/alloc_tag.c
> > > +F:   tools/testing/selftests/alloc_tag/
> > >
> > >   MEMORY CONTROLLER DRIVERS
> > >   M:  Krzysztof Kozlowski <krzk@kernel.org>
> > > diff --git a/tools/testing/selftests/alloc_tag/Makefile b/tools/testing/selftests/alloc_tag/Makefile
> > > new file mode 100644
> > > index 000000000000..f2b8fc022c3b
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/alloc_tag/Makefile
> > > @@ -0,0 +1,9 @@
> > > +# SPDX-License-Identifier: GPL-2.0
> > > +
> > > +TEST_GEN_PROGS := allocinfo_ioctl_test
> > > +
> > > +CFLAGS += -Wall
> > > +CFLAGS += -I../../../../usr/include
> >
> >
> > I think we should replace -I../../../../usr/include with $(KHDR_INCLUDES).
> >
> >
> > > +
> > > +include ../lib.mk
> > > +
> >
> >
> > We would also introduce an extra field in the parent directory,
> >
> > allowing our alloc_tag target to be built when running make
> >
> > within /home/linux/tools/testing/selftests.
> >
> > like this:
> >
> > --- a/tools/testing/selftests/Makefile
> > +++ b/tools/testing/selftests/Makefile
> > @@ -1,5 +1,6 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   TARGETS += acct
> > +TARGETS += alloc_tag
> >   TARGETS += alsa
> >   TARGETS += amd-pstate
> >   TARGETS += arm64
> >
> > Below is the relevant build log:
> >
> > [root@localhost selftests]# make -j8
> >    CC       acct_syscall
> >    CC       allocinfo_ioctl_test
> >
> > Below is the log from running make clean:
> >
> > [root@localhost selftests]# make clean
> > rm -f -r /home/linux/tools/testing/selftests/acct/acct_syscall
> > rm -f -r /home/linux/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test
> >
> >
> > (Sorry about this inconvenience. I had limited experience with the
> > kselftest framework,
> >
> > so I reached out to my teammates to sort out the relevant build logic.
> >
> > It's my fault for not catching this issue in prior reviews
> >
> > and forcing you to send a revised patch.)
> >
> >
> Before writing this patch, I had discussed this with Suren and he was
> of the opinion that this should be done later in a separate patch. So
> I'll let Suren answer this further.

Hmm. Don't recall exact details of that discussion but it does not
make sense to introduce an issue just to clean it up later. I agree
with Hao and if I said otherwise before, I was wrong.
Thanks,
Suren.

>
> > > diff --git a/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> > > new file mode 100644
> > > index 000000000000..62d5a488a04d
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> > > @@ -0,0 +1,331 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +
> > > +/* kselftest for allocinfo ioctl
> > > + * allocinfo ioctl retrives allocinfo data through ioctl
> > > + * Copyright (C) 2026 Google, Inc.
> > > + */
> > > +
> > > +#include <errno.h>
> >
> >
> > errno.h is unused and may be dropped.
> >
> >
> > > +#include <fcntl.h>
> > > +#include <stdio.h>
> > > +#include <stdlib.h>
> > > +#include <string.h>
> > > +#include <stdbool.h>
> > > +#include <unistd.h>
> > > +#include <sys/ioctl.h>
> > > +#include <linux/types.h>
> > > +#include <linux/alloc_tag.h>
> > > +#include "../kselftest.h"
> > > +
> > > +#define MAX_LINE_LEN         512
> > > +#define ALLOCINFO_PROC               "/proc/allocinfo"
> > > +
> > > +enum ioctl_ret {
> > > +     IOCTL_SUCCESS = 0,
> > > +     IOCTL_FAILURE = 1,
> > > +     IOCTL_INVALID_DATA = 2,
> > > +};
> > > +
> > > +#define VEC_MAX_ENTRIES 32
> > > +
> > > +struct allocinfo_tag_data_vec {
> > > +     struct allocinfo_tag_data tag[VEC_MAX_ENTRIES];
> > > +     __u64 count;
> > > +};
> > > +
> > > +static inline int __allocinfo_get_content_id(int dev_fd, struct allocinfo_content_id *params)
> > > +{
> > > +     return ioctl(dev_fd, ALLOCINFO_IOC_CONTENT_ID, params);
> > > +}
> > > +
> > > +static inline int __allocinfo_get_at(int dev_fd, struct allocinfo_get_at *params)
> > > +{
> > > +     return ioctl(dev_fd, ALLOCINFO_IOC_GET_AT, params);
> > > +}
> > > +
> > > +static inline int __allocinfo_get_next(int dev_fd, struct allocinfo_tag_data *params)
> > > +{
> > > +     return ioctl(dev_fd, ALLOCINFO_IOC_GET_NEXT, params);
> > > +}
> > > +
> > > +static bool match_entry(const struct allocinfo_tag_data *procfs_entry,
> > > +                     const struct allocinfo_tag_data *tag_data,
> > > +                     bool match_bytes, bool match_calls, bool match_lineno,
> > > +                     bool match_function, bool match_filename)
> > > +{
> > > +     if (match_bytes && tag_data->counter.bytes != procfs_entry->counter.bytes) {
> > > +             ksft_print_msg("size retrieved through ioctl does not match procfs\n");
> > > +             return false;
> > > +     }
> > > +
> > > +     if (match_calls && tag_data->counter.calls != procfs_entry->counter.calls) {
> > > +             ksft_print_msg("call count retrieved through ioctl does not match procfs\n");
> > > +             return false;
> > > +     }
> > > +
> > > +     if (match_lineno && tag_data->tag.lineno != procfs_entry->tag.lineno) {
> > > +             ksft_print_msg("lineno retrieved through ioctl does not match procfs\n");
> > > +             return false;
> > > +     }
> > > +
> > > +     if (match_function &&
> > > +         strncmp(tag_data->tag.function, procfs_entry->tag.function, ALLOCINFO_STR_SIZE)) {
> > > +             ksft_print_msg("function retrieved through ioctl does not match procfs\n");
> > > +             return false;
> > > +     }
> > > +
> > > +     if (match_filename &&
> > > +         strncmp(tag_data->tag.filename, procfs_entry->tag.filename, ALLOCINFO_STR_SIZE)) {
> > > +             ksft_print_msg("filename retrieved through ioctl does not match procfs\n");
> > > +             return false;
> > > +     }
> > > +     return true;
> > > +}
> > > +
> > > +static bool match_entries(const struct allocinfo_tag_data_vec *procfs_entries,
> > > +                       const struct allocinfo_tag_data_vec *tags,
> > > +                       bool match_bytes, bool match_calls, bool match_lineno,
> > > +                       bool match_function, bool match_filename)
> > > +{
> > > +     __u64 i;
> > > +
> > > +     if (procfs_entries->count != tags->count) {
> > > +             ksft_print_msg("Entry count mismatch. ioctl entries: %llu, proc entries: %llu\n",
> > > +                            tags->count, procfs_entries->count);
> > > +             return false;
> > > +     }
> > > +     for (i = 0; i < procfs_entries->count; i++) {
> > > +             if (!match_entry(&procfs_entries->tag[i], &tags->tag[i],
> > > +                              match_bytes, match_calls, match_lineno,
> > > +                              match_function, match_filename)) {
> > > +                     ksft_print_msg("%lluth entry does not match.\n", i);
> > > +                     return false;
> > > +             }
> > > +     }
> > > +     return true;
> > > +}
> > > +
> > > +static const char *allocinfo_str(const char *str)
> > > +{
> > > +     size_t len = strlen(str);
> > > +
> > > +     if (len >= ALLOCINFO_STR_SIZE)
> > > +             str += (len - ALLOCINFO_STR_SIZE) + 1;
> > > +     return str;
> > > +}
> > > +
> > > +static void allocinfo_copy_str(char *dest, const char *src)
> > > +{
> > > +     strncpy(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE - 1);
> > > +     dest[ALLOCINFO_STR_SIZE - 1] = '\0';
> > > +}
> > > +
> > > +static int get_filtered_procfs_entries(struct allocinfo_tag_data_vec *procfs_entries,
> > > +                                    const struct allocinfo_filter *filter)
> > > +{
> > > +     FILE *fp = fopen(ALLOCINFO_PROC, "r");
> > > +     char line[MAX_LINE_LEN];
> > > +     int matches;
> > > +     struct allocinfo_tag_data procfs_entry;
> > > +
> > > +     if (!fp) {
> > > +             ksft_print_msg("Failed to open " ALLOCINFO_PROC " for reading\n");
> > > +             return 1;
> > > +     }
> > > +     memset(procfs_entries, 0, sizeof(*procfs_entries));
> > > +     while (fgets(line, sizeof(line), fp) && procfs_entries->count < VEC_MAX_ENTRIES) {
> > > +             char filename[MAX_LINE_LEN];
> > > +             char function[MAX_LINE_LEN];
> > > +
> > > +             memset(&procfs_entry, 0, sizeof(procfs_entry));
> > > +             matches = sscanf(line, "%llu %llu %[^:]:%llu func:%s",
> > > +                              &procfs_entry.counter.bytes,
> > > +                              &procfs_entry.counter.calls,
> > > +                              filename,
> > > +                              &procfs_entry.tag.lineno,
> > > +                              function);
> > > +
> > > +             if (matches != 5)
> > > +                     continue;
> > > +
> > > +             allocinfo_copy_str(procfs_entry.tag.filename, filename);
> > > +             allocinfo_copy_str(procfs_entry.tag.function, function);
> > > +
> > > +             if (filter->mask & ALLOCINFO_FILTER_MASK_FILENAME) {
> > > +                     if (strncmp(procfs_entry.tag.filename,
> > > +                                 filter->fields.filename, ALLOCINFO_STR_SIZE))
> > > +                             continue;
> > > +             }
> > > +             if (filter->mask & ALLOCINFO_FILTER_MASK_FUNCTION) {
> > > +                     if (strncmp(procfs_entry.tag.function,
> > > +                                 filter->fields.function, ALLOCINFO_STR_SIZE))
> > > +                             continue;
> > > +             }
> > > +             if (filter->mask & ALLOCINFO_FILTER_MASK_LINENO) {
> > > +                     if (procfs_entry.tag.lineno != filter->fields.lineno)
> > > +                             continue;
> > > +             }
> > > +             if (filter->mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) {
> > > +                     if (procfs_entry.counter.bytes < filter->min_size)
> > > +                             continue;
> > > +             }
> > > +             if (filter->mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) {
> > > +                     if (procfs_entry.counter.bytes > filter->max_size)
> > > +                             continue;
> > > +             }
> > > +
> > > +             memcpy(&procfs_entries->tag[procfs_entries->count++], &procfs_entry,
> > > +                    sizeof(procfs_entry));
> > > +     }
> > > +     fclose(fp);
> > > +     return 0;
> > > +}
> > > +
> > > +static enum ioctl_ret get_filtered_ioctl_entries(struct allocinfo_tag_data_vec *tags,
> > > +                                              const struct allocinfo_filter *filter,
> > > +                                              __u64 start_pos)
> > > +{
> > > +     int fd = open(ALLOCINFO_PROC, O_RDONLY);
> > > +
> > > +     if (fd < 0) {
> > > +             ksft_print_msg("Failed to open " ALLOCINFO_PROC " for IOCTL\n");
> > > +             return IOCTL_FAILURE;
> > > +     }
> > > +     struct allocinfo_content_id start_cont_id, end_cont_id;
> > > +     struct allocinfo_get_at get_at_params;
> > > +     const int max_retries = 10;
> > > +     int retry_count = 0;
> > > +     int status;
> > > +
> > > +     /*
> > > +      * __allocinfo_get_content_id may return different values if a kernel module was loaded
> > > +      * between the two calls. If that happens, the data gathered cannot be considered consistent
> > > +      * and hence needs to be fetched again to avoid flakiness.
> > > +      */
> > > +     do {
> > > +             if (__allocinfo_get_content_id(fd, &start_cont_id)) {
> > > +                     ksft_print_msg("allocinfo_get_content_id failed\n");
> > > +                     return IOCTL_FAILURE;
> >
> > Sashiko pointed out the fd leak in get_filtered_ioctl_entries().
> >
> > Since we already need to update the patch to fix the missing
> >
> > TARGETS entry in the top-level Makefile, we might as well fix
> >
> > this at the same time.
> >
> >
> > Thanks
> >
> > Best Regards
> >
> > Hao
> >
>
> Ahh I missed this code path interestingly. Ack, I'll remove these
> return statements here.
>
> > > +             }
> > > +
> > > +             memset(tags, 0, sizeof(*tags));
> > > +             memset(&get_at_params, 0, sizeof(get_at_params));
> > > +             memcpy(&get_at_params.filter, filter, sizeof(*filter));
> > > +             get_at_params.pos = start_pos;
> > > +             if (__allocinfo_get_at(fd, &get_at_params)) {
> > > +                     ksft_print_msg("allocinfo_get_at failed\n");
> > > +                     return IOCTL_FAILURE;
> > > +             }
> > > +             memcpy(&tags->tag[tags->count++], &get_at_params.data, sizeof(get_at_params.data));
> > > +
> > > +             while (tags->count < VEC_MAX_ENTRIES &&
> > > +                    __allocinfo_get_next(fd, &tags->tag[tags->count]) == 0)
> > > +                     tags->count++;
> > > +
> > > +             if (__allocinfo_get_content_id(fd, &end_cont_id)) {
> > > +                     ksft_print_msg("allocinfo_get_content_id failed\n");
> > > +                     return IOCTL_FAILURE;
> > > +             }
> > > +
> > > +             if (start_cont_id.id == end_cont_id.id) {
> > > +                     status = IOCTL_SUCCESS;
> > > +             } else {
> > > +                     ksft_print_msg("allocinfo_get_content_id mismatch, retrying...\n");
> > > +                     status = IOCTL_INVALID_DATA;
> > > +             }
> > > +     } while (status == IOCTL_INVALID_DATA && retry_count++ < max_retries);
> > > +
> > > +     close(fd);
> > > +     return status;
> > > +}
> > > +
> > > +static int run_filter_test(const struct allocinfo_filter *filter)
> > > +{
> > > +     struct allocinfo_tag_data_vec *tags = malloc(sizeof(*tags));
> > > +     struct allocinfo_tag_data_vec *procfs_entries = malloc(sizeof(*procfs_entries));
> > > +     int ioctl_status;
> > > +     int ret = KSFT_PASS;
> > > +
> > > +     if (!tags || !procfs_entries) {
> > > +             ksft_print_msg("Memory allocation failed.\n");
> > > +             ret = KSFT_FAIL;
> > > +             goto exit;
> > > +     }
> > > +
> > > +     if (get_filtered_procfs_entries(procfs_entries, filter)) {
> > > +             ksft_print_msg("Error retrieving entries from " ALLOCINFO_PROC "\n");
> > > +             ret = KSFT_SKIP;
> > > +             goto exit;
> > > +     }
> > > +
> > > +     if (procfs_entries->count == 0) {
> > > +             ksft_print_msg("No entries found in " ALLOCINFO_PROC ", skipping test\n");
> > > +             ret = KSFT_SKIP;
> > > +             goto exit;
> > > +     }
> > > +
> > > +     ioctl_status = get_filtered_ioctl_entries(tags, filter, 0);
> > > +     if (ioctl_status == IOCTL_INVALID_DATA) {
> > > +             ksft_print_msg("Trouble retrieving valid IOCTL entries, skipping.\n");
> > > +             ret = KSFT_SKIP;
> > > +             goto exit;
> > > +     }
> > > +     if (ioctl_status == IOCTL_FAILURE) {
> > > +             ksft_print_msg("Error retrieving IOCTL entries.\n");
> > > +             ret = KSFT_FAIL;
> > > +             goto exit;
> > > +     }
> > > +
> > > +     if (!match_entries(procfs_entries, tags, false, false, true, true, true))
> > > +             ret = KSFT_FAIL;
> > > +
> > > +exit:
> > > +     free(tags);
> > > +     free(procfs_entries);
> > > +     return ret;
> > > +}
> > > +
> > > +static int test_filename_filter(void)
> > > +{
> > > +     struct allocinfo_filter filter;
> > > +     const char *target_filename = "mm/memory.c";
> > > +
> > > +     memset(&filter, 0, sizeof(filter));
> > > +     filter.mask |= ALLOCINFO_FILTER_MASK_FILENAME;
> > > +     strncpy(filter.fields.filename, target_filename, ALLOCINFO_STR_SIZE);
> > > +
> > > +     return run_filter_test(&filter);
> > > +}
> > > +
> > > +static int test_function_filter(void)
> > > +{
> > > +     struct allocinfo_filter filter;
> > > +     const char *target_function = "dup_mm";
> > > +
> > > +     memset(&filter, 0, sizeof(filter));
> > > +     filter.mask |= ALLOCINFO_FILTER_MASK_FUNCTION;
> > > +     strncpy(filter.fields.function, target_function, ALLOCINFO_STR_SIZE);
> > > +
> > > +     return run_filter_test(&filter);
> > > +}
> > > +
> > > +int main(int argc, char *argv[])
> > > +{
> > > +     int ret;
> > > +
> > > +     ksft_set_plan(2);
> > > +
> > > +     ret = test_filename_filter();
> > > +     if (ret == KSFT_SKIP)
> > > +             ksft_test_result_skip("Skipping test_filename_filter\n");
> > > +     else
> > > +             ksft_test_result(ret == KSFT_PASS, "test_filename_filter\n");
> > > +
> > > +     ret = test_function_filter();
> > > +     if (ret == KSFT_SKIP)
> > > +             ksft_test_result_skip("Skipping test_function_filter\n");
> > > +     else
> > > +             ksft_test_result(ret == KSFT_PASS, "test_function_filter\n");
> > > +
> > > +     ksft_finished();
> > > +}

^ permalink raw reply

* Re: [PATCH RFC 2/4] printk: deprecate boot_delay in favour of printk_delay
From: Petr Mladek @ 2026-06-17 15:00 UTC (permalink / raw)
  To: Andrew Murray
  Cc: Jonathan Corbet, Shuah Khan, Russell King, Florian Fainelli,
	Broadcom internal kernel review list, Ray Jui, Scott Branden,
	Steven Rostedt, John Ogness, Sergey Senozhatsky, Andrew Morton,
	Sebastian Andrzej Siewior, Clark Williams, Randy Dunlap,
	Linus Torvalds, linux-doc, linux-kernel, linux-arm-kernel,
	linux-rpi-kernel, linux-rt-devel
In-Reply-To: <CALqELGzTH8cTLVgX9CXuf_LFLgC97_yfqYJVHzU9ghPuev7SNA@mail.gmail.com>

On Sun 2026-06-14 12:45:44, Andrew Murray wrote:
> On Mon, 8 Jun 2026 at 15:07, Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Mon 2026-06-01 00:17:38, Andrew Murray wrote:
> > > The boot_delay (BOOT_PRINTK_DELAY) kernel parameter and printk_delay sysctl
> > > are two distinct mechanisms for providing similar functionality which add a
> > > delay prior to each printed printk message.
> > >
> > > boot_delay provides a kernel parameter for delaying printk output from
> > > kernel start through to boot (SYSTEM_RUNNING), whereas printk_delay is
> > > configurable only via sysctl and thus is only used post boot.
> > >
> > > Let's deprecate the boot_delay feature in favour of printk_delay. In order
> > > to preserve functionality, we'll also extend printk_delay such that it can
> > > additionally configured via a kernel parameter.
> >
> > > --- a/kernel/printk/printk.c
> > > +++ b/kernel/printk/printk.c
> > > @@ -1339,11 +1327,34 @@ static void boot_delay_msec(int level)
> > >       }
> > >  }
> > >  #else
> > > -static inline void boot_delay_msec(int level)
> > > +static inline void __init printk_delay_calculate(void)
> > > +{
> > > +}
> > > +
> > > +static inline void early_boot_delay_msec(void)
> > >  {
> >
> > It would be nice to print a warning that the early boot delay
> > does not work, something like:
> >
> >         pr_warn_once("Early boot delay does not work without CONFIG_GENERIC_CALIBRATE_DELAY enabled.\n");
> >
> > >  }
> > >  #endif
> > >
> > > +static int __init printk_delay_setup(char *str)
> > > +{
> > > +     get_option(&str, &printk_delay_msec);
> > > +     if (printk_delay_msec > 10 * 1000)
> > > +             printk_delay_msec = 0;
> >
> > Sashiko AI warns that this code accepts negative values.
> > It might cause long delays, see
> > https://sashiko.dev/#/patchset/20260601-deprecate_boot_delay-v1-0-c34c187142a6%40thegoodpenguin.co.uk
> >
> > The problem has already been there even before. But it would be nice
> > to fix it.
> 
> Thanks for pointing out Sashiko, I hadn't seen its review on my
> patches. Are authors expected to get emails from it, as I didn't?

Sashiko is able to send mails but it is opt-in.

It might create too much noise because it has false positives, it
keeps reporting minor or nice-to-fix problems which will "never"
get fixed. Also it is not predictable so that you could not reliably
check the patchset before sending.

Anyway, I thought about enabling this for printk-related patches
because Sashiko also gives a lot of useful feedback. And it is easier
to discuss it as reply to a mail. But AFAIK, it can be done only
per-mailing list. And printk does not have any dedicated mailing list.

So, I "always" search for the feedback at https://sashiko.dev/

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH 0/3] mm/zram: route block swap I/O through swap_ops
From: Jianyue Wu @ 2026-06-17 14:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Chris Li, Baoquan He, Nhat Pham, Barry Song,
	Kairui Song, Kemeng Shi, Youngjun Park, Minchan Kim,
	Sergey Senozhatsky, Jens Axboe, Matthew Wilcox (Oracle), Jan Kara,
	linux-mm, linux-kernel, linux-block, linux-doc
In-Reply-To: <20260617061743.GA19844@lst.de>

On Wed, Jun 17, 2026 at 2:17 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Jun 17, 2026 at 11:38:02AM +0800, Jianyue Wu wrote:
> > Before I rework or drop the RFC, could you outline how you see that
> > core-side model working? In particular:
> >   - How should a compressed backend like zram or future block device
> >     plug into swap_iocb / swap_ops?
>
> I don't think that is the right layer.  The virtual swap layer that is
> currently in the process of being upstreamed is the right level, and
> the actual swap devices or swap files are just a dumb backend for what
> they higher level code does.
>
> >   - What role do you expect zram to keep while the legacy block interface
> >     remains: current block swap only, or something else?
>
> For now we'll need to keep it working as-is.  It is heavily used in
> android and potentially elsewhere.  Once we have zswap fully working
> in the virtual swap layer world it might make sense to say never
> compress again in zram when REQ_SWAP is set (or maybe a new
> REQ_COPRESSED) so that we can use the core compression code without
> breaking existing setups.
>
Hello Christoph,

Thanks for the clarification.

I understand the goal is to have more common code in the core layer,
with dumb backends. On the swap path, once core has already compressed
the data, zram would only store it and not compress again, while
non-swap use of zram stays as-is.

Thanks,
Jianyue

^ permalink raw reply

* Re: [PATCH v7 00/20] nfsd: add support for CB_NOTIFY callbacks in directory delegations
From: Chuck Lever @ 2026-06-17 13:59 UTC (permalink / raw)
  To: NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Trond Myklebust, Anna Schumaker, Jonathan Corbet, Shuah Khan,
	Jeff Layton
  Cc: Steven Rostedt, Alexander Aring, Amir Goldstein, Jan Kara,
	Alexander Viro, Christian Brauner, Calum Mackay, linux-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260616-dir-deleg-v7-0-6cbc7eac0ade@kernel.org>

On Tue, 16 Jun 2026 07:58:43 -0400, Jeff Layton wrote:
> This patchset builds on the directory delegation work we did a few
> months ago to add support for CB_NOTIFY callbacks for some events. In
> particular, creates, unlinks and renames. The server also sends updated
> directory attributes in the notifications. With this support, the client
> can register interest in a directory and get notifications about changes
> within it without losing its lease.
> 
> [...]

Applied to nfsd-testing, thanks!

[01/20] nfsd: check fl_lmops in nfsd_breaker_owns_lease()
        commit: 5829ecd7e99a7ed8da577f30a3818e219e0f939a
[02/20] nfsd: add protocol support for CB_NOTIFY
        commit: 02a63ce55455420340d9d90721ba6f0914301399
[03/20] nfs_common: add new NOTIFY4_* flags proposed in RFC8881bis
        commit: b08cfbdd37224bd6963735da9532744d11d52020
[04/20] nfsd: allow nfsd to get a dir lease with an ignore mask
        commit: 124ee5f1be54c060e0a23a0ac62ecd3802094a59
[05/20] nfsd: update the fsnotify mark when setting or removing a dir delegation
        commit: 08c7feaff63d6a14ff0b868b44a5d03e2ced99e4
[06/20] nfsd: make nfsd4_callback_ops->prepare operation bool return
        commit: da30d09e299cd9a8e3928ab555eca7f4dc332f61
[07/20] nfsd: add callback encoding and decoding linkages for CB_NOTIFY
        commit: 046b4a0e1817482e2017db5262734ca35c691fce
[08/20] nfsd: use RCU to protect fi_deleg_file
        commit: 7d47d06e04e03d05be4ae7febdf81a7365054995
[09/20] nfsd: add data structures for handling CB_NOTIFY
        commit: 0db43bdffdc39bbc076c09630c87710f6c2ce3eb
[10/20] nfsd: add notification handlers for dir events
        commit: b557e3bcc740d87b1fb0b5d0fd9743ae2d9e6ece
[11/20] nfsd: apply the notify mask to the delegation when requested
        commit: 718bee77fdb46cb4074b113aa40a3c82e49bf13b
[12/20] nfsd: add helper to marshal a fattr4 from completed args
        commit: 2c495d8ec198888447018689ad33c2d52e52cb48
[13/20] nfsd: allow nfsd4_encode_fattr4_change() to work with no export
        commit: 498a0c7ad8710daaf9e92040b325055cebc31747
[14/20] nfsd: send basic file attributes in CB_NOTIFY
        commit: b78e9993ec52676ccebb161fd65996e18372f6c4
[15/20] nfsd: allow encoding a filehandle into fattr4 without a svc_fh
        commit: c2959ab33014d9aaefd2782f414cf0b78b991efa
[16/20] nfsd: add the filehandle to returned attributes in CB_NOTIFY
        commit: b4d7483c6e50fb9382f2f4977ed7eae86e4264b9
[17/20] nfsd: fix reply size estimate for GET_DIR_DELEGATION
        commit: 7c8674f3963bc3f4d2472c70884d350164fff25b
[18/20] nfsd: properly track requested child attributes
        commit: 744500529a2d378084909bfd277af18e62ce34ac
[19/20] nfsd: track requested dir attributes
        commit: df3ecee7ae57f6e1042556fa5f7b5dcae3b7b26d
[20/20] nfsd: add support to CB_NOTIFY for dir attribute changes
        commit: 0c072553ca525d62c7e8e63a6ecb7e88f84d07e0

--
Chuck Lever


^ permalink raw reply

* [PATCH v6 10/10] RAS: add firmware-first CPER provider
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com>

Add a firmware-first CPER provider that reuses the shared
GHES helpers, wire it into the RAS Kconfig/Makefile and
document it in the admin guide.

Update MAINTAINERS now that the driver exists.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 Documentation/admin-guide/RAS/main.rst |  15 ++
 MAINTAINERS                            |   1 +
 drivers/acpi/apei/apei-internal.h      |   3 +-
 drivers/ras/Kconfig                    |  11 ++
 drivers/ras/Makefile                   |   1 +
 drivers/ras/cper-esource.c             | 322 +++++++++++++++++++++++++++++++++
 6 files changed, 351 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/RAS/main.rst b/Documentation/admin-guide/RAS/main.rst
index 5a45db32c49b..d4e3c8c1b92f 100644
--- a/Documentation/admin-guide/RAS/main.rst
+++ b/Documentation/admin-guide/RAS/main.rst
@@ -205,6 +205,21 @@ Architecture (MCA)\ [#f3]_.
 .. [#f3] For more details about the Machine Check Architecture (MCA),
   please read Documentation/arch/x86/x86_64/machinecheck.rst at the Kernel tree.
 
+Firmware-first CPER providers
+-----------------------------
+
+Some systems expose Common Platform Error Record (CPER) data through
+platform firmware, with the error source described in DeviceTree.
+Enable ``CONFIG_RAS_CPER_ESOURCE`` to support those providers. The
+current in-tree binding is
+``Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml``.
+
+The DeviceTree node describes the firmware-owned status buffer and ack
+buffer used to exchange CPER data with the OS. The driver reuses the
+shared GHES CPER handling helpers, so parsing, logging, notifier
+delivery, and memory failure handling follow the same paths as ACPI
+GHES whether the error source is described by ACPI or DeviceTree.
+
 EDAC - Error Detection And Correction
 *************************************
 
diff --git a/MAINTAINERS b/MAINTAINERS
index 5aa495fdff72..00b9a1abab67 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22326,6 +22326,7 @@ RAS ERROR STATUS
 M:	Ahmed Tiba <ahmed.tiba@arm.com>
 S:	Maintained
 F:	Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml
+F:	drivers/ras/cper-esource.c
 
 RAS INFRASTRUCTURE
 M:	Tony Luck <tony.luck@intel.com>
diff --git a/drivers/acpi/apei/apei-internal.h b/drivers/acpi/apei/apei-internal.h
index 77c10a7a7a9f..15d11f10d067 100644
--- a/drivers/acpi/apei/apei-internal.h
+++ b/drivers/acpi/apei/apei-internal.h
@@ -123,8 +123,7 @@ struct dentry *apei_get_debugfs_dir(void);
 static inline u32 cper_estatus_len(struct acpi_hest_generic_status *estatus)
 {
 	if (estatus->raw_data_length)
-		return estatus->raw_data_offset + \
-			estatus->raw_data_length;
+		return estatus->raw_data_offset + estatus->raw_data_length;
 	else
 		return sizeof(*estatus) + estatus->data_length;
 }
diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index fc4f4bb94a4c..3c1c63b2fefc 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -34,6 +34,17 @@ if RAS
 source "arch/x86/ras/Kconfig"
 source "drivers/ras/amd/atl/Kconfig"
 
+config RAS_CPER_ESOURCE
+	bool "Firmware-first CPER error source block provider"
+	select GHES_CPER_HELPERS
+	help
+	  Enable support for firmware-first Common Platform Error Record
+	  (CPER) error source block providers. The current in-tree user is
+	  described by the arm,ras-cper DeviceTree binding. The driver
+	  reuses the existing GHES CPER helpers so the error processing
+	  matches the ACPI code paths, but it can be built even when ACPI is
+	  disabled.
+
 config RAS_FMPM
 	tristate "FRU Memory Poison Manager"
 	default m
diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
index 11f95d59d397..0de069557f31 100644
--- a/drivers/ras/Makefile
+++ b/drivers/ras/Makefile
@@ -2,6 +2,7 @@
 obj-$(CONFIG_RAS)	+= ras.o
 obj-$(CONFIG_DEBUG_FS)	+= debugfs.o
 obj-$(CONFIG_RAS_CEC)	+= cec.o
+obj-$(CONFIG_RAS_CPER_ESOURCE)	+= cper-esource.o
 
 obj-$(CONFIG_RAS_FMPM)	+= amd/fmpm.o
 obj-y			+= amd/atl/
diff --git a/drivers/ras/cper-esource.c b/drivers/ras/cper-esource.c
new file mode 100644
index 000000000000..cc9f5f522400
--- /dev/null
+++ b/drivers/ras/cper-esource.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Firmware-first CPER error source provider.
+ *
+ * This driver shares the GHES CPER helpers so we keep the reporting and
+ * notifier behaviour identical to ACPI GHES.
+ *
+ * Copyright (C) 2026 ARM Ltd.
+ * Author: Ahmed Tiba <ahmed.tiba@arm.com>
+ */
+
+#include <linux/bitops.h>
+#include <linux/cleanup.h>
+#include <linux/cper.h>
+#include <linux/idr.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/mod_devicetable.h>
+#include <linux/module.h>
+#include <linux/of_reserved_mem.h>
+#include <linux/panic.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include <acpi/ghes.h>
+#include <acpi/ghes_cper.h>
+
+static DEFINE_IDA(cper_esource_source_ids);
+
+struct cper_esource_ack {
+	void *addr;
+	u64 preserve;
+	u64 set;
+	u8 width;
+	bool present;
+};
+
+struct cper_esource {
+	struct device *dev;
+	void *status;
+	size_t status_len;
+
+	struct cper_esource_ack ack;
+
+	struct acpi_hest_generic generic;
+	struct acpi_hest_generic_status *estatus;
+
+	int irq;
+
+	/* Serializes access while firmware and the OS share the status buffer. */
+	spinlock_t lock;
+};
+
+static void *cper_esource_map_region(struct device *dev, unsigned int index,
+				     size_t *size)
+{
+	struct resource res;
+	void *addr;
+
+	if (of_reserved_mem_region_to_resource(dev->of_node, index, &res))
+		return ERR_PTR(dev_err_probe(dev, -EINVAL,
+					     "unable to resolve memory-region %u\n",
+					     index));
+
+	*size = resource_size(&res);
+	if (!*size)
+		return ERR_PTR(dev_err_probe(dev, -EINVAL,
+					     "memory-region %u has zero length\n",
+					     index));
+
+	addr = devm_memremap(dev, res.start, *size, MEMREMAP_WB);
+	if (!addr)
+		return ERR_PTR(dev_err_probe(dev, -ENOMEM,
+					     "failed to map memory-region %u\n",
+					     index));
+
+	return addr;
+}
+
+static void cper_esource_release_source_id(void *data)
+{
+	struct cper_esource *ctx = data;
+
+	ida_free(&cper_esource_source_ids, ctx->generic.header.source_id);
+}
+
+static int cper_esource_init_pool(void)
+{
+	return ghes_estatus_pool_init(1);
+}
+
+static u32 cper_esource_estatus_len(struct acpi_hest_generic_status *estatus)
+{
+	if (estatus->raw_data_length)
+		return estatus->raw_data_offset + estatus->raw_data_length;
+	else
+		return sizeof(*estatus) + estatus->data_length;
+}
+
+static int cper_esource_validate_status(struct cper_esource *ctx)
+{
+	size_t estatus_len;
+
+	if (!ctx->estatus->block_status)
+		return -ENOENT;
+
+	if (cper_estatus_check_header(ctx->estatus))
+		return -EINVAL;
+
+	if (ctx->estatus->raw_data_length &&
+	    (ctx->estatus->raw_data_offset > ctx->status_len ||
+	     ctx->estatus->raw_data_length >
+	     ctx->status_len - ctx->estatus->raw_data_offset))
+		return -EINVAL;
+
+	estatus_len = cper_esource_estatus_len(ctx->estatus);
+	if (estatus_len < sizeof(*ctx->estatus) || estatus_len > ctx->status_len)
+		return -EINVAL;
+
+	if (cper_estatus_check(ctx->estatus))
+		return -EINVAL;
+
+	return 0;
+}
+
+static void cper_esource_ack(struct cper_esource *ctx)
+{
+	if (!ctx->ack.present)
+		return;
+
+	if (ctx->ack.width == 64) {
+		u64 *addr = ctx->ack.addr;
+		u64 val = READ_ONCE(*addr);
+
+		/* Publish status-buffer updates before raising the ack bit. */
+		wmb();
+		val &= ctx->ack.preserve;
+		val |= ctx->ack.set;
+		WRITE_ONCE(*addr, val);
+	} else {
+		u32 *addr = ctx->ack.addr;
+		u32 val = READ_ONCE(*addr);
+
+		/* Publish status-buffer updates before raising the ack bit. */
+		wmb();
+		val &= (u32)ctx->ack.preserve;
+		val |= (u32)ctx->ack.set;
+		WRITE_ONCE(*addr, val);
+	}
+}
+
+static void cper_esource_clear_status(struct cper_esource *ctx)
+{
+	ctx->estatus->block_status = 0;
+	WRITE_ONCE(((struct acpi_hest_generic_status *)ctx->status)->block_status, 0);
+}
+
+static void cper_esource_fatal(struct cper_esource *ctx)
+{
+	__ghes_print_estatus(KERN_EMERG, &ctx->generic, ctx->estatus);
+	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_STILL_OK);
+	panic("GHES: fatal firmware-first CPER record from %s\n",
+	      dev_name(ctx->dev));
+}
+
+static void cper_esource_process(struct cper_esource *ctx)
+{
+	int rc;
+	int sev;
+
+	guard(spinlock_irqsave)(&ctx->lock);
+
+	memcpy(ctx->estatus, ctx->status, ctx->status_len);
+
+	rc = cper_esource_validate_status(ctx);
+	if (rc == -ENOENT)
+		return;
+	if (rc) {
+		dev_warn_ratelimited(ctx->dev, FW_WARN GHES_PFX
+				     "Invalid error status block\n");
+		cper_esource_clear_status(ctx);
+		cper_esource_ack(ctx);
+		return;
+	}
+
+	sev = ghes_severity(ctx->estatus->error_severity);
+	if (sev >= GHES_SEV_PANIC)
+		cper_esource_fatal(ctx);
+
+	ghes_print_estatus(NULL, &ctx->generic, ctx->estatus);
+
+	ghes_cper_handle_status(ctx->dev, &ctx->generic, ctx->estatus, false);
+	cper_esource_clear_status(ctx);
+	cper_esource_ack(ctx);
+}
+
+static irqreturn_t cper_esource_irq(int irq, void *data)
+{
+	struct cper_esource *ctx = data;
+
+	cper_esource_process(ctx);
+
+	return IRQ_HANDLED;
+}
+
+static int cper_esource_init_ack(struct cper_esource *ctx)
+{
+	struct device *dev = ctx->dev;
+	size_t size;
+
+	ctx->ack.addr = cper_esource_map_region(dev, 1, &size);
+	if (IS_ERR(ctx->ack.addr))
+		return PTR_ERR(ctx->ack.addr);
+
+	switch (size) {
+	case 4:
+		ctx->ack.width = 32;
+		ctx->ack.preserve = ~0U;
+		break;
+	case 8:
+		ctx->ack.width = 64;
+		ctx->ack.preserve = ~0ULL;
+		break;
+	default:
+		return dev_err_probe(dev, -EINVAL,
+				     "unsupported ack resource size %zu\n", size);
+	}
+
+	ctx->ack.set = BIT_ULL(0);
+	ctx->ack.present = true;
+	return 0;
+}
+
+static int cper_esource_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct cper_esource *ctx;
+	size_t size;
+	int source_id;
+	int rc;
+
+	ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	spin_lock_init(&ctx->lock);
+	ctx->dev = dev;
+
+	ctx->status = cper_esource_map_region(dev, 0, &size);
+	if (IS_ERR(ctx->status))
+		return PTR_ERR(ctx->status);
+
+	ctx->status_len = size;
+	if (ctx->status_len < sizeof(*ctx->estatus))
+		return dev_err_probe(dev, -EINVAL,
+				     "status region is smaller than a CPER header\n");
+
+	rc = cper_esource_init_ack(ctx);
+	if (rc)
+		return rc;
+
+	rc = cper_esource_init_pool();
+	if (rc)
+		return rc;
+
+	ctx->estatus = devm_kzalloc(dev, ctx->status_len, GFP_KERNEL);
+	if (!ctx->estatus)
+		return -ENOMEM;
+
+	/* Keep source_id 0 unused so a zeroed header is never treated as valid. */
+	source_id = ida_alloc_min(&cper_esource_source_ids, 1, GFP_KERNEL);
+	if (source_id < 0)
+		return source_id;
+	if (source_id > U16_MAX) {
+		ida_free(&cper_esource_source_ids, source_id);
+		return -ENOSPC;
+	}
+
+	ctx->generic.header.type = ACPI_HEST_TYPE_GENERIC_ERROR;
+	ctx->generic.header.source_id = source_id;
+
+	rc = devm_add_action_or_reset(dev, cper_esource_release_source_id,
+				      ctx);
+	if (rc)
+		return rc;
+
+	ctx->generic.notify.type = ACPI_HEST_NOTIFY_EXTERNAL;
+	ctx->generic.error_block_length = ctx->status_len;
+
+	ctx->irq = platform_get_irq(pdev, 0);
+	if (ctx->irq < 0)
+		return ctx->irq;
+
+	rc = devm_request_threaded_irq(dev, ctx->irq, NULL, cper_esource_irq,
+				       IRQF_ONESHOT,
+				       dev_name(dev), ctx);
+	if (rc)
+		return dev_err_probe(dev, rc, "failed to request interrupt\n");
+
+	return 0;
+}
+
+static const struct of_device_id cper_esource_of_match[] = {
+	{ .compatible = "arm,ras-cper" },
+	{ /* sentinel */ }
+};
+MODULE_DEVICE_TABLE(of, cper_esource_of_match);
+
+static struct platform_driver cper_esource_driver = {
+	.driver = {
+		.name = "cper-esource",
+		.of_match_table = cper_esource_of_match,
+	},
+	.probe = cper_esource_probe,
+};
+
+module_platform_driver(cper_esource_driver);
+
+MODULE_AUTHOR("Ahmed Tiba <ahmed.tiba@arm.com>");
+MODULE_DESCRIPTION("Firmware-first CPER provider");
+MODULE_LICENSE("GPL");

-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 09/10] dt-bindings: firmware: add arm,ras-cper
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com>

Describe the DeviceTree node that exposes the Arm firmware-first CPER
provider and hook the file into MAINTAINERS so the binding has an
owner.

The initial user is the upstream zena-css platform, validated so far
on FVP.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 .../devicetree/bindings/firmware/arm,ras-cper.yaml | 52 ++++++++++++++++++++++
 MAINTAINERS                                        |  5 +++
 2 files changed, 57 insertions(+)

diff --git a/Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml b/Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml
new file mode 100644
index 000000000000..23d54008230d
--- /dev/null
+++ b/Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml
@@ -0,0 +1,52 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/firmware/arm,ras-cper.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Arm RAS CPER provider
+
+maintainers:
+  - Ahmed Tiba <ahmed.tiba@arm.com>
+
+description:
+  Arm Reliability, Availability and Serviceability (RAS) firmware can expose
+  a firmware-first CPER error source directly via DeviceTree. Firmware
+  provides the CPER Generic Error Status block and notifies the OS through
+  an interrupt.
+
+properties:
+  compatible:
+    const: arm,ras-cper
+
+  memory-region:
+    items:
+      - description:
+          CPER Generic Error Status block exposed by firmware.
+      - description:
+          Firmware-owned ack buffer. Firmware watches bit 0 and expects the
+          OS to set it once the current status block has been consumed
+          before the CPER buffer is overwritten.
+
+  interrupts:
+    maxItems: 1
+    description:
+      Interrupt used to signal that a new status record is ready.
+
+required:
+  - compatible
+  - memory-region
+  - interrupts
+
+additionalProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+    error-handler {
+      compatible = "arm,ras-cper";
+      memory-region = <&ras_cper_buffer>, <&ras_cper_ack>;
+      interrupts = <GIC_SPI 32 IRQ_TYPE_LEVEL_HIGH>;
+    };
+...
diff --git a/MAINTAINERS b/MAINTAINERS
index 3d6db8cb608f..5aa495fdff72 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22322,6 +22322,11 @@ M:	Alexandre Bounine <alex.bou9@gmail.com>
 S:	Maintained
 F:	drivers/rapidio/
 
+RAS ERROR STATUS
+M:	Ahmed Tiba <ahmed.tiba@arm.com>
+S:	Maintained
+F:	Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml
+
 RAS INFRASTRUCTURE
 M:	Tony Luck <tony.luck@intel.com>
 M:	Borislav Petkov <bp@alien8.de>

-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 08/10] ACPI: APEI: share GHES CPER helpers
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com>

Wire GHES up to the helper routines in ghes_cper.c and remove the local
copies from ghes.c. This keeps the control flow identical while letting
the helpers be shared with other firmware-first providers.

The weak arch_apei_report_mem_error() fallback is only there to keep the
shared helper buildable when GHES_CPER_HELPERS is enabled without
ACPI_APEI, while preserving the current GHES behaviour when ACPI_APEI
is enabled.

Serialize ghes_estatus_pool_init() and allow later users to extend the
shared pool instead of racing a second initialization or silently
reusing a pool sized only for the first user.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/ghes.c      | 416 +-------------------------------------
 drivers/acpi/apei/ghes_cper.c | 452 ++++++++++++++++++++++++++++++++++++++++++
 include/acpi/ghes_cper.h      |  20 ++
 3 files changed, 474 insertions(+), 414 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 136993704d52..2ec3c9100544 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -67,8 +67,6 @@
 #define FIX_APEI_GHES_SDEI_CRITICAL	__end_of_fixed_addresses
 #endif
 
-static ATOMIC_NOTIFIER_HEAD(ghes_report_chain);
-
 /*
  * This driver isn't really modular, however for the time being,
  * continuing to use module_param is the easiest way to remain
@@ -113,421 +111,11 @@ static DEFINE_MUTEX(ghes_devs_mutex);
  */
 static DEFINE_SPINLOCK(ghes_notify_lock_irq);
 
-struct gen_pool *ghes_estatus_pool;
-
-int ghes_estatus_pool_init(unsigned int num_ghes)
-{
-	unsigned long addr, len;
-	int rc;
-
-	ghes_estatus_pool = gen_pool_create(GHES_ESTATUS_POOL_MIN_ALLOC_ORDER, -1);
-	if (!ghes_estatus_pool)
-		return -ENOMEM;
-
-	len = GHES_ESTATUS_CACHE_AVG_SIZE * GHES_ESTATUS_CACHE_ALLOCED_MAX;
-	len += (num_ghes * GHES_ESOURCE_PREALLOC_MAX_SIZE);
-
-	addr = (unsigned long)vmalloc(PAGE_ALIGN(len));
-	if (!addr)
-		goto err_pool_alloc;
-
-	rc = gen_pool_add(ghes_estatus_pool, addr, PAGE_ALIGN(len), -1);
-	if (rc)
-		goto err_pool_add;
-
-	return 0;
-
-err_pool_add:
-	vfree((void *)addr);
-
-err_pool_alloc:
-	gen_pool_destroy(ghes_estatus_pool);
-
-	return -ENOMEM;
-}
-
-/**
- * ghes_estatus_pool_region_free - free previously allocated memory
- *				   from the ghes_estatus_pool.
- * @addr: address of memory to free.
- * @size: size of memory to free.
- *
- * Returns none.
- */
-void ghes_estatus_pool_region_free(unsigned long addr, u32 size)
-{
-	gen_pool_free(ghes_estatus_pool, addr, size);
-}
-EXPORT_SYMBOL_GPL(ghes_estatus_pool_region_free);
-
-static inline int ghes_severity(int severity)
-{
-	switch (severity) {
-	case CPER_SEV_INFORMATIONAL:
-		return GHES_SEV_NO;
-	case CPER_SEV_CORRECTED:
-		return GHES_SEV_CORRECTED;
-	case CPER_SEV_RECOVERABLE:
-		return GHES_SEV_RECOVERABLE;
-	case CPER_SEV_FATAL:
-		return GHES_SEV_PANIC;
-	default:
-		/* Unknown, go panic */
-		return GHES_SEV_PANIC;
-	}
-}
-
-
-/**
- * struct ghes_task_work - for synchronous RAS event
- *
- * @twork:                callback_head for task work
- * @pfn:                  page frame number of corrupted page
- * @flags:                work control flags
- *
- * Structure to pass task work to be handled before
- * returning to user-space via task_work_add().
- */
-struct ghes_task_work {
-	struct callback_head twork;
-	u64 pfn;
-	int flags;
-};
-
-static void memory_failure_cb(struct callback_head *twork)
-{
-	struct ghes_task_work *twcb = container_of(twork, struct ghes_task_work, twork);
-	int ret;
-
-	ret = memory_failure(twcb->pfn, twcb->flags);
-	gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
-
-	if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
-		return;
-
-	pr_err("%#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
-			twcb->pfn, current->comm, task_pid_nr(current));
-	force_sig(SIGBUS);
-}
-
-static bool ghes_do_memory_failure(u64 physical_addr, int flags)
-{
-	struct ghes_task_work *twcb;
-	unsigned long pfn;
-
-	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
-		return false;
-
-	pfn = PHYS_PFN(physical_addr);
-
-	if (flags == MF_ACTION_REQUIRED && current->mm) {
-		twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
-		if (!twcb)
-			return false;
-
-		twcb->pfn = pfn;
-		twcb->flags = flags;
-		init_task_work(&twcb->twork, memory_failure_cb);
-		task_work_add(current, &twcb->twork, TWA_RESUME);
-		return true;
-	}
-
-	memory_failure_queue(pfn, flags);
-	return true;
-}
-
-static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
-				       int sev, bool sync)
-{
-	int flags = -1;
-	int sec_sev = ghes_severity(gdata->error_severity);
-	struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
-
-	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
-		return false;
-
-	/* iff following two events can be handled properly by now */
-	if (sec_sev == GHES_SEV_CORRECTED &&
-	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
-		flags = MF_SOFT_OFFLINE;
-	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
-		flags = sync ? MF_ACTION_REQUIRED : 0;
-
-	if (flags != -1)
-		return ghes_do_memory_failure(mem_err->physical_addr, flags);
-
-	return false;
-}
-
-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
-				     int sev, bool sync)
-{
-	struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
-	int flags = sync ? MF_ACTION_REQUIRED : 0;
-	int length = gdata->error_data_length;
-	char error_type[120];
-	bool queued = false;
-	int sec_sev, i;
-	char *p;
-
-	sec_sev = ghes_severity(gdata->error_severity);
-	if (length >= sizeof(*err)) {
-		log_arm_hw_error(err, sec_sev);
-	} else {
-		pr_warn(FW_BUG "arm error length: %d\n", length);
-		pr_warn(FW_BUG "length is too small\n");
-		pr_warn(FW_BUG "firmware-generated error record is incorrect\n");
-		return false;
-	}
-
-	if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
-		return false;
-
-	p = (char *)(err + 1);
-	length -= sizeof(err);
-
-	for (i = 0; i < err->err_info_num; i++) {
-		struct cper_arm_err_info *err_info;
-		bool is_cache, has_pa;
-
-		/* Ensure we have enough data for the error info header */
-		if (length < sizeof(*err_info))
-			break;
-
-		err_info = (struct cper_arm_err_info *)p;
-
-		/* Validate the claimed length before using it */
-		length -= err_info->length;
-		if (length < 0)
-			break;
-
-		is_cache = err_info->type & CPER_ARM_CACHE_ERROR;
-		has_pa = (err_info->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR);
-
-		/*
-		 * The field (err_info->error_info & BIT(26)) is fixed to set to
-		 * 1 in some old firmware of HiSilicon Kunpeng920. We assume that
-		 * firmware won't mix corrected errors in an uncorrected section,
-		 * and don't filter out 'corrected' error here.
-		 */
-		if (is_cache && has_pa) {
-			queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
-			p += err_info->length;
-			continue;
-		}
-
-		cper_bits_to_str(error_type, sizeof(error_type),
-				 FIELD_GET(CPER_ARM_ERR_TYPE_MASK, err_info->type),
-				 cper_proc_error_type_strs,
-				 ARRAY_SIZE(cper_proc_error_type_strs));
-
-		pr_warn_ratelimited(FW_WARN GHES_PFX
-				    "Unhandled processor error type 0x%02x: %s%s\n",
-				    err_info->type, error_type,
-				    (err_info->type & ~CPER_ARM_ERR_TYPE_MASK) ? " with reserved bit(s)" : "");
-		p += err_info->length;
-	}
-
-	return queued;
-}
-
-/*
- * PCIe AER errors need to be sent to the AER driver for reporting and
- * recovery. The GHES severities map to the following AER severities and
- * require the following handling:
- *
- * GHES_SEV_CORRECTABLE -> AER_CORRECTABLE
- *     These need to be reported by the AER driver but no recovery is
- *     necessary.
- * GHES_SEV_RECOVERABLE -> AER_NONFATAL
- * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
- *     These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_PANIC does not make it to this handling since the kernel must
- *     panic.
- */
-static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
-{
-#ifdef CONFIG_ACPI_APEI_PCIEAER
-	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
-
-	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
-	    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
-		unsigned int devfn;
-		int aer_severity;
-		u8 *aer_info;
-
-		devfn = PCI_DEVFN(pcie_err->device_id.device,
-				  pcie_err->device_id.function);
-		aer_severity = cper_severity_to_aer(gdata->error_severity);
-
-		/*
-		 * If firmware reset the component to contain
-		 * the error, we must reinitialize it before
-		 * use, so treat it as a fatal AER error.
-		 */
-		if (gdata->flags & CPER_SEC_RESET)
-			aer_severity = AER_FATAL;
-
-		aer_info = (void *)gen_pool_alloc(ghes_estatus_pool,
-						  sizeof(struct aer_capability_regs));
-		if (!aer_info)
-			return;
-		memcpy(aer_info, pcie_err->aer_info, sizeof(struct aer_capability_regs));
-
-		aer_recover_queue(pcie_err->device_id.segment,
-				  pcie_err->device_id.bus,
-				  devfn, aer_severity,
-				  (struct aer_capability_regs *)
-				  aer_info);
-	}
-#endif
-}
-
-static void ghes_log_hwerr(int sev, guid_t *sec_type)
-{
-	if (sev != CPER_SEV_RECOVERABLE)
-		return;
-
-	if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) ||
-	    guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) ||
-	    guid_equal(sec_type, &CPER_SEC_PROC_IA)) {
-		hwerr_log_error_type(HWERR_RECOV_CPU);
-		return;
-	}
-
-	if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) ||
-	    guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) ||
-	    guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) ||
-	    guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
-		hwerr_log_error_type(HWERR_RECOV_CXL);
-		return;
-	}
-
-	if (guid_equal(sec_type, &CPER_SEC_PCIE) ||
-	    guid_equal(sec_type, &CPER_SEC_PCI_X_BUS)) {
-		hwerr_log_error_type(HWERR_RECOV_PCI);
-		return;
-	}
-
-	if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
-		hwerr_log_error_type(HWERR_RECOV_MEMORY);
-		return;
-	}
-
-	hwerr_log_error_type(HWERR_RECOV_OTHERS);
-}
-
 static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
-	int sev, sec_sev;
-	struct acpi_hest_generic_data *gdata;
-	guid_t *sec_type;
-	const guid_t *fru_id = &guid_null;
-	char *fru_text = "";
-	bool queued = false;
-	bool sync = is_hest_sync_notify(ghes);
-
-	sev = ghes_severity(estatus->error_severity);
-	apei_estatus_for_each_section(estatus, gdata) {
-		sec_type = (guid_t *)gdata->section_type;
-		sec_sev = ghes_severity(gdata->error_severity);
-		if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
-			fru_id = (guid_t *)gdata->fru_id;
-
-		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
-			fru_text = gdata->fru_text;
-
-		ghes_log_hwerr(sev, sec_type);
-		if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
-			struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
-
-			atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
-
-			arch_apei_report_mem_error(sev, mem_err);
-			queued = ghes_handle_memory_failure(gdata, sev, sync);
-		} else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
-			ghes_handle_aer(gdata);
-		} else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
-			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
-		} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
-			struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
-
-			cxl_cper_post_prot_err(prot_err, gdata->error_severity);
-		} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
-			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
-
-			cxl_cper_post_event(CXL_CPER_EVENT_GEN_MEDIA, rec);
-		} else if (guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID)) {
-			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
-
-			cxl_cper_post_event(CXL_CPER_EVENT_DRAM, rec);
-		} else if (guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
-			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
-
-			cxl_cper_post_event(CXL_CPER_EVENT_MEM_MODULE, rec);
-		} else {
-			void *err = acpi_hest_get_payload(gdata);
-
-			ghes_defer_non_standard_event(gdata, sev);
-			log_non_standard_event(sec_type, fru_id, fru_text,
-					       sec_sev, err,
-					       gdata->error_data_length);
-		}
-	}
-
-	/*
-	 * If no memory failure work is queued for abnormal synchronous
-	 * errors, do a force kill.
-	 */
-	if (sync && !queued) {
-		dev_err(ghes->dev,
-			HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
-			current->comm, task_pid_nr(current));
-		force_sig(SIGBUS);
-	}
-}
-
-static void __ghes_print_estatus(const char *pfx,
-				 const struct acpi_hest_generic *generic,
-				 const struct acpi_hest_generic_status *estatus)
-{
-	static atomic_t seqno;
-	unsigned int curr_seqno;
-	char pfx_seq[64];
-
-	if (pfx == NULL) {
-		if (ghes_severity(estatus->error_severity) <=
-		    GHES_SEV_CORRECTED)
-			pfx = KERN_WARNING;
-		else
-			pfx = KERN_ERR;
-	}
-	curr_seqno = atomic_inc_return(&seqno);
-	snprintf(pfx_seq, sizeof(pfx_seq), "%s{%u}" HW_ERR, pfx, curr_seqno);
-	printk("%s""Hardware error from APEI Generic Hardware Error Source: %d\n",
-	       pfx_seq, generic->header.source_id);
-	cper_estatus_print(pfx_seq, estatus);
-}
-
-static int ghes_print_estatus(const char *pfx,
-			      const struct acpi_hest_generic *generic,
-			      const struct acpi_hest_generic_status *estatus)
-{
-	/* Not more than 2 messages every 5 seconds */
-	static DEFINE_RATELIMIT_STATE(ratelimit_corrected, 5*HZ, 2);
-	static DEFINE_RATELIMIT_STATE(ratelimit_uncorrected, 5*HZ, 2);
-	struct ratelimit_state *ratelimit;
-
-	if (ghes_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
-		ratelimit = &ratelimit_corrected;
-	else
-		ratelimit = &ratelimit_uncorrected;
-	if (__ratelimit(ratelimit)) {
-		__ghes_print_estatus(pfx, generic, estatus);
-		return 1;
-	}
-	return 0;
+	ghes_cper_handle_status(ghes->dev, ghes->generic,
+				estatus, is_hest_sync_notify(ghes));
 }
 
 static void __ghes_panic(struct ghes *ghes,
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
index 66bf1af4db00..460fe12b6513 100644
--- a/drivers/acpi/apei/ghes_cper.c
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -13,7 +13,9 @@
  */
 
 #include <linux/aer.h>
+#include <linux/bitfield.h>
 #include <linux/cleanup.h>
+#include <linux/device.h>
 #include <linux/err.h>
 #include <linux/genalloc.h>
 #include <linux/io.h>
@@ -21,11 +23,20 @@
 #include <linux/kernel.h>
 #include <linux/math64.h>
 #include <linux/mm.h>
+#include <linux/mutex.h>
+#include <linux/string.h>
+#include <linux/uuid.h>
+#include <linux/sched/signal.h>
+#include <linux/task_work.h>
 #include <linux/notifier.h>
+#include <linux/ras.h>
+#include <ras/ras_event.h>
 #include <linux/ratelimit.h>
 #include <linux/rcupdate.h>
 #include <linux/sched/clock.h>
 #include <linux/slab.h>
+#include <linux/vmcore_info.h>
+#include <linux/vmalloc.h>
 
 #include <acpi/apei.h>
 #include <acpi/acpi_io.h>
@@ -36,9 +47,449 @@
 
 #include "apei-internal.h"
 
+ATOMIC_NOTIFIER_HEAD(ghes_report_chain);
+
+#ifndef CONFIG_ACPI_APEI
+void __weak arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) { }
+#endif
+
 static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
+struct gen_pool *ghes_estatus_pool;
+static DEFINE_MUTEX(ghes_estatus_pool_lock);
+
+int ghes_estatus_pool_init(unsigned int num_ghes)
+{
+	unsigned long addr, len;
+	int rc;
+
+	len = GHES_ESTATUS_CACHE_AVG_SIZE * GHES_ESTATUS_CACHE_ALLOCED_MAX;
+	len += (num_ghes * GHES_ESOURCE_PREALLOC_MAX_SIZE);
+	len = PAGE_ALIGN(len);
+
+	mutex_lock(&ghes_estatus_pool_lock);
+
+	if (!ghes_estatus_pool) {
+		ghes_estatus_pool = gen_pool_create(GHES_ESTATUS_POOL_MIN_ALLOC_ORDER, -1);
+		if (!ghes_estatus_pool) {
+			rc = -ENOMEM;
+			goto out_unlock;
+		}
+	}
+
+	addr = (unsigned long)vmalloc(len);
+	if (!addr) {
+		rc = -ENOMEM;
+		goto err_pool_alloc;
+	}
+
+	rc = gen_pool_add(ghes_estatus_pool, addr, len, -1);
+	if (rc)
+		goto err_pool_add;
+
+	mutex_unlock(&ghes_estatus_pool_lock);
+	return 0;
+
+err_pool_add:
+	vfree((void *)addr);
+
+err_pool_alloc:
+	if (!gen_pool_avail(ghes_estatus_pool) && !gen_pool_size(ghes_estatus_pool)) {
+		gen_pool_destroy(ghes_estatus_pool);
+		ghes_estatus_pool = NULL;
+	}
+out_unlock:
+	mutex_unlock(&ghes_estatus_pool_lock);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(ghes_estatus_pool_init);
+
+/**
+ * ghes_estatus_pool_region_free - free previously allocated memory
+ *				   from the ghes_estatus_pool.
+ * @addr: address of memory to free.
+ * @size: size of memory to free.
+ *
+ * Returns none.
+ */
+void ghes_estatus_pool_region_free(unsigned long addr, u32 size)
+{
+	gen_pool_free(ghes_estatus_pool, addr, size);
+}
+EXPORT_SYMBOL_GPL(ghes_estatus_pool_region_free);
+
+int ghes_severity(int severity)
+{
+	switch (severity) {
+	case CPER_SEV_INFORMATIONAL:
+		return GHES_SEV_NO;
+	case CPER_SEV_CORRECTED:
+		return GHES_SEV_CORRECTED;
+	case CPER_SEV_RECOVERABLE:
+		return GHES_SEV_RECOVERABLE;
+	case CPER_SEV_FATAL:
+		return GHES_SEV_PANIC;
+	default:
+		/* Unknown, go panic */
+		return GHES_SEV_PANIC;
+	}
+}
+
+/**
+ * struct ghes_task_work - for synchronous RAS event
+ *
+ * @twork:                callback_head for task work
+ * @pfn:                  page frame number of corrupted page
+ * @flags:                work control flags
+ *
+ * Structure to pass task work to be handled before
+ * returning to user-space via task_work_add().
+ */
+struct ghes_task_work {
+	struct callback_head twork;
+	u64 pfn;
+	int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
+{
+	struct ghes_task_work *twcb = container_of(twork, struct ghes_task_work, twork);
+	int ret;
+
+	ret = memory_failure(twcb->pfn, twcb->flags);
+	gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
+
+	if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
+		return;
+
+	pr_err("%#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+			twcb->pfn, current->comm, task_pid_nr(current));
+	force_sig(SIGBUS);
+}
+
+static bool ghes_do_memory_failure(u64 physical_addr, int flags)
+{
+	struct ghes_task_work *twcb;
+	unsigned long pfn;
+
+	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
+		return false;
+
+	pfn = PHYS_PFN(physical_addr);
+
+	if (flags == MF_ACTION_REQUIRED && current->mm) {
+		twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
+		if (!twcb)
+			return false;
+
+		twcb->pfn = pfn;
+		twcb->flags = flags;
+		init_task_work(&twcb->twork, memory_failure_cb);
+		task_work_add(current, &twcb->twork, TWA_RESUME);
+		return true;
+	}
+
+	memory_failure_queue(pfn, flags);
+	return true;
+}
+
+bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
+				int sev, bool sync)
+{
+	int flags = -1;
+	int sec_sev = ghes_severity(gdata->error_severity);
+	struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
+
+	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
+		return false;
+
+	/* iff following two events can be handled properly by now */
+	if (sec_sev == GHES_SEV_CORRECTED &&
+	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
+		flags = MF_SOFT_OFFLINE;
+	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
+		flags = sync ? MF_ACTION_REQUIRED : 0;
+
+	if (flags != -1)
+		return ghes_do_memory_failure(mem_err->physical_addr, flags);
+
+	return false;
+}
+
+bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+			      int sev, bool sync)
+{
+	struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+	int flags = sync ? MF_ACTION_REQUIRED : 0;
+	int length = gdata->error_data_length;
+	char error_type[120];
+	bool queued = false;
+	int sec_sev, i;
+	char *p;
+
+	sec_sev = ghes_severity(gdata->error_severity);
+	if (length >= sizeof(*err)) {
+		log_arm_hw_error(err, sec_sev);
+	} else {
+		pr_warn(FW_BUG "arm error length: %d\n", length);
+		pr_warn(FW_BUG "length is too small\n");
+		pr_warn(FW_BUG "firmware-generated error record is incorrect\n");
+		return false;
+	}
+
+	if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
+		return false;
+
+	p = (char *)(err + 1);
+	length -= sizeof(err);
+
+	for (i = 0; i < err->err_info_num; i++) {
+		struct cper_arm_err_info *err_info;
+		bool is_cache, has_pa;
+
+		/* Ensure we have enough data for the error info header */
+		if (length < sizeof(*err_info))
+			break;
+
+		err_info = (struct cper_arm_err_info *)p;
+
+		/* Validate the claimed length before using it */
+		length -= err_info->length;
+		if (length < 0)
+			break;
+
+		is_cache = err_info->type & CPER_ARM_CACHE_ERROR;
+		has_pa = (err_info->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR);
+
+		/*
+		 * The field (err_info->error_info & BIT(26)) is fixed to set to
+		 * 1 in some old firmware of HiSilicon Kunpeng920. We assume that
+		 * firmware won't mix corrected errors in an uncorrected section,
+		 * and don't filter out 'corrected' error here.
+		 */
+		if (is_cache && has_pa) {
+			queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
+			p += err_info->length;
+			continue;
+		}
+
+		cper_bits_to_str(error_type, sizeof(error_type),
+				 FIELD_GET(CPER_ARM_ERR_TYPE_MASK, err_info->type),
+				 cper_proc_error_type_strs,
+				 ARRAY_SIZE(cper_proc_error_type_strs));
+
+		pr_warn_ratelimited(FW_WARN GHES_PFX
+				    "Unhandled processor error type 0x%02x: %s%s\n",
+				    err_info->type, error_type,
+				    (err_info->type & ~CPER_ARM_ERR_TYPE_MASK) ? " with reserved bit(s)" : "");
+		p += err_info->length;
+	}
+
+	return queued;
+}
+
+/*
+ * PCIe AER errors need to be sent to the AER driver for reporting and
+ * recovery. The GHES severities map to the following AER severities and
+ * require the following handling:
+ *
+ * GHES_SEV_CORRECTABLE -> AER_CORRECTABLE
+ *     These need to be reported by the AER driver but no recovery is
+ *     necessary.
+ * GHES_SEV_RECOVERABLE -> AER_NONFATAL
+ * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
+ *     These both need to be reported and recovered from by the AER driver.
+ * GHES_SEV_PANIC does not make it to this handling since the kernel must
+ *     panic.
+ */
+void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
+{
+#ifdef CONFIG_ACPI_APEI_PCIEAER
+	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
+
+	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+	    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
+		unsigned int devfn;
+		int aer_severity;
+		u8 *aer_info;
+
+		devfn = PCI_DEVFN(pcie_err->device_id.device,
+				  pcie_err->device_id.function);
+		aer_severity = cper_severity_to_aer(gdata->error_severity);
+
+		/*
+		 * If firmware reset the component to contain
+		 * the error, we must reinitialize it before
+		 * use, so treat it as a fatal AER error.
+		 */
+		if (gdata->flags & CPER_SEC_RESET)
+			aer_severity = AER_FATAL;
+
+		aer_info = (void *)gen_pool_alloc(ghes_estatus_pool,
+						  sizeof(struct aer_capability_regs));
+		if (!aer_info)
+			return;
+		memcpy(aer_info, pcie_err->aer_info, sizeof(struct aer_capability_regs));
+
+		aer_recover_queue(pcie_err->device_id.segment,
+				  pcie_err->device_id.bus,
+				  devfn, aer_severity,
+				  (struct aer_capability_regs *)
+				  aer_info);
+	}
+#endif
+}
+
+void ghes_log_hwerr(int sev, guid_t *sec_type)
+{
+	if (sev != CPER_SEV_RECOVERABLE)
+		return;
+
+	if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) ||
+	    guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) ||
+	    guid_equal(sec_type, &CPER_SEC_PROC_IA)) {
+		hwerr_log_error_type(HWERR_RECOV_CPU);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
+		hwerr_log_error_type(HWERR_RECOV_CXL);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_PCIE) ||
+	    guid_equal(sec_type, &CPER_SEC_PCI_X_BUS)) {
+		hwerr_log_error_type(HWERR_RECOV_PCI);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
+		hwerr_log_error_type(HWERR_RECOV_MEMORY);
+		return;
+	}
+
+	hwerr_log_error_type(HWERR_RECOV_OTHERS);
+}
+
+void ghes_cper_handle_status(struct device *dev,
+			     const struct acpi_hest_generic *generic,
+			     const struct acpi_hest_generic_status *estatus,
+			     bool sync)
+{
+	int sev, sec_sev;
+	struct acpi_hest_generic_data *gdata;
+	guid_t *sec_type;
+	const guid_t *fru_id = &guid_null;
+	char *fru_text = "";
+	bool queued = false;
+
+	sev = ghes_severity(estatus->error_severity);
+	apei_estatus_for_each_section(estatus, gdata) {
+		sec_type = (guid_t *)gdata->section_type;
+		sec_sev = ghes_severity(gdata->error_severity);
+		if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
+			fru_id = (guid_t *)gdata->fru_id;
+
+		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
+			fru_text = gdata->fru_text;
+
+		ghes_log_hwerr(sev, sec_type);
+		if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
+			struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
+
+			atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
+
+			arch_apei_report_mem_error(sev, mem_err);
+			queued = ghes_handle_memory_failure(gdata, sev, sync);
+		} else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
+			ghes_handle_aer(gdata);
+		} else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
+			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
+		} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
+			struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
+
+			cxl_cper_post_prot_err(prot_err, gdata->error_severity);
+		} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
+			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
+
+			cxl_cper_post_event(CXL_CPER_EVENT_GEN_MEDIA, rec);
+		} else if (guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID)) {
+			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
+
+			cxl_cper_post_event(CXL_CPER_EVENT_DRAM, rec);
+		} else if (guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
+			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
+
+			cxl_cper_post_event(CXL_CPER_EVENT_MEM_MODULE, rec);
+		} else {
+			void *err = acpi_hest_get_payload(gdata);
+
+			ghes_defer_non_standard_event(gdata, sev);
+			log_non_standard_event(sec_type, fru_id, fru_text,
+					       sec_sev, err,
+					       gdata->error_data_length);
+		}
+	}
+
+	/*
+	 * If no memory failure work is queued for abnormal synchronous
+	 * errors, do a force kill.
+	 */
+	if (sync && !queued) {
+		dev_err(dev,
+			HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
+			current->comm, task_pid_nr(current));
+		force_sig(SIGBUS);
+	}
+}
+
+void __ghes_print_estatus(const char *pfx,
+			  const struct acpi_hest_generic *generic,
+			  const struct acpi_hest_generic_status *estatus)
+{
+	static atomic_t seqno;
+	unsigned int curr_seqno;
+	char pfx_seq[64];
+
+	if (pfx == NULL) {
+		if (ghes_severity(estatus->error_severity) <=
+		    GHES_SEV_CORRECTED)
+			pfx = KERN_WARNING;
+		else
+			pfx = KERN_ERR;
+	}
+	curr_seqno = atomic_inc_return(&seqno);
+	snprintf(pfx_seq, sizeof(pfx_seq), "%s{%u}" HW_ERR, pfx, curr_seqno);
+	printk("%s""Hardware error from APEI Generic Hardware Error Source: %d\n",
+	       pfx_seq, generic->header.source_id);
+	cper_estatus_print(pfx_seq, estatus);
+}
+
+int ghes_print_estatus(const char *pfx,
+		       const struct acpi_hest_generic *generic,
+		       const struct acpi_hest_generic_status *estatus)
+{
+	/* Not more than 2 messages every 5 seconds */
+	static DEFINE_RATELIMIT_STATE(ratelimit_corrected, 5*HZ, 2);
+	static DEFINE_RATELIMIT_STATE(ratelimit_uncorrected, 5*HZ, 2);
+	struct ratelimit_state *ratelimit;
+
+	if (ghes_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
+		ratelimit = &ratelimit_corrected;
+	else
+		ratelimit = &ratelimit_uncorrected;
+	if (__ratelimit(ratelimit)) {
+		__ghes_print_estatus(pfx, generic, estatus);
+		return 1;
+	}
+	return 0;
+}
+
+#ifdef CONFIG_ACPI_APEI
 static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
 {
 	phys_addr_t paddr;
@@ -270,6 +721,7 @@ void ghes_clear_estatus(struct ghes *ghes,
 	if (is_hest_type_generic_v2(ghes))
 		ghes_ack_error(ghes->generic_v2);
 }
+#endif /* CONFIG_ACPI_APEI */
 
 static BLOCKING_NOTIFIER_HEAD(vendor_record_notify_list);
 
diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
index a853a5996cdf..e09a33d343d7 100644
--- a/include/acpi/ghes_cper.h
+++ b/include/acpi/ghes_cper.h
@@ -16,6 +16,8 @@
 #ifndef ACPI_APEI_GHES_CPER_H
 #define ACPI_APEI_GHES_CPER_H
 
+#include <linux/device.h>
+#include <linux/notifier.h>
 #include <linux/workqueue.h>
 
 #include <acpi/ghes.h>
@@ -57,6 +59,7 @@
 	((struct ghes_vendor_record_entry *)(vendor_entry) + 1))
 
 extern struct gen_pool *ghes_estatus_pool;
+extern struct atomic_notifier_head ghes_report_chain;
 
 static inline bool is_hest_type_generic_v2(struct ghes *ghes)
 {
@@ -107,6 +110,23 @@ void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
 			    struct acpi_hest_generic_status *estatus);
 void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 				   int sev);
+int ghes_severity(int severity);
+bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
+				int sev, bool sync);
+bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+			      int sev, bool sync);
+void ghes_handle_aer(struct acpi_hest_generic_data *gdata);
+void ghes_log_hwerr(int sev, guid_t *sec_type);
+void __ghes_print_estatus(const char *pfx,
+			  const struct acpi_hest_generic *generic,
+			  const struct acpi_hest_generic_status *estatus);
+int ghes_print_estatus(const char *pfx,
+		       const struct acpi_hest_generic *generic,
+		       const struct acpi_hest_generic_status *estatus);
+void ghes_cper_handle_status(struct device *dev,
+			     const struct acpi_hest_generic *generic,
+			     const struct acpi_hest_generic_status *estatus,
+			     bool sync);
 void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
 			    int severity);
 int cxl_cper_register_prot_err_work(struct work_struct *work);

-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 07/10] ACPI: APEI: introduce GHES helper
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com>

Add a dedicated GHES_CPER_HELPERS Kconfig entry so the shared helper code
can be built even when ACPI_APEI_GHES is disabled. Update the build glue
and headers to depend on the new symbol.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/Makefile           |  1 +
 drivers/acpi/Kconfig       |  4 ++++
 drivers/acpi/apei/Kconfig  |  1 +
 drivers/acpi/apei/Makefile |  2 +-
 include/acpi/ghes.h        | 10 ++++++----
 include/cxl/event.h        |  2 +-
 6 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 0841ea851847..27a664cb45ea 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -31,6 +31,7 @@ obj-y				+= idle/
 obj-y				+= char/ipmi/
 
 obj-$(CONFIG_ACPI)		+= acpi/
+obj-$(CONFIG_GHES_CPER_HELPERS)	+= acpi/apei/ghes_cper.o
 
 # PnP must come after ACPI since it will eventually need to check if acpi
 # was used and do nothing if so
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index f165d14cf61a..13ef0e99f840 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -6,6 +6,10 @@
 config ARCH_SUPPORTS_ACPI
 	bool
 
+config GHES_CPER_HELPERS
+	bool
+	select UEFI_CPER
+
 menuconfig ACPI
 	bool "ACPI (Advanced Configuration and Power Interface) Support"
 	depends on ARCH_SUPPORTS_ACPI
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index 428458c623f0..ddb62638eb02 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -21,6 +21,7 @@ config ACPI_APEI_GHES
 	bool "APEI Generic Hardware Error Source"
 	depends on ACPI_APEI
 	select ACPI_HED
+	select GHES_CPER_HELPERS
 	select IRQ_WORK
 	select GENERIC_ALLOCATOR
 	select ARM_SDE_INTERFACE if ARM64
diff --git a/drivers/acpi/apei/Makefile b/drivers/acpi/apei/Makefile
index f57f3b009d8e..66588d6be56f 100644
--- a/drivers/acpi/apei/Makefile
+++ b/drivers/acpi/apei/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_ACPI_APEI)		+= apei.o
-obj-$(CONFIG_ACPI_APEI_GHES)	+= ghes.o ghes_cper.o
+obj-$(CONFIG_ACPI_APEI_GHES)	+= ghes.o
 # clang versions prior to 18 may blow out the stack with KASAN
 ifeq ($(CONFIG_COMPILE_TEST)_$(CONFIG_CC_IS_CLANG)_$(call clang-min-version, 180000),y_y_)
 KASAN_SANITIZE_ghes.o := n
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 8d7e5caef3f1..2ffab36b6154 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -83,15 +83,17 @@ int devm_ghes_register_vendor_record_notifier(struct device *dev,
 					      struct notifier_block *nb);
 
 struct list_head *ghes_get_devices(void);
-
-void ghes_estatus_pool_region_free(unsigned long addr, u32 size);
 #else
 static inline struct list_head *ghes_get_devices(void) { return NULL; }
-
-static inline void ghes_estatus_pool_region_free(unsigned long addr, u32 size) { return; }
 #endif
 
+#ifdef CONFIG_GHES_CPER_HELPERS
 int ghes_estatus_pool_init(unsigned int num_ghes);
+void ghes_estatus_pool_region_free(unsigned long addr, u32 size);
+#else
+static inline int ghes_estatus_pool_init(unsigned int num_ghes) { return -ENODEV; }
+static inline void ghes_estatus_pool_region_free(unsigned long addr, u32 size) { }
+#endif
 
 static inline int acpi_hest_get_version(struct acpi_hest_generic_data *gdata)
 {
diff --git a/include/cxl/event.h b/include/cxl/event.h
index ff97fea718d2..2ebd65b0d9d6 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -285,7 +285,7 @@ struct cxl_cper_prot_err_work_data {
 	int severity;
 };
 
-#ifdef CONFIG_ACPI_APEI_GHES
+#ifdef CONFIG_GHES_CPER_HELPERS
 int cxl_cper_register_work(struct work_struct *work);
 int cxl_cper_unregister_work(struct work_struct *work);
 int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd);

-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 06/10] ACPI: APEI: GHES: move CXL CPER helpers
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com>

Move the CXL CPER handling paths out of ghes.c and into ghes_cper.c so the
helpers can be reused. The code is moved as-is, with the public
prototypes updated so GHES keeps calling into the new translation unit.

Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/ghes.c      | 132 -----------------------------------------
 drivers/acpi/apei/ghes_cper.c | 135 ++++++++++++++++++++++++++++++++++++++++++
 include/acpi/ghes_cper.h      |  11 ++++
 3 files changed, 146 insertions(+), 132 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9703c602a8c2..136993704d52 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -383,138 +383,6 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
 #endif
 }
 
-/* Room for 8 entries */
-#define CXL_CPER_PROT_ERR_FIFO_DEPTH 8
-static DEFINE_KFIFO(cxl_cper_prot_err_fifo, struct cxl_cper_prot_err_work_data,
-		    CXL_CPER_PROT_ERR_FIFO_DEPTH);
-
-/* Synchronize schedule_work() with cxl_cper_prot_err_work changes */
-static DEFINE_SPINLOCK(cxl_cper_prot_err_work_lock);
-struct work_struct *cxl_cper_prot_err_work;
-
-static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
-				   int severity)
-{
-#ifdef CONFIG_ACPI_APEI_PCIEAER
-	struct cxl_cper_prot_err_work_data wd;
-
-	if (cxl_cper_sec_prot_err_valid(prot_err))
-		return;
-
-	guard(spinlock_irqsave)(&cxl_cper_prot_err_work_lock);
-
-	if (!cxl_cper_prot_err_work)
-		return;
-
-	if (cxl_cper_setup_prot_err_work_data(&wd, prot_err, severity))
-		return;
-
-	if (!kfifo_put(&cxl_cper_prot_err_fifo, wd)) {
-		pr_err_ratelimited("CXL CPER kfifo overflow\n");
-		return;
-	}
-
-	schedule_work(cxl_cper_prot_err_work);
-#endif
-}
-
-int cxl_cper_register_prot_err_work(struct work_struct *work)
-{
-	if (cxl_cper_prot_err_work)
-		return -EINVAL;
-
-	guard(spinlock)(&cxl_cper_prot_err_work_lock);
-	cxl_cper_prot_err_work = work;
-	return 0;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_register_prot_err_work, "CXL");
-
-int cxl_cper_unregister_prot_err_work(struct work_struct *work)
-{
-	if (cxl_cper_prot_err_work != work)
-		return -EINVAL;
-
-	guard(spinlock)(&cxl_cper_prot_err_work_lock);
-	cxl_cper_prot_err_work = NULL;
-	return 0;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
-
-int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
-{
-	return kfifo_get(&cxl_cper_prot_err_fifo, wd);
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_prot_err_kfifo_get, "CXL");
-
-/* Room for 8 entries for each of the 4 event log queues */
-#define CXL_CPER_FIFO_DEPTH 32
-DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
-
-/* Synchronize schedule_work() with cxl_cper_work changes */
-static DEFINE_SPINLOCK(cxl_cper_work_lock);
-struct work_struct *cxl_cper_work;
-
-static void cxl_cper_post_event(enum cxl_event_type event_type,
-				struct cxl_cper_event_rec *rec)
-{
-	struct cxl_cper_work_data wd;
-
-	if (rec->hdr.length <= sizeof(rec->hdr) ||
-	    rec->hdr.length > sizeof(*rec)) {
-		pr_err(FW_WARN "CXL CPER Invalid section length (%u)\n",
-		       rec->hdr.length);
-		return;
-	}
-
-	if (!(rec->hdr.validation_bits & CPER_CXL_COMP_EVENT_LOG_VALID)) {
-		pr_err(FW_WARN "CXL CPER invalid event\n");
-		return;
-	}
-
-	guard(spinlock_irqsave)(&cxl_cper_work_lock);
-
-	if (!cxl_cper_work)
-		return;
-
-	wd.event_type = event_type;
-	memcpy(&wd.rec, rec, sizeof(wd.rec));
-
-	if (!kfifo_put(&cxl_cper_fifo, wd)) {
-		pr_err_ratelimited("CXL CPER kfifo overflow\n");
-		return;
-	}
-
-	schedule_work(cxl_cper_work);
-}
-
-int cxl_cper_register_work(struct work_struct *work)
-{
-	if (cxl_cper_work)
-		return -EINVAL;
-
-	guard(spinlock)(&cxl_cper_work_lock);
-	cxl_cper_work = work;
-	return 0;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_register_work, "CXL");
-
-int cxl_cper_unregister_work(struct work_struct *work)
-{
-	if (cxl_cper_work != work)
-		return -EINVAL;
-
-	guard(spinlock)(&cxl_cper_work_lock);
-	cxl_cper_work = NULL;
-	return 0;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_work, "CXL");
-
-int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
-{
-	return kfifo_get(&cxl_cper_fifo, wd);
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
-
 static void ghes_log_hwerr(int sev, guid_t *sec_type)
 {
 	if (sev != CPER_SEV_RECOVERABLE)
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
index b26943eafd79..66bf1af4db00 100644
--- a/drivers/acpi/apei/ghes_cper.c
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -12,9 +12,12 @@
  *   Author: Huang Ying <ying.huang@intel.com>
  */
 
+#include <linux/aer.h>
+#include <linux/cleanup.h>
 #include <linux/err.h>
 #include <linux/genalloc.h>
 #include <linux/io.h>
+#include <linux/kfifo.h>
 #include <linux/kernel.h>
 #include <linux/math64.h>
 #include <linux/mm.h>
@@ -336,6 +339,138 @@ void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 	schedule_work(&entry->work);
 }
 
+/* Room for 8 entries */
+#define CXL_CPER_PROT_ERR_FIFO_DEPTH 8
+static DEFINE_KFIFO(cxl_cper_prot_err_fifo, struct cxl_cper_prot_err_work_data,
+		    CXL_CPER_PROT_ERR_FIFO_DEPTH);
+
+/* Synchronize schedule_work() with cxl_cper_prot_err_work changes */
+static DEFINE_SPINLOCK(cxl_cper_prot_err_work_lock);
+struct work_struct *cxl_cper_prot_err_work;
+
+void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
+			    int severity)
+{
+#ifdef CONFIG_ACPI_APEI_PCIEAER
+	struct cxl_cper_prot_err_work_data wd;
+
+	if (cxl_cper_sec_prot_err_valid(prot_err))
+		return;
+
+	guard(spinlock_irqsave)(&cxl_cper_prot_err_work_lock);
+
+	if (!cxl_cper_prot_err_work)
+		return;
+
+	if (cxl_cper_setup_prot_err_work_data(&wd, prot_err, severity))
+		return;
+
+	if (!kfifo_put(&cxl_cper_prot_err_fifo, wd)) {
+		pr_err_ratelimited("CXL CPER kfifo overflow\n");
+		return;
+	}
+
+	schedule_work(cxl_cper_prot_err_work);
+#endif
+}
+
+int cxl_cper_register_prot_err_work(struct work_struct *work)
+{
+	if (cxl_cper_prot_err_work)
+		return -EINVAL;
+
+	guard(spinlock)(&cxl_cper_prot_err_work_lock);
+	cxl_cper_prot_err_work = work;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_register_prot_err_work, "CXL");
+
+int cxl_cper_unregister_prot_err_work(struct work_struct *work)
+{
+	if (cxl_cper_prot_err_work != work)
+		return -EINVAL;
+
+	guard(spinlock)(&cxl_cper_prot_err_work_lock);
+	cxl_cper_prot_err_work = NULL;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
+
+int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
+{
+	return kfifo_get(&cxl_cper_prot_err_fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_prot_err_kfifo_get, "CXL");
+
+/* Room for 8 entries for each of the 4 event log queues */
+#define CXL_CPER_FIFO_DEPTH 32
+static DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
+
+/* Synchronize schedule_work() with cxl_cper_work changes */
+static DEFINE_SPINLOCK(cxl_cper_work_lock);
+struct work_struct *cxl_cper_work;
+
+void cxl_cper_post_event(enum cxl_event_type event_type,
+			 struct cxl_cper_event_rec *rec)
+{
+	struct cxl_cper_work_data wd;
+
+	if (rec->hdr.length <= sizeof(rec->hdr) ||
+	    rec->hdr.length > sizeof(*rec)) {
+		pr_err(FW_WARN "CXL CPER Invalid section length (%u)\n",
+		       rec->hdr.length);
+		return;
+	}
+
+	if (!(rec->hdr.validation_bits & CPER_CXL_COMP_EVENT_LOG_VALID)) {
+		pr_err(FW_WARN "CXL CPER invalid event\n");
+		return;
+	}
+
+	guard(spinlock_irqsave)(&cxl_cper_work_lock);
+
+	if (!cxl_cper_work)
+		return;
+
+	wd.event_type = event_type;
+	memcpy(&wd.rec, rec, sizeof(wd.rec));
+
+	if (!kfifo_put(&cxl_cper_fifo, wd)) {
+		pr_err_ratelimited("CXL CPER kfifo overflow\n");
+		return;
+	}
+
+	schedule_work(cxl_cper_work);
+}
+
+int cxl_cper_register_work(struct work_struct *work)
+{
+	if (cxl_cper_work)
+		return -EINVAL;
+
+	guard(spinlock)(&cxl_cper_work_lock);
+	cxl_cper_work = work;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_register_work, "CXL");
+
+int cxl_cper_unregister_work(struct work_struct *work)
+{
+	if (cxl_cper_work != work)
+		return -EINVAL;
+
+	guard(spinlock)(&cxl_cper_work_lock);
+	cxl_cper_work = NULL;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_work, "CXL");
+
+int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
+{
+	return kfifo_get(&cxl_cper_fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
+
 /*
  * GHES error status reporting throttle, to report more kinds of
  * errors, instead of just most frequently occurred errors.
diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
index d9f9253d8de9..a853a5996cdf 100644
--- a/include/acpi/ghes_cper.h
+++ b/include/acpi/ghes_cper.h
@@ -20,6 +20,7 @@
 
 #include <acpi/ghes.h>
 #include <asm/fixmap.h>
+#include <cxl/event.h>
 
 #define GHES_PFX	"GHES: "
 
@@ -106,5 +107,15 @@ void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
 			    struct acpi_hest_generic_status *estatus);
 void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 				   int sev);
+void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
+			    int severity);
+int cxl_cper_register_prot_err_work(struct work_struct *work);
+int cxl_cper_unregister_prot_err_work(struct work_struct *work);
+int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd);
+void cxl_cper_post_event(enum cxl_event_type event_type,
+			 struct cxl_cper_event_rec *rec);
+int cxl_cper_register_work(struct work_struct *work);
+int cxl_cper_unregister_work(struct work_struct *work);
+int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd);
 
 #endif /* ACPI_APEI_GHES_CPER_H */

-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 05/10] ACPI: APEI: GHES: move vendor record helpers
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com>

Shift the vendor record workqueue helpers into ghes_cper.c so both GHES
and future DT-based providers can use the same implementation. The change
is mechanical and keeps the notifier behavior identical.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/ghes.c      | 68 ------------------------------------------
 drivers/acpi/apei/ghes_cper.c | 69 +++++++++++++++++++++++++++++++++++++++++++
 include/acpi/ghes_cper.h      |  2 ++
 3 files changed, 71 insertions(+), 68 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index adab7404310e..9703c602a8c2 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -383,74 +383,6 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
 #endif
 }
 
-static BLOCKING_NOTIFIER_HEAD(vendor_record_notify_list);
-
-int ghes_register_vendor_record_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_register(&vendor_record_notify_list, nb);
-}
-EXPORT_SYMBOL_GPL(ghes_register_vendor_record_notifier);
-
-void ghes_unregister_vendor_record_notifier(struct notifier_block *nb)
-{
-	blocking_notifier_chain_unregister(&vendor_record_notify_list, nb);
-}
-EXPORT_SYMBOL_GPL(ghes_unregister_vendor_record_notifier);
-
-static void ghes_vendor_record_notifier_destroy(void *nb)
-{
-	ghes_unregister_vendor_record_notifier(nb);
-}
-
-int devm_ghes_register_vendor_record_notifier(struct device *dev,
-					      struct notifier_block *nb)
-{
-	int ret;
-
-	ret = ghes_register_vendor_record_notifier(nb);
-	if (ret)
-		return ret;
-
-	return devm_add_action_or_reset(dev, ghes_vendor_record_notifier_destroy, nb);
-}
-EXPORT_SYMBOL_GPL(devm_ghes_register_vendor_record_notifier);
-
-static void ghes_vendor_record_work_func(struct work_struct *work)
-{
-	struct ghes_vendor_record_entry *entry;
-	struct acpi_hest_generic_data *gdata;
-	u32 len;
-
-	entry = container_of(work, struct ghes_vendor_record_entry, work);
-	gdata = GHES_GDATA_FROM_VENDOR_ENTRY(entry);
-
-	blocking_notifier_call_chain(&vendor_record_notify_list,
-				     entry->error_severity, gdata);
-
-	len = GHES_VENDOR_ENTRY_LEN(acpi_hest_get_record_size(gdata));
-	gen_pool_free(ghes_estatus_pool, (unsigned long)entry, len);
-}
-
-static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
-					  int sev)
-{
-	struct acpi_hest_generic_data *copied_gdata;
-	struct ghes_vendor_record_entry *entry;
-	u32 len;
-
-	len = GHES_VENDOR_ENTRY_LEN(acpi_hest_get_record_size(gdata));
-	entry = (void *)gen_pool_alloc(ghes_estatus_pool, len);
-	if (!entry)
-		return;
-
-	copied_gdata = GHES_GDATA_FROM_VENDOR_ENTRY(entry);
-	memcpy(copied_gdata, gdata, acpi_hest_get_record_size(gdata));
-	entry->error_severity = sev;
-
-	INIT_WORK(&entry->work, ghes_vendor_record_work_func);
-	schedule_work(&entry->work);
-}
-
 /* Room for 8 entries */
 #define CXL_CPER_PROT_ERR_FIFO_DEPTH 8
 static DEFINE_KFIFO(cxl_cper_prot_err_fifo, struct cxl_cper_prot_err_work_data,
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
index 5218fc57e562..b26943eafd79 100644
--- a/drivers/acpi/apei/ghes_cper.c
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -18,6 +18,7 @@
 #include <linux/kernel.h>
 #include <linux/math64.h>
 #include <linux/mm.h>
+#include <linux/notifier.h>
 #include <linux/ratelimit.h>
 #include <linux/rcupdate.h>
 #include <linux/sched/clock.h>
@@ -267,6 +268,74 @@ void ghes_clear_estatus(struct ghes *ghes,
 		ghes_ack_error(ghes->generic_v2);
 }
 
+static BLOCKING_NOTIFIER_HEAD(vendor_record_notify_list);
+
+int ghes_register_vendor_record_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&vendor_record_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(ghes_register_vendor_record_notifier);
+
+void ghes_unregister_vendor_record_notifier(struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&vendor_record_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(ghes_unregister_vendor_record_notifier);
+
+static void ghes_vendor_record_notifier_destroy(void *nb)
+{
+	ghes_unregister_vendor_record_notifier(nb);
+}
+
+int devm_ghes_register_vendor_record_notifier(struct device *dev,
+					      struct notifier_block *nb)
+{
+	int ret;
+
+	ret = ghes_register_vendor_record_notifier(nb);
+	if (ret)
+		return ret;
+
+	return devm_add_action_or_reset(dev, ghes_vendor_record_notifier_destroy, nb);
+}
+EXPORT_SYMBOL_GPL(devm_ghes_register_vendor_record_notifier);
+
+static void ghes_vendor_record_work_func(struct work_struct *work)
+{
+	struct ghes_vendor_record_entry *entry;
+	struct acpi_hest_generic_data *gdata;
+	u32 len;
+
+	entry = container_of(work, struct ghes_vendor_record_entry, work);
+	gdata = GHES_GDATA_FROM_VENDOR_ENTRY(entry);
+
+	blocking_notifier_call_chain(&vendor_record_notify_list,
+				     entry->error_severity, gdata);
+
+	len = GHES_VENDOR_ENTRY_LEN(acpi_hest_get_record_size(gdata));
+	gen_pool_free(ghes_estatus_pool, (unsigned long)entry, len);
+}
+
+void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
+				   int sev)
+{
+	struct acpi_hest_generic_data *copied_gdata;
+	struct ghes_vendor_record_entry *entry;
+	u32 len;
+
+	len = GHES_VENDOR_ENTRY_LEN(acpi_hest_get_record_size(gdata));
+	entry = (void *)gen_pool_alloc(ghes_estatus_pool, len);
+	if (!entry)
+		return;
+
+	copied_gdata = GHES_GDATA_FROM_VENDOR_ENTRY(entry);
+	memcpy(copied_gdata, gdata, acpi_hest_get_record_size(gdata));
+	entry->error_severity = sev;
+
+	INIT_WORK(&entry->work, ghes_vendor_record_work_func);
+	schedule_work(&entry->work);
+}
+
 /*
  * GHES error status reporting throttle, to report more kinds of
  * errors, instead of just most frequently occurred errors.
diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
index 15305c8be9a7..d9f9253d8de9 100644
--- a/include/acpi/ghes_cper.h
+++ b/include/acpi/ghes_cper.h
@@ -104,5 +104,7 @@ int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
 int ghes_estatus_cached(struct acpi_hest_generic_status *estatus);
 void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
 			    struct acpi_hest_generic_status *estatus);
+void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
+				   int sev);
 
 #endif /* ACPI_APEI_GHES_CPER_H */

-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 04/10] ACPI: APEI: GHES: move estatus cache helpers
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com>

Relocate the estatus cache allocation and lookup helpers from ghes.c into
ghes_cper.c. This code move keeps the logic intact while making the cache
implementation available to forthcoming users.

Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/ghes.c      | 138 +----------------------------------------
 drivers/acpi/apei/ghes_cper.c | 139 ++++++++++++++++++++++++++++++++++++++++++
 include/acpi/ghes_cper.h      |   5 ++
 3 files changed, 145 insertions(+), 137 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 91638ae7e05e..adab7404310e 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -113,10 +113,7 @@ static DEFINE_MUTEX(ghes_devs_mutex);
  */
 static DEFINE_SPINLOCK(ghes_notify_lock_irq);
 
-static struct gen_pool *ghes_estatus_pool;
-
-static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
-static atomic_t ghes_estatus_cache_alloced;
+struct gen_pool *ghes_estatus_pool;
 
 int ghes_estatus_pool_init(unsigned int num_ghes)
 {
@@ -733,139 +730,6 @@ static int ghes_print_estatus(const char *pfx,
 	return 0;
 }
 
-/*
- * GHES error status reporting throttle, to report more kinds of
- * errors, instead of just most frequently occurred errors.
- */
-static int ghes_estatus_cached(struct acpi_hest_generic_status *estatus)
-{
-	u32 len;
-	int i, cached = 0;
-	unsigned long long now;
-	struct ghes_estatus_cache *cache;
-	struct acpi_hest_generic_status *cache_estatus;
-
-	len = cper_estatus_len(estatus);
-	rcu_read_lock();
-	for (i = 0; i < GHES_ESTATUS_CACHES_SIZE; i++) {
-		cache = rcu_dereference(ghes_estatus_caches[i]);
-		if (cache == NULL)
-			continue;
-		if (len != cache->estatus_len)
-			continue;
-		cache_estatus = GHES_ESTATUS_FROM_CACHE(cache);
-		if (memcmp(estatus, cache_estatus, len))
-			continue;
-		atomic_inc(&cache->count);
-		now = sched_clock();
-		if (now - cache->time_in < GHES_ESTATUS_IN_CACHE_MAX_NSEC)
-			cached = 1;
-		break;
-	}
-	rcu_read_unlock();
-	return cached;
-}
-
-static struct ghes_estatus_cache *ghes_estatus_cache_alloc(
-	struct acpi_hest_generic *generic,
-	struct acpi_hest_generic_status *estatus)
-{
-	int alloced;
-	u32 len, cache_len;
-	struct ghes_estatus_cache *cache;
-	struct acpi_hest_generic_status *cache_estatus;
-
-	alloced = atomic_add_return(1, &ghes_estatus_cache_alloced);
-	if (alloced > GHES_ESTATUS_CACHE_ALLOCED_MAX) {
-		atomic_dec(&ghes_estatus_cache_alloced);
-		return NULL;
-	}
-	len = cper_estatus_len(estatus);
-	cache_len = GHES_ESTATUS_CACHE_LEN(len);
-	cache = (void *)gen_pool_alloc(ghes_estatus_pool, cache_len);
-	if (!cache) {
-		atomic_dec(&ghes_estatus_cache_alloced);
-		return NULL;
-	}
-	cache_estatus = GHES_ESTATUS_FROM_CACHE(cache);
-	memcpy(cache_estatus, estatus, len);
-	cache->estatus_len = len;
-	atomic_set(&cache->count, 0);
-	cache->generic = generic;
-	cache->time_in = sched_clock();
-	return cache;
-}
-
-static void ghes_estatus_cache_rcu_free(struct rcu_head *head)
-{
-	struct ghes_estatus_cache *cache;
-	u32 len;
-
-	cache = container_of(head, struct ghes_estatus_cache, rcu);
-	len = cper_estatus_len(GHES_ESTATUS_FROM_CACHE(cache));
-	len = GHES_ESTATUS_CACHE_LEN(len);
-	gen_pool_free(ghes_estatus_pool, (unsigned long)cache, len);
-	atomic_dec(&ghes_estatus_cache_alloced);
-}
-
-static void
-ghes_estatus_cache_add(struct acpi_hest_generic *generic,
-		       struct acpi_hest_generic_status *estatus)
-{
-	unsigned long long now, duration, period, max_period = 0;
-	struct ghes_estatus_cache *cache, *new_cache;
-	struct ghes_estatus_cache __rcu *victim;
-	int i, slot = -1, count;
-
-	new_cache = ghes_estatus_cache_alloc(generic, estatus);
-	if (!new_cache)
-		return;
-
-	rcu_read_lock();
-	now = sched_clock();
-	for (i = 0; i < GHES_ESTATUS_CACHES_SIZE; i++) {
-		cache = rcu_dereference(ghes_estatus_caches[i]);
-		if (cache == NULL) {
-			slot = i;
-			break;
-		}
-		duration = now - cache->time_in;
-		if (duration >= GHES_ESTATUS_IN_CACHE_MAX_NSEC) {
-			slot = i;
-			break;
-		}
-		count = atomic_read(&cache->count);
-		period = duration;
-		do_div(period, (count + 1));
-		if (period > max_period) {
-			max_period = period;
-			slot = i;
-		}
-	}
-	rcu_read_unlock();
-
-	if (slot != -1) {
-		/*
-		 * Use release semantics to ensure that ghes_estatus_cached()
-		 * running on another CPU will see the updated cache fields if
-		 * it can see the new value of the pointer.
-		 */
-		victim = xchg_release(&ghes_estatus_caches[slot],
-				      RCU_INITIALIZER(new_cache));
-
-		/*
-		 * At this point, victim may point to a cached item different
-		 * from the one based on which we selected the slot. Instead of
-		 * going to the loop again to pick another slot, let's just
-		 * drop the other item anyway: this may cause a false cache
-		 * miss later on, but that won't cause any problems.
-		 */
-		if (victim)
-			call_rcu(&unrcu_pointer(victim)->rcu,
-				 ghes_estatus_cache_rcu_free);
-	}
-}
-
 static void __ghes_panic(struct ghes *ghes,
 			 struct acpi_hest_generic_status *estatus,
 			 u64 buf_paddr, enum fixed_addresses fixmap_idx)
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
index c6e20fdc6964..5218fc57e562 100644
--- a/drivers/acpi/apei/ghes_cper.c
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -13,10 +13,14 @@
  */
 
 #include <linux/err.h>
+#include <linux/genalloc.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
+#include <linux/math64.h>
 #include <linux/mm.h>
 #include <linux/ratelimit.h>
+#include <linux/rcupdate.h>
+#include <linux/sched/clock.h>
 #include <linux/slab.h>
 
 #include <acpi/apei.h>
@@ -28,6 +32,9 @@
 
 #include "apei-internal.h"
 
+static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
+static atomic_t ghes_estatus_cache_alloced;
+
 static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
 {
 	phys_addr_t paddr;
@@ -259,3 +266,135 @@ void ghes_clear_estatus(struct ghes *ghes,
 	if (is_hest_type_generic_v2(ghes))
 		ghes_ack_error(ghes->generic_v2);
 }
+
+/*
+ * GHES error status reporting throttle, to report more kinds of
+ * errors, instead of just most frequently occurred errors.
+ */
+int ghes_estatus_cached(struct acpi_hest_generic_status *estatus)
+{
+	u32 len;
+	int i, cached = 0;
+	unsigned long long now;
+	struct ghes_estatus_cache *cache;
+	struct acpi_hest_generic_status *cache_estatus;
+
+	len = cper_estatus_len(estatus);
+	rcu_read_lock();
+	for (i = 0; i < GHES_ESTATUS_CACHES_SIZE; i++) {
+		cache = rcu_dereference(ghes_estatus_caches[i]);
+		if (cache == NULL)
+			continue;
+		if (len != cache->estatus_len)
+			continue;
+		cache_estatus = GHES_ESTATUS_FROM_CACHE(cache);
+		if (memcmp(estatus, cache_estatus, len))
+			continue;
+		atomic_inc(&cache->count);
+		now = sched_clock();
+		if (now - cache->time_in < GHES_ESTATUS_IN_CACHE_MAX_NSEC)
+			cached = 1;
+		break;
+	}
+	rcu_read_unlock();
+	return cached;
+}
+
+static struct ghes_estatus_cache *ghes_estatus_cache_alloc(
+	struct acpi_hest_generic *generic,
+	struct acpi_hest_generic_status *estatus)
+{
+	int alloced;
+	u32 len, cache_len;
+	struct ghes_estatus_cache *cache;
+	struct acpi_hest_generic_status *cache_estatus;
+
+	alloced = atomic_add_return(1, &ghes_estatus_cache_alloced);
+	if (alloced > GHES_ESTATUS_CACHE_ALLOCED_MAX) {
+		atomic_dec(&ghes_estatus_cache_alloced);
+		return NULL;
+	}
+	len = cper_estatus_len(estatus);
+	cache_len = GHES_ESTATUS_CACHE_LEN(len);
+	cache = (void *)gen_pool_alloc(ghes_estatus_pool, cache_len);
+	if (cache == NULL) {
+		atomic_dec(&ghes_estatus_cache_alloced);
+		return NULL;
+	}
+	cache_estatus = GHES_ESTATUS_FROM_CACHE(cache);
+	memcpy(cache_estatus, estatus, len);
+	cache->estatus_len = len;
+	atomic_set(&cache->count, 0);
+	cache->generic = generic;
+	cache->time_in = sched_clock();
+	return cache;
+}
+
+static void ghes_estatus_cache_rcu_free(struct rcu_head *head)
+{
+	struct ghes_estatus_cache *cache;
+	u32 len;
+
+	cache = container_of(head, struct ghes_estatus_cache, rcu);
+	len = cper_estatus_len(GHES_ESTATUS_FROM_CACHE(cache));
+	len = GHES_ESTATUS_CACHE_LEN(len);
+	gen_pool_free(ghes_estatus_pool, (unsigned long)cache, len);
+	atomic_dec(&ghes_estatus_cache_alloced);
+}
+
+void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
+			    struct acpi_hest_generic_status *estatus)
+{
+	unsigned long long now, duration, period, max_period = 0;
+	struct ghes_estatus_cache *cache, *new_cache;
+	struct ghes_estatus_cache __rcu *victim;
+	int i, slot = -1, count;
+
+	new_cache = ghes_estatus_cache_alloc(generic, estatus);
+	if (!new_cache)
+		return;
+
+	rcu_read_lock();
+	now = sched_clock();
+	for (i = 0; i < GHES_ESTATUS_CACHES_SIZE; i++) {
+		cache = rcu_dereference(ghes_estatus_caches[i]);
+		if (cache == NULL) {
+			slot = i;
+			break;
+		}
+		duration = now - cache->time_in;
+		if (duration >= GHES_ESTATUS_IN_CACHE_MAX_NSEC) {
+			slot = i;
+			break;
+		}
+		count = atomic_read(&cache->count);
+		period = duration;
+		do_div(period, (count + 1));
+		if (period > max_period) {
+			max_period = period;
+			slot = i;
+		}
+	}
+	rcu_read_unlock();
+
+	if (slot != -1) {
+		/*
+		 * Use release semantics to ensure that ghes_estatus_cached()
+		 * running on another CPU will see the updated cache fields if
+		 * it can see the new value of the pointer.
+		 */
+		victim = xchg_release(&ghes_estatus_caches[slot],
+				      RCU_INITIALIZER(new_cache));
+
+		/*
+		 * At this point, victim may point to a cached item different
+		 * from the one based on which we selected the slot. Instead of
+		 * going to the loop again to pick another slot, let's just
+		 * drop the other item anyway: this may cause a false cache
+		 * miss later on, but that won't cause any problems.
+		 */
+		if (victim)
+			call_rcu(&unrcu_pointer(victim)->rcu,
+				 ghes_estatus_cache_rcu_free);
+	}
+}
diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
index 4649e3140888..15305c8be9a7 100644
--- a/include/acpi/ghes_cper.h
+++ b/include/acpi/ghes_cper.h
@@ -55,6 +55,8 @@
 	((struct acpi_hest_generic_data *)                              \
 	((struct ghes_vendor_record_entry *)(vendor_entry) + 1))
 
+extern struct gen_pool *ghes_estatus_pool;
+
 static inline bool is_hest_type_generic_v2(struct ghes *ghes)
 {
 	return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
@@ -99,5 +101,8 @@ int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
 			u64 buf_paddr, enum fixed_addresses fixmap_idx,
 			size_t buf_len);
 #endif
+int ghes_estatus_cached(struct acpi_hest_generic_status *estatus);
+void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
+			    struct acpi_hest_generic_status *estatus);
 
 #endif /* ACPI_APEI_GHES_CPER_H */

-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 03/10] ACPI: APEI: GHES: move GHESv2 ack and alloc helpers
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com>

Move the GHESv2 acknowledgment and error-source allocation helpers from
ghes.c into ghes_cper.c. This is a mechanical refactor that keeps the
logic unchanged while making the helpers reusable.

Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/ghes.c      | 65 -------------------------------------------
 drivers/acpi/apei/ghes_cper.c | 65 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+), 65 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3f35580e8efd..91638ae7e05e 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -163,71 +163,6 @@ void ghes_estatus_pool_region_free(unsigned long addr, u32 size)
 }
 EXPORT_SYMBOL_GPL(ghes_estatus_pool_region_free);
 
-static int map_gen_v2(struct ghes *ghes)
-{
-	return apei_map_generic_address(&ghes->generic_v2->read_ack_register);
-}
-
-static void unmap_gen_v2(struct ghes *ghes)
-{
-	apei_unmap_generic_address(&ghes->generic_v2->read_ack_register);
-}
-
-struct ghes *ghes_new(struct acpi_hest_generic *generic)
-{
-	struct ghes *ghes;
-	unsigned int error_block_length;
-	int rc;
-
-	ghes = kzalloc_obj(*ghes);
-	if (!ghes)
-		return ERR_PTR(-ENOMEM);
-
-	ghes->generic = generic;
-	if (is_hest_type_generic_v2(ghes)) {
-		rc = map_gen_v2(ghes);
-		if (rc)
-			goto err_free;
-	}
-
-	rc = apei_map_generic_address(&generic->error_status_address);
-	if (rc)
-		goto err_unmap_read_ack_addr;
-	error_block_length = generic->error_block_length;
-	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
-		pr_warn(FW_WARN GHES_PFX
-			"Error status block length is too long: %u for "
-			"generic hardware error source: %d.\n",
-			error_block_length, generic->header.source_id);
-		error_block_length = GHES_ESTATUS_MAX_SIZE;
-	}
-	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
-	ghes->estatus_length = error_block_length;
-	if (!ghes->estatus) {
-		rc = -ENOMEM;
-		goto err_unmap_status_addr;
-	}
-
-	return ghes;
-
-err_unmap_status_addr:
-	apei_unmap_generic_address(&generic->error_status_address);
-err_unmap_read_ack_addr:
-	if (is_hest_type_generic_v2(ghes))
-		unmap_gen_v2(ghes);
-err_free:
-	kfree(ghes);
-	return ERR_PTR(rc);
-}
-
-void ghes_fini(struct ghes *ghes)
-{
-	kfree(ghes->estatus);
-	apei_unmap_generic_address(&ghes->generic->error_status_address);
-	if (is_hest_type_generic_v2(ghes))
-		unmap_gen_v2(ghes);
-}
-
 static inline int ghes_severity(int severity)
 {
 	switch (severity) {
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
index b365c42efce4..c6e20fdc6964 100644
--- a/drivers/acpi/apei/ghes_cper.c
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -63,6 +63,71 @@ static void ghes_ack_error(struct acpi_hest_generic_v2 *gv2)
 	apei_write(val, &gv2->read_ack_register);
 }
 
+static int map_gen_v2(struct ghes *ghes)
+{
+	return apei_map_generic_address(&ghes->generic_v2->read_ack_register);
+}
+
+static void unmap_gen_v2(struct ghes *ghes)
+{
+	apei_unmap_generic_address(&ghes->generic_v2->read_ack_register);
+}
+
+struct ghes *ghes_new(struct acpi_hest_generic *generic)
+{
+	struct ghes *ghes;
+	unsigned int error_block_length;
+	int rc;
+
+	ghes = kzalloc_obj(*ghes);
+	if (!ghes)
+		return ERR_PTR(-ENOMEM);
+
+	ghes->generic = generic;
+	if (is_hest_type_generic_v2(ghes)) {
+		rc = map_gen_v2(ghes);
+		if (rc)
+			goto err_free;
+	}
+
+	rc = apei_map_generic_address(&generic->error_status_address);
+	if (rc)
+		goto err_unmap_read_ack_addr;
+	error_block_length = generic->error_block_length;
+	if (error_block_length > GHES_ESTATUS_MAX_SIZE) {
+		pr_warn(FW_WARN GHES_PFX
+			"Error status block length is too long: %u for "
+			"generic hardware error source: %d.\n",
+			error_block_length, generic->header.source_id);
+		error_block_length = GHES_ESTATUS_MAX_SIZE;
+	}
+	ghes->estatus = kmalloc(error_block_length, GFP_KERNEL);
+	ghes->estatus_length = error_block_length;
+	if (!ghes->estatus) {
+		rc = -ENOMEM;
+		goto err_unmap_status_addr;
+	}
+
+	return ghes;
+
+err_unmap_status_addr:
+	apei_unmap_generic_address(&generic->error_status_address);
+err_unmap_read_ack_addr:
+	if (is_hest_type_generic_v2(ghes))
+		unmap_gen_v2(ghes);
+err_free:
+	kfree(ghes);
+	return ERR_PTR(rc);
+}
+
+void ghes_fini(struct ghes *ghes)
+{
+	kfree(ghes->estatus);
+	apei_unmap_generic_address(&ghes->generic->error_status_address);
+	if (is_hest_type_generic_v2(ghes))
+		unmap_gen_v2(ghes);
+}
+
 static void ghes_copy_tofrom_phys(void *buffer, u64 paddr, u32 len,
 				  int from_phys,
 				  enum fixed_addresses fixmap_idx)

-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 02/10] ACPI: APEI: GHES: move CPER read helpers
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com>

Relocate the CPER buffer mapping, peek, and clear helpers from ghes.c into
ghes_cper.c so they can be shared with other firmware-first providers.
This commit only shuffles code; behavior stays the same.

Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/Makefile    |   2 +-
 drivers/acpi/apei/ghes.c      | 166 -----------------------------------
 drivers/acpi/apei/ghes_cper.c | 196 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 197 insertions(+), 167 deletions(-)

diff --git a/drivers/acpi/apei/Makefile b/drivers/acpi/apei/Makefile
index 66588d6be56f..f57f3b009d8e 100644
--- a/drivers/acpi/apei/Makefile
+++ b/drivers/acpi/apei/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_ACPI_APEI)		+= apei.o
-obj-$(CONFIG_ACPI_APEI_GHES)	+= ghes.o
+obj-$(CONFIG_ACPI_APEI_GHES)	+= ghes.o ghes_cper.o
 # clang versions prior to 18 may blow out the stack with KASAN
 ifeq ($(CONFIG_COMPILE_TEST)_$(CONFIG_CC_IS_CLANG)_$(call clang-min-version, 180000),y_y_)
 KASAN_SANITIZE_ghes.o := n
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 4f67f46410c4..3f35580e8efd 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -118,26 +118,6 @@ static struct gen_pool *ghes_estatus_pool;
 static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
-static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
-{
-	phys_addr_t paddr;
-	pgprot_t prot;
-
-	paddr = PFN_PHYS(pfn);
-	prot = arch_apei_get_mem_attribute(paddr);
-	__set_fixmap(fixmap_idx, paddr, prot);
-
-	return (void __iomem *) __fix_to_virt(fixmap_idx);
-}
-
-static void ghes_unmap(void __iomem *vaddr, enum fixed_addresses fixmap_idx)
-{
-	int _idx = virt_to_fix((unsigned long)vaddr);
-
-	WARN_ON_ONCE(fixmap_idx != _idx);
-	clear_fixmap(fixmap_idx);
-}
-
 int ghes_estatus_pool_init(unsigned int num_ghes)
 {
 	unsigned long addr, len;
@@ -193,21 +173,6 @@ static void unmap_gen_v2(struct ghes *ghes)
 	apei_unmap_generic_address(&ghes->generic_v2->read_ack_register);
 }
 
-static void ghes_ack_error(struct acpi_hest_generic_v2 *gv2)
-{
-	int rc;
-	u64 val = 0;
-
-	rc = apei_read(&val, &gv2->read_ack_register);
-	if (rc)
-		return;
-
-	val &= gv2->read_ack_preserve << gv2->read_ack_register.bit_offset;
-	val |= gv2->read_ack_write    << gv2->read_ack_register.bit_offset;
-
-	apei_write(val, &gv2->read_ack_register);
-}
-
 struct ghes *ghes_new(struct acpi_hest_generic *generic)
 {
 	struct ghes *ghes;
@@ -280,137 +245,6 @@ static inline int ghes_severity(int severity)
 	}
 }
 
-static void ghes_copy_tofrom_phys(void *buffer, u64 paddr, u32 len,
-				  int from_phys,
-				  enum fixed_addresses fixmap_idx)
-{
-	void __iomem *vaddr;
-	u64 offset;
-	u32 trunk;
-
-	while (len > 0) {
-		offset = paddr - (paddr & PAGE_MASK);
-		vaddr = ghes_map(PHYS_PFN(paddr), fixmap_idx);
-		trunk = PAGE_SIZE - offset;
-		trunk = min(trunk, len);
-		if (from_phys)
-			memcpy_fromio(buffer, vaddr + offset, trunk);
-		else
-			memcpy_toio(vaddr + offset, buffer, trunk);
-		len -= trunk;
-		paddr += trunk;
-		buffer += trunk;
-		ghes_unmap(vaddr, fixmap_idx);
-	}
-}
-
-/* Check the top-level record header has an appropriate size. */
-int __ghes_check_estatus(struct ghes *ghes,
-			 struct acpi_hest_generic_status *estatus)
-{
-	u32 len = cper_estatus_len(estatus);
-	u32 max_len = min(ghes->generic->error_block_length,
-			  ghes->estatus_length);
-
-	if (len < sizeof(*estatus)) {
-		pr_warn_ratelimited(FW_WARN GHES_PFX "Truncated error status block!\n");
-		return -EIO;
-	}
-
-	if (!len || len > max_len) {
-		pr_warn_ratelimited(FW_WARN GHES_PFX "Invalid error status block length!\n");
-		return -EIO;
-	}
-
-	if (cper_estatus_check_header(estatus)) {
-		pr_warn_ratelimited(FW_WARN GHES_PFX "Invalid CPER header!\n");
-		return -EIO;
-	}
-
-	return 0;
-}
-
-/* Read the CPER block, returning its address, and header in estatus. */
-int __ghes_peek_estatus(struct ghes *ghes,
-			struct acpi_hest_generic_status *estatus,
-			u64 *buf_paddr, enum fixed_addresses fixmap_idx)
-{
-	struct acpi_hest_generic *g = ghes->generic;
-	int rc;
-
-	rc = apei_read(buf_paddr, &g->error_status_address);
-	if (rc) {
-		*buf_paddr = 0;
-		pr_warn_ratelimited(FW_WARN GHES_PFX
-				    "Failed to read error status block address for hardware error source: %d.\n",
-				   g->header.source_id);
-		return -EIO;
-	}
-	if (!*buf_paddr)
-		return -ENOENT;
-
-	ghes_copy_tofrom_phys(estatus, *buf_paddr, sizeof(*estatus), 1,
-			      fixmap_idx);
-	if (!estatus->block_status) {
-		*buf_paddr = 0;
-		return -ENOENT;
-	}
-
-	return 0;
-}
-
-int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
-			u64 buf_paddr, enum fixed_addresses fixmap_idx,
-			size_t buf_len)
-{
-	ghes_copy_tofrom_phys(estatus, buf_paddr, buf_len, 1, fixmap_idx);
-	if (cper_estatus_check(estatus)) {
-		pr_warn_ratelimited(FW_WARN GHES_PFX
-				    "Failed to read error status block!\n");
-		return -EIO;
-	}
-
-	return 0;
-}
-
-int ghes_read_estatus(struct ghes *ghes,
-		      struct acpi_hest_generic_status *estatus,
-		      u64 *buf_paddr, enum fixed_addresses fixmap_idx)
-{
-	int rc;
-
-	rc = __ghes_peek_estatus(ghes, estatus, buf_paddr, fixmap_idx);
-	if (rc)
-		return rc;
-
-	rc = __ghes_check_estatus(ghes, estatus);
-	if (rc)
-		return rc;
-
-	return __ghes_read_estatus(estatus, *buf_paddr, fixmap_idx,
-				   cper_estatus_len(estatus));
-}
-
-void ghes_clear_estatus(struct ghes *ghes,
-			struct acpi_hest_generic_status *estatus,
-			u64 buf_paddr, enum fixed_addresses fixmap_idx)
-{
-	estatus->block_status = 0;
-
-	if (!buf_paddr)
-		return;
-
-	ghes_copy_tofrom_phys(estatus, buf_paddr,
-			      sizeof(estatus->block_status), 0,
-			      fixmap_idx);
-
-	/*
-	 * GHESv2 type HEST entries introduce support for error acknowledgment,
-	 * so only acknowledge the error if this support is present.
-	 */
-	if (is_hest_type_generic_v2(ghes))
-		ghes_ack_error(ghes->generic_v2);
-}
 
 /**
  * struct ghes_task_work - for synchronous RAS event
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
new file mode 100644
index 000000000000..b365c42efce4
--- /dev/null
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Shared GHES helpers for firmware-first CPER error handling.
+ *
+ * This file holds the GHES helper code that is shared by the in-tree GHES
+ * driver and by other firmware-first error sources that reuse the same CPER
+ * handling flow.
+ *
+ * Derived from the ACPI APEI GHES driver.
+ *
+ * Copyright 2010,2011 Intel Corp.
+ *   Author: Huang Ying <ying.huang@intel.com>
+ */
+
+#include <linux/err.h>
+#include <linux/io.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/ratelimit.h>
+#include <linux/slab.h>
+
+#include <acpi/apei.h>
+#include <acpi/acpi_io.h>
+#include <acpi/ghes_cper.h>
+
+#include <asm/fixmap.h>
+#include <asm/tlbflush.h>
+
+#include "apei-internal.h"
+
+static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
+{
+	phys_addr_t paddr;
+	pgprot_t prot;
+
+	paddr = PFN_PHYS(pfn);
+	prot = arch_apei_get_mem_attribute(paddr);
+	__set_fixmap(fixmap_idx, paddr, prot);
+
+	return (void __iomem *) __fix_to_virt(fixmap_idx);
+}
+
+static void ghes_unmap(void __iomem *vaddr, enum fixed_addresses fixmap_idx)
+{
+	int _idx = virt_to_fix((unsigned long)vaddr);
+
+	WARN_ON_ONCE(fixmap_idx != _idx);
+	clear_fixmap(fixmap_idx);
+}
+
+static void ghes_ack_error(struct acpi_hest_generic_v2 *gv2)
+{
+	int rc;
+	u64 val = 0;
+
+	rc = apei_read(&val, &gv2->read_ack_register);
+	if (rc)
+		return;
+
+	val &= gv2->read_ack_preserve << gv2->read_ack_register.bit_offset;
+	val |= gv2->read_ack_write    << gv2->read_ack_register.bit_offset;
+
+	apei_write(val, &gv2->read_ack_register);
+}
+
+static void ghes_copy_tofrom_phys(void *buffer, u64 paddr, u32 len,
+				  int from_phys,
+				  enum fixed_addresses fixmap_idx)
+{
+	void __iomem *vaddr;
+	u64 offset;
+	u32 trunk;
+
+	while (len > 0) {
+		offset = paddr - (paddr & PAGE_MASK);
+		vaddr = ghes_map(PHYS_PFN(paddr), fixmap_idx);
+		trunk = PAGE_SIZE - offset;
+		trunk = min(trunk, len);
+		if (from_phys)
+			memcpy_fromio(buffer, vaddr + offset, trunk);
+		else
+			memcpy_toio(vaddr + offset, buffer, trunk);
+		len -= trunk;
+		paddr += trunk;
+		buffer += trunk;
+		ghes_unmap(vaddr, fixmap_idx);
+	}
+}
+
+/* Check the top-level record header has an appropriate size. */
+int __ghes_check_estatus(struct ghes *ghes,
+			 struct acpi_hest_generic_status *estatus)
+{
+	u32 len = cper_estatus_len(estatus);
+	u32 max_len = min(ghes->generic->error_block_length,
+			  ghes->estatus_length);
+
+	if (len < sizeof(*estatus)) {
+		pr_warn_ratelimited(FW_WARN GHES_PFX "Truncated error status block!\n");
+		return -EIO;
+	}
+
+	if (len > max_len) {
+		pr_warn_ratelimited(FW_WARN GHES_PFX "Invalid error status block length!\n");
+		return -EIO;
+	}
+
+	if (cper_estatus_check_header(estatus)) {
+		pr_warn_ratelimited(FW_WARN GHES_PFX "Invalid CPER header!\n");
+		return -EIO;
+	}
+
+	return 0;
+}
+
+/* Read the CPER block, returning its address, and header in estatus. */
+int __ghes_peek_estatus(struct ghes *ghes,
+			struct acpi_hest_generic_status *estatus,
+			u64 *buf_paddr, enum fixed_addresses fixmap_idx)
+{
+	struct acpi_hest_generic *g = ghes->generic;
+	int rc;
+
+	rc = apei_read(buf_paddr, &g->error_status_address);
+	if (rc) {
+		*buf_paddr = 0;
+		pr_warn_ratelimited(FW_WARN GHES_PFX
+				    "Failed to read error status block address for hardware error source: %d.\n",
+				    g->header.source_id);
+		return -EIO;
+	}
+	if (!*buf_paddr)
+		return -ENOENT;
+
+	ghes_copy_tofrom_phys(estatus, *buf_paddr, sizeof(*estatus), 1,
+			      fixmap_idx);
+	if (!estatus->block_status) {
+		*buf_paddr = 0;
+		return -ENOENT;
+	}
+
+	return 0;
+}
+
+int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
+			u64 buf_paddr, enum fixed_addresses fixmap_idx,
+			size_t buf_len)
+{
+	ghes_copy_tofrom_phys(estatus, buf_paddr, buf_len, 1, fixmap_idx);
+	if (cper_estatus_check(estatus)) {
+		pr_warn_ratelimited(FW_WARN GHES_PFX
+				    "Failed to read error status block!\n");
+		return -EIO;
+	}
+
+	return 0;
+}
+
+int ghes_read_estatus(struct ghes *ghes,
+		      struct acpi_hest_generic_status *estatus,
+		      u64 *buf_paddr, enum fixed_addresses fixmap_idx)
+{
+	int rc;
+
+	rc = __ghes_peek_estatus(ghes, estatus, buf_paddr, fixmap_idx);
+	if (rc)
+		return rc;
+
+	rc = __ghes_check_estatus(ghes, estatus);
+	if (rc)
+		return rc;
+
+	return __ghes_read_estatus(estatus, *buf_paddr, fixmap_idx,
+				   cper_estatus_len(estatus));
+}
+
+void ghes_clear_estatus(struct ghes *ghes,
+			struct acpi_hest_generic_status *estatus,
+			u64 buf_paddr, enum fixed_addresses fixmap_idx)
+{
+	estatus->block_status = 0;
+
+	if (!buf_paddr)
+		return;
+
+	ghes_copy_tofrom_phys(estatus, buf_paddr,
+			      sizeof(estatus->block_status), 0,
+			      fixmap_idx);
+
+	/*
+	 * GHESv2 type HEST entries introduce support for error acknowledgment,
+	 * so only acknowledge the error if this support is present.
+	 */
+	if (is_hest_type_generic_v2(ghes))
+		ghes_ack_error(ghes->generic_v2);
+}

-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 01/10] ACPI: APEI: GHES: share macros via a private header
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov
In-Reply-To: <20260617-topics-ahmtib01-ras_ffh_arm_internal_review-v6-0-91f725174aa0@arm.com>

Carve the CPER helper macros out of ghes.c and place them in a private
header so they can be shared with upcoming helper files.

Move the vendor record entry declaration and the prototypes for the CPER
read and clear helpers along with ghes_new() and ghes_fini() into the
same header. This requires dropping their local static visibility in
ghes.c.

Reviewed-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/ghes.c |  94 +++++++++---------------------------------
 include/acpi/ghes_cper.h | 103 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+), 76 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3236a3ce79d6..4f67f46410c4 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -49,6 +49,7 @@
 
 #include <acpi/actbl1.h>
 #include <acpi/ghes.h>
+#include <acpi/ghes_cper.h>
 #include <acpi/apei.h>
 #include <asm/fixmap.h>
 #include <asm/tlbflush.h>
@@ -57,40 +58,6 @@
 
 #include "apei-internal.h"
 
-#define GHES_PFX	"GHES: "
-
-#define GHES_ESTATUS_MAX_SIZE		65536
-#define GHES_ESOURCE_PREALLOC_MAX_SIZE	65536
-
-#define GHES_ESTATUS_POOL_MIN_ALLOC_ORDER 3
-
-/* This is just an estimation for memory pool allocation */
-#define GHES_ESTATUS_CACHE_AVG_SIZE	512
-
-#define GHES_ESTATUS_CACHES_SIZE	4
-
-#define GHES_ESTATUS_IN_CACHE_MAX_NSEC	10000000000ULL
-/* Prevent too many caches are allocated because of RCU */
-#define GHES_ESTATUS_CACHE_ALLOCED_MAX	(GHES_ESTATUS_CACHES_SIZE * 3 / 2)
-
-#define GHES_ESTATUS_CACHE_LEN(estatus_len)			\
-	(sizeof(struct ghes_estatus_cache) + (estatus_len))
-#define GHES_ESTATUS_FROM_CACHE(estatus_cache)			\
-	((struct acpi_hest_generic_status *)				\
-	 ((struct ghes_estatus_cache *)(estatus_cache) + 1))
-
-#define GHES_ESTATUS_NODE_LEN(estatus_len)			\
-	(sizeof(struct ghes_estatus_node) + (estatus_len))
-#define GHES_ESTATUS_FROM_NODE(estatus_node)			\
-	((struct acpi_hest_generic_status *)				\
-	 ((struct ghes_estatus_node *)(estatus_node) + 1))
-
-#define GHES_VENDOR_ENTRY_LEN(gdata_len)                               \
-	(sizeof(struct ghes_vendor_record_entry) + (gdata_len))
-#define GHES_GDATA_FROM_VENDOR_ENTRY(vendor_entry)                     \
-	((struct acpi_hest_generic_data *)                              \
-	((struct ghes_vendor_record_entry *)(vendor_entry) + 1))
-
 /*
  *  NMI-like notifications vary by architecture, before the compiler can prune
  *  unused static functions it needs a value for these enums.
@@ -102,25 +69,6 @@
 
 static ATOMIC_NOTIFIER_HEAD(ghes_report_chain);
 
-static inline bool is_hest_type_generic_v2(struct ghes *ghes)
-{
-	return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
-}
-
-/*
- * A platform may describe one error source for the handling of synchronous
- * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
- * or External Interrupt). On x86, the HEST notifications are always
- * asynchronous, so only SEA on ARM is delivered as a synchronous
- * notification.
- */
-static inline bool is_hest_sync_notify(struct ghes *ghes)
-{
-	u8 notify_type = ghes->generic->notify.type;
-
-	return notify_type == ACPI_HEST_NOTIFY_SEA;
-}
-
 /*
  * This driver isn't really modular, however for the time being,
  * continuing to use module_param is the easiest way to remain
@@ -165,12 +113,6 @@ static DEFINE_MUTEX(ghes_devs_mutex);
  */
 static DEFINE_SPINLOCK(ghes_notify_lock_irq);
 
-struct ghes_vendor_record_entry {
-	struct work_struct work;
-	int error_severity;
-	char vendor_record[];
-};
-
 static struct gen_pool *ghes_estatus_pool;
 
 static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
@@ -266,7 +208,7 @@ static void ghes_ack_error(struct acpi_hest_generic_v2 *gv2)
 	apei_write(val, &gv2->read_ack_register);
 }
 
-static struct ghes *ghes_new(struct acpi_hest_generic *generic)
+struct ghes *ghes_new(struct acpi_hest_generic *generic)
 {
 	struct ghes *ghes;
 	unsigned int error_block_length;
@@ -313,7 +255,7 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
 	return ERR_PTR(rc);
 }
 
-static void ghes_fini(struct ghes *ghes)
+void ghes_fini(struct ghes *ghes)
 {
 	kfree(ghes->estatus);
 	apei_unmap_generic_address(&ghes->generic->error_status_address);
@@ -363,8 +305,8 @@ static void ghes_copy_tofrom_phys(void *buffer, u64 paddr, u32 len,
 }
 
 /* Check the top-level record header has an appropriate size. */
-static int __ghes_check_estatus(struct ghes *ghes,
-				struct acpi_hest_generic_status *estatus)
+int __ghes_check_estatus(struct ghes *ghes,
+			 struct acpi_hest_generic_status *estatus)
 {
 	u32 len = cper_estatus_len(estatus);
 	u32 max_len = min(ghes->generic->error_block_length,
@@ -389,9 +331,9 @@ static int __ghes_check_estatus(struct ghes *ghes,
 }
 
 /* Read the CPER block, returning its address, and header in estatus. */
-static int __ghes_peek_estatus(struct ghes *ghes,
-			       struct acpi_hest_generic_status *estatus,
-			       u64 *buf_paddr, enum fixed_addresses fixmap_idx)
+int __ghes_peek_estatus(struct ghes *ghes,
+			struct acpi_hest_generic_status *estatus,
+			u64 *buf_paddr, enum fixed_addresses fixmap_idx)
 {
 	struct acpi_hest_generic *g = ghes->generic;
 	int rc;
@@ -400,7 +342,7 @@ static int __ghes_peek_estatus(struct ghes *ghes,
 	if (rc) {
 		*buf_paddr = 0;
 		pr_warn_ratelimited(FW_WARN GHES_PFX
-"Failed to read error status block address for hardware error source: %d.\n",
+				    "Failed to read error status block address for hardware error source: %d.\n",
 				   g->header.source_id);
 		return -EIO;
 	}
@@ -417,9 +359,9 @@ static int __ghes_peek_estatus(struct ghes *ghes,
 	return 0;
 }
 
-static int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
-			       u64 buf_paddr, enum fixed_addresses fixmap_idx,
-			       size_t buf_len)
+int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
+			u64 buf_paddr, enum fixed_addresses fixmap_idx,
+			size_t buf_len)
 {
 	ghes_copy_tofrom_phys(estatus, buf_paddr, buf_len, 1, fixmap_idx);
 	if (cper_estatus_check(estatus)) {
@@ -431,9 +373,9 @@ static int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
 	return 0;
 }
 
-static int ghes_read_estatus(struct ghes *ghes,
-			     struct acpi_hest_generic_status *estatus,
-			     u64 *buf_paddr, enum fixed_addresses fixmap_idx)
+int ghes_read_estatus(struct ghes *ghes,
+		      struct acpi_hest_generic_status *estatus,
+		      u64 *buf_paddr, enum fixed_addresses fixmap_idx)
 {
 	int rc;
 
@@ -449,9 +391,9 @@ static int ghes_read_estatus(struct ghes *ghes,
 				   cper_estatus_len(estatus));
 }
 
-static void ghes_clear_estatus(struct ghes *ghes,
-			       struct acpi_hest_generic_status *estatus,
-			       u64 buf_paddr, enum fixed_addresses fixmap_idx)
+void ghes_clear_estatus(struct ghes *ghes,
+			struct acpi_hest_generic_status *estatus,
+			u64 buf_paddr, enum fixed_addresses fixmap_idx)
 {
 	estatus->block_status = 0;
 
diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
new file mode 100644
index 000000000000..4649e3140888
--- /dev/null
+++ b/include/acpi/ghes_cper.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * GHES declarations used by both the ACPI APEI GHES driver
+ * and the firmware-first CPER provider.
+ *
+ * These declarations lets GHES and other firmware-first error sources use
+ * the same helper so the non-ACPI path follows the same
+ * behavior as GHES instead of carrying a separate copy.
+ *
+ * Derived from the ACPI APEI GHES driver.
+ *
+ * Copyright 2010,2011 Intel Corp.
+ *   Author: Huang Ying <ying.huang@intel.com>
+ */
+
+#ifndef ACPI_APEI_GHES_CPER_H
+#define ACPI_APEI_GHES_CPER_H
+
+#include <linux/workqueue.h>
+
+#include <acpi/ghes.h>
+#include <asm/fixmap.h>
+
+#define GHES_PFX	"GHES: "
+
+#define GHES_ESTATUS_MAX_SIZE		65536
+#define GHES_ESOURCE_PREALLOC_MAX_SIZE	65536
+
+#define GHES_ESTATUS_POOL_MIN_ALLOC_ORDER 3
+
+/* This is just an estimation for memory pool allocation */
+#define GHES_ESTATUS_CACHE_AVG_SIZE	512
+
+#define GHES_ESTATUS_CACHES_SIZE	4
+
+#define GHES_ESTATUS_IN_CACHE_MAX_NSEC	10000000000ULL
+/* Prevent too many caches are allocated because of RCU */
+#define GHES_ESTATUS_CACHE_ALLOCED_MAX	(GHES_ESTATUS_CACHES_SIZE * 3 / 2)
+
+#define GHES_ESTATUS_CACHE_LEN(estatus_len)			\
+	(sizeof(struct ghes_estatus_cache) + (estatus_len))
+#define GHES_ESTATUS_FROM_CACHE(estatus_cache)			\
+	((struct acpi_hest_generic_status *)				\
+	 ((struct ghes_estatus_cache *)(estatus_cache) + 1))
+
+#define GHES_ESTATUS_NODE_LEN(estatus_len)			\
+	(sizeof(struct ghes_estatus_node) + (estatus_len))
+#define GHES_ESTATUS_FROM_NODE(estatus_node)			\
+	((struct acpi_hest_generic_status *)				\
+	 ((struct ghes_estatus_node *)(estatus_node) + 1))
+
+#define GHES_VENDOR_ENTRY_LEN(gdata_len)                               \
+	(sizeof(struct ghes_vendor_record_entry) + (gdata_len))
+#define GHES_GDATA_FROM_VENDOR_ENTRY(vendor_entry)                     \
+	((struct acpi_hest_generic_data *)                              \
+	((struct ghes_vendor_record_entry *)(vendor_entry) + 1))
+
+static inline bool is_hest_type_generic_v2(struct ghes *ghes)
+{
+	return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
+}
+
+/*
+ * A platform may describe one error source for the handling of synchronous
+ * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
+ * or External Interrupt). On x86, the HEST notifications are always
+ * asynchronous, so only SEA on ARM is delivered as a synchronous
+ * notification.
+ */
+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{
+	u8 notify_type = ghes->generic->notify.type;
+
+	return notify_type == ACPI_HEST_NOTIFY_SEA;
+}
+
+struct ghes_vendor_record_entry {
+	struct work_struct work;
+	int error_severity;
+	char vendor_record[];
+};
+
+#ifdef CONFIG_ACPI_APEI
+struct ghes *ghes_new(struct acpi_hest_generic *generic);
+void ghes_fini(struct ghes *ghes);
+
+int ghes_read_estatus(struct ghes *ghes,
+		      struct acpi_hest_generic_status *estatus,
+		      u64 *buf_paddr, enum fixed_addresses fixmap_idx);
+void ghes_clear_estatus(struct ghes *ghes,
+			struct acpi_hest_generic_status *estatus,
+			u64 buf_paddr, enum fixed_addresses fixmap_idx);
+int __ghes_peek_estatus(struct ghes *ghes,
+			struct acpi_hest_generic_status *estatus,
+			u64 *buf_paddr, enum fixed_addresses fixmap_idx);
+int __ghes_check_estatus(struct ghes *ghes,
+			 struct acpi_hest_generic_status *estatus);
+int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
+			u64 buf_paddr, enum fixed_addresses fixmap_idx,
+			size_t buf_len);
+#endif
+
+#endif /* ACPI_APEI_GHES_CPER_H */

-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 00/10] ACPI: APEI: share GHES CPER helpers and add DT FFH provider
From: Ahmed Tiba @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Saket Dumbre,
	Davidlohr Bueso, Jonathan Cameron, Dave Jiang, Alison Schofield,
	Vishal Verma, Ira Weiny, Dan Williams, Ahmed Tiba, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet, Shuah Khan
  Cc: linux-kernel, linux-acpi, acpica-devel, linux-cxl, devicetree,
	linux-edac, linux-doc, Dmitry.Lamerov

This is v6 of the GHES refactor series. Compared to v5, it addresses
the latest review comments and tightens the DT CPER provider and
related helper wiring.

The DT firmware-first CPER provider is intended for the upstream
zena-css platform. Validation so far has been on FVP.

Changes in v6:
- Keep cper_estatus_len() out of the final DT provider patch by restoring it
  to the APEI internal header and keeping the DT-side length helper local.
- Remove the include guard dance around apei-internal.h in ghes_cper.c.
- Drop stray vendor notifier declarations from ghes_cper.h.
- Rework the DT CPER provider to:
  - validate block_status and CPER lengths before parsing
  - remove the unsupported synchronous/SEA path
  - embed struct acpi_hest_generic in the driver state
  - remove the unnecessary copy helper
  - avoid caching DT-provider estatus records
  - reject undersized status buffers
  - bound source_id to u16 and document why 0 is left unused
  - consume the status and ack buffers from the binding-defined
    memory-region entries
  - treat those regions as shared memory rather than MMIO
  - clear block_status after consuming a record before raising the ack bit
- Rewrite and rewrap the admin-guide text to describe the DT firmware-first
  CPER model directly.

Link to v5: https://lore.kernel.org/r/20260529-topics-ahmtib01-ras_ffh_arm_internal_review-v5-0-2e0500d42642@arm.com

ACPI: APEI: share GHES CPER helpers and add DT FFH provider

This is v5 of the GHES refactor series. Compared to v4, it only updates
the DT binding to address the latest review comments.

Changes in v5:
- Dropped the `oneOf` from `memory-region` and described it as a plain
  list with `minItems: 1`.
- Simplified the DT example to keep only the `arm,ras-cper` device node.
- Link to v4: https://lore.kernel.org/r/20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com

ACPI: APEI: share GHES CPER helpers and add DT FFH provider

This is v4 of the GHES refactor series. Compared to v3, it mainly
updates the shared header comment and the DT binding/description
for the firmware-owned CPER buffer.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>

Changes in v4:
- Reworded the ghes_cper.h header comment and kept the original copyrights.
- Fixed the ghes_cper.h W=1 warnings by limiting the ACPI
  fixmap-based declarations to the ACPI build path.
- Updated the DT binding to describe the CPER buffer
  as firmware-owned shared memory.
- Described the optional ack area as a second memory-region entry.
- Updated the DT example accordingly.
- Link to v3: https://lore.kernel.org/r/20260318-topics-ahmtib01-ras_ffh_arm_internal_review-v3-0-48e6a1c249ef@arm.com

Changes in v3:
- Fixed the new ghes_cper.h header comment and kept the original
  copyrights.
- Added <linux/bitfield.h> to fix the kernel test robot build failure.
- Renamed the binding/compatible and DT-side naming to ras-cper.
- Switched the DT provider to generic firmware property accessors.
- Replaced atomic source IDs with IDA.
- Updated IRQ/error/resource handling as suggested in review
  (platform_get_irq(), dev_err_probe(), devm platform ioremap
  helpers).
- Removed the ARM64 dependency and fixed Kconfig/build coverage.
- Clarified comments and kept the early move patches mechanical.
- Link to v2: https://lore.kernel.org/r/20260220-topics-ahmtib01-ras_ffh_arm_internal_review-v2-0-347fa2d7351b@arm.com

Changes in v2:
- Dropped the proposed "estatus core" and kept GHES naming/flow intact
  (per Borislav Petkov).
- Re-sliced the series into smaller mechanical steps (per Mauro Carvalho Chehab).
- Minor DT binding fixes based on Krzysztof Kozlowski's feedback.
- Removed fixmap slot usage from the DT FFH driver (per Will Deacon).

Series structure:
- Patches 1-8 are mechanical moves only and do not change behavior.
- Patch 9 wires the shared helpers back into GHES.
- The DT firmware-first CPER buffer provider is added in the final patches.
- "ACPI: APEI: introduce GHES helper" is internal build glue only
  and does not introduce a new user-visible configuration option.

- Link to v1: https://lore.kernel.org/r/20251217112845.1814119-1-ahmed.tiba@arm.com

---
Ahmed Tiba (10):
      ACPI: APEI: GHES: share macros via a private header
      ACPI: APEI: GHES: move CPER read helpers
      ACPI: APEI: GHES: move GHESv2 ack and alloc helpers
      ACPI: APEI: GHES: move estatus cache helpers
      ACPI: APEI: GHES: move vendor record helpers
      ACPI: APEI: GHES: move CXL CPER helpers
      ACPI: APEI: introduce GHES helper
      ACPI: APEI: share GHES CPER helpers
      dt-bindings: firmware: add arm,ras-cper
      RAS: add firmware-first CPER provider

 Documentation/admin-guide/RAS/main.rst             |   15 +
 .../devicetree/bindings/firmware/arm,ras-cper.yaml |   52 +
 MAINTAINERS                                        |    6 +
 drivers/Makefile                                   |    1 +
 drivers/acpi/Kconfig                               |    4 +
 drivers/acpi/apei/Kconfig                          |    1 +
 drivers/acpi/apei/apei-internal.h                  |    3 +-
 drivers/acpi/apei/ghes.c                           | 1043 +------------------
 drivers/acpi/apei/ghes_cper.c                      | 1056 ++++++++++++++++++++
 drivers/ras/Kconfig                                |   11 +
 drivers/ras/Makefile                               |    1 +
 drivers/ras/cper-esource.c                         |  322 ++++++
 include/acpi/ghes.h                                |   10 +-
 include/acpi/ghes_cper.h                           |  141 +++
 include/cxl/event.h                                |    2 +-
 15 files changed, 1621 insertions(+), 1047 deletions(-)
---
base-commit: b1cbabe84ca1381a004fb91ee1791a1a53bce44e
change-id: 20260220-topics-ahmtib01-ras_ffh_arm_internal_review-bfddc7fc7cab

Best regards,
-- 
Ahmed Tiba <ahmed.tiba@arm.com>


^ permalink raw reply

* Re: [PATCH 6/7] hwmon: adm1275: Support ROHM BD12790
From: Guenter Roeck @ 2026-06-17 13:54 UTC (permalink / raw)
  To: Matti Vaittinen, Matti Vaittinen, Matti Vaittinen
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, Wensheng Wang, Ashish Yadav, Kim Seer Paller,
	Cedric Encarnacion, Chris Packham, Yuxi Wang, Charles Hsu,
	ChiShih Tsai, linux-hwmon, devicetree, linux-kernel, linux-doc
In-Reply-To: <7e430392-1b27-4c3d-bfc7-1311b9838156@gmail.com>

On 6/16/26 22:56, Matti Vaittinen wrote:
> On 16/06/2026 17:08, Guenter Roeck wrote:
>> On 6/15/26 23:44, Matti Vaittinen wrote:
>>> From: Matti Vaittinen <mazziesaccount@gmail.com>
>>>
>>> Add support for ROHM BD12790 hot-swap controller which is largely
>>> similar to Analog Devices adm1272.
>>>
>>> The BD12790 uses the same selectable 60V/100V voltage ranges and
>>> 15mV/30mV current-sense ranges as the ADM1272, and the same VRANGE
>>> (bit 5) and IRANGE (bit 0) layout in PMON_CONFIG. It therefore uses
>>> a dedicated coefficient table that mirrors adm1272_coefficients, with
>>> the following differences derived from BD12790 datasheet Table 1 (p.18):
>>> - power 60V/30mV: m=17560 (vs. 17561)
>>> - power 100V/30mV: m=10536 (vs. 10535)
>>> - temperature: b=31880 (vs. 31871, reflecting T[11:0] = 4.2*T + 3188)
>>>
>>> Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
>>> Assisted-by: GitHub Copilot:claude-sonnet-4.6
>>>
>>> ---
>>> Originally this patch was AI-generated. I did pretty much re-write the
>>> probe changes by hand, and also fixed some of the coefficient math
>>> afterwards :/ But yeah, this one was AI "assisted". :)
>>>
>>>   drivers/hwmon/pmbus/Kconfig   |  4 +--
>>>   drivers/hwmon/pmbus/adm1275.c | 53 +++++++++++++++++++++++++++++------
>>>   2 files changed, 47 insertions(+), 10 deletions(-)
> 
> // snip
> 
>>> @@ -655,12 +681,23 @@ static int adm1275_probe(struct i2c_client *client)
>>>           break;
>>>       case adm1272:
>>>       case adm1273:
>>> +    case bd12790:
>>
>> Please don't overload the existing case statements.
>> Just add separate case statements for the new chips.
>>
> 
> Hmm. Ok, although, here, same as with the BD12780, I would like the code to clearly show that the BD12790 is very very similar to another IC (adm1272). When we have own case for it, this information gets kind of lost as these cases are a tad too long to easily spot the differences. If there are any ideas how to ease spotting this while having own cases - I am keen to hear.
> 

You could add a comment into the case statement.

Guenter


^ permalink raw reply

* Re: [PATCH 3/7] hwmon: adm1275: Support ROHM BD12780
From: Guenter Roeck @ 2026-06-17 13:50 UTC (permalink / raw)
  To: Matti Vaittinen, Matti Vaittinen, Matti Vaittinen
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, Wensheng Wang, Ashish Yadav, Kim Seer Paller,
	Cedric Encarnacion, Chris Packham, Yuxi Wang, Charles Hsu,
	ChiShih Tsai, linux-hwmon, devicetree, linux-kernel, linux-doc
In-Reply-To: <46b3f680-91a9-4a2a-a197-8f0ca5e38b90@gmail.com>

On 6/16/26 22:48, Matti Vaittinen wrote:
> Thanks for taking the time to review this! Feedback is appreciated :)
> 
> On 16/06/2026 17:13, Guenter Roeck wrote:
>> On 6/15/26 23:36, Matti Vaittinen wrote:
>>> From: Matti Vaittinen <mazziesaccount@gmail.com>
>>>
>>> ROHM BD12780 and BD12780A are hot-swap controllers. They are largely
>>> similar to Analog Devices ADM1278. Besides the ID registers and some
>>> added functionality, the BD12780 and BD12780A mark PMON_CONFIG bits
>>> [15:14] as reserved. Hence TSFILT setting must be omitted on these ICs.
>>>
>>> The BD12780 has 3 pins usable for configuring the I2C address. The
>>> BD12780A lists the ADDR3-pin as "not connect".
>>>
>>> Support ROHM BD12780 and BD12780A controllers.
>>>
>>> Signed-off-by: Matti Vaittinen <mazziesaccount@gmail.com>
>>> ---
>>>   drivers/hwmon/pmbus/Kconfig   |  2 +-
>>>   drivers/hwmon/pmbus/adm1275.c | 46 +++++++++++++++++++++++++++++------
>>>   2 files changed, 39 insertions(+), 9 deletions(-)
>>>
> 
> // snip
> 
>>> @@ -487,6 +489,21 @@ static const struct i2c_device_id adm1275_id[] = {
>>>       { "adm1281", adm1281 },
>>>       { "adm1293", adm1293 },
>>>       { "adm1294", adm1294 },
>>> +    /*
>>> +     * The BD12780a is functionally identical to BD12780(*). Even the pmbus ID
>>> +     * register contents are same. When instantiated from the DT, it is required
>>> +     * to have the bd12780 as a fall-back. We still need the bd12780a ID here,
>>> +     * because the i2c_device_id is created from the first compatible, not from
>>> +     * the fall-back entry.
>>> +     * (*)Until proven to differ. I prefer having own compatible for these
>>> +     * variants for that day. Please note that even though the probe is called
>>> +     * based on the 'bd12780a' -entry, the ID is picked at probe based on the
>>> +     * pmbus register contents and not by DT entry. Thus, if the bd12780 and
>>> +     * bd12780a are found to require different handling, then this needs to be
>>> +     * changed, or bd12780a is handled as bd12780.
>>> +     */
>>> +    { "bd12780", bd12780 },
>>> +    { "bd12780a", /* driver data unused, see --^ */ },
>>
>> We don't usually do that. There are various A/B/C variants for many chips,
>> and we just use the base name unless a difference is warranted. Either this
>> is needed, and driver data is needed as well, or it isn't. If it is not needed,
>> it should be dropped.
> 
> At the moment the only difference I know is reduced amount of I2C slave addresses. This shouldn't be visible to this driver.
> 
> My problem is that I don't know for sure if we later notice something that requires differentiating. Thus I would like to have different DT compatibles (or other source of I2C IDs if those are used to instantiate the driver). If we don't do this, then we have problems if we later find out that these ICs require different handling as users because we can't differentiate these ICs if they are described with same compatible/I2C ID.
> 
> The "fun" thing is that both variants have exactly same MFR_MODEL and MFR_REVISION register contents. Thus, these ICs can't be differentiated by reading PMBus registers.
> 
> This is also why the driver data entry gets unused. The existing probe mechanism in this driver, scans the names in this ID list and compares it to the PMBus MFR_MODEL, and then picks-up the driver data to use. Now because BD12780 and BD12780A both have same MFR_MODEL, the scan in probe will always pick the same driver data entry, no matter what ID was matched by bus code. That's why I added the comment here.
> 
> If I drop the { "bd12780a", /* driver data unused, see --^ */ } -entry from the ID list and add the of_device_ids, then I think this problem is solved for the DT-platforms. As far as I understand, this would still cause any non DT platform to describe both variants as "bd12780", making it impossible to later differentiate ICs in the driver, right? I can do this, but for me it feels a bit like asking for problems...
> 
> My thinking was to have different IDs for these variants so hardware description could have different IDs for different ICs. Then, if we later need to differentiate these ICs, we still have an option to change the probe to trust the i2c_get_match_data() when PMBus indicates the bd12780.
> 
> This however is some extra complexity, and I would like to add it to the probe only if it really is required.
> 
> But yeah, having an ID entry in the list and driver data not used even when the matching IC is found, is unusual and would have caught me off guard. Hence I added the (long) comment.
> 

It is unusual and unnecessary. If for whatever reason it turns out to be necessary
later to have a separate ID, add it then. Again, there are lots of chips with A/B/C
variants. We only add separate entries for them if/when needed just for "in case".

>>
>>>       { "mc09c", sq24905c },
>>>       { }
>>>   };
> 
> // snip
> 
>>> @@ -712,7 +732,16 @@ static int adm1275_probe(struct i2c_client *client)
>>>           break;
>>>       case adm1278:
>>>       case adm1281:
>>> +    case bd12780:
>>>       case sq24905c:
>>> +    {
>>> +        u16 defconfig;
>>> +
>>> +        if (data->id == bd12780)
>>> +            defconfig = BD12780_PMON_DEFCONFIG;
>>> +        else
>>> +            defconfig = ADM1278_PMON_DEFCONFIG;
>>> +
>>
>> Please add a separate case statement for the new chip
>> and do not overload existing chip data.
> 
> I originally did just that. I, however, was not happy with this as it resulted quite long own case this IC, which is almost identical to this other (adm1278, adm1281, sq24905c) case. I wanted the code to shout that the DB12780 is indeed (almost) identical to the adm1278. But yeah, I agree, having "if (foo)" in "switch (foo)" case can get confusing.
> 
Better that than embedded if statements. At some point it would
probably make sense to rework the driver to use per-chip configuration
data, similar to other drivers such as lm90. But that would be a separate
effort. Until then, please keep per-chip configuration in separate case
statements.

Thanks,
Guenter


^ permalink raw reply

* Re: [PATCH] Docs/translations/it_IT: update current minimal requirements
From: Federico Vaga @ 2026-06-17 13:43 UTC (permalink / raw)
  To: Doehyun Baek; +Cc: linux-doc, Jonathan Corbet, Shuah Khan
In-Reply-To: <20260617085305.3205822-1-doehyunbaek@gmail.com>

Hi and thank you Doehyun.

It looks all good to me.


Federico Vaga


mercoledì 17 giugno 2026 10:53, Doehyun Baek <doehyunbaek@gmail.com> ha scritto:

> Update the Italian translation of the current minimal requirements table to
> match Documentation/process/changes.rst.  The translated table still listed
> older versions for Rust, bindgen, pahole, Sphinx, and Python.
> 
> Signed-off-by: Doehyun Baek <doehyunbaek@gmail.com>
> Cc: Federico Vaga <federico.vaga@vaga.pv.it>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Shuah Khan <skhan@linuxfoundation.org>
> ---
>  Documentation/translations/it_IT/process/changes.rst | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/translations/it_IT/process/changes.rst b/Documentation/translations/it_IT/process/changes.rst
> index 7ee54c972418..1f89bae7b6c2 100644
> --- a/Documentation/translations/it_IT/process/changes.rst
> +++ b/Documentation/translations/it_IT/process/changes.rst
> @@ -34,14 +34,14 @@ PC Card, per esempio, probabilmente non dovreste preoccuparvi di pcmciautils.
>  ====================== =================  ========================================
>  GNU C                  8.1                gcc --version
>  Clang/LLVM (optional)  17.0.1             clang --version
> -Rust (opzionale)       1.78.0             rustc --version
> -bindgen (opzionale)    0.65.1             bindgen --version
> +Rust (opzionale)       1.85.0             rustc --version
> +bindgen (opzionale)    0.71.1             bindgen --version
>  GNU make               4.0                make --version
>  bash                   4.2                bash --version
>  binutils               2.30               ld -v
>  flex                   2.5.35             flex --version
>  bison                  2.0                bison --version
> -pahole                 1.16               pahole --version
> +pahole                 1.26               pahole --version
>  util-linux             2.10o              mount --version
>  kmod                   13                 depmod -V
>  e2fsprogs              1.41.4             e2fsck -V
> @@ -60,12 +60,12 @@ mcelog                 0.6                mcelog --version
>  iptables               1.4.2              iptables -V
>  openssl & libcrypto    1.0.0              openssl version
>  bc                     1.06.95            bc --version
> -Sphinx\ [#f1]_         2.4.4              sphinx-build --version
> +Sphinx\ [#f1]_         3.4.3              sphinx-build --version
>  cpio                   any                cpio --version
>  GNU tar                1.28               tar --version
>  gtags (opzionale)      6.6.5              gtags --version
>  mkimage (opzionale)    2017.01            mkimage --version
> -Python (opzionale)     3.5.x              python3 --version
> +Python (opzionale)     3.9.x              python3 --version
>  ====================== =================  ========================================
> 
>  .. [#f1] Sphinx è necessario solo per produrre la documentazione del Kernel
> --
> 2.43.0
> 
>

^ permalink raw reply

* Re: [PATCH v5 09/10] dt-bindings: firmware: add arm,ras-cper
From: Ahmed Tiba @ 2026-06-17 13:41 UTC (permalink / raw)
  To: Rob Herring
  Cc: Jonathan Cameron, will, xueshuai, saket.dumbre, mchehab, dave,
	djbw, bp, tony.luck, guohanjun, lenb, skhan, vishal.l.verma,
	rafael, corbet, ira.weiny, dave.jiang, krzk+dt, catalin.marinas,
	alison.schofield, conor+dt, linux-arm-kernel, Michael.Zhao2,
	linux-doc, linux-kernel, linux-cxl, Dmitry.Lamerov, devicetree,
	linux-acpi, linux-edac, acpica-devel
In-Reply-To: <20260612144910.GA989816-robh@kernel.org>

On 12/06/2026 15:49, Rob Herring wrote:
> On Thu, Jun 11, 2026 at 03:22:21PM +0100, Ahmed Tiba wrote:
>> On 29/05/2026 17:44, Jonathan Cameron wrote:
>>> On Fri, 29 May 2026 10:50:49 +0100
>>> Ahmed Tiba<ahmed.tiba@arm.com> wrote:
>>>>    .../devicetree/bindings/firmware/arm,ras-cper.yaml | 54 ++++++++++++++++++++++
>>>>    MAINTAINERS                                        |  5 ++
>>>>    2 files changed, 59 insertions(+)
>>>>
>>>> diff --git a/Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml b/Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml
>>>> new file mode 100644
>>>> index 000000000000..3d4de096093f
>>>> --- /dev/null
>>>> +++ b/Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml
>>>> @@ -0,0 +1,54 @@
>>>> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
>>>> +%YAML 1.2
>>>> +---
>>>> +$id:http://devicetree.org/schemas/firmware/arm,ras-cper.yaml#
>>>> +$schema:http://devicetree.org/meta-schemas/core.yaml#
>>>> +
>>>> +title: Arm RAS CPER provider
>>>> +
>>>> +maintainers:
>>>> +  - Ahmed Tiba<ahmed.tiba@arm.com>
>>>> +
>>>> +description:
>>>> +  Arm Reliability, Availability and Serviceability (RAS) firmware can expose
>>>> +  a firmware-first CPER error source directly via DeviceTree. Firmware
>>>> +  provides the CPER Generic Error Status block and notifies the OS through
>>>> +  an interrupt.
>>> I'd like some spec references in here if possible.
>> I can add a reference to the UEFI CPER specification for the Generic
>> Error Status record format.
>>
>> For the firmware-first DT description itself I do not have a more specific
>> public reference to cite.
> 
> Is there a platform actually using this with DT (FVP doesn't really
> count)?
> 
> Rob

Yes. The initial intended user is the upstream zena-css platform,
with validation so far on FVP.

I will note that in the next revision commit message and cover letter.

Best regards,
Ahmed

^ permalink raw reply

* Re: [PATCH v4 00/31] Introduce SCMI Telemetry FS support
From: Christian Brauner @ 2026-06-17 12:58 UTC (permalink / raw)
  To: Cristian Marussi
  Cc: linux-kernel, linux-arm-kernel, arm-scmi, linux-fsdevel,
	linux-doc, sudeep.holla, james.quinlan, f.fainelli,
	vincent.guittot, etienne.carriere, peng.fan, michal.simek, d-gole,
	jic23, elif.topuz, lukasz.luba, philip.radford,
	souvik.chakravarty, leitao, kas, puranjay, usama.arif,
	kernel-team
In-Reply-To: <20260612223802.1337232-1-cristian.marussi@arm.com>

On Fri, Jun 12, 2026 at 11:37:30PM +0100, Cristian Marussi wrote:
> Hi all,
> 
> --------------------------------------------------------------------------------
> [TLDR Summary]
> This series introduces a new SCMI driver which uses a new Telemetry FS to expose
> and configure SCMI Telemetry Data Events retrieved from the platform SCMI FW
> at runtime. The patches carrying the new STLMFS Filesystem support are tagged
> with 'stlmfs'.
> --------------------------------------------------------------------------------
> 
> the upcoming SCMI v4.0 specification [0] introduces a new SCMI protocol
> dedicated to System Telemetry.
> 
> In a nutshell, the SCMI Telemetry protocol allows an agent to discover at
> runtime the set of Telemetry Data Events (DEs) available on a specific
> platform and provides the means to configure the set of DEs that a user is
> interested into, while reading them back using the collection method that
> is deeemed more suitable for the usecase at hand. (...amongst the various
> possible collection methods allowed by SCMI specification)
> 
> Without delving into the gory details of the whole SCMI Telemetry protocol
> let's just say that the SCMI platform/server firmware advertises a number
> of Telemetry Data Events, each one identified by a 32bit unique ID, and an
> SCMI agent/client, like Linux, can discover them and read back at will the
> associated data value in a number of ways.
> Data collection is mainly intended to happen on demand via shared memory
> areas exposed by the platform firmware, discovered dynamically via SCMI
> Telemetry and accessed by Linux on-demand, but some DE can also be reported
> via SCMI Notifications asynchronous messages or via direct dedicated
> FastChannels (another kind of SCMI memory based access): all of this
> underlying mechanism is anyway hidden to the user since it is mediated by
> the kernel driver which will return the proper data value when queried.
> 
> Anyway, the set of well-known architected DE IDs defined by the spec is
> limited to a dozen IDs, which means that the vast majority of DE IDs are
> customizable per-platform: as a consequence, though, the same ID, say
> '0x1234', could represent completely different things on different systems.
> 
> Precise definitions and semantic of such custom Data Event IDs are out of
> the scope of the SCMI Telemetry specification and of this implementation:
> they are supposed to be provided using some kind of JSON-like description
> file that will have to be consumed by a userspace tool which would be
> finally in charge of making sense of the set of available DEs.
> 
> IOW, in turn, this means that even though the DEs enumerated via SCMI come
> with some sort of topological and qualitative description provided by the
> protocol (like unit of measurements, name, topology info etc), kernel-wise
> we CANNOT be completely sure of "what is what" without being fed-back some
> sort of information about the DEs by the afore mentioned userspace tool.
> 
> For these reasons, currently this series does NOT attempt to register any
> of these DEs with any of the usual in-kernel subsystems (like HWMON, IIO,
> PERF etc), simply because we cannot be sure which DE is suitable, or even
> desirable, for a given subsystem. This also means there are NO in-kernel
> users of these Telemetry data events as of now.
> 
> So, while we do not exclude, for the future, to feed/register some of the
> discovered DEs to/with some of the above mentioned Kernel subsystems, as
> of now we have ONLY modeled a custom userspace API to make SCMI Telemetry
> available to userspace tools.
> 
> In deciding which kind of interface to expose SCMI Telemetry data to a
> user, this new SCMI Telemetry driver aims at satisfying 2 main reqs:
> 
>  - exposing an FS-based human-readable interface that can be used to
>    discover, configure and access our Telemetry data directly also from
>    the shell without special tools
> 
>  - exposing alternative machine-friendly, more-performant, binary
>    interfaces that can be used to avoid the overhead of multiple accesses
>    to the VFS and that can be more suitable to access with custom tools
> 
> In the initial RFC posted a few months ago [1], the above was achieved
> with a combination of a SysFS interface, for the human-readable side of
> the story, and a classic chardev/ioctl for the plain binary access.
> 
> Since V1, instead, we moved away from this combined approach, especially
> away from SysFS, for the following reason:
> 
>  1. "Abusing SysFS": SysFS is a handy way to expose device related
>       properties in a common way, using a few common helpers built on
>       kernfs; this means, though, that unfortunately in our scenario I had
>       to generate a dummy simple device for EACH SCMI Telemetry DataEvent
>       that I got to discover at runtime and attach to them, all of the
>       properties I need.
>       This by itself seemed to me abusing the SysFS framework, but, even
>       ignoring this, the impact on the system when we have to deal with
>       hundreds or tens of thousands of DEs is sensible.
>       In some test scenario I ended with 50k DE devices and half-a-millon
>       related property files ... O_o
> 
>  2. "SysFS constraints": SysFS usage itself has its well-known constraints
>       and best practices, like the one-file/one-value rule, and due to the
>       fact that any virtual file with a complex structure or handling logic
>       is frowned upon, you can forget about IOCTLs and mmap'ing to provide
>       a more performant interface within SysFs, which is the reason why,
>       in the previous RFC, there was an additional alternative chardev
>       interface.
>       These latter limitations around the implementation of files with a
>       more complex semantic (i.e. with a broader set of file_operations)
>       derive from the underlying KernFS support, so KernFS is equally not
>       suitable as a building block for our implementation.
> 
>  2. "Chardev limitations": Given the nature of the protocol, the hybrid
>       approach employing character devices was itself problematic: first
>       of all because there is an upper limit on the number of chardev we
>       can create, dictated by the range of available minor numbers, and
>       then because the fact itself to have to maintain 2 completely
>       different interfaces (FS + chardev) is painful.
> 
> As a final remark, please NOTE THAT all of this is supposed to be available
> in production systems across a number of heterogeneous platforms: for these
> reasons the easy choice, debugFS, is NOT an option here.
> 
> Due to the above reasoning, since V1 we opted for a new approach with the
> proposed interfaces now based on a full fledged, unified, virtual pseudo
> filesystem implemented from scratch, so that we can:
> 
>  - expose all the DEs property we like as before with SysFS, but without
>    any of the constraint imposed by the usage of SysFs or kernfs.
> 
>  - easily expose additional alternative views of the same set of DEs
>    using symlinking capabilities (e.g. alternative topological view)
> 
>  - additionally expose a few alternative and more performant interfaces
>    by embedding in that same FS, a few special virtual files:
> 
>    + 'control': to issue IOCTLs for quicker discovery and on-demand access
>    		to data
>    + 'pipe' [TBD]: to provide a stream of events using a virtual
>    		   infinite-style file
>    + 'raw_<N>' [TBD]: to provide direct memory mapped access to the raw
>    		      SCMI Telemetry data from userspace

A filsystem driver for telemetry like this is really misguided. I think
shell access is really not an argument for adding a filesystem into the
kernel like this. That's just not appropriate justification to push
thousand and thousands of lines of code into the kernel.

You're building completely new infrastructure. The format is whatever it
is. If you stream it somehow just add a binary that userspace can use to
consume or translate it. If you need a filesystem interface for
convenience build it via FUSE on top of whatever streams that data and
get it ouf of the kernels way.

You also buy into all kinds of really wonky properties. If you split it
over multiple files you can never get a snapshot of data that is
consistent if it's across multiple files.

Telemetry over a filesystem is just not a great idea. If you did it via
sysfs I really wouldn't care because all because the infrastructure
already exists and I couldn't be bothered if this grew yet another wart
but as a separate massive hand-rolled pseudofs, no I'm not seeing it.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox