Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper
From: Ian Kent @ 2026-05-18 13:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: NeilBrown, Horst Birthelmer, Amir Goldstein, Miklos Szeredi,
	Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner,
	linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer
In-Reply-To: <yk2hem4zwinm4glenpc74to7sm5kyriksgwn6mxh7t4saotiba@7zik7jcnbs5m>

On 18/5/26 16:19, Jan Kara wrote:
> Hi Ian,
>
> On Mon 18-05-26 10:55:43, Ian Kent wrote:
>> On 18/5/26 07:55, NeilBrown wrote:
>>> On Fri, 15 May 2026, Horst Birthelmer wrote:
>>> According to the email you linked, a problem arises when a directory has
>>> a great many negative children.  Code which walks the list of children
>>> (such as fsnotify) while holding a lock can suffer unpredictable delays
>>> and result in long lock-hold times.  So maybe a limit on negative
>>> dentries for any parent is what we really want.  That would be clumsy to
>>> implement I imagine.
>> But the notion of dropping the dentry in ->d_delete() on last dput() is
>> simple enough but did see regressions (the only other place in the VFS
>> besides dentry_kill() that the inode is unlinked from the dentry on
>> dput()). I wonder if the regression was related to the test itself
>> deliberately recreating deleted files and if that really is normal
>> behaviour. By itself that should prevent almost all negative dentries
>> being retained. Although file systems could do this as well (think XFS
>> inode recycling) it should be reasonable to require it be left to the
>> VFS.
>>
>> But even that's not enough given that, in my case, there would still be
>> around 4 million dentries in the LRU cache and in fsnotify there are
>> directory child traversals holding the parent i_lock "spinlock" that are
>> going to cause problems.
> Do you mean there are very many positive children of a directory?

Didn't quantify that.


The symptom is the "Spinlock held for more than ... seconds" occurring

in the log. So there are certainly a lot of children in the list, but

it's an assumption the ratio of positive to negative entries is roughly

the same as the overall ratio in the dcache.


>
>> That's all that much more puzzling when I see things like commit
>> 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily") which looks
>> like it implies the child flag depends entirely on the parent state (what
>> am I missing Amir?)
> PARENT_WATCHED dentry flags (as the name suggests) are only caching the
> information whether the parent has notification marks receiving events from
> the child. So yes, the flag fully depends on the parent state.

Ok, this is something I was after, I will keep looking at the fsnotify

code since there is something to find, thanks for that.


>
>> so why is this traversal even retained in fsnotify?
> Not sure which traversal you mean but if you set watch on a parent, you
> have to walk all children to set PARENT_WATCHED flag so that you don't miss
> events on children...

Yes, that traversal is what I'm questioning ... again thanks.


I think the function name is still fsnotify_set_children_dentry_flags() in

recent kernels, the subject of commit 172e422ffea2 I mentioned above.


When you say miss events are you saying that accessing the parent dentry to

work out if the child needs to respond to an event is quite expensive in the

overall event processing context, that might make more sense to me ... or do

I completely not yet understand the reasoning behind the need for the flag?


>
>>> But what if we move dentries to the end of the list when they become
>>> negative, and to the start of the list when they become positive?  Then
>>> code which walks the child list could simply abort on the first
>>> negative.
>>>
>>> I doubt that would be quite as easy as it sounds, but it would at least
>>> be more focused on the observed symptom rather than some whole-system
>>> number which only vaguely correlates with the observed symptom.
>>>
>>> Maybe a completely different approach: change children-walking code to
>>> drop and retake the lock (with appropriate validation) periodically.
>>> What too would address the specific symptom.
>> Another good question.
>>
>> I have assumed that dropping and re-taking the lock cannot be done but
>> this is a question I would like answered as well. Dropping and re-taking
>> lock would require, as Miklos pointed out to me off-list, recording the
>> list position with say a cursor, introducing unwanted complexity when it
>> would be better to accept the cost of a single extra access to the parent
>> flags (which I assume is one reason to set the flag in the child).
> The parent access is actually more expensive than you might think. Based on
> experience with past fsnotify related performance regression I expect some
> 20% performance hit for small tmpfs writes if you add unconditional parent
> access to the write path.

That sounds like a lot for what should be a memory access of an already in

memory structure since the parent must be accessed to traverse the list of

child entries. I clearly don't fully understand the implications of what

I'm saying but there has been mention of another context ...


Nevertheless more useful information, ;)


Thanks again,

Ian


^ permalink raw reply

* [PATCH v2] cpufreq: Documentation: fix sampling_down_factor range
From: Pengjie Zhang @ 2026-05-18 13:34 UTC (permalink / raw)
  To: rafael, viresh.kumar, corbet
  Cc: skhan, zhongqiu.han, linux-pm, linux-doc, zhanjie9, zhenglifeng1,
	lihuisong, yubowen8, linhongye, linuxarm, zhangpengjie2,
	wangzhi12

The ondemand governor implementation accepts sampling_down_factor values
from 1 to 100000 via MAX_SAMPLING_DOWN_FACTOR, but the documentation in
admin-guide/pm/cpufreq.rst still says the valid range is 1 to 100.

Update the documentation to match the actual code.

Fixes: 2a0e49279850 ("cpufreq: User/admin documentation update and consolidation")
Reviewed-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Signed-off-by: Pengjie Zhang <zhangpengjie2@huawei.com>
---
Changes in v2:
- Modify the title.
- Add Reviewed-by tag.
Link to v1:https://lore.kernel.org/all/20260515094930.273599-1-zhangpengjie2@huawei.com/
---
 Documentation/admin-guide/pm/cpufreq.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index dbe6d23a5d67..fdca59c955dc 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -516,7 +516,7 @@ This governor exposes the following tunables:
 	of those tasks above 0 and set this attribute to 1.
 
 ``sampling_down_factor``
-	Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
+	Temporary multiplier, between 1 (default) and 100000 inclusive, to apply to
 	the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
 
 	This causes the next execution of the governor's worker routine (after
-- 
2.33.0


^ permalink raw reply related

* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-18 13:33 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, linux-doc, linux-kselftest, bpf, live-patching,
	Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
	Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
	Anthony Iliopoulos, Michal Hocko, Jiri Olsa
In-Reply-To: <CAPhsuW4x8shWon8Moi5VgCq2n4E2EzaaauZ2HHpy42Rp1Y-J-g@mail.gmail.com>

On Sun, May 17, 2026 at 11:37:36PM -0700, Song Liu wrote:
>On Sun, May 17, 2026 at 6:49 AM Sasha Levin <sashal@kernel.org> wrote:
>> * fail_function (CONFIG_FUNCTION_ERROR_INJECTION) is disabled in
>>   most production kernels. Even where enabled, it only works on
>>   functions pre-annotated with ALLOW_ERROR_INJECTION() in source -
>>   no help for a freshly-disclosed CVE. The debugfs UI is blocked by
>>   lockdown=integrity and the override is probabilistic.
>>
>> * BPF override (bpf_override_return) honors the same
>>   ALLOW_ERROR_INJECTION() whitelist, and BPF itself is off in many
>>   production kernels. Even where on, the operator interface is
>>   "load a verified BPF program," not a one-line write.
>
>If it is OK for killswitch to attach to any kernel functions, do we still
>need ALLOW_ERROR_INJECTION() for fail_function and BPF
>override? Shall we instead also allow fail_function and BPF override
>to attach to any kernel functions?

I don't think so. ALLOW_ERROR_INJECTION is not a security mechanism, it's an
integrity/safety mechanism for both bpf and fault injection.

It protects against a "developer or CI script doing legitimate fault injection
accidentally panics the box" scenario, not an "attacker gets in" one.

-- 
Thanks,
Sasha

^ permalink raw reply

* [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Yury Murashka @ 2026-05-18 13:23 UTC (permalink / raw)
  To: bhelgaas, mahesh
  Cc: oohall, corbet, skhan, linux-pci, linux-doc, linux-kernel,
	linuxppc-dev, Yury Murashka

pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
If a new AER error is subsequently reported, the AER driver calls
find_source_device() to find the source of the error. It rescans the
whole bus and picks the first device reporting an AER error. Because the
previous error was never cleared, the error is attributed to the wrong
device and AER recovery is started for the wrong device.

Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
AER error status even when recovery fails, preventing stale errors from
causing incorrect device identification on subsequent AER events.

Signed-off-by: Yury Murashka <yurypm@arista.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  5 +++++
 drivers/pci/pci.c                               |  2 ++
 drivers/pci/pci.h                               |  2 ++
 drivers/pci/pcie/err.c                          | 13 +++++++++++++
 4 files changed, 22 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt
b/Documentation/admin-guide/kernel-parameters.txt
index 4d0f545fb..5a9e266f5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5301,6 +5301,11 @@ Kernel parameters
                nomio           [S390] Do not use MIO instructions.
                norid           [S390] ignore the RID field and force use of
                                one PCI domain per PCI function
+               aer_clear_on_recovery_failure
+                               [PCIE] If the PCIEAER kernel config parameter is
+                               enabled, this kernel boot option can be used to
+                               enable AER errors cleanup even if error recovery
+                               failed.
                notph           [PCIE] If the PCIE_TPH kernel config parameter
                                is enabled, this kernel boot option can be used
                                to disable PCIe TLP Processing Hints support
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d34266651..701459c62 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6769,6 +6769,8 @@ static int __init pci_setup(char *str)
                                disable_acs_redir_param = str + 18;
                        } else if (!strncmp(str, "config_acs=", 11)) {
                                config_acs_param = str + 11;
+                       } else if (!strncmp(str,
"aer_clear_on_recovery_failure", 29)) {
+                               pci_enable_aer_clear_on_recovery_failure();
                        } else {
                                pr_err("PCI: Unknown option `%s'\n", str);
                        }
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4a14f88e5..093a7c896 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1292,6 +1292,7 @@ int pci_aer_clear_status(struct pci_dev *dev);
 int pci_aer_raw_clear_status(struct pci_dev *dev);
 void pci_save_aer_state(struct pci_dev *dev);
 void pci_restore_aer_state(struct pci_dev *dev);
+void pci_enable_aer_clear_on_recovery_failure(void);
 #else
 static inline void pci_no_aer(void) { }
 static inline void pci_aer_init(struct pci_dev *d) { }
@@ -1301,6 +1302,7 @@ static inline int pci_aer_clear_status(struct
pci_dev *dev) { return -EINVAL; }
 static inline int pci_aer_raw_clear_status(struct pci_dev *dev) {
return -EINVAL; }
 static inline void pci_save_aer_state(struct pci_dev *dev) { }
 static inline void pci_restore_aer_state(struct pci_dev *dev) { }
+static inline void pci_enable_aer_clear_on_recovery_failure(void) { }
 #endif

 #ifdef CONFIG_ACPI
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc11..29d655a34 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -21,6 +21,13 @@
 #include "portdrv.h"
 #include "../pci.h"

+static int enable_aer_clear_on_recovery_failure;
+
+void pci_enable_aer_clear_on_recovery_failure(void)
+{
+       enable_aer_clear_on_recovery_failure = 1;
+}
+
 static pci_ers_result_t merge_result(enum pci_ers_result orig,
                                  enum pci_ers_result new)
 {
@@ -289,6 +296,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
        return status;

 failed:
+       if (enable_aer_clear_on_recovery_failure &&
+           (host->native_aer || pcie_ports_native)) {
+               pcie_clear_device_status(dev);
+               pci_aer_clear_nonfatal_status(dev);
+       }
+
        pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);

        pci_walk_bridge(bridge, report_perm_failure_detected, NULL);
--
2.51.0

^ permalink raw reply related

* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: David Hildenbrand (Arm) @ 2026-05-18 13:16 UTC (permalink / raw)
  To: Wei Yang, Lance Yang
  Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
	gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260514031009.f66cgop3ctgiqxz3@master>

On 5/14/26 05:10, Wei Yang wrote:
> On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
>>
>> On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
>>> generalize the order of the __collapse_huge_page_* and collapse_max_*
>>> functions to support future mTHP collapse.
>>>
>>> The current mechanism for determining collapse with the
>>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>>> raises a key design issue: if we support user defined max_pte_none values
>>> (even those scaled by order), a collapse of a lower order can introduces
>>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
>>> than HPAGE_PMD_NR / 2. [1]
>>>
>>> With this configuration, a successful collapse to order N will populate
>>> enough pages to satisfy the collapse condition on order N+1 on the next
>>> scan. This leads to unnecessary work and memory churn.
>>>
>>> To fix this issue introduce a helper function that will limit mTHP
>>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>>> This effectively supports two modes: [2]
>>>
>>> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>>>  that maps the shared zeropage. Consequently, no memory bloat.
>>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>>>  available mTHP order.
>>>
>>> This removes the possiblilty of "creep", while not modifying any uAPI
>>> expectations. A warning will be emitted if any non-supported
>>> max_ptes_none value is configured with mTHP enabled.
>>>
>>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
>>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>>> shared or swapped entry.
>>>
>>> No functional changes in this patch; however it defines future behavior
>>> for mTHP collapse.
>>>
>>> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
>>> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>>>
>>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>> include/trace/events/huge_memory.h |   3 +-
>>> mm/khugepaged.c                    | 117 ++++++++++++++++++++---------
>>> 2 files changed, 85 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>>> index bcdc57eea270..443e0bd13fdb 100644
>>> --- a/include/trace/events/huge_memory.h
>>> +++ b/include/trace/events/huge_memory.h
>>> @@ -39,7 +39,8 @@
>>> 	EM( SCAN_STORE_FAILED,		"store_failed")			\
>>> 	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
>>> 	EM( SCAN_PAGE_FILLED,		"page_filled")			\
>>> -	EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
>>> +	EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")	\
>>> +	EMe(SCAN_INVALID_PTES_NONE,	"invalid_ptes_none")
>>>
>>> #undef EM
>>> #undef EMe
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index f68853b3caa7..27465161fa6d 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -61,6 +61,7 @@ enum scan_result {
>>> 	SCAN_COPY_MC,
>>> 	SCAN_PAGE_FILLED,
>>> 	SCAN_PAGE_DIRTY_OR_WRITEBACK,
>>> +	SCAN_INVALID_PTES_NONE,
>>> };
>>>
>>> #define CREATE_TRACE_POINTS
>>> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
>>>  * PTEs for the given collapse operation.
>>>  * @cc: The collapse control struct
>>>  * @vma: The vma to check for userfaultfd
>>> + * @order: The folio order being collapsed to
>>>  *
>>>  * Return: Maximum number of none-page or zero-page PTEs allowed for the
>>>  * collapse operation.
>>>  */
>>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>>> -		struct vm_area_struct *vma)
>>> +static int collapse_max_ptes_none(struct collapse_control *cc,
>>> +		struct vm_area_struct *vma, unsigned int order)
>>> {
>>> +	unsigned int max_ptes_none = khugepaged_max_ptes_none;
>>> 	// If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>>
>> One thing I still want to call out: kernel code usually uses C-style
>> comments :)
>>
>>> 	if (vma && userfaultfd_armed(vma))
>>> 		return 0;
>>> 	// for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
>>> 	if (!cc->is_khugepaged)
>>> 		return HPAGE_PMD_NR;
>>> -	// For all other cases repect the user defined maximum.
>>> -	return khugepaged_max_ptes_none;
>>> +	// for PMD collapse, respect the user defined maximum.
>>> +	if (is_pmd_order(order))
>>> +		return max_ptes_none;
>>> +	/* Zero/non-present collapse disabled. */
>>> +	if (!max_ptes_none)
>>> +		return 0;
>>> +	// for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
>>> +	// scale the maximum number of PTEs to the order of the collapse.
>>> +	if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
>>> +		return (1 << order) - 1;
>>> +
>>> +	// We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
>>> +	// Emit a warning and return -EINVAL.
>>> +	pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
>>> +		      KHUGEPAGED_MAX_PTES_LIMIT);
>>
>> Maybe fallback to 0 instead, as David suggested earlier?
>>
> 
> It looks reasonable to fallback to 0.
> 
> But as the updated Document says in patch 14:
> 
>   For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
>   value will emit a warning and no mTHP collapse will be attempted.
> 
> This is why it does like this now.
> 
>     mthp_collapse()
>         max_ptes_none = collapse_max_ptes_none();
>         if (max_ptes_none < 0)
>             return collapsed;
> 
>> max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
>> intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
>> disable it :(
>>
> 
> So it depends on what we want to do here :-)
> 
> For me, I would vote for fallback to 0.

At this point I'll prefer to not return errors from collapse_max_ptes_none().
It's just rather awkward to return an error deep down in collapse code for a
configuration problem.

For mthp collapse, we only support max_ptes_none==0 and
max_ptes_none=="HPAGE_PMD_NR - 1" (default).

If another value is specified while collapsing mTHP, print a warning and treat
it as 0 (save value, no creep, no memory waste).

In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
warning, because we would issue a warning with the default settings).

@Lorenzo, fine with you?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH] cpufreq: Documentation: fix sampling_down_factor documentation range
From: zhangpengjie (A) @ 2026-05-18 13:12 UTC (permalink / raw)
  To: Zhongqiu Han, rafael, viresh.kumar, corbet
  Cc: skhan, linux-pm, linux-doc, zhanjie9, zhenglifeng1, lihuisong,
	yubowen8, linhongye, linuxarm, wangzhi12
In-Reply-To: <05980ed2-e591-468c-a528-5b2b74c192d8@oss.qualcomm.com>


On 5/17/2026 1:04 PM, Zhongqiu Han wrote:
> On 5/15/2026 5:49 PM, Pengjie Zhang wrote:
>> The ondemand governor implementation accepts sampling_down_factor values
>> from 1 to 100000 via MAX_SAMPLING_DOWN_FACTOR, but the documentation in
>> admin-guide/pm/cpufreq.rst still says the valid range is 1 to 100.
>>
>> Update the documentation to match the actual code.
>>
>> Fixes: 2a0e49279850 ("cpufreq: User/admin documentation update and 
>> consolidation")
>
>
> Thanks Pengjie,
>
> Yes, commit 3f78a9f7fcee introduced MAX_SAMPLING_DOWN_FACTOR (100000),
> and commit 2a0e49279850 updated the documentation later, so the Fixes
> tag is correct.
>
> Small nit: "documentation range" feels a bit redundant; just "range"
> might be enough.
>
> Looks good to me.
>
> Reviewed-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
>
Thanks for your review. I'll send out v2 shortly.
Best regards,
     pengjie
>
>> Signed-off-by: Pengjie Zhang <zhangpengjie2@huawei.com>
>> ---
>>   Documentation/admin-guide/pm/cpufreq.rst | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/Documentation/admin-guide/pm/cpufreq.rst 
>> b/Documentation/admin-guide/pm/cpufreq.rst
>> index dbe6d23a5d67..fdca59c955dc 100644
>> --- a/Documentation/admin-guide/pm/cpufreq.rst
>> +++ b/Documentation/admin-guide/pm/cpufreq.rst
>> @@ -516,7 +516,7 @@ This governor exposes the following tunables:
>>       of those tasks above 0 and set this attribute to 1.
>>     ``sampling_down_factor``
>> -    Temporary multiplier, between 1 (default) and 100 inclusive, to 
>> apply to
>> +    Temporary multiplier, between 1 (default) and 100000 inclusive, 
>> to apply to
>>       the ``sampling_rate`` value if the CPU load goes above 
>> ``up_threshold``.
>>         This causes the next execution of the governor's worker 
>> routine (after
>
>

^ permalink raw reply

* Re: [PATCH v2 1/3] Doc: deprecated.rst: add strlcat()
From: David Laight @ 2026-05-18 12:59 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Heiko Carstens, Kees Cook, Manuel Ebner, Andy Shevchenko,
	Jonathan Corbet, Shuah Khan, Andy Whitcroft, Joe Perches,
	Dwaipayan Ray, Lukas Bulwahn, Randy Dunlap, Jani Nikula,
	open list:DOCUMENTATION PROCESS, open list:DOCUMENTATION,
	open list
In-Reply-To: <CAMuHMdXEezxGi1d=BCiQ57cbnG4D2PPXvt_FAHcyT5mgR7md3g@mail.gmail.com>

On Mon, 18 May 2026 09:11:04 +0200
Geert Uytterhoeven <geert@linux-m68k.org> wrote:

> Hi David,
...
> > I don't really see why strlcat() should be deprecated.
> > Clearly there are many cases where there are better ways to do things.  
> 
> https://elixir.bootlin.com/linux/v7.0.8/source/include/linux/fortify-string.h#L346
> already says "Do not use this function. [...] Prefer building the
>  * string with formatting, via scnprintf(), seq_buf, or similar.".

Trouble is that all requires a lot more rework.

I might try changing the type of the 'buffer' to sysfs_emit()
from 'char *' to 'sysfs_buf *'.
Initially the types will have to be the same, but propagating it through
will show where it can be used.
But last I looked I failed to even find the associated kmalloc().
Eventually it could be changed to a different type.

> > The only problem with strlcat() is that it returns the 'required length'.
> > So there are some broken uses.
> > - fs/nfs/flexfilelayout/flexfilelayout.c
> > - lib/kunit/string-stream.c (although the preceding vsnprintf() looks like the actual bug).
> > There is also some very strange code in security/selinus/ima.c - but it may be ok.
> >
> > In reality the return value of strlcat() isn't really much worse that that
> > of snprintf().  
> 
> So we need strscat()? ;-)

Indeed...

-- David

> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 


^ permalink raw reply

* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Julian Orth @ 2026-05-18 12:58 UTC (permalink / raw)
  To: Christian König
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Sumit Semwal, Jonathan Corbet, Shuah Khan,
	Arnd Bergmann, Greg Kroah-Hartman, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig, linux-doc, wayland-devel,
	Michel Dänzer
In-Reply-To: <69dcbcc1-da58-4d34-bfb0-5c8d33b75d59@amd.com>

On Mon, May 18, 2026 at 2:41 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/18/26 14:02, Julian Orth wrote:
> > On Mon, May 18, 2026 at 1:58 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/16/26 13:06, Julian Orth wrote:
> >>> This series adds a new device /dev/syncobj that can be used to create
> >>> and manipulate DRM syncobjs. Previously, these operations required the
> >>> use of a DRM device and the device needed to support the DRIVER_SYNCOBJ
> >>> and DRIVER_SYNCOBJ_TIMELINE features.
> >>>
> >>> There are several issues with the existing API:
> >>>
> >>> - Syncobjs are the only explicit sync mechanism available on wayland.
> >>>   Most compositors do not use GPU waits. Instead, they use the
> >>>   DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to perform a CPU wait. Being tied to
> >>>   DRM devices means that compositors cannot consistently offer this
> >>>   feature even though no device-specific logic is involved.
> >>
> >> Well the drm_syncobj is a container for device specific dma fences.
> >
> > Not necessarily. The DRM_IOCTL_SYNCOBJ_TIMELINE_SIGNAL ioctl attaches
> > some kind of dummy fence that is already signaled. I don't believe
> > this is device specific. That is also the path that llvmpipe would
> > use.
>
> Yeah I feared that.
>
> This is the wait before signal path and if I'm not completely mistaken that one is not supported by a lot of compositors.

I believe this is supported by all compositors.

>
> The last time I looked for GPU support the compositor needs to spawn a separate thread for each client to support this approach.
>
> It could be that we have eventfd integration for that as well now, but in that case you could give the compositor an eventfd instead of a drm_syncobj fd in the first place.

Yes, all compositors use the DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to wait
async for the timeline point to materialize and/or be signaled. The
wayland protocol was the motivation for that ioctl.

>
> So as far as I can see using drm_syncobj for software rendering really doesn't make sense, eventfd is a much better fit for that use case.

Using eventfd has some disadvantages:

- We've just added syncobj support to vulkan:
https://github.com/KhronosGroup/Vulkan-Docs/issues/2473#issuecomment-4446117280.
For eventfd we would not only have to add yet another extension, that
would realistically only be exposed by llvmpipe, but also every
compositor and every client would have to support both extensions.
- Similarly, a new wayland protocol would need to be designed to
support sync over eventfd.
- Eventfd does not support timeline semantics. Meaning that you would
have to send two eventfds over the wire for each commit, one for the
acquire point and one for the release point. Whereas with syncobj you
only need to send two integers per commit.

I don't see the advantage when drm_syncobj already does everything we need.

You seem to believe that compositors would not be ready for this and
from that perspective I can understand your apprehension. But I can
assure you that compositors are already fully set up to support all of
the usecases I've described: The wayland protocol requires the
compositor to support wait before signal.

>
> Regards,
> Christian.
>
> >
> >>
> >> What could be possible instead is to pass an eventfd into Wayland, but that is something userspace needs to decide.
> >>
> >>> - llvmpipe currently cannot offer syncobj interop because it does not
> >>>   have access to a DRM device. This means that applications using
> >>>   llvmpipe cannot present images before they have finished rendering,
> >>>   despite llvmpipe using threaded rendering.
> >>
> >> Yeah, but that is completely intentional. You *CAN'T* use a dma_fence as completion event for llvmpipe rendering. See the kernel documentation on that.
> >>
> >> What could be possible is to use the drm_syncobjs functionality to wait before signal, but that has different semantics.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> - Clients that do not use the Vulkan WSI need to manually probe /dev/dri
> >>>   for devices that support the syncobj ioctls in order to use the
> >>>   wayland syncobj protocol.
> >>> - Similarly, clients that want to use screen capture have no equivalent
> >>>   to the WSI and are therefore forced into that path.
> >>> - Having to keep a DRM device open has potentially negative interactions
> >>>   with GPU hotplug.
> >>> - Having to translate between syncobj FDs and handles is troublesome in
> >>>   the compositor usecase since syncobjs come and go frequently and need
> >>>   to be cleaned up when clients disconnect.
> >>>
> >>> /dev/syncobj solves these issues by providing all syncobj ioctls under a
> >>> consistent path that is not tied to any DRM device. It also operates
> >>> directly on file descriptors instead of syncobj handles.
> >>>
> >>> The series starts with a number of small refactorings in drm_syncobj.c
> >>> to make its functionality available outside of the file and without the
> >>> need for drm_file/handle pairs.
> >>>
> >>> The last commit adds the /dev/syncobj module. I've added it as a misc
> >>> device but maybe this should instead live somewhere under gpu/drm.
> >>>
> >>> An application using the new interface can be found at [1].
> >>>
> >>> [1]: https://github.com/mahkoh/jay/pull/947
> >>>
> >>> ---
> >>> Julian Orth (12):
> >>>       drm/syncobj: add drm_syncobj_from_fd
> >>>       drm/syncobj: add drm_syncobj_fence_lookup
> >>>       drm/syncobj: make drm_syncobj_array_wait_timeout public
> >>>       drm/syncobj: add drm_syncobj_register_eventfd
> >>>       drm/syncobj: have transfer functions accept drm_syncobj directly
> >>>       drm/syncobj: add drm_syncobj_transfer
> >>>       drm/syncobj: add drm_syncobj_timeline_signal
> >>>       drm/syncobj: add drm_syncobj_query
> >>>       drm/syncobj: fix resource leak in drm_syncobj_import_sync_file_fence
> >>>       drm/syncobj: add drm_syncobj_import_sync_file
> >>>       drm/syncobj: add drm_syncobj_export_sync_file
> >>>       misc/syncobj: add new device
> >>>
> >>>  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
> >>>  drivers/gpu/drm/drm_syncobj.c                      | 374 ++++++++++++++-----
> >>>  drivers/misc/Kconfig                               |  10 +
> >>>  drivers/misc/Makefile                              |   1 +
> >>>  drivers/misc/syncobj.c                             | 404 +++++++++++++++++++++
> >>>  include/drm/drm_syncobj.h                          |  21 ++
> >>>  include/uapi/linux/syncobj.h                       |  75 ++++
> >>>  7 files changed, 795 insertions(+), 91 deletions(-)
> >>> ---
> >>> base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
> >>> change-id: 20260516-jorth-syncobj-d4d374c8c61b
> >>>
> >>> Best regards,
> >>> --
> >>> Julian Orth <ju.orth@gmail.com>
> >>>
> >>
>

^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-18 12:50 UTC (permalink / raw)
  To: Christian König
  Cc: T.J. Mercier, Christian Brauner, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
	Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <208fb820-d8eb-4832-a343-ef8b360e8120@amd.com>

On Mon, May 18, 2026 at 9:20 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/15/26 19:06, T.J. Mercier wrote:
> > On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
> >>
> >> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
> >>> On embedded platforms a central process often allocates dma-buf
> >>> memory on behalf of client applications. Without a way to
> >>> attribute the charge to the requesting client's cgroup, the
> >>> cost lands on the allocator, making per-cgroup memory limits
> >>> ineffective for the actual consumers.
> >>>
> >>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> >>
> >> Please be aware that pidfds come in two flavors:
> >>
> >> thread-group pidfds and thread-specific pidfds. Make sure that your API
> >> doesn't implicitly depend on this distinction not existing.
> >
> > Hi Christian,
> >
> > Memcg is not a controller that supports "thread mode" so all threads
> > in a group should belong to the same memcg.
>
> BTW: Exactly that is the requirement automotive has with their native context use case.
>
> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
>
> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
>
> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.

Hi Christian,

Thanks for sharing this atuomotive usecase. If I understand correctly,
the actual requirement is attributing dma-buf charges to the right
client, not putting each daemon thread in a different cgroup? If so,
the `charge_pid_fd` approach achieves this directly by passing the
client's `pid_fd`, without needing to add per-thread cgroup
infrastructure.

>
> Regards,
> Christian.
>
> >
> > Checking the flags from pidfd_get_pid would be the best way for an
> > explicit check of the pidfd type?
> >
> >>> a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> >>> memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> >>> inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> >>> the mem_accounting module parameter enabled, the buffer is charged
> >>> to the allocator's own cgroup.
> >>>
> >>> Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> >>> system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> >>> page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> >>> twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> >>> all accounting through a single MEMCG_DMABUF path.
> >>>
> >>> Usage examples:
> >>>
> >>>   1. Central allocator charging to a client at allocation time.
> >>>      The allocator knows the client's PID (e.g., from binder's
> >>>      sender_pid) and uses pidfd to attribute the charge:
> >>>
> >>>        pid_t client_pid = txn->sender_pid;
> >>>        int pidfd = pidfd_open(client_pid, 0);
> >>>
> >>>        struct dma_heap_allocation_data alloc = {
> >>>            .len             = buffer_size,
> >>>            .fd_flags        = O_RDWR | O_CLOEXEC,
> >>>            .charge_pid_fd   = pidfd,
> >>>        };
> >>>        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> >>>        close(pidfd);
> >>>        /* alloc.fd is now charged to client's cgroup */
> >>>
> >>>   2. Default allocation (no pidfd, mem_accounting=1).
> >>>      When charge_pid_fd is not set and the mem_accounting module
> >>>      parameter is enabled, the buffer is charged to the allocator's
> >>>      own cgroup:
> >>>
> >>>        struct dma_heap_allocation_data alloc = {
> >>>            .len      = buffer_size,
> >>>            .fd_flags = O_RDWR | O_CLOEXEC,
> >>>        };
> >>>        ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> >>>        /* charged to current process's cgroup */
> >>>
> >>> Current limitations:
> >>>
> >>>  - Single-owner model: a dma-buf carries one memcg charge regardless of
> >>>    how many processes share it. Means only the first owner (and exporter)
> >>>    of the shared buffer bears the charge.
> >>>  - Only memcg accounting supported. While this makes sense for system
> >>>    heap buffers, other heaps (e.g., CMA heaps) will require selectively
> >>>    charging also for the dmem controller.
> >>>
> >>> Signed-off-by: Albert Esteve <aesteve@redhat.com>
> >>> ---
> >>>  Documentation/admin-guide/cgroup-v2.rst |  5 ++--
> >>>  drivers/dma-buf/dma-buf.c               | 16 ++++---------
> >>>  drivers/dma-buf/dma-heap.c              | 42 ++++++++++++++++++++++++++++++---
> >>>  drivers/dma-buf/heaps/system_heap.c     |  2 --
> >>>  include/uapi/linux/dma-heap.h           |  6 +++++
> >>>  5 files changed, 53 insertions(+), 18 deletions(-)
> >>>
> >>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> >>> index 8bdbc2e866430..824d269531eb1 100644
> >>> --- a/Documentation/admin-guide/cgroup-v2.rst
> >>> +++ b/Documentation/admin-guide/cgroup-v2.rst
> >>> @@ -1636,8 +1636,9 @@ The following nested keys are defined.
> >>>               structures.
> >>>
> >>>         dmabuf (npn)
> >>> -             Amount of memory used for exported DMA buffers allocated by the cgroup.
> >>> -             Stays with the allocating cgroup regardless of how the buffer is shared.
> >>> +             Amount of memory used for exported DMA buffers allocated by or on
> >>> +             behalf of the cgroup. Stays with the allocating cgroup regardless
> >>> +             of how the buffer is shared.
> >>>
> >>>         workingset_refault_anon
> >>>               Number of refaults of previously evicted anonymous pages.
> >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >>> index ce02377f48908..23fb758b78297 100644
> >>> --- a/drivers/dma-buf/dma-buf.c
> >>> +++ b/drivers/dma-buf/dma-buf.c
> >>> @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
> >>>        */
> >>>       BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> >>>
> >>> -     mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> >>> -     mem_cgroup_put(dmabuf->memcg);
> >>> +     if (dmabuf->memcg) {
> >>> +             mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> >>> +                                       PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> >>> +             mem_cgroup_put(dmabuf->memcg);
> >>> +     }
> >>>
> >>>       dmabuf->ops->release(dmabuf);
> >>>
> >>> @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >>>               dmabuf->resv = resv;
> >>>       }
> >>>
> >>> -     dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> >>> -     if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> >>> -                                   GFP_KERNEL)) {
> >>> -             ret = -ENOMEM;
> >>> -             goto err_memcg;
> >>> -     }
> >>> -
> >>>       file->private_data = dmabuf;
> >>>       file->f_path.dentry->d_fsdata = dmabuf;
> >>>       dmabuf->file = file;
> >>> @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >>>
> >>>       return dmabuf;
> >>>
> >>> -err_memcg:
> >>> -     mem_cgroup_put(dmabuf->memcg);
> >>>  err_file:
> >>>       fput(file);
> >>>  err_module:
> >>> diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> >>> index ac5f8685a6494..ff6e259afcdc0 100644
> >>> --- a/drivers/dma-buf/dma-heap.c
> >>> +++ b/drivers/dma-buf/dma-heap.c
> >>> @@ -7,13 +7,17 @@
> >>>   */
> >>>
> >>>  #include <linux/cdev.h>
> >>> +#include <linux/cgroup.h>
> >>>  #include <linux/device.h>
> >>>  #include <linux/dma-buf.h>
> >>>  #include <linux/dma-heap.h>
> >>> +#include <linux/memcontrol.h>
> >>> +#include <linux/sched/mm.h>
> >>>  #include <linux/err.h>
> >>>  #include <linux/export.h>
> >>>  #include <linux/list.h>
> >>>  #include <linux/nospec.h>
> >>> +#include <linux/pidfd.h>
> >>>  #include <linux/syscalls.h>
> >>>  #include <linux/uaccess.h>
> >>>  #include <linux/xarray.h>
> >>> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
> >>>                "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
> >>>
> >>>  static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> >>> -                              u32 fd_flags,
> >>> -                              u64 heap_flags)
> >>> +                              u32 fd_flags, u64 heap_flags,
> >>> +                              struct mem_cgroup *charge_to)
> >>>  {
> >>>       struct dma_buf *dmabuf;
> >>> +     unsigned int nr_pages;
> >>> +     struct mem_cgroup *memcg = charge_to;
> >>>       int fd;
> >>>
> >>>       /*
> >>> @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> >>>       if (IS_ERR(dmabuf))
> >>>               return PTR_ERR(dmabuf);
> >>>
> >>> +     nr_pages = len / PAGE_SIZE;
> >>> +
> >>> +     if (memcg)
> >>> +             css_get(&memcg->css);
> >>> +     else if (mem_accounting)
> >>> +             memcg = get_mem_cgroup_from_mm(current->mm);
> >>> +
> >>> +     if (memcg) {
> >>> +             if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> >>> +                     mem_cgroup_put(memcg);
> >>> +                     dma_buf_put(dmabuf);
> >>> +                     return -ENOMEM;
> >>> +             }
> >>> +             dmabuf->memcg = memcg;
> >>> +     }
> >>> +
> >>>       fd = dma_buf_fd(dmabuf, fd_flags);
> >>>       if (fd < 0) {
> >>>               dma_buf_put(dmabuf);
> >>> @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> >>>  {
> >>>       struct dma_heap_allocation_data *heap_allocation = data;
> >>>       struct dma_heap *heap = file->private_data;
> >>> +     struct mem_cgroup *memcg = NULL;
> >>> +     struct task_struct *task;
> >>> +     unsigned int pidfd_flags;
> >>>       int fd;
> >>>
> >>>       if (heap_allocation->fd)
> >>> @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> >>>       if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
> >>>               return -EINVAL;
> >>>
> >>> +     if (heap_allocation->charge_pid_fd) {
> >>> +             task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> >>
> >> Will always get a thread-group leader pidfd and will fail if this is a
> >> thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to
> >> open a thread-specific pidfd.
> >>
> >>> +             if (IS_ERR(task))
> >>> +                     return PTR_ERR(task);
> >>> +
> >>> +             memcg = get_mem_cgroup_from_mm(task->mm);
> >>> +             put_task_struct(task);
> >>> +     }
> >>> +
> >>>       fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
> >>>                                  heap_allocation->fd_flags,
> >>> -                                heap_allocation->heap_flags);
> >>> +                                heap_allocation->heap_flags,
> >>> +                                memcg);
> >>> +     mem_cgroup_put(memcg);
> >>>       if (fd < 0)
> >>>               return fd;
> >>>
> >>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> >>> index 03c2b87cb1112..95d7688167b93 100644
> >>> --- a/drivers/dma-buf/heaps/system_heap.c
> >>> +++ b/drivers/dma-buf/heaps/system_heap.c
> >>> @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
> >>>               if (max_order < orders[i])
> >>>                       continue;
> >>>               flags = order_flags[i];
> >>> -             if (mem_accounting)
> >>> -                     flags |= __GFP_ACCOUNT;
> >>>               page = alloc_pages(flags, orders[i]);
> >>>               if (!page)
> >>>                       continue;
> >>> diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> >>> index a4cf716a49fa6..e02b0f8cbc6a1 100644
> >>> --- a/include/uapi/linux/dma-heap.h
> >>> +++ b/include/uapi/linux/dma-heap.h
> >>> @@ -29,6 +29,10 @@
> >>>   *                   handle to the allocated dma-buf
> >>>   * @fd_flags:                file descriptor flags used when allocating
> >>>   * @heap_flags:              flags passed to heap
> >>> + * @charge_pid_fd:   optional pidfd of the process whose cgroup should be
> >>> + *                   charged for this allocation; 0 means charge the calling
> >>> + *                   process's cgroup
> >>> + * @__padding:               reserved, must be zero
> >>>   *
> >>>   * Provided by userspace as an argument to the ioctl
> >>>   */
> >>> @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
> >>>       __u32 fd;
> >>>       __u32 fd_flags;
> >>>       __u64 heap_flags;
> >>> +     __u32 charge_pid_fd;
> >>> +     __u32 __padding;
> >>>  };
> >>>
> >>>  #define DMA_HEAP_IOC_MAGIC           'H'
> >>>
> >>> --
> >>> 2.53.0
> >>>
>


^ permalink raw reply

* Re: [PATCH] nios2: remove the architecture
From: Krzysztof Kozlowski @ 2026-05-18 12:50 UTC (permalink / raw)
  To: Ethan Nelson-Moore
  Cc: linux-doc, devicetree, workflows, linux-arch, dmaengine,
	linux-i2c, linux-iio, netdev, linux-pci, linux-pwm,
	linux-hardening, linux-kbuild, linux-csky, Jonathan Corbet,
	Shuah Khan, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Daniel Lezcano, Thomas Gleixner, Alex Shi, Yanteng Si,
	Dongliang Mu, Hu Haowen, Dinh Nguyen, Kees Cook, Oleg Nesterov,
	Will Deacon, Aneesh Kumar K.V, Andrew Morton, Nick Piggin,
	Peter Zijlstra, Vinod Koul, Frank Li, Dave Penkler, Andi Shyti,
	Jonathan Cameron, David Lechner, Nuno Sá, Andy Shevchenko,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Lorenzo Pieralisi, Krzysztof Wilczyński
In-Reply-To: <20260518042833.272221-1-enelsonmoore@gmail.com>

On Sun, May 17, 2026 at 09:28:33PM -0700, Ethan Nelson-Moore wrote:
> The Nios II architecture is a soft-core architecture developed by
> Altera (since acquired by Intel) and intended to run on their FPGAs.
> 
> Licenses for the architecture have not been available for purchase
> since 2024 [1], and support for it has been removed from GCC 15 [2],
> Buildroot [3], and QEMU [4].
> 
> Given all of these factors, it is time to remove Nios II support from
> the kernel. The maintainer stated in 2024 that they were planning to do
> so soon [5], but this did not come to pass.
> 
> Remove Nios II support from the kernel and move the former maintainer
> to CREDITS. Thank you, Dinh Nguyen, for maintaining Nios II support!
> 
> References:
> [1] https://docs.altera.com/v/u/docs/781327/is-discontinuing-ip-ordering-codes-listed-in-pdn2312-for-nios-ii-ip
> [2] https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=e876acab6cdd84bb2b32c98fc69fb0ba29c81153
> [3] https://github.com/buildroot/buildroot/commit/6775ccc5a199d574ad70b5f79ec58cce97a07c6f
> [4] https://github.com/qemu/qemu/commit/6c3014858c4c0024dd0560f08a6eda0f92f658d6
> [5] https://sourceware.org/pipermail/newlib/2024/021083.html
> 
> Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
> ---

Wearing DT hat:

Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>

Best regards,
Krzysztof


^ permalink raw reply

* Re: [PATCH mm-unstable v17 00/14] khugepaged: mTHP support
From: Wei Yang @ 2026-05-18 12:50 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>

On Mon, May 11, 2026 at 12:58:00PM -0600, Nico Pache wrote:
>The following series provides khugepaged with the capability to collapse
>anonymous memory regions to mTHPs.
>
>To achieve this we generalize the khugepaged functions to no longer depend
>on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>pages that are occupied (!none/zero). After the PMD scan is done, we use
>the bitmap to find the optimal mTHP sizes for the PMD range. The
>restriction on max_ptes_none is removed during the scan, to make sure we
>account for the whole PMD range in the bitmap. When no mTHP size is
>enabled, the legacy behavior of khugepaged is maintained.
>
>We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
>(ie 511). If any other value is specified, the kernel will emit a warning
>and no mTHP collapse will be attempted. If a mTHP collapse is attempted,
>but contains swapped out, or shared pages, we don't perform the collapse.
>It is now also possible to collapse to mTHPs without requiring the PMD THP
>size to be enabled. These limitations are to prevent collapse "creep"
>behavior. This prevents constantly promoting mTHPs to the next available
>size, which would occur because a collapse introduces more non-zero pages
>that would satisfy the promotion condition on subsequent scans.
>
>Patch 1-2:   Generalize hugepage_vma_revalidate and alloc_charge_folio
>	     for arbitrary orders.
>Patch 3:     Rework max_ptes_* handling into helper functions
>Patch 4:     Generalize __collapse_huge_page_* for mTHP support
>Patch 5:     Require collapse_huge_page to enter/exit with the lock dropped
>Patch 6:     Generalize collapse_huge_page for mTHP collapse
>Patch 7:     Skip collapsing mTHP to smaller orders
>Patch 8-9:   Add per-order mTHP statistics and tracepoints
>Patch 10:    Introduce collapse_allowable_orders helper function
>Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
>Patch 14:    Documentation
>
>Testing:
>- Built for x86_64, aarch64, ppc64le, and s390x
>- ran all arches on test suites provided by the kernel-tests project
>- internal testing suites: functional testing and performance testing
>- selftests mm
>- I created a test script that I used to push khugepaged to its limits
>   while monitoring a number of stats and tracepoints. The code is
>   available here[1] (Run in legacy mode for these changes and set mthp
>   sizes to inherit)
>   The summary from my testings was that there was no significant
>   regression noticed through this test. In some cases my changes had
>   better collapse latencies, and was able to scan more pages in the same
>   amount of time/work, but for the most part the results were consistent.
>- redis testing. I did some testing with these changes along with my defer
>  changes (see followup [2] post for more details). We've decided to get
>  the mTHP changes merged first before attempting the defer series.
>- some basic testing on 64k page size.
>- lots of general use.
>

Two links are missing. I got them from previous version.

[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/

And the test in [1] is a performance test. I am thinking whether we want a
functional test in selftests.

I did a quick try with following change and some hack.

@@ -744,6 +765,51 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
 	ksft_test_result_report(exit_status, "%s\n", __func__);
 }
 
+static void collapse_mth_ptes(struct collapse_context *c, struct mem_ops *ops)
+{
+	struct thp_settings settings = *thp_current_settings();
+	void *p;
+	int i;
+
+	/* Disable mthp on fault */
+	for (i = 0; i < NR_ORDERS; i++) {
+		settings.hugepages[i].enabled = THP_NEVER;
+	}
+	thp_push_settings(&settings);
+
+	p = ops->setup_area(1);
+
+	ops->fault(p, 0, hpage_pmd_size);
+
+	/* Expect all order-0 folio after fault */
+	memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
+	expected_orders[0] = hpage_pmd_nr;
+	if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
+					   kpageflags_fd, expected_orders,
+					   (pmd_order + 1)))
+		ksft_exit_fail_msg("Unexpected huge page at fault\n");
+
+	/* Enable mthp before collapse */
+	thp_pop_settings();
+	settings.hugepages[2].enabled = THP_ALWAYS;
+	thp_push_settings(&settings);
+
+	c->collapse("Collapse fully populated PTE table with order 2", p, 1,
+		    ops, true);
+
+	/* Expect all order-2 folio after collapse */
+	memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
+	expected_orders[2] = 1 << (pmd_order - 2);
+	if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
+					   kpageflags_fd, expected_orders,
+					   (pmd_order + 1)))
+		ksft_exit_fail_msg("Unexpected page order\n");
+
+	ops->cleanup_area(p, hpage_pmd_size);
+	thp_pop_settings();
+	ksft_test_result_report(exit_status, "%s\n", __func__);
+}
+
 static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
 {
 	void *p;

This leverage check_after_split_folio_orders() in split_huge_page_test.c to
check folio order in PMD range.

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Christian König @ 2026-05-18 12:41 UTC (permalink / raw)
  To: Julian Orth
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Sumit Semwal, Jonathan Corbet, Shuah Khan,
	Arnd Bergmann, Greg Kroah-Hartman, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig, linux-doc, wayland-devel,
	Michel Dänzer
In-Reply-To: <CAHijbEUzWZC4GAMU6YGV42gOYkrQaMZZPiwS4Erb4H1J-fh_8Q@mail.gmail.com>

On 5/18/26 14:02, Julian Orth wrote:
> On Mon, May 18, 2026 at 1:58 PM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 5/16/26 13:06, Julian Orth wrote:
>>> This series adds a new device /dev/syncobj that can be used to create
>>> and manipulate DRM syncobjs. Previously, these operations required the
>>> use of a DRM device and the device needed to support the DRIVER_SYNCOBJ
>>> and DRIVER_SYNCOBJ_TIMELINE features.
>>>
>>> There are several issues with the existing API:
>>>
>>> - Syncobjs are the only explicit sync mechanism available on wayland.
>>>   Most compositors do not use GPU waits. Instead, they use the
>>>   DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to perform a CPU wait. Being tied to
>>>   DRM devices means that compositors cannot consistently offer this
>>>   feature even though no device-specific logic is involved.
>>
>> Well the drm_syncobj is a container for device specific dma fences.
> 
> Not necessarily. The DRM_IOCTL_SYNCOBJ_TIMELINE_SIGNAL ioctl attaches
> some kind of dummy fence that is already signaled. I don't believe
> this is device specific. That is also the path that llvmpipe would
> use.

Yeah I feared that.

This is the wait before signal path and if I'm not completely mistaken that one is not supported by a lot of compositors.

The last time I looked for GPU support the compositor needs to spawn a separate thread for each client to support this approach.

It could be that we have eventfd integration for that as well now, but in that case you could give the compositor an eventfd instead of a drm_syncobj fd in the first place.

So as far as I can see using drm_syncobj for software rendering really doesn't make sense, eventfd is a much better fit for that use case.

Regards,
Christian.

> 
>>
>> What could be possible instead is to pass an eventfd into Wayland, but that is something userspace needs to decide.
>>
>>> - llvmpipe currently cannot offer syncobj interop because it does not
>>>   have access to a DRM device. This means that applications using
>>>   llvmpipe cannot present images before they have finished rendering,
>>>   despite llvmpipe using threaded rendering.
>>
>> Yeah, but that is completely intentional. You *CAN'T* use a dma_fence as completion event for llvmpipe rendering. See the kernel documentation on that.
>>
>> What could be possible is to use the drm_syncobjs functionality to wait before signal, but that has different semantics.
>>
>> Regards,
>> Christian.
>>
>>> - Clients that do not use the Vulkan WSI need to manually probe /dev/dri
>>>   for devices that support the syncobj ioctls in order to use the
>>>   wayland syncobj protocol.
>>> - Similarly, clients that want to use screen capture have no equivalent
>>>   to the WSI and are therefore forced into that path.
>>> - Having to keep a DRM device open has potentially negative interactions
>>>   with GPU hotplug.
>>> - Having to translate between syncobj FDs and handles is troublesome in
>>>   the compositor usecase since syncobjs come and go frequently and need
>>>   to be cleaned up when clients disconnect.
>>>
>>> /dev/syncobj solves these issues by providing all syncobj ioctls under a
>>> consistent path that is not tied to any DRM device. It also operates
>>> directly on file descriptors instead of syncobj handles.
>>>
>>> The series starts with a number of small refactorings in drm_syncobj.c
>>> to make its functionality available outside of the file and without the
>>> need for drm_file/handle pairs.
>>>
>>> The last commit adds the /dev/syncobj module. I've added it as a misc
>>> device but maybe this should instead live somewhere under gpu/drm.
>>>
>>> An application using the new interface can be found at [1].
>>>
>>> [1]: https://github.com/mahkoh/jay/pull/947
>>>
>>> ---
>>> Julian Orth (12):
>>>       drm/syncobj: add drm_syncobj_from_fd
>>>       drm/syncobj: add drm_syncobj_fence_lookup
>>>       drm/syncobj: make drm_syncobj_array_wait_timeout public
>>>       drm/syncobj: add drm_syncobj_register_eventfd
>>>       drm/syncobj: have transfer functions accept drm_syncobj directly
>>>       drm/syncobj: add drm_syncobj_transfer
>>>       drm/syncobj: add drm_syncobj_timeline_signal
>>>       drm/syncobj: add drm_syncobj_query
>>>       drm/syncobj: fix resource leak in drm_syncobj_import_sync_file_fence
>>>       drm/syncobj: add drm_syncobj_import_sync_file
>>>       drm/syncobj: add drm_syncobj_export_sync_file
>>>       misc/syncobj: add new device
>>>
>>>  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
>>>  drivers/gpu/drm/drm_syncobj.c                      | 374 ++++++++++++++-----
>>>  drivers/misc/Kconfig                               |  10 +
>>>  drivers/misc/Makefile                              |   1 +
>>>  drivers/misc/syncobj.c                             | 404 +++++++++++++++++++++
>>>  include/drm/drm_syncobj.h                          |  21 ++
>>>  include/uapi/linux/syncobj.h                       |  75 ++++
>>>  7 files changed, 795 insertions(+), 91 deletions(-)
>>> ---
>>> base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
>>> change-id: 20260516-jorth-syncobj-d4d374c8c61b
>>>
>>> Best regards,
>>> --
>>> Julian Orth <ju.orth@gmail.com>
>>>
>>


^ permalink raw reply

* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-18 12:16 UTC (permalink / raw)
  To: Barry Song
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <CAGsJ_4xfznffbjOaNKwnN6oZk_H6pqOzYqd1zx4Q9XrocdzV8A@mail.gmail.com>

On Sat, May 16, 2026 at 9:37 AM Barry Song <baohua@kernel.org> wrote:
>
> On Tue, May 12, 2026 at 5:18 PM Albert Esteve <aesteve@redhat.com> wrote:
> >
> > On embedded platforms a central process often allocates dma-buf
> > memory on behalf of client applications. Without a way to
> > attribute the charge to the requesting client's cgroup, the
> > cost lands on the allocator, making per-cgroup memory limits
> > ineffective for the actual consumers.
> >
> > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > the mem_accounting module parameter enabled, the buffer is charged
> > to the allocator's own cgroup.
> >
> > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > all accounting through a single MEMCG_DMABUF path.
> >
> [...]
>
> > -               if (mem_accounting)
> > -                       flags |= __GFP_ACCOUNT;
>
> Hi Albert,
>
> would it be better to move this and its description to patch 1? It
> looks like patch 1 already introduces the double accounting changes,
> and patch 2 is mainly just supporting remote charging.

Hi Barry,

Thanks for looking into this series! Yes, in my head I was trying to
keep patch 1, which was taken from a previous, different series, and
then diverge from it starting with patch 2. This would clarify the
difference between the two. But I can see it just added some confusion
(for example, patch 1 charges on dma_buf_export() and then it is moved
to dma_heap_buffer_alloc() in patch 2). I will reorganize it better
for the next version, including your suggestion.

>
> Also, mem_accounting is only used by system_heap.c; has this patchset
> also eliminated its need?

No, mem_accounting is still handled in this patch for the general case
where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:

+       if (memcg)
+               css_get(&memcg->css);
+       else if (mem_accounting)
+               memcg = get_mem_cgroup_from_mm(current->mm);

>
> Thanks
> Barry
>


^ permalink raw reply

* Re: [PATCH 12/12] misc/syncobj: add new device
From: Julian Orth @ 2026-05-18 12:10 UTC (permalink / raw)
  To: Christian König
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Sumit Semwal, Jonathan Corbet, Shuah Khan,
	Arnd Bergmann, Greg Kroah-Hartman, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig, linux-doc, wayland-devel
In-Reply-To: <8602a990-e557-45e3-8b3a-f9e6aaa00e0d@amd.com>

On Mon, May 18, 2026 at 2:06 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/16/26 13:06, Julian Orth wrote:
> > This device makes the DRM_IOCTL_SYNCOBJ_* ioctls available via a
> > dedicated device. This allows applications to use syncobjs without
> > having to open device nodes in /dev/dri, on systems that don't have any
> > such nodes, or on systems whose devices don't support the
> > DRIVER_SYNCOBJ_TIMELINE feature.
> >
> > Wayland uses syncobjs as its buffer synchronization mechanism. Most
> > compositors use the DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to perform a pure
> > CPU wait for syncobj point. DRM devices are not involved in this process
> > except insofar that a DRM device needs to be used to access the ioctl.
> >
> > Similarly, a software-rendered client might perform rendering on a
> > dedicated thread and use the wayland syncobj protocol to submit frames
> > before they finish rendering. Again, this does not involve DRM devices
> > except insofar ... as above.
>
> That use case is invalid.
>
> Usually drm_syncobj can only be filled with dma_fence objects and it is impossible to create one of those for software rendering.

That is simply not true. As I wrote above,
DRM_IOCTL_SYNCOBJ_TIMELINE_SIGNAL can be used with software rendering.

>
> What could be used is the drm_syncobj wait before signal functionality, but that usually requires special handling on the Wayland/Compositor side which as far as I can see doesn't make sense here either.

Commit (to wayland) before submit (rendering work) is fully supported
by the wayland syncobj protocol. No work needs to be done on the
wayland side. In fact, everything that this series enables can already
be done today by opening random /dev/dri nodes until you find one that
supports the syncobj timeline ioctls. This series just makes it
easier.

>
> So the justification to use this for software rendering is very weak. Either I'm missing something or that is not going to fly at all.
>
> Regards,
> Christian.
>
> >
> > As an added benefit, this device removes the need to translate between
> > file descriptors and handles.
> >
> > Signed-off-by: Julian Orth <ju.orth@gmail.com>
> > ---
> >  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
> >  drivers/misc/Kconfig                               |  10 +
> >  drivers/misc/Makefile                              |   1 +
> >  drivers/misc/syncobj.c                             | 404 +++++++++++++++++++++
> >  include/uapi/linux/syncobj.h                       |  75 ++++
> >  5 files changed, 491 insertions(+)
> >
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index 331223761fff..5e140ae5735e 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -395,6 +395,7 @@ Code  Seq#    Include File                                             Comments
> >                                                                         <mailto:michael.klein@puffin.lb.shuttle.de>
> >  0xCC  00-0F  drivers/misc/ibmvmc.h                                     pseries VMC driver
> >  0xCD  01     linux/reiserfs_fs.h                                       Dead since 6.13
> > +0xCD  00-0F  uapi/linux/syncobj.h
> >  0xCE  01-02  uapi/linux/cxl_mem.h                                      Compute Express Link Memory Devices
> >  0xCF  02     fs/smb/client/cifs_ioctl.h
> >  0xDD  00-3F                                                            ZFCP device driver see drivers/s390/scsi/
> > diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> > index 00683bf06258..c1e7749bd356 100644
> > --- a/drivers/misc/Kconfig
> > +++ b/drivers/misc/Kconfig
> > @@ -644,6 +644,16 @@ config MCHP_LAN966X_PCI
> >           - lan966x-miim (MDIO_MSCC_MIIM)
> >           - lan966x-switch (LAN966X_SWITCH)
> >
> > +config SYNCOBJ_DEV
> > +     tristate "DRM syncobj device (/dev/syncobj)"
> > +     depends on DRM
> > +     help
> > +       Creates a /dev/syncobj device node that provides DRM synchronization
> > +       objects (syncobjs) without requiring a DRM device.
> > +
> > +       To compile this driver as a module, choose M here: the module
> > +       will be called syncobj.
> > +
> >  source "drivers/misc/c2port/Kconfig"
> >  source "drivers/misc/eeprom/Kconfig"
> >  source "drivers/misc/cb710/Kconfig"
> > diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> > index b32a2597d246..9e5deb1d0d76 100644
> > --- a/drivers/misc/Makefile
> > +++ b/drivers/misc/Makefile
> > @@ -75,3 +75,4 @@ obj-$(CONFIG_MCHP_LAN966X_PCI)      += lan966x-pci.o
> >  obj-y                                += keba/
> >  obj-y                                += amd-sbi/
> >  obj-$(CONFIG_MISC_RP1)               += rp1/
> > +obj-$(CONFIG_SYNCOBJ_DEV)    += syncobj.o
> > diff --git a/drivers/misc/syncobj.c b/drivers/misc/syncobj.c
> > new file mode 100644
> > index 000000000000..11ef46ddfeef
> > --- /dev/null
> > +++ b/drivers/misc/syncobj.c
> > @@ -0,0 +1,404 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * syncobj.c - Standalone device for syncobj manipulation.
> > + *
> > + * Copyright (C) 2026 Julian Orth <ju.orth@gmail.com>
> > + */
> > +
> > +#include <linux/fdtable.h>
> > +#include <linux/miscdevice.h>
> > +#include <linux/module.h>
> > +#include <linux/uaccess.h>
> > +#include <drm/drm_syncobj.h>
> > +#include <drm/drm_utils.h>
> > +#include <uapi/drm/drm.h>
> > +#include <uapi/linux/syncobj.h>
> > +
> > +static int syncobj_array_find(void __user *user_fds, u32 count,
> > +                           struct drm_syncobj ***syncobjs_out)
> > +{
> > +     u32 i;
> > +     s32 *fds;
> > +     struct drm_syncobj **syncobjs;
> > +     int ret;
> > +
> > +     fds = kmalloc_array(count, sizeof(*fds), GFP_KERNEL);
> > +     if (!fds)
> > +             return -ENOMEM;
> > +
> > +     if (copy_from_user(fds, user_fds, sizeof(s32) * count)) {
> > +             ret = -EFAULT;
> > +             goto err_free_fds;
> > +     }
> > +
> > +     syncobjs = kmalloc_array(count, sizeof(*syncobjs), GFP_KERNEL);
> > +     if (!syncobjs) {
> > +             ret = -ENOMEM;
> > +             goto err_free_fds;
> > +     }
> > +
> > +     for (i = 0; i < count; i++) {
> > +             syncobjs[i] = drm_syncobj_from_fd(fds[i]);
> > +             if (!syncobjs[i]) {
> > +                     ret = -EBADF;
> > +                     goto err_put_syncobjs;
> > +             }
> > +     }
> > +
> > +     kfree(fds);
> > +     *syncobjs_out = syncobjs;
> > +     return 0;
> > +
> > +err_put_syncobjs:
> > +     while (i-- > 0)
> > +             drm_syncobj_put(syncobjs[i]);
> > +     kfree(syncobjs);
> > +err_free_fds:
> > +     kfree(fds);
> > +     return ret;
> > +}
> > +
> > +static void syncobj_array_free(struct drm_syncobj **syncobjs, u32 count)
> > +{
> > +     u32 i;
> > +
> > +     for (i = 0; i < count; i++)
> > +             drm_syncobj_put(syncobjs[i]);
> > +     kfree(syncobjs);
> > +}
> > +
> > +static int syncobj_ioctl_create(void __user *argp)
> > +{
> > +     struct syncobj_create_args args;
> > +     struct drm_syncobj *syncobj;
> > +     int fd, ret;
> > +
> > +     if (copy_from_user(&args, argp, sizeof(args)))
> > +             return -EFAULT;
> > +
> > +     if (args.flags & ~SYNCOBJ_CREATE_SIGNALED)
> > +             return -EINVAL;
> > +
> > +     static_assert(SYNCOBJ_CREATE_SIGNALED == DRM_SYNCOBJ_CREATE_SIGNALED);
> > +
> > +     ret = drm_syncobj_create(&syncobj, args.flags, NULL);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = drm_syncobj_get_fd(syncobj, &fd);
> > +     drm_syncobj_put(syncobj);
> > +     if (ret)
> > +             return ret;
> > +
> > +     args.fd = fd;
> > +     if (copy_to_user(argp, &args, sizeof(args))) {
> > +             close_fd(fd);
> > +             return -EFAULT;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static int syncobj_ioctl_wait(void __user *argp)
> > +{
> > +     struct syncobj_wait_args args;
> > +     struct drm_syncobj **syncobjs;
> > +     signed long timeout;
> > +     u32 first = ~0;
> > +     ktime_t t, *tp = NULL;
> > +     int ret;
> > +
> > +     if (copy_from_user(&args, argp, sizeof(args)))
> > +             return -EFAULT;
> > +
> > +     if (args.flags & ~(SYNCOBJ_WAIT_FLAGS_WAIT_ALL |
> > +                        SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT |
> > +                        SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE |
> > +                        SYNCOBJ_WAIT_FLAGS_WAIT_DEADLINE))
> > +             return -EINVAL;
> > +
> > +     static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_ALL        == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_ALL);
> > +     static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT);
> > +     static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE  == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE);
> > +     static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_DEADLINE   == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_DEADLINE);
> > +
> > +     if (args.pad)
> > +             return -EINVAL;
> > +
> > +     if (args.count == 0)
> > +             return 0;
> > +
> > +     ret = syncobj_array_find(u64_to_user_ptr(args.fds),
> > +                              args.count, &syncobjs);
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     if (args.flags & SYNCOBJ_WAIT_FLAGS_WAIT_DEADLINE) {
> > +             t = ns_to_ktime(args.deadline_nsec);
> > +             tp = &t;
> > +     }
> > +
> > +     timeout = drm_timeout_abs_to_jiffies(args.timeout_nsec);
> > +     timeout = drm_syncobj_array_wait_timeout(syncobjs,
> > +                                              u64_to_user_ptr(args.points),
> > +                                              args.count,
> > +                                              args.flags,
> > +                                              timeout, &first, tp);
> > +
> > +     syncobj_array_free(syncobjs, args.count);
> > +
> > +     if (timeout < 0)
> > +             return timeout;
> > +
> > +     args.first_signaled = first;
> > +     if (copy_to_user(argp, &args, sizeof(args)))
> > +             return -EFAULT;
> > +
> > +     return 0;
> > +}
> > +
> > +static int syncobj_ioctl_reset(void __user *argp)
> > +{
> > +     struct syncobj_array_args args;
> > +     struct drm_syncobj **syncobjs;
> > +     u32 i;
> > +     int ret;
> > +
> > +     if (copy_from_user(&args, argp, sizeof(args)))
> > +             return -EFAULT;
> > +
> > +     if (args.flags)
> > +             return -EINVAL;
> > +
> > +     if (args.points)
> > +             return -EINVAL;
> > +
> > +     if (args.count == 0)
> > +             return -EINVAL;
> > +
> > +     ret = syncobj_array_find(u64_to_user_ptr(args.fds),
> > +                              args.count, &syncobjs);
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     for (i = 0; i < args.count; i++)
> > +             drm_syncobj_replace_fence(syncobjs[i], NULL);
> > +
> > +     syncobj_array_free(syncobjs, args.count);
> > +     return 0;
> > +}
> > +
> > +static int syncobj_ioctl_signal(void __user *argp)
> > +{
> > +     struct syncobj_array_args args;
> > +     struct drm_syncobj **syncobjs;
> > +     int ret;
> > +
> > +     if (copy_from_user(&args, argp, sizeof(args)))
> > +             return -EFAULT;
> > +
> > +     if (args.flags)
> > +             return -EINVAL;
> > +
> > +     if (args.count == 0)
> > +             return -EINVAL;
> > +
> > +     ret = syncobj_array_find(u64_to_user_ptr(args.fds),
> > +                              args.count, &syncobjs);
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     ret = drm_syncobj_timeline_signal(syncobjs, args.points, args.count);
> > +
> > +     syncobj_array_free(syncobjs, args.count);
> > +     return ret;
> > +}
> > +
> > +static int syncobj_ioctl_query(void __user *argp)
> > +{
> > +     struct syncobj_array_args args;
> > +     struct drm_syncobj **syncobjs;
> > +     int ret;
> > +
> > +     if (copy_from_user(&args, argp, sizeof(args)))
> > +             return -EFAULT;
> > +
> > +     if (args.flags & ~SYNCOBJ_QUERY_FLAGS_LAST_SUBMITTED)
> > +             return -EINVAL;
> > +
> > +     static_assert(SYNCOBJ_QUERY_FLAGS_LAST_SUBMITTED == DRM_SYNCOBJ_QUERY_FLAGS_LAST_SUBMITTED);
> > +
> > +     if (args.count == 0)
> > +             return -EINVAL;
> > +
> > +     ret = syncobj_array_find(u64_to_user_ptr(args.fds),
> > +                              args.count, &syncobjs);
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     ret = drm_syncobj_query(syncobjs, args.points, args.count, args.flags);
> > +
> > +     syncobj_array_free(syncobjs, args.count);
> > +     return ret;
> > +}
> > +
> > +static int syncobj_ioctl_transfer(void __user *argp)
> > +{
> > +     struct syncobj_transfer_args args;
> > +     struct drm_syncobj *src, *dst;
> > +     int ret;
> > +
> > +     if (copy_from_user(&args, argp, sizeof(args)))
> > +             return -EFAULT;
> > +
> > +     if (args.pad)
> > +             return -EINVAL;
> > +
> > +     if (args.flags & ~SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT)
> > +             return -EINVAL;
> > +
> > +     static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT);
> > +
> > +     src = drm_syncobj_from_fd(args.src_fd);
> > +     if (!src)
> > +             return -EBADF;
> > +
> > +     dst = drm_syncobj_from_fd(args.dst_fd);
> > +     if (!dst) {
> > +             drm_syncobj_put(src);
> > +             return -EBADF;
> > +     }
> > +
> > +     ret = drm_syncobj_transfer(src, args.src_point,
> > +                                dst, args.dst_point, args.flags);
> > +
> > +     drm_syncobj_put(dst);
> > +     drm_syncobj_put(src);
> > +
> > +     return ret;
> > +}
> > +
> > +static int syncobj_ioctl_eventfd(void __user *argp)
> > +{
> > +     struct syncobj_eventfd_args args;
> > +     struct drm_syncobj *syncobj;
> > +     int ret;
> > +
> > +     if (copy_from_user(&args, argp, sizeof(args)))
> > +             return -EFAULT;
> > +
> > +     if (args.flags & ~SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE)
> > +             return -EINVAL;
> > +
> > +     static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE);
> > +
> > +     if (args.pad)
> > +             return -EINVAL;
> > +
> > +     syncobj = drm_syncobj_from_fd(args.syncobj_fd);
> > +     if (!syncobj)
> > +             return -EBADF;
> > +
> > +     ret = drm_syncobj_register_eventfd(syncobj, args.eventfd,
> > +                                        args.point, args.flags);
> > +
> > +     drm_syncobj_put(syncobj);
> > +
> > +     return ret;
> > +}
> > +
> > +static int syncobj_ioctl_export_sync_file(void __user *argp)
> > +{
> > +     struct syncobj_sync_file_args args;
> > +     struct drm_syncobj *syncobj;
> > +     int ret;
> > +
> > +     if (copy_from_user(&args, argp, sizeof(args)))
> > +             return -EFAULT;
> > +
> > +     syncobj = drm_syncobj_from_fd(args.syncobj_fd);
> > +     if (!syncobj)
> > +             return -EBADF;
> > +
> > +     ret = drm_syncobj_export_sync_file(syncobj, args.point,
> > +                                        &args.sync_file_fd);
> > +     drm_syncobj_put(syncobj);
> > +     if (ret)
> > +             return ret;
> > +
> > +     if (copy_to_user(argp, &args, sizeof(args))) {
> > +             close_fd(args.sync_file_fd);
> > +             return -EFAULT;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static int syncobj_ioctl_import_sync_file(void __user *argp)
> > +{
> > +     struct syncobj_sync_file_args args;
> > +     struct drm_syncobj *syncobj;
> > +     int ret;
> > +
> > +     if (copy_from_user(&args, argp, sizeof(args)))
> > +             return -EFAULT;
> > +
> > +     syncobj = drm_syncobj_from_fd(args.syncobj_fd);
> > +     if (!syncobj)
> > +             return -EBADF;
> > +
> > +     ret = drm_syncobj_import_sync_file(syncobj, args.sync_file_fd,
> > +                                        args.point);
> > +
> > +     drm_syncobj_put(syncobj);
> > +
> > +     return ret;
> > +}
> > +
> > +static long syncobj_dev_ioctl(struct file *file, unsigned int cmd,
> > +                           unsigned long arg)
> > +{
> > +     void __user *argp = (void __user *)arg;
> > +
> > +     switch (cmd) {
> > +     case SYNCOBJ_IOC_CREATE:
> > +             return syncobj_ioctl_create(argp);
> > +     case SYNCOBJ_IOC_WAIT:
> > +             return syncobj_ioctl_wait(argp);
> > +     case SYNCOBJ_IOC_RESET:
> > +             return syncobj_ioctl_reset(argp);
> > +     case SYNCOBJ_IOC_SIGNAL:
> > +             return syncobj_ioctl_signal(argp);
> > +     case SYNCOBJ_IOC_QUERY:
> > +             return syncobj_ioctl_query(argp);
> > +     case SYNCOBJ_IOC_TRANSFER:
> > +             return syncobj_ioctl_transfer(argp);
> > +     case SYNCOBJ_IOC_EVENTFD:
> > +             return syncobj_ioctl_eventfd(argp);
> > +     case SYNCOBJ_IOC_EXPORT_SYNC_FILE:
> > +             return syncobj_ioctl_export_sync_file(argp);
> > +     case SYNCOBJ_IOC_IMPORT_SYNC_FILE:
> > +             return syncobj_ioctl_import_sync_file(argp);
> > +     default:
> > +             return -ENOIOCTLCMD;
> > +     }
> > +}
> > +
> > +static const struct file_operations syncobj_dev_fops = {
> > +     .owner          = THIS_MODULE,
> > +     .unlocked_ioctl = syncobj_dev_ioctl,
> > +     .compat_ioctl   = compat_ptr_ioctl,
> > +};
> > +
> > +static struct miscdevice syncobj_misc = {
> > +     .minor  = MISC_DYNAMIC_MINOR,
> > +     .name   = "syncobj",
> > +     .fops   = &syncobj_dev_fops,
> > +     .mode   = 0666,
> > +};
> > +
> > +module_misc_device(syncobj_misc);
> > +
> > +MODULE_AUTHOR("Julian Orth");
> > +MODULE_DESCRIPTION("DRM syncobj device");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/uapi/linux/syncobj.h b/include/uapi/linux/syncobj.h
> > new file mode 100644
> > index 000000000000..c4068fbd5773
> > --- /dev/null
> > +++ b/include/uapi/linux/syncobj.h
> > @@ -0,0 +1,75 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_SYNCOBJ_H_
> > +#define _UAPI_LINUX_SYNCOBJ_H_
> > +
> > +#include <linux/ioctl.h>
> > +#include <linux/types.h>
> > +
> > +#define SYNCOBJ_CREATE_SIGNALED                      (1 << 0)
> > +
> > +#define SYNCOBJ_WAIT_FLAGS_WAIT_ALL          (1 << 0)
> > +#define SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT   (1 << 1)
> > +#define SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE    (1 << 2)
> > +#define SYNCOBJ_WAIT_FLAGS_WAIT_DEADLINE     (1 << 3)
> > +
> > +#define SYNCOBJ_QUERY_FLAGS_LAST_SUBMITTED   (1 << 0)
> > +
> > +struct syncobj_create_args {
> > +     __s32 fd;
> > +     __u32 flags;
> > +};
> > +
> > +struct syncobj_wait_args {
> > +     __u64 fds;
> > +     __u64 points;
> > +     __s64 timeout_nsec;
> > +     __u32 count;
> > +     __u32 flags;
> > +     __u32 first_signaled;
> > +     __u32 pad;
> > +     __u64 deadline_nsec;
> > +};
> > +
> > +struct syncobj_array_args {
> > +     __u64 fds;
> > +     __u64 points;
> > +     __u32 count;
> > +     __u32 flags;
> > +};
> > +
> > +struct syncobj_transfer_args {
> > +     __s32 src_fd;
> > +     __s32 dst_fd;
> > +     __u64 src_point;
> > +     __u64 dst_point;
> > +     __u32 flags;
> > +     __u32 pad;
> > +};
> > +
> > +struct syncobj_eventfd_args {
> > +     __s32 syncobj_fd;
> > +     __s32 eventfd;
> > +     __u64 point;
> > +     __u32 flags;
> > +     __u32 pad;
> > +};
> > +
> > +struct syncobj_sync_file_args {
> > +     __s32 syncobj_fd;
> > +     __s32 sync_file_fd;
> > +     __u64 point;
> > +};
> > +
> > +#define SYNCOBJ_IOC_BASE             0xCD
> > +
> > +#define SYNCOBJ_IOC_CREATE           _IOWR(SYNCOBJ_IOC_BASE, 0, struct syncobj_create_args)
> > +#define SYNCOBJ_IOC_WAIT             _IOWR(SYNCOBJ_IOC_BASE, 1, struct syncobj_wait_args)
> > +#define SYNCOBJ_IOC_RESET            _IOW(SYNCOBJ_IOC_BASE,  2, struct syncobj_array_args)
> > +#define SYNCOBJ_IOC_SIGNAL           _IOW(SYNCOBJ_IOC_BASE,  3, struct syncobj_array_args)
> > +#define SYNCOBJ_IOC_QUERY            _IOW(SYNCOBJ_IOC_BASE,  4, struct syncobj_array_args)
> > +#define SYNCOBJ_IOC_TRANSFER         _IOW(SYNCOBJ_IOC_BASE,  5, struct syncobj_transfer_args)
> > +#define SYNCOBJ_IOC_EVENTFD          _IOW(SYNCOBJ_IOC_BASE,  6, struct syncobj_eventfd_args)
> > +#define SYNCOBJ_IOC_EXPORT_SYNC_FILE _IOWR(SYNCOBJ_IOC_BASE, 7, struct syncobj_sync_file_args)
> > +#define SYNCOBJ_IOC_IMPORT_SYNC_FILE _IOW(SYNCOBJ_IOC_BASE,  8, struct syncobj_sync_file_args)
> > +
> > +#endif /* _UAPI_LINUX_SYNCOBJ_H_ */
> >
>

^ permalink raw reply

* Re: [Linaro-mm-sig] Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-18 12:06 UTC (permalink / raw)
  To: Christian König
  Cc: Barry Song, T.J. Mercier, Tejun Heo, Johannes Weiner,
	Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
	linux-kernel, linux-media, dri-, linaro-mm-sig, linux-mm,
	linux-security-module, selinux, linux-kselftest, mripard,
	echanude
In-Reply-To: <cb84c2ee-9de1-4565-b2e0-60984721228f@amd.com>

On Mon, May 18, 2026 at 9:34 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/16/26 11:19, Barry Song wrote:
> > On Thu, May 14, 2026 at 12:35 AM T.J. Mercier <tjmercier@google.com> wrote:
> > [...]
> >>>> I have a question about this part. Albert I guess you are interested
> >>>> only in accounting dmabuf-heap allocations, or do you expect to add
> >>>> __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other
> >>>> non-dmabuf-heap exporters?
> >>>
> >>> We're scoping this to dma-buf heaps for now. CMA heaps and the dmem
> >>> controller are on the radar for follow-up/parallel work (there will be
> >>> dragons and will surely need discussion). For DRM and V4L2 the
> >>> long-term intent is migration to heaps, which would make direct
> >>> accounting on those paths unnecessary.
> >>
> >> Ah I see. GEM buffers exported to dmabufs are what I had in mind. I
> >> guess this would only leave the odd non-DRM driver with the need to
> >> add their own accounting calls, which I don't expect would be a big
> >> problem.
> >>
> >
> > sounds like we still have a long way to go to correctly account for
> > various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in
> > dma_buf_export(), so I guess it covers all dma-buf types except
> > dma_heap, but the problem is that it has no remote charging support at
> > all?
>
> No, just the other way around
>
> DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
>
> dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
>
> >>> udmabufs are already
> >>> memcg-charged, so adding a separate MEMCG_DMABUF would double count.
> >>> Are there any other exporters you had in mind that would benefit from
> >>> this approach?
>
> Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
>
> But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
>
> That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.

I removed a draft adding an ioctl for charge transfer from the series
before sending because I wanted to focus on the charge_pid_fd approach
and keep things simple, deferring the recharge path to a follow-up
depending on feedback.

The main difference between my removed draft and what you're
describing, iiuc, is scope and layer: my draft was an explicit ioctl
on the dma-buf fd that the consumer calls to claim the charge (see
below), while you seem to be suggesting a more general kernel-internal
function that could work across buffer types and cgroup controllers,
so not necessarily userspace-initiated? A kernel-internal function
will need a way to identify the target process, which sounds similar
to the binder-backed approach from TJ [1]. For everything else, the
receiver still needs to declare itself, which the ioctl accomplishes.

```
# When an app imports a daemon-allocated buffer, it can transfer the
charge to itself:
int buf_fd = receive_dmabuf_from_daemon();
ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE); /* charge now attributed to
apps's cgroup */
```

[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com/

>
> The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.

The main reasons we moved away from TJ's transfer-based approach
toward `charge_pid_fd` are: avoid the transient charge window on the
daemon's cgroup; and to decouple from Binder, allowing any allocator
to use it.

Technically, both approaches could coexist, though. Of the three
scenarios TJ described:
- Scenario 2 is directly addressed by charge_pid_fd approach without
any transient charge on the daemon at the cost of one extra field in
the heap ioctl uAPI struct.
- Scenario 3 can be handled by the charge transfer function without
changes to SurfaceFlinger. The app or dequeueBuffer claims the charge
for itself or the app, respectively (depending on whether we include a
pid_fd field in the transfer ioctl). It also covers non-heap
exporters. The con in both variants is the transient charge window on
the daemon.

Both approaches shift the responsibility for correct charging
attribution to userspace: first, 'charge_pid_fd` on the allocator's
side, and the transfer charge on the consumer's side.

Deciding on one, the other or both depends on how much we value
avoiding transient attribution, and how much we need a non-heap
generic solution. With the XFER_CHARGE we can cover both. Thus, the
`charge_pid_fd` approach in this RFC can be seen as a
performance/strictness optimisation, eliminating transient charges to
the daemon at the cost of a permanent uAPI addition to the heap ioctl
struct, but not strictly required for correctness. On the other hand,
if we agree on the end goal of migrating other exporters to use
dma-buf heaps, and scenario 3 is addressed by adding the app's pid_fd
to SurfaceFlinger, then `charge_pid_fd` alone is a coherent/sufficient
approach despite the uAPI change.

>
> Regards,
> Christian.
>
> >
> > Thanks
> > Barry
>


^ permalink raw reply

* Re: [PATCH 12/12] misc/syncobj: add new device
From: Christian König @ 2026-05-18 12:06 UTC (permalink / raw)
  To: Julian Orth, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Sumit Semwal, Jonathan Corbet,
	Shuah Khan, Arnd Bergmann, Greg Kroah-Hartman
  Cc: dri-devel, linux-kernel, linux-media, linaro-mm-sig, linux-doc,
	wayland-devel
In-Reply-To: <20260516-jorth-syncobj-v1-12-88ede9d98a81@gmail.com>

On 5/16/26 13:06, Julian Orth wrote:
> This device makes the DRM_IOCTL_SYNCOBJ_* ioctls available via a
> dedicated device. This allows applications to use syncobjs without
> having to open device nodes in /dev/dri, on systems that don't have any
> such nodes, or on systems whose devices don't support the
> DRIVER_SYNCOBJ_TIMELINE feature.
> 
> Wayland uses syncobjs as its buffer synchronization mechanism. Most
> compositors use the DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to perform a pure
> CPU wait for syncobj point. DRM devices are not involved in this process
> except insofar that a DRM device needs to be used to access the ioctl.
> 
> Similarly, a software-rendered client might perform rendering on a
> dedicated thread and use the wayland syncobj protocol to submit frames
> before they finish rendering. Again, this does not involve DRM devices
> except insofar ... as above.

That use case is invalid.

Usually drm_syncobj can only be filled with dma_fence objects and it is impossible to create one of those for software rendering.

What could be used is the drm_syncobj wait before signal functionality, but that usually requires special handling on the Wayland/Compositor side which as far as I can see doesn't make sense here either.

So the justification to use this for software rendering is very weak. Either I'm missing something or that is not going to fly at all.

Regards,
Christian.

> 
> As an added benefit, this device removes the need to translate between
> file descriptors and handles.
> 
> Signed-off-by: Julian Orth <ju.orth@gmail.com>
> ---
>  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
>  drivers/misc/Kconfig                               |  10 +
>  drivers/misc/Makefile                              |   1 +
>  drivers/misc/syncobj.c                             | 404 +++++++++++++++++++++
>  include/uapi/linux/syncobj.h                       |  75 ++++
>  5 files changed, 491 insertions(+)
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 331223761fff..5e140ae5735e 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -395,6 +395,7 @@ Code  Seq#    Include File                                             Comments
>                                                                         <mailto:michael.klein@puffin.lb.shuttle.de>
>  0xCC  00-0F  drivers/misc/ibmvmc.h                                     pseries VMC driver
>  0xCD  01     linux/reiserfs_fs.h                                       Dead since 6.13
> +0xCD  00-0F  uapi/linux/syncobj.h
>  0xCE  01-02  uapi/linux/cxl_mem.h                                      Compute Express Link Memory Devices
>  0xCF  02     fs/smb/client/cifs_ioctl.h
>  0xDD  00-3F                                                            ZFCP device driver see drivers/s390/scsi/
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index 00683bf06258..c1e7749bd356 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -644,6 +644,16 @@ config MCHP_LAN966X_PCI
>  	    - lan966x-miim (MDIO_MSCC_MIIM)
>  	    - lan966x-switch (LAN966X_SWITCH)
>  
> +config SYNCOBJ_DEV
> +	tristate "DRM syncobj device (/dev/syncobj)"
> +	depends on DRM
> +	help
> +	  Creates a /dev/syncobj device node that provides DRM synchronization
> +	  objects (syncobjs) without requiring a DRM device.
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called syncobj.
> +
>  source "drivers/misc/c2port/Kconfig"
>  source "drivers/misc/eeprom/Kconfig"
>  source "drivers/misc/cb710/Kconfig"
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index b32a2597d246..9e5deb1d0d76 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -75,3 +75,4 @@ obj-$(CONFIG_MCHP_LAN966X_PCI)	+= lan966x-pci.o
>  obj-y				+= keba/
>  obj-y				+= amd-sbi/
>  obj-$(CONFIG_MISC_RP1)		+= rp1/
> +obj-$(CONFIG_SYNCOBJ_DEV)	+= syncobj.o
> diff --git a/drivers/misc/syncobj.c b/drivers/misc/syncobj.c
> new file mode 100644
> index 000000000000..11ef46ddfeef
> --- /dev/null
> +++ b/drivers/misc/syncobj.c
> @@ -0,0 +1,404 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * syncobj.c - Standalone device for syncobj manipulation.
> + *
> + * Copyright (C) 2026 Julian Orth <ju.orth@gmail.com>
> + */
> +
> +#include <linux/fdtable.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/uaccess.h>
> +#include <drm/drm_syncobj.h>
> +#include <drm/drm_utils.h>
> +#include <uapi/drm/drm.h>
> +#include <uapi/linux/syncobj.h>
> +
> +static int syncobj_array_find(void __user *user_fds, u32 count,
> +			      struct drm_syncobj ***syncobjs_out)
> +{
> +	u32 i;
> +	s32 *fds;
> +	struct drm_syncobj **syncobjs;
> +	int ret;
> +
> +	fds = kmalloc_array(count, sizeof(*fds), GFP_KERNEL);
> +	if (!fds)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(fds, user_fds, sizeof(s32) * count)) {
> +		ret = -EFAULT;
> +		goto err_free_fds;
> +	}
> +
> +	syncobjs = kmalloc_array(count, sizeof(*syncobjs), GFP_KERNEL);
> +	if (!syncobjs) {
> +		ret = -ENOMEM;
> +		goto err_free_fds;
> +	}
> +
> +	for (i = 0; i < count; i++) {
> +		syncobjs[i] = drm_syncobj_from_fd(fds[i]);
> +		if (!syncobjs[i]) {
> +			ret = -EBADF;
> +			goto err_put_syncobjs;
> +		}
> +	}
> +
> +	kfree(fds);
> +	*syncobjs_out = syncobjs;
> +	return 0;
> +
> +err_put_syncobjs:
> +	while (i-- > 0)
> +		drm_syncobj_put(syncobjs[i]);
> +	kfree(syncobjs);
> +err_free_fds:
> +	kfree(fds);
> +	return ret;
> +}
> +
> +static void syncobj_array_free(struct drm_syncobj **syncobjs, u32 count)
> +{
> +	u32 i;
> +
> +	for (i = 0; i < count; i++)
> +		drm_syncobj_put(syncobjs[i]);
> +	kfree(syncobjs);
> +}
> +
> +static int syncobj_ioctl_create(void __user *argp)
> +{
> +	struct syncobj_create_args args;
> +	struct drm_syncobj *syncobj;
> +	int fd, ret;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.flags & ~SYNCOBJ_CREATE_SIGNALED)
> +		return -EINVAL;
> +
> +	static_assert(SYNCOBJ_CREATE_SIGNALED == DRM_SYNCOBJ_CREATE_SIGNALED);
> +
> +	ret = drm_syncobj_create(&syncobj, args.flags, NULL);
> +	if (ret)
> +		return ret;
> +
> +	ret = drm_syncobj_get_fd(syncobj, &fd);
> +	drm_syncobj_put(syncobj);
> +	if (ret)
> +		return ret;
> +
> +	args.fd = fd;
> +	if (copy_to_user(argp, &args, sizeof(args))) {
> +		close_fd(fd);
> +		return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
> +static int syncobj_ioctl_wait(void __user *argp)
> +{
> +	struct syncobj_wait_args args;
> +	struct drm_syncobj **syncobjs;
> +	signed long timeout;
> +	u32 first = ~0;
> +	ktime_t t, *tp = NULL;
> +	int ret;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.flags & ~(SYNCOBJ_WAIT_FLAGS_WAIT_ALL |
> +			   SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT |
> +			   SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE |
> +			   SYNCOBJ_WAIT_FLAGS_WAIT_DEADLINE))
> +		return -EINVAL;
> +
> +	static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_ALL        == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_ALL);
> +	static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT);
> +	static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE  == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE);
> +	static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_DEADLINE   == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_DEADLINE);
> +
> +	if (args.pad)
> +		return -EINVAL;
> +
> +	if (args.count == 0)
> +		return 0;
> +
> +	ret = syncobj_array_find(u64_to_user_ptr(args.fds),
> +				 args.count, &syncobjs);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (args.flags & SYNCOBJ_WAIT_FLAGS_WAIT_DEADLINE) {
> +		t = ns_to_ktime(args.deadline_nsec);
> +		tp = &t;
> +	}
> +
> +	timeout = drm_timeout_abs_to_jiffies(args.timeout_nsec);
> +	timeout = drm_syncobj_array_wait_timeout(syncobjs,
> +						 u64_to_user_ptr(args.points),
> +						 args.count,
> +						 args.flags,
> +						 timeout, &first, tp);
> +
> +	syncobj_array_free(syncobjs, args.count);
> +
> +	if (timeout < 0)
> +		return timeout;
> +
> +	args.first_signaled = first;
> +	if (copy_to_user(argp, &args, sizeof(args)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static int syncobj_ioctl_reset(void __user *argp)
> +{
> +	struct syncobj_array_args args;
> +	struct drm_syncobj **syncobjs;
> +	u32 i;
> +	int ret;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.flags)
> +		return -EINVAL;
> +
> +	if (args.points)
> +		return -EINVAL;
> +
> +	if (args.count == 0)
> +		return -EINVAL;
> +
> +	ret = syncobj_array_find(u64_to_user_ptr(args.fds),
> +				 args.count, &syncobjs);
> +	if (ret < 0)
> +		return ret;
> +
> +	for (i = 0; i < args.count; i++)
> +		drm_syncobj_replace_fence(syncobjs[i], NULL);
> +
> +	syncobj_array_free(syncobjs, args.count);
> +	return 0;
> +}
> +
> +static int syncobj_ioctl_signal(void __user *argp)
> +{
> +	struct syncobj_array_args args;
> +	struct drm_syncobj **syncobjs;
> +	int ret;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.flags)
> +		return -EINVAL;
> +
> +	if (args.count == 0)
> +		return -EINVAL;
> +
> +	ret = syncobj_array_find(u64_to_user_ptr(args.fds),
> +				 args.count, &syncobjs);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = drm_syncobj_timeline_signal(syncobjs, args.points, args.count);
> +
> +	syncobj_array_free(syncobjs, args.count);
> +	return ret;
> +}
> +
> +static int syncobj_ioctl_query(void __user *argp)
> +{
> +	struct syncobj_array_args args;
> +	struct drm_syncobj **syncobjs;
> +	int ret;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.flags & ~SYNCOBJ_QUERY_FLAGS_LAST_SUBMITTED)
> +		return -EINVAL;
> +
> +	static_assert(SYNCOBJ_QUERY_FLAGS_LAST_SUBMITTED == DRM_SYNCOBJ_QUERY_FLAGS_LAST_SUBMITTED);
> +
> +	if (args.count == 0)
> +		return -EINVAL;
> +
> +	ret = syncobj_array_find(u64_to_user_ptr(args.fds),
> +				 args.count, &syncobjs);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = drm_syncobj_query(syncobjs, args.points, args.count, args.flags);
> +
> +	syncobj_array_free(syncobjs, args.count);
> +	return ret;
> +}
> +
> +static int syncobj_ioctl_transfer(void __user *argp)
> +{
> +	struct syncobj_transfer_args args;
> +	struct drm_syncobj *src, *dst;
> +	int ret;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.pad)
> +		return -EINVAL;
> +
> +	if (args.flags & ~SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT)
> +		return -EINVAL;
> +
> +	static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT);
> +
> +	src = drm_syncobj_from_fd(args.src_fd);
> +	if (!src)
> +		return -EBADF;
> +
> +	dst = drm_syncobj_from_fd(args.dst_fd);
> +	if (!dst) {
> +		drm_syncobj_put(src);
> +		return -EBADF;
> +	}
> +
> +	ret = drm_syncobj_transfer(src, args.src_point,
> +				   dst, args.dst_point, args.flags);
> +
> +	drm_syncobj_put(dst);
> +	drm_syncobj_put(src);
> +
> +	return ret;
> +}
> +
> +static int syncobj_ioctl_eventfd(void __user *argp)
> +{
> +	struct syncobj_eventfd_args args;
> +	struct drm_syncobj *syncobj;
> +	int ret;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +
> +	if (args.flags & ~SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE)
> +		return -EINVAL;
> +
> +	static_assert(SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE == DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE);
> +
> +	if (args.pad)
> +		return -EINVAL;
> +
> +	syncobj = drm_syncobj_from_fd(args.syncobj_fd);
> +	if (!syncobj)
> +		return -EBADF;
> +
> +	ret = drm_syncobj_register_eventfd(syncobj, args.eventfd,
> +					   args.point, args.flags);
> +
> +	drm_syncobj_put(syncobj);
> +
> +	return ret;
> +}
> +
> +static int syncobj_ioctl_export_sync_file(void __user *argp)
> +{
> +	struct syncobj_sync_file_args args;
> +	struct drm_syncobj *syncobj;
> +	int ret;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +
> +	syncobj = drm_syncobj_from_fd(args.syncobj_fd);
> +	if (!syncobj)
> +		return -EBADF;
> +
> +	ret = drm_syncobj_export_sync_file(syncobj, args.point,
> +					   &args.sync_file_fd);
> +	drm_syncobj_put(syncobj);
> +	if (ret)
> +		return ret;
> +
> +	if (copy_to_user(argp, &args, sizeof(args))) {
> +		close_fd(args.sync_file_fd);
> +		return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
> +static int syncobj_ioctl_import_sync_file(void __user *argp)
> +{
> +	struct syncobj_sync_file_args args;
> +	struct drm_syncobj *syncobj;
> +	int ret;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +
> +	syncobj = drm_syncobj_from_fd(args.syncobj_fd);
> +	if (!syncobj)
> +		return -EBADF;
> +
> +	ret = drm_syncobj_import_sync_file(syncobj, args.sync_file_fd,
> +					   args.point);
> +
> +	drm_syncobj_put(syncobj);
> +
> +	return ret;
> +}
> +
> +static long syncobj_dev_ioctl(struct file *file, unsigned int cmd,
> +			      unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +
> +	switch (cmd) {
> +	case SYNCOBJ_IOC_CREATE:
> +		return syncobj_ioctl_create(argp);
> +	case SYNCOBJ_IOC_WAIT:
> +		return syncobj_ioctl_wait(argp);
> +	case SYNCOBJ_IOC_RESET:
> +		return syncobj_ioctl_reset(argp);
> +	case SYNCOBJ_IOC_SIGNAL:
> +		return syncobj_ioctl_signal(argp);
> +	case SYNCOBJ_IOC_QUERY:
> +		return syncobj_ioctl_query(argp);
> +	case SYNCOBJ_IOC_TRANSFER:
> +		return syncobj_ioctl_transfer(argp);
> +	case SYNCOBJ_IOC_EVENTFD:
> +		return syncobj_ioctl_eventfd(argp);
> +	case SYNCOBJ_IOC_EXPORT_SYNC_FILE:
> +		return syncobj_ioctl_export_sync_file(argp);
> +	case SYNCOBJ_IOC_IMPORT_SYNC_FILE:
> +		return syncobj_ioctl_import_sync_file(argp);
> +	default:
> +		return -ENOIOCTLCMD;
> +	}
> +}
> +
> +static const struct file_operations syncobj_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.unlocked_ioctl	= syncobj_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +};
> +
> +static struct miscdevice syncobj_misc = {
> +	.minor	= MISC_DYNAMIC_MINOR,
> +	.name	= "syncobj",
> +	.fops	= &syncobj_dev_fops,
> +	.mode	= 0666,
> +};
> +
> +module_misc_device(syncobj_misc);
> +
> +MODULE_AUTHOR("Julian Orth");
> +MODULE_DESCRIPTION("DRM syncobj device");
> +MODULE_LICENSE("GPL");
> diff --git a/include/uapi/linux/syncobj.h b/include/uapi/linux/syncobj.h
> new file mode 100644
> index 000000000000..c4068fbd5773
> --- /dev/null
> +++ b/include/uapi/linux/syncobj.h
> @@ -0,0 +1,75 @@
> +/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_SYNCOBJ_H_
> +#define _UAPI_LINUX_SYNCOBJ_H_
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
> +#define SYNCOBJ_CREATE_SIGNALED			(1 << 0)
> +
> +#define SYNCOBJ_WAIT_FLAGS_WAIT_ALL		(1 << 0)
> +#define SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT	(1 << 1)
> +#define SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE	(1 << 2)
> +#define SYNCOBJ_WAIT_FLAGS_WAIT_DEADLINE	(1 << 3)
> +
> +#define SYNCOBJ_QUERY_FLAGS_LAST_SUBMITTED	(1 << 0)
> +
> +struct syncobj_create_args {
> +	__s32 fd;
> +	__u32 flags;
> +};
> +
> +struct syncobj_wait_args {
> +	__u64 fds;
> +	__u64 points;
> +	__s64 timeout_nsec;
> +	__u32 count;
> +	__u32 flags;
> +	__u32 first_signaled;
> +	__u32 pad;
> +	__u64 deadline_nsec;
> +};
> +
> +struct syncobj_array_args {
> +	__u64 fds;
> +	__u64 points;
> +	__u32 count;
> +	__u32 flags;
> +};
> +
> +struct syncobj_transfer_args {
> +	__s32 src_fd;
> +	__s32 dst_fd;
> +	__u64 src_point;
> +	__u64 dst_point;
> +	__u32 flags;
> +	__u32 pad;
> +};
> +
> +struct syncobj_eventfd_args {
> +	__s32 syncobj_fd;
> +	__s32 eventfd;
> +	__u64 point;
> +	__u32 flags;
> +	__u32 pad;
> +};
> +
> +struct syncobj_sync_file_args {
> +	__s32 syncobj_fd;
> +	__s32 sync_file_fd;
> +	__u64 point;
> +};
> +
> +#define SYNCOBJ_IOC_BASE		0xCD
> +
> +#define SYNCOBJ_IOC_CREATE		_IOWR(SYNCOBJ_IOC_BASE, 0, struct syncobj_create_args)
> +#define SYNCOBJ_IOC_WAIT		_IOWR(SYNCOBJ_IOC_BASE, 1, struct syncobj_wait_args)
> +#define SYNCOBJ_IOC_RESET		_IOW(SYNCOBJ_IOC_BASE,  2, struct syncobj_array_args)
> +#define SYNCOBJ_IOC_SIGNAL		_IOW(SYNCOBJ_IOC_BASE,  3, struct syncobj_array_args)
> +#define SYNCOBJ_IOC_QUERY		_IOW(SYNCOBJ_IOC_BASE,  4, struct syncobj_array_args)
> +#define SYNCOBJ_IOC_TRANSFER		_IOW(SYNCOBJ_IOC_BASE,  5, struct syncobj_transfer_args)
> +#define SYNCOBJ_IOC_EVENTFD		_IOW(SYNCOBJ_IOC_BASE,  6, struct syncobj_eventfd_args)
> +#define SYNCOBJ_IOC_EXPORT_SYNC_FILE	_IOWR(SYNCOBJ_IOC_BASE, 7, struct syncobj_sync_file_args)
> +#define SYNCOBJ_IOC_IMPORT_SYNC_FILE	_IOW(SYNCOBJ_IOC_BASE,  8, struct syncobj_sync_file_args)
> +
> +#endif /* _UAPI_LINUX_SYNCOBJ_H_ */
> 


^ permalink raw reply

* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Julian Orth @ 2026-05-18 12:02 UTC (permalink / raw)
  To: Christian König
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Sumit Semwal, Jonathan Corbet, Shuah Khan,
	Arnd Bergmann, Greg Kroah-Hartman, dri-devel, linux-kernel,
	linux-media, linaro-mm-sig, linux-doc, wayland-devel
In-Reply-To: <c6c91de9-a34b-4b50-a3c1-d42bf7631f8e@amd.com>

On Mon, May 18, 2026 at 1:58 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/16/26 13:06, Julian Orth wrote:
> > This series adds a new device /dev/syncobj that can be used to create
> > and manipulate DRM syncobjs. Previously, these operations required the
> > use of a DRM device and the device needed to support the DRIVER_SYNCOBJ
> > and DRIVER_SYNCOBJ_TIMELINE features.
> >
> > There are several issues with the existing API:
> >
> > - Syncobjs are the only explicit sync mechanism available on wayland.
> >   Most compositors do not use GPU waits. Instead, they use the
> >   DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to perform a CPU wait. Being tied to
> >   DRM devices means that compositors cannot consistently offer this
> >   feature even though no device-specific logic is involved.
>
> Well the drm_syncobj is a container for device specific dma fences.

Not necessarily. The DRM_IOCTL_SYNCOBJ_TIMELINE_SIGNAL ioctl attaches
some kind of dummy fence that is already signaled. I don't believe
this is device specific. That is also the path that llvmpipe would
use.

>
> What could be possible instead is to pass an eventfd into Wayland, but that is something userspace needs to decide.
>
> > - llvmpipe currently cannot offer syncobj interop because it does not
> >   have access to a DRM device. This means that applications using
> >   llvmpipe cannot present images before they have finished rendering,
> >   despite llvmpipe using threaded rendering.
>
> Yeah, but that is completely intentional. You *CAN'T* use a dma_fence as completion event for llvmpipe rendering. See the kernel documentation on that.
>
> What could be possible is to use the drm_syncobjs functionality to wait before signal, but that has different semantics.
>
> Regards,
> Christian.
>
> > - Clients that do not use the Vulkan WSI need to manually probe /dev/dri
> >   for devices that support the syncobj ioctls in order to use the
> >   wayland syncobj protocol.
> > - Similarly, clients that want to use screen capture have no equivalent
> >   to the WSI and are therefore forced into that path.
> > - Having to keep a DRM device open has potentially negative interactions
> >   with GPU hotplug.
> > - Having to translate between syncobj FDs and handles is troublesome in
> >   the compositor usecase since syncobjs come and go frequently and need
> >   to be cleaned up when clients disconnect.
> >
> > /dev/syncobj solves these issues by providing all syncobj ioctls under a
> > consistent path that is not tied to any DRM device. It also operates
> > directly on file descriptors instead of syncobj handles.
> >
> > The series starts with a number of small refactorings in drm_syncobj.c
> > to make its functionality available outside of the file and without the
> > need for drm_file/handle pairs.
> >
> > The last commit adds the /dev/syncobj module. I've added it as a misc
> > device but maybe this should instead live somewhere under gpu/drm.
> >
> > An application using the new interface can be found at [1].
> >
> > [1]: https://github.com/mahkoh/jay/pull/947
> >
> > ---
> > Julian Orth (12):
> >       drm/syncobj: add drm_syncobj_from_fd
> >       drm/syncobj: add drm_syncobj_fence_lookup
> >       drm/syncobj: make drm_syncobj_array_wait_timeout public
> >       drm/syncobj: add drm_syncobj_register_eventfd
> >       drm/syncobj: have transfer functions accept drm_syncobj directly
> >       drm/syncobj: add drm_syncobj_transfer
> >       drm/syncobj: add drm_syncobj_timeline_signal
> >       drm/syncobj: add drm_syncobj_query
> >       drm/syncobj: fix resource leak in drm_syncobj_import_sync_file_fence
> >       drm/syncobj: add drm_syncobj_import_sync_file
> >       drm/syncobj: add drm_syncobj_export_sync_file
> >       misc/syncobj: add new device
> >
> >  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
> >  drivers/gpu/drm/drm_syncobj.c                      | 374 ++++++++++++++-----
> >  drivers/misc/Kconfig                               |  10 +
> >  drivers/misc/Makefile                              |   1 +
> >  drivers/misc/syncobj.c                             | 404 +++++++++++++++++++++
> >  include/drm/drm_syncobj.h                          |  21 ++
> >  include/uapi/linux/syncobj.h                       |  75 ++++
> >  7 files changed, 795 insertions(+), 91 deletions(-)
> > ---
> > base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
> > change-id: 20260516-jorth-syncobj-d4d374c8c61b
> >
> > Best regards,
> > --
> > Julian Orth <ju.orth@gmail.com>
> >
>

^ permalink raw reply

* [PATCH v4 10/10] RAS: add firmware-first CPER provider
From: Ahmed Tiba @ 2026-05-18 11:57 UTC (permalink / raw)
  To: rafael, bp, saket.dumbre, will, xueshuai, mchehab, krzk+dt, dave,
	conor+dt, vishal.l.verma, jic23, corbet, guohanjun, dave.jiang,
	catalin.marinas, lenb, tony.luck, skhan, djbw, alison.schofield,
	ira.weiny, robh
  Cc: Ahmed Tiba, devicetree, linux-acpi, linux-doc, Dmitry.Lamerov,
	linux-cxl, Michael.Zhao2, acpica-devel, linux-kernel,
	linux-arm-kernel, linux-edac
In-Reply-To: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com>

Add a firmware-first CPER provider that reuses the shared
GHES helpers, wire it into the RAS Kconfig/Makefile and
document it in the admin guide.

Update MAINTAINERS now that the driver exists.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 Documentation/admin-guide/RAS/main.rst |  18 +++
 MAINTAINERS                            |   1 +
 drivers/acpi/apei/apei-internal.h      |  10 +-
 drivers/acpi/apei/ghes_cper.c          |   2 +
 drivers/ras/Kconfig                    |  11 ++
 drivers/ras/Makefile                   |   1 +
 drivers/ras/cper-esource.c             | 257 +++++++++++++++++++++++++++++++++
 include/acpi/ghes_cper.h               |  10 ++
 8 files changed, 301 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/RAS/main.rst b/Documentation/admin-guide/RAS/main.rst
index 5a45db32c49b..84219d25a072 100644
--- a/Documentation/admin-guide/RAS/main.rst
+++ b/Documentation/admin-guide/RAS/main.rst
@@ -205,6 +205,24 @@ Architecture (MCA)\ [#f3]_.
 .. [#f3] For more details about the Machine Check Architecture (MCA),
   please read Documentation/arch/x86/x86_64/machinecheck.rst at the Kernel tree.
 
+Firmware-first CPER providers
+-----------------------------
+
+Some systems expose Common Platform Error Record (CPER) data
+through platform firmware instead of ACPI HEST tables.
+Enable ``CONFIG_RAS_CPER_ESOURCE`` to build the ``drivers/ras/cper-esource.c``
+driver. The current in-tree firmware description uses the
+``Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml`` binding.
+The driver reuses the GHES CPER helper object in
+``drivers/acpi/apei/ghes_cper.c`` so the logging, notifier chains, and
+memory failure handling match the ACPI GHES behaviour even when
+ACPI is disabled.
+
+Once a platform describes a firmware-first provider, both ACPI GHES and the
+firmware-described driver reuse the same code paths. This keeps the
+behaviour consistent regardless of whether the error source is described
+by ACPI tables or another firmware description.
+
 EDAC - Error Detection And Correction
 *************************************
 
diff --git a/MAINTAINERS b/MAINTAINERS
index 3bbc19589f1a..8a5151a49820 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22255,6 +22255,7 @@ RAS ERROR STATUS
 M:	Ahmed Tiba <ahmed.tiba@arm.com>
 S:	Maintained
 F:	Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml
+F:	drivers/ras/cper-esource.c
 
 RAS INFRASTRUCTURE
 M:	Tony Luck <tony.luck@intel.com>
diff --git a/drivers/acpi/apei/apei-internal.h b/drivers/acpi/apei/apei-internal.h
index 77c10a7a7a9f..c16ac541f15b 100644
--- a/drivers/acpi/apei/apei-internal.h
+++ b/drivers/acpi/apei/apei-internal.h
@@ -8,6 +8,7 @@
 #define APEI_INTERNAL_H
 
 #include <linux/acpi.h>
+#include <acpi/ghes_cper.h>
 
 struct apei_exec_context;
 
@@ -120,15 +121,6 @@ int apei_exec_collect_resources(struct apei_exec_context *ctx,
 struct dentry;
 struct dentry *apei_get_debugfs_dir(void);
 
-static inline u32 cper_estatus_len(struct acpi_hest_generic_status *estatus)
-{
-	if (estatus->raw_data_length)
-		return estatus->raw_data_offset + \
-			estatus->raw_data_length;
-	else
-		return sizeof(*estatus) + estatus->data_length;
-}
-
 int apei_osc_setup(void);
 
 int einj_get_available_error_type(u32 *type, int einj_action);
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
index 0ff9d06eb78f..a7691aa5011c 100644
--- a/drivers/acpi/apei/ghes_cper.c
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -46,7 +46,9 @@
 #include <asm/fixmap.h>
 #include <asm/tlbflush.h>
 
+#ifdef CONFIG_ACPI_APEI
 #include "apei-internal.h"
+#endif
 
 ATOMIC_NOTIFIER_HEAD(ghes_report_chain);
 
diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index fc4f4bb94a4c..3c1c63b2fefc 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -34,6 +34,17 @@ if RAS
 source "arch/x86/ras/Kconfig"
 source "drivers/ras/amd/atl/Kconfig"
 
+config RAS_CPER_ESOURCE
+	bool "Firmware-first CPER error source block provider"
+	select GHES_CPER_HELPERS
+	help
+	  Enable support for firmware-first Common Platform Error Record
+	  (CPER) error source block providers. The current in-tree user is
+	  described by the arm,ras-cper DeviceTree binding. The driver
+	  reuses the existing GHES CPER helpers so the error processing
+	  matches the ACPI code paths, but it can be built even when ACPI is
+	  disabled.
+
 config RAS_FMPM
 	tristate "FRU Memory Poison Manager"
 	default m
diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
index 11f95d59d397..0de069557f31 100644
--- a/drivers/ras/Makefile
+++ b/drivers/ras/Makefile
@@ -2,6 +2,7 @@
 obj-$(CONFIG_RAS)	+= ras.o
 obj-$(CONFIG_DEBUG_FS)	+= debugfs.o
 obj-$(CONFIG_RAS_CEC)	+= cec.o
+obj-$(CONFIG_RAS_CPER_ESOURCE)	+= cper-esource.o
 
 obj-$(CONFIG_RAS_FMPM)	+= amd/fmpm.o
 obj-y			+= amd/atl/
diff --git a/drivers/ras/cper-esource.c b/drivers/ras/cper-esource.c
new file mode 100644
index 000000000000..83f7a910e50a
--- /dev/null
+++ b/drivers/ras/cper-esource.c
@@ -0,0 +1,257 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Firmware-first CPER error source provider.
+ *
+ * This driver shares the GHES CPER helpers so we keep the reporting and
+ * notifier behaviour identical to ACPI GHES.
+ *
+ * Copyright (C) 2026 ARM Ltd.
+ * Author: Ahmed Tiba <ahmed.tiba@arm.com>
+ */
+
+#include <linux/bitops.h>
+#include <linux/cleanup.h>
+#include <linux/idr.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/mod_devicetable.h>
+#include <linux/module.h>
+#include <linux/panic.h>
+#include <linux/platform_device.h>
+#include <linux/property.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include <acpi/ghes.h>
+#include <acpi/ghes_cper.h>
+
+static DEFINE_IDA(cper_esource_source_ids);
+
+struct cper_esource_ack {
+	void __iomem *addr;
+	u64 preserve;
+	u64 set;
+	u8 width;
+	bool present;
+};
+
+struct cper_esource {
+	struct device *dev;
+	void __iomem *status;
+	size_t status_len;
+
+	struct cper_esource_ack ack;
+
+	struct acpi_hest_generic *generic;
+	struct acpi_hest_generic_status *estatus;
+
+	bool sync;
+	int irq;
+
+	/* Serializes access while firmware and the OS share the status buffer. */
+	spinlock_t lock;
+};
+
+static void cper_esource_release_source_id(void *data)
+{
+	struct acpi_hest_generic *generic = data;
+
+	ida_free(&cper_esource_source_ids, generic->header.source_id);
+}
+
+static int cper_esource_init_pool(void)
+{
+	if (ghes_estatus_pool)
+		return 0;
+
+	return ghes_estatus_pool_init(1);
+}
+
+static int cper_esource_copy_status(struct cper_esource *ctx)
+{
+	memcpy_fromio(ctx->estatus, ctx->status, ctx->status_len);
+	return 0;
+}
+
+static void cper_esource_ack(struct cper_esource *ctx)
+{
+	u64 val;
+
+	if (!ctx->ack.present)
+		return;
+
+	if (ctx->ack.width == 64) {
+		val = readq(ctx->ack.addr);
+		val &= ctx->ack.preserve;
+		val |= ctx->ack.set;
+		writeq(val, ctx->ack.addr);
+	} else {
+		val = readl(ctx->ack.addr);
+		val &= (u32)ctx->ack.preserve;
+		val |= (u32)ctx->ack.set;
+		writel(val, ctx->ack.addr);
+	}
+}
+
+static void cper_esource_fatal(struct cper_esource *ctx)
+{
+	__ghes_print_estatus(KERN_EMERG, ctx->generic, ctx->estatus);
+	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_STILL_OK);
+	panic("GHES: fatal firmware-first CPER record from %s\n",
+	      dev_name(ctx->dev));
+}
+
+static void cper_esource_process(struct cper_esource *ctx)
+{
+	int sev;
+
+	guard(spinlock_irqsave)(&ctx->lock);
+
+	if (cper_esource_copy_status(ctx))
+		return;
+
+	sev = ghes_severity(ctx->estatus->error_severity);
+	if (sev >= GHES_SEV_PANIC)
+		cper_esource_fatal(ctx);
+
+	if (!ghes_estatus_cached(ctx->estatus) &&
+	    ghes_print_estatus(NULL, ctx->generic, ctx->estatus))
+		ghes_estatus_cache_add(ctx->generic, ctx->estatus);
+
+	ghes_cper_handle_status(ctx->dev, ctx->generic, ctx->estatus, ctx->sync);
+	cper_esource_ack(ctx);
+}
+
+static irqreturn_t cper_esource_irq(int irq, void *data)
+{
+	struct cper_esource *ctx = data;
+
+	cper_esource_process(ctx);
+
+	return IRQ_HANDLED;
+}
+
+static int cper_esource_init_ack(struct platform_device *pdev,
+				 struct cper_esource *ctx)
+{
+	struct device *dev = &pdev->dev;
+	struct resource *res;
+	size_t size;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
+	if (!res)
+		return 0;
+
+	ctx->ack.addr = devm_platform_get_and_ioremap_resource(pdev, 1, &res);
+	if (IS_ERR(ctx->ack.addr))
+		return PTR_ERR(ctx->ack.addr);
+
+	size = resource_size(res);
+	switch (size) {
+	case 4:
+		ctx->ack.width = 32;
+		ctx->ack.preserve = ~0U;
+		break;
+	case 8:
+		ctx->ack.width = 64;
+		ctx->ack.preserve = ~0ULL;
+		break;
+	default:
+		return dev_err_probe(dev, -EINVAL,
+				     "unsupported ack resource size %zu\n", size);
+	}
+
+	ctx->ack.set = BIT_ULL(0);
+	ctx->ack.present = true;
+	return 0;
+}
+
+static int cper_esource_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct cper_esource *ctx;
+	struct resource *res;
+	int source_id;
+	int rc;
+
+	ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+
+	spin_lock_init(&ctx->lock);
+	ctx->dev = dev;
+	ctx->sync = device_property_read_bool(dev, "arm,sea-notify");
+
+	ctx->status = devm_platform_get_and_ioremap_resource(pdev, 0, &res);
+	if (IS_ERR(ctx->status))
+		return dev_err_probe(dev, PTR_ERR(ctx->status),
+				     "failed to map status region\n");
+
+	ctx->status_len = resource_size(res);
+	if (!ctx->status_len)
+		return dev_err_probe(dev, -EINVAL, "status region has zero length\n");
+
+	rc = cper_esource_init_ack(pdev, ctx);
+	if (rc)
+		return rc;
+
+	rc = cper_esource_init_pool();
+	if (rc)
+		return rc;
+
+	ctx->estatus = devm_kzalloc(dev, ctx->status_len, GFP_KERNEL);
+	if (!ctx->estatus)
+		return -ENOMEM;
+
+	ctx->generic = devm_kzalloc(dev, sizeof(*ctx->generic), GFP_KERNEL);
+	if (!ctx->generic)
+		return -ENOMEM;
+
+	source_id = ida_alloc_min(&cper_esource_source_ids, 1, GFP_KERNEL);
+	if (source_id < 0)
+		return source_id;
+
+	ctx->generic->header.type = ACPI_HEST_TYPE_GENERIC_ERROR;
+	ctx->generic->header.source_id = source_id;
+
+	rc = devm_add_action_or_reset(dev, cper_esource_release_source_id,
+				      ctx->generic);
+	if (rc)
+		return rc;
+
+	ctx->generic->notify.type = ctx->sync ?
+		ACPI_HEST_NOTIFY_SEA : ACPI_HEST_NOTIFY_EXTERNAL;
+	ctx->generic->error_block_length = ctx->status_len;
+
+	ctx->irq = platform_get_irq(pdev, 0);
+	if (ctx->irq < 0)
+		return ctx->irq;
+
+	rc = devm_request_threaded_irq(dev, ctx->irq, NULL, cper_esource_irq,
+				       IRQF_ONESHOT,
+				       dev_name(dev), ctx);
+	if (rc)
+		return dev_err_probe(dev, rc, "failed to request interrupt\n");
+
+	return 0;
+}
+
+static const struct of_device_id cper_esource_of_match[] = {
+	{ .compatible = "arm,ras-cper" },
+	{ /* sentinel */ }
+};
+MODULE_DEVICE_TABLE(of, cper_esource_of_match);
+
+static struct platform_driver cper_esource_driver = {
+	.driver = {
+		.name = "cper-esource",
+		.of_match_table = cper_esource_of_match,
+	},
+	.probe = cper_esource_probe,
+};
+
+module_platform_driver(cper_esource_driver);
+
+MODULE_AUTHOR("Ahmed Tiba <ahmed.tiba@arm.com>");
+MODULE_DESCRIPTION("Firmware-first CPER provider");
+MODULE_LICENSE("GPL");
diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
index 511b95b50911..a78d4a773129 100644
--- a/include/acpi/ghes_cper.h
+++ b/include/acpi/ghes_cper.h
@@ -80,6 +80,14 @@ static inline bool is_hest_sync_notify(struct ghes *ghes)
 	return notify_type == ACPI_HEST_NOTIFY_SEA;
 }
 
+static inline u32 cper_estatus_len(struct acpi_hest_generic_status *estatus)
+{
+	if (estatus->raw_data_length)
+		return estatus->raw_data_offset + estatus->raw_data_length;
+	else
+		return sizeof(*estatus) + estatus->data_length;
+}
+
 struct ghes_vendor_record_entry {
 	struct work_struct work;
 	int error_severity;
@@ -108,6 +116,8 @@ int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
 int ghes_estatus_cached(struct acpi_hest_generic_status *estatus);
 void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
 			    struct acpi_hest_generic_status *estatus);
+int ghes_register_vendor_record_notifier(struct notifier_block *nb);
+void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);
 void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 				   int sev);
 int ghes_severity(int severity);

-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 09/10] dt-bindings: firmware: add arm,ras-cper
From: Ahmed Tiba @ 2026-05-18 11:57 UTC (permalink / raw)
  To: rafael, bp, saket.dumbre, will, xueshuai, mchehab, krzk+dt, dave,
	conor+dt, vishal.l.verma, jic23, corbet, guohanjun, dave.jiang,
	catalin.marinas, lenb, tony.luck, skhan, djbw, alison.schofield,
	ira.weiny, robh
  Cc: Ahmed Tiba, devicetree, linux-acpi, linux-doc, Dmitry.Lamerov,
	linux-cxl, Michael.Zhao2, acpica-devel, linux-kernel,
	linux-arm-kernel, linux-edac
In-Reply-To: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com>

Describe the DeviceTree node that exposes the Arm firmware-first
CPER provider and hook the file into MAINTAINERS so the
binding has an owner.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 .../devicetree/bindings/firmware/arm,ras-cper.yaml | 71 ++++++++++++++++++++++
 MAINTAINERS                                        |  5 ++
 2 files changed, 76 insertions(+)

diff --git a/Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml b/Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml
new file mode 100644
index 000000000000..81dc37390af5
--- /dev/null
+++ b/Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml
@@ -0,0 +1,71 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/firmware/arm,ras-cper.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Arm RAS CPER provider
+
+maintainers:
+  - Ahmed Tiba <ahmed.tiba@arm.com>
+
+description:
+  Arm Reliability, Availability and Serviceability (RAS) firmware can expose
+  a firmware-first CPER error source directly via DeviceTree. Firmware
+  provides the CPER Generic Error Status block and notifies the OS through
+  an interrupt.
+
+properties:
+  compatible:
+    const: arm,ras-cper
+
+  memory-region:
+    oneOf:
+      - items:
+          - description:
+              CPER Generic Error Status block exposed by firmware
+      - items:
+          - description:
+              CPER Generic Error Status block exposed by firmware.
+          - description:
+              Optional firmware-owned ack buffer used on platforms
+              where firmware needs an explicit "ack" handshake before overwriting
+              the CPER buffer. Firmware watches bit 0 and expects the OS to set it
+              once the current status block has been consumed.
+
+  interrupts:
+    maxItems: 1
+    description:
+      Interrupt used to signal that a new status record is ready.
+
+required:
+  - compatible
+  - memory-region
+  - interrupts
+
+additionalProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+    reserved-memory {
+      #address-cells = <2>;
+      #size-cells = <2>;
+      ras_cper_buffer: memory@fe800000 {
+        reg = <0x0 0xfe800000 0x0 0x1000>;
+        no-map;
+      };
+
+      ras_cper_ack: memory@fe801000 {
+        reg = <0x0 0xfe801000 0x0 0x1000>;
+        no-map;
+      };
+    };
+
+    error-handler {
+      compatible = "arm,ras-cper";
+      memory-region = <&ras_cper_buffer>, <&ras_cper_ack>;
+      interrupts = <GIC_SPI 32 IRQ_TYPE_LEVEL_HIGH>;
+    };
+...
diff --git a/MAINTAINERS b/MAINTAINERS
index 7492fefa447c..3bbc19589f1a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22251,6 +22251,11 @@ M:	Alexandre Bounine <alex.bou9@gmail.com>
 S:	Maintained
 F:	drivers/rapidio/
 
+RAS ERROR STATUS
+M:	Ahmed Tiba <ahmed.tiba@arm.com>
+S:	Maintained
+F:	Documentation/devicetree/bindings/firmware/arm,ras-cper.yaml
+
 RAS INFRASTRUCTURE
 M:	Tony Luck <tony.luck@intel.com>
 M:	Borislav Petkov <bp@alien8.de>

-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 08/10] ACPI: APEI: share GHES CPER helpers
From: Ahmed Tiba @ 2026-05-18 11:57 UTC (permalink / raw)
  To: rafael, bp, saket.dumbre, will, xueshuai, mchehab, krzk+dt, dave,
	conor+dt, vishal.l.verma, jic23, corbet, guohanjun, dave.jiang,
	catalin.marinas, lenb, tony.luck, skhan, djbw, alison.schofield,
	ira.weiny, robh
  Cc: Ahmed Tiba, devicetree, linux-acpi, linux-doc, Dmitry.Lamerov,
	linux-cxl, Michael.Zhao2, acpica-devel, linux-kernel,
	linux-arm-kernel, linux-edac
In-Reply-To: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com>

Wire GHES up to the helper routines in ghes_cper.c and remove the local
copies from ghes.c. This keeps the control flow identical while letting
the helpers be shared with other firmware-first providers.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/ghes.c      | 416 +--------------------------------------
 drivers/acpi/apei/ghes_cper.c | 438 +++++++++++++++++++++++++++++++++++++++++-
 include/acpi/ghes_cper.h      |  20 ++
 3 files changed, 459 insertions(+), 415 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 85be2ebf4d3e..f85b97c4db4c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -67,8 +67,6 @@
 #define FIX_APEI_GHES_SDEI_CRITICAL	__end_of_fixed_addresses
 #endif
 
-static ATOMIC_NOTIFIER_HEAD(ghes_report_chain);
-
 /*
  * This driver isn't really modular, however for the time being,
  * continuing to use module_param is the easiest way to remain
@@ -113,276 +111,6 @@ static DEFINE_MUTEX(ghes_devs_mutex);
  */
 static DEFINE_SPINLOCK(ghes_notify_lock_irq);
 
-struct gen_pool *ghes_estatus_pool;
-
-int ghes_estatus_pool_init(unsigned int num_ghes)
-{
-	unsigned long addr, len;
-	int rc;
-
-	ghes_estatus_pool = gen_pool_create(GHES_ESTATUS_POOL_MIN_ALLOC_ORDER, -1);
-	if (!ghes_estatus_pool)
-		return -ENOMEM;
-
-	len = GHES_ESTATUS_CACHE_AVG_SIZE * GHES_ESTATUS_CACHE_ALLOCED_MAX;
-	len += (num_ghes * GHES_ESOURCE_PREALLOC_MAX_SIZE);
-
-	addr = (unsigned long)vmalloc(PAGE_ALIGN(len));
-	if (!addr)
-		goto err_pool_alloc;
-
-	rc = gen_pool_add(ghes_estatus_pool, addr, PAGE_ALIGN(len), -1);
-	if (rc)
-		goto err_pool_add;
-
-	return 0;
-
-err_pool_add:
-	vfree((void *)addr);
-
-err_pool_alloc:
-	gen_pool_destroy(ghes_estatus_pool);
-
-	return -ENOMEM;
-}
-
-/**
- * ghes_estatus_pool_region_free - free previously allocated memory
- *				   from the ghes_estatus_pool.
- * @addr: address of memory to free.
- * @size: size of memory to free.
- *
- * Returns none.
- */
-void ghes_estatus_pool_region_free(unsigned long addr, u32 size)
-{
-	gen_pool_free(ghes_estatus_pool, addr, size);
-}
-EXPORT_SYMBOL_GPL(ghes_estatus_pool_region_free);
-
-static inline int ghes_severity(int severity)
-{
-	switch (severity) {
-	case CPER_SEV_INFORMATIONAL:
-		return GHES_SEV_NO;
-	case CPER_SEV_CORRECTED:
-		return GHES_SEV_CORRECTED;
-	case CPER_SEV_RECOVERABLE:
-		return GHES_SEV_RECOVERABLE;
-	case CPER_SEV_FATAL:
-		return GHES_SEV_PANIC;
-	default:
-		/* Unknown, go panic */
-		return GHES_SEV_PANIC;
-	}
-}
-
-
-/**
- * struct ghes_task_work - for synchronous RAS event
- *
- * @twork:                callback_head for task work
- * @pfn:                  page frame number of corrupted page
- * @flags:                work control flags
- *
- * Structure to pass task work to be handled before
- * returning to user-space via task_work_add().
- */
-struct ghes_task_work {
-	struct callback_head twork;
-	u64 pfn;
-	int flags;
-};
-
-static void memory_failure_cb(struct callback_head *twork)
-{
-	struct ghes_task_work *twcb = container_of(twork, struct ghes_task_work, twork);
-	int ret;
-
-	ret = memory_failure(twcb->pfn, twcb->flags);
-	gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
-
-	if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
-		return;
-
-	pr_err("%#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
-			twcb->pfn, current->comm, task_pid_nr(current));
-	force_sig(SIGBUS);
-}
-
-static bool ghes_do_memory_failure(u64 physical_addr, int flags)
-{
-	struct ghes_task_work *twcb;
-	unsigned long pfn;
-
-	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
-		return false;
-
-	pfn = PHYS_PFN(physical_addr);
-
-	if (flags == MF_ACTION_REQUIRED && current->mm) {
-		twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
-		if (!twcb)
-			return false;
-
-		twcb->pfn = pfn;
-		twcb->flags = flags;
-		init_task_work(&twcb->twork, memory_failure_cb);
-		task_work_add(current, &twcb->twork, TWA_RESUME);
-		return true;
-	}
-
-	memory_failure_queue(pfn, flags);
-	return true;
-}
-
-static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
-				       int sev, bool sync)
-{
-	int flags = -1;
-	int sec_sev = ghes_severity(gdata->error_severity);
-	struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
-
-	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
-		return false;
-
-	/* iff following two events can be handled properly by now */
-	if (sec_sev == GHES_SEV_CORRECTED &&
-	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
-		flags = MF_SOFT_OFFLINE;
-	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
-		flags = sync ? MF_ACTION_REQUIRED : 0;
-
-	if (flags != -1)
-		return ghes_do_memory_failure(mem_err->physical_addr, flags);
-
-	return false;
-}
-
-static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
-				     int sev, bool sync)
-{
-	struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
-	int flags = sync ? MF_ACTION_REQUIRED : 0;
-	int length = gdata->error_data_length;
-	char error_type[120];
-	bool queued = false;
-	int sec_sev, i;
-	char *p;
-
-	sec_sev = ghes_severity(gdata->error_severity);
-	if (length >= sizeof(*err)) {
-		log_arm_hw_error(err, sec_sev);
-	} else {
-		pr_warn(FW_BUG "arm error length: %d\n", length);
-		pr_warn(FW_BUG "length is too small\n");
-		pr_warn(FW_BUG "firmware-generated error record is incorrect\n");
-		return false;
-	}
-
-	if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
-		return false;
-
-	p = (char *)(err + 1);
-	length -= sizeof(err);
-
-	for (i = 0; i < err->err_info_num; i++) {
-		struct cper_arm_err_info *err_info;
-		bool is_cache, has_pa;
-
-		/* Ensure we have enough data for the error info header */
-		if (length < sizeof(*err_info))
-			break;
-
-		err_info = (struct cper_arm_err_info *)p;
-
-		/* Validate the claimed length before using it */
-		length -= err_info->length;
-		if (length < 0)
-			break;
-
-		is_cache = err_info->type & CPER_ARM_CACHE_ERROR;
-		has_pa = (err_info->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR);
-
-		/*
-		 * The field (err_info->error_info & BIT(26)) is fixed to set to
-		 * 1 in some old firmware of HiSilicon Kunpeng920. We assume that
-		 * firmware won't mix corrected errors in an uncorrected section,
-		 * and don't filter out 'corrected' error here.
-		 */
-		if (is_cache && has_pa) {
-			queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
-			p += err_info->length;
-			continue;
-		}
-
-		cper_bits_to_str(error_type, sizeof(error_type),
-				 FIELD_GET(CPER_ARM_ERR_TYPE_MASK, err_info->type),
-				 cper_proc_error_type_strs,
-				 ARRAY_SIZE(cper_proc_error_type_strs));
-
-		pr_warn_ratelimited(FW_WARN GHES_PFX
-				    "Unhandled processor error type 0x%02x: %s%s\n",
-				    err_info->type, error_type,
-				    (err_info->type & ~CPER_ARM_ERR_TYPE_MASK) ? " with reserved bit(s)" : "");
-		p += err_info->length;
-	}
-
-	return queued;
-}
-
-/*
- * PCIe AER errors need to be sent to the AER driver for reporting and
- * recovery. The GHES severities map to the following AER severities and
- * require the following handling:
- *
- * GHES_SEV_CORRECTABLE -> AER_CORRECTABLE
- *     These need to be reported by the AER driver but no recovery is
- *     necessary.
- * GHES_SEV_RECOVERABLE -> AER_NONFATAL
- * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
- *     These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_PANIC does not make it to this handling since the kernel must
- *     panic.
- */
-static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
-{
-#ifdef CONFIG_ACPI_APEI_PCIEAER
-	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
-
-	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
-	    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
-		unsigned int devfn;
-		int aer_severity;
-		u8 *aer_info;
-
-		devfn = PCI_DEVFN(pcie_err->device_id.device,
-				  pcie_err->device_id.function);
-		aer_severity = cper_severity_to_aer(gdata->error_severity);
-
-		/*
-		 * If firmware reset the component to contain
-		 * the error, we must reinitialize it before
-		 * use, so treat it as a fatal AER error.
-		 */
-		if (gdata->flags & CPER_SEC_RESET)
-			aer_severity = AER_FATAL;
-
-		aer_info = (void *)gen_pool_alloc(ghes_estatus_pool,
-						  sizeof(struct aer_capability_regs));
-		if (!aer_info)
-			return;
-		memcpy(aer_info, pcie_err->aer_info, sizeof(struct aer_capability_regs));
-
-		aer_recover_queue(pcie_err->device_id.segment,
-				  pcie_err->device_id.bus,
-				  devfn, aer_severity,
-				  (struct aer_capability_regs *)
-				  aer_info);
-	}
-#endif
-}
-
 static void ghes_vendor_record_notifier_destroy(void *nb)
 {
 	ghes_unregister_vendor_record_notifier(nb);
@@ -401,151 +129,11 @@ int devm_ghes_register_vendor_record_notifier(struct device *dev,
 }
 EXPORT_SYMBOL_GPL(devm_ghes_register_vendor_record_notifier);
 
-static void ghes_log_hwerr(int sev, guid_t *sec_type)
-{
-	if (sev != CPER_SEV_RECOVERABLE)
-		return;
-
-	if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) ||
-	    guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) ||
-	    guid_equal(sec_type, &CPER_SEC_PROC_IA)) {
-		hwerr_log_error_type(HWERR_RECOV_CPU);
-		return;
-	}
-
-	if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) ||
-	    guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) ||
-	    guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) ||
-	    guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
-		hwerr_log_error_type(HWERR_RECOV_CXL);
-		return;
-	}
-
-	if (guid_equal(sec_type, &CPER_SEC_PCIE) ||
-	    guid_equal(sec_type, &CPER_SEC_PCI_X_BUS)) {
-		hwerr_log_error_type(HWERR_RECOV_PCI);
-		return;
-	}
-
-	if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
-		hwerr_log_error_type(HWERR_RECOV_MEMORY);
-		return;
-	}
-
-	hwerr_log_error_type(HWERR_RECOV_OTHERS);
-}
-
 static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
-	int sev, sec_sev;
-	struct acpi_hest_generic_data *gdata;
-	guid_t *sec_type;
-	const guid_t *fru_id = &guid_null;
-	char *fru_text = "";
-	bool queued = false;
-	bool sync = is_hest_sync_notify(ghes);
-
-	sev = ghes_severity(estatus->error_severity);
-	apei_estatus_for_each_section(estatus, gdata) {
-		sec_type = (guid_t *)gdata->section_type;
-		sec_sev = ghes_severity(gdata->error_severity);
-		if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
-			fru_id = (guid_t *)gdata->fru_id;
-
-		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
-			fru_text = gdata->fru_text;
-
-		ghes_log_hwerr(sev, sec_type);
-		if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
-			struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
-
-			atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
-
-			arch_apei_report_mem_error(sev, mem_err);
-			queued = ghes_handle_memory_failure(gdata, sev, sync);
-		} else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
-			ghes_handle_aer(gdata);
-		} else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
-			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
-		} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
-			struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
-
-			cxl_cper_post_prot_err(prot_err, gdata->error_severity);
-		} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
-			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
-
-			cxl_cper_post_event(CXL_CPER_EVENT_GEN_MEDIA, rec);
-		} else if (guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID)) {
-			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
-
-			cxl_cper_post_event(CXL_CPER_EVENT_DRAM, rec);
-		} else if (guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
-			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
-
-			cxl_cper_post_event(CXL_CPER_EVENT_MEM_MODULE, rec);
-		} else {
-			void *err = acpi_hest_get_payload(gdata);
-
-			ghes_defer_non_standard_event(gdata, sev);
-			log_non_standard_event(sec_type, fru_id, fru_text,
-					       sec_sev, err,
-					       gdata->error_data_length);
-		}
-	}
-
-	/*
-	 * If no memory failure work is queued for abnormal synchronous
-	 * errors, do a force kill.
-	 */
-	if (sync && !queued) {
-		dev_err(ghes->dev,
-			HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
-			current->comm, task_pid_nr(current));
-		force_sig(SIGBUS);
-	}
-}
-
-static void __ghes_print_estatus(const char *pfx,
-				 const struct acpi_hest_generic *generic,
-				 const struct acpi_hest_generic_status *estatus)
-{
-	static atomic_t seqno;
-	unsigned int curr_seqno;
-	char pfx_seq[64];
-
-	if (pfx == NULL) {
-		if (ghes_severity(estatus->error_severity) <=
-		    GHES_SEV_CORRECTED)
-			pfx = KERN_WARNING;
-		else
-			pfx = KERN_ERR;
-	}
-	curr_seqno = atomic_inc_return(&seqno);
-	snprintf(pfx_seq, sizeof(pfx_seq), "%s{%u}" HW_ERR, pfx, curr_seqno);
-	printk("%s""Hardware error from APEI Generic Hardware Error Source: %d\n",
-	       pfx_seq, generic->header.source_id);
-	cper_estatus_print(pfx_seq, estatus);
-}
-
-static int ghes_print_estatus(const char *pfx,
-			      const struct acpi_hest_generic *generic,
-			      const struct acpi_hest_generic_status *estatus)
-{
-	/* Not more than 2 messages every 5 seconds */
-	static DEFINE_RATELIMIT_STATE(ratelimit_corrected, 5*HZ, 2);
-	static DEFINE_RATELIMIT_STATE(ratelimit_uncorrected, 5*HZ, 2);
-	struct ratelimit_state *ratelimit;
-
-	if (ghes_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
-		ratelimit = &ratelimit_corrected;
-	else
-		ratelimit = &ratelimit_uncorrected;
-	if (__ratelimit(ratelimit)) {
-		__ghes_print_estatus(pfx, generic, estatus);
-		return 1;
-	}
-	return 0;
+	ghes_cper_handle_status(ghes->dev, ghes->generic,
+				estatus, is_hest_sync_notify(ghes));
 }
 
 static void __ghes_panic(struct ghes *ghes,
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
index d7a666a163c3..0ff9d06eb78f 100644
--- a/drivers/acpi/apei/ghes_cper.c
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -13,22 +13,32 @@
  */
 
 #include <linux/aer.h>
+#include <linux/bitfield.h>
+#include <linux/device.h>
 #include <linux/err.h>
 #include <linux/genalloc.h>
-#include <linux/irq_work.h>
 #include <linux/io.h>
+#include <linux/irq_work.h>
 #include <linux/kfifo.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/math64.h>
 #include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/uuid.h>
+#include <linux/sched/signal.h>
+#include <linux/task_work.h>
 #include <linux/notifier.h>
 #include <linux/llist.h>
+#include <linux/ras.h>
+#include <ras/ras_event.h>
 #include <linux/ratelimit.h>
 #include <linux/rcupdate.h>
 #include <linux/rculist.h>
 #include <linux/sched/clock.h>
 #include <linux/slab.h>
+#include <linux/vmcore_info.h>
+#include <linux/vmalloc.h>
 
 #include <acpi/apei.h>
 #include <acpi/ghes_cper.h>
@@ -38,9 +48,363 @@
 
 #include "apei-internal.h"
 
+ATOMIC_NOTIFIER_HEAD(ghes_report_chain);
+
+#ifndef CONFIG_ACPI_APEI
+void __weak arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) { }
+#endif
+
 static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
 static atomic_t ghes_estatus_cache_alloced;
 
+struct gen_pool *ghes_estatus_pool;
+
+int ghes_estatus_pool_init(unsigned int num_ghes)
+{
+	unsigned long addr, len;
+	int rc;
+
+	ghes_estatus_pool = gen_pool_create(GHES_ESTATUS_POOL_MIN_ALLOC_ORDER, -1);
+	if (!ghes_estatus_pool)
+		return -ENOMEM;
+
+	len = GHES_ESTATUS_CACHE_AVG_SIZE * GHES_ESTATUS_CACHE_ALLOCED_MAX;
+	len += (num_ghes * GHES_ESOURCE_PREALLOC_MAX_SIZE);
+
+	addr = (unsigned long)vmalloc(PAGE_ALIGN(len));
+	if (!addr)
+		goto err_pool_alloc;
+
+	rc = gen_pool_add(ghes_estatus_pool, addr, PAGE_ALIGN(len), -1);
+	if (rc)
+		goto err_pool_add;
+
+	return 0;
+
+err_pool_add:
+	vfree((void *)addr);
+
+err_pool_alloc:
+	gen_pool_destroy(ghes_estatus_pool);
+
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(ghes_estatus_pool_init);
+
+/**
+ * ghes_estatus_pool_region_free - free previously allocated memory
+ *				   from the ghes_estatus_pool.
+ * @addr: address of memory to free.
+ * @size: size of memory to free.
+ *
+ * Returns none.
+ */
+void ghes_estatus_pool_region_free(unsigned long addr, u32 size)
+{
+	gen_pool_free(ghes_estatus_pool, addr, size);
+}
+EXPORT_SYMBOL_GPL(ghes_estatus_pool_region_free);
+
+int ghes_severity(int severity)
+{
+	switch (severity) {
+	case CPER_SEV_INFORMATIONAL:
+		return GHES_SEV_NO;
+	case CPER_SEV_CORRECTED:
+		return GHES_SEV_CORRECTED;
+	case CPER_SEV_RECOVERABLE:
+		return GHES_SEV_RECOVERABLE;
+	case CPER_SEV_FATAL:
+		return GHES_SEV_PANIC;
+	default:
+		/* Unknown, go panic */
+		return GHES_SEV_PANIC;
+	}
+}
+
+/**
+ * struct ghes_task_work - for synchronous RAS event
+ *
+ * @twork:                callback_head for task work
+ * @pfn:                  page frame number of corrupted page
+ * @flags:                work control flags
+ *
+ * Structure to pass task work to be handled before
+ * returning to user-space via task_work_add().
+ */
+struct ghes_task_work {
+	struct callback_head twork;
+	u64 pfn;
+	int flags;
+};
+
+static void memory_failure_cb(struct callback_head *twork)
+{
+	struct ghes_task_work *twcb = container_of(twork, struct ghes_task_work, twork);
+	int ret;
+
+	ret = memory_failure(twcb->pfn, twcb->flags);
+	gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb));
+
+	if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP)
+		return;
+
+	pr_err("%#llx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
+	       twcb->pfn, current->comm, task_pid_nr(current));
+	force_sig(SIGBUS);
+}
+
+static bool ghes_do_memory_failure(u64 physical_addr, int flags)
+{
+	struct ghes_task_work *twcb;
+	unsigned long pfn;
+
+	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
+		return false;
+
+	pfn = PHYS_PFN(physical_addr);
+
+	if (flags == MF_ACTION_REQUIRED && current->mm) {
+		twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
+		if (!twcb)
+			return false;
+
+		twcb->pfn = pfn;
+		twcb->flags = flags;
+		init_task_work(&twcb->twork, memory_failure_cb);
+		task_work_add(current, &twcb->twork, TWA_RESUME);
+		return true;
+	}
+
+	memory_failure_queue(pfn, flags);
+	return true;
+}
+
+bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
+				int sev, bool sync)
+{
+	int flags = -1;
+	int sec_sev = ghes_severity(gdata->error_severity);
+	struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
+
+	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
+		return false;
+
+	/* iff following two events can be handled properly by now */
+	if (sec_sev == GHES_SEV_CORRECTED &&
+	    (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
+		flags = MF_SOFT_OFFLINE;
+	if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
+		flags = sync ? MF_ACTION_REQUIRED : 0;
+
+	if (flags != -1)
+		return ghes_do_memory_failure(mem_err->physical_addr, flags);
+
+	return false;
+}
+
+bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+			      int sev, bool sync)
+{
+	struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+	int flags = sync ? MF_ACTION_REQUIRED : 0;
+	int length = gdata->error_data_length;
+	char error_type[120];
+	bool queued = false;
+	int sec_sev, i;
+	char *p;
+
+	sec_sev = ghes_severity(gdata->error_severity);
+	if (length >= sizeof(*err)) {
+		log_arm_hw_error(err, sec_sev);
+	} else {
+		pr_warn(FW_BUG "arm error length: %d\n", length);
+		pr_warn(FW_BUG "length is too small\n");
+		pr_warn(FW_BUG "firmware-generated error record is incorrect\n");
+		return false;
+	}
+
+	if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
+		return false;
+
+	p = (char *)(err + 1);
+	length -= sizeof(err);
+
+	for (i = 0; i < err->err_info_num; i++) {
+		struct cper_arm_err_info *err_info;
+		bool is_cache, has_pa;
+
+		/* Ensure we have enough data for the error info header */
+		if (length < sizeof(*err_info))
+			break;
+
+		err_info = (struct cper_arm_err_info *)p;
+
+		/* Validate the claimed length before using it */
+		length -= err_info->length;
+		if (length < 0)
+			break;
+
+		is_cache = err_info->type & CPER_ARM_CACHE_ERROR;
+		has_pa = (err_info->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR);
+
+		/*
+		 * The field (err_info->error_info & BIT(26)) is fixed to set to
+		 * 1 in some old firmware of HiSilicon Kunpeng920. We assume that
+		 * firmware won't mix corrected errors in an uncorrected section,
+		 * and don't filter out 'corrected' error here.
+		 */
+		if (is_cache && has_pa) {
+			queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags);
+			p += err_info->length;
+			continue;
+		}
+
+		cper_bits_to_str(error_type, sizeof(error_type),
+				 FIELD_GET(CPER_ARM_ERR_TYPE_MASK, err_info->type),
+				 cper_proc_error_type_strs,
+				 ARRAY_SIZE(cper_proc_error_type_strs));
+
+			pr_warn_ratelimited(FW_WARN GHES_PFX
+					    "Unhandled processor error type 0x%02x: %s%s\n",
+					    err_info->type, error_type,
+					    err_info->type & ~CPER_ARM_ERR_TYPE_MASK ?
+					    " with reserved bit(s)" : "");
+		p += err_info->length;
+	}
+
+	return queued;
+}
+
+/*
+ * PCIe AER errors need to be sent to the AER driver for reporting and
+ * recovery. The GHES severities map to the following AER severities and
+ * require the following handling:
+ *
+ * GHES_SEV_CORRECTABLE -> AER_CORRECTABLE
+ *     These need to be reported by the AER driver but no recovery is
+ *     necessary.
+ * GHES_SEV_RECOVERABLE -> AER_NONFATAL
+ * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
+ *     These both need to be reported and recovered from by the AER driver.
+ * GHES_SEV_PANIC does not make it to this handling since the kernel must
+ *     panic.
+ */
+void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
+{
+#ifdef CONFIG_ACPI_APEI_PCIEAER
+	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
+
+	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+	    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
+		unsigned int devfn;
+		int aer_severity;
+		u8 *aer_info;
+
+		devfn = PCI_DEVFN(pcie_err->device_id.device,
+				  pcie_err->device_id.function);
+		aer_severity = cper_severity_to_aer(gdata->error_severity);
+
+		/*
+		 * If firmware reset the component to contain
+		 * the error, we must reinitialize it before
+		 * use, so treat it as a fatal AER error.
+		 */
+		if (gdata->flags & CPER_SEC_RESET)
+			aer_severity = AER_FATAL;
+
+		aer_info = (void *)gen_pool_alloc(ghes_estatus_pool,
+						  sizeof(struct aer_capability_regs));
+		if (!aer_info)
+			return;
+		memcpy(aer_info, pcie_err->aer_info, sizeof(struct aer_capability_regs));
+
+		aer_recover_queue(pcie_err->device_id.segment,
+				  pcie_err->device_id.bus,
+				  devfn, aer_severity,
+				  (struct aer_capability_regs *)
+				  aer_info);
+	}
+#endif
+}
+
+void ghes_log_hwerr(int sev, guid_t *sec_type)
+{
+	if (sev != CPER_SEV_RECOVERABLE)
+		return;
+
+	if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) ||
+	    guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) ||
+	    guid_equal(sec_type, &CPER_SEC_PROC_IA)) {
+		hwerr_log_error_type(HWERR_RECOV_CPU);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
+		hwerr_log_error_type(HWERR_RECOV_CXL);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_PCIE) ||
+	    guid_equal(sec_type, &CPER_SEC_PCI_X_BUS)) {
+		hwerr_log_error_type(HWERR_RECOV_PCI);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
+		hwerr_log_error_type(HWERR_RECOV_MEMORY);
+		return;
+	}
+
+	hwerr_log_error_type(HWERR_RECOV_OTHERS);
+}
+
+void __ghes_print_estatus(const char *pfx,
+			  const struct acpi_hest_generic *generic,
+			  const struct acpi_hest_generic_status *estatus)
+{
+	static atomic_t seqno;
+	unsigned int curr_seqno;
+	char pfx_seq[64];
+
+	if (!pfx) {
+		if (ghes_severity(estatus->error_severity) <=
+		    GHES_SEV_CORRECTED)
+			pfx = KERN_WARNING;
+		else
+			pfx = KERN_ERR;
+	}
+	curr_seqno = atomic_inc_return(&seqno);
+	snprintf(pfx_seq, sizeof(pfx_seq), "%s{%u}" HW_ERR, pfx, curr_seqno);
+	printk("%sHardware error from APEI Generic Hardware Error Source: %d\n",
+	       pfx_seq, generic->header.source_id);
+	cper_estatus_print(pfx_seq, estatus);
+}
+
+int ghes_print_estatus(const char *pfx,
+		       const struct acpi_hest_generic *generic,
+		       const struct acpi_hest_generic_status *estatus)
+{
+	/* Not more than 2 messages every 5 seconds */
+	static DEFINE_RATELIMIT_STATE(ratelimit_corrected, 5 * HZ, 2);
+	static DEFINE_RATELIMIT_STATE(ratelimit_uncorrected, 5 * HZ, 2);
+	struct ratelimit_state *ratelimit;
+
+	if (ghes_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
+		ratelimit = &ratelimit_corrected;
+	else
+		ratelimit = &ratelimit_uncorrected;
+	if (__ratelimit(ratelimit)) {
+		__ghes_print_estatus(pfx, generic, estatus);
+		return 1;
+	}
+	return 0;
+}
+
+#ifdef CONFIG_ACPI_APEI
 static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
 {
 	phys_addr_t paddr;
@@ -272,6 +636,7 @@ void ghes_clear_estatus(struct ghes *ghes,
 	if (is_hest_type_generic_v2(ghes))
 		ghes_ack_error(ghes->generic_v2);
 }
+#endif /* CONFIG_ACPI_APEI */
 
 static BLOCKING_NOTIFIER_HEAD(vendor_record_notify_list);
 
@@ -323,6 +688,77 @@ void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 	schedule_work(&entry->work);
 }
 
+void ghes_cper_handle_status(struct device *dev,
+			     const struct acpi_hest_generic *generic,
+			     const struct acpi_hest_generic_status *estatus,
+			     bool sync)
+{
+	int sev, sec_sev;
+	struct acpi_hest_generic_data *gdata;
+	guid_t *sec_type;
+	const guid_t *fru_id = &guid_null;
+	char *fru_text = "";
+	bool queued = false;
+
+	sev = ghes_severity(estatus->error_severity);
+	apei_estatus_for_each_section(estatus, gdata) {
+		sec_type = (guid_t *)gdata->section_type;
+		sec_sev = ghes_severity(gdata->error_severity);
+		if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
+			fru_id = (guid_t *)gdata->fru_id;
+
+		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
+			fru_text = gdata->fru_text;
+
+		ghes_log_hwerr(sev, sec_type);
+		if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
+			struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
+
+			atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err);
+
+			arch_apei_report_mem_error(sev, mem_err);
+			queued = ghes_handle_memory_failure(gdata, sev, sync);
+		} else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
+			ghes_handle_aer(gdata);
+		} else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
+			queued = ghes_handle_arm_hw_error(gdata, sev, sync);
+		} else if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR)) {
+			struct cxl_cper_sec_prot_err *prot_err = acpi_hest_get_payload(gdata);
+
+			cxl_cper_post_prot_err(prot_err, gdata->error_severity);
+		} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
+			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
+
+			cxl_cper_post_event(CXL_CPER_EVENT_GEN_MEDIA, rec);
+		} else if (guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID)) {
+			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
+
+			cxl_cper_post_event(CXL_CPER_EVENT_DRAM, rec);
+		} else if (guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
+			struct cxl_cper_event_rec *rec = acpi_hest_get_payload(gdata);
+
+			cxl_cper_post_event(CXL_CPER_EVENT_MEM_MODULE, rec);
+		} else {
+			void *err = acpi_hest_get_payload(gdata);
+
+			ghes_defer_non_standard_event(gdata, sev);
+			log_non_standard_event(sec_type, fru_id, fru_text,
+					       sec_sev, err,
+					       gdata->error_data_length);
+		}
+	}
+
+	/*
+	 * If no memory failure work is queued for abnormal synchronous
+	 * errors, do a force kill.
+	 */
+	if (sync && !queued) {
+		dev_err(dev,
+			HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n",
+			current->comm, task_pid_nr(current));
+		force_sig(SIGBUS);
+	}
+}
 /* Room for 8 entries */
 #define CXL_CPER_PROT_ERR_FIFO_DEPTH 8
 static DEFINE_KFIFO(cxl_cper_prot_err_fifo, struct cxl_cper_prot_err_work_data,
diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
index dd49e9179b63..511b95b50911 100644
--- a/include/acpi/ghes_cper.h
+++ b/include/acpi/ghes_cper.h
@@ -17,6 +17,8 @@
 #define ACPI_APEI_GHES_CPER_H
 
 #include <linux/atomic.h>
+#include <linux/device.h>
+#include <linux/notifier.h>
 #include <linux/workqueue.h>
 
 #include <acpi/ghes.h>
@@ -57,6 +59,7 @@
 	((struct ghes_vendor_record_entry *)(vendor_entry) + 1))
 
 extern struct gen_pool *ghes_estatus_pool;
+extern struct atomic_notifier_head ghes_report_chain;
 
 static inline bool is_hest_type_generic_v2(struct ghes *ghes)
 {
@@ -107,6 +110,23 @@ void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
 			    struct acpi_hest_generic_status *estatus);
 void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 				   int sev);
+int ghes_severity(int severity);
+bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
+				int sev, bool sync);
+bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata,
+			      int sev, bool sync);
+void ghes_handle_aer(struct acpi_hest_generic_data *gdata);
+void ghes_log_hwerr(int sev, guid_t *sec_type);
+void __ghes_print_estatus(const char *pfx,
+			  const struct acpi_hest_generic *generic,
+			  const struct acpi_hest_generic_status *estatus);
+int ghes_print_estatus(const char *pfx,
+		       const struct acpi_hest_generic *generic,
+		       const struct acpi_hest_generic_status *estatus);
+void ghes_cper_handle_status(struct device *dev,
+			     const struct acpi_hest_generic *generic,
+			     const struct acpi_hest_generic_status *estatus,
+			     bool sync);
 void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
 			    int severity);
 int cxl_cper_register_prot_err_work(struct work_struct *work);

-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 07/10] ACPI: APEI: introduce GHES helper
From: Ahmed Tiba @ 2026-05-18 11:57 UTC (permalink / raw)
  To: rafael, bp, saket.dumbre, will, xueshuai, mchehab, krzk+dt, dave,
	conor+dt, vishal.l.verma, jic23, corbet, guohanjun, dave.jiang,
	catalin.marinas, lenb, tony.luck, skhan, djbw, alison.schofield,
	ira.weiny, robh
  Cc: Ahmed Tiba, devicetree, linux-acpi, linux-doc, Dmitry.Lamerov,
	linux-cxl, Michael.Zhao2, acpica-devel, linux-kernel,
	linux-arm-kernel, linux-edac
In-Reply-To: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com>

Add a dedicated GHES_CPER_HELPERS Kconfig entry so the shared helper code
can be built even when ACPI_APEI_GHES is disabled. Update the build glue
and headers to depend on the new symbol.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/Makefile           |  1 +
 drivers/acpi/Kconfig       |  4 ++++
 drivers/acpi/apei/Kconfig  |  1 +
 drivers/acpi/apei/Makefile |  2 +-
 include/acpi/ghes.h        | 10 ++++++----
 include/cxl/event.h        |  2 +-
 6 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 0841ea851847..27a664cb45ea 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -31,6 +31,7 @@ obj-y				+= idle/
 obj-y				+= char/ipmi/
 
 obj-$(CONFIG_ACPI)		+= acpi/
+obj-$(CONFIG_GHES_CPER_HELPERS)	+= acpi/apei/ghes_cper.o
 
 # PnP must come after ACPI since it will eventually need to check if acpi
 # was used and do nothing if so
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index f165d14cf61a..13ef0e99f840 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -6,6 +6,10 @@
 config ARCH_SUPPORTS_ACPI
 	bool
 
+config GHES_CPER_HELPERS
+	bool
+	select UEFI_CPER
+
 menuconfig ACPI
 	bool "ACPI (Advanced Configuration and Power Interface) Support"
 	depends on ARCH_SUPPORTS_ACPI
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index 428458c623f0..ddb62638eb02 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -21,6 +21,7 @@ config ACPI_APEI_GHES
 	bool "APEI Generic Hardware Error Source"
 	depends on ACPI_APEI
 	select ACPI_HED
+	select GHES_CPER_HELPERS
 	select IRQ_WORK
 	select GENERIC_ALLOCATOR
 	select ARM_SDE_INTERFACE if ARM64
diff --git a/drivers/acpi/apei/Makefile b/drivers/acpi/apei/Makefile
index f57f3b009d8e..66588d6be56f 100644
--- a/drivers/acpi/apei/Makefile
+++ b/drivers/acpi/apei/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_ACPI_APEI)		+= apei.o
-obj-$(CONFIG_ACPI_APEI_GHES)	+= ghes.o ghes_cper.o
+obj-$(CONFIG_ACPI_APEI_GHES)	+= ghes.o
 # clang versions prior to 18 may blow out the stack with KASAN
 ifeq ($(CONFIG_COMPILE_TEST)_$(CONFIG_CC_IS_CLANG)_$(call clang-min-version, 180000),y_y_)
 KASAN_SANITIZE_ghes.o := n
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 8d7e5caef3f1..2ffab36b6154 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -83,15 +83,17 @@ int devm_ghes_register_vendor_record_notifier(struct device *dev,
 					      struct notifier_block *nb);
 
 struct list_head *ghes_get_devices(void);
-
-void ghes_estatus_pool_region_free(unsigned long addr, u32 size);
 #else
 static inline struct list_head *ghes_get_devices(void) { return NULL; }
-
-static inline void ghes_estatus_pool_region_free(unsigned long addr, u32 size) { return; }
 #endif
 
+#ifdef CONFIG_GHES_CPER_HELPERS
 int ghes_estatus_pool_init(unsigned int num_ghes);
+void ghes_estatus_pool_region_free(unsigned long addr, u32 size);
+#else
+static inline int ghes_estatus_pool_init(unsigned int num_ghes) { return -ENODEV; }
+static inline void ghes_estatus_pool_region_free(unsigned long addr, u32 size) { }
+#endif
 
 static inline int acpi_hest_get_version(struct acpi_hest_generic_data *gdata)
 {
diff --git a/include/cxl/event.h b/include/cxl/event.h
index ff97fea718d2..2ebd65b0d9d6 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -285,7 +285,7 @@ struct cxl_cper_prot_err_work_data {
 	int severity;
 };
 
-#ifdef CONFIG_ACPI_APEI_GHES
+#ifdef CONFIG_GHES_CPER_HELPERS
 int cxl_cper_register_work(struct work_struct *work);
 int cxl_cper_unregister_work(struct work_struct *work);
 int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd);

-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Christian König @ 2026-05-18 11:58 UTC (permalink / raw)
  To: Julian Orth, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Simona Vetter, Sumit Semwal, Jonathan Corbet,
	Shuah Khan, Arnd Bergmann, Greg Kroah-Hartman
  Cc: dri-devel, linux-kernel, linux-media, linaro-mm-sig, linux-doc,
	wayland-devel
In-Reply-To: <20260516-jorth-syncobj-v1-0-88ede9d98a81@gmail.com>

On 5/16/26 13:06, Julian Orth wrote:
> This series adds a new device /dev/syncobj that can be used to create
> and manipulate DRM syncobjs. Previously, these operations required the
> use of a DRM device and the device needed to support the DRIVER_SYNCOBJ
> and DRIVER_SYNCOBJ_TIMELINE features.
> 
> There are several issues with the existing API:
> 
> - Syncobjs are the only explicit sync mechanism available on wayland.
>   Most compositors do not use GPU waits. Instead, they use the
>   DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to perform a CPU wait. Being tied to
>   DRM devices means that compositors cannot consistently offer this
>   feature even though no device-specific logic is involved.

Well the drm_syncobj is a container for device specific dma fences.

What could be possible instead is to pass an eventfd into Wayland, but that is something userspace needs to decide.

> - llvmpipe currently cannot offer syncobj interop because it does not
>   have access to a DRM device. This means that applications using
>   llvmpipe cannot present images before they have finished rendering,
>   despite llvmpipe using threaded rendering.

Yeah, but that is completely intentional. You *CAN'T* use a dma_fence as completion event for llvmpipe rendering. See the kernel documentation on that.

What could be possible is to use the drm_syncobjs functionality to wait before signal, but that has different semantics.

Regards,
Christian.

> - Clients that do not use the Vulkan WSI need to manually probe /dev/dri
>   for devices that support the syncobj ioctls in order to use the
>   wayland syncobj protocol.
> - Similarly, clients that want to use screen capture have no equivalent
>   to the WSI and are therefore forced into that path.
> - Having to keep a DRM device open has potentially negative interactions
>   with GPU hotplug.
> - Having to translate between syncobj FDs and handles is troublesome in
>   the compositor usecase since syncobjs come and go frequently and need
>   to be cleaned up when clients disconnect.
> 
> /dev/syncobj solves these issues by providing all syncobj ioctls under a
> consistent path that is not tied to any DRM device. It also operates
> directly on file descriptors instead of syncobj handles.
> 
> The series starts with a number of small refactorings in drm_syncobj.c
> to make its functionality available outside of the file and without the
> need for drm_file/handle pairs.
> 
> The last commit adds the /dev/syncobj module. I've added it as a misc
> device but maybe this should instead live somewhere under gpu/drm.
> 
> An application using the new interface can be found at [1].
> 
> [1]: https://github.com/mahkoh/jay/pull/947
> 
> ---
> Julian Orth (12):
>       drm/syncobj: add drm_syncobj_from_fd
>       drm/syncobj: add drm_syncobj_fence_lookup
>       drm/syncobj: make drm_syncobj_array_wait_timeout public
>       drm/syncobj: add drm_syncobj_register_eventfd
>       drm/syncobj: have transfer functions accept drm_syncobj directly
>       drm/syncobj: add drm_syncobj_transfer
>       drm/syncobj: add drm_syncobj_timeline_signal
>       drm/syncobj: add drm_syncobj_query
>       drm/syncobj: fix resource leak in drm_syncobj_import_sync_file_fence
>       drm/syncobj: add drm_syncobj_import_sync_file
>       drm/syncobj: add drm_syncobj_export_sync_file
>       misc/syncobj: add new device
> 
>  Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
>  drivers/gpu/drm/drm_syncobj.c                      | 374 ++++++++++++++-----
>  drivers/misc/Kconfig                               |  10 +
>  drivers/misc/Makefile                              |   1 +
>  drivers/misc/syncobj.c                             | 404 +++++++++++++++++++++
>  include/drm/drm_syncobj.h                          |  21 ++
>  include/uapi/linux/syncobj.h                       |  75 ++++
>  7 files changed, 795 insertions(+), 91 deletions(-)
> ---
> base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
> change-id: 20260516-jorth-syncobj-d4d374c8c61b
> 
> Best regards,
> --  
> Julian Orth <ju.orth@gmail.com>
> 


^ permalink raw reply

* [PATCH v4 06/10] ACPI: APEI: GHES: move CXL CPER helpers
From: Ahmed Tiba @ 2026-05-18 11:57 UTC (permalink / raw)
  To: rafael, bp, saket.dumbre, will, xueshuai, mchehab, krzk+dt, dave,
	conor+dt, vishal.l.verma, jic23, corbet, guohanjun, dave.jiang,
	catalin.marinas, lenb, tony.luck, skhan, djbw, alison.schofield,
	ira.weiny, robh
  Cc: Ahmed Tiba, devicetree, linux-acpi, linux-doc, Dmitry.Lamerov,
	linux-cxl, Michael.Zhao2, acpica-devel, linux-kernel,
	linux-arm-kernel, linux-edac
In-Reply-To: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com>

Move the CXL CPER handling paths out of ghes.c and into ghes_cper.c so the
helpers can be reused. The code is moved as-is, with the public
prototypes updated so GHES keeps calling into the new translation unit.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/ghes.c      | 132 -----------------------------------------
 drivers/acpi/apei/ghes_cper.c | 134 ++++++++++++++++++++++++++++++++++++++++++
 include/acpi/ghes_cper.h      |  11 ++++
 3 files changed, 145 insertions(+), 132 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 81ac51632f21..85be2ebf4d3e 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -383,69 +383,6 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
 #endif
 }
 
-/* Room for 8 entries */
-#define CXL_CPER_PROT_ERR_FIFO_DEPTH 8
-static DEFINE_KFIFO(cxl_cper_prot_err_fifo, struct cxl_cper_prot_err_work_data,
-		    CXL_CPER_PROT_ERR_FIFO_DEPTH);
-
-/* Synchronize schedule_work() with cxl_cper_prot_err_work changes */
-static DEFINE_SPINLOCK(cxl_cper_prot_err_work_lock);
-struct work_struct *cxl_cper_prot_err_work;
-
-static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
-				   int severity)
-{
-#ifdef CONFIG_ACPI_APEI_PCIEAER
-	struct cxl_cper_prot_err_work_data wd;
-
-	if (cxl_cper_sec_prot_err_valid(prot_err))
-		return;
-
-	guard(spinlock_irqsave)(&cxl_cper_prot_err_work_lock);
-
-	if (!cxl_cper_prot_err_work)
-		return;
-
-	if (cxl_cper_setup_prot_err_work_data(&wd, prot_err, severity))
-		return;
-
-	if (!kfifo_put(&cxl_cper_prot_err_fifo, wd)) {
-		pr_err_ratelimited("CXL CPER kfifo overflow\n");
-		return;
-	}
-
-	schedule_work(cxl_cper_prot_err_work);
-#endif
-}
-
-int cxl_cper_register_prot_err_work(struct work_struct *work)
-{
-	if (cxl_cper_prot_err_work)
-		return -EINVAL;
-
-	guard(spinlock)(&cxl_cper_prot_err_work_lock);
-	cxl_cper_prot_err_work = work;
-	return 0;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_register_prot_err_work, "CXL");
-
-int cxl_cper_unregister_prot_err_work(struct work_struct *work)
-{
-	if (cxl_cper_prot_err_work != work)
-		return -EINVAL;
-
-	guard(spinlock)(&cxl_cper_prot_err_work_lock);
-	cxl_cper_prot_err_work = NULL;
-	return 0;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
-
-int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
-{
-	return kfifo_get(&cxl_cper_prot_err_fifo, wd);
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_prot_err_kfifo_get, "CXL");
-
 static void ghes_vendor_record_notifier_destroy(void *nb)
 {
 	ghes_unregister_vendor_record_notifier(nb);
@@ -464,75 +401,6 @@ int devm_ghes_register_vendor_record_notifier(struct device *dev,
 }
 EXPORT_SYMBOL_GPL(devm_ghes_register_vendor_record_notifier);
 
-/* Room for 8 entries for each of the 4 event log queues */
-#define CXL_CPER_FIFO_DEPTH 32
-DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
-
-/* Synchronize schedule_work() with cxl_cper_work changes */
-static DEFINE_SPINLOCK(cxl_cper_work_lock);
-struct work_struct *cxl_cper_work;
-
-static void cxl_cper_post_event(enum cxl_event_type event_type,
-				struct cxl_cper_event_rec *rec)
-{
-	struct cxl_cper_work_data wd;
-
-	if (rec->hdr.length <= sizeof(rec->hdr) ||
-	    rec->hdr.length > sizeof(*rec)) {
-		pr_err(FW_WARN "CXL CPER Invalid section length (%u)\n",
-		       rec->hdr.length);
-		return;
-	}
-
-	if (!(rec->hdr.validation_bits & CPER_CXL_COMP_EVENT_LOG_VALID)) {
-		pr_err(FW_WARN "CXL CPER invalid event\n");
-		return;
-	}
-
-	guard(spinlock_irqsave)(&cxl_cper_work_lock);
-
-	if (!cxl_cper_work)
-		return;
-
-	wd.event_type = event_type;
-	memcpy(&wd.rec, rec, sizeof(wd.rec));
-
-	if (!kfifo_put(&cxl_cper_fifo, wd)) {
-		pr_err_ratelimited("CXL CPER kfifo overflow\n");
-		return;
-	}
-
-	schedule_work(cxl_cper_work);
-}
-
-int cxl_cper_register_work(struct work_struct *work)
-{
-	if (cxl_cper_work)
-		return -EINVAL;
-
-	guard(spinlock)(&cxl_cper_work_lock);
-	cxl_cper_work = work;
-	return 0;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_register_work, "CXL");
-
-int cxl_cper_unregister_work(struct work_struct *work)
-{
-	if (cxl_cper_work != work)
-		return -EINVAL;
-
-	guard(spinlock)(&cxl_cper_work_lock);
-	cxl_cper_work = NULL;
-	return 0;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_work, "CXL");
-
-int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
-{
-	return kfifo_get(&cxl_cper_fifo, wd);
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
-
 static void ghes_log_hwerr(int sev, guid_t *sec_type)
 {
 	if (sev != CPER_SEV_RECOVERABLE)
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
index 131980d36064..d7a666a163c3 100644
--- a/drivers/acpi/apei/ghes_cper.c
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -12,10 +12,12 @@
  *   Author: Huang Ying <ying.huang@intel.com>
  */
 
+#include <linux/aer.h>
 #include <linux/err.h>
 #include <linux/genalloc.h>
 #include <linux/irq_work.h>
 #include <linux/io.h>
+#include <linux/kfifo.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/math64.h>
@@ -321,6 +323,138 @@ void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 	schedule_work(&entry->work);
 }
 
+/* Room for 8 entries */
+#define CXL_CPER_PROT_ERR_FIFO_DEPTH 8
+static DEFINE_KFIFO(cxl_cper_prot_err_fifo, struct cxl_cper_prot_err_work_data,
+		    CXL_CPER_PROT_ERR_FIFO_DEPTH);
+
+/* Synchronize schedule_work() with cxl_cper_prot_err_work changes */
+static DEFINE_SPINLOCK(cxl_cper_prot_err_work_lock);
+struct work_struct *cxl_cper_prot_err_work;
+
+void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
+			    int severity)
+{
+#ifdef CONFIG_ACPI_APEI_PCIEAER
+	struct cxl_cper_prot_err_work_data wd;
+
+	if (cxl_cper_sec_prot_err_valid(prot_err))
+		return;
+
+	guard(spinlock_irqsave)(&cxl_cper_prot_err_work_lock);
+
+	if (!cxl_cper_prot_err_work)
+		return;
+
+	if (cxl_cper_setup_prot_err_work_data(&wd, prot_err, severity))
+		return;
+
+	if (!kfifo_put(&cxl_cper_prot_err_fifo, wd)) {
+		pr_err_ratelimited("CXL CPER kfifo overflow\n");
+		return;
+	}
+
+	schedule_work(cxl_cper_prot_err_work);
+#endif
+}
+
+int cxl_cper_register_prot_err_work(struct work_struct *work)
+{
+	if (cxl_cper_prot_err_work)
+		return -EINVAL;
+
+	guard(spinlock)(&cxl_cper_prot_err_work_lock);
+	cxl_cper_prot_err_work = work;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_register_prot_err_work, "CXL");
+
+int cxl_cper_unregister_prot_err_work(struct work_struct *work)
+{
+	if (cxl_cper_prot_err_work != work)
+		return -EINVAL;
+
+	guard(spinlock)(&cxl_cper_prot_err_work_lock);
+	cxl_cper_prot_err_work = NULL;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
+
+int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
+{
+	return kfifo_get(&cxl_cper_prot_err_fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_prot_err_kfifo_get, "CXL");
+
+/* Room for 8 entries for each of the 4 event log queues */
+#define CXL_CPER_FIFO_DEPTH 32
+static DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
+
+/* Synchronize schedule_work() with cxl_cper_work changes */
+static DEFINE_SPINLOCK(cxl_cper_work_lock);
+struct work_struct *cxl_cper_work;
+
+void cxl_cper_post_event(enum cxl_event_type event_type,
+			 struct cxl_cper_event_rec *rec)
+{
+	struct cxl_cper_work_data wd;
+
+	if (rec->hdr.length <= sizeof(rec->hdr) ||
+	    rec->hdr.length > sizeof(*rec)) {
+		pr_err(FW_WARN "CXL CPER Invalid section length (%u)\n",
+		       rec->hdr.length);
+		return;
+	}
+
+	if (!(rec->hdr.validation_bits & CPER_CXL_COMP_EVENT_LOG_VALID)) {
+		pr_err(FW_WARN "CXL CPER invalid event\n");
+		return;
+	}
+
+	guard(spinlock_irqsave)(&cxl_cper_work_lock);
+
+	if (!cxl_cper_work)
+		return;
+
+	wd.event_type = event_type;
+	memcpy(&wd.rec, rec, sizeof(wd.rec));
+
+	if (!kfifo_put(&cxl_cper_fifo, wd)) {
+		pr_err_ratelimited("CXL CPER kfifo overflow\n");
+		return;
+	}
+
+	schedule_work(cxl_cper_work);
+}
+
+int cxl_cper_register_work(struct work_struct *work)
+{
+	if (cxl_cper_work)
+		return -EINVAL;
+
+	guard(spinlock)(&cxl_cper_work_lock);
+	cxl_cper_work = work;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_register_work, "CXL");
+
+int cxl_cper_unregister_work(struct work_struct *work)
+{
+	if (cxl_cper_work != work)
+		return -EINVAL;
+
+	guard(spinlock)(&cxl_cper_work_lock);
+	cxl_cper_work = NULL;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_work, "CXL");
+
+int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
+{
+	return kfifo_get(&cxl_cper_fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
+
 /*
  * GHES error status reporting throttle, to report more kinds of
  * errors, instead of just most frequently occurred errors.
diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
index 51725f25c516..dd49e9179b63 100644
--- a/include/acpi/ghes_cper.h
+++ b/include/acpi/ghes_cper.h
@@ -20,6 +20,7 @@
 #include <linux/workqueue.h>
 
 #include <acpi/ghes.h>
+#include <cxl/event.h>
 
 #define GHES_PFX	"GHES: "
 
@@ -106,5 +107,15 @@ void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
 			    struct acpi_hest_generic_status *estatus);
 void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
 				   int sev);
+void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
+			    int severity);
+int cxl_cper_register_prot_err_work(struct work_struct *work);
+int cxl_cper_unregister_prot_err_work(struct work_struct *work);
+int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd);
+void cxl_cper_post_event(enum cxl_event_type event_type,
+			 struct cxl_cper_event_rec *rec);
+int cxl_cper_register_work(struct work_struct *work);
+int cxl_cper_unregister_work(struct work_struct *work);
+int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd);
 
 #endif /* ACPI_APEI_GHES_CPER_H */

-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 05/10] ACPI: APEI: GHES: move vendor record helpers
From: Ahmed Tiba @ 2026-05-18 11:57 UTC (permalink / raw)
  To: rafael, bp, saket.dumbre, will, xueshuai, mchehab, krzk+dt, dave,
	conor+dt, vishal.l.verma, jic23, corbet, guohanjun, dave.jiang,
	catalin.marinas, lenb, tony.luck, skhan, djbw, alison.schofield,
	ira.weiny, robh
  Cc: Ahmed Tiba, devicetree, linux-acpi, linux-doc, Dmitry.Lamerov,
	linux-cxl, Michael.Zhao2, acpica-devel, linux-kernel,
	linux-arm-kernel, linux-edac
In-Reply-To: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com>

Shift the vendor record workqueue helpers into ghes_cper.c so both GHES
and future DT-based providers can use the same implementation. The change
is mechanical and keeps the notifier behavior identical.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/ghes.c      | 86 +++++++++----------------------------------
 drivers/acpi/apei/ghes_cper.c | 55 +++++++++++++++++++++++++++
 include/acpi/ghes_cper.h      |  2 +
 3 files changed, 75 insertions(+), 68 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index adab7404310e..81ac51632f21 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -383,74 +383,6 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
 #endif
 }
 
-static BLOCKING_NOTIFIER_HEAD(vendor_record_notify_list);
-
-int ghes_register_vendor_record_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_register(&vendor_record_notify_list, nb);
-}
-EXPORT_SYMBOL_GPL(ghes_register_vendor_record_notifier);
-
-void ghes_unregister_vendor_record_notifier(struct notifier_block *nb)
-{
-	blocking_notifier_chain_unregister(&vendor_record_notify_list, nb);
-}
-EXPORT_SYMBOL_GPL(ghes_unregister_vendor_record_notifier);
-
-static void ghes_vendor_record_notifier_destroy(void *nb)
-{
-	ghes_unregister_vendor_record_notifier(nb);
-}
-
-int devm_ghes_register_vendor_record_notifier(struct device *dev,
-					      struct notifier_block *nb)
-{
-	int ret;
-
-	ret = ghes_register_vendor_record_notifier(nb);
-	if (ret)
-		return ret;
-
-	return devm_add_action_or_reset(dev, ghes_vendor_record_notifier_destroy, nb);
-}
-EXPORT_SYMBOL_GPL(devm_ghes_register_vendor_record_notifier);
-
-static void ghes_vendor_record_work_func(struct work_struct *work)
-{
-	struct ghes_vendor_record_entry *entry;
-	struct acpi_hest_generic_data *gdata;
-	u32 len;
-
-	entry = container_of(work, struct ghes_vendor_record_entry, work);
-	gdata = GHES_GDATA_FROM_VENDOR_ENTRY(entry);
-
-	blocking_notifier_call_chain(&vendor_record_notify_list,
-				     entry->error_severity, gdata);
-
-	len = GHES_VENDOR_ENTRY_LEN(acpi_hest_get_record_size(gdata));
-	gen_pool_free(ghes_estatus_pool, (unsigned long)entry, len);
-}
-
-static void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
-					  int sev)
-{
-	struct acpi_hest_generic_data *copied_gdata;
-	struct ghes_vendor_record_entry *entry;
-	u32 len;
-
-	len = GHES_VENDOR_ENTRY_LEN(acpi_hest_get_record_size(gdata));
-	entry = (void *)gen_pool_alloc(ghes_estatus_pool, len);
-	if (!entry)
-		return;
-
-	copied_gdata = GHES_GDATA_FROM_VENDOR_ENTRY(entry);
-	memcpy(copied_gdata, gdata, acpi_hest_get_record_size(gdata));
-	entry->error_severity = sev;
-
-	INIT_WORK(&entry->work, ghes_vendor_record_work_func);
-	schedule_work(&entry->work);
-}
-
 /* Room for 8 entries */
 #define CXL_CPER_PROT_ERR_FIFO_DEPTH 8
 static DEFINE_KFIFO(cxl_cper_prot_err_fifo, struct cxl_cper_prot_err_work_data,
@@ -514,6 +446,24 @@ int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cper_prot_err_kfifo_get, "CXL");
 
+static void ghes_vendor_record_notifier_destroy(void *nb)
+{
+	ghes_unregister_vendor_record_notifier(nb);
+}
+
+int devm_ghes_register_vendor_record_notifier(struct device *dev,
+					      struct notifier_block *nb)
+{
+	int ret;
+
+	ret = ghes_register_vendor_record_notifier(nb);
+	if (ret)
+		return ret;
+
+	return devm_add_action_or_reset(dev, ghes_vendor_record_notifier_destroy, nb);
+}
+EXPORT_SYMBOL_GPL(devm_ghes_register_vendor_record_notifier);
+
 /* Room for 8 entries for each of the 4 event log queues */
 #define CXL_CPER_FIFO_DEPTH 32
 DEFINE_KFIFO(cxl_cper_fifo, struct cxl_cper_work_data, CXL_CPER_FIFO_DEPTH);
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
index 0a117f478afb..131980d36064 100644
--- a/drivers/acpi/apei/ghes_cper.c
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -14,12 +14,17 @@
 
 #include <linux/err.h>
 #include <linux/genalloc.h>
+#include <linux/irq_work.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
+#include <linux/list.h>
 #include <linux/math64.h>
 #include <linux/mm.h>
+#include <linux/notifier.h>
+#include <linux/llist.h>
 #include <linux/ratelimit.h>
 #include <linux/rcupdate.h>
+#include <linux/rculist.h>
 #include <linux/sched/clock.h>
 #include <linux/slab.h>
 
@@ -266,6 +271,56 @@ void ghes_clear_estatus(struct ghes *ghes,
 		ghes_ack_error(ghes->generic_v2);
 }
 
+static BLOCKING_NOTIFIER_HEAD(vendor_record_notify_list);
+
+int ghes_register_vendor_record_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&vendor_record_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(ghes_register_vendor_record_notifier);
+
+void ghes_unregister_vendor_record_notifier(struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&vendor_record_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(ghes_unregister_vendor_record_notifier);
+
+static void ghes_vendor_record_work_func(struct work_struct *work)
+{
+	struct ghes_vendor_record_entry *entry;
+	struct acpi_hest_generic_data *gdata;
+	u32 len;
+
+	entry = container_of(work, struct ghes_vendor_record_entry, work);
+	gdata = GHES_GDATA_FROM_VENDOR_ENTRY(entry);
+
+	blocking_notifier_call_chain(&vendor_record_notify_list,
+				     entry->error_severity, gdata);
+
+	len = GHES_VENDOR_ENTRY_LEN(acpi_hest_get_record_size(gdata));
+	gen_pool_free(ghes_estatus_pool, (unsigned long)entry, len);
+}
+
+void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
+				   int sev)
+{
+	struct acpi_hest_generic_data *copied_gdata;
+	struct ghes_vendor_record_entry *entry;
+	u32 len;
+
+	len = GHES_VENDOR_ENTRY_LEN(acpi_hest_get_record_size(gdata));
+	entry = (void *)gen_pool_alloc(ghes_estatus_pool, len);
+	if (!entry)
+		return;
+
+	copied_gdata = GHES_GDATA_FROM_VENDOR_ENTRY(entry);
+	memcpy(copied_gdata, gdata, acpi_hest_get_record_size(gdata));
+	entry->error_severity = sev;
+
+	INIT_WORK(&entry->work, ghes_vendor_record_work_func);
+	schedule_work(&entry->work);
+}
+
 /*
  * GHES error status reporting throttle, to report more kinds of
  * errors, instead of just most frequently occurred errors.
diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
index 1b5dbeca9bb6..51725f25c516 100644
--- a/include/acpi/ghes_cper.h
+++ b/include/acpi/ghes_cper.h
@@ -104,5 +104,7 @@ int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
 int ghes_estatus_cached(struct acpi_hest_generic_status *estatus);
 void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
 			    struct acpi_hest_generic_status *estatus);
+void ghes_defer_non_standard_event(struct acpi_hest_generic_data *gdata,
+				   int sev);
 
 #endif /* ACPI_APEI_GHES_CPER_H */

-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 04/10] ACPI: APEI: GHES: move estatus cache helpers
From: Ahmed Tiba @ 2026-05-18 11:57 UTC (permalink / raw)
  To: rafael, bp, saket.dumbre, will, xueshuai, mchehab, krzk+dt, dave,
	conor+dt, vishal.l.verma, jic23, corbet, guohanjun, dave.jiang,
	catalin.marinas, lenb, tony.luck, skhan, djbw, alison.schofield,
	ira.weiny, robh
  Cc: Ahmed Tiba, devicetree, linux-acpi, linux-doc, Dmitry.Lamerov,
	linux-cxl, Michael.Zhao2, acpica-devel, linux-kernel,
	linux-arm-kernel, linux-edac
In-Reply-To: <20260518-topics-ahmtib01-ras_ffh_arm_internal_review-v4-0-42698675ba61@arm.com>

Relocate the estatus cache allocation and lookup helpers from ghes.c into
ghes_cper.c. This code move keeps the logic intact while making the cache
implementation available to forthcoming users.

Signed-off-by: Ahmed Tiba <ahmed.tiba@arm.com>
---
 drivers/acpi/apei/ghes.c      | 138 +----------------------------------------
 drivers/acpi/apei/ghes_cper.c | 140 ++++++++++++++++++++++++++++++++++++++++++
 include/acpi/ghes_cper.h      |   6 ++
 3 files changed, 147 insertions(+), 137 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 91638ae7e05e..adab7404310e 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -113,10 +113,7 @@ static DEFINE_MUTEX(ghes_devs_mutex);
  */
 static DEFINE_SPINLOCK(ghes_notify_lock_irq);
 
-static struct gen_pool *ghes_estatus_pool;
-
-static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
-static atomic_t ghes_estatus_cache_alloced;
+struct gen_pool *ghes_estatus_pool;
 
 int ghes_estatus_pool_init(unsigned int num_ghes)
 {
@@ -733,139 +730,6 @@ static int ghes_print_estatus(const char *pfx,
 	return 0;
 }
 
-/*
- * GHES error status reporting throttle, to report more kinds of
- * errors, instead of just most frequently occurred errors.
- */
-static int ghes_estatus_cached(struct acpi_hest_generic_status *estatus)
-{
-	u32 len;
-	int i, cached = 0;
-	unsigned long long now;
-	struct ghes_estatus_cache *cache;
-	struct acpi_hest_generic_status *cache_estatus;
-
-	len = cper_estatus_len(estatus);
-	rcu_read_lock();
-	for (i = 0; i < GHES_ESTATUS_CACHES_SIZE; i++) {
-		cache = rcu_dereference(ghes_estatus_caches[i]);
-		if (cache == NULL)
-			continue;
-		if (len != cache->estatus_len)
-			continue;
-		cache_estatus = GHES_ESTATUS_FROM_CACHE(cache);
-		if (memcmp(estatus, cache_estatus, len))
-			continue;
-		atomic_inc(&cache->count);
-		now = sched_clock();
-		if (now - cache->time_in < GHES_ESTATUS_IN_CACHE_MAX_NSEC)
-			cached = 1;
-		break;
-	}
-	rcu_read_unlock();
-	return cached;
-}
-
-static struct ghes_estatus_cache *ghes_estatus_cache_alloc(
-	struct acpi_hest_generic *generic,
-	struct acpi_hest_generic_status *estatus)
-{
-	int alloced;
-	u32 len, cache_len;
-	struct ghes_estatus_cache *cache;
-	struct acpi_hest_generic_status *cache_estatus;
-
-	alloced = atomic_add_return(1, &ghes_estatus_cache_alloced);
-	if (alloced > GHES_ESTATUS_CACHE_ALLOCED_MAX) {
-		atomic_dec(&ghes_estatus_cache_alloced);
-		return NULL;
-	}
-	len = cper_estatus_len(estatus);
-	cache_len = GHES_ESTATUS_CACHE_LEN(len);
-	cache = (void *)gen_pool_alloc(ghes_estatus_pool, cache_len);
-	if (!cache) {
-		atomic_dec(&ghes_estatus_cache_alloced);
-		return NULL;
-	}
-	cache_estatus = GHES_ESTATUS_FROM_CACHE(cache);
-	memcpy(cache_estatus, estatus, len);
-	cache->estatus_len = len;
-	atomic_set(&cache->count, 0);
-	cache->generic = generic;
-	cache->time_in = sched_clock();
-	return cache;
-}
-
-static void ghes_estatus_cache_rcu_free(struct rcu_head *head)
-{
-	struct ghes_estatus_cache *cache;
-	u32 len;
-
-	cache = container_of(head, struct ghes_estatus_cache, rcu);
-	len = cper_estatus_len(GHES_ESTATUS_FROM_CACHE(cache));
-	len = GHES_ESTATUS_CACHE_LEN(len);
-	gen_pool_free(ghes_estatus_pool, (unsigned long)cache, len);
-	atomic_dec(&ghes_estatus_cache_alloced);
-}
-
-static void
-ghes_estatus_cache_add(struct acpi_hest_generic *generic,
-		       struct acpi_hest_generic_status *estatus)
-{
-	unsigned long long now, duration, period, max_period = 0;
-	struct ghes_estatus_cache *cache, *new_cache;
-	struct ghes_estatus_cache __rcu *victim;
-	int i, slot = -1, count;
-
-	new_cache = ghes_estatus_cache_alloc(generic, estatus);
-	if (!new_cache)
-		return;
-
-	rcu_read_lock();
-	now = sched_clock();
-	for (i = 0; i < GHES_ESTATUS_CACHES_SIZE; i++) {
-		cache = rcu_dereference(ghes_estatus_caches[i]);
-		if (cache == NULL) {
-			slot = i;
-			break;
-		}
-		duration = now - cache->time_in;
-		if (duration >= GHES_ESTATUS_IN_CACHE_MAX_NSEC) {
-			slot = i;
-			break;
-		}
-		count = atomic_read(&cache->count);
-		period = duration;
-		do_div(period, (count + 1));
-		if (period > max_period) {
-			max_period = period;
-			slot = i;
-		}
-	}
-	rcu_read_unlock();
-
-	if (slot != -1) {
-		/*
-		 * Use release semantics to ensure that ghes_estatus_cached()
-		 * running on another CPU will see the updated cache fields if
-		 * it can see the new value of the pointer.
-		 */
-		victim = xchg_release(&ghes_estatus_caches[slot],
-				      RCU_INITIALIZER(new_cache));
-
-		/*
-		 * At this point, victim may point to a cached item different
-		 * from the one based on which we selected the slot. Instead of
-		 * going to the loop again to pick another slot, let's just
-		 * drop the other item anyway: this may cause a false cache
-		 * miss later on, but that won't cause any problems.
-		 */
-		if (victim)
-			call_rcu(&unrcu_pointer(victim)->rcu,
-				 ghes_estatus_cache_rcu_free);
-	}
-}
-
 static void __ghes_panic(struct ghes *ghes,
 			 struct acpi_hest_generic_status *estatus,
 			 u64 buf_paddr, enum fixed_addresses fixmap_idx)
diff --git a/drivers/acpi/apei/ghes_cper.c b/drivers/acpi/apei/ghes_cper.c
index 8080e0f76dac..0a117f478afb 100644
--- a/drivers/acpi/apei/ghes_cper.c
+++ b/drivers/acpi/apei/ghes_cper.c
@@ -13,10 +13,14 @@
  */
 
 #include <linux/err.h>
+#include <linux/genalloc.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
+#include <linux/math64.h>
 #include <linux/mm.h>
 #include <linux/ratelimit.h>
+#include <linux/rcupdate.h>
+#include <linux/sched/clock.h>
 #include <linux/slab.h>
 
 #include <acpi/apei.h>
@@ -27,6 +31,9 @@
 
 #include "apei-internal.h"
 
+static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
+static atomic_t ghes_estatus_cache_alloced;
+
 static void __iomem *ghes_map(u64 pfn, enum fixed_addresses fixmap_idx)
 {
 	phys_addr_t paddr;
@@ -258,3 +265,136 @@ void ghes_clear_estatus(struct ghes *ghes,
 	if (is_hest_type_generic_v2(ghes))
 		ghes_ack_error(ghes->generic_v2);
 }
+
+/*
+ * GHES error status reporting throttle, to report more kinds of
+ * errors, instead of just most frequently occurred errors.
+ */
+int ghes_estatus_cached(struct acpi_hest_generic_status *estatus)
+{
+	u32 len;
+	int i, cached = 0;
+	unsigned long long now;
+	struct ghes_estatus_cache *cache;
+	struct acpi_hest_generic_status *cache_estatus;
+
+	len = cper_estatus_len(estatus);
+	rcu_read_lock();
+	for (i = 0; i < GHES_ESTATUS_CACHES_SIZE; i++) {
+		cache = rcu_dereference(ghes_estatus_caches[i]);
+		if (cache == NULL)
+			continue;
+		if (len != cache->estatus_len)
+			continue;
+		cache_estatus = GHES_ESTATUS_FROM_CACHE(cache);
+		if (memcmp(estatus, cache_estatus, len))
+			continue;
+		atomic_inc(&cache->count);
+		now = sched_clock();
+		if (now - cache->time_in < GHES_ESTATUS_IN_CACHE_MAX_NSEC)
+			cached = 1;
+		break;
+	}
+	rcu_read_unlock();
+	return cached;
+}
+
+static struct ghes_estatus_cache *ghes_estatus_cache_alloc(
+	struct acpi_hest_generic *generic,
+	struct acpi_hest_generic_status *estatus)
+{
+	int alloced;
+	u32 len, cache_len;
+	struct ghes_estatus_cache *cache;
+	struct acpi_hest_generic_status *cache_estatus;
+
+	alloced = atomic_add_return(1, &ghes_estatus_cache_alloced);
+	if (alloced > GHES_ESTATUS_CACHE_ALLOCED_MAX) {
+		atomic_dec(&ghes_estatus_cache_alloced);
+		return NULL;
+	}
+	len = cper_estatus_len(estatus);
+	cache_len = GHES_ESTATUS_CACHE_LEN(len);
+	cache = (void *)gen_pool_alloc(ghes_estatus_pool, cache_len);
+	if (cache == NULL) {
+		atomic_dec(&ghes_estatus_cache_alloced);
+		return NULL;
+	}
+	cache_estatus = GHES_ESTATUS_FROM_CACHE(cache);
+	memcpy(cache_estatus, estatus, len);
+	cache->estatus_len = len;
+	atomic_set(&cache->count, 0);
+	cache->generic = generic;
+	cache->time_in = sched_clock();
+	return cache;
+}
+
+static void ghes_estatus_cache_rcu_free(struct rcu_head *head)
+{
+	struct ghes_estatus_cache *cache;
+	u32 len;
+
+	cache = container_of(head, struct ghes_estatus_cache, rcu);
+	len = cper_estatus_len(GHES_ESTATUS_FROM_CACHE(cache));
+	len = GHES_ESTATUS_CACHE_LEN(len);
+	gen_pool_free(ghes_estatus_pool, (unsigned long)cache, len);
+	atomic_dec(&ghes_estatus_cache_alloced);
+}
+
+void
+ghes_estatus_cache_add(struct acpi_hest_generic *generic,
+		       struct acpi_hest_generic_status *estatus)
+{
+	unsigned long long now, duration, period, max_period = 0;
+	struct ghes_estatus_cache *cache, *new_cache;
+	struct ghes_estatus_cache __rcu *victim;
+	int i, slot = -1, count;
+
+	new_cache = ghes_estatus_cache_alloc(generic, estatus);
+	if (!new_cache)
+		return;
+
+	rcu_read_lock();
+	now = sched_clock();
+	for (i = 0; i < GHES_ESTATUS_CACHES_SIZE; i++) {
+		cache = rcu_dereference(ghes_estatus_caches[i]);
+		if (cache == NULL) {
+			slot = i;
+			break;
+		}
+		duration = now - cache->time_in;
+		if (duration >= GHES_ESTATUS_IN_CACHE_MAX_NSEC) {
+			slot = i;
+			break;
+		}
+		count = atomic_read(&cache->count);
+		period = duration;
+		do_div(period, (count + 1));
+		if (period > max_period) {
+			max_period = period;
+			slot = i;
+		}
+	}
+	rcu_read_unlock();
+
+	if (slot != -1) {
+		/*
+		 * Use release semantics to ensure that ghes_estatus_cached()
+		 * running on another CPU will see the updated cache fields if
+		 * it can see the new value of the pointer.
+		 */
+		victim = xchg_release(&ghes_estatus_caches[slot],
+				      RCU_INITIALIZER(new_cache));
+
+		/*
+		 * At this point, victim may point to a cached item different
+		 * from the one based on which we selected the slot. Instead of
+		 * going to the loop again to pick another slot, let's just
+		 * drop the other item anyway: this may cause a false cache
+		 * miss later on, but that won't cause any problems.
+		 */
+		if (victim)
+			call_rcu(&unrcu_pointer(victim)->rcu,
+				 ghes_estatus_cache_rcu_free);
+	}
+}
diff --git a/include/acpi/ghes_cper.h b/include/acpi/ghes_cper.h
index 6b7632cfaf66..1b5dbeca9bb6 100644
--- a/include/acpi/ghes_cper.h
+++ b/include/acpi/ghes_cper.h
@@ -16,6 +16,7 @@
 #ifndef ACPI_APEI_GHES_CPER_H
 #define ACPI_APEI_GHES_CPER_H
 
+#include <linux/atomic.h>
 #include <linux/workqueue.h>
 
 #include <acpi/ghes.h>
@@ -54,6 +55,8 @@
 	((struct acpi_hest_generic_data *)                              \
 	((struct ghes_vendor_record_entry *)(vendor_entry) + 1))
 
+extern struct gen_pool *ghes_estatus_pool;
+
 static inline bool is_hest_type_generic_v2(struct ghes *ghes)
 {
 	return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
@@ -98,5 +101,8 @@ int __ghes_read_estatus(struct acpi_hest_generic_status *estatus,
 			u64 buf_paddr, enum fixed_addresses fixmap_idx,
 			size_t buf_len);
 #endif
+int ghes_estatus_cached(struct acpi_hest_generic_status *estatus);
+void ghes_estatus_cache_add(struct acpi_hest_generic *generic,
+			    struct acpi_hest_generic_status *estatus);
 
 #endif /* ACPI_APEI_GHES_CPER_H */

-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox