* Re: [PATCH v3 2/2] cpufreq: CPPC: add autonomous mode boot parameter support
From: Mario Limonciello @ 2026-05-18 14:21 UTC (permalink / raw)
To: Sumit Gupta, rafael, viresh.kumar, pierre.gondois,
ionela.voinescu, zhenglifeng1, zhanjie9, corbet, skhan, rdunlap,
linux-pm, linux-doc, linux-kernel
Cc: linux-tegra, treding, jonathanh, vsethi, ksitaraman, sanjayc,
mochs, bbasu
In-Reply-To: <e1a546f2-6e7e-4236-97bb-f72bea0137f7@nvidia.com>
On 5/18/26 09:15, Sumit Gupta wrote:
>
> On 18/05/26 19:20, Mario Limonciello wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 5/18/26 08:44, Sumit Gupta wrote:
>>> Hi Mario,
>>>
>>>
>>> On 16/05/26 02:43, Mario Limonciello wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On 5/15/26 07:26, Sumit Gupta wrote:
>>>>> Add a kernel boot parameter 'cppc_cpufreq.auto_sel_mode' to enable
>>>>> CPPC autonomous performance selection on all CPUs at system startup.
>>>>> When autonomous mode is enabled, the hardware automatically adjusts
>>>>> CPU performance based on workload demands using Energy Performance
>>>>> Preference (EPP) hints.
>>>>>
>>>>> When the parameter is set:
>>>>> - Configure all CPUs for autonomous operation on first init
>>>>> - Use HW min/max_perf when available; otherwise initialize from caps
>>>>> - Initialize desired_perf to max_perf as a starting hint
>>>>> - Hardware controls frequency instead of the OS governor
>>>>> - EPP behavior depends on parameter value:
>>>>> - performance (or 1): override EPP to performance preference (0x0)
>>>>> - default_epp (or 2): preserve EPP value programmed by BIOS/
>>>>> firmware
>>>>>
>>>>> The boot parameter is applied only during first policy initialization.
>>>>> Skip applying it on CPU hotplug to preserve runtime sysfs
>>>>> configuration.
>>>>>
>>>>> This patch depends on patch series [1] ("cpufreq: Set policy->min and
>>>>> max as real QoS constraints") so that the policy->min/max set in
>>>>> cppc_cpufreq_cpu_init() are not overridden by cpufreq_set_policy()
>>>>> during init.
>>>>>
>>>>> Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
>>>>> ---
>>>>> [1] https://lore.kernel.org/lkml/20260511135538.522653-1-
>>>>> pierre.gondois@arm.com/
>>>>> ---
>>>>> .../admin-guide/kernel-parameters.txt | 16 +++
>>>>> drivers/cpufreq/cppc_cpufreq.c | 122 +++++++++++++
>>>>> ++++-
>>>>> 2 files changed, 133 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/
>>>>> Documentation/admin-guide/kernel-parameters.txt
>>>>> index 0eb64aab3685..7e4b3a8fd76f 100644
>>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>>> @@ -1048,6 +1048,22 @@ Kernel parameters
>>>>> policy to use. This governor must be registered
>>>>> in the
>>>>> kernel before the cpufreq driver probes.
>>>>>
>>>>> + cppc_cpufreq.auto_sel_mode=
>>>>> + [CPU_FREQ] Enable ACPI CPPC autonomous
>>>>> performance
>>>>> + selection. When enabled, hardware automatically
>>>>> adjusts
>>>>> + CPU frequency on all CPUs based on workload
>>>>> demands.
>>>>> + In Autonomous mode, Energy Performance
>>>>> Preference (EPP)
>>>>> + hints guide hardware toward performance (0x0)
>>>>> or energy
>>>>> + efficiency (0xff).
>>>>> + Requires ACPI CPPC autonomous selection register
>>>>> + support.
>>>>> + Accepts:
>>>>> + performance, 1: enable auto_sel + set EPP to
>>>>> + performance (0x0)
>>>>> + default_epp, 2: enable auto_sel, preserve EPP
>>>>> value
>>>>> + programmed by BIOS/firmware
>>>>> + Unset: cpufreq governors are used (auto_sel
>>>>> disabled).
>>>>
>>>> Rather than unset doing nothing, have you considered having it take a
>>>> midpoint like 128? That's what we do in amd-pstate (default to
>>>> balance_performance). I think it turns into a reasonable balance.
>>>
>>> Thanks for the suggestion.
>>> I can add balance_performance that enables auto_sel with EPP=128 in v4.
>>>
>>> On changing the driver default (no param behavior) to auto enable
>>> balance_performance, it would be good to keep the current behavior for
>>> now since cppc_cpufreq is generic across ARM64/RISC-V platforms where
>>> EPP and Autonomous Selection registers are optional.
>>> A default change would affect existing users relying on governors.
>>>
>>> Thank you,
>>> Sumit Gupta
>>
>> But couldn't you make the "no module parameter set" follow the behavior
>> to only set the registers if they're available?
>>
>> So the systems that support it start using it, the ones that don't it's
>> a NOP.
>>
>
> Would it work to add balance_performance as a new mode in v4,
> and discuss changing the default separately as a follow-up?
>
Sure.
> Runtime detection helps for unsupported platforms. But platforms which
> support the registers use OS governors today, and silently switching
> them to autonomous mode on a kernel update is a behavior change for
> existing users. They would also have no way to boot into sw governor.
>
But hopefully it should be better battery life/responsiveness for those
scenarios too, right?
>
>
>>>
>>>
>>>>
>>>>> +
>>>>> cpu_init_udelay=N
>>>>> [X86,EARLY] Delay for N microsec between assert
>>>>> and de-assert
>>>>> of APIC INIT to start processors. This delay
>>>>> occurs
>>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/
>>>>> cppc_cpufreq.c
>>>>> index 6b54427b52e1..5f4d735e7c7d 100644
>>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>>> @@ -28,6 +28,43 @@
>>>>>
>>>>> static struct cpufreq_driver cppc_cpufreq_driver;
>>>>>
>>>>> +/* Autonomous Selection boot parameter modes */
>>>>> +enum {
>>>>> + AUTO_SEL_PERFORMANCE = 1,
>>>>> + AUTO_SEL_DEFAULT_EPP = 2,
>>>>> +};
>>>>> +
>>>>> +static int auto_sel_mode;
>>>>> +
>>>>> +static int auto_sel_mode_set(const char *val, const struct
>>>>> kernel_param *kp)
>>>>> +{
>>>>> + if (sysfs_streq(val, "performance") || sysfs_streq(val, "1"))
>>>>> + *(int *)kp->arg = AUTO_SEL_PERFORMANCE;
>>>>> + else if (sysfs_streq(val, "default_epp") || sysfs_streq(val,
>>>>> "2"))
>>>>> + *(int *)kp->arg = AUTO_SEL_DEFAULT_EPP;
>>>>> + else
>>>>> + return -EINVAL;
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> +static int auto_sel_mode_get(char *buffer, const struct kernel_param
>>>>> *kp)
>>>>> +{
>>>>> + switch (*(int *)kp->arg) {
>>>>> + case AUTO_SEL_PERFORMANCE:
>>>>> + return sysfs_emit(buffer, "performance\n");
>>>>> + case AUTO_SEL_DEFAULT_EPP:
>>>>> + return sysfs_emit(buffer, "default_epp\n");
>>>>> + default:
>>>>> + return sysfs_emit(buffer, "disabled\n");
>>>>> + }
>>>>> +}
>>>>> +
>>>>> +static const struct kernel_param_ops auto_sel_mode_ops = {
>>>>> + .set = auto_sel_mode_set,
>>>>> + .get = auto_sel_mode_get,
>>>>> +};
>>>>> +
>>>>> #ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
>>>>> static enum {
>>>>> FIE_UNSET = -1,
>>>>> @@ -715,11 +752,75 @@ static int cppc_cpufreq_cpu_init(struct
>>>>> cpufreq_policy *policy)
>>>>> policy->cur = cppc_perf_to_khz(caps, caps->highest_perf);
>>>>> cpu_data->perf_ctrls.desired_perf = caps->highest_perf;
>>>>>
>>>>> - ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>>>> - if (ret) {
>>>>> - pr_debug("Err setting perf value:%d on CPU:%d. ret:
>>>>> %d\n",
>>>>> - caps->highest_perf, cpu, ret);
>>>>> - goto out;
>>>>> + /*
>>>>> + * Enable autonomous mode on first init if boot param is set.
>>>>> + * Check last_governor to detect first init and skip if auto_sel
>>>>> + * is already enabled.
>>>>> + */
>>>>> + if (auto_sel_mode && policy->last_governor[0] == '\0' &&
>>>>> + !cpu_data->perf_ctrls.auto_sel) {
>>>>> + /* Init min/max_perf from caps if not already set by
>>>>> HW. */
>>>>> + if (!cpu_data->perf_ctrls.min_perf)
>>>>> + cpu_data->perf_ctrls.min_perf = caps-
>>>>> >lowest_nonlinear_perf;
>>>>> + if (!cpu_data->perf_ctrls.max_perf)
>>>>> + cpu_data->perf_ctrls.max_perf = policy-
>>>>> >boost_enabled ?
>>>>> + caps->highest_perf : caps->nominal_perf;
>>>>> +
>>>>> + /*
>>>>> + * In autonomous mode desired_perf is only a hint;
>>>>> EPP and
>>>>> + * the platform drive actual selection within [min,
>>>>> max].
>>>>> + * Initialize it to max_perf so HW starts at the upper
>>>>> bound.
>>>>> + */
>>>>> + cpu_data->perf_ctrls.desired_perf = cpu_data-
>>>>> >perf_ctrls.max_perf;
>>>>> +
>>>>> + policy->cur = cppc_perf_to_khz(caps,
>>>>> + cpu_data->perf_ctrls.desired_perf);
>>>>> +
>>>>> + /*
>>>>> + * Override EPP only in 'performance' mode;
>>>>> 'default_epp' mode
>>>>> + * preserves the BIOS/firmware programmed EPP value.
>>>>> + * EPP is optional - some platforms may not support it.
>>>>> + */
>>>>> + if (auto_sel_mode == AUTO_SEL_PERFORMANCE) {
>>>>> + ret = cppc_set_epp(cpu,
>>>>> CPPC_EPP_PERFORMANCE_PREF);
>>>>> + if (ret && ret != -EOPNOTSUPP)
>>>>> + pr_warn("Failed to set EPP for CPU%d
>>>>> (%d)\n", cpu, ret);
>>>>> + else if (!ret)
>>>>> + cpu_data->perf_ctrls.energy_perf = CPPC_EPP_PERFORMANCE_PREF;
>>>>> + }
>>>>> +
>>>>> + /* Program min/max/desired into CPPC regs (non-fatal on
>>>>> failure). */
>>>>> + ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>>>> + if (ret)
>>>>> + pr_warn("set_perf failed CPU%d (%d); using HW
>>>>> values\n",
>>>>> + cpu, ret);
>>>>> +
>>>>> + ret = cppc_set_auto_sel(cpu, true);
>>>>> + if (ret && ret != -EOPNOTSUPP)
>>>>> + pr_warn("auto_sel CPU%d failed (%d); using OS
>>>>> mode\n",
>>>>> + cpu, ret);
>>>>> + else if (!ret)
>>>>> + cpu_data->perf_ctrls.auto_sel = true;
>>>>> + }
>>>>> +
>>>>> + if (cpu_data->perf_ctrls.auto_sel) {
>>>>> + /* Sync policy limits from HW when autonomous mode is
>>>>> active */
>>>>> + policy->min = cppc_perf_to_khz(caps,
>>>>> + cpu_data->perf_ctrls.min_perf ?:
>>>>> + caps->lowest_nonlinear_perf);
>>>>> + policy->max = cppc_perf_to_khz(caps,
>>>>> + cpu_data->perf_ctrls.max_perf ?:
>>>>> + (policy->boost_enabled ?
>>>>> + caps->highest_perf :
>>>>> + caps->nominal_perf));
>>>>> + } else {
>>>>> + /* Normal mode: governors control frequency */
>>>>> + ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>>>> + if (ret) {
>>>>> + pr_debug("Err setting perf value:%d on CPU:%d.
>>>>> ret:%d\n",
>>>>> + caps->highest_perf, cpu, ret);
>>>>> + goto out;
>>>>> + }
>>>>> }
>>>>>
>>>>> cppc_cpufreq_cpu_fie_init(policy);
>>>>> @@ -1079,10 +1180,21 @@ static int __init cppc_cpufreq_init(void)
>>>>>
>>>>> static void __exit cppc_cpufreq_exit(void)
>>>>> {
>>>>> + unsigned int cpu;
>>>>> +
>>>>> + for_each_present_cpu(cpu)
>>>>> + cppc_set_auto_sel(cpu, false);
>>>>> +
>>>>> cpufreq_unregister_driver(&cppc_cpufreq_driver);
>>>>> cppc_freq_invariance_exit();
>>>>> }
>>>>>
>>>>> +module_param_cb(auto_sel_mode, &auto_sel_mode_ops, &auto_sel_mode,
>>>>> 0444);
>>>>> +MODULE_PARM_DESC(auto_sel_mode,
>>>>> + "Enable CPPC autonomous performance selection at
>>>>> boot: "
>>>>> + "performance or 1 (EPP=performance), "
>>>>> + "default_epp or 2 (preserve BIOS/firmware EPP)");
>>>>> +
>>>>> module_exit(cppc_cpufreq_exit);
>>>>> MODULE_AUTHOR("Ashwin Chaugule");
>>>>> MODULE_DESCRIPTION("CPUFreq driver based on the ACPI CPPC v5.0+
>>>>> spec");
>>>>
>>
^ permalink raw reply
* Re: [PATCH v3 2/2] cpufreq: CPPC: add autonomous mode boot parameter support
From: Sumit Gupta @ 2026-05-18 14:15 UTC (permalink / raw)
To: Mario Limonciello, rafael, viresh.kumar, pierre.gondois,
ionela.voinescu, zhenglifeng1, zhanjie9, corbet, skhan, rdunlap,
linux-pm, linux-doc, linux-kernel
Cc: linux-tegra, treding, jonathanh, vsethi, ksitaraman, sanjayc,
mochs, bbasu
In-Reply-To: <72fd2fcc-6303-4980-beb7-e4b711ad6406@amd.com>
On 18/05/26 19:20, Mario Limonciello wrote:
> External email: Use caution opening links or attachments
>
>
> On 5/18/26 08:44, Sumit Gupta wrote:
>> Hi Mario,
>>
>>
>> On 16/05/26 02:43, Mario Limonciello wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 5/15/26 07:26, Sumit Gupta wrote:
>>>> Add a kernel boot parameter 'cppc_cpufreq.auto_sel_mode' to enable
>>>> CPPC autonomous performance selection on all CPUs at system startup.
>>>> When autonomous mode is enabled, the hardware automatically adjusts
>>>> CPU performance based on workload demands using Energy Performance
>>>> Preference (EPP) hints.
>>>>
>>>> When the parameter is set:
>>>> - Configure all CPUs for autonomous operation on first init
>>>> - Use HW min/max_perf when available; otherwise initialize from caps
>>>> - Initialize desired_perf to max_perf as a starting hint
>>>> - Hardware controls frequency instead of the OS governor
>>>> - EPP behavior depends on parameter value:
>>>> - performance (or 1): override EPP to performance preference (0x0)
>>>> - default_epp (or 2): preserve EPP value programmed by
>>>> BIOS/firmware
>>>>
>>>> The boot parameter is applied only during first policy initialization.
>>>> Skip applying it on CPU hotplug to preserve runtime sysfs
>>>> configuration.
>>>>
>>>> This patch depends on patch series [1] ("cpufreq: Set policy->min and
>>>> max as real QoS constraints") so that the policy->min/max set in
>>>> cppc_cpufreq_cpu_init() are not overridden by cpufreq_set_policy()
>>>> during init.
>>>>
>>>> Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
>>>> ---
>>>> [1] https://lore.kernel.org/lkml/20260511135538.522653-1-
>>>> pierre.gondois@arm.com/
>>>> ---
>>>> .../admin-guide/kernel-parameters.txt | 16 +++
>>>> drivers/cpufreq/cppc_cpufreq.c | 122
>>>> +++++++++++++++++-
>>>> 2 files changed, 133 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/
>>>> Documentation/admin-guide/kernel-parameters.txt
>>>> index 0eb64aab3685..7e4b3a8fd76f 100644
>>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>>> @@ -1048,6 +1048,22 @@ Kernel parameters
>>>> policy to use. This governor must be registered
>>>> in the
>>>> kernel before the cpufreq driver probes.
>>>>
>>>> + cppc_cpufreq.auto_sel_mode=
>>>> + [CPU_FREQ] Enable ACPI CPPC autonomous
>>>> performance
>>>> + selection. When enabled, hardware automatically
>>>> adjusts
>>>> + CPU frequency on all CPUs based on workload
>>>> demands.
>>>> + In Autonomous mode, Energy Performance
>>>> Preference (EPP)
>>>> + hints guide hardware toward performance (0x0)
>>>> or energy
>>>> + efficiency (0xff).
>>>> + Requires ACPI CPPC autonomous selection register
>>>> + support.
>>>> + Accepts:
>>>> + performance, 1: enable auto_sel + set EPP to
>>>> + performance (0x0)
>>>> + default_epp, 2: enable auto_sel, preserve EPP
>>>> value
>>>> + programmed by BIOS/firmware
>>>> + Unset: cpufreq governors are used (auto_sel
>>>> disabled).
>>>
>>> Rather than unset doing nothing, have you considered having it take a
>>> midpoint like 128? That's what we do in amd-pstate (default to
>>> balance_performance). I think it turns into a reasonable balance.
>>
>> Thanks for the suggestion.
>> I can add balance_performance that enables auto_sel with EPP=128 in v4.
>>
>> On changing the driver default (no param behavior) to auto enable
>> balance_performance, it would be good to keep the current behavior for
>> now since cppc_cpufreq is generic across ARM64/RISC-V platforms where
>> EPP and Autonomous Selection registers are optional.
>> A default change would affect existing users relying on governors.
>>
>> Thank you,
>> Sumit Gupta
>
> But couldn't you make the "no module parameter set" follow the behavior
> to only set the registers if they're available?
>
> So the systems that support it start using it, the ones that don't it's
> a NOP.
>
Would it work to add balance_performance as a new mode in v4,
and discuss changing the default separately as a follow-up?
Runtime detection helps for unsupported platforms. But platforms which
support the registers use OS governors today, and silently switching
them to autonomous mode on a kernel update is a behavior change for
existing users. They would also have no way to boot into sw governor.
Thank you,
Sumit Gupta
>>
>>
>>>
>>>> +
>>>> cpu_init_udelay=N
>>>> [X86,EARLY] Delay for N microsec between assert
>>>> and de-assert
>>>> of APIC INIT to start processors. This delay
>>>> occurs
>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/
>>>> cppc_cpufreq.c
>>>> index 6b54427b52e1..5f4d735e7c7d 100644
>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>> @@ -28,6 +28,43 @@
>>>>
>>>> static struct cpufreq_driver cppc_cpufreq_driver;
>>>>
>>>> +/* Autonomous Selection boot parameter modes */
>>>> +enum {
>>>> + AUTO_SEL_PERFORMANCE = 1,
>>>> + AUTO_SEL_DEFAULT_EPP = 2,
>>>> +};
>>>> +
>>>> +static int auto_sel_mode;
>>>> +
>>>> +static int auto_sel_mode_set(const char *val, const struct
>>>> kernel_param *kp)
>>>> +{
>>>> + if (sysfs_streq(val, "performance") || sysfs_streq(val, "1"))
>>>> + *(int *)kp->arg = AUTO_SEL_PERFORMANCE;
>>>> + else if (sysfs_streq(val, "default_epp") || sysfs_streq(val,
>>>> "2"))
>>>> + *(int *)kp->arg = AUTO_SEL_DEFAULT_EPP;
>>>> + else
>>>> + return -EINVAL;
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static int auto_sel_mode_get(char *buffer, const struct kernel_param
>>>> *kp)
>>>> +{
>>>> + switch (*(int *)kp->arg) {
>>>> + case AUTO_SEL_PERFORMANCE:
>>>> + return sysfs_emit(buffer, "performance\n");
>>>> + case AUTO_SEL_DEFAULT_EPP:
>>>> + return sysfs_emit(buffer, "default_epp\n");
>>>> + default:
>>>> + return sysfs_emit(buffer, "disabled\n");
>>>> + }
>>>> +}
>>>> +
>>>> +static const struct kernel_param_ops auto_sel_mode_ops = {
>>>> + .set = auto_sel_mode_set,
>>>> + .get = auto_sel_mode_get,
>>>> +};
>>>> +
>>>> #ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
>>>> static enum {
>>>> FIE_UNSET = -1,
>>>> @@ -715,11 +752,75 @@ static int cppc_cpufreq_cpu_init(struct
>>>> cpufreq_policy *policy)
>>>> policy->cur = cppc_perf_to_khz(caps, caps->highest_perf);
>>>> cpu_data->perf_ctrls.desired_perf = caps->highest_perf;
>>>>
>>>> - ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>>> - if (ret) {
>>>> - pr_debug("Err setting perf value:%d on CPU:%d.
>>>> ret:%d\n",
>>>> - caps->highest_perf, cpu, ret);
>>>> - goto out;
>>>> + /*
>>>> + * Enable autonomous mode on first init if boot param is set.
>>>> + * Check last_governor to detect first init and skip if auto_sel
>>>> + * is already enabled.
>>>> + */
>>>> + if (auto_sel_mode && policy->last_governor[0] == '\0' &&
>>>> + !cpu_data->perf_ctrls.auto_sel) {
>>>> + /* Init min/max_perf from caps if not already set by
>>>> HW. */
>>>> + if (!cpu_data->perf_ctrls.min_perf)
>>>> + cpu_data->perf_ctrls.min_perf = caps-
>>>> >lowest_nonlinear_perf;
>>>> + if (!cpu_data->perf_ctrls.max_perf)
>>>> + cpu_data->perf_ctrls.max_perf = policy-
>>>> >boost_enabled ?
>>>> + caps->highest_perf : caps->nominal_perf;
>>>> +
>>>> + /*
>>>> + * In autonomous mode desired_perf is only a hint;
>>>> EPP and
>>>> + * the platform drive actual selection within [min,
>>>> max].
>>>> + * Initialize it to max_perf so HW starts at the upper
>>>> bound.
>>>> + */
>>>> + cpu_data->perf_ctrls.desired_perf = cpu_data-
>>>> >perf_ctrls.max_perf;
>>>> +
>>>> + policy->cur = cppc_perf_to_khz(caps,
>>>> + cpu_data->perf_ctrls.desired_perf);
>>>> +
>>>> + /*
>>>> + * Override EPP only in 'performance' mode;
>>>> 'default_epp' mode
>>>> + * preserves the BIOS/firmware programmed EPP value.
>>>> + * EPP is optional - some platforms may not support it.
>>>> + */
>>>> + if (auto_sel_mode == AUTO_SEL_PERFORMANCE) {
>>>> + ret = cppc_set_epp(cpu,
>>>> CPPC_EPP_PERFORMANCE_PREF);
>>>> + if (ret && ret != -EOPNOTSUPP)
>>>> + pr_warn("Failed to set EPP for CPU%d
>>>> (%d)\n", cpu, ret);
>>>> + else if (!ret)
>>>> + cpu_data->perf_ctrls.energy_perf = CPPC_EPP_PERFORMANCE_PREF;
>>>> + }
>>>> +
>>>> + /* Program min/max/desired into CPPC regs (non-fatal on
>>>> failure). */
>>>> + ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>>> + if (ret)
>>>> + pr_warn("set_perf failed CPU%d (%d); using HW
>>>> values\n",
>>>> + cpu, ret);
>>>> +
>>>> + ret = cppc_set_auto_sel(cpu, true);
>>>> + if (ret && ret != -EOPNOTSUPP)
>>>> + pr_warn("auto_sel CPU%d failed (%d); using OS
>>>> mode\n",
>>>> + cpu, ret);
>>>> + else if (!ret)
>>>> + cpu_data->perf_ctrls.auto_sel = true;
>>>> + }
>>>> +
>>>> + if (cpu_data->perf_ctrls.auto_sel) {
>>>> + /* Sync policy limits from HW when autonomous mode is
>>>> active */
>>>> + policy->min = cppc_perf_to_khz(caps,
>>>> + cpu_data->perf_ctrls.min_perf ?:
>>>> + caps->lowest_nonlinear_perf);
>>>> + policy->max = cppc_perf_to_khz(caps,
>>>> + cpu_data->perf_ctrls.max_perf ?:
>>>> + (policy->boost_enabled ?
>>>> + caps->highest_perf :
>>>> + caps->nominal_perf));
>>>> + } else {
>>>> + /* Normal mode: governors control frequency */
>>>> + ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>>> + if (ret) {
>>>> + pr_debug("Err setting perf value:%d on CPU:%d.
>>>> ret:%d\n",
>>>> + caps->highest_perf, cpu, ret);
>>>> + goto out;
>>>> + }
>>>> }
>>>>
>>>> cppc_cpufreq_cpu_fie_init(policy);
>>>> @@ -1079,10 +1180,21 @@ static int __init cppc_cpufreq_init(void)
>>>>
>>>> static void __exit cppc_cpufreq_exit(void)
>>>> {
>>>> + unsigned int cpu;
>>>> +
>>>> + for_each_present_cpu(cpu)
>>>> + cppc_set_auto_sel(cpu, false);
>>>> +
>>>> cpufreq_unregister_driver(&cppc_cpufreq_driver);
>>>> cppc_freq_invariance_exit();
>>>> }
>>>>
>>>> +module_param_cb(auto_sel_mode, &auto_sel_mode_ops, &auto_sel_mode,
>>>> 0444);
>>>> +MODULE_PARM_DESC(auto_sel_mode,
>>>> + "Enable CPPC autonomous performance selection at
>>>> boot: "
>>>> + "performance or 1 (EPP=performance), "
>>>> + "default_epp or 2 (preserve BIOS/firmware EPP)");
>>>> +
>>>> module_exit(cppc_cpufreq_exit);
>>>> MODULE_AUTHOR("Ashwin Chaugule");
>>>> MODULE_DESCRIPTION("CPUFreq driver based on the ACPI CPPC v5.0+
>>>> spec");
>>>
>
^ permalink raw reply
* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Christian König @ 2026-05-18 14:06 UTC (permalink / raw)
To: Albert Esteve
Cc: T.J. Mercier, Christian Brauner, Tejun Heo, Johannes Weiner,
Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, linux-mm,
linux-security-module, selinux, linux-kselftest, mripard,
echanude
In-Reply-To: <CADSE00Lh95ygoXGKJGsYvQGEsFV8sVmwEC3uvh8M6r3ERzaJwg@mail.gmail.com>
On 5/18/26 14:50, Albert Esteve wrote:
> On Mon, May 18, 2026 at 9:20 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 5/15/26 19:06, T.J. Mercier wrote:
>>> On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
>>>>
>>>> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
>>>>> On embedded platforms a central process often allocates dma-buf
>>>>> memory on behalf of client applications. Without a way to
>>>>> attribute the charge to the requesting client's cgroup, the
>>>>> cost lands on the allocator, making per-cgroup memory limits
>>>>> ineffective for the actual consumers.
>>>>>
>>>>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
>>>>
>>>> Please be aware that pidfds come in two flavors:
>>>>
>>>> thread-group pidfds and thread-specific pidfds. Make sure that your API
>>>> doesn't implicitly depend on this distinction not existing.
>>>
>>> Hi Christian,
>>>
>>> Memcg is not a controller that supports "thread mode" so all threads
>>> in a group should belong to the same memcg.
>>
>> BTW: Exactly that is the requirement automotive has with their native context use case.
>>
>> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
>>
>> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
>>
>> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
>
> Hi Christian,
>
> Thanks for sharing this atuomotive usecase. If I understand correctly,
> the actual requirement is attributing dma-buf charges to the right
> client, not putting each daemon thread in a different cgroup?
Nope, exactly that's the difference.
The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
> If so,
> the `charge_pid_fd` approach achieves this directly by passing the
> client's `pid_fd`, without needing to add per-thread cgroup
> infrastructure.
Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
Doing that automatically for CPU and I/O time would just be nice to have additionally.
Regards,
Christian.
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Checking the flags from pidfd_get_pid would be the best way for an
>>> explicit check of the pidfd type?
>>>
>>>>> a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
>>>>> memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
>>>>> inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
>>>>> the mem_accounting module parameter enabled, the buffer is charged
>>>>> to the allocator's own cgroup.
>>>>>
>>>>> Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
>>>>> system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
>>>>> page allocations. Keeping __GFP_ACCOUNT would charge the same pages
>>>>> twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
>>>>> all accounting through a single MEMCG_DMABUF path.
>>>>>
>>>>> Usage examples:
>>>>>
>>>>> 1. Central allocator charging to a client at allocation time.
>>>>> The allocator knows the client's PID (e.g., from binder's
>>>>> sender_pid) and uses pidfd to attribute the charge:
>>>>>
>>>>> pid_t client_pid = txn->sender_pid;
>>>>> int pidfd = pidfd_open(client_pid, 0);
>>>>>
>>>>> struct dma_heap_allocation_data alloc = {
>>>>> .len = buffer_size,
>>>>> .fd_flags = O_RDWR | O_CLOEXEC,
>>>>> .charge_pid_fd = pidfd,
>>>>> };
>>>>> ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
>>>>> close(pidfd);
>>>>> /* alloc.fd is now charged to client's cgroup */
>>>>>
>>>>> 2. Default allocation (no pidfd, mem_accounting=1).
>>>>> When charge_pid_fd is not set and the mem_accounting module
>>>>> parameter is enabled, the buffer is charged to the allocator's
>>>>> own cgroup:
>>>>>
>>>>> struct dma_heap_allocation_data alloc = {
>>>>> .len = buffer_size,
>>>>> .fd_flags = O_RDWR | O_CLOEXEC,
>>>>> };
>>>>> ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
>>>>> /* charged to current process's cgroup */
>>>>>
>>>>> Current limitations:
>>>>>
>>>>> - Single-owner model: a dma-buf carries one memcg charge regardless of
>>>>> how many processes share it. Means only the first owner (and exporter)
>>>>> of the shared buffer bears the charge.
>>>>> - Only memcg accounting supported. While this makes sense for system
>>>>> heap buffers, other heaps (e.g., CMA heaps) will require selectively
>>>>> charging also for the dmem controller.
>>>>>
>>>>> Signed-off-by: Albert Esteve <aesteve@redhat.com>
>>>>> ---
>>>>> Documentation/admin-guide/cgroup-v2.rst | 5 ++--
>>>>> drivers/dma-buf/dma-buf.c | 16 ++++---------
>>>>> drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++---
>>>>> drivers/dma-buf/heaps/system_heap.c | 2 --
>>>>> include/uapi/linux/dma-heap.h | 6 +++++
>>>>> 5 files changed, 53 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>>>> index 8bdbc2e866430..824d269531eb1 100644
>>>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>>>> @@ -1636,8 +1636,9 @@ The following nested keys are defined.
>>>>> structures.
>>>>>
>>>>> dmabuf (npn)
>>>>> - Amount of memory used for exported DMA buffers allocated by the cgroup.
>>>>> - Stays with the allocating cgroup regardless of how the buffer is shared.
>>>>> + Amount of memory used for exported DMA buffers allocated by or on
>>>>> + behalf of the cgroup. Stays with the allocating cgroup regardless
>>>>> + of how the buffer is shared.
>>>>>
>>>>> workingset_refault_anon
>>>>> Number of refaults of previously evicted anonymous pages.
>>>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>>>> index ce02377f48908..23fb758b78297 100644
>>>>> --- a/drivers/dma-buf/dma-buf.c
>>>>> +++ b/drivers/dma-buf/dma-buf.c
>>>>> @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
>>>>> */
>>>>> BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
>>>>>
>>>>> - mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
>>>>> - mem_cgroup_put(dmabuf->memcg);
>>>>> + if (dmabuf->memcg) {
>>>>> + mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
>>>>> + PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
>>>>> + mem_cgroup_put(dmabuf->memcg);
>>>>> + }
>>>>>
>>>>> dmabuf->ops->release(dmabuf);
>>>>>
>>>>> @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>>>>> dmabuf->resv = resv;
>>>>> }
>>>>>
>>>>> - dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
>>>>> - if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
>>>>> - GFP_KERNEL)) {
>>>>> - ret = -ENOMEM;
>>>>> - goto err_memcg;
>>>>> - }
>>>>> -
>>>>> file->private_data = dmabuf;
>>>>> file->f_path.dentry->d_fsdata = dmabuf;
>>>>> dmabuf->file = file;
>>>>> @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
>>>>>
>>>>> return dmabuf;
>>>>>
>>>>> -err_memcg:
>>>>> - mem_cgroup_put(dmabuf->memcg);
>>>>> err_file:
>>>>> fput(file);
>>>>> err_module:
>>>>> diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
>>>>> index ac5f8685a6494..ff6e259afcdc0 100644
>>>>> --- a/drivers/dma-buf/dma-heap.c
>>>>> +++ b/drivers/dma-buf/dma-heap.c
>>>>> @@ -7,13 +7,17 @@
>>>>> */
>>>>>
>>>>> #include <linux/cdev.h>
>>>>> +#include <linux/cgroup.h>
>>>>> #include <linux/device.h>
>>>>> #include <linux/dma-buf.h>
>>>>> #include <linux/dma-heap.h>
>>>>> +#include <linux/memcontrol.h>
>>>>> +#include <linux/sched/mm.h>
>>>>> #include <linux/err.h>
>>>>> #include <linux/export.h>
>>>>> #include <linux/list.h>
>>>>> #include <linux/nospec.h>
>>>>> +#include <linux/pidfd.h>
>>>>> #include <linux/syscalls.h>
>>>>> #include <linux/uaccess.h>
>>>>> #include <linux/xarray.h>
>>>>> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
>>>>> "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
>>>>>
>>>>> static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
>>>>> - u32 fd_flags,
>>>>> - u64 heap_flags)
>>>>> + u32 fd_flags, u64 heap_flags,
>>>>> + struct mem_cgroup *charge_to)
>>>>> {
>>>>> struct dma_buf *dmabuf;
>>>>> + unsigned int nr_pages;
>>>>> + struct mem_cgroup *memcg = charge_to;
>>>>> int fd;
>>>>>
>>>>> /*
>>>>> @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
>>>>> if (IS_ERR(dmabuf))
>>>>> return PTR_ERR(dmabuf);
>>>>>
>>>>> + nr_pages = len / PAGE_SIZE;
>>>>> +
>>>>> + if (memcg)
>>>>> + css_get(&memcg->css);
>>>>> + else if (mem_accounting)
>>>>> + memcg = get_mem_cgroup_from_mm(current->mm);
>>>>> +
>>>>> + if (memcg) {
>>>>> + if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
>>>>> + mem_cgroup_put(memcg);
>>>>> + dma_buf_put(dmabuf);
>>>>> + return -ENOMEM;
>>>>> + }
>>>>> + dmabuf->memcg = memcg;
>>>>> + }
>>>>> +
>>>>> fd = dma_buf_fd(dmabuf, fd_flags);
>>>>> if (fd < 0) {
>>>>> dma_buf_put(dmabuf);
>>>>> @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
>>>>> {
>>>>> struct dma_heap_allocation_data *heap_allocation = data;
>>>>> struct dma_heap *heap = file->private_data;
>>>>> + struct mem_cgroup *memcg = NULL;
>>>>> + struct task_struct *task;
>>>>> + unsigned int pidfd_flags;
>>>>> int fd;
>>>>>
>>>>> if (heap_allocation->fd)
>>>>> @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
>>>>> if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
>>>>> return -EINVAL;
>>>>>
>>>>> + if (heap_allocation->charge_pid_fd) {
>>>>> + task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
>>>>
>>>> Will always get a thread-group leader pidfd and will fail if this is a
>>>> thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to
>>>> open a thread-specific pidfd.
>>>>
>>>>> + if (IS_ERR(task))
>>>>> + return PTR_ERR(task);
>>>>> +
>>>>> + memcg = get_mem_cgroup_from_mm(task->mm);
>>>>> + put_task_struct(task);
>>>>> + }
>>>>> +
>>>>> fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
>>>>> heap_allocation->fd_flags,
>>>>> - heap_allocation->heap_flags);
>>>>> + heap_allocation->heap_flags,
>>>>> + memcg);
>>>>> + mem_cgroup_put(memcg);
>>>>> if (fd < 0)
>>>>> return fd;
>>>>>
>>>>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
>>>>> index 03c2b87cb1112..95d7688167b93 100644
>>>>> --- a/drivers/dma-buf/heaps/system_heap.c
>>>>> +++ b/drivers/dma-buf/heaps/system_heap.c
>>>>> @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
>>>>> if (max_order < orders[i])
>>>>> continue;
>>>>> flags = order_flags[i];
>>>>> - if (mem_accounting)
>>>>> - flags |= __GFP_ACCOUNT;
>>>>> page = alloc_pages(flags, orders[i]);
>>>>> if (!page)
>>>>> continue;
>>>>> diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
>>>>> index a4cf716a49fa6..e02b0f8cbc6a1 100644
>>>>> --- a/include/uapi/linux/dma-heap.h
>>>>> +++ b/include/uapi/linux/dma-heap.h
>>>>> @@ -29,6 +29,10 @@
>>>>> * handle to the allocated dma-buf
>>>>> * @fd_flags: file descriptor flags used when allocating
>>>>> * @heap_flags: flags passed to heap
>>>>> + * @charge_pid_fd: optional pidfd of the process whose cgroup should be
>>>>> + * charged for this allocation; 0 means charge the calling
>>>>> + * process's cgroup
>>>>> + * @__padding: reserved, must be zero
>>>>> *
>>>>> * Provided by userspace as an argument to the ioctl
>>>>> */
>>>>> @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
>>>>> __u32 fd;
>>>>> __u32 fd_flags;
>>>>> __u64 heap_flags;
>>>>> + __u32 charge_pid_fd;
>>>>> + __u32 __padding;
>>>>> };
>>>>>
>>>>> #define DMA_HEAP_IOC_MAGIC 'H'
>>>>>
>>>>> --
>>>>> 2.53.0
>>>>>
>>
>
^ permalink raw reply
* Re: [PATCH v3 0/4] KVM: arm64: vgic: Fix IGROUPR writability and IIDR revision control
From: David Woodhouse @ 2026-05-18 13:56 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Jonathan Corbet, Shuah Khan, Marc Zyngier, Oliver Upton,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
Will Deacon, Jonathan Cameron, Sascha Bischoff, Eric Auger,
Raghavendra Rao Ananta, Maxim Levitsky, Kees Cook, Timothy Hayes,
Arnd Bergmann, kvm, linux-doc, linux-kernel, linux-arm-kernel,
kvmarm, linux-kselftest, Peter Maydell, qemu-arm, qemu-devel
In-Reply-To: <20260511113558.3325004-1-dwmw2@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 876 bytes --]
On Mon, 2026-05-11 at 12:30 +0100, David Woodhouse wrote:
> Maintaining precise guest compatibility across host kernel upgrades —
> and even downgrades, since rollback is sometimes necessary — is not
> optional. That *shouldn't* need saying, but maybe it does:
> https://lore.kernel.org/all/6856b269d2af706eae397e0cf9c1231f89d9a932.camel@infradead.org/
>
> This series fixes the GICv2/v3 IGROUPR writability model to be
> consistently controlled by the GICD_IIDR implementation revision,
> replacing the ad-hoc v2_groups_user_writable flag.
The more generic question of whether KVM should actually maintain guest
compatibility from one version of the host kernel to another is being
discussed in that other thread.
Setting that aside, is there any *technical* feedback on this series? I
believe I've dealt with all of Marc's feedback on v1 (thanks).
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v5 02/13] iio: core: support 64-bit register through debugfs
From: Rodrigo Alencar @ 2026-05-18 13:56 UTC (permalink / raw)
To: rodrigo.alencar, linux-iio, devicetree, linux-kernel, linux-doc,
linux-hardening
Cc: Lars-Peter Clausen, Michael Hennerich, Jonathan Cameron,
David Lechner, Andy Shevchenko, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Philipp Zabel, Jonathan Corbet, Shuah Khan,
Kees Cook, Gustavo A. R. Silva
In-Reply-To: <20260517-ad9910-iio-driver-v5-2-31599c88314a@analog.com>
On 26/05/17 06:56PM, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] Silent truncation of 64-bit values when written to a 32-bit register access function.
> --
>
> commit 64785a0c1416e2cf718b558aa7a2cc9ac39b2d81
> Author: Rodrigo Alencar <rodrigo.alencar@analog.com>
>
> iio: core: support 64-bit register through debugfs
>
> This commit introduces a debugfs_reg64_access function pointer to iio_info
> to allow 64-bit register reads and writes through debugfs. The file operation
> callbacks are modified to prefer the 64-bit variant when available.
>
> > diff --git a/drivers/iio/industrialio-core.c b/drivers/iio/industrialio-core.c
> > index e587aca79b8e7..5c8404efd0a56 100644
> > --- a/drivers/iio/industrialio-core.c
> > +++ b/drivers/iio/industrialio-core.c
>
> [ ... ]
>
> > @@ -429,7 +439,7 @@ static ssize_t iio_debugfs_write_reg(struct file *file,
> >
> > buf[ret] = '\0';
> >
> > - ret = sscanf(buf, "%i %i", ®, &val);
> > + ret = sscanf(buf, "%i %lli", ®, &val64);
> >
> > switch (ret) {
> > case 1:
> > @@ -437,8 +447,12 @@ static ssize_t iio_debugfs_write_reg(struct file *file,
> > break;
> > case 2:
> > iio_dev_opaque->cached_reg_addr = reg;
> > - ret = indio_dev->info->debugfs_reg_access(indio_dev, reg,
> > - val, NULL);
> > + if (indio_dev->info->debugfs_reg64_access)
> > + ret = indio_dev->info->debugfs_reg64_access(indio_dev, reg,
> > + val64, NULL);
> > + else
> > + ret = indio_dev->info->debugfs_reg_access(indio_dev, reg,
> > + val64, NULL);
>
> Could this introduce a regression where values are silently truncated if the
> provided input exceeds 32 bits?
>
> Since the input is now unconditionally parsed as a 64-bit integer, falling
> back to debugfs_reg_access (which takes an unsigned int for writeval) will
> implicitly cast val64. If a user provides a value larger than UINT_MAX to a
> driver that only provides the 32-bit callback, it appears the value will be
> truncated rather than rejected with an error like -EINVAL.
Maybe -EINVAL can be returned for that case indeed. For a debug interface I
did not see that as a problem.
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260517-ad9910-iio-driver-v5-0-31599c88314a@analog.com?part=2
--
Kind regards,
Rodrigo Alencar
^ permalink raw reply
* Re: [PATCH v3 2/2] cpufreq: CPPC: add autonomous mode boot parameter support
From: Mario Limonciello @ 2026-05-18 13:50 UTC (permalink / raw)
To: Sumit Gupta, rafael, viresh.kumar, pierre.gondois,
ionela.voinescu, zhenglifeng1, zhanjie9, corbet, skhan, rdunlap,
linux-pm, linux-doc, linux-kernel
Cc: linux-tegra, treding, jonathanh, vsethi, ksitaraman, sanjayc,
mochs, bbasu
In-Reply-To: <139d2f0e-72d9-4721-9d5a-d1d4a2a95fa1@nvidia.com>
On 5/18/26 08:44, Sumit Gupta wrote:
> Hi Mario,
>
>
> On 16/05/26 02:43, Mario Limonciello wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 5/15/26 07:26, Sumit Gupta wrote:
>>> Add a kernel boot parameter 'cppc_cpufreq.auto_sel_mode' to enable
>>> CPPC autonomous performance selection on all CPUs at system startup.
>>> When autonomous mode is enabled, the hardware automatically adjusts
>>> CPU performance based on workload demands using Energy Performance
>>> Preference (EPP) hints.
>>>
>>> When the parameter is set:
>>> - Configure all CPUs for autonomous operation on first init
>>> - Use HW min/max_perf when available; otherwise initialize from caps
>>> - Initialize desired_perf to max_perf as a starting hint
>>> - Hardware controls frequency instead of the OS governor
>>> - EPP behavior depends on parameter value:
>>> - performance (or 1): override EPP to performance preference (0x0)
>>> - default_epp (or 2): preserve EPP value programmed by BIOS/firmware
>>>
>>> The boot parameter is applied only during first policy initialization.
>>> Skip applying it on CPU hotplug to preserve runtime sysfs configuration.
>>>
>>> This patch depends on patch series [1] ("cpufreq: Set policy->min and
>>> max as real QoS constraints") so that the policy->min/max set in
>>> cppc_cpufreq_cpu_init() are not overridden by cpufreq_set_policy()
>>> during init.
>>>
>>> Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
>>> ---
>>> [1] https://lore.kernel.org/lkml/20260511135538.522653-1-
>>> pierre.gondois@arm.com/
>>> ---
>>> .../admin-guide/kernel-parameters.txt | 16 +++
>>> drivers/cpufreq/cppc_cpufreq.c | 122 +++++++++++++++++-
>>> 2 files changed, 133 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/
>>> Documentation/admin-guide/kernel-parameters.txt
>>> index 0eb64aab3685..7e4b3a8fd76f 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -1048,6 +1048,22 @@ Kernel parameters
>>> policy to use. This governor must be registered
>>> in the
>>> kernel before the cpufreq driver probes.
>>>
>>> + cppc_cpufreq.auto_sel_mode=
>>> + [CPU_FREQ] Enable ACPI CPPC autonomous performance
>>> + selection. When enabled, hardware automatically
>>> adjusts
>>> + CPU frequency on all CPUs based on workload
>>> demands.
>>> + In Autonomous mode, Energy Performance
>>> Preference (EPP)
>>> + hints guide hardware toward performance (0x0)
>>> or energy
>>> + efficiency (0xff).
>>> + Requires ACPI CPPC autonomous selection register
>>> + support.
>>> + Accepts:
>>> + performance, 1: enable auto_sel + set EPP to
>>> + performance (0x0)
>>> + default_epp, 2: enable auto_sel, preserve EPP
>>> value
>>> + programmed by BIOS/firmware
>>> + Unset: cpufreq governors are used (auto_sel
>>> disabled).
>>
>> Rather than unset doing nothing, have you considered having it take a
>> midpoint like 128? That's what we do in amd-pstate (default to
>> balance_performance). I think it turns into a reasonable balance.
>
> Thanks for the suggestion.
> I can add balance_performance that enables auto_sel with EPP=128 in v4.
>
> On changing the driver default (no param behavior) to auto enable
> balance_performance, it would be good to keep the current behavior for
> now since cppc_cpufreq is generic across ARM64/RISC-V platforms where
> EPP and Autonomous Selection registers are optional.
> A default change would affect existing users relying on governors.
>
> Thank you,
> Sumit Gupta
But couldn't you make the "no module parameter set" follow the behavior
to only set the registers if they're available?
So the systems that support it start using it, the ones that don't it's
a NOP.
>
>
>>
>>> +
>>> cpu_init_udelay=N
>>> [X86,EARLY] Delay for N microsec between assert
>>> and de-assert
>>> of APIC INIT to start processors. This delay
>>> occurs
>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/
>>> cppc_cpufreq.c
>>> index 6b54427b52e1..5f4d735e7c7d 100644
>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>> @@ -28,6 +28,43 @@
>>>
>>> static struct cpufreq_driver cppc_cpufreq_driver;
>>>
>>> +/* Autonomous Selection boot parameter modes */
>>> +enum {
>>> + AUTO_SEL_PERFORMANCE = 1,
>>> + AUTO_SEL_DEFAULT_EPP = 2,
>>> +};
>>> +
>>> +static int auto_sel_mode;
>>> +
>>> +static int auto_sel_mode_set(const char *val, const struct
>>> kernel_param *kp)
>>> +{
>>> + if (sysfs_streq(val, "performance") || sysfs_streq(val, "1"))
>>> + *(int *)kp->arg = AUTO_SEL_PERFORMANCE;
>>> + else if (sysfs_streq(val, "default_epp") || sysfs_streq(val, "2"))
>>> + *(int *)kp->arg = AUTO_SEL_DEFAULT_EPP;
>>> + else
>>> + return -EINVAL;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int auto_sel_mode_get(char *buffer, const struct kernel_param
>>> *kp)
>>> +{
>>> + switch (*(int *)kp->arg) {
>>> + case AUTO_SEL_PERFORMANCE:
>>> + return sysfs_emit(buffer, "performance\n");
>>> + case AUTO_SEL_DEFAULT_EPP:
>>> + return sysfs_emit(buffer, "default_epp\n");
>>> + default:
>>> + return sysfs_emit(buffer, "disabled\n");
>>> + }
>>> +}
>>> +
>>> +static const struct kernel_param_ops auto_sel_mode_ops = {
>>> + .set = auto_sel_mode_set,
>>> + .get = auto_sel_mode_get,
>>> +};
>>> +
>>> #ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
>>> static enum {
>>> FIE_UNSET = -1,
>>> @@ -715,11 +752,75 @@ static int cppc_cpufreq_cpu_init(struct
>>> cpufreq_policy *policy)
>>> policy->cur = cppc_perf_to_khz(caps, caps->highest_perf);
>>> cpu_data->perf_ctrls.desired_perf = caps->highest_perf;
>>>
>>> - ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>> - if (ret) {
>>> - pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",
>>> - caps->highest_perf, cpu, ret);
>>> - goto out;
>>> + /*
>>> + * Enable autonomous mode on first init if boot param is set.
>>> + * Check last_governor to detect first init and skip if auto_sel
>>> + * is already enabled.
>>> + */
>>> + if (auto_sel_mode && policy->last_governor[0] == '\0' &&
>>> + !cpu_data->perf_ctrls.auto_sel) {
>>> + /* Init min/max_perf from caps if not already set by
>>> HW. */
>>> + if (!cpu_data->perf_ctrls.min_perf)
>>> + cpu_data->perf_ctrls.min_perf = caps-
>>> >lowest_nonlinear_perf;
>>> + if (!cpu_data->perf_ctrls.max_perf)
>>> + cpu_data->perf_ctrls.max_perf = policy-
>>> >boost_enabled ?
>>> + caps->highest_perf : caps->nominal_perf;
>>> +
>>> + /*
>>> + * In autonomous mode desired_perf is only a hint; EPP and
>>> + * the platform drive actual selection within [min, max].
>>> + * Initialize it to max_perf so HW starts at the upper
>>> bound.
>>> + */
>>> + cpu_data->perf_ctrls.desired_perf = cpu_data-
>>> >perf_ctrls.max_perf;
>>> +
>>> + policy->cur = cppc_perf_to_khz(caps,
>>> + cpu_data->perf_ctrls.desired_perf);
>>> +
>>> + /*
>>> + * Override EPP only in 'performance' mode;
>>> 'default_epp' mode
>>> + * preserves the BIOS/firmware programmed EPP value.
>>> + * EPP is optional - some platforms may not support it.
>>> + */
>>> + if (auto_sel_mode == AUTO_SEL_PERFORMANCE) {
>>> + ret = cppc_set_epp(cpu,
>>> CPPC_EPP_PERFORMANCE_PREF);
>>> + if (ret && ret != -EOPNOTSUPP)
>>> + pr_warn("Failed to set EPP for CPU%d
>>> (%d)\n", cpu, ret);
>>> + else if (!ret)
>>> + cpu_data->perf_ctrls.energy_perf = CPPC_EPP_PERFORMANCE_PREF;
>>> + }
>>> +
>>> + /* Program min/max/desired into CPPC regs (non-fatal on
>>> failure). */
>>> + ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>> + if (ret)
>>> + pr_warn("set_perf failed CPU%d (%d); using HW
>>> values\n",
>>> + cpu, ret);
>>> +
>>> + ret = cppc_set_auto_sel(cpu, true);
>>> + if (ret && ret != -EOPNOTSUPP)
>>> + pr_warn("auto_sel CPU%d failed (%d); using OS
>>> mode\n",
>>> + cpu, ret);
>>> + else if (!ret)
>>> + cpu_data->perf_ctrls.auto_sel = true;
>>> + }
>>> +
>>> + if (cpu_data->perf_ctrls.auto_sel) {
>>> + /* Sync policy limits from HW when autonomous mode is
>>> active */
>>> + policy->min = cppc_perf_to_khz(caps,
>>> + cpu_data->perf_ctrls.min_perf ?:
>>> + caps->lowest_nonlinear_perf);
>>> + policy->max = cppc_perf_to_khz(caps,
>>> + cpu_data->perf_ctrls.max_perf ?:
>>> + (policy->boost_enabled ?
>>> + caps->highest_perf :
>>> + caps->nominal_perf));
>>> + } else {
>>> + /* Normal mode: governors control frequency */
>>> + ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>>> + if (ret) {
>>> + pr_debug("Err setting perf value:%d on CPU:%d.
>>> ret:%d\n",
>>> + caps->highest_perf, cpu, ret);
>>> + goto out;
>>> + }
>>> }
>>>
>>> cppc_cpufreq_cpu_fie_init(policy);
>>> @@ -1079,10 +1180,21 @@ static int __init cppc_cpufreq_init(void)
>>>
>>> static void __exit cppc_cpufreq_exit(void)
>>> {
>>> + unsigned int cpu;
>>> +
>>> + for_each_present_cpu(cpu)
>>> + cppc_set_auto_sel(cpu, false);
>>> +
>>> cpufreq_unregister_driver(&cppc_cpufreq_driver);
>>> cppc_freq_invariance_exit();
>>> }
>>>
>>> +module_param_cb(auto_sel_mode, &auto_sel_mode_ops, &auto_sel_mode,
>>> 0444);
>>> +MODULE_PARM_DESC(auto_sel_mode,
>>> + "Enable CPPC autonomous performance selection at boot: "
>>> + "performance or 1 (EPP=performance), "
>>> + "default_epp or 2 (preserve BIOS/firmware EPP)");
>>> +
>>> module_exit(cppc_cpufreq_exit);
>>> MODULE_AUTHOR("Ashwin Chaugule");
>>> MODULE_DESCRIPTION("CPUFreq driver based on the ACPI CPPC v5.0+
>>> spec");
>>
^ permalink raw reply
* Re: [PATCH v5 09/13] iio: frequency: ad9910: add RAM mode support
From: Rodrigo Alencar @ 2026-05-18 13:50 UTC (permalink / raw)
To: rodrigo.alencar, linux-iio, devicetree, linux-kernel, linux-doc,
linux-hardening
Cc: Lars-Peter Clausen, Michael Hennerich, Jonathan Cameron,
David Lechner, Andy Shevchenko, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Philipp Zabel, Jonathan Corbet, Shuah Khan,
Kees Cook, Gustavo A. R. Silva
In-Reply-To: <20260517-ad9910-iio-driver-v5-9-31599c88314a@analog.com>
On 26/05/17 07:19PM, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
> - [High] Incomplete hardware rollback during a profile swap failure leaves AD9910 shadow registers corrupted.
> - [High] Missing software cache and hardware rollback when CFR1 register update fails.
> - [High] Unprotected software profile cache corruption on firmware upload failure.
> - [High] Firmware upload `.write` callback fails on chunked payloads and performs out-of-bounds memory reads.
> - [Medium] The RAM firmware upload sysfs node and debugfs symlinks are registered with a `(null)` name.
> --
>
> commit 3876f09fc86743085e5099fea5762c34dc150ecf
> Author: Rodrigo Alencar <rodrigo.alencar@analog.com>
>
> iio: frequency: ad9910: add RAM mode support
>
> Add RAM control channel, which includes RAM data loading via firmware
> upload interface, per-profile configuration, profile switching,
> sampling frequency control, and ram-enable-aware read/write paths.
>
> > diff --git a/drivers/iio/frequency/ad9910.c b/drivers/iio/frequency/ad9910.c
> > index 4ad80475139d7..86ed350011cff 100644
> > --- a/drivers/iio/frequency/ad9910.c
> > +++ b/drivers/iio/frequency/ad9910.c
>
> [ ... ]
>
> > @@ -1202,6 +1306,41 @@ static int ad9910_write_raw(struct iio_dev *indio_dev,
> > return ad9910_reg32_update(st, AD9910_REG_CFR2,
> > AD9910_CFR2_DRG_ENABLE_MSK,
> > tmp32, true);
> > + case AD9910_CHANNEL_RAM:
> > + if (AD9910_RAM_ENABLED(st) == !!val)
> > + return 0;
> > +
> > + /* swap profile configs */
> > + for (i = 0; i < AD9910_NUM_PROFILES; i++) {
> > + tmp64 = st->reg[AD9910_REG_PROFILE(i)].val64;
> > + ret = ad9910_reg64_write(st,
> > + AD9910_REG_PROFILE(i),
> > + st->reg_profile[i],
> > + false);
> > + if (ret)
> > + break;
> > + st->reg_profile[i] = tmp64;
> > + }
> > +
> > + if (ret) {
> > + /*
> > + * After the write failure, profiles 0..i-1 were
> > + * already swapped in SW, but Hw registers are
> > + * still pending an IO update, so swap them back
> > + * in SW to keep the state consistent.
> > + */
> > + while (i--) {
> > + tmp64 = st->reg[AD9910_REG_PROFILE(i)].val64;
> > + st->reg[AD9910_REG_PROFILE(i)].val64 = st->reg_profile[i];
> > + st->reg_profile[i] = tmp64;
> > + }
> > + return ret;
>
> If a write fails mid-loop, this rollback block restores the software cache
> but doesn't issue SPI writes to revert the hardware shadow registers.
> Could the next successful SPI write with an IO_UPDATE pulse inadvertently
> latch these corrupted, partially-swapped profiles into the active hardware
> state?
The approach reverts the changes on cached values, so we do not have RAM and single
tone profile values mixed in software. The hardware might be in an invalid state,
but an error is reported to the user.
> > + }
> > +
> > + tmp32 = FIELD_PREP(AD9910_CFR1_RAM_ENABLE_MSK, !!val);
> > + return ad9910_reg32_update(st, AD9910_REG_CFR1,
> > + AD9910_CFR1_RAM_ENABLE_MSK,
> > + tmp32, true);
>
> If ad9910_reg32_update() fails here, the function returns immediately without
> reverting the swapped profiles. Should there be a rollback of the swapped
> profiles in this error path to prevent the device from operating with corrupted
> synthesizer parameters?
Yes, I'd this part can be improved. Writing to CFR1 can be moved up (before the cache
restore).
>
> > default:
> > return -EINVAL;
> > }
>
> [ ... ]
>
> > @@ -1475,6 +1657,126 @@ static int ad9910_read_label(struct iio_dev *indio_dev,
> > +static enum fw_upload_err ad9910_ram_fwu_write(struct fw_upload *fw_upload,
> > + const u8 *data, u32 offset,
> > + u32 size, u32 *written)
> > +{
> > + const struct ad9910_ram_fw *fw_data = (const struct ad9910_ram_fw *)data;
> > + struct ad9910_state *st = fw_upload->dd_handle;
> > + int ret, ret2, idx, wcount;
> > + u64 tmp64, backup;
> > +
> > + if (offset != 0)
> > + return FW_UPLOAD_ERR_INVALID_SIZE;
>
> The sysfs firmware upload framework chunks data into blocks bounded by
> PAGE_SIZE (typically 4096 bytes). Since the maximum AD9910 firmware size
> can be 4176 bytes, this driver will reject any chunks after the first.
> Can this be adjusted to support firmware files larger than PAGE_SIZE?
That does not seem to be the case:
https://elixir.bootlin.com/linux/v7.1-rc3/source/drivers/base/firmware_loader/sysfs_upload.c#L184
Sysfs binary attribute writes from userspace are limited to PAGE_SIZE per
write() syscall (enforced in fs/sysfs/file.c). But the firmware loader
framework accumulates these into an internal buffer (fw->data).
The driver's write callback is called from fw_upload_main() in sysfs_upload.c.
So, the key question is whether that function passes fw->size - offset
(the full remaining size) or min(fw->size - offset, PAGE_SIZE) (PAGE_SIZE-bounded chunks)
to the driver callback.
The referenced code shows that the write callback is not called every
PAGE_SIZE chunks, but the callback implementation that defines the how
much it can handle at each call.
Also, I have already tested this code with a FW file bigger than 4k.
> > +
> > + guard(mutex)(&st->lock);
> > +
> > + if (st->ram_fwu_cancel)
> > + return FW_UPLOAD_ERR_CANCELED;
> > +
> > + if (AD9910_RAM_ENABLED(st))
> > + return FW_UPLOAD_ERR_HW_ERROR;
> > +
> > + for (idx = 0; idx < AD9910_NUM_PROFILES; idx++)
> > + st->reg_profile[idx] = get_unaligned_be64(&fw_data->profiles[idx]) |
> > + AD9910_PROFILE_RAM_OPEN_MSK;
>
> Here the software cache st->reg_profile[] is overwritten before the SPI
> hardware operations. If any subsequent hardware writes fail, the function
> aborts and the cache is left with partially uploaded profiles.
> Could the original cache be backed up and restored on failure?
I'd say that is not an issue. When RAM is disabled (requirement to enter this function),
this cache is exclusive to be managed by the RAM mode. This is meant to be
written here.
>
> > +
> > + ret = ad9910_reg32_update(st, AD9910_REG_CFR1,
> > + AD9910_CFR1_RAM_PLAYBACK_DEST_MSK |
> > + AD9910_CFR1_INT_PROFILE_CTL_MSK,
> > + get_unaligned_be32(&fw_data->cfr1), true);
> > + if (ret)
> > + return FW_UPLOAD_ERR_RW_ERROR;
> > +
> > + wcount = get_unaligned_be16(&fw_data->wcount);
> > + if (!wcount) {
> > + *written = size;
> > + return FW_UPLOAD_ERR_NONE; /* nothing else to write */
> > + }
> > +
> > + ret = ad9910_profile_set(st, st->profile);
> > + if (ret)
> > + return FW_UPLOAD_ERR_HW_ERROR;
> > +
> > + /* backup profile register and update it with required address range */
> > + backup = st->reg[AD9910_REG_PROFILE(st->profile)].val64;
> > + tmp64 = AD9910_PROFILE_RAM_STEP_RATE_MSK |
> > + FIELD_PREP(AD9910_PROFILE_RAM_START_ADDR_MSK, 0) |
> > + FIELD_PREP(AD9910_PROFILE_RAM_END_ADDR_MSK, wcount - 1);
> > + ret = ad9910_reg64_write(st, AD9910_REG_PROFILE(st->profile), tmp64, true);
> > + if (ret)
> > + return FW_UPLOAD_ERR_RW_ERROR;
> > +
> > + memcpy(&st->tx_buf[1], fw_data->words, wcount * AD9910_RAM_WORD_SIZE);
>
> If the firmware is larger than PAGE_SIZE, size will be 4096 on the first
> call, but wcount could request more bytes than are present in the chunk.
> Does this memcpy() risk reading out-of-bounds memory from the data buffer?
this is dicussed above, from drivers/base/firmware_loader/sysfs_upload.c it
does not seem to be true that the write happens in chunks of PAGE_SIZE.
> [ ... ]
>
> > @@ -1688,7 +2014,25 @@ static int ad9910_probe(struct spi_device *spi)
> > if (ret)
> > return dev_err_probe(dev, ret, "device setup failed\n");
> >
> > - return devm_iio_device_register(dev, indio_dev);
> > + snprintf(st->ram_fwu_name, sizeof(st->ram_fwu_name), "%s:ram",
> > + dev_name(&indio_dev->dev));
>
> The IIO core does not assign the device name to indio_dev->dev until
> devm_iio_device_register() is called. Will dev_name() return "(null)" here,
> resulting in sysfs nodes named "/sys/class/firmware/(null):ram"?
> This might also cause probe failures due to sysfs name collisions if multiple
> AD9910 devices are present on the system.
I suppose that is incorrect.
dev_set_name(&indio_dev->dev, "iio:device%d", iio_dev_opaque->id)
happens during iio_device_alloc().
>
> > + st->ram_fwu = firmware_upload_register(THIS_MODULE, dev, st->ram_fwu_name,
> > + &ad9910_ram_fwu_ops, st);
> > + if (IS_ERR(st->ram_fwu))
> > + return dev_err_probe(dev, PTR_ERR(st->ram_fwu),
> > + "failed to register ram upload ops\n");
> > +
> > + ret = devm_add_action_or_reset(dev, ad9910_ram_fwu_unregister, st->ram_fwu);
> > + if (ret)
> > + return dev_err_probe(dev, ret,
> > + "failed to add ram upload unregister action\n");
> > +
> > + ret = devm_iio_device_register(dev, indio_dev);
> > + if (ret)
> > + return ret;
> > +
> > + ad9910_debugfs_init(st, indio_dev);
> > + return 0;
> > }
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260517-ad9910-iio-driver-v5-0-31599c88314a@analog.com?part=9
--
Kind regards,
Rodrigo Alencar
^ permalink raw reply
* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Usama Arif @ 2026-05-18 13:49 UTC (permalink / raw)
To: Nico Pache
Cc: Usama Arif, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260511185817.686831-5-npache@redhat.com>
On Mon, 11 May 2026 12:58:04 -0600 Nico Pache <npache@redhat.com> wrote:
> generalize the order of the __collapse_huge_page_* and collapse_max_*
> functions to support future mTHP collapse.
>
> The current mechanism for determining collapse with the
> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> raises a key design issue: if we support user defined max_pte_none values
> (even those scaled by order), a collapse of a lower order can introduces
> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> than HPAGE_PMD_NR / 2. [1]
>
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
>
> To fix this issue introduce a helper function that will limit mTHP
> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> This effectively supports two modes: [2]
>
> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
> that maps the shared zeropage. Consequently, no memory bloat.
> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> available mTHP order.
>
> This removes the possiblilty of "creep", while not modifying any uAPI
> expectations. A warning will be emitted if any non-supported
> max_ptes_none value is configured with mTHP enabled.
>
> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> shared or swapped entry.
>
> No functional changes in this patch; however it defines future behavior
> for mTHP collapse.
>
> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> include/trace/events/huge_memory.h | 3 +-
> mm/khugepaged.c | 117 ++++++++++++++++++++---------
> 2 files changed, 85 insertions(+), 35 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index bcdc57eea270..443e0bd13fdb 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -39,7 +39,8 @@
> EM( SCAN_STORE_FAILED, "store_failed") \
> EM( SCAN_COPY_MC, "copy_poisoned_page") \
> EM( SCAN_PAGE_FILLED, "page_filled") \
> - EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
> + EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") \
> + EMe(SCAN_INVALID_PTES_NONE, "invalid_ptes_none")
>
> #undef EM
> #undef EMe
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index f68853b3caa7..27465161fa6d 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -61,6 +61,7 @@ enum scan_result {
> SCAN_COPY_MC,
> SCAN_PAGE_FILLED,
> SCAN_PAGE_DIRTY_OR_WRITEBACK,
> + SCAN_INVALID_PTES_NONE,
> };
>
> #define CREATE_TRACE_POINTS
> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
> * PTEs for the given collapse operation.
> * @cc: The collapse control struct
> * @vma: The vma to check for userfaultfd
> + * @order: The folio order being collapsed to
> *
> * Return: Maximum number of none-page or zero-page PTEs allowed for the
> * collapse operation.
> */
> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> - struct vm_area_struct *vma)
> +static int collapse_max_ptes_none(struct collapse_control *cc,
> + struct vm_area_struct *vma, unsigned int order)
> {
> + unsigned int max_ptes_none = khugepaged_max_ptes_none;
> // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
> if (vma && userfaultfd_armed(vma))
> return 0;
> // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
> - // For all other cases repect the user defined maximum.
> - return khugepaged_max_ptes_none;
> + // for PMD collapse, respect the user defined maximum.
> + if (is_pmd_order(order))
> + return max_ptes_none;
> + /* Zero/non-present collapse disabled. */
> + if (!max_ptes_none)
> + return 0;
> + // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> + // scale the maximum number of PTEs to the order of the collapse.
> + if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> + return (1 << order) - 1;
> +
> + // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
> + // Emit a warning and return -EINVAL.
> + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
> + KHUGEPAGED_MAX_PTES_LIMIT);
> + return -EINVAL;
> }
>
> /**
> * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
> * anonymous pages for the given collapse operation.
> * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
> *
> * Return: Maximum number of PTEs that map shared anonymous pages for the
> * collapse operation
> */
> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
> + unsigned int order)
> {
> // for MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> // anonymous pages.
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
> + // for mTHP collapse do not allow collapsing anonymous memory pages that
> + // are shared between processes.
> + if (!is_pmd_order(order))
> + return 0;
> + // for PMD collapse, respect the user defined maximum.
> return khugepaged_max_ptes_shared;
> }
>
> @@ -391,16 +415,22 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
> * maximum allowed non-present pagecache entries for the given collapse operation.
> * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
> *
> * Return: Maximum number of non-present PTEs or the maximum allowed non-present
> * pagecache entries for the collapse operation.
> */
> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
> + unsigned int order)
> {
> // for MADV_COLLAPSE, do not restrict the number PTEs entries or
> // pagecache entries that are non-present.
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
> + // for mTHP collapse do not allow any non-present PTEs or pagecache entries.
> + if (!is_pmd_order(order))
> + return 0;
> + // for PMD collapse, respect the user defined maximum.
> return khugepaged_max_ptes_swap;
> }
>
> @@ -594,18 +624,22 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>
> static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> - struct list_head *compound_pagelist)
> + unsigned int order, struct list_head *compound_pagelist)
> {
> + const unsigned long nr_pages = 1UL << order;
> struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr = start_addr;
> pte_t *_pte;
> int none_or_zero = 0, shared = 0, referenced = 0;
> enum scan_result result = SCAN_FAIL;
> - unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> - unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> + int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> + unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
> +
> + if (max_ptes_none < 0)
> + return SCAN_INVALID_PTES_NONE;
>
> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> + for (_pte = pte; _pte < pte + nr_pages;
> _pte++, addr += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> if (pte_none_or_zero(pteval)) {
> @@ -738,18 +772,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> }
>
> static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> - struct vm_area_struct *vma,
> - unsigned long address,
> - spinlock_t *ptl,
> - struct list_head *compound_pagelist)
> + struct vm_area_struct *vma, unsigned long address,
> + spinlock_t *ptl, unsigned int order,
> + struct list_head *compound_pagelist)
> {
> - unsigned long end = address + HPAGE_PMD_SIZE;
> + const unsigned long nr_pages = 1UL << order;
> + unsigned long end = address + (PAGE_SIZE << order);
> struct folio *src, *tmp;
> pte_t pteval;
> pte_t *_pte;
> unsigned int nr_ptes;
>
> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> + for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
> address += nr_ptes * PAGE_SIZE) {
> nr_ptes = 1;
> pteval = ptep_get(_pte);
> @@ -802,11 +836,10 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> }
>
> static void __collapse_huge_page_copy_failed(pte_t *pte,
> - pmd_t *pmd,
> - pmd_t orig_pmd,
> - struct vm_area_struct *vma,
> - struct list_head *compound_pagelist)
> + pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> + unsigned int order, struct list_head *compound_pagelist)
> {
> + const unsigned long nr_pages = 1UL << order;
> spinlock_t *pmd_ptl;
>
> /*
> @@ -822,7 +855,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> * Release both raw and compound pages isolated
> * in __collapse_huge_page_isolate.
> */
> - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> + release_pte_pages(pte, pte + nr_pages, compound_pagelist);
> }
>
> /*
> @@ -842,16 +875,17 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> */
> static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> - unsigned long address, spinlock_t *ptl,
> + unsigned long address, spinlock_t *ptl, unsigned int order,
> struct list_head *compound_pagelist)
> {
> + const unsigned long nr_pages = 1UL << order;
> unsigned int i;
> enum scan_result result = SCAN_SUCCEED;
>
> /*
> * Copying pages' contents is subject to memory poison at any iteration.
> */
> - for (i = 0; i < HPAGE_PMD_NR; i++) {
> + for (i = 0; i < nr_pages; i++) {
> pte_t pteval = ptep_get(pte + i);
> struct page *page = folio_page(folio, i);
> unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -870,10 +904,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>
> if (likely(result == SCAN_SUCCEED))
> __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> - compound_pagelist);
> + order, compound_pagelist);
> else
> __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> - compound_pagelist);
> + order, compound_pagelist);
>
> return result;
> }
> @@ -1044,12 +1078,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
> * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> */
Can you add a comment above __collapse_huge_page_swapin function that says its only
done for PMD size only? Something like:
For PMD-order collapse this faults in any swap entries it finds. For mTHP
orders the function bails on the first swap entry with SCAN_EXCEED_SWAP_PTE,
because faulting pages back in during a lower-order collapse could re-populate
PTEs that push a later scan over the threshold for a higher-order collapse.
> static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> - struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
> - int referenced)
> + struct vm_area_struct *vma, unsigned long start_addr,
> + pmd_t *pmd, int referenced, unsigned int order)
> {
> int swapped_in = 0;
> vm_fault_t ret = 0;
> - unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
> + unsigned long addr, end = start_addr + (PAGE_SIZE << order);
> enum scan_result result;
> pte_t *pte = NULL;
> spinlock_t *ptl;
> @@ -1081,6 +1115,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> pte_present(vmf.orig_pte))
> continue;
>
> + /*
> + * TODO: Support swapin without leading to further mTHP
> + * collapses. Currently bringing in new pages via swapin may
> + * cause a future higher order collapse on a rescan of the same
> + * range.
> + */
> + if (!is_pmd_order(order)) {
> + pte_unmap(pte);
> + mmap_read_unlock(mm);
> + result = SCAN_EXCEED_SWAP_PTE;
> + goto out;
> + }
> +
> vmf.pte = pte;
> vmf.ptl = ptl;
> ret = do_swap_page(&vmf);
> @@ -1200,7 +1247,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * that case. Continuing to collapse causes inconsistency.
> */
> result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> - referenced);
> + referenced, HPAGE_PMD_ORDER);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
> }
> @@ -1248,6 +1295,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> if (pte) {
> result = __collapse_huge_page_isolate(vma, address, pte, cc,
> + HPAGE_PMD_ORDER,
> &compound_pagelist);
> spin_unlock(pte_ptl);
> } else {
> @@ -1278,6 +1326,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
> result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> vma, address, pte_ptl,
> + HPAGE_PMD_ORDER,
> &compound_pagelist);
> pte_unmap(pte);
> if (unlikely(result != SCAN_SUCCEED))
> @@ -1313,9 +1362,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long start_addr,
> bool *lock_dropped, struct collapse_control *cc)
> {
> - const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> - const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> - const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> + const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> + const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> + const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> pmd_t *pmd;
> pte_t *pte, *_pte;
> int none_or_zero = 0, shared = 0, referenced = 0;
> @@ -2369,8 +2418,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
> unsigned long addr, struct file *file, pgoff_t start,
> struct collapse_control *cc)
> {
> - const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> - const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> + const int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
> + const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> struct folio *folio = NULL;
> struct address_space *mapping = file->f_mapping;
> XA_STATE(xas, &mapping->i_pages, start);
> --
> 2.54.0
>
>
^ permalink raw reply
* Re: [PATCH v3 2/2] cpufreq: CPPC: add autonomous mode boot parameter support
From: Sumit Gupta @ 2026-05-18 13:49 UTC (permalink / raw)
To: Randy Dunlap, rafael, viresh.kumar, pierre.gondois,
ionela.voinescu, zhenglifeng1, zhanjie9, corbet, skhan,
mario.limonciello, linux-pm, linux-doc, linux-kernel
Cc: linux-tegra, treding, jonathanh, vsethi, ksitaraman, sanjayc,
mochs, bbasu
In-Reply-To: <b4516579-c4bf-4ddd-843a-30d4a4992519@infradead.org>
Hi Randy,
On 16/05/26 03:44, Randy Dunlap wrote:
>
> On 5/15/26 5:26 AM, Sumit Gupta wrote:
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index 0eb64aab3685..7e4b3a8fd76f 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -1048,6 +1048,22 @@ Kernel parameters
>> policy to use. This governor must be registered in the
>> kernel before the cpufreq driver probes.
>>
>> + cppc_cpufreq.auto_sel_mode=
>> + [CPU_FREQ] Enable ACPI CPPC autonomous performance
> I just noticed that we should have both CPU_FREQ and CPU_IDLE added to the
> legend (meanings) section at the very beginning of this file, but that
> doesn't have to be part of this patch.
Thanks.
Will send a separate patch adding CPU_FREQ and CPU_IDLE to the legend.
Thank you,
Sumit Gupta
>
>> + selection. When enabled, hardware automatically adjusts
>> + CPU frequency on all CPUs based on workload demands.
>> + In Autonomous mode, Energy Performance Preference (EPP)
>> + hints guide hardware toward performance (0x0) or energy
>> + efficiency (0xff).
>> + Requires ACPI CPPC autonomous selection register
>> + support.
>> + Accepts:
>> + performance, 1: enable auto_sel + set EPP to
>> + performance (0x0)
>> + default_epp, 2: enable auto_sel, preserve EPP value
>> + programmed by BIOS/firmware
>> + Unset: cpufreq governors are used (auto_sel disabled).
^ permalink raw reply
* Re: [PATCH RFC v4 09/10] Documentation: ABI: testing: add docs for ad9910 sysfs entries
From: Jonathan Cameron @ 2026-05-18 13:45 UTC (permalink / raw)
To: Rodrigo Alencar
Cc: Rodrigo Alencar via B4 Relay, rodrigo.alencar, linux-iio,
devicetree, linux-kernel, linux-doc, linux-hardening,
Lars-Peter Clausen, Michael Hennerich, David Lechner,
Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Philipp Zabel, Jonathan Corbet, Shuah Khan, Kees Cook,
Gustavo A. R. Silva
In-Reply-To: <yrabhhhdkzmiuxlqzrrj6a47ftlzwvva7r2korzeszdy4yqrin@xl6obhhnnas4>
On Sun, 17 May 2026 18:30:27 +0100
Rodrigo Alencar <455.rodrigo.alencar@gmail.com> wrote:
> On 26/05/17 03:58PM, Jonathan Cameron wrote:
> > On Fri, 08 May 2026 18:00:25 +0100
> > Rodrigo Alencar via B4 Relay <devnull+rodrigo.alencar.analog.com@kernel.org> wrote:
> >
> > > From: Rodrigo Alencar <rodrigo.alencar@analog.com>
> > >
> > > Add custom ABI documentation file for the DDS AD9910 with sysfs entries to
> > > control Parallel Port, Digital Ramp Generator and OSK parameters.
> > >
> > > Signed-off-by: Rodrigo Alencar <rodrigo.alencar@analog.com>
> > I'm fine with phase and frequency as defined, but for the scaling it made me wonder.
> > For outvoltage0 channels the assumption the value is the peak voltage so if
> > we know what input to be modulated by the ramp generator can we express them
> > in volts (well milivolts) rather than as a scaling multiplier?
>
> The DAC output is current-based and differential. Voltage conversion would happen
> outside the device...
Why aren't we representing this as out_altcurrentX-Y_xxxx?
> using a resistor load or an op-amp transimpedance stage,
> and I am no expert on that, but that often requires impedance matching so voltage
> levels may depend on the frequency. Then, I suppose that voltage is not the right
> unit to use.
Understood that it can get complex!
>
> The scale here controls the amplitude of the varying signal. Assuming the peak voltage
> (amplitude) is constant means we have a constant envelope, but that should not mean
> we can't control it or it should not mean that the hardware can have other ways to
> control it. That said, scale behaves as a "gain multiplier".
Understood. Given it's the envelope then if scale happened to be 1 always it would
be presented as _processed. So this is consistent with other channel types.
>
> >
> > That seems to me like it fits better with the overall ABI.
> >
> > > +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_scale_offset
> > > +KernelVersion:
> > > +Contact: linux-iio@vger.kernel.org
> > > +Description:
> > > + For a channel that allows amplitude control through buffers, this
> > > + represents the value for a base amplitude scale. The actual output
> > > + amplitude scale is a result with the sum of this value.
> > > +
> >
> > > +
> > > +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_scale_roc
> >
> > Silly question perhaps but can work out how this related to millivolts/sec
> > That might make a more intuitive interface than scaling multiplier per sec
> > Perhaps the combination with offset makes this impossible though maybe that
> > could be a expressed as a voltage offset? Afterall if the amplitude being
> > scaled is 5V then 5 * (offset + scale) = 5 * offset + 5 * scale
> >
> > > +KernelVersion:
> > > +Contact: linux-iio@vger.kernel.org
> > > +Description:
> > > + Amplitude scale rate of change in 1/s for channels that ramp
> > > + amplitude. This value may be influenced by the channel's
> > > + sampling_frequency setting.
> >
> >
>
^ permalink raw reply
* Re: [PATCH v3 2/2] cpufreq: CPPC: add autonomous mode boot parameter support
From: Sumit Gupta @ 2026-05-18 13:44 UTC (permalink / raw)
To: Mario Limonciello, rafael, viresh.kumar, pierre.gondois,
ionela.voinescu, zhenglifeng1, zhanjie9, corbet, skhan, rdunlap,
linux-pm, linux-doc, linux-kernel
Cc: linux-tegra, treding, jonathanh, vsethi, ksitaraman, sanjayc,
mochs, bbasu
In-Reply-To: <bf521e4e-1aa5-49ce-bec5-52845f02214e@amd.com>
Hi Mario,
On 16/05/26 02:43, Mario Limonciello wrote:
> External email: Use caution opening links or attachments
>
>
> On 5/15/26 07:26, Sumit Gupta wrote:
>> Add a kernel boot parameter 'cppc_cpufreq.auto_sel_mode' to enable
>> CPPC autonomous performance selection on all CPUs at system startup.
>> When autonomous mode is enabled, the hardware automatically adjusts
>> CPU performance based on workload demands using Energy Performance
>> Preference (EPP) hints.
>>
>> When the parameter is set:
>> - Configure all CPUs for autonomous operation on first init
>> - Use HW min/max_perf when available; otherwise initialize from caps
>> - Initialize desired_perf to max_perf as a starting hint
>> - Hardware controls frequency instead of the OS governor
>> - EPP behavior depends on parameter value:
>> - performance (or 1): override EPP to performance preference (0x0)
>> - default_epp (or 2): preserve EPP value programmed by BIOS/firmware
>>
>> The boot parameter is applied only during first policy initialization.
>> Skip applying it on CPU hotplug to preserve runtime sysfs configuration.
>>
>> This patch depends on patch series [1] ("cpufreq: Set policy->min and
>> max as real QoS constraints") so that the policy->min/max set in
>> cppc_cpufreq_cpu_init() are not overridden by cpufreq_set_policy()
>> during init.
>>
>> Signed-off-by: Sumit Gupta <sumitg@nvidia.com>
>> ---
>> [1]
>> https://lore.kernel.org/lkml/20260511135538.522653-1-pierre.gondois@arm.com/
>> ---
>> .../admin-guide/kernel-parameters.txt | 16 +++
>> drivers/cpufreq/cppc_cpufreq.c | 122 +++++++++++++++++-
>> 2 files changed, 133 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
>> b/Documentation/admin-guide/kernel-parameters.txt
>> index 0eb64aab3685..7e4b3a8fd76f 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -1048,6 +1048,22 @@ Kernel parameters
>> policy to use. This governor must be registered
>> in the
>> kernel before the cpufreq driver probes.
>>
>> + cppc_cpufreq.auto_sel_mode=
>> + [CPU_FREQ] Enable ACPI CPPC autonomous performance
>> + selection. When enabled, hardware automatically
>> adjusts
>> + CPU frequency on all CPUs based on workload
>> demands.
>> + In Autonomous mode, Energy Performance
>> Preference (EPP)
>> + hints guide hardware toward performance (0x0)
>> or energy
>> + efficiency (0xff).
>> + Requires ACPI CPPC autonomous selection register
>> + support.
>> + Accepts:
>> + performance, 1: enable auto_sel + set EPP to
>> + performance (0x0)
>> + default_epp, 2: enable auto_sel, preserve EPP
>> value
>> + programmed by BIOS/firmware
>> + Unset: cpufreq governors are used (auto_sel
>> disabled).
>
> Rather than unset doing nothing, have you considered having it take a
> midpoint like 128? That's what we do in amd-pstate (default to
> balance_performance). I think it turns into a reasonable balance.
Thanks for the suggestion.
I can add balance_performance that enables auto_sel with EPP=128 in v4.
On changing the driver default (no param behavior) to auto enable
balance_performance, it would be good to keep the current behavior for
now since cppc_cpufreq is generic across ARM64/RISC-V platforms where
EPP and Autonomous Selection registers are optional.
A default change would affect existing users relying on governors.
Thank you,
Sumit Gupta
>
>> +
>> cpu_init_udelay=N
>> [X86,EARLY] Delay for N microsec between assert
>> and de-assert
>> of APIC INIT to start processors. This delay
>> occurs
>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>> b/drivers/cpufreq/cppc_cpufreq.c
>> index 6b54427b52e1..5f4d735e7c7d 100644
>> --- a/drivers/cpufreq/cppc_cpufreq.c
>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>> @@ -28,6 +28,43 @@
>>
>> static struct cpufreq_driver cppc_cpufreq_driver;
>>
>> +/* Autonomous Selection boot parameter modes */
>> +enum {
>> + AUTO_SEL_PERFORMANCE = 1,
>> + AUTO_SEL_DEFAULT_EPP = 2,
>> +};
>> +
>> +static int auto_sel_mode;
>> +
>> +static int auto_sel_mode_set(const char *val, const struct
>> kernel_param *kp)
>> +{
>> + if (sysfs_streq(val, "performance") || sysfs_streq(val, "1"))
>> + *(int *)kp->arg = AUTO_SEL_PERFORMANCE;
>> + else if (sysfs_streq(val, "default_epp") || sysfs_streq(val, "2"))
>> + *(int *)kp->arg = AUTO_SEL_DEFAULT_EPP;
>> + else
>> + return -EINVAL;
>> +
>> + return 0;
>> +}
>> +
>> +static int auto_sel_mode_get(char *buffer, const struct kernel_param
>> *kp)
>> +{
>> + switch (*(int *)kp->arg) {
>> + case AUTO_SEL_PERFORMANCE:
>> + return sysfs_emit(buffer, "performance\n");
>> + case AUTO_SEL_DEFAULT_EPP:
>> + return sysfs_emit(buffer, "default_epp\n");
>> + default:
>> + return sysfs_emit(buffer, "disabled\n");
>> + }
>> +}
>> +
>> +static const struct kernel_param_ops auto_sel_mode_ops = {
>> + .set = auto_sel_mode_set,
>> + .get = auto_sel_mode_get,
>> +};
>> +
>> #ifdef CONFIG_ACPI_CPPC_CPUFREQ_FIE
>> static enum {
>> FIE_UNSET = -1,
>> @@ -715,11 +752,75 @@ static int cppc_cpufreq_cpu_init(struct
>> cpufreq_policy *policy)
>> policy->cur = cppc_perf_to_khz(caps, caps->highest_perf);
>> cpu_data->perf_ctrls.desired_perf = caps->highest_perf;
>>
>> - ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>> - if (ret) {
>> - pr_debug("Err setting perf value:%d on CPU:%d. ret:%d\n",
>> - caps->highest_perf, cpu, ret);
>> - goto out;
>> + /*
>> + * Enable autonomous mode on first init if boot param is set.
>> + * Check last_governor to detect first init and skip if auto_sel
>> + * is already enabled.
>> + */
>> + if (auto_sel_mode && policy->last_governor[0] == '\0' &&
>> + !cpu_data->perf_ctrls.auto_sel) {
>> + /* Init min/max_perf from caps if not already set by
>> HW. */
>> + if (!cpu_data->perf_ctrls.min_perf)
>> + cpu_data->perf_ctrls.min_perf =
>> caps->lowest_nonlinear_perf;
>> + if (!cpu_data->perf_ctrls.max_perf)
>> + cpu_data->perf_ctrls.max_perf =
>> policy->boost_enabled ?
>> + caps->highest_perf : caps->nominal_perf;
>> +
>> + /*
>> + * In autonomous mode desired_perf is only a hint; EPP and
>> + * the platform drive actual selection within [min, max].
>> + * Initialize it to max_perf so HW starts at the upper
>> bound.
>> + */
>> + cpu_data->perf_ctrls.desired_perf =
>> cpu_data->perf_ctrls.max_perf;
>> +
>> + policy->cur = cppc_perf_to_khz(caps,
>> + cpu_data->perf_ctrls.desired_perf);
>> +
>> + /*
>> + * Override EPP only in 'performance' mode;
>> 'default_epp' mode
>> + * preserves the BIOS/firmware programmed EPP value.
>> + * EPP is optional - some platforms may not support it.
>> + */
>> + if (auto_sel_mode == AUTO_SEL_PERFORMANCE) {
>> + ret = cppc_set_epp(cpu,
>> CPPC_EPP_PERFORMANCE_PREF);
>> + if (ret && ret != -EOPNOTSUPP)
>> + pr_warn("Failed to set EPP for CPU%d
>> (%d)\n", cpu, ret);
>> + else if (!ret)
>> + cpu_data->perf_ctrls.energy_perf = CPPC_EPP_PERFORMANCE_PREF;
>> + }
>> +
>> + /* Program min/max/desired into CPPC regs (non-fatal on
>> failure). */
>> + ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>> + if (ret)
>> + pr_warn("set_perf failed CPU%d (%d); using HW
>> values\n",
>> + cpu, ret);
>> +
>> + ret = cppc_set_auto_sel(cpu, true);
>> + if (ret && ret != -EOPNOTSUPP)
>> + pr_warn("auto_sel CPU%d failed (%d); using OS
>> mode\n",
>> + cpu, ret);
>> + else if (!ret)
>> + cpu_data->perf_ctrls.auto_sel = true;
>> + }
>> +
>> + if (cpu_data->perf_ctrls.auto_sel) {
>> + /* Sync policy limits from HW when autonomous mode is
>> active */
>> + policy->min = cppc_perf_to_khz(caps,
>> + cpu_data->perf_ctrls.min_perf ?:
>> + caps->lowest_nonlinear_perf);
>> + policy->max = cppc_perf_to_khz(caps,
>> + cpu_data->perf_ctrls.max_perf ?:
>> + (policy->boost_enabled ?
>> + caps->highest_perf :
>> + caps->nominal_perf));
>> + } else {
>> + /* Normal mode: governors control frequency */
>> + ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
>> + if (ret) {
>> + pr_debug("Err setting perf value:%d on CPU:%d.
>> ret:%d\n",
>> + caps->highest_perf, cpu, ret);
>> + goto out;
>> + }
>> }
>>
>> cppc_cpufreq_cpu_fie_init(policy);
>> @@ -1079,10 +1180,21 @@ static int __init cppc_cpufreq_init(void)
>>
>> static void __exit cppc_cpufreq_exit(void)
>> {
>> + unsigned int cpu;
>> +
>> + for_each_present_cpu(cpu)
>> + cppc_set_auto_sel(cpu, false);
>> +
>> cpufreq_unregister_driver(&cppc_cpufreq_driver);
>> cppc_freq_invariance_exit();
>> }
>>
>> +module_param_cb(auto_sel_mode, &auto_sel_mode_ops, &auto_sel_mode,
>> 0444);
>> +MODULE_PARM_DESC(auto_sel_mode,
>> + "Enable CPPC autonomous performance selection at boot: "
>> + "performance or 1 (EPP=performance), "
>> + "default_epp or 2 (preserve BIOS/firmware EPP)");
>> +
>> module_exit(cppc_cpufreq_exit);
>> MODULE_AUTHOR("Ashwin Chaugule");
>> MODULE_DESCRIPTION("CPUFreq driver based on the ACPI CPPC v5.0+
>> spec");
>
^ permalink raw reply
* Re: [PATCH RFC v4 03/10] iio: frequency: ad9910: initial driver implementation
From: Jonathan Cameron @ 2026-05-18 13:42 UTC (permalink / raw)
To: Rodrigo Alencar
Cc: Rodrigo Alencar via B4 Relay, rodrigo.alencar, linux-iio,
devicetree, linux-kernel, linux-doc, linux-hardening,
Lars-Peter Clausen, Michael Hennerich, David Lechner,
Andy Shevchenko, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Philipp Zabel, Jonathan Corbet, Shuah Khan, Kees Cook,
Gustavo A. R. Silva
In-Reply-To: <is4rbxohz5icbaslatmjmzhb5oztnh6ytmntgkn3rssjijppb3@mur2heagprdi>
> > > + case IIO_CHAN_INFO_FREQUENCY:
> > > + switch (chan->channel) {
> > > + case AD9910_CHANNEL_PROFILE_0 ... AD9910_CHANNEL_PROFILE_7:
> > > + tmp32 = chan->channel - AD9910_CHANNEL_PROFILE_0;
> > > + tmp64 = FIELD_GET(AD9910_PROFILE_ST_FTW_MSK,
> > > + st->reg[AD9910_REG_PROFILE(tmp32)].val64);
> > > + break;
> > > + default:
> > > + return -EINVAL;
> > > + }
> > > + tmp64 *= st->data.sysclk_freq_hz;
> > > + *val = tmp64 >> 32;
> > > + *val2 = ((tmp64 & GENMASK_ULL(31, 0)) * MICRO) >> 32;
> >
> > Why in this particular case have this outside the switch / case whereas in others
> > you do the full maths and set inside? I'd put it inside and not worry about slightly
> > long lines.
>
> for frequency, those calculations are going to be common for the other channels that are
> going to be populated by other patches...
>
> DRG up/down and RAM will have tmp64 populated with a FTW value.
Makes sense. Thanks,
^ permalink raw reply
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper
From: Ian Kent @ 2026-05-18 13:39 UTC (permalink / raw)
To: Jan Kara
Cc: NeilBrown, Horst Birthelmer, Amir Goldstein, Miklos Szeredi,
Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner,
linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer
In-Reply-To: <yk2hem4zwinm4glenpc74to7sm5kyriksgwn6mxh7t4saotiba@7zik7jcnbs5m>
On 18/5/26 16:19, Jan Kara wrote:
> Hi Ian,
>
> On Mon 18-05-26 10:55:43, Ian Kent wrote:
>> On 18/5/26 07:55, NeilBrown wrote:
>>> On Fri, 15 May 2026, Horst Birthelmer wrote:
>>> According to the email you linked, a problem arises when a directory has
>>> a great many negative children. Code which walks the list of children
>>> (such as fsnotify) while holding a lock can suffer unpredictable delays
>>> and result in long lock-hold times. So maybe a limit on negative
>>> dentries for any parent is what we really want. That would be clumsy to
>>> implement I imagine.
>> But the notion of dropping the dentry in ->d_delete() on last dput() is
>> simple enough but did see regressions (the only other place in the VFS
>> besides dentry_kill() that the inode is unlinked from the dentry on
>> dput()). I wonder if the regression was related to the test itself
>> deliberately recreating deleted files and if that really is normal
>> behaviour. By itself that should prevent almost all negative dentries
>> being retained. Although file systems could do this as well (think XFS
>> inode recycling) it should be reasonable to require it be left to the
>> VFS.
>>
>> But even that's not enough given that, in my case, there would still be
>> around 4 million dentries in the LRU cache and in fsnotify there are
>> directory child traversals holding the parent i_lock "spinlock" that are
>> going to cause problems.
> Do you mean there are very many positive children of a directory?
Didn't quantify that.
The symptom is the "Spinlock held for more than ... seconds" occurring
in the log. So there are certainly a lot of children in the list, but
it's an assumption the ratio of positive to negative entries is roughly
the same as the overall ratio in the dcache.
>
>> That's all that much more puzzling when I see things like commit
>> 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily") which looks
>> like it implies the child flag depends entirely on the parent state (what
>> am I missing Amir?)
> PARENT_WATCHED dentry flags (as the name suggests) are only caching the
> information whether the parent has notification marks receiving events from
> the child. So yes, the flag fully depends on the parent state.
Ok, this is something I was after, I will keep looking at the fsnotify
code since there is something to find, thanks for that.
>
>> so why is this traversal even retained in fsnotify?
> Not sure which traversal you mean but if you set watch on a parent, you
> have to walk all children to set PARENT_WATCHED flag so that you don't miss
> events on children...
Yes, that traversal is what I'm questioning ... again thanks.
I think the function name is still fsnotify_set_children_dentry_flags() in
recent kernels, the subject of commit 172e422ffea2 I mentioned above.
When you say miss events are you saying that accessing the parent dentry to
work out if the child needs to respond to an event is quite expensive in the
overall event processing context, that might make more sense to me ... or do
I completely not yet understand the reasoning behind the need for the flag?
>
>>> But what if we move dentries to the end of the list when they become
>>> negative, and to the start of the list when they become positive? Then
>>> code which walks the child list could simply abort on the first
>>> negative.
>>>
>>> I doubt that would be quite as easy as it sounds, but it would at least
>>> be more focused on the observed symptom rather than some whole-system
>>> number which only vaguely correlates with the observed symptom.
>>>
>>> Maybe a completely different approach: change children-walking code to
>>> drop and retake the lock (with appropriate validation) periodically.
>>> What too would address the specific symptom.
>> Another good question.
>>
>> I have assumed that dropping and re-taking the lock cannot be done but
>> this is a question I would like answered as well. Dropping and re-taking
>> lock would require, as Miklos pointed out to me off-list, recording the
>> list position with say a cursor, introducing unwanted complexity when it
>> would be better to accept the cost of a single extra access to the parent
>> flags (which I assume is one reason to set the flag in the child).
> The parent access is actually more expensive than you might think. Based on
> experience with past fsnotify related performance regression I expect some
> 20% performance hit for small tmpfs writes if you add unconditional parent
> access to the write path.
That sounds like a lot for what should be a memory access of an already in
memory structure since the parent must be accessed to traverse the list of
child entries. I clearly don't fully understand the implications of what
I'm saying but there has been mention of another context ...
Nevertheless more useful information, ;)
Thanks again,
Ian
^ permalink raw reply
* [PATCH v2] cpufreq: Documentation: fix sampling_down_factor range
From: Pengjie Zhang @ 2026-05-18 13:34 UTC (permalink / raw)
To: rafael, viresh.kumar, corbet
Cc: skhan, zhongqiu.han, linux-pm, linux-doc, zhanjie9, zhenglifeng1,
lihuisong, yubowen8, linhongye, linuxarm, zhangpengjie2,
wangzhi12
The ondemand governor implementation accepts sampling_down_factor values
from 1 to 100000 via MAX_SAMPLING_DOWN_FACTOR, but the documentation in
admin-guide/pm/cpufreq.rst still says the valid range is 1 to 100.
Update the documentation to match the actual code.
Fixes: 2a0e49279850 ("cpufreq: User/admin documentation update and consolidation")
Reviewed-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Signed-off-by: Pengjie Zhang <zhangpengjie2@huawei.com>
---
Changes in v2:
- Modify the title.
- Add Reviewed-by tag.
Link to v1:https://lore.kernel.org/all/20260515094930.273599-1-zhangpengjie2@huawei.com/
---
Documentation/admin-guide/pm/cpufreq.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index dbe6d23a5d67..fdca59c955dc 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -516,7 +516,7 @@ This governor exposes the following tunables:
of those tasks above 0 and set this attribute to 1.
``sampling_down_factor``
- Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
+ Temporary multiplier, between 1 (default) and 100000 inclusive, to apply to
the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
This causes the next execution of the governor's worker routine (after
--
2.33.0
^ permalink raw reply related
* Re: [PATCH v3] killswitch: add per-function short-circuit mitigation primitive
From: Sasha Levin @ 2026-05-18 13:33 UTC (permalink / raw)
To: Song Liu
Cc: linux-kernel, linux-doc, linux-kselftest, bpf, live-patching,
Greg Kroah-Hartman, Andrew Morton, Jonathan Corbet,
Mathieu Desnoyers, Joshua Peisach, Florian Weimer, Breno Leitao,
Anthony Iliopoulos, Michal Hocko, Jiri Olsa
In-Reply-To: <CAPhsuW4x8shWon8Moi5VgCq2n4E2EzaaauZ2HHpy42Rp1Y-J-g@mail.gmail.com>
On Sun, May 17, 2026 at 11:37:36PM -0700, Song Liu wrote:
>On Sun, May 17, 2026 at 6:49 AM Sasha Levin <sashal@kernel.org> wrote:
>> * fail_function (CONFIG_FUNCTION_ERROR_INJECTION) is disabled in
>> most production kernels. Even where enabled, it only works on
>> functions pre-annotated with ALLOW_ERROR_INJECTION() in source -
>> no help for a freshly-disclosed CVE. The debugfs UI is blocked by
>> lockdown=integrity and the override is probabilistic.
>>
>> * BPF override (bpf_override_return) honors the same
>> ALLOW_ERROR_INJECTION() whitelist, and BPF itself is off in many
>> production kernels. Even where on, the operator interface is
>> "load a verified BPF program," not a one-line write.
>
>If it is OK for killswitch to attach to any kernel functions, do we still
>need ALLOW_ERROR_INJECTION() for fail_function and BPF
>override? Shall we instead also allow fail_function and BPF override
>to attach to any kernel functions?
I don't think so. ALLOW_ERROR_INJECTION is not a security mechanism, it's an
integrity/safety mechanism for both bpf and fault injection.
It protects against a "developer or CI script doing legitimate fault injection
accidentally panics the box" scenario, not an "attacker gets in" one.
--
Thanks,
Sasha
^ permalink raw reply
* [PATCH] PCI/AER: Clear non-fatal errors on AER recovery failure
From: Yury Murashka @ 2026-05-18 13:23 UTC (permalink / raw)
To: bhelgaas, mahesh
Cc: oohall, corbet, skhan, linux-pci, linux-doc, linux-kernel,
linuxppc-dev, Yury Murashka
pci_aer_clear_nonfatal_status() is not called when AER recovery fails.
If a new AER error is subsequently reported, the AER driver calls
find_source_device() to find the source of the error. It rescans the
whole bus and picks the first device reporting an AER error. Because the
previous error was never cleared, the error is attributed to the wrong
device and AER recovery is started for the wrong device.
Add a kernel boot parameter pci=aer_clear_on_recovery_failure to clear
AER error status even when recovery fails, preventing stale errors from
causing incorrect device identification on subsequent AER events.
Signed-off-by: Yury Murashka <yurypm@arista.com>
---
Documentation/admin-guide/kernel-parameters.txt | 5 +++++
drivers/pci/pci.c | 2 ++
drivers/pci/pci.h | 2 ++
drivers/pci/pcie/err.c | 13 +++++++++++++
4 files changed, 22 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt
b/Documentation/admin-guide/kernel-parameters.txt
index 4d0f545fb..5a9e266f5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5301,6 +5301,11 @@ Kernel parameters
nomio [S390] Do not use MIO instructions.
norid [S390] ignore the RID field and force use of
one PCI domain per PCI function
+ aer_clear_on_recovery_failure
+ [PCIE] If the PCIEAER kernel config parameter is
+ enabled, this kernel boot option can be used to
+ enable AER errors cleanup even if error recovery
+ failed.
notph [PCIE] If the PCIE_TPH kernel config parameter
is enabled, this kernel boot option can be used
to disable PCIe TLP Processing Hints support
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d34266651..701459c62 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6769,6 +6769,8 @@ static int __init pci_setup(char *str)
disable_acs_redir_param = str + 18;
} else if (!strncmp(str, "config_acs=", 11)) {
config_acs_param = str + 11;
+ } else if (!strncmp(str,
"aer_clear_on_recovery_failure", 29)) {
+ pci_enable_aer_clear_on_recovery_failure();
} else {
pr_err("PCI: Unknown option `%s'\n", str);
}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4a14f88e5..093a7c896 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1292,6 +1292,7 @@ int pci_aer_clear_status(struct pci_dev *dev);
int pci_aer_raw_clear_status(struct pci_dev *dev);
void pci_save_aer_state(struct pci_dev *dev);
void pci_restore_aer_state(struct pci_dev *dev);
+void pci_enable_aer_clear_on_recovery_failure(void);
#else
static inline void pci_no_aer(void) { }
static inline void pci_aer_init(struct pci_dev *d) { }
@@ -1301,6 +1302,7 @@ static inline int pci_aer_clear_status(struct
pci_dev *dev) { return -EINVAL; }
static inline int pci_aer_raw_clear_status(struct pci_dev *dev) {
return -EINVAL; }
static inline void pci_save_aer_state(struct pci_dev *dev) { }
static inline void pci_restore_aer_state(struct pci_dev *dev) { }
+static inline void pci_enable_aer_clear_on_recovery_failure(void) { }
#endif
#ifdef CONFIG_ACPI
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc11..29d655a34 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -21,6 +21,13 @@
#include "portdrv.h"
#include "../pci.h"
+static int enable_aer_clear_on_recovery_failure;
+
+void pci_enable_aer_clear_on_recovery_failure(void)
+{
+ enable_aer_clear_on_recovery_failure = 1;
+}
+
static pci_ers_result_t merge_result(enum pci_ers_result orig,
enum pci_ers_result new)
{
@@ -289,6 +296,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
return status;
failed:
+ if (enable_aer_clear_on_recovery_failure &&
+ (host->native_aer || pcie_ports_native)) {
+ pcie_clear_device_status(dev);
+ pci_aer_clear_nonfatal_status(dev);
+ }
+
pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
pci_walk_bridge(bridge, report_perm_failure_detected, NULL);
--
2.51.0
^ permalink raw reply related
* Re: [PATCH mm-unstable v17 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: David Hildenbrand (Arm) @ 2026-05-18 13:16 UTC (permalink / raw)
To: Wei Yang, Lance Yang
Cc: npache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, dev.jain,
gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, liam, ljs, mathieu.desnoyers, matthew.brost,
mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260514031009.f66cgop3ctgiqxz3@master>
On 5/14/26 05:10, Wei Yang wrote:
> On Tue, May 12, 2026 at 03:42:02PM +0800, Lance Yang wrote:
>>
>> On Mon, May 11, 2026 at 12:58:04PM -0600, Nico Pache wrote:
>>> generalize the order of the __collapse_huge_page_* and collapse_max_*
>>> functions to support future mTHP collapse.
>>>
>>> The current mechanism for determining collapse with the
>>> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
>>> raises a key design issue: if we support user defined max_pte_none values
>>> (even those scaled by order), a collapse of a lower order can introduces
>>> an feedback loop, or "creep", when max_ptes_none is set to a value greater
>>> than HPAGE_PMD_NR / 2. [1]
>>>
>>> With this configuration, a successful collapse to order N will populate
>>> enough pages to satisfy the collapse condition on order N+1 on the next
>>> scan. This leads to unnecessary work and memory churn.
>>>
>>> To fix this issue introduce a helper function that will limit mTHP
>>> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
>>> This effectively supports two modes: [2]
>>>
>>> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
>>> that maps the shared zeropage. Consequently, no memory bloat.
>>> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
>>> available mTHP order.
>>>
>>> This removes the possiblilty of "creep", while not modifying any uAPI
>>> expectations. A warning will be emitted if any non-supported
>>> max_ptes_none value is configured with mTHP enabled.
>>>
>>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
>>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
>>> shared or swapped entry.
>>>
>>> No functional changes in this patch; however it defines future behavior
>>> for mTHP collapse.
>>>
>>> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
>>> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>>>
>>> Co-developed-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>> include/trace/events/huge_memory.h | 3 +-
>>> mm/khugepaged.c | 117 ++++++++++++++++++++---------
>>> 2 files changed, 85 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>>> index bcdc57eea270..443e0bd13fdb 100644
>>> --- a/include/trace/events/huge_memory.h
>>> +++ b/include/trace/events/huge_memory.h
>>> @@ -39,7 +39,8 @@
>>> EM( SCAN_STORE_FAILED, "store_failed") \
>>> EM( SCAN_COPY_MC, "copy_poisoned_page") \
>>> EM( SCAN_PAGE_FILLED, "page_filled") \
>>> - EMe(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback")
>>> + EM(SCAN_PAGE_DIRTY_OR_WRITEBACK, "page_dirty_or_writeback") \
>>> + EMe(SCAN_INVALID_PTES_NONE, "invalid_ptes_none")
>>>
>>> #undef EM
>>> #undef EMe
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index f68853b3caa7..27465161fa6d 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -61,6 +61,7 @@ enum scan_result {
>>> SCAN_COPY_MC,
>>> SCAN_PAGE_FILLED,
>>> SCAN_PAGE_DIRTY_OR_WRITEBACK,
>>> + SCAN_INVALID_PTES_NONE,
>>> };
>>>
>>> #define CREATE_TRACE_POINTS
>>> @@ -353,37 +354,60 @@ static bool pte_none_or_zero(pte_t pte)
>>> * PTEs for the given collapse operation.
>>> * @cc: The collapse control struct
>>> * @vma: The vma to check for userfaultfd
>>> + * @order: The folio order being collapsed to
>>> *
>>> * Return: Maximum number of none-page or zero-page PTEs allowed for the
>>> * collapse operation.
>>> */
>>> -static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
>>> - struct vm_area_struct *vma)
>>> +static int collapse_max_ptes_none(struct collapse_control *cc,
>>> + struct vm_area_struct *vma, unsigned int order)
>>> {
>>> + unsigned int max_ptes_none = khugepaged_max_ptes_none;
>>> // If the vma is userfaultfd-armed, allow no none-page or zero-page PTEs.
>>
>> One thing I still want to call out: kernel code usually uses C-style
>> comments :)
>>
>>> if (vma && userfaultfd_armed(vma))
>>> return 0;
>>> // for MADV_COLLAPSE, allow any none-page or zero-page PTEs.
>>> if (!cc->is_khugepaged)
>>> return HPAGE_PMD_NR;
>>> - // For all other cases repect the user defined maximum.
>>> - return khugepaged_max_ptes_none;
>>> + // for PMD collapse, respect the user defined maximum.
>>> + if (is_pmd_order(order))
>>> + return max_ptes_none;
>>> + /* Zero/non-present collapse disabled. */
>>> + if (!max_ptes_none)
>>> + return 0;
>>> + // for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
>>> + // scale the maximum number of PTEs to the order of the collapse.
>>> + if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
>>> + return (1 << order) - 1;
>>> +
>>> + // We currently only support max_ptes_none values of 0 or KHUGEPAGED_MAX_PTES_LIMIT.
>>> + // Emit a warning and return -EINVAL.
>>> + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u\n",
>>> + KHUGEPAGED_MAX_PTES_LIMIT);
>>
>> Maybe fallback to 0 instead, as David suggested earlier?
>>
>
> It looks reasonable to fallback to 0.
>
> But as the updated Document says in patch 14:
>
> For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other
> value will emit a warning and no mTHP collapse will be attempted.
>
> This is why it does like this now.
>
> mthp_collapse()
> max_ptes_none = collapse_max_ptes_none();
> if (max_ptes_none < 0)
> return collapsed;
>
>> max_ptes_none is mostly legacy PMD THP behavior. mTHP is new, and any
>> intermediate value in (0, KHUGEPAGED_MAX_PTES_LIMIT) would implicitly
>> disable it :(
>>
>
> So it depends on what we want to do here :-)
>
> For me, I would vote for fallback to 0.
At this point I'll prefer to not return errors from collapse_max_ptes_none().
It's just rather awkward to return an error deep down in collapse code for a
configuration problem.
For mthp collapse, we only support max_ptes_none==0 and
max_ptes_none=="HPAGE_PMD_NR - 1" (default).
If another value is specified while collapsing mTHP, print a warning and treat
it as 0 (save value, no creep, no memory waste).
In a sense, this is similar to how we handle max_ptes_shared + max_ptes_swap:
for mTHP: we always treat them as being 0 for mTHP collapse (and don't issue a
warning, because we would issue a warning with the default settings).
@Lorenzo, fine with you?
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH] cpufreq: Documentation: fix sampling_down_factor documentation range
From: zhangpengjie (A) @ 2026-05-18 13:12 UTC (permalink / raw)
To: Zhongqiu Han, rafael, viresh.kumar, corbet
Cc: skhan, linux-pm, linux-doc, zhanjie9, zhenglifeng1, lihuisong,
yubowen8, linhongye, linuxarm, wangzhi12
In-Reply-To: <05980ed2-e591-468c-a528-5b2b74c192d8@oss.qualcomm.com>
On 5/17/2026 1:04 PM, Zhongqiu Han wrote:
> On 5/15/2026 5:49 PM, Pengjie Zhang wrote:
>> The ondemand governor implementation accepts sampling_down_factor values
>> from 1 to 100000 via MAX_SAMPLING_DOWN_FACTOR, but the documentation in
>> admin-guide/pm/cpufreq.rst still says the valid range is 1 to 100.
>>
>> Update the documentation to match the actual code.
>>
>> Fixes: 2a0e49279850 ("cpufreq: User/admin documentation update and
>> consolidation")
>
>
> Thanks Pengjie,
>
> Yes, commit 3f78a9f7fcee introduced MAX_SAMPLING_DOWN_FACTOR (100000),
> and commit 2a0e49279850 updated the documentation later, so the Fixes
> tag is correct.
>
> Small nit: "documentation range" feels a bit redundant; just "range"
> might be enough.
>
> Looks good to me.
>
> Reviewed-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
>
Thanks for your review. I'll send out v2 shortly.
Best regards,
pengjie
>
>> Signed-off-by: Pengjie Zhang <zhangpengjie2@huawei.com>
>> ---
>> Documentation/admin-guide/pm/cpufreq.rst | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/Documentation/admin-guide/pm/cpufreq.rst
>> b/Documentation/admin-guide/pm/cpufreq.rst
>> index dbe6d23a5d67..fdca59c955dc 100644
>> --- a/Documentation/admin-guide/pm/cpufreq.rst
>> +++ b/Documentation/admin-guide/pm/cpufreq.rst
>> @@ -516,7 +516,7 @@ This governor exposes the following tunables:
>> of those tasks above 0 and set this attribute to 1.
>> ``sampling_down_factor``
>> - Temporary multiplier, between 1 (default) and 100 inclusive, to
>> apply to
>> + Temporary multiplier, between 1 (default) and 100000 inclusive,
>> to apply to
>> the ``sampling_rate`` value if the CPU load goes above
>> ``up_threshold``.
>> This causes the next execution of the governor's worker
>> routine (after
>
>
^ permalink raw reply
* Re: [PATCH v2 1/3] Doc: deprecated.rst: add strlcat()
From: David Laight @ 2026-05-18 12:59 UTC (permalink / raw)
To: Geert Uytterhoeven
Cc: Heiko Carstens, Kees Cook, Manuel Ebner, Andy Shevchenko,
Jonathan Corbet, Shuah Khan, Andy Whitcroft, Joe Perches,
Dwaipayan Ray, Lukas Bulwahn, Randy Dunlap, Jani Nikula,
open list:DOCUMENTATION PROCESS, open list:DOCUMENTATION,
open list
In-Reply-To: <CAMuHMdXEezxGi1d=BCiQ57cbnG4D2PPXvt_FAHcyT5mgR7md3g@mail.gmail.com>
On Mon, 18 May 2026 09:11:04 +0200
Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> Hi David,
...
> > I don't really see why strlcat() should be deprecated.
> > Clearly there are many cases where there are better ways to do things.
>
> https://elixir.bootlin.com/linux/v7.0.8/source/include/linux/fortify-string.h#L346
> already says "Do not use this function. [...] Prefer building the
> * string with formatting, via scnprintf(), seq_buf, or similar.".
Trouble is that all requires a lot more rework.
I might try changing the type of the 'buffer' to sysfs_emit()
from 'char *' to 'sysfs_buf *'.
Initially the types will have to be the same, but propagating it through
will show where it can be used.
But last I looked I failed to even find the associated kmalloc().
Eventually it could be changed to a different type.
> > The only problem with strlcat() is that it returns the 'required length'.
> > So there are some broken uses.
> > - fs/nfs/flexfilelayout/flexfilelayout.c
> > - lib/kunit/string-stream.c (although the preceding vsnprintf() looks like the actual bug).
> > There is also some very strange code in security/selinus/ima.c - but it may be ok.
> >
> > In reality the return value of strlcat() isn't really much worse that that
> > of snprintf().
>
> So we need strscat()? ;-)
Indeed...
-- David
>
> Gr{oetje,eeting}s,
>
> Geert
>
^ permalink raw reply
* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Julian Orth @ 2026-05-18 12:58 UTC (permalink / raw)
To: Christian König
Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Sumit Semwal, Jonathan Corbet, Shuah Khan,
Arnd Bergmann, Greg Kroah-Hartman, dri-devel, linux-kernel,
linux-media, linaro-mm-sig, linux-doc, wayland-devel,
Michel Dänzer
In-Reply-To: <69dcbcc1-da58-4d34-bfb0-5c8d33b75d59@amd.com>
On Mon, May 18, 2026 at 2:41 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/18/26 14:02, Julian Orth wrote:
> > On Mon, May 18, 2026 at 1:58 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/16/26 13:06, Julian Orth wrote:
> >>> This series adds a new device /dev/syncobj that can be used to create
> >>> and manipulate DRM syncobjs. Previously, these operations required the
> >>> use of a DRM device and the device needed to support the DRIVER_SYNCOBJ
> >>> and DRIVER_SYNCOBJ_TIMELINE features.
> >>>
> >>> There are several issues with the existing API:
> >>>
> >>> - Syncobjs are the only explicit sync mechanism available on wayland.
> >>> Most compositors do not use GPU waits. Instead, they use the
> >>> DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to perform a CPU wait. Being tied to
> >>> DRM devices means that compositors cannot consistently offer this
> >>> feature even though no device-specific logic is involved.
> >>
> >> Well the drm_syncobj is a container for device specific dma fences.
> >
> > Not necessarily. The DRM_IOCTL_SYNCOBJ_TIMELINE_SIGNAL ioctl attaches
> > some kind of dummy fence that is already signaled. I don't believe
> > this is device specific. That is also the path that llvmpipe would
> > use.
>
> Yeah I feared that.
>
> This is the wait before signal path and if I'm not completely mistaken that one is not supported by a lot of compositors.
I believe this is supported by all compositors.
>
> The last time I looked for GPU support the compositor needs to spawn a separate thread for each client to support this approach.
>
> It could be that we have eventfd integration for that as well now, but in that case you could give the compositor an eventfd instead of a drm_syncobj fd in the first place.
Yes, all compositors use the DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to wait
async for the timeline point to materialize and/or be signaled. The
wayland protocol was the motivation for that ioctl.
>
> So as far as I can see using drm_syncobj for software rendering really doesn't make sense, eventfd is a much better fit for that use case.
Using eventfd has some disadvantages:
- We've just added syncobj support to vulkan:
https://github.com/KhronosGroup/Vulkan-Docs/issues/2473#issuecomment-4446117280.
For eventfd we would not only have to add yet another extension, that
would realistically only be exposed by llvmpipe, but also every
compositor and every client would have to support both extensions.
- Similarly, a new wayland protocol would need to be designed to
support sync over eventfd.
- Eventfd does not support timeline semantics. Meaning that you would
have to send two eventfds over the wire for each commit, one for the
acquire point and one for the release point. Whereas with syncobj you
only need to send two integers per commit.
I don't see the advantage when drm_syncobj already does everything we need.
You seem to believe that compositors would not be ready for this and
from that perspective I can understand your apprehension. But I can
assure you that compositors are already fully set up to support all of
the usecases I've described: The wayland protocol requires the
compositor to support wait before signal.
>
> Regards,
> Christian.
>
> >
> >>
> >> What could be possible instead is to pass an eventfd into Wayland, but that is something userspace needs to decide.
> >>
> >>> - llvmpipe currently cannot offer syncobj interop because it does not
> >>> have access to a DRM device. This means that applications using
> >>> llvmpipe cannot present images before they have finished rendering,
> >>> despite llvmpipe using threaded rendering.
> >>
> >> Yeah, but that is completely intentional. You *CAN'T* use a dma_fence as completion event for llvmpipe rendering. See the kernel documentation on that.
> >>
> >> What could be possible is to use the drm_syncobjs functionality to wait before signal, but that has different semantics.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> - Clients that do not use the Vulkan WSI need to manually probe /dev/dri
> >>> for devices that support the syncobj ioctls in order to use the
> >>> wayland syncobj protocol.
> >>> - Similarly, clients that want to use screen capture have no equivalent
> >>> to the WSI and are therefore forced into that path.
> >>> - Having to keep a DRM device open has potentially negative interactions
> >>> with GPU hotplug.
> >>> - Having to translate between syncobj FDs and handles is troublesome in
> >>> the compositor usecase since syncobjs come and go frequently and need
> >>> to be cleaned up when clients disconnect.
> >>>
> >>> /dev/syncobj solves these issues by providing all syncobj ioctls under a
> >>> consistent path that is not tied to any DRM device. It also operates
> >>> directly on file descriptors instead of syncobj handles.
> >>>
> >>> The series starts with a number of small refactorings in drm_syncobj.c
> >>> to make its functionality available outside of the file and without the
> >>> need for drm_file/handle pairs.
> >>>
> >>> The last commit adds the /dev/syncobj module. I've added it as a misc
> >>> device but maybe this should instead live somewhere under gpu/drm.
> >>>
> >>> An application using the new interface can be found at [1].
> >>>
> >>> [1]: https://github.com/mahkoh/jay/pull/947
> >>>
> >>> ---
> >>> Julian Orth (12):
> >>> drm/syncobj: add drm_syncobj_from_fd
> >>> drm/syncobj: add drm_syncobj_fence_lookup
> >>> drm/syncobj: make drm_syncobj_array_wait_timeout public
> >>> drm/syncobj: add drm_syncobj_register_eventfd
> >>> drm/syncobj: have transfer functions accept drm_syncobj directly
> >>> drm/syncobj: add drm_syncobj_transfer
> >>> drm/syncobj: add drm_syncobj_timeline_signal
> >>> drm/syncobj: add drm_syncobj_query
> >>> drm/syncobj: fix resource leak in drm_syncobj_import_sync_file_fence
> >>> drm/syncobj: add drm_syncobj_import_sync_file
> >>> drm/syncobj: add drm_syncobj_export_sync_file
> >>> misc/syncobj: add new device
> >>>
> >>> Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
> >>> drivers/gpu/drm/drm_syncobj.c | 374 ++++++++++++++-----
> >>> drivers/misc/Kconfig | 10 +
> >>> drivers/misc/Makefile | 1 +
> >>> drivers/misc/syncobj.c | 404 +++++++++++++++++++++
> >>> include/drm/drm_syncobj.h | 21 ++
> >>> include/uapi/linux/syncobj.h | 75 ++++
> >>> 7 files changed, 795 insertions(+), 91 deletions(-)
> >>> ---
> >>> base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
> >>> change-id: 20260516-jorth-syncobj-d4d374c8c61b
> >>>
> >>> Best regards,
> >>> --
> >>> Julian Orth <ju.orth@gmail.com>
> >>>
> >>
>
^ permalink raw reply
* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-18 12:50 UTC (permalink / raw)
To: Christian König
Cc: T.J. Mercier, Christian Brauner, Tejun Heo, Johannes Weiner,
Michal Koutný, Jonathan Corbet, Shuah Khan, Sumit Semwal,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
Paul Moore, James Morris, Serge E. Hallyn, Stephen Smalley,
Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, linux-mm,
linux-security-module, selinux, linux-kselftest, mripard,
echanude
In-Reply-To: <208fb820-d8eb-4832-a343-ef8b360e8120@amd.com>
On Mon, May 18, 2026 at 9:20 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/15/26 19:06, T.J. Mercier wrote:
> > On Fri, May 15, 2026 at 6:53 AM Christian Brauner <brauner@kernel.org> wrote:
> >>
> >> On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
> >>> On embedded platforms a central process often allocates dma-buf
> >>> memory on behalf of client applications. Without a way to
> >>> attribute the charge to the requesting client's cgroup, the
> >>> cost lands on the allocator, making per-cgroup memory limits
> >>> ineffective for the actual consumers.
> >>>
> >>> Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> >>
> >> Please be aware that pidfds come in two flavors:
> >>
> >> thread-group pidfds and thread-specific pidfds. Make sure that your API
> >> doesn't implicitly depend on this distinction not existing.
> >
> > Hi Christian,
> >
> > Memcg is not a controller that supports "thread mode" so all threads
> > in a group should belong to the same memcg.
>
> BTW: Exactly that is the requirement automotive has with their native context use case.
>
> The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
>
> At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
>
> Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
Hi Christian,
Thanks for sharing this atuomotive usecase. If I understand correctly,
the actual requirement is attributing dma-buf charges to the right
client, not putting each daemon thread in a different cgroup? If so,
the `charge_pid_fd` approach achieves this directly by passing the
client's `pid_fd`, without needing to add per-thread cgroup
infrastructure.
>
> Regards,
> Christian.
>
> >
> > Checking the flags from pidfd_get_pid would be the best way for an
> > explicit check of the pidfd type?
> >
> >>> a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> >>> memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> >>> inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> >>> the mem_accounting module parameter enabled, the buffer is charged
> >>> to the allocator's own cgroup.
> >>>
> >>> Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> >>> system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> >>> page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> >>> twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> >>> all accounting through a single MEMCG_DMABUF path.
> >>>
> >>> Usage examples:
> >>>
> >>> 1. Central allocator charging to a client at allocation time.
> >>> The allocator knows the client's PID (e.g., from binder's
> >>> sender_pid) and uses pidfd to attribute the charge:
> >>>
> >>> pid_t client_pid = txn->sender_pid;
> >>> int pidfd = pidfd_open(client_pid, 0);
> >>>
> >>> struct dma_heap_allocation_data alloc = {
> >>> .len = buffer_size,
> >>> .fd_flags = O_RDWR | O_CLOEXEC,
> >>> .charge_pid_fd = pidfd,
> >>> };
> >>> ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> >>> close(pidfd);
> >>> /* alloc.fd is now charged to client's cgroup */
> >>>
> >>> 2. Default allocation (no pidfd, mem_accounting=1).
> >>> When charge_pid_fd is not set and the mem_accounting module
> >>> parameter is enabled, the buffer is charged to the allocator's
> >>> own cgroup:
> >>>
> >>> struct dma_heap_allocation_data alloc = {
> >>> .len = buffer_size,
> >>> .fd_flags = O_RDWR | O_CLOEXEC,
> >>> };
> >>> ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc);
> >>> /* charged to current process's cgroup */
> >>>
> >>> Current limitations:
> >>>
> >>> - Single-owner model: a dma-buf carries one memcg charge regardless of
> >>> how many processes share it. Means only the first owner (and exporter)
> >>> of the shared buffer bears the charge.
> >>> - Only memcg accounting supported. While this makes sense for system
> >>> heap buffers, other heaps (e.g., CMA heaps) will require selectively
> >>> charging also for the dmem controller.
> >>>
> >>> Signed-off-by: Albert Esteve <aesteve@redhat.com>
> >>> ---
> >>> Documentation/admin-guide/cgroup-v2.rst | 5 ++--
> >>> drivers/dma-buf/dma-buf.c | 16 ++++---------
> >>> drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++---
> >>> drivers/dma-buf/heaps/system_heap.c | 2 --
> >>> include/uapi/linux/dma-heap.h | 6 +++++
> >>> 5 files changed, 53 insertions(+), 18 deletions(-)
> >>>
> >>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> >>> index 8bdbc2e866430..824d269531eb1 100644
> >>> --- a/Documentation/admin-guide/cgroup-v2.rst
> >>> +++ b/Documentation/admin-guide/cgroup-v2.rst
> >>> @@ -1636,8 +1636,9 @@ The following nested keys are defined.
> >>> structures.
> >>>
> >>> dmabuf (npn)
> >>> - Amount of memory used for exported DMA buffers allocated by the cgroup.
> >>> - Stays with the allocating cgroup regardless of how the buffer is shared.
> >>> + Amount of memory used for exported DMA buffers allocated by or on
> >>> + behalf of the cgroup. Stays with the allocating cgroup regardless
> >>> + of how the buffer is shared.
> >>>
> >>> workingset_refault_anon
> >>> Number of refaults of previously evicted anonymous pages.
> >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >>> index ce02377f48908..23fb758b78297 100644
> >>> --- a/drivers/dma-buf/dma-buf.c
> >>> +++ b/drivers/dma-buf/dma-buf.c
> >>> @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry)
> >>> */
> >>> BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> >>>
> >>> - mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> >>> - mem_cgroup_put(dmabuf->memcg);
> >>> + if (dmabuf->memcg) {
> >>> + mem_cgroup_uncharge_dmabuf(dmabuf->memcg,
> >>> + PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
> >>> + mem_cgroup_put(dmabuf->memcg);
> >>> + }
> >>>
> >>> dmabuf->ops->release(dmabuf);
> >>>
> >>> @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >>> dmabuf->resv = resv;
> >>> }
> >>>
> >>> - dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
> >>> - if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
> >>> - GFP_KERNEL)) {
> >>> - ret = -ENOMEM;
> >>> - goto err_memcg;
> >>> - }
> >>> -
> >>> file->private_data = dmabuf;
> >>> file->f_path.dentry->d_fsdata = dmabuf;
> >>> dmabuf->file = file;
> >>> @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
> >>>
> >>> return dmabuf;
> >>>
> >>> -err_memcg:
> >>> - mem_cgroup_put(dmabuf->memcg);
> >>> err_file:
> >>> fput(file);
> >>> err_module:
> >>> diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
> >>> index ac5f8685a6494..ff6e259afcdc0 100644
> >>> --- a/drivers/dma-buf/dma-heap.c
> >>> +++ b/drivers/dma-buf/dma-heap.c
> >>> @@ -7,13 +7,17 @@
> >>> */
> >>>
> >>> #include <linux/cdev.h>
> >>> +#include <linux/cgroup.h>
> >>> #include <linux/device.h>
> >>> #include <linux/dma-buf.h>
> >>> #include <linux/dma-heap.h>
> >>> +#include <linux/memcontrol.h>
> >>> +#include <linux/sched/mm.h>
> >>> #include <linux/err.h>
> >>> #include <linux/export.h>
> >>> #include <linux/list.h>
> >>> #include <linux/nospec.h>
> >>> +#include <linux/pidfd.h>
> >>> #include <linux/syscalls.h>
> >>> #include <linux/uaccess.h>
> >>> #include <linux/xarray.h>
> >>> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting,
> >>> "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
> >>>
> >>> static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> >>> - u32 fd_flags,
> >>> - u64 heap_flags)
> >>> + u32 fd_flags, u64 heap_flags,
> >>> + struct mem_cgroup *charge_to)
> >>> {
> >>> struct dma_buf *dmabuf;
> >>> + unsigned int nr_pages;
> >>> + struct mem_cgroup *memcg = charge_to;
> >>> int fd;
> >>>
> >>> /*
> >>> @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
> >>> if (IS_ERR(dmabuf))
> >>> return PTR_ERR(dmabuf);
> >>>
> >>> + nr_pages = len / PAGE_SIZE;
> >>> +
> >>> + if (memcg)
> >>> + css_get(&memcg->css);
> >>> + else if (mem_accounting)
> >>> + memcg = get_mem_cgroup_from_mm(current->mm);
> >>> +
> >>> + if (memcg) {
> >>> + if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {
> >>> + mem_cgroup_put(memcg);
> >>> + dma_buf_put(dmabuf);
> >>> + return -ENOMEM;
> >>> + }
> >>> + dmabuf->memcg = memcg;
> >>> + }
> >>> +
> >>> fd = dma_buf_fd(dmabuf, fd_flags);
> >>> if (fd < 0) {
> >>> dma_buf_put(dmabuf);
> >>> @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> >>> {
> >>> struct dma_heap_allocation_data *heap_allocation = data;
> >>> struct dma_heap *heap = file->private_data;
> >>> + struct mem_cgroup *memcg = NULL;
> >>> + struct task_struct *task;
> >>> + unsigned int pidfd_flags;
> >>> int fd;
> >>>
> >>> if (heap_allocation->fd)
> >>> @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data)
> >>> if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS)
> >>> return -EINVAL;
> >>>
> >>> + if (heap_allocation->charge_pid_fd) {
> >>> + task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
> >>
> >> Will always get a thread-group leader pidfd and will fail if this is a
> >> thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to
> >> open a thread-specific pidfd.
> >>
> >>> + if (IS_ERR(task))
> >>> + return PTR_ERR(task);
> >>> +
> >>> + memcg = get_mem_cgroup_from_mm(task->mm);
> >>> + put_task_struct(task);
> >>> + }
> >>> +
> >>> fd = dma_heap_buffer_alloc(heap, heap_allocation->len,
> >>> heap_allocation->fd_flags,
> >>> - heap_allocation->heap_flags);
> >>> + heap_allocation->heap_flags,
> >>> + memcg);
> >>> + mem_cgroup_put(memcg);
> >>> if (fd < 0)
> >>> return fd;
> >>>
> >>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> >>> index 03c2b87cb1112..95d7688167b93 100644
> >>> --- a/drivers/dma-buf/heaps/system_heap.c
> >>> +++ b/drivers/dma-buf/heaps/system_heap.c
> >>> @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size,
> >>> if (max_order < orders[i])
> >>> continue;
> >>> flags = order_flags[i];
> >>> - if (mem_accounting)
> >>> - flags |= __GFP_ACCOUNT;
> >>> page = alloc_pages(flags, orders[i]);
> >>> if (!page)
> >>> continue;
> >>> diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h
> >>> index a4cf716a49fa6..e02b0f8cbc6a1 100644
> >>> --- a/include/uapi/linux/dma-heap.h
> >>> +++ b/include/uapi/linux/dma-heap.h
> >>> @@ -29,6 +29,10 @@
> >>> * handle to the allocated dma-buf
> >>> * @fd_flags: file descriptor flags used when allocating
> >>> * @heap_flags: flags passed to heap
> >>> + * @charge_pid_fd: optional pidfd of the process whose cgroup should be
> >>> + * charged for this allocation; 0 means charge the calling
> >>> + * process's cgroup
> >>> + * @__padding: reserved, must be zero
> >>> *
> >>> * Provided by userspace as an argument to the ioctl
> >>> */
> >>> @@ -37,6 +41,8 @@ struct dma_heap_allocation_data {
> >>> __u32 fd;
> >>> __u32 fd_flags;
> >>> __u64 heap_flags;
> >>> + __u32 charge_pid_fd;
> >>> + __u32 __padding;
> >>> };
> >>>
> >>> #define DMA_HEAP_IOC_MAGIC 'H'
> >>>
> >>> --
> >>> 2.53.0
> >>>
>
^ permalink raw reply
* Re: [PATCH] nios2: remove the architecture
From: Krzysztof Kozlowski @ 2026-05-18 12:50 UTC (permalink / raw)
To: Ethan Nelson-Moore
Cc: linux-doc, devicetree, workflows, linux-arch, dmaengine,
linux-i2c, linux-iio, netdev, linux-pci, linux-pwm,
linux-hardening, linux-kbuild, linux-csky, Jonathan Corbet,
Shuah Khan, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Daniel Lezcano, Thomas Gleixner, Alex Shi, Yanteng Si,
Dongliang Mu, Hu Haowen, Dinh Nguyen, Kees Cook, Oleg Nesterov,
Will Deacon, Aneesh Kumar K.V, Andrew Morton, Nick Piggin,
Peter Zijlstra, Vinod Koul, Frank Li, Dave Penkler, Andi Shyti,
Jonathan Cameron, David Lechner, Nuno Sá, Andy Shevchenko,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Lorenzo Pieralisi, Krzysztof Wilczyński
In-Reply-To: <20260518042833.272221-1-enelsonmoore@gmail.com>
On Sun, May 17, 2026 at 09:28:33PM -0700, Ethan Nelson-Moore wrote:
> The Nios II architecture is a soft-core architecture developed by
> Altera (since acquired by Intel) and intended to run on their FPGAs.
>
> Licenses for the architecture have not been available for purchase
> since 2024 [1], and support for it has been removed from GCC 15 [2],
> Buildroot [3], and QEMU [4].
>
> Given all of these factors, it is time to remove Nios II support from
> the kernel. The maintainer stated in 2024 that they were planning to do
> so soon [5], but this did not come to pass.
>
> Remove Nios II support from the kernel and move the former maintainer
> to CREDITS. Thank you, Dinh Nguyen, for maintaining Nios II support!
>
> References:
> [1] https://docs.altera.com/v/u/docs/781327/is-discontinuing-ip-ordering-codes-listed-in-pdn2312-for-nios-ii-ip
> [2] https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=e876acab6cdd84bb2b32c98fc69fb0ba29c81153
> [3] https://github.com/buildroot/buildroot/commit/6775ccc5a199d574ad70b5f79ec58cce97a07c6f
> [4] https://github.com/qemu/qemu/commit/6c3014858c4c0024dd0560f08a6eda0f92f658d6
> [5] https://sourceware.org/pipermail/newlib/2024/021083.html
>
> Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
> ---
Wearing DT hat:
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Best regards,
Krzysztof
^ permalink raw reply
* Re: [PATCH mm-unstable v17 00/14] khugepaged: mTHP support
From: Wei Yang @ 2026-05-18 12:50 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260511185817.686831-1-npache@redhat.com>
On Mon, May 11, 2026 at 12:58:00PM -0600, Nico Pache wrote:
>The following series provides khugepaged with the capability to collapse
>anonymous memory regions to mTHPs.
>
>To achieve this we generalize the khugepaged functions to no longer depend
>on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual
>pages that are occupied (!none/zero). After the PMD scan is done, we use
>the bitmap to find the optimal mTHP sizes for the PMD range. The
>restriction on max_ptes_none is removed during the scan, to make sure we
>account for the whole PMD range in the bitmap. When no mTHP size is
>enabled, the legacy behavior of khugepaged is maintained.
>
>We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1
>(ie 511). If any other value is specified, the kernel will emit a warning
>and no mTHP collapse will be attempted. If a mTHP collapse is attempted,
>but contains swapped out, or shared pages, we don't perform the collapse.
>It is now also possible to collapse to mTHPs without requiring the PMD THP
>size to be enabled. These limitations are to prevent collapse "creep"
>behavior. This prevents constantly promoting mTHPs to the next available
>size, which would occur because a collapse introduces more non-zero pages
>that would satisfy the promotion condition on subsequent scans.
>
>Patch 1-2: Generalize hugepage_vma_revalidate and alloc_charge_folio
> for arbitrary orders.
>Patch 3: Rework max_ptes_* handling into helper functions
>Patch 4: Generalize __collapse_huge_page_* for mTHP support
>Patch 5: Require collapse_huge_page to enter/exit with the lock dropped
>Patch 6: Generalize collapse_huge_page for mTHP collapse
>Patch 7: Skip collapsing mTHP to smaller orders
>Patch 8-9: Add per-order mTHP statistics and tracepoints
>Patch 10: Introduce collapse_allowable_orders helper function
>Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled
>Patch 14: Documentation
>
>Testing:
>- Built for x86_64, aarch64, ppc64le, and s390x
>- ran all arches on test suites provided by the kernel-tests project
>- internal testing suites: functional testing and performance testing
>- selftests mm
>- I created a test script that I used to push khugepaged to its limits
> while monitoring a number of stats and tracepoints. The code is
> available here[1] (Run in legacy mode for these changes and set mthp
> sizes to inherit)
> The summary from my testings was that there was no significant
> regression noticed through this test. In some cases my changes had
> better collapse latencies, and was able to scan more pages in the same
> amount of time/work, but for the most part the results were consistent.
>- redis testing. I did some testing with these changes along with my defer
> changes (see followup [2] post for more details). We've decided to get
> the mTHP changes merged first before attempting the defer series.
>- some basic testing on 64k page size.
>- lots of general use.
>
Two links are missing. I got them from previous version.
[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/
And the test in [1] is a performance test. I am thinking whether we want a
functional test in selftests.
I did a quick try with following change and some hack.
@@ -744,6 +765,51 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
ksft_test_result_report(exit_status, "%s\n", __func__);
}
+static void collapse_mth_ptes(struct collapse_context *c, struct mem_ops *ops)
+{
+ struct thp_settings settings = *thp_current_settings();
+ void *p;
+ int i;
+
+ /* Disable mthp on fault */
+ for (i = 0; i < NR_ORDERS; i++) {
+ settings.hugepages[i].enabled = THP_NEVER;
+ }
+ thp_push_settings(&settings);
+
+ p = ops->setup_area(1);
+
+ ops->fault(p, 0, hpage_pmd_size);
+
+ /* Expect all order-0 folio after fault */
+ memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
+ expected_orders[0] = hpage_pmd_nr;
+ if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
+ kpageflags_fd, expected_orders,
+ (pmd_order + 1)))
+ ksft_exit_fail_msg("Unexpected huge page at fault\n");
+
+ /* Enable mthp before collapse */
+ thp_pop_settings();
+ settings.hugepages[2].enabled = THP_ALWAYS;
+ thp_push_settings(&settings);
+
+ c->collapse("Collapse fully populated PTE table with order 2", p, 1,
+ ops, true);
+
+ /* Expect all order-2 folio after collapse */
+ memset(expected_orders, 0, sizeof(int) * (pmd_order + 1));
+ expected_orders[2] = 1 << (pmd_order - 2);
+ if (check_folio_orders(p, hpage_pmd_size, pagemap_fd,
+ kpageflags_fd, expected_orders,
+ (pmd_order + 1)))
+ ksft_exit_fail_msg("Unexpected page order\n");
+
+ ops->cleanup_area(p, hpage_pmd_size);
+ thp_pop_settings();
+ ksft_test_result_report(exit_status, "%s\n", __func__);
+}
+
static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
{
void *p;
This leverage check_after_split_folio_orders() in split_huge_page_test.c to
check folio order in PMD range.
--
Wei Yang
Help you, Help me
^ permalink raw reply
* Re: [PATCH 00/12] misc/syncobj: add /dev/syncobj device
From: Christian König @ 2026-05-18 12:41 UTC (permalink / raw)
To: Julian Orth
Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
Simona Vetter, Sumit Semwal, Jonathan Corbet, Shuah Khan,
Arnd Bergmann, Greg Kroah-Hartman, dri-devel, linux-kernel,
linux-media, linaro-mm-sig, linux-doc, wayland-devel,
Michel Dänzer
In-Reply-To: <CAHijbEUzWZC4GAMU6YGV42gOYkrQaMZZPiwS4Erb4H1J-fh_8Q@mail.gmail.com>
On 5/18/26 14:02, Julian Orth wrote:
> On Mon, May 18, 2026 at 1:58 PM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 5/16/26 13:06, Julian Orth wrote:
>>> This series adds a new device /dev/syncobj that can be used to create
>>> and manipulate DRM syncobjs. Previously, these operations required the
>>> use of a DRM device and the device needed to support the DRIVER_SYNCOBJ
>>> and DRIVER_SYNCOBJ_TIMELINE features.
>>>
>>> There are several issues with the existing API:
>>>
>>> - Syncobjs are the only explicit sync mechanism available on wayland.
>>> Most compositors do not use GPU waits. Instead, they use the
>>> DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to perform a CPU wait. Being tied to
>>> DRM devices means that compositors cannot consistently offer this
>>> feature even though no device-specific logic is involved.
>>
>> Well the drm_syncobj is a container for device specific dma fences.
>
> Not necessarily. The DRM_IOCTL_SYNCOBJ_TIMELINE_SIGNAL ioctl attaches
> some kind of dummy fence that is already signaled. I don't believe
> this is device specific. That is also the path that llvmpipe would
> use.
Yeah I feared that.
This is the wait before signal path and if I'm not completely mistaken that one is not supported by a lot of compositors.
The last time I looked for GPU support the compositor needs to spawn a separate thread for each client to support this approach.
It could be that we have eventfd integration for that as well now, but in that case you could give the compositor an eventfd instead of a drm_syncobj fd in the first place.
So as far as I can see using drm_syncobj for software rendering really doesn't make sense, eventfd is a much better fit for that use case.
Regards,
Christian.
>
>>
>> What could be possible instead is to pass an eventfd into Wayland, but that is something userspace needs to decide.
>>
>>> - llvmpipe currently cannot offer syncobj interop because it does not
>>> have access to a DRM device. This means that applications using
>>> llvmpipe cannot present images before they have finished rendering,
>>> despite llvmpipe using threaded rendering.
>>
>> Yeah, but that is completely intentional. You *CAN'T* use a dma_fence as completion event for llvmpipe rendering. See the kernel documentation on that.
>>
>> What could be possible is to use the drm_syncobjs functionality to wait before signal, but that has different semantics.
>>
>> Regards,
>> Christian.
>>
>>> - Clients that do not use the Vulkan WSI need to manually probe /dev/dri
>>> for devices that support the syncobj ioctls in order to use the
>>> wayland syncobj protocol.
>>> - Similarly, clients that want to use screen capture have no equivalent
>>> to the WSI and are therefore forced into that path.
>>> - Having to keep a DRM device open has potentially negative interactions
>>> with GPU hotplug.
>>> - Having to translate between syncobj FDs and handles is troublesome in
>>> the compositor usecase since syncobjs come and go frequently and need
>>> to be cleaned up when clients disconnect.
>>>
>>> /dev/syncobj solves these issues by providing all syncobj ioctls under a
>>> consistent path that is not tied to any DRM device. It also operates
>>> directly on file descriptors instead of syncobj handles.
>>>
>>> The series starts with a number of small refactorings in drm_syncobj.c
>>> to make its functionality available outside of the file and without the
>>> need for drm_file/handle pairs.
>>>
>>> The last commit adds the /dev/syncobj module. I've added it as a misc
>>> device but maybe this should instead live somewhere under gpu/drm.
>>>
>>> An application using the new interface can be found at [1].
>>>
>>> [1]: https://github.com/mahkoh/jay/pull/947
>>>
>>> ---
>>> Julian Orth (12):
>>> drm/syncobj: add drm_syncobj_from_fd
>>> drm/syncobj: add drm_syncobj_fence_lookup
>>> drm/syncobj: make drm_syncobj_array_wait_timeout public
>>> drm/syncobj: add drm_syncobj_register_eventfd
>>> drm/syncobj: have transfer functions accept drm_syncobj directly
>>> drm/syncobj: add drm_syncobj_transfer
>>> drm/syncobj: add drm_syncobj_timeline_signal
>>> drm/syncobj: add drm_syncobj_query
>>> drm/syncobj: fix resource leak in drm_syncobj_import_sync_file_fence
>>> drm/syncobj: add drm_syncobj_import_sync_file
>>> drm/syncobj: add drm_syncobj_export_sync_file
>>> misc/syncobj: add new device
>>>
>>> Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
>>> drivers/gpu/drm/drm_syncobj.c | 374 ++++++++++++++-----
>>> drivers/misc/Kconfig | 10 +
>>> drivers/misc/Makefile | 1 +
>>> drivers/misc/syncobj.c | 404 +++++++++++++++++++++
>>> include/drm/drm_syncobj.h | 21 ++
>>> include/uapi/linux/syncobj.h | 75 ++++
>>> 7 files changed, 795 insertions(+), 91 deletions(-)
>>> ---
>>> base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
>>> change-id: 20260516-jorth-syncobj-d4d374c8c61b
>>>
>>> Best regards,
>>> --
>>> Julian Orth <ju.orth@gmail.com>
>>>
>>
^ permalink raw reply
* Re: [PATCH RFC 2/5] dma-heap: charge dma-buf memory via explicit memcg
From: Albert Esteve @ 2026-05-18 12:16 UTC (permalink / raw)
To: Barry Song
Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
Shuah Khan, Sumit Semwal, Christian König, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Benjamin Gaignard, Brian Starkey, John Stultz, T.J. Mercier,
Christian Brauner, Paul Moore, James Morris, Serge E. Hallyn,
Stephen Smalley, Ondrej Mosnacek, Shuah Khan, cgroups, linux-doc,
linux-kernel, linux-media, dri-devel, linaro-mm-sig, linux-mm,
linux-security-module, selinux, linux-kselftest, mripard,
echanude
In-Reply-To: <CAGsJ_4xfznffbjOaNKwnN6oZk_H6pqOzYqd1zx4Q9XrocdzV8A@mail.gmail.com>
On Sat, May 16, 2026 at 9:37 AM Barry Song <baohua@kernel.org> wrote:
>
> On Tue, May 12, 2026 at 5:18 PM Albert Esteve <aesteve@redhat.com> wrote:
> >
> > On embedded platforms a central process often allocates dma-buf
> > memory on behalf of client applications. Without a way to
> > attribute the charge to the requesting client's cgroup, the
> > cost lands on the allocator, making per-cgroup memory limits
> > ineffective for the actual consumers.
> >
> > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
> > a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's
> > memcg and charges the buffer there via mem_cgroup_charge_dmabuf()
> > inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with
> > the mem_accounting module parameter enabled, the buffer is charged
> > to the allocator's own cgroup.
> >
> > Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for
> > system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap
> > page allocations. Keeping __GFP_ACCOUNT would charge the same pages
> > twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route
> > all accounting through a single MEMCG_DMABUF path.
> >
> [...]
>
> > - if (mem_accounting)
> > - flags |= __GFP_ACCOUNT;
>
> Hi Albert,
>
> would it be better to move this and its description to patch 1? It
> looks like patch 1 already introduces the double accounting changes,
> and patch 2 is mainly just supporting remote charging.
Hi Barry,
Thanks for looking into this series! Yes, in my head I was trying to
keep patch 1, which was taken from a previous, different series, and
then diverge from it starting with patch 2. This would clarify the
difference between the two. But I can see it just added some confusion
(for example, patch 1 charges on dma_buf_export() and then it is moved
to dma_heap_buffer_alloc() in patch 2). I will reorganize it better
for the next version, including your suggestion.
>
> Also, mem_accounting is only used by system_heap.c; has this patchset
> also eliminated its need?
No, mem_accounting is still handled in this patch for the general case
where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
+ if (memcg)
+ css_get(&memcg->css);
+ else if (mem_accounting)
+ memcg = get_mem_cgroup_from_mm(current->mm);
>
> Thanks
> Barry
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox