All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver
@ 2025-09-04  6:35 Penny Zheng
  2025-09-04  6:35 ` [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{} Penny Zheng
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Penny Zheng @ 2025-09-04  6:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Penny Zheng, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	Juergen Gross, Oleksii Kurochko, Community Manager

amd-cppc is the AMD CPU performance scaling driver that introduces a
new CPU frequency control mechanism on modern AMD APU and CPU series in
Xen. The new mechanism is based on Collaborative Processor Performance
Control (CPPC) which provides finer grain frequency management than
legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using
the ACPI P-states driver to manage CPU frequency and clocks with
switching only in 3 P-states. CPPC replaces the ACPI P-states controls
and allows a flexible, low-latency interface for Xen to directly
communicate the performance hints to hardware.

amd_cppc driver has 2 operation modes: autonomous (active) mode,
and non-autonomous (passive) mode. We register different CPUFreq driver
for different modes, "amd-cppc" for passive mode and "amd-cppc-epp"
for active mode.

The passive mode leverages common governors such as *ondemand*,
*performance*, etc, to manage the performance tuning. While the active mode
uses epp to provides a hint to the hardware if software wants to bias
toward performance (0x0) or energy efficiency (0xff). CPPC power algorithm
in hardware will automatically calculate the runtime workload and adjust the
realtime cpu cores frequency according to the power supply and thermal, core
voltage and some other hardware conditions.

amd-cppc is enabled on passive mode with a top-level `cpufreq=amd-cppc` option,
while users add extra `active` flag to select active mode.

With `cpufreq=amd-cppc,active`, we did a 60s sampling test to see the CPU
frequency change, through tweaking the energy_perf preference from
`xenpm set-cpufreq-cppc powersave` to `xenpm set-cpufreq-cppc performance`.
The outputs are as follows:
```
Setting CPU in powersave mode
Sampling and Outputs:
  Avg freq      580000 KHz
  Avg freq      580000 KHz
  Avg freq      580000 KHz
Setting CPU in performance mode
Sampling and Outputs:
  Avg freq      4640000 KHz
  Avg freq      4220000 KHz
  Avg freq      4640000 KHz
```

Penny Zheng (8):
  xen/cpufreq: embed hwp into struct cpufreq_policy{}
  xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
  xen/cpufreq: implement amd-cppc-epp driver for CPPC in active mode
  xen/cpufreq: get performance policy from governor set via xenpm
  tools/cpufreq: extract CPPC para from cpufreq para
  xen/cpufreq: bypass governor-related para for amd-cppc-epp
  xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc
    driver
  CHANGELOG.md: add amd-cppc/amd-cppc-epp cpufreq driver support

 CHANGELOG.md                         |   1 +
 docs/misc/xen-command-line.pandoc    |   9 +-
 tools/include/xenctrl.h              |   3 +-
 tools/libs/ctrl/xc_pm.c              |  25 +-
 tools/misc/xenpm.c                   |  94 ++--
 xen/arch/x86/acpi/cpufreq/amd-cppc.c | 703 ++++++++++++++++++++++++++-
 xen/arch/x86/acpi/cpufreq/hwp.c      |  32 +-
 xen/arch/x86/cpu/amd.c               |   8 +-
 xen/arch/x86/include/asm/amd.h       |   2 +
 xen/arch/x86/include/asm/msr-index.h |   6 +
 xen/drivers/acpi/pm-op.c             |  58 ++-
 xen/drivers/cpufreq/utility.c        |  15 +
 xen/include/acpi/cpufreq/cpufreq.h   |  44 ++
 xen/include/public/sysctl.h          |   5 +-
 14 files changed, 936 insertions(+), 69 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
  2025-09-04  6:35 [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Penny Zheng
@ 2025-09-04  6:35 ` Penny Zheng
  2025-09-04 11:50   ` Jan Beulich
  2025-09-04  6:35 ` [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode Penny Zheng
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Penny Zheng @ 2025-09-04  6:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Penny Zheng, Jan Beulich, Andrew Cooper, Roger Pau Monné

For cpus sharing one cpufreq domain, cpufreq_driver.init() is
only invoked on the firstcpu, so current per-CPU hwp driver data
struct hwp_drv_data{} actually fails to be allocated for cpus other than the
first one. There is no need to make it per-CPU.
We embed struct hwp_drv_data{} into struct cpufreq_policy{}, then cpus could
share the hwp driver data allocated for the firstcpu, like the way they share
struct cpufreq_policy{}. We also make it a union, with "hwp", and later
"amd-cppc" as a sub-struct.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
v8 -> v9:
- new commit
---
 xen/arch/x86/acpi/cpufreq/hwp.c    | 32 +++++++++++++-----------------
 xen/include/acpi/cpufreq/cpufreq.h |  6 ++++++
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/hwp.c b/xen/arch/x86/acpi/cpufreq/hwp.c
index 240491c96a..5c98f3eb3e 100644
--- a/xen/arch/x86/acpi/cpufreq/hwp.c
+++ b/xen/arch/x86/acpi/cpufreq/hwp.c
@@ -67,7 +67,6 @@ struct hwp_drv_data
     uint8_t desired;
     uint8_t energy_perf;
 };
-static DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data *, hwp_drv_data);
 
 #define hwp_err(cpu, fmt, args...) \
     printk(XENLOG_ERR "HWP: CPU%u error: " fmt, cpu, ## args)
@@ -224,7 +223,7 @@ static bool __init hwp_available(void)
 
 static int cf_check hwp_cpufreq_verify(struct cpufreq_policy *policy)
 {
-    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
+    struct hwp_drv_data *data = policy->u.hwp;
 
     if ( !feature_hwp_activity_window && data->activity_window )
     {
@@ -239,7 +238,7 @@ static int cf_check hwp_cpufreq_verify(struct cpufreq_policy *policy)
 static void cf_check hwp_write_request(void *info)
 {
     const struct cpufreq_policy *policy = info;
-    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
+    struct hwp_drv_data *data = policy->u.hwp;
     union hwp_request hwp_req = data->curr_req;
 
     data->ret = 0;
@@ -259,7 +258,7 @@ static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
                                        unsigned int relation)
 {
     unsigned int cpu = policy->cpu;
-    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+    struct hwp_drv_data *data = policy->u.hwp;
     /* Zero everything to ensure reserved bits are zero... */
     union hwp_request hwp_req = { .raw = 0 };
 
@@ -350,7 +349,7 @@ static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
 static void cf_check hwp_init_msrs(void *info)
 {
     struct cpufreq_policy *policy = info;
-    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
+    struct hwp_drv_data *data = policy->u.hwp;
     uint64_t val;
 
     /*
@@ -426,15 +425,14 @@ static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
 
     policy->governor = &cpufreq_gov_hwp;
 
-    per_cpu(hwp_drv_data, cpu) = data;
+    policy->u.hwp = data;
 
     on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
 
     if ( data->curr_req.raw == -1 )
     {
         hwp_err(cpu, "Could not initialize HWP properly\n");
-        per_cpu(hwp_drv_data, cpu) = NULL;
-        xfree(data);
+        XFREE(policy->u.hwp);
         return -ENODEV;
     }
 
@@ -462,10 +460,8 @@ static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
 
 static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
 {
-    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
-
-    per_cpu(hwp_drv_data, policy->cpu) = NULL;
-    xfree(data);
+    if ( policy->u.hwp )
+        XFREE(policy->u.hwp);
 
     return 0;
 }
@@ -480,7 +476,7 @@ static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
 static void cf_check hwp_set_misc_turbo(void *info)
 {
     const struct cpufreq_policy *policy = info;
-    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
+    struct hwp_drv_data *data = policy->u.hwp;
     uint64_t msr;
 
     data->ret = 0;
@@ -511,7 +507,7 @@ static int cf_check hwp_cpufreq_update(unsigned int cpu, struct cpufreq_policy *
 {
     on_selected_cpus(cpumask_of(cpu), hwp_set_misc_turbo, policy, 1);
 
-    return per_cpu(hwp_drv_data, cpu)->ret;
+    return policy->u.hwp->ret;
 }
 #endif /* CONFIG_PM_OP */
 
@@ -531,9 +527,10 @@ hwp_cpufreq_driver = {
 int get_hwp_para(unsigned int cpu,
                  struct xen_get_cppc_para *cppc_para)
 {
-    const struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+    const struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_policy, cpu);
+    const struct hwp_drv_data *data;
 
-    if ( data == NULL )
+    if ( !policy || !(data = policy->u.hwp) )
         return -ENODATA;
 
     cppc_para->features         =
@@ -554,8 +551,7 @@ int get_hwp_para(unsigned int cpu,
 int set_hwp_para(struct cpufreq_policy *policy,
                  struct xen_set_cppc_para *set_cppc)
 {
-    unsigned int cpu = policy->cpu;
-    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
+    struct hwp_drv_data *data = policy->u.hwp;
     bool cleared_act_window = false;
 
     if ( data == NULL )
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 5d4881eea8..c0ecd690c5 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -62,6 +62,7 @@ struct perf_limits {
     uint32_t min_policy_pct;
 };
 
+struct hwp_drv_data;
 struct cpufreq_policy {
     cpumask_var_t       cpus;          /* affected CPUs */
     unsigned int        shared_type;   /* ANY or ALL affected CPUs
@@ -81,6 +82,11 @@ struct cpufreq_policy {
     int8_t              turbo;  /* tristate flag: 0 for unsupported
                                  * -1 for disable, 1 for enabled
                                  * See CPUFREQ_TURBO_* below for defines */
+    union {
+#ifdef CONFIG_INTEL
+        struct hwp_drv_data *hwp; /* Driver data for Intel HWP */
+#endif
+    } u;
 };
 DECLARE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_policy);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
  2025-09-04  6:35 [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Penny Zheng
  2025-09-04  6:35 ` [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{} Penny Zheng
@ 2025-09-04  6:35 ` Penny Zheng
  2025-09-04 12:04   ` Jan Beulich
  2025-09-04 12:11   ` Jan Beulich
  2025-09-04  6:35 ` [PATCH v9 3/8] xen/cpufreq: implement amd-cppc-epp driver for CPPC in active mode Penny Zheng
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 29+ messages in thread
From: Penny Zheng @ 2025-09-04  6:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Penny Zheng, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini

amd-cppc is the AMD CPU performance scaling driver that introduces a
new CPU frequency control mechanism. The new mechanism is based on
Collaborative Processor Performance Control (CPPC) which is a finer grain
frequency management than legacy ACPI hardware P-States.
Current AMD CPU platforms are using the ACPI P-states driver to
manage CPU frequency and clocks with switching only in 3 P-states, while the
new amd-cppc allows a more flexible, low-latency interface for Xen
to directly communicate the performance hints to hardware.

"amd-cppc" driver is responsible for implementing CPPC in passive mode, which
still leverages Xen governors such as *ondemand*, *performance*, etc, to
calculate the performance hints. In the future, we will introduce an advanced
active mode to enable autonomous performence level selection.

Field epp, energy performance preference, which only has meaning when active
mode is enabled and will be introduced later in details, so we read
pre-defined BIOS value for it in passive mode.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v1 -> v2:
- re-construct union caps and req to have anonymous struct instead
- avoid "else" when the earlier if() ends in an unconditional control flow statement
- Add check to avoid chopping off set bits from cast
- make pointers pointer-to-const wherever possible
- remove noisy log
- exclude families before 0x17 before CPPC-feature MSR op
- remove useless variable helpers
- use xvzalloc and XVFREE
- refactor error handling as ENABLE bit can only be cleared by reset
---
v2 -> v3:
- Move all MSR-definations to msr-index.h and follow the required style
- Refactor opening figure braces for struct/union
- Sort overlong lines throughout the series
- Make offset/res int covering underflow scenario
- Error out when amd_max_freq_mhz isn't set
- Introduce amd_get_freq(name) macro to decrease redundancy
- Supported CPU family checked ahead of smp-function
- Nominal freq shall be checked between the [min, max]
- Use APERF/MPREF to calculate current frequency
- Use amd_cppc_cpufreq_cpu_exit() to tidy error path
---
v3 -> v4:
- verbose print shall come with a CPU number
- deal with res <= 0 in amd_cppc_khz_to_perf()
- introduce a single helper amd_get_lowest_or_nominal_freq() to cover both
lowest and nominal scenario
- reduce abuse of wrmsr_safe()/rdmsr_safe() with wrmsrl()/rdmsrl()
- move cf_check from amd_cppc_write_request() to amd_cppc_write_request_msrs()
- add comment to explain why setting non_linear_lowest in passive mode
- add check to ensure perf values in
lowest <= non_linear_lowest <= nominal <= highset
- refactor comment for "data->err != 0" scenario
- use "data->err" instead of -ENODEV
- add U suffixes for all msr macro
---
v4 -> v5:
- all freq-values shall be unsigned int type
- remove shortcuts as it is rarely taken
- checking cpc.nominal_mhz and cpc.lowest_mhz are non-zero values is enough
- drop the explicit type cast
- null pointer check is in no need for internal functions
- change amd_get_lowest_or_nominal_freq() to amd_get_cpc_freq()
- clarifying function-wide that the calculated frequency result is to be in kHz
- use array notation
- with cpu_has_cppc check, no need to do cpu family check
---
v5 -> v6
- replace "AMD_CPPC" with "AMD-CPPC" in message
- add equation(mul,div) non-zero check
- replace -EINVAL with -EOPNOTSUPP
- refactor comment
---
v6 -> v7
- used > in place of !=, to not only serve a doc aspect, but also allow to
drop one part
- unify with UINT8_MAX
- return -ERANGE as we reject perf values of 0 as invalid
- replace uint32_t with unsigned int
- Move some epp introduction here, otherwise we will mis-handle this field here
by always clearing it
---
v7 -> v8:
- refine message text by removing 0
---
v8 -> v9
- embed struct amd_cppc_drv_data{} into struct cpufreq_policy{}
---
 xen/arch/x86/acpi/cpufreq/amd-cppc.c | 414 ++++++++++++++++++++++++++-
 xen/arch/x86/cpu/amd.c               |   8 +-
 xen/arch/x86/include/asm/amd.h       |   2 +
 xen/arch/x86/include/asm/msr-index.h |   6 +
 xen/include/acpi/cpufreq/cpufreq.h   |   4 +
 xen/include/public/sysctl.h          |   1 +
 6 files changed, 430 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
index 3377783f7e..5cf8b85c9f 100644
--- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
+++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
@@ -14,7 +14,96 @@
 #include <xen/domain.h>
 #include <xen/init.h>
 #include <xen/param.h>
+#include <xen/percpu.h>
+#include <xen/xvmalloc.h>
 #include <acpi/cpufreq/cpufreq.h>
+#include <asm/amd.h>
+#include <asm/msr.h>
+
+#define amd_cppc_err(cpu, fmt, args...)                             \
+    printk(XENLOG_ERR "AMD-CPPC: CPU%u error: " fmt, cpu, ## args)
+#define amd_cppc_warn(cpu, fmt, args...)                            \
+    printk(XENLOG_WARNING "AMD-CPPC: CPU%u warning: " fmt, cpu, ## args)
+#define amd_cppc_verbose(cpu, fmt, args...)                         \
+({                                                                  \
+    if ( cpufreq_verbose )                                          \
+        printk(XENLOG_DEBUG "AMD-CPPC: CPU%u " fmt, cpu, ## args);  \
+})
+
+/*
+ * Field highest_perf, nominal_perf, lowest_nonlinear_perf, and lowest_perf
+ * contain the values read from CPPC capability MSR. They represent the limits
+ * of managed performance range as well as the dynamic capability, which may
+ * change during processor operation
+ * Field highest_perf represents highest performance, which is the absolute
+ * maximum performance an individual processor may reach, assuming ideal
+ * conditions. This performance level may not be sustainable for long
+ * durations and may only be achievable if other platform components
+ * are in a specific state; for example, it may require other processors be
+ * in an idle state. This would be equivalent to the highest frequencies
+ * supported by the processor.
+ * Field nominal_perf represents maximum sustained performance level of the
+ * processor, assuming ideal operating conditions. All cores/processors are
+ * expected to be able to sustain their nominal performance state
+ * simultaneously.
+ * Field lowest_nonlinear_perf represents Lowest Nonlinear Performance, which
+ * is the lowest performance level at which nonlinear power savings are
+ * achieved. Above this threshold, lower performance levels should be
+ * generally more energy efficient than higher performance levels. So in
+ * traditional terms, this represents the P-state range of performance levels.
+ * Field lowest_perf represents the absolute lowest performance level of the
+ * platform. Selecting it may cause an efficiency penalty but should reduce
+ * the instantaneous power consumption of the processor. So in traditional
+ * terms, this represents the T-state range of performance levels.
+ *
+ * Field max_perf, min_perf, des_perf store the values for CPPC request MSR.
+ * Software passes performance goals through these fields.
+ * Field max_perf conveys the maximum performance level at which the platform
+ * may run. And it may be set to any performance value in the range
+ * [lowest_perf, highest_perf], inclusive.
+ * Field min_perf conveys the minimum performance level at which the platform
+ * may run. And it may be set to any performance value in the range
+ * [lowest_perf, highest_perf], inclusive but must be less than or equal to
+ * max_perf.
+ * Field des_perf conveys performance level Xen governor is requesting. And it
+ * may be set to any performance value in the range [min_perf, max_perf],
+ * inclusive.
+ * Field epp represents energy performance preference, which only has meaning
+ * when active mode is enabled.
+ */
+struct amd_cppc_drv_data
+{
+    const struct xen_processor_cppc *cppc_data;
+    union {
+        uint64_t raw;
+        struct {
+            unsigned int lowest_perf:8;
+            unsigned int lowest_nonlinear_perf:8;
+            unsigned int nominal_perf:8;
+            unsigned int highest_perf:8;
+            unsigned int :32;
+        };
+    } caps;
+    union {
+        uint64_t raw;
+        struct {
+            unsigned int max_perf:8;
+            unsigned int min_perf:8;
+            unsigned int des_perf:8;
+            unsigned int epp:8;
+            unsigned int :32;
+        };
+    } req;
+
+    int err;
+};
+
+/*
+ * Core max frequency read from PstateDef as anchor point
+ * for freq-to-perf transition
+ */
+static DEFINE_PER_CPU_READ_MOSTLY(unsigned int, pxfreq_mhz);
+static DEFINE_PER_CPU_READ_MOSTLY(uint8_t, epp_init);
 
 static bool __init amd_cppc_handle_option(const char *s, const char *end)
 {
@@ -50,10 +139,333 @@ int __init amd_cppc_cmdline_parse(const char *s, const char *e)
     return 0;
 }
 
+/*
+ * If CPPC lowest_freq and nominal_freq registers are exposed then we can
+ * use them to convert perf to freq and vice versa. The conversion is
+ * extrapolated as an linear function passing by the 2 points:
+ *  - (Low perf, Low freq)
+ *  - (Nominal perf, Nominal freq)
+ * Parameter freq is always in kHz.
+ */
+static int amd_cppc_khz_to_perf(const struct amd_cppc_drv_data *data,
+                                unsigned int freq, uint8_t *perf)
+{
+    const struct xen_processor_cppc *cppc_data = data->cppc_data;
+    unsigned int mul, div;
+    int offset = 0, res;
+
+    if ( cppc_data->cpc.lowest_mhz &&
+         data->caps.nominal_perf > data->caps.lowest_perf &&
+         cppc_data->cpc.nominal_mhz > cppc_data->cpc.lowest_mhz )
+    {
+        mul = data->caps.nominal_perf - data->caps.lowest_perf;
+        div = cppc_data->cpc.nominal_mhz - cppc_data->cpc.lowest_mhz;
+
+        /*
+         * We don't need to convert to kHz for computing offset and can
+         * directly use nominal_mhz and lowest_mhz as the division
+         * will remove the frequency unit.
+         */
+        offset = data->caps.nominal_perf -
+                 (mul * cppc_data->cpc.nominal_mhz) / div;
+    }
+    else
+    {
+        /* Read Processor Max Speed(MHz) as anchor point */
+        mul = data->caps.highest_perf;
+        div = this_cpu(pxfreq_mhz);
+        if ( !div )
+            return -EOPNOTSUPP;
+    }
+
+    res = offset + (mul * freq) / (div * 1000);
+    if ( res > UINT8_MAX )
+    {
+        printk_once(XENLOG_WARNING
+                    "Perf value exceeds maximum value 255: %d\n", res);
+        *perf = UINT8_MAX;
+        return 0;
+    }
+    if ( res <= 0 )
+    {
+        printk_once(XENLOG_WARNING
+                    "Perf value smaller than minimum value: %d\n", res);
+        return -ERANGE;
+    }
+    *perf = res;
+
+    return 0;
+}
+
+/*
+ * _CPC may define nominal frequecy and lowest frequency, if not, use
+ * Processor Max Speed as anchor point to calculate.
+ * Output freq stores cpc frequency in kHz
+ */
+static int amd_get_cpc_freq(const struct amd_cppc_drv_data *data,
+                            unsigned int cpc_mhz, uint8_t perf,
+                            unsigned int *freq)
+{
+    unsigned int mul, div, res;
+
+    if ( cpc_mhz )
+    {
+        /* Switch to kHz */
+        *freq = cpc_mhz * 1000;
+        return 0;
+    }
+
+    /* Read Processor Max Speed(MHz) as anchor point */
+    mul = this_cpu(pxfreq_mhz);
+    if ( !mul )
+        return -EOPNOTSUPP;
+    div = data->caps.highest_perf;
+    res = (mul * perf * 1000) / div;
+    if ( unlikely(!res) )
+        return -EOPNOTSUPP;
+
+    return 0;
+}
+
+/* Output max_freq stores calculated maximum frequency in kHz */
+static int amd_get_max_freq(const struct amd_cppc_drv_data *data,
+                            unsigned int *max_freq)
+{
+    unsigned int nom_freq = 0;
+    int res;
+
+    res = amd_get_cpc_freq(data, data->cppc_data->cpc.nominal_mhz,
+                           data->caps.nominal_perf, &nom_freq);
+    if ( res )
+        return res;
+
+    *max_freq = (data->caps.highest_perf * nom_freq) / data->caps.nominal_perf;
+
+    return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_verify(struct cpufreq_policy *policy)
+{
+    cpufreq_verify_within_limits(policy, policy->cpuinfo.min_freq,
+                                 policy->cpuinfo.max_freq);
+
+    return 0;
+}
+
+static void cf_check amd_cppc_write_request_msrs(void *info)
+{
+    const struct amd_cppc_drv_data *data = info;
+
+    wrmsrl(MSR_AMD_CPPC_REQ, data->req.raw);
+}
+
+static void amd_cppc_write_request(unsigned int cpu,
+                                   struct amd_cppc_drv_data *data,
+                                   uint8_t min_perf, uint8_t des_perf,
+                                   uint8_t max_perf, uint8_t epp)
+{
+    uint64_t prev = data->req.raw;
+
+    data->req.min_perf = min_perf;
+    data->req.max_perf = max_perf;
+    data->req.des_perf = des_perf;
+    data->req.epp = epp;
+
+    if ( prev == data->req.raw )
+        return;
+
+    on_selected_cpus(cpumask_of(cpu), amd_cppc_write_request_msrs, data, 1);
+}
+
+static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
+                                            unsigned int target_freq,
+                                            unsigned int relation)
+{
+    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
+    uint8_t des_perf;
+    int res;
+
+    if ( unlikely(!target_freq) )
+        return 0;
+
+    res = amd_cppc_khz_to_perf(data, target_freq, &des_perf);
+    if ( res )
+        return res;
+
+    /*
+     * Having a performance level lower than the lowest nonlinear
+     * performance level, such as, lowest_perf <= perf <= lowest_nonliner_perf,
+     * may actually cause an efficiency penalty, So when deciding the min_perf
+     * value, we prefer lowest nonlinear performance over lowest performance.
+     */
+    amd_cppc_write_request(policy->cpu, data, data->caps.lowest_nonlinear_perf,
+                           des_perf, data->caps.highest_perf,
+                           /* Pre-defined BIOS value for passive mode */
+                           per_cpu(epp_init, policy->cpu));
+    return 0;
+}
+
+static void cf_check amd_cppc_init_msrs(void *info)
+{
+    struct cpufreq_policy *policy = info;
+    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
+    uint64_t val;
+    unsigned int min_freq = 0, nominal_freq = 0, max_freq;
+
+    /* Package level MSR */
+    rdmsrl(MSR_AMD_CPPC_ENABLE, val);
+    /*
+     * Only when Enable bit is on, the hardware will calculate the processor’s
+     * performance capabilities and initialize the performance level fields in
+     * the CPPC capability registers.
+     */
+    if ( !(val & AMD_CPPC_ENABLE) )
+    {
+        val |= AMD_CPPC_ENABLE;
+        wrmsrl(MSR_AMD_CPPC_ENABLE, val);
+    }
+
+    rdmsrl(MSR_AMD_CPPC_CAP1, data->caps.raw);
+
+    if ( data->caps.highest_perf == 0 || data->caps.lowest_perf == 0 ||
+         data->caps.nominal_perf == 0 || data->caps.lowest_nonlinear_perf == 0 ||
+         data->caps.lowest_perf > data->caps.lowest_nonlinear_perf ||
+         data->caps.lowest_nonlinear_perf > data->caps.nominal_perf ||
+         data->caps.nominal_perf > data->caps.highest_perf )
+    {
+        amd_cppc_err(policy->cpu,
+                     "Out of range values: highest(%u), lowest(%u), nominal(%u), lowest_nonlinear(%u)\n",
+                     data->caps.highest_perf, data->caps.lowest_perf,
+                     data->caps.nominal_perf, data->caps.lowest_nonlinear_perf);
+        goto err;
+    }
+
+    amd_process_freq(&cpu_data[policy->cpu],
+                     NULL, NULL, &this_cpu(pxfreq_mhz));
+
+    data->err = amd_get_cpc_freq(data, data->cppc_data->cpc.lowest_mhz,
+                                 data->caps.lowest_perf, &min_freq);
+    if ( data->err )
+        return;
+
+    data->err = amd_get_cpc_freq(data, data->cppc_data->cpc.nominal_mhz,
+                                 data->caps.nominal_perf, &nominal_freq);
+    if ( data->err )
+        return;
+
+    data->err = amd_get_max_freq(data, &max_freq);
+    if ( data->err )
+        return;
+
+    if ( min_freq > nominal_freq || nominal_freq > max_freq )
+    {
+        amd_cppc_err(policy->cpu,
+                     "min(%u), or max(%u), or nominal(%u) freq value is incorrect\n",
+                     min_freq, max_freq, nominal_freq);
+        goto err;
+    }
+
+    policy->min = min_freq;
+    policy->max = max_freq;
+
+    policy->cpuinfo.min_freq = min_freq;
+    policy->cpuinfo.max_freq = max_freq;
+    policy->cpuinfo.perf_freq = nominal_freq;
+    /*
+     * Set after policy->cpuinfo.perf_freq, as we are taking
+     * APERF/MPERF average frequency as current frequency.
+     */
+    policy->cur = cpufreq_driver_getavg(policy->cpu, GOV_GETAVG);
+
+    /* Store pre-defined BIOS value for passive mode */
+    rdmsrl(MSR_AMD_CPPC_REQ, val);
+    this_cpu(epp_init) = MASK_EXTR(val, AMD_CPPC_EPP_MASK);
+
+    return;
+
+ err:
+    /*
+     * No fallback shceme is available here, see more explanation at call
+     * site in amd_cppc_cpufreq_cpu_init().
+     */
+    data->err = -EINVAL;
+}
+
+/*
+ * AMD CPPC driver is different than legacy ACPI hardware P-State,
+ * which has a finer grain frequency range between the highest and lowest
+ * frequency. And boost frequency is actually the frequency which is mapped on
+ * highest performance ratio. The legacy P0 frequency is actually mapped on
+ * nominal performance ratio.
+ */
+static void amd_cppc_boost_init(struct cpufreq_policy *policy,
+                                const struct amd_cppc_drv_data *data)
+{
+    if ( data->caps.highest_perf <= data->caps.nominal_perf )
+        return;
+
+    policy->turbo = CPUFREQ_TURBO_ENABLED;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
+{
+    XVFREE(policy->u.amd_cppc);
+
+    return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    unsigned int cpu = policy->cpu;
+    struct amd_cppc_drv_data *data;
+
+    data = xvzalloc(struct amd_cppc_drv_data);
+    if ( !data )
+        return -ENOMEM;
+    policy->u.amd_cppc = data;
+
+    data->cppc_data = &processor_pminfo[cpu]->cppc_data;
+
+    on_selected_cpus(cpumask_of(cpu), amd_cppc_init_msrs, policy, 1);
+
+    /*
+     * The enable bit is sticky, as we need to enable it at the very first
+     * begining, before CPPC capability values sanity check.
+     * If error path is taken effective, not only amd-cppc cpufreq core fails
+     * to initialize, but also we could not fall back to legacy P-states
+     * driver, irrespective of the command line specifying a fallback option.
+     */
+    if ( data->err )
+    {
+        amd_cppc_err(cpu, "Could not initialize cpufreq core in CPPC mode\n");
+        amd_cppc_cpufreq_cpu_exit(policy);
+        return data->err;
+    }
+
+    policy->governor = cpufreq_opt_governor ? : CPUFREQ_DEFAULT_GOVERNOR;
+
+    amd_cppc_boost_init(policy, data);
+
+    amd_cppc_verbose(policy->cpu,
+                     "CPU initialized with amd-cppc passive mode\n");
+
+    return 0;
+}
+
+static const struct cpufreq_driver __initconst_cf_clobber
+amd_cppc_cpufreq_driver =
+{
+    .name   = XEN_AMD_CPPC_DRIVER_NAME,
+    .verify = amd_cppc_cpufreq_verify,
+    .target = amd_cppc_cpufreq_target,
+    .init   = amd_cppc_cpufreq_cpu_init,
+    .exit   = amd_cppc_cpufreq_cpu_exit,
+};
+
 int __init amd_cppc_register_driver(void)
 {
     if ( !cpu_has_cppc )
         return -ENODEV;
 
-    return -EOPNOTSUPP;
+    return cpufreq_register_driver(&amd_cppc_cpufreq_driver);
 }
diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
index 567b992a9f..9767f63539 100644
--- a/xen/arch/x86/cpu/amd.c
+++ b/xen/arch/x86/cpu/amd.c
@@ -613,10 +613,10 @@ static unsigned int attr_const amd_parse_freq(unsigned int family,
 	return freq;
 }
 
-static void amd_process_freq(const struct cpuinfo_x86 *c,
-			     unsigned int *low_mhz,
-			     unsigned int *nom_mhz,
-			     unsigned int *hi_mhz)
+void amd_process_freq(const struct cpuinfo_x86 *c,
+		      unsigned int *low_mhz,
+		      unsigned int *nom_mhz,
+		      unsigned int *hi_mhz)
 {
 	unsigned int idx = 0, h;
 	uint64_t hi, lo, val;
diff --git a/xen/arch/x86/include/asm/amd.h b/xen/arch/x86/include/asm/amd.h
index 9c9599a622..72df42a6f6 100644
--- a/xen/arch/x86/include/asm/amd.h
+++ b/xen/arch/x86/include/asm/amd.h
@@ -173,5 +173,7 @@ extern bool amd_virt_spec_ctrl;
 bool amd_setup_legacy_ssbd(void);
 void amd_set_legacy_ssbd(bool enable);
 void amd_set_cpuid_user_dis(bool enable);
+void amd_process_freq(const struct cpuinfo_x86 *c, unsigned int *low_mhz,
+                      unsigned int *nom_mhz, unsigned int *hi_mhz);
 
 #endif /* __AMD_H__ */
diff --git a/xen/arch/x86/include/asm/msr-index.h b/xen/arch/x86/include/asm/msr-index.h
index bb48d16f0c..df52587c85 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -252,6 +252,12 @@
 
 #define MSR_AMD_CSTATE_CFG                  0xc0010296U
 
+#define MSR_AMD_CPPC_CAP1                   0xc00102b0U
+#define MSR_AMD_CPPC_ENABLE                 0xc00102b1U
+#define  AMD_CPPC_ENABLE                    (_AC(1, ULL) << 0)
+#define MSR_AMD_CPPC_REQ                    0xc00102b3U
+#define  AMD_CPPC_EPP_MASK                  (_AC(0xff, ULL) << 24)
+
 /*
  * Legacy MSR constants in need of cleanup.  No new MSRs below this comment.
  */
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index c0ecd690c5..baffb5bbe6 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -63,6 +63,7 @@ struct perf_limits {
 };
 
 struct hwp_drv_data;
+struct amd_cppc_drv_data;
 struct cpufreq_policy {
     cpumask_var_t       cpus;          /* affected CPUs */
     unsigned int        shared_type;   /* ANY or ALL affected CPUs
@@ -85,6 +86,9 @@ struct cpufreq_policy {
     union {
 #ifdef CONFIG_INTEL
         struct hwp_drv_data *hwp; /* Driver data for Intel HWP */
+#endif
+#ifdef CONFIG_AMD
+        struct amd_cppc_drv_data *amd_cppc; /* Driver data for AMD CPPC */
 #endif
     } u;
 };
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index aafa7fcf2b..aa29a5401c 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -453,6 +453,7 @@ struct xen_set_cppc_para {
     uint32_t activity_window;
 };
 
+#define XEN_AMD_CPPC_DRIVER_NAME "amd-cppc"
 #define XEN_HWP_DRIVER_NAME "hwp"
 
 /*
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 3/8] xen/cpufreq: implement amd-cppc-epp driver for CPPC in active mode
  2025-09-04  6:35 [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Penny Zheng
  2025-09-04  6:35 ` [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{} Penny Zheng
  2025-09-04  6:35 ` [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode Penny Zheng
@ 2025-09-04  6:35 ` Penny Zheng
  2025-09-04 12:12   ` Jan Beulich
  2025-09-04  6:35 ` [PATCH v9 4/8] xen/cpufreq: get performance policy from governor set via xenpm Penny Zheng
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Penny Zheng @ 2025-09-04  6:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Penny Zheng, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

amd-cppc has 2 operation modes: autonomous (active) mode and
non-autonomous (passive) mode.
In active mode, we don't need Xen governor to calculate and tune the cpu
frequency, while hardware built-in CPPC power algorithm will calculate the
runtime workload and adjust cores frequency automatically according to the
power supply, thermal, core voltage and some other hardware conditions.
In active mode, CPPC ignores requests done in the desired performance field,
and takes into account only the values set to the minimum performance, maximum
performance, and energy performance preference registers.

A new field EPP (energy performance preference), in CPPC request register, is
introduced. It will be used in the CCLK DPM controller to drive the frequency
that a core is going to operate during short periods of activity, called
minimum active frequency, It could contatin a range of values from 0 to 0xff.
An EPP of zero sets the min active frequency to maximum frequency, while
an EPP of 0xff sets the min active frequency to approxiately Idle frequency.

We implement a new AMD CPU frequency driver `amd-cppc-epp` for active mode.
It requires `active` tag in Xen cmdline for users to explicitly select active
mode.
In driver `active-cppc-epp`, ->setpolicy() is hooked, not the ->target(), as
it does not depend on xen governor to do performance tuning.

We also introduce a new field "policy" (CPUFREQ_POLICY_xxx) to represent
performance policy. Right now, it supports three values:
CPUFREQ_POLICY_PERFORMANCE as maximum performance, CPUFREQ_POLICY_POWERSAVE
as the least power consumption, and CPUFREQ_POLICY_ONDEMAND as no preference,
just corresponding to "performance", "powersave" and "ondemand" Xen governor,
which benefit users from re-using "governor" in Xen cmdline to deliver
which performance policy they want to apply.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v1 -> v2:
- Remove redundant epp_mode
- Remove pointless initializer
- Define sole caller read_epp_init_once and epp_init value to read
pre-defined BIOS epp value only once
- Combine the commit "xen/cpufreq: introduce policy type when
cpufreq_driver->setpolicy exists"
---
v2 -> v3:
- Combined with commit "x86/cpufreq: add "cpufreq=amd-cppc,active" para"
- Refactor doc about "active mode"
- Change opt_cpufreq_active to opt_active_mode
- Let caller pass epp_init when unspecified to allow the function parameter
to be of uint8_t
- Make epp_init per-cpu value
---
v3 -> v4:
- doc refinement
- use MASK_EXTR() to get epp value
- fix indentation
- replace if-else() with switch()
- combine successive comments and do refinement
- no need to introduce amd_cppc_epp_update_limit() as a wrapper
- rename cpufreq_parse_policy() with cpufreq_policy_from_governor()
- no need to use case-insensitive comparison
---
v4 -> v5:
- refine doc to state what the default is for "active" sub-option and it's of
boolean nature
- excess blank after << for AMD_CPPC_EPP_MASK
- set max_perf with lowest_perf to get utmost powersave
- refine commit message to include description about relation between "policy"
and "governor"
---
v5 -> v6:
- expand comment for "epp" field
- let min_perf set with lowest_nonliner_perf, not lowest_perf, to constrain
  performance tuning in P-states range
- refactor doc and comments
- blank lines between non-fall-through case blocks
- introduce and add entry for "CPUFREQ_POLICY_ONDEMAND"
---
v6 -> v7
- make opt_active_mode __initdata when NDEBUG=y
- add assertion check for must-zero des_perf in active mode
- use the local variable max_perf and min_perf
- read_epp_init() doesn't worth a separate function
---
v7 -> v8:
- use "ASSERT(!opt_active_mode || !des_perf);" to remove #ifndef NDEBUG
- add a new helper amd_cppc_prepare_policy()
---
v8 -> v9:
- Adapt to changes of "Embed struct amd_cppc_drv_data{} into struct
cpufreq_policy{}"
---
 docs/misc/xen-command-line.pandoc    |   9 +-
 xen/arch/x86/acpi/cpufreq/amd-cppc.c | 134 ++++++++++++++++++++++++++-
 xen/drivers/cpufreq/utility.c        |  15 +++
 xen/include/acpi/cpufreq/cpufreq.h   |  18 ++++
 xen/include/public/sysctl.h          |   1 +
 5 files changed, 172 insertions(+), 5 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 4adcd7e762..adfdf71b40 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -515,7 +515,7 @@ If set, force use of the performance counters for oprofile, rather than detectin
 available support.
 
 ### cpufreq
-> `= none | {{ <boolean> | xen } { [:[powersave|performance|ondemand|userspace][,[<maxfreq>]][,[<minfreq>]]] } [,verbose]} | dom0-kernel | hwp[:[<hdc>][,verbose]] | amd-cppc[:[verbose]]`
+> `= none | {{ <boolean> | xen } { [:[powersave|performance|ondemand|userspace][,[<maxfreq>]][,[<minfreq>]]] } [,verbose]} | dom0-kernel | hwp[:[<hdc>][,verbose]] | amd-cppc[:[active][,verbose]]`
 
 > Default: `xen`
 
@@ -537,6 +537,13 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels.
 * `amd-cppc` selects ACPI Collaborative Performance and Power Control (CPPC)
   on supported AMD hardware to provide finer grained frequency control
   mechanism. The default is disabled.
+* `active` is a boolean to enable amd-cppc driver in active(autonomous) mode.
+  In this mode, users don't rely on Xen governor to do performance monitoring
+  and tuning. Hardware built-in CPPC power algorithm will calculate the runtime
+  workload and adjust cores frequency automatically according to the power
+  supply, thermal, core voltage and some other hardware conditions.
+  The default is disabled, and the option only applies when `amd-cppc` is
+  enabled.
 
 There is also support for `;`-separated fallback options:
 `cpufreq=hwp;xen,verbose`.  This first tries `hwp` and falls back to `xen` if
diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
index 5cf8b85c9f..80b829b84e 100644
--- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
+++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
@@ -67,9 +67,14 @@
  * max_perf.
  * Field des_perf conveys performance level Xen governor is requesting. And it
  * may be set to any performance value in the range [min_perf, max_perf],
- * inclusive.
+ * inclusive. In active mode, des_perf must be zero.
  * Field epp represents energy performance preference, which only has meaning
- * when active mode is enabled.
+ * when active mode is enabled. The EPP is used in the CCLK DPM controller
+ * to drive the frequency that a core is going to operate during short periods
+ * of activity, called minimum active frequency, It could contatin a range of
+ * values from 0 to 0xff. An EPP of zero sets the min active frequency to
+ * maximum frequency, while an EPP of 0xff sets the min active frequency to
+ * approxiately Idle frequency.
  */
 struct amd_cppc_drv_data
 {
@@ -104,6 +109,12 @@ struct amd_cppc_drv_data
  */
 static DEFINE_PER_CPU_READ_MOSTLY(unsigned int, pxfreq_mhz);
 static DEFINE_PER_CPU_READ_MOSTLY(uint8_t, epp_init);
+#ifndef NDEBUG
+static bool __ro_after_init opt_active_mode;
+#else
+static bool __initdata opt_active_mode;
+#endif
+
 
 static bool __init amd_cppc_handle_option(const char *s, const char *end)
 {
@@ -116,6 +127,13 @@ static bool __init amd_cppc_handle_option(const char *s, const char *end)
         return true;
     }
 
+    ret = parse_boolean("active", s, end);
+    if ( ret >= 0 )
+    {
+        opt_active_mode = ret;
+        return true;
+    }
+
     return false;
 }
 
@@ -268,6 +286,7 @@ static void amd_cppc_write_request(unsigned int cpu,
 
     data->req.min_perf = min_perf;
     data->req.max_perf = max_perf;
+    ASSERT(!opt_active_mode || !des_perf);
     data->req.des_perf = des_perf;
     data->req.epp = epp;
 
@@ -414,7 +433,7 @@ static int cf_check amd_cppc_cpufreq_cpu_exit(struct cpufreq_policy *policy)
     return 0;
 }
 
-static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
+static int amd_cppc_cpufreq_init_perf(struct cpufreq_policy *policy)
 {
     unsigned int cpu = policy->cpu;
     struct amd_cppc_drv_data *data;
@@ -446,12 +465,102 @@ static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
 
     amd_cppc_boost_init(policy, data);
 
+    return 0;
+}
+
+static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
+{
+    int ret;
+
+    ret = amd_cppc_cpufreq_init_perf(policy);
+    if ( ret )
+        return ret;
+
     amd_cppc_verbose(policy->cpu,
                      "CPU initialized with amd-cppc passive mode\n");
 
     return 0;
 }
 
+static int cf_check amd_cppc_epp_cpu_init(struct cpufreq_policy *policy)
+{
+    int ret;
+
+    ret = amd_cppc_cpufreq_init_perf(policy);
+    if ( ret )
+        return ret;
+
+    policy->policy = cpufreq_policy_from_governor(policy->governor);
+
+    amd_cppc_verbose(policy->cpu,
+                     "CPU initialized with amd-cppc active mode\n");
+
+    return 0;
+}
+
+static void amd_cppc_prepare_policy(struct cpufreq_policy *policy,
+                                    uint8_t *max_perf, uint8_t *min_perf,
+                                    uint8_t *epp)
+{
+    const struct amd_cppc_drv_data *data = policy->u.amd_cppc;
+
+    /*
+     * On default, set min_perf with lowest_nonlinear_perf, and max_perf
+     * with the highest, to ensure performance scaling in P-states range.
+     */
+    *max_perf = data->caps.highest_perf;
+    *min_perf = data->caps.lowest_nonlinear_perf;
+
+    /*
+     * In policy CPUFREQ_POLICY_PERFORMANCE, increase min_perf to
+     * highest_perf to achieve ultmost performance.
+     * In policy CPUFREQ_POLICY_POWERSAVE, decrease max_perf to
+     * lowest_nonlinear_perf to achieve ultmost power saving.
+     * Set governor only to help print proper policy info to users.
+     */
+    switch ( policy->policy )
+    {
+    case CPUFREQ_POLICY_PERFORMANCE:
+        /* Force the epp value to be zero for performance policy */
+        *epp = CPPC_ENERGY_PERF_MAX_PERFORMANCE;
+        *min_perf = *max_perf;
+        policy->governor = &cpufreq_gov_performance;
+        break;
+
+    case CPUFREQ_POLICY_POWERSAVE:
+        /* Force the epp value to be 0xff for powersave policy */
+        *epp = CPPC_ENERGY_PERF_MAX_POWERSAVE;
+        *max_perf = *min_perf;
+        policy->governor = &cpufreq_gov_powersave;
+        break;
+
+    case CPUFREQ_POLICY_ONDEMAND:
+        /*
+         * Set epp with medium value to show no preference over performance
+         * or powersave
+         */
+        *epp = CPPC_ENERGY_PERF_BALANCE;
+        policy->governor = &cpufreq_gov_dbs;
+        break;
+
+    default:
+        *epp = per_cpu(epp_init, policy->cpu);
+        break;
+    }
+}
+
+static int cf_check amd_cppc_epp_set_policy(struct cpufreq_policy *policy)
+{
+    uint8_t max_perf, min_perf, epp;
+
+    amd_cppc_prepare_policy(policy, &max_perf, &min_perf, &epp);
+
+    amd_cppc_write_request(policy->cpu, policy->u.amd_cppc, min_perf,
+                           0 /* no des_perf in active mode */,
+                           max_perf, epp);
+    return 0;
+}
+
 static const struct cpufreq_driver __initconst_cf_clobber
 amd_cppc_cpufreq_driver =
 {
@@ -462,10 +571,27 @@ amd_cppc_cpufreq_driver =
     .exit   = amd_cppc_cpufreq_cpu_exit,
 };
 
+static const struct cpufreq_driver __initconst_cf_clobber
+amd_cppc_epp_driver =
+{
+    .name       = XEN_AMD_CPPC_EPP_DRIVER_NAME,
+    .verify     = amd_cppc_cpufreq_verify,
+    .setpolicy  = amd_cppc_epp_set_policy,
+    .init       = amd_cppc_epp_cpu_init,
+    .exit       = amd_cppc_cpufreq_cpu_exit,
+};
+
 int __init amd_cppc_register_driver(void)
 {
+    int ret;
+
     if ( !cpu_has_cppc )
         return -ENODEV;
 
-    return cpufreq_register_driver(&amd_cppc_cpufreq_driver);
+    if ( opt_active_mode )
+        ret = cpufreq_register_driver(&amd_cppc_epp_driver);
+    else
+        ret = cpufreq_register_driver(&amd_cppc_cpufreq_driver);
+
+    return ret;
 }
diff --git a/xen/drivers/cpufreq/utility.c b/xen/drivers/cpufreq/utility.c
index 987c3b5929..e2cc9ff2af 100644
--- a/xen/drivers/cpufreq/utility.c
+++ b/xen/drivers/cpufreq/utility.c
@@ -250,6 +250,7 @@ int __cpufreq_set_policy(struct cpufreq_policy *data,
     data->min = policy->min;
     data->max = policy->max;
     data->limits = policy->limits;
+    data->policy = policy->policy;
     if (cpufreq_driver.setpolicy)
         return alternative_call(cpufreq_driver.setpolicy, data);
 
@@ -281,3 +282,17 @@ int __cpufreq_set_policy(struct cpufreq_policy *data,
 
     return __cpufreq_governor(data, CPUFREQ_GOV_LIMITS);
 }
+
+unsigned int cpufreq_policy_from_governor(const struct cpufreq_governor *gov)
+{
+    if ( !strncmp(gov->name, "performance", CPUFREQ_NAME_LEN) )
+        return CPUFREQ_POLICY_PERFORMANCE;
+
+    if ( !strncmp(gov->name, "powersave", CPUFREQ_NAME_LEN) )
+        return CPUFREQ_POLICY_POWERSAVE;
+
+    if ( !strncmp(gov->name, "ondemand", CPUFREQ_NAME_LEN) )
+        return CPUFREQ_POLICY_ONDEMAND;
+
+    return CPUFREQ_POLICY_UNKNOWN;
+}
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index baffb5bbe6..274b7ea06e 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -91,6 +91,7 @@ struct cpufreq_policy {
         struct amd_cppc_drv_data *amd_cppc; /* Driver data for AMD CPPC */
 #endif
     } u;
+    unsigned int        policy; /* CPUFREQ_POLICY_* */
 };
 DECLARE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_policy);
 
@@ -141,6 +142,23 @@ extern int cpufreq_register_governor(struct cpufreq_governor *governor);
 extern struct cpufreq_governor *__find_governor(const char *governor);
 #define CPUFREQ_DEFAULT_GOVERNOR &cpufreq_gov_dbs
 
+/*
+ * Performance Policy
+ * If cpufreq_driver->target() exists, the ->governor decides what frequency
+ * within the limits is used. If cpufreq_driver->setpolicy() exists, these
+ * following policies are available:
+ * CPUFREQ_POLICY_PERFORMANCE represents maximum performance
+ * CPUFREQ_POLICY_POWERSAVE represents least power consumption
+ * CPUFREQ_POLICY_ONDEMAND represents no preference over performance or
+ * powersave
+ */
+#define CPUFREQ_POLICY_UNKNOWN      0
+#define CPUFREQ_POLICY_POWERSAVE    1
+#define CPUFREQ_POLICY_PERFORMANCE  2
+#define CPUFREQ_POLICY_ONDEMAND     3
+
+unsigned int cpufreq_policy_from_governor(const struct cpufreq_governor *gov);
+
 /* pass a target to the cpufreq driver */
 extern int __cpufreq_driver_target(struct cpufreq_policy *policy,
                                    unsigned int target_freq,
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index aa29a5401c..eb3a23b038 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -454,6 +454,7 @@ struct xen_set_cppc_para {
 };
 
 #define XEN_AMD_CPPC_DRIVER_NAME "amd-cppc"
+#define XEN_AMD_CPPC_EPP_DRIVER_NAME "amd-cppc-epp"
 #define XEN_HWP_DRIVER_NAME "hwp"
 
 /*
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 4/8] xen/cpufreq: get performance policy from governor set via xenpm
  2025-09-04  6:35 [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (2 preceding siblings ...)
  2025-09-04  6:35 ` [PATCH v9 3/8] xen/cpufreq: implement amd-cppc-epp driver for CPPC in active mode Penny Zheng
@ 2025-09-04  6:35 ` Penny Zheng
  2025-09-04  6:35 ` [PATCH v9 5/8] tools/cpufreq: extract CPPC para from cpufreq para Penny Zheng
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Penny Zheng @ 2025-09-04  6:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Penny Zheng, Jan Beulich

Even if Xen governor is not used in amd-cppc active mode, we could
somehow deduce which performance policy (CPUFREQ_POLICY_xxx) user wants to
apply through which governor they choose, such as:
If user chooses performance governor, they want maximum performance, then
the policy shall be CPUFREQ_POLICY_PERFORMANCE
If user chooses powersave governor, they want the least power consumption,
then the policy shall be CPUFREQ_POLICY_POWERSAVE
Function cpufreq_policy_from_governor() is responsible for above transition,
and it shall be also effective when users setting new governor through xenpm.

Userspace is a forbidden choice, and if users specify such option, we shall
not only give warning message to suggest using "xenpm set-cpufreq-cppc", but
also error out.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v4 -> v5:
- new commit
---
v5 -> v6:
- refactor warning message
---
v6 -> v7:
- move policy->policy set where it firstly gets introduced
- refactor commit message
---
v7 -> v8:
- policy transition is only limited in CPPC mode
---
 xen/drivers/acpi/pm-op.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/xen/drivers/acpi/pm-op.c b/xen/drivers/acpi/pm-op.c
index 2f516e62b1..a7eaf29c31 100644
--- a/xen/drivers/acpi/pm-op.c
+++ b/xen/drivers/acpi/pm-op.c
@@ -207,6 +207,17 @@ static int set_cpufreq_gov(struct xen_sysctl_pm_op *op)
     if ( new_policy.governor == NULL )
         return -EINVAL;
 
+    if ( processor_pminfo[op->cpuid]->init & XEN_CPPC_INIT )
+    {
+        new_policy.policy = cpufreq_policy_from_governor(new_policy.governor);
+        if ( new_policy.policy == CPUFREQ_POLICY_UNKNOWN )
+        {
+            printk("Failed to get performance policy from %s, Try \"xenpm set-cpufreq-cppc\"\n",
+                   new_policy.governor->name);
+            return -EINVAL;
+        }
+    }
+
     return __cpufreq_set_policy(old_policy, &new_policy);
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 5/8] tools/cpufreq: extract CPPC para from cpufreq para
  2025-09-04  6:35 [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (3 preceding siblings ...)
  2025-09-04  6:35 ` [PATCH v9 4/8] xen/cpufreq: get performance policy from governor set via xenpm Penny Zheng
@ 2025-09-04  6:35 ` Penny Zheng
  2025-09-04 12:26   ` Jan Beulich
  2025-09-04  6:35 ` [PATCH v9 6/8] xen/cpufreq: bypass governor-related para for amd-cppc-epp Penny Zheng
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Penny Zheng @ 2025-09-04  6:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Penny Zheng, Anthony PERARD, Juergen Gross, Andrew Cooper,
	Michal Orzel, Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

We extract cppc info from "struct xen_get_cpufreq_para", where it acts as
a member of union, and share the space with governor info.
However, it may fail in amd-cppc passive mode, in which governor info and
CPPC info could co-exist, and both need to be printed together via xenpm tool.
If we tried to still put it in "struct xen_get_cpufreq_para" (e.g. just move
out of union), "struct xen_get_cpufreq_para" will enlarge too much to further
make xen_sysctl.u exceed 128 bytes.

So we introduce a new sub-field GET_CPUFREQ_CPPC to dedicatedly acquire
CPPC-related para, and make get-cpufreq-para invoke GET_CPUFREQ_CPPC
if available.
New helpers print_cppc_para() and get_cpufreq_cppc() are introduced to
extract CPPC-related parameters process from cpufreq para.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Acked-by: Jan Beulich <jbeulich@suse.com> # hypervisor
Acked-by: Anthony PERARD <anthony.perard@vates.tech>
---
v4 -> v5:
- new commit
---
v5 -> v6:
- remove the changes for get-cpufreq-para
---
v6 -> v7:
- make get-cpufreq-para invoke GET_CPUFREQ_CPPC
---
v7 -> v8:
- use structure assignment as it is a alias
- add errno info to the error print
---
 tools/include/xenctrl.h     |  3 +-
 tools/libs/ctrl/xc_pm.c     | 25 +++++++++++-
 tools/misc/xenpm.c          | 79 ++++++++++++++++++++++++-------------
 xen/drivers/acpi/pm-op.c    | 19 +++++++--
 xen/include/public/sysctl.h |  3 +-
 5 files changed, 96 insertions(+), 33 deletions(-)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 965d3b585a..e5103453a9 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1938,7 +1938,6 @@ struct xc_get_cpufreq_para {
                 xc_ondemand_t ondemand;
             } u;
         } s;
-        xc_cppc_para_t cppc_para;
     } u;
 
     int32_t turbo_enabled;
@@ -1953,6 +1952,8 @@ int xc_set_cpufreq_para(xc_interface *xch, int cpuid,
                         int ctrl_type, int ctrl_value);
 int xc_set_cpufreq_cppc(xc_interface *xch, int cpuid,
                         xc_set_cppc_para_t *set_cppc);
+int xc_get_cppc_para(xc_interface *xch, unsigned int cpuid,
+                     xc_cppc_para_t *cppc_para);
 int xc_get_cpufreq_avgfreq(xc_interface *xch, int cpuid, int *avg_freq);
 
 int xc_set_sched_opt_smt(xc_interface *xch, uint32_t value);
diff --git a/tools/libs/ctrl/xc_pm.c b/tools/libs/ctrl/xc_pm.c
index 6fda973f1f..56e213018a 100644
--- a/tools/libs/ctrl/xc_pm.c
+++ b/tools/libs/ctrl/xc_pm.c
@@ -288,7 +288,6 @@ int xc_get_cpufreq_para(xc_interface *xch, int cpuid,
         CHK_FIELD(s.scaling_min_freq);
         CHK_FIELD(s.u.userspace);
         CHK_FIELD(s.u.ondemand);
-        CHK_FIELD(cppc_para);
 
 #undef CHK_FIELD
 
@@ -366,6 +365,30 @@ int xc_set_cpufreq_cppc(xc_interface *xch, int cpuid,
     return ret;
 }
 
+int xc_get_cppc_para(xc_interface *xch, unsigned int cpuid,
+                     xc_cppc_para_t *cppc_para)
+{
+    int ret;
+    struct xen_sysctl sysctl = {};
+
+    if ( !xch  || !cppc_para )
+    {
+        errno = EINVAL;
+        return -1;
+    }
+
+    sysctl.cmd = XEN_SYSCTL_pm_op;
+    sysctl.u.pm_op.cmd = GET_CPUFREQ_CPPC;
+    sysctl.u.pm_op.cpuid = cpuid;
+
+    ret = xc_sysctl(xch, &sysctl);
+    if ( ret )
+        return ret;
+
+    *cppc_para = sysctl.u.pm_op.u.get_cppc;
+    return ret;
+}
+
 int xc_get_cpufreq_avgfreq(xc_interface *xch, int cpuid, int *avg_freq)
 {
     int ret = 0;
diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 6b054b10a4..e83dd0d80c 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -801,6 +801,34 @@ static unsigned int calculate_activity_window(const xc_cppc_para_t *cppc,
     return mantissa * multiplier;
 }
 
+/* print out parameters about cpu cppc */
+static void print_cppc_para(unsigned int cpuid,
+                            const xc_cppc_para_t *cppc)
+{
+    printf("cppc variables       :\n");
+    printf("  hardware limits    : lowest [%"PRIu32"] lowest nonlinear [%"PRIu32"]\n",
+           cppc->lowest, cppc->lowest_nonlinear);
+    printf("                     : nominal [%"PRIu32"] highest [%"PRIu32"]\n",
+           cppc->nominal, cppc->highest);
+    printf("  configured limits  : min [%"PRIu32"] max [%"PRIu32"] energy perf [%"PRIu32"]\n",
+           cppc->minimum, cppc->maximum, cppc->energy_perf);
+
+    if ( cppc->features & XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW )
+    {
+        unsigned int activity_window;
+        const char *units;
+
+        activity_window = calculate_activity_window(cppc, &units);
+        printf("                     : activity_window [%"PRIu32" %s]\n",
+               activity_window, units);
+    }
+
+    printf("                     : desired [%"PRIu32"%s]\n",
+           cppc->desired,
+           cppc->desired ? "" : " hw autonomous");
+    printf("\n");
+}
+
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
@@ -826,33 +854,7 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 
     printf("scaling_driver       : %s\n", p_cpufreq->scaling_driver);
 
-    if ( hwp )
-    {
-        const xc_cppc_para_t *cppc = &p_cpufreq->u.cppc_para;
-
-        printf("cppc variables       :\n");
-        printf("  hardware limits    : lowest [%"PRIu32"] lowest nonlinear [%"PRIu32"]\n",
-               cppc->lowest, cppc->lowest_nonlinear);
-        printf("                     : nominal [%"PRIu32"] highest [%"PRIu32"]\n",
-               cppc->nominal, cppc->highest);
-        printf("  configured limits  : min [%"PRIu32"] max [%"PRIu32"] energy perf [%"PRIu32"]\n",
-               cppc->minimum, cppc->maximum, cppc->energy_perf);
-
-        if ( cppc->features & XEN_SYSCTL_CPPC_FEAT_ACT_WINDOW )
-        {
-            unsigned int activity_window;
-            const char *units;
-
-            activity_window = calculate_activity_window(cppc, &units);
-            printf("                     : activity_window [%"PRIu32" %s]\n",
-                   activity_window, units);
-        }
-
-        printf("                     : desired [%"PRIu32"%s]\n",
-               cppc->desired,
-               cppc->desired ? "" : " hw autonomous");
-    }
-    else
+    if ( !hwp )
     {
         if ( p_cpufreq->gov_num )
             printf("scaling_avail_gov    : %s\n",
@@ -898,6 +900,24 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
     printf("\n");
 }
 
+/* show cpu cppc parameters information on CPU cpuid */
+static int show_cppc_para_by_cpuid(xc_interface *xc_handle, unsigned int cpuid)
+{
+    int ret;
+    xc_cppc_para_t cppc_para;
+
+    ret = xc_get_cppc_para(xc_handle, cpuid, &cppc_para);
+    if ( !ret )
+        print_cppc_para(cpuid, &cppc_para);
+    else if ( errno == ENODEV )
+        ret = 0; /* Ignore unsupported platform */
+    else
+        fprintf(stderr, "[CPU%u] failed to get cppc parameter: %s\n",
+                cpuid, strerror(errno));
+
+    return ret;
+}
+
 /* show cpu frequency parameters information on CPU cpuid */
 static int show_cpufreq_para_by_cpuid(xc_interface *xc_handle, int cpuid)
 {
@@ -957,7 +977,12 @@ static int show_cpufreq_para_by_cpuid(xc_interface *xc_handle, int cpuid)
     } while ( ret && errno == EAGAIN );
 
     if ( ret == 0 )
+    {
         print_cpufreq_para(cpuid, p_cpufreq);
+
+        /* Show CPPC parameters if available */
+        ret = show_cppc_para_by_cpuid(xc_handle, cpuid);
+    }
     else if ( errno == ENODEV )
     {
         ret = -ENODEV;
diff --git a/xen/drivers/acpi/pm-op.c b/xen/drivers/acpi/pm-op.c
index a7eaf29c31..19aedf6b0b 100644
--- a/xen/drivers/acpi/pm-op.c
+++ b/xen/drivers/acpi/pm-op.c
@@ -77,6 +77,17 @@ static int read_scaling_available_governors(char *scaling_available_governors,
     return 0;
 }
 
+static int get_cpufreq_cppc(unsigned int cpu,
+                            struct xen_get_cppc_para *cppc_para)
+{
+    int ret = -ENODEV;
+
+    if ( hwp_active() )
+        ret = get_hwp_para(cpu, cppc_para);
+
+    return ret;
+}
+
 static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
 {
     uint32_t ret = 0;
@@ -143,9 +154,7 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     else
         strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
 
-    if ( hwp_active() )
-        ret = get_hwp_para(policy->cpu, &op->u.get_para.u.cppc_para);
-    else
+    if ( !hwp_active() )
     {
         if ( !(scaling_available_governors =
                xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
@@ -385,6 +394,10 @@ int do_pm_op(struct xen_sysctl_pm_op *op)
         ret = set_cpufreq_para(op);
         break;
 
+    case GET_CPUFREQ_CPPC:
+        ret = get_cpufreq_cppc(op->cpuid, &op->u.get_cppc);
+        break;
+
     case SET_CPUFREQ_CPPC:
         ret = set_cpufreq_cppc(op);
         break;
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index eb3a23b038..3f654f98ab 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -492,7 +492,6 @@ struct xen_get_cpufreq_para {
                 struct  xen_ondemand ondemand;
             } u;
         } s;
-        struct xen_get_cppc_para cppc_para;
     } u;
 
     int32_t turbo_enabled;
@@ -523,6 +522,7 @@ struct xen_sysctl_pm_op {
     #define SET_CPUFREQ_PARA           (CPUFREQ_PARA | 0x03)
     #define GET_CPUFREQ_AVGFREQ        (CPUFREQ_PARA | 0x04)
     #define SET_CPUFREQ_CPPC           (CPUFREQ_PARA | 0x05)
+    #define GET_CPUFREQ_CPPC           (CPUFREQ_PARA | 0x06)
 
     /* set/reset scheduler power saving option */
     #define XEN_SYSCTL_pm_op_set_sched_opt_smt    0x21
@@ -547,6 +547,7 @@ struct xen_sysctl_pm_op {
     uint32_t cpuid;
     union {
         struct xen_get_cpufreq_para get_para;
+        struct xen_get_cppc_para    get_cppc;
         struct xen_set_cpufreq_gov  set_gov;
         struct xen_set_cpufreq_para set_para;
         struct xen_set_cppc_para    set_cppc;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 6/8] xen/cpufreq: bypass governor-related para for amd-cppc-epp
  2025-09-04  6:35 [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (4 preceding siblings ...)
  2025-09-04  6:35 ` [PATCH v9 5/8] tools/cpufreq: extract CPPC para from cpufreq para Penny Zheng
@ 2025-09-04  6:35 ` Penny Zheng
  2025-09-04  6:35 ` [PATCH v9 7/8] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver Penny Zheng
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Penny Zheng @ 2025-09-04  6:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Penny Zheng, Anthony PERARD, Jan Beulich

HWP and amd-cppc-epp are both governor-less driver, so we introduce
"is_governor_less" flag and cpufreq_is_governorless() to help bypass
governor-related info on dealing with cpufreq para.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Anthony PERARD <anthony.perard@vates.tech>
---
v3 -> v4:
- Include validation check fix here
---
v4 -> v5:
- validation check has beem moved to where XEN_PROCESSOR_PM_CPPC and
XEN_CPPC_INIT have been firstly introduced
- adding "cpufreq_driver.setpolicy == NULL" check to exclude governor-related
para for amd-cppc-epp driver in get/set_cpufreq_para()
---
v5 -> v6:
- add helper cpufreq_is_governorless() to tell whether cpufreq driver is
governor-less
---
v6 -> v7:
- change "hw_auto" to "is_goverless"
- complement comment
- wrap around with PM_OP to avoid violating Misra rule 2.1
---
v7 -> v8:
- change "is_goverless" to "is_governor_less"
- make cpufreq_is_governorless() inline function
---
 tools/misc/xenpm.c                 | 10 +++++++---
 xen/drivers/acpi/pm-op.c           |  4 ++--
 xen/include/acpi/cpufreq/cpufreq.h | 12 ++++++++++++
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index e83dd0d80c..893a0afe11 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -832,9 +832,13 @@ static void print_cppc_para(unsigned int cpuid,
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
-    bool hwp = strcmp(p_cpufreq->scaling_driver, XEN_HWP_DRIVER_NAME) == 0;
+    bool is_governor_less = false;
     int i;
 
+    if ( !strcmp(p_cpufreq->scaling_driver, XEN_HWP_DRIVER_NAME) ||
+         !strcmp(p_cpufreq->scaling_driver, XEN_AMD_CPPC_EPP_DRIVER_NAME) )
+        is_governor_less = true;
+
     printf("cpu id               : %d\n", cpuid);
 
     printf("affected_cpus        :");
@@ -842,7 +846,7 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
         printf(" %d", p_cpufreq->affected_cpus[i]);
     printf("\n");
 
-    if ( hwp )
+    if ( is_governor_less )
         printf("cpuinfo frequency    : base [%"PRIu32"] max [%"PRIu32"]\n",
                p_cpufreq->cpuinfo_min_freq,
                p_cpufreq->cpuinfo_max_freq);
@@ -854,7 +858,7 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 
     printf("scaling_driver       : %s\n", p_cpufreq->scaling_driver);
 
-    if ( !hwp )
+    if ( !is_governor_less )
     {
         if ( p_cpufreq->gov_num )
             printf("scaling_avail_gov    : %s\n",
diff --git a/xen/drivers/acpi/pm-op.c b/xen/drivers/acpi/pm-op.c
index 19aedf6b0b..371deaf678 100644
--- a/xen/drivers/acpi/pm-op.c
+++ b/xen/drivers/acpi/pm-op.c
@@ -154,7 +154,7 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     else
         strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
 
-    if ( !hwp_active() )
+    if ( !cpufreq_is_governorless(op->cpuid) )
     {
         if ( !(scaling_available_governors =
                xzalloc_array(char, gov_num * CPUFREQ_NAME_LEN)) )
@@ -240,7 +240,7 @@ static int set_cpufreq_para(struct xen_sysctl_pm_op *op)
     if ( !policy || !policy->governor )
         return -EINVAL;
 
-    if ( hwp_active() )
+    if ( cpufreq_is_governorless(op->cpuid) )
         return -EOPNOTSUPP;
 
     switch( op->u.set_para.ctrl_type )
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 274b7ea06e..85fbf772a0 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -304,4 +304,16 @@ int acpi_cpufreq_register(void);
 int amd_cppc_cmdline_parse(const char *s, const char *e);
 int amd_cppc_register_driver(void);
 
+/*
+ * Governor-less cpufreq driver indicates the driver doesn't rely on Xen
+ * governor to do performance tuning, mostly it has hardware built-in
+ * algorithm to calculate runtime workload and adjust cores frequency
+ * automatically, like Intel HWP, or CPPC in AMD.
+ */
+static inline bool cpufreq_is_governorless(unsigned int cpuid)
+{
+    return processor_pminfo[cpuid]->init && (hwp_active() ||
+                                             cpufreq_driver.setpolicy);
+}
+
 #endif /* __XEN_CPUFREQ_PM_H__ */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 7/8] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver
  2025-09-04  6:35 [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (5 preceding siblings ...)
  2025-09-04  6:35 ` [PATCH v9 6/8] xen/cpufreq: bypass governor-related para for amd-cppc-epp Penny Zheng
@ 2025-09-04  6:35 ` Penny Zheng
  2025-09-04 12:33   ` Jan Beulich
  2025-09-04  6:35 ` [PATCH v9 8/8] CHANGELOG.md: add amd-cppc/amd-cppc-epp cpufreq driver support Penny Zheng
  2025-09-09 16:10 ` [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Jan Beulich
  8 siblings, 1 reply; 29+ messages in thread
From: Penny Zheng @ 2025-09-04  6:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Penny Zheng, Anthony PERARD, Jan Beulich, Andrew Cooper,
	Roger Pau Monné

Introduce helper set_amd_cppc_para() and get_amd_cppc_para() to
SET/GET CPPC-related para for amd-cppc/amd-cppc-epp driver.

In get_cpufreq_cppc()/set_cpufreq_cppc(), we include
"processor_pminfo[cpuid]->init & XEN_CPPC_INIT" condition check to deal with
cpufreq driver in amd-cppc.
We borrow governor field to indicate policy info for CPPC active mode,
so we need to move the copying of the governor name out of the
!cpufreq_is_governorless() guard.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Acked-by: Anthony PERARD <anthony.perard@vates.tech>
---
v1 -> v2:
- Give the variable des_perf an initializer of 0
- Use the strncmp()s directly in the if()
---
v3 -> v4
- refactor comments
- remove double blank lines
- replace amd_cppc_in_use flag with XEN_PROCESSOR_PM_CPPC
---
v4 -> v5:
- add new field "policy" in "struct xen_cppc_para"
- add new performamce policy XEN_CPUFREQ_POLICY_BALANCE
- drop string comparisons with "processor_pminfo[cpuid]->init & XEN_CPPC_INIT"
and "cpufreq.setpolicy == NULL"
- Blank line ahead of the main "return" of a function
- refactor comments, commit message and title
---
v5 -> v6:
- remove duplicated manifest constants, and just move it to public header
- use "else if" to avoid confusion that it looks as if both paths could be taken
- add check for legitimate perf values
- use "unknown" instead of "none"
- introduce "CPUFREQ_POLICY_END" for array overrun check in user space tools
---
v6 -> v7:
- use ARRAY_SIZE() instead
- ->policy print is avoided in passive mode and print "unknown" in invalid
cases
- let cpufreq_is_governorless() being the variable's initializer
- refactor with the conditional operator to increase readability
- move duplicated defination ahead and use local variable
- avoid using "else-condition" to bring "dead code" in Misra's nomeclature
- move the comment out of public header and into the respective internal
struct field
- wrap set{,get}_amd_cppc_para() with CONFIG_PM_OP
- add symmetry scenario for maximum check
---
v7 -> v8:
- change function name to amd_cppc_get{,set}_para()
- fix too deep indentation, and indent according to pending open parentheses
- missing -EINVAL when no flag is set at all
- use new helper amd_cppc_prepare_policy() to reduce redundancy
- borrow governor field to indicate policy info
---
v8 -> v9
- add description of "moving the copying of the governor name"
- Adapt to changes of "Embed struct amd_cppc_drv_data{} into struct
cpufreq_policy{}"
---
 tools/misc/xenpm.c                   |  13 ++-
 xen/arch/x86/acpi/cpufreq/amd-cppc.c | 163 +++++++++++++++++++++++++++
 xen/drivers/acpi/pm-op.c             |  28 +++--
 xen/include/acpi/cpufreq/cpufreq.h   |   4 +
 4 files changed, 195 insertions(+), 13 deletions(-)

diff --git a/tools/misc/xenpm.c b/tools/misc/xenpm.c
index 893a0afe11..bda9c62aa0 100644
--- a/tools/misc/xenpm.c
+++ b/tools/misc/xenpm.c
@@ -832,11 +832,14 @@ static void print_cppc_para(unsigned int cpuid,
 /* print out parameters about cpu frequency */
 static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
 {
-    bool is_governor_less = false;
+    bool is_governor_less = false, is_cppc_active = false;
     int i;
 
-    if ( !strcmp(p_cpufreq->scaling_driver, XEN_HWP_DRIVER_NAME) ||
-         !strcmp(p_cpufreq->scaling_driver, XEN_AMD_CPPC_EPP_DRIVER_NAME) )
+    if ( !strcmp(p_cpufreq->scaling_driver, XEN_AMD_CPPC_EPP_DRIVER_NAME) )
+        is_cppc_active = true;
+
+    if ( is_cppc_active ||
+         !strcmp(p_cpufreq->scaling_driver, XEN_HWP_DRIVER_NAME) )
         is_governor_less = true;
 
     printf("cpu id               : %d\n", cpuid);
@@ -899,6 +902,10 @@ static void print_cpufreq_para(int cpuid, struct xc_get_cpufreq_para *p_cpufreq)
                p_cpufreq->u.s.scaling_cur_freq);
     }
 
+    /* Translate governor info to policy info in CPPC active mode */
+    if ( is_cppc_active )
+        printf("policy               : %s\n", p_cpufreq->u.s.scaling_governor);
+
     printf("turbo mode           : %s\n",
            p_cpufreq->turbo_enabled ? "enabled" : "disabled or n/a");
     printf("\n");
diff --git a/xen/arch/x86/acpi/cpufreq/amd-cppc.c b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
index 80b829b84e..01203c65b1 100644
--- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
+++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
@@ -561,6 +561,169 @@ static int cf_check amd_cppc_epp_set_policy(struct cpufreq_policy *policy)
     return 0;
 }
 
+#ifdef CONFIG_PM_OP
+int amd_cppc_get_para(const struct cpufreq_policy *policy,
+                      struct xen_get_cppc_para *cppc_para)
+{
+    const struct amd_cppc_drv_data *data = policy->u.amd_cppc;
+
+    if ( data == NULL )
+        return -ENODATA;
+
+    cppc_para->lowest           = data->caps.lowest_perf;
+    cppc_para->lowest_nonlinear = data->caps.lowest_nonlinear_perf;
+    cppc_para->nominal          = data->caps.nominal_perf;
+    cppc_para->highest          = data->caps.highest_perf;
+    cppc_para->minimum          = data->req.min_perf;
+    cppc_para->maximum          = data->req.max_perf;
+    cppc_para->desired          = data->req.des_perf;
+    cppc_para->energy_perf      = data->req.epp;
+
+    return 0;
+}
+
+int amd_cppc_set_para(struct cpufreq_policy *policy,
+                      const struct xen_set_cppc_para *set_cppc)
+{
+    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
+    uint8_t max_perf, min_perf, des_perf, epp;
+    bool active_mode = cpufreq_is_governorless(policy->cpu);
+
+    if ( data == NULL )
+        return -ENOENT;
+
+    /* Only allow values if params bit is set. */
+    if ( (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED) &&
+          set_cppc->desired) ||
+         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM) &&
+          set_cppc->minimum) ||
+         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM) &&
+          set_cppc->maximum) ||
+         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF) &&
+          set_cppc->energy_perf) )
+        return -EINVAL;
+
+    /* Return if there is nothing to do. */
+    if ( set_cppc->set_params == 0 )
+        return 0;
+
+    /*
+     * Validate all parameters
+     * Maximum performance may be set to any performance value in the range
+     * [Nonlinear Lowest Performance, Highest Performance], inclusive but must
+     * be set to a value that is larger than or equal to minimum Performance.
+     */
+    if ( (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM) &&
+         (set_cppc->maximum > data->caps.highest_perf ||
+          (set_cppc->maximum <
+           (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM
+            ? set_cppc->minimum
+            : data->req.min_perf))) )
+        return -EINVAL;
+    /*
+     * Minimum performance may be set to any performance value in the range
+     * [Nonlinear Lowest Performance, Highest Performance], inclusive but must
+     * be set to a value that is less than or equal to Maximum Performance.
+     */
+    if ( (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM) &&
+         (set_cppc->minimum < data->caps.lowest_nonlinear_perf ||
+          (set_cppc->minimum >
+           (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM
+            ? set_cppc->maximum
+            : data->req.max_perf))) )
+        return -EINVAL;
+    /*
+     * Desired performance may be set to any performance value in the range
+     * [Minimum Performance, Maximum Performance], inclusive.
+     */
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
+    {
+        if ( active_mode )
+            return -EOPNOTSUPP;
+
+        if ( (set_cppc->desired >
+              (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM
+               ? set_cppc->maximum
+               : data->req.max_perf)) ||
+             (set_cppc->desired <
+              (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM
+               ? set_cppc->minimum
+               : data->req.min_perf)) )
+            return -EINVAL;
+    }
+    /*
+     * Energy Performance Preference may be set with a range of values
+     * from 0 to 0xFF
+     */
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF )
+    {
+        if ( !active_mode )
+            return -EOPNOTSUPP;
+
+        if ( set_cppc->energy_perf > UINT8_MAX )
+            return -EINVAL;
+    }
+
+    /* Activity window not supported in MSR */
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW )
+        return -EOPNOTSUPP;
+
+    des_perf = data->req.des_perf;
+    /*
+     * Apply presets:
+     * XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE/PERFORMANCE/ONDEMAND are
+     * only available when CPPC in active mode
+     */
+    switch ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_PRESET_MASK )
+    {
+    case XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE:
+        if ( !active_mode )
+            return -EINVAL;
+        policy->policy = CPUFREQ_POLICY_POWERSAVE;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE:
+        if ( !active_mode )
+            return -EINVAL;
+        policy->policy = CPUFREQ_POLICY_PERFORMANCE;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_ONDEMAND:
+        if ( !active_mode )
+            return -EINVAL;
+        policy->policy = CPUFREQ_POLICY_ONDEMAND;
+        break;
+
+    case XEN_SYSCTL_CPPC_SET_PRESET_NONE:
+        if ( active_mode )
+            policy->policy = CPUFREQ_POLICY_UNKNOWN;
+        break;
+
+    default:
+        return -EINVAL;
+    }
+    amd_cppc_prepare_policy(policy, &max_perf, &min_perf, &epp);
+
+    /* Further customize presets if needed */
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM )
+        min_perf = set_cppc->minimum;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM )
+        max_perf = set_cppc->maximum;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF )
+        epp = set_cppc->energy_perf;
+
+    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
+        des_perf = set_cppc->desired;
+
+    amd_cppc_write_request(policy->cpu, data,
+                           min_perf, des_perf, max_perf, epp);
+
+    return 0;
+}
+#endif /* CONFIG_PM_OP */
+
 static const struct cpufreq_driver __initconst_cf_clobber
 amd_cppc_cpufreq_driver =
 {
diff --git a/xen/drivers/acpi/pm-op.c b/xen/drivers/acpi/pm-op.c
index 371deaf678..bcb3b9b2a7 100644
--- a/xen/drivers/acpi/pm-op.c
+++ b/xen/drivers/acpi/pm-op.c
@@ -84,6 +84,8 @@ static int get_cpufreq_cppc(unsigned int cpu,
 
     if ( hwp_active() )
         ret = get_hwp_para(cpu, cppc_para);
+    else if ( processor_pminfo[cpu]->init & XEN_CPPC_INIT )
+        ret = amd_cppc_get_para(per_cpu(cpufreq_cpu_policy, cpu), cppc_para);
 
     return ret;
 }
@@ -154,6 +156,17 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
     else
         strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
 
+    /*
+     * In CPPC active mode, we are borrowing governor field to indicate
+     * policy info.
+     */
+    if ( policy->governor->name[0] )
+        strlcpy(op->u.get_para.u.s.scaling_governor,
+                policy->governor->name, CPUFREQ_NAME_LEN);
+    else
+        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
+                CPUFREQ_NAME_LEN);
+
     if ( !cpufreq_is_governorless(op->cpuid) )
     {
         if ( !(scaling_available_governors =
@@ -178,13 +191,6 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
         op->u.get_para.u.s.scaling_max_freq = policy->max;
         op->u.get_para.u.s.scaling_min_freq = policy->min;
 
-        if ( policy->governor->name[0] )
-            strlcpy(op->u.get_para.u.s.scaling_governor,
-                    policy->governor->name, CPUFREQ_NAME_LEN);
-        else
-            strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
-                    CPUFREQ_NAME_LEN);
-
         /* governor specific para */
         if ( !strncasecmp(op->u.get_para.u.s.scaling_governor,
                           "userspace", CPUFREQ_NAME_LEN) )
@@ -321,10 +327,12 @@ static int set_cpufreq_cppc(struct xen_sysctl_pm_op *op)
     if ( !policy || !policy->governor )
         return -ENOENT;
 
-    if ( !hwp_active() )
-        return -EOPNOTSUPP;
+    if ( hwp_active() )
+        return set_hwp_para(policy, &op->u.set_cppc);
+    if ( processor_pminfo[op->cpuid]->init & XEN_CPPC_INIT )
+        return amd_cppc_set_para(policy, &op->u.set_cppc);
 
-    return set_hwp_para(policy, &op->u.set_cppc);
+    return -EOPNOTSUPP;
 }
 
 int do_pm_op(struct xen_sysctl_pm_op *op)
diff --git a/xen/include/acpi/cpufreq/cpufreq.h b/xen/include/acpi/cpufreq/cpufreq.h
index 85fbf772a0..adecf57e18 100644
--- a/xen/include/acpi/cpufreq/cpufreq.h
+++ b/xen/include/acpi/cpufreq/cpufreq.h
@@ -303,6 +303,10 @@ int acpi_cpufreq_register(void);
 
 int amd_cppc_cmdline_parse(const char *s, const char *e);
 int amd_cppc_register_driver(void);
+int amd_cppc_get_para(const struct cpufreq_policy *policy,
+                      struct xen_get_cppc_para *cppc_para);
+int amd_cppc_set_para(struct cpufreq_policy *policy,
+                      const struct xen_set_cppc_para *set_cppc);
 
 /*
  * Governor-less cpufreq driver indicates the driver doesn't rely on Xen
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 8/8] CHANGELOG.md: add amd-cppc/amd-cppc-epp cpufreq driver support
  2025-09-04  6:35 [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (6 preceding siblings ...)
  2025-09-04  6:35 ` [PATCH v9 7/8] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver Penny Zheng
@ 2025-09-04  6:35 ` Penny Zheng
  2025-09-04  6:47   ` Jan Beulich
  2025-09-09 16:10 ` [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Jan Beulich
  8 siblings, 1 reply; 29+ messages in thread
From: Penny Zheng @ 2025-09-04  6:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Penny Zheng, Oleksii Kurochko, Community Manager

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
---
 CHANGELOG.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index cd34ea87b8..c1a57924f3 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -33,6 +33,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
    - Support in hvmloader for new SMBIOS tables: 7 (Cache Info), 8 (Port
      Connector), 9 (System Slots), 26 (Voltage Probe), 27 (Cooling Device),
      and 28 (Temperature Probe).
+   - Support amd-cppc/amd-cppc-epp cpufreq driver
 
  - On Arm:
     - Ability to enable stack protector
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 8/8] CHANGELOG.md: add amd-cppc/amd-cppc-epp cpufreq driver support
  2025-09-04  6:35 ` [PATCH v9 8/8] CHANGELOG.md: add amd-cppc/amd-cppc-epp cpufreq driver support Penny Zheng
@ 2025-09-04  6:47   ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-04  6:47 UTC (permalink / raw)
  To: Penny Zheng; +Cc: Oleksii Kurochko, Community Manager, xen-devel

On 04.09.2025 08:35, Penny Zheng wrote:
> --- a/CHANGELOG.md
> +++ b/CHANGELOG.md
> @@ -33,6 +33,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
>     - Support in hvmloader for new SMBIOS tables: 7 (Cache Info), 8 (Port
>       Connector), 9 (System Slots), 26 (Voltage Probe), 27 (Cooling Device),
>       and 28 (Temperature Probe).
> +   - Support amd-cppc/amd-cppc-epp cpufreq driver

s/Support/New/ ? Otherwise this reads as if the driver had been there already
(no matter that this is in the "Added" section, but the adjacent entries are
ambiguous already as to "added" vs "changed").

Also a full stop please at the end, like all other entries here have it.

Both can easily be adjusted while committing, of course.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
  2025-09-04  6:35 ` [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{} Penny Zheng
@ 2025-09-04 11:50   ` Jan Beulich
  2025-09-04 18:53     ` Jason Andryuk
       [not found]     ` <DM4PR12MB8451C5D54EFEC8F6E0B76E43E103A@DM4PR12MB8451.namprd12.prod.outlook.com>
  0 siblings, 2 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-04 11:50 UTC (permalink / raw)
  To: Penny Zheng, Jason Andryuk; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel

On 04.09.2025 08:35, Penny Zheng wrote:
> For cpus sharing one cpufreq domain, cpufreq_driver.init() is
> only invoked on the firstcpu, so current per-CPU hwp driver data
> struct hwp_drv_data{} actually fails to be allocated for cpus other than the
> first one. There is no need to make it per-CPU.
> We embed struct hwp_drv_data{} into struct cpufreq_policy{}, then cpus could
> share the hwp driver data allocated for the firstcpu, like the way they share
> struct cpufreq_policy{}. We also make it a union, with "hwp", and later
> "amd-cppc" as a sub-struct.

And ACPI, as per my patch (which then will need re-basing).

> Suggested-by: Jan Beulich <jbeulich@suse.com>

Not quite, this really is Reported-by: as it's a bug you fix, and in turn it
also wants to gain a Fixes: tag. This also will need backporting.

It would also have been nice if you had Cc-ed Jason right away, seeing that
this code was all written by him.

> @@ -259,7 +258,7 @@ static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
>                                         unsigned int relation)
>  {
>      unsigned int cpu = policy->cpu;
> -    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> +    struct hwp_drv_data *data = policy->u.hwp;
>      /* Zero everything to ensure reserved bits are zero... */
>      union hwp_request hwp_req = { .raw = 0 };

Further down in this same function we have

    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);

That's similarly problematic when the CPU denoted by policy->cpu isn't
online anymore. (It's not quite clear whether all related issues would
want fixing together, or in multiple patches.)

> @@ -350,7 +349,7 @@ static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
>  static void cf_check hwp_init_msrs(void *info)
>  {
>      struct cpufreq_policy *policy = info;
> -    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    struct hwp_drv_data *data = policy->u.hwp;
>      uint64_t val;
>  
>      /*
> @@ -426,15 +425,14 @@ static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  
>      policy->governor = &cpufreq_gov_hwp;
>  
> -    per_cpu(hwp_drv_data, cpu) = data;
> +    policy->u.hwp = data;
>  
>      on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);

If multiple CPUs are in a domain, not all of them will make it here. By
implication the MSRs accessed by hwp_init_msrs() would need to have wider
than thread scope. The SDM, afaics, says nothing either way in this regard
in the Architectural MSRs section. Later model-specific tables have some
data.

Which gets me back to my original question: Is "sharing" actually possible
for HWP? Note further how there are both HWP_REQUEST and HWP_REQUEST_PKG
MSRs, for example. Which one is (to be) used looks to be controlled by
HWP_CTL.PKG_CTL_POLARITY.

> @@ -462,10 +460,8 @@ static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  
>  static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
>  {
> -    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
> -
> -    per_cpu(hwp_drv_data, policy->cpu) = NULL;
> -    xfree(data);
> +    if ( policy->u.hwp )
> +        XFREE(policy->u.hwp);

No if() needed here.

> @@ -480,7 +476,7 @@ static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
>  static void cf_check hwp_set_misc_turbo(void *info)
>  {
>      const struct cpufreq_policy *policy = info;
> -    struct hwp_drv_data *data = per_cpu(hwp_drv_data, policy->cpu);
> +    struct hwp_drv_data *data = policy->u.hwp;
>      uint64_t msr;
>  
>      data->ret = 0;
> @@ -511,7 +507,7 @@ static int cf_check hwp_cpufreq_update(unsigned int cpu, struct cpufreq_policy *
>  {
>      on_selected_cpus(cpumask_of(cpu), hwp_set_misc_turbo, policy, 1);
>  
> -    return per_cpu(hwp_drv_data, cpu)->ret;
> +    return policy->u.hwp->ret;
>  }
>  #endif /* CONFIG_PM_OP */

Same concern here wrt MSR scope. MISC_ENABLE.TURBO_DISENGAGE's scope is
package as per the few tables which have the bit explicitly explained;
whether that extends to all models is unclear.

> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -62,6 +62,7 @@ struct perf_limits {
>      uint32_t min_policy_pct;
>  };
>  
> +struct hwp_drv_data;

This shouldn't be needed.

> @@ -81,6 +82,11 @@ struct cpufreq_policy {
>      int8_t              turbo;  /* tristate flag: 0 for unsupported
>                                   * -1 for disable, 1 for enabled
>                                   * See CPUFREQ_TURBO_* below for defines */
> +    union {
> +#ifdef CONFIG_INTEL
> +        struct hwp_drv_data *hwp; /* Driver data for Intel HWP */
> +#endif

While it may make for a smaller diff, ultimately I think we don't want
this to be a pointer, much like I've done in my patch for the ACPI driver.

> +    } u;

This wants to either not have a name at all, or be named e.g. drv_data.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
  2025-09-04  6:35 ` [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode Penny Zheng
@ 2025-09-04 12:04   ` Jan Beulich
  2025-09-05  5:15     ` Penny, Zheng
  2025-09-04 12:11   ` Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2025-09-04 12:04 UTC (permalink / raw)
  To: Penny Zheng
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD, Michal Orzel,
	Julien Grall, Stefano Stabellini, xen-devel

On 04.09.2025 08:35, Penny Zheng wrote:
> amd-cppc is the AMD CPU performance scaling driver that introduces a
> new CPU frequency control mechanism. The new mechanism is based on
> Collaborative Processor Performance Control (CPPC) which is a finer grain
> frequency management than legacy ACPI hardware P-States.
> Current AMD CPU platforms are using the ACPI P-states driver to
> manage CPU frequency and clocks with switching only in 3 P-states, while the
> new amd-cppc allows a more flexible, low-latency interface for Xen
> to directly communicate the performance hints to hardware.
> 
> "amd-cppc" driver is responsible for implementing CPPC in passive mode, which
> still leverages Xen governors such as *ondemand*, *performance*, etc, to
> calculate the performance hints. In the future, we will introduce an advanced
> active mode to enable autonomous performence level selection.
> 
> Field epp, energy performance preference, which only has meaning when active
> mode is enabled and will be introduced later in details, so we read
> pre-defined BIOS value for it in passive mode.
> 
> Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
> Acked-by: Jan Beulich <jbeulich@suse.com>

With the issue I had pointed out, leading to ...

> ---
> v8 -> v9
> - embed struct amd_cppc_drv_data{} into struct cpufreq_policy{}

... this change, I think the tag would have needed to be dropped.

> +static void cf_check amd_cppc_write_request_msrs(void *info)
> +{
> +    const struct amd_cppc_drv_data *data = info;
> +
> +    wrmsrl(MSR_AMD_CPPC_REQ, data->req.raw);
> +}
> +
> +static void amd_cppc_write_request(unsigned int cpu,
> +                                   struct amd_cppc_drv_data *data,
> +                                   uint8_t min_perf, uint8_t des_perf,
> +                                   uint8_t max_perf, uint8_t epp)
> +{
> +    uint64_t prev = data->req.raw;
> +
> +    data->req.min_perf = min_perf;
> +    data->req.max_perf = max_perf;
> +    data->req.des_perf = des_perf;
> +    data->req.epp = epp;
> +
> +    if ( prev == data->req.raw )
> +        return;
> +
> +    on_selected_cpus(cpumask_of(cpu), amd_cppc_write_request_msrs, data, 1);

With "cpu" coming from ...

> +}
> +
> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
> +                                            unsigned int target_freq,
> +                                            unsigned int relation)
> +{
> +    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
> +    uint8_t des_perf;
> +    int res;
> +
> +    if ( unlikely(!target_freq) )
> +        return 0;
> +
> +    res = amd_cppc_khz_to_perf(data, target_freq, &des_perf);
> +    if ( res )
> +        return res;
> +
> +    /*
> +     * Having a performance level lower than the lowest nonlinear
> +     * performance level, such as, lowest_perf <= perf <= lowest_nonliner_perf,
> +     * may actually cause an efficiency penalty, So when deciding the min_perf
> +     * value, we prefer lowest nonlinear performance over lowest performance.
> +     */
> +    amd_cppc_write_request(policy->cpu, data, data->caps.lowest_nonlinear_perf,

... here, how can this work when this particular CPU isn't online anymore?

> +                           des_perf, data->caps.highest_perf,
> +                           /* Pre-defined BIOS value for passive mode */
> +                           per_cpu(epp_init, policy->cpu));
> +    return 0;
> +}
> +
> +static void cf_check amd_cppc_init_msrs(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
> +    uint64_t val;
> +    unsigned int min_freq = 0, nominal_freq = 0, max_freq;
> +
> +    /* Package level MSR */
> +    rdmsrl(MSR_AMD_CPPC_ENABLE, val);

Here you clarify the scope, yet what about ...

> +    /*
> +     * Only when Enable bit is on, the hardware will calculate the processor’s
> +     * performance capabilities and initialize the performance level fields in
> +     * the CPPC capability registers.
> +     */
> +    if ( !(val & AMD_CPPC_ENABLE) )
> +    {
> +        val |= AMD_CPPC_ENABLE;
> +        wrmsrl(MSR_AMD_CPPC_ENABLE, val);
> +    }
> +
> +    rdmsrl(MSR_AMD_CPPC_CAP1, data->caps.raw);

... this and ...

> +    if ( data->caps.highest_perf == 0 || data->caps.lowest_perf == 0 ||
> +         data->caps.nominal_perf == 0 || data->caps.lowest_nonlinear_perf == 0 ||
> +         data->caps.lowest_perf > data->caps.lowest_nonlinear_perf ||
> +         data->caps.lowest_nonlinear_perf > data->caps.nominal_perf ||
> +         data->caps.nominal_perf > data->caps.highest_perf )
> +    {
> +        amd_cppc_err(policy->cpu,
> +                     "Out of range values: highest(%u), lowest(%u), nominal(%u), lowest_nonlinear(%u)\n",
> +                     data->caps.highest_perf, data->caps.lowest_perf,
> +                     data->caps.nominal_perf, data->caps.lowest_nonlinear_perf);
> +        goto err;
> +    }
> +
> +    amd_process_freq(&cpu_data[policy->cpu],
> +                     NULL, NULL, &this_cpu(pxfreq_mhz));
> +
> +    data->err = amd_get_cpc_freq(data, data->cppc_data->cpc.lowest_mhz,
> +                                 data->caps.lowest_perf, &min_freq);
> +    if ( data->err )
> +        return;
> +
> +    data->err = amd_get_cpc_freq(data, data->cppc_data->cpc.nominal_mhz,
> +                                 data->caps.nominal_perf, &nominal_freq);
> +    if ( data->err )
> +        return;
> +
> +    data->err = amd_get_max_freq(data, &max_freq);
> +    if ( data->err )
> +        return;
> +
> +    if ( min_freq > nominal_freq || nominal_freq > max_freq )
> +    {
> +        amd_cppc_err(policy->cpu,
> +                     "min(%u), or max(%u), or nominal(%u) freq value is incorrect\n",
> +                     min_freq, max_freq, nominal_freq);
> +        goto err;
> +    }
> +
> +    policy->min = min_freq;
> +    policy->max = max_freq;
> +
> +    policy->cpuinfo.min_freq = min_freq;
> +    policy->cpuinfo.max_freq = max_freq;
> +    policy->cpuinfo.perf_freq = nominal_freq;
> +    /*
> +     * Set after policy->cpuinfo.perf_freq, as we are taking
> +     * APERF/MPERF average frequency as current frequency.
> +     */
> +    policy->cur = cpufreq_driver_getavg(policy->cpu, GOV_GETAVG);
> +
> +    /* Store pre-defined BIOS value for passive mode */
> +    rdmsrl(MSR_AMD_CPPC_REQ, val);

... this?

> +static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct amd_cppc_drv_data *data;
> +
> +    data = xvzalloc(struct amd_cppc_drv_data);
> +    if ( !data )
> +        return -ENOMEM;
> +    policy->u.amd_cppc = data;
> +
> +    data->cppc_data = &processor_pminfo[cpu]->cppc_data;
> +
> +    on_selected_cpus(cpumask_of(cpu), amd_cppc_init_msrs, policy, 1);
> +
> +    /*
> +     * The enable bit is sticky, as we need to enable it at the very first
> +     * begining, before CPPC capability values sanity check.
> +     * If error path is taken effective, not only amd-cppc cpufreq core fails
> +     * to initialize, but also we could not fall back to legacy P-states
> +     * driver, irrespective of the command line specifying a fallback option.
> +     */
> +    if ( data->err )
> +    {
> +        amd_cppc_err(cpu, "Could not initialize cpufreq core in CPPC mode\n");
> +        amd_cppc_cpufreq_cpu_exit(policy);
> +        return data->err;

amd_cppc_cpufreq_cpu_exit() has already freed what data points to.

> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -63,6 +63,7 @@ struct perf_limits {
>  };
>  
>  struct hwp_drv_data;
> +struct amd_cppc_drv_data;
>  struct cpufreq_policy {
>      cpumask_var_t       cpus;          /* affected CPUs */
>      unsigned int        shared_type;   /* ANY or ALL affected CPUs
> @@ -85,6 +86,9 @@ struct cpufreq_policy {
>      union {
>  #ifdef CONFIG_INTEL
>          struct hwp_drv_data *hwp; /* Driver data for Intel HWP */
> +#endif
> +#ifdef CONFIG_AMD
> +        struct amd_cppc_drv_data *amd_cppc; /* Driver data for AMD CPPC */
>  #endif
>      } u;
>  };

Same comments here as for the HWP patch.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
  2025-09-04  6:35 ` [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode Penny Zheng
  2025-09-04 12:04   ` Jan Beulich
@ 2025-09-04 12:11   ` Jan Beulich
  1 sibling, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-04 12:11 UTC (permalink / raw)
  To: Penny Zheng
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD, Michal Orzel,
	Julien Grall, Stefano Stabellini, xen-devel

On 04.09.2025 08:35, Penny Zheng wrote:
> @@ -50,10 +139,333 @@ int __init amd_cppc_cmdline_parse(const char *s, const char *e)
>      return 0;
>  }
>  
> +/*
> + * If CPPC lowest_freq and nominal_freq registers are exposed then we can
> + * use them to convert perf to freq and vice versa. The conversion is
> + * extrapolated as an linear function passing by the 2 points:
> + *  - (Low perf, Low freq)
> + *  - (Nominal perf, Nominal freq)
> + * Parameter freq is always in kHz.
> + */
> +static int amd_cppc_khz_to_perf(const struct amd_cppc_drv_data *data,
> +                                unsigned int freq, uint8_t *perf)
> +{
> +    const struct xen_processor_cppc *cppc_data = data->cppc_data;
> +    unsigned int mul, div;
> +    int offset = 0, res;
> +
> +    if ( cppc_data->cpc.lowest_mhz &&
> +         data->caps.nominal_perf > data->caps.lowest_perf &&
> +         cppc_data->cpc.nominal_mhz > cppc_data->cpc.lowest_mhz )
> +    {
> +        mul = data->caps.nominal_perf - data->caps.lowest_perf;
> +        div = cppc_data->cpc.nominal_mhz - cppc_data->cpc.lowest_mhz;
> +
> +        /*
> +         * We don't need to convert to kHz for computing offset and can
> +         * directly use nominal_mhz and lowest_mhz as the division
> +         * will remove the frequency unit.
> +         */
> +        offset = data->caps.nominal_perf -
> +                 (mul * cppc_data->cpc.nominal_mhz) / div;
> +    }
> +    else
> +    {
> +        /* Read Processor Max Speed(MHz) as anchor point */
> +        mul = data->caps.highest_perf;
> +        div = this_cpu(pxfreq_mhz);

How do you know you ever initialized this instance of the per-CPU variable?
amd_cppc_init_msrs() may never have run for this particular CPU.

> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
> +                                            unsigned int target_freq,
> +                                            unsigned int relation)
> +{
> +    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
> +    uint8_t des_perf;
> +    int res;
> +
> +    if ( unlikely(!target_freq) )
> +        return 0;
> +
> +    res = amd_cppc_khz_to_perf(data, target_freq, &des_perf);
> +    if ( res )
> +        return res;
> +
> +    /*
> +     * Having a performance level lower than the lowest nonlinear
> +     * performance level, such as, lowest_perf <= perf <= lowest_nonliner_perf,
> +     * may actually cause an efficiency penalty, So when deciding the min_perf
> +     * value, we prefer lowest nonlinear performance over lowest performance.
> +     */
> +    amd_cppc_write_request(policy->cpu, data, data->caps.lowest_nonlinear_perf,
> +                           des_perf, data->caps.highest_perf,
> +                           /* Pre-defined BIOS value for passive mode */
> +                           per_cpu(epp_init, policy->cpu));

This may access per-CPU data of an offline CPU.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 3/8] xen/cpufreq: implement amd-cppc-epp driver for CPPC in active mode
  2025-09-04  6:35 ` [PATCH v9 3/8] xen/cpufreq: implement amd-cppc-epp driver for CPPC in active mode Penny Zheng
@ 2025-09-04 12:12   ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-04 12:12 UTC (permalink / raw)
  To: Penny Zheng
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Roger Pau Monné, Stefano Stabellini, xen-devel

On 04.09.2025 08:35, Penny Zheng wrote:
> ---
> v8 -> v9:
> - Adapt to changes of "Embed struct amd_cppc_drv_data{} into struct
> cpufreq_policy{}"

With problems mentioned there also extending here.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 5/8] tools/cpufreq: extract CPPC para from cpufreq para
  2025-09-04  6:35 ` [PATCH v9 5/8] tools/cpufreq: extract CPPC para from cpufreq para Penny Zheng
@ 2025-09-04 12:26   ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-04 12:26 UTC (permalink / raw)
  To: Penny Zheng
  Cc: Anthony PERARD, Juergen Gross, Andrew Cooper, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini, xen-devel

On 04.09.2025 08:35, Penny Zheng wrote:
> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -492,7 +492,6 @@ struct xen_get_cpufreq_para {
>                  struct  xen_ondemand ondemand;
>              } u;
>          } s;
> -        struct xen_get_cppc_para cppc_para;
>      } u;

Which means the outer union could now be dropped as well. Which may help
proving (or perhaps even easily seeing) the safety of the change you're
making in patch 7.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 7/8] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver
  2025-09-04  6:35 ` [PATCH v9 7/8] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver Penny Zheng
@ 2025-09-04 12:33   ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-04 12:33 UTC (permalink / raw)
  To: Penny Zheng
  Cc: Anthony PERARD, Andrew Cooper, Roger Pau Monné, xen-devel

On 04.09.2025 08:35, Penny Zheng wrote:
> Introduce helper set_amd_cppc_para() and get_amd_cppc_para() to
> SET/GET CPPC-related para for amd-cppc/amd-cppc-epp driver.
> 
> In get_cpufreq_cppc()/set_cpufreq_cppc(), we include
> "processor_pminfo[cpuid]->init & XEN_CPPC_INIT" condition check to deal with
> cpufreq driver in amd-cppc.
> We borrow governor field to indicate policy info for CPPC active mode,
> so we need to move the copying of the governor name out of the
> !cpufreq_is_governorless() guard.

Well, as said in my v8 comment - it's not so much the "what" that needs covering,
but the "why is it correct / safe to do so". See my respective reply to patch 5,
and also to Jason's response on the v8 thread. Perhaps with the union there
removed this doesn't need calling out explicitly anymore.

> ---
> v8 -> v9
> - add description of "moving the copying of the governor name"
> - Adapt to changes of "Embed struct amd_cppc_drv_data{} into struct
> cpufreq_policy{}"

Except that again problems extend to here as well.

> --- a/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> +++ b/xen/arch/x86/acpi/cpufreq/amd-cppc.c
> @@ -561,6 +561,169 @@ static int cf_check amd_cppc_epp_set_policy(struct cpufreq_policy *policy)
>      return 0;
>  }
>  
> +#ifdef CONFIG_PM_OP
> +int amd_cppc_get_para(const struct cpufreq_policy *policy,
> +                      struct xen_get_cppc_para *cppc_para)
> +{
> +    const struct amd_cppc_drv_data *data = policy->u.amd_cppc;
> +
> +    if ( data == NULL )
> +        return -ENODATA;
> +
> +    cppc_para->lowest           = data->caps.lowest_perf;
> +    cppc_para->lowest_nonlinear = data->caps.lowest_nonlinear_perf;
> +    cppc_para->nominal          = data->caps.nominal_perf;
> +    cppc_para->highest          = data->caps.highest_perf;
> +    cppc_para->minimum          = data->req.min_perf;
> +    cppc_para->maximum          = data->req.max_perf;
> +    cppc_para->desired          = data->req.des_perf;
> +    cppc_para->energy_perf      = data->req.epp;
> +
> +    return 0;
> +}
> +
> +int amd_cppc_set_para(struct cpufreq_policy *policy,
> +                      const struct xen_set_cppc_para *set_cppc)
> +{
> +    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
> +    uint8_t max_perf, min_perf, des_perf, epp;
> +    bool active_mode = cpufreq_is_governorless(policy->cpu);
> +
> +    if ( data == NULL )
> +        return -ENOENT;
> +
> +    /* Only allow values if params bit is set. */
> +    if ( (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED) &&
> +          set_cppc->desired) ||
> +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM) &&
> +          set_cppc->minimum) ||
> +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM) &&
> +          set_cppc->maximum) ||
> +         (!(set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF) &&
> +          set_cppc->energy_perf) )
> +        return -EINVAL;
> +
> +    /* Return if there is nothing to do. */
> +    if ( set_cppc->set_params == 0 )
> +        return 0;
> +
> +    /*
> +     * Validate all parameters
> +     * Maximum performance may be set to any performance value in the range
> +     * [Nonlinear Lowest Performance, Highest Performance], inclusive but must
> +     * be set to a value that is larger than or equal to minimum Performance.
> +     */
> +    if ( (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM) &&
> +         (set_cppc->maximum > data->caps.highest_perf ||
> +          (set_cppc->maximum <
> +           (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM
> +            ? set_cppc->minimum
> +            : data->req.min_perf))) )
> +        return -EINVAL;
> +    /*
> +     * Minimum performance may be set to any performance value in the range
> +     * [Nonlinear Lowest Performance, Highest Performance], inclusive but must
> +     * be set to a value that is less than or equal to Maximum Performance.
> +     */
> +    if ( (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM) &&
> +         (set_cppc->minimum < data->caps.lowest_nonlinear_perf ||
> +          (set_cppc->minimum >
> +           (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM
> +            ? set_cppc->maximum
> +            : data->req.max_perf))) )
> +        return -EINVAL;
> +    /*
> +     * Desired performance may be set to any performance value in the range
> +     * [Minimum Performance, Maximum Performance], inclusive.
> +     */
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
> +    {
> +        if ( active_mode )
> +            return -EOPNOTSUPP;
> +
> +        if ( (set_cppc->desired >
> +              (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM
> +               ? set_cppc->maximum
> +               : data->req.max_perf)) ||
> +             (set_cppc->desired <
> +              (set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM
> +               ? set_cppc->minimum
> +               : data->req.min_perf)) )
> +            return -EINVAL;
> +    }
> +    /*
> +     * Energy Performance Preference may be set with a range of values
> +     * from 0 to 0xFF
> +     */
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF )
> +    {
> +        if ( !active_mode )
> +            return -EOPNOTSUPP;
> +
> +        if ( set_cppc->energy_perf > UINT8_MAX )
> +            return -EINVAL;
> +    }
> +
> +    /* Activity window not supported in MSR */
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ACT_WINDOW )
> +        return -EOPNOTSUPP;
> +
> +    des_perf = data->req.des_perf;
> +    /*
> +     * Apply presets:
> +     * XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE/PERFORMANCE/ONDEMAND are
> +     * only available when CPPC in active mode
> +     */
> +    switch ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_PRESET_MASK )
> +    {
> +    case XEN_SYSCTL_CPPC_SET_PRESET_POWERSAVE:
> +        if ( !active_mode )
> +            return -EINVAL;
> +        policy->policy = CPUFREQ_POLICY_POWERSAVE;
> +        break;
> +
> +    case XEN_SYSCTL_CPPC_SET_PRESET_PERFORMANCE:
> +        if ( !active_mode )
> +            return -EINVAL;
> +        policy->policy = CPUFREQ_POLICY_PERFORMANCE;
> +        break;
> +
> +    case XEN_SYSCTL_CPPC_SET_PRESET_ONDEMAND:
> +        if ( !active_mode )
> +            return -EINVAL;
> +        policy->policy = CPUFREQ_POLICY_ONDEMAND;
> +        break;
> +
> +    case XEN_SYSCTL_CPPC_SET_PRESET_NONE:
> +        if ( active_mode )
> +            policy->policy = CPUFREQ_POLICY_UNKNOWN;
> +        break;
> +
> +    default:
> +        return -EINVAL;
> +    }
> +    amd_cppc_prepare_policy(policy, &max_perf, &min_perf, &epp);
> +
> +    /* Further customize presets if needed */
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MINIMUM )
> +        min_perf = set_cppc->minimum;
> +
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_MAXIMUM )
> +        max_perf = set_cppc->maximum;
> +
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_ENERGY_PERF )
> +        epp = set_cppc->energy_perf;
> +
> +    if ( set_cppc->set_params & XEN_SYSCTL_CPPC_SET_DESIRED )
> +        des_perf = set_cppc->desired;
> +
> +    amd_cppc_write_request(policy->cpu, data,

Like elsewhere, policy->cpu may not be online.

> --- a/xen/drivers/acpi/pm-op.c
> +++ b/xen/drivers/acpi/pm-op.c
> @@ -84,6 +84,8 @@ static int get_cpufreq_cppc(unsigned int cpu,
>  
>      if ( hwp_active() )
>          ret = get_hwp_para(cpu, cppc_para);
> +    else if ( processor_pminfo[cpu]->init & XEN_CPPC_INIT )
> +        ret = amd_cppc_get_para(per_cpu(cpufreq_cpu_policy, cpu), cppc_para);
>  
>      return ret;
>  }
> @@ -154,6 +156,17 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>      else
>          strlcpy(op->u.get_para.scaling_driver, "Unknown", CPUFREQ_NAME_LEN);
>  
> +    /*
> +     * In CPPC active mode, we are borrowing governor field to indicate
> +     * policy info.
> +     */
> +    if ( policy->governor->name[0] )
> +        strlcpy(op->u.get_para.u.s.scaling_governor,
> +                policy->governor->name, CPUFREQ_NAME_LEN);
> +    else
> +        strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
> +                CPUFREQ_NAME_LEN);
> +
>      if ( !cpufreq_is_governorless(op->cpuid) )
>      {
>          if ( !(scaling_available_governors =
> @@ -178,13 +191,6 @@ static int get_cpufreq_para(struct xen_sysctl_pm_op *op)
>          op->u.get_para.u.s.scaling_max_freq = policy->max;
>          op->u.get_para.u.s.scaling_min_freq = policy->min;
>  
> -        if ( policy->governor->name[0] )
> -            strlcpy(op->u.get_para.u.s.scaling_governor,
> -                    policy->governor->name, CPUFREQ_NAME_LEN);
> -        else
> -            strlcpy(op->u.get_para.u.s.scaling_governor, "Unknown",
> -                    CPUFREQ_NAME_LEN);

Just to re-iterate here: Pulling this out is okay because the union has
no other member anymore, and hence other date cannot be badly impacted.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
  2025-09-04 11:50   ` Jan Beulich
@ 2025-09-04 18:53     ` Jason Andryuk
  2025-09-08  9:35       ` Jan Beulich
       [not found]     ` <DM4PR12MB8451C5D54EFEC8F6E0B76E43E103A@DM4PR12MB8451.namprd12.prod.outlook.com>
  1 sibling, 1 reply; 29+ messages in thread
From: Jason Andryuk @ 2025-09-04 18:53 UTC (permalink / raw)
  To: Jan Beulich, Penny Zheng; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel

On 2025-09-04 07:50, Jan Beulich wrote:
> On 04.09.2025 08:35, Penny Zheng wrote:
>> For cpus sharing one cpufreq domain, cpufreq_driver.init() is
>> only invoked on the firstcpu, so current per-CPU hwp driver data
>> struct hwp_drv_data{} actually fails to be allocated for cpus other than the
>> first one.
 >> There is no need to make it per-CPU.>> We embed struct 
hwp_drv_data{} into struct cpufreq_policy{}, then cpus could
>> share the hwp driver data allocated for the firstcpu, like the way they share
>> struct cpufreq_policy{}. We also make it a union, with "hwp", and later
>> "amd-cppc" as a sub-struct.
> 
> And ACPI, as per my patch (which then will need re-basing).
> 
>> Suggested-by: Jan Beulich <jbeulich@suse.com>
> 
> Not quite, this really is Reported-by: as it's a bug you fix, and in turn it
> also wants to gain a Fixes: tag. This also will need backporting.
> 
> It would also have been nice if you had Cc-ed Jason right away, seeing that
> this code was all written by him.
> 
>> @@ -259,7 +258,7 @@ static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
>>                                          unsigned int relation)
>>   {
>>       unsigned int cpu = policy->cpu;
>> -    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
>> +    struct hwp_drv_data *data = policy->u.hwp;
>>       /* Zero everything to ensure reserved bits are zero... */
>>       union hwp_request hwp_req = { .raw = 0 };
> 
> Further down in this same function we have
> 
>      on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
> 
> That's similarly problematic when the CPU denoted by policy->cpu isn't
> online anymore. (It's not quite clear whether all related issues would
> want fixing together, or in multiple patches.)
> 
>> @@ -350,7 +349,7 @@ static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
>>   static void cf_check hwp_init_msrs(void *info)
>>   {
>>       struct cpufreq_policy *policy = info;
>> -    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
>> +    struct hwp_drv_data *data = policy->u.hwp;
>>       uint64_t val;
>>   
>>       /*
>> @@ -426,15 +425,14 @@ static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
>>   
>>       policy->governor = &cpufreq_gov_hwp;
>>   
>> -    per_cpu(hwp_drv_data, cpu) = data;
>> +    policy->u.hwp = data;
>>   
>>       on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
> 
> If multiple CPUs are in a domain, not all of them will make it here. By
> implication the MSRs accessed by hwp_init_msrs() would need to have wider
> than thread scope. The SDM, afaics, says nothing either way in this regard
> in the Architectural MSRs section. Later model-specific tables have some
> data.

When I wrote the HWP driver, I expected there to be per-cpu 
hwp_drv_data.  policy->cpu looked like the correct way to identify each 
CPU.  I was unaware of the idea of cpufreq_domains, and didn't intend 
there to be any sharing.

> Which gets me back to my original question: Is "sharing" actually possible
> for HWP? Note further how there are both HWP_REQUEST and HWP_REQUEST_PKG
> MSRs, for example. Which one is (to be) used looks to be controlled by
> HWP_CTL.PKG_CTL_POLARITY.

I was aware of the Package Level MSRs, but chose not to support them. 
Topology information didn't seem readily available to the driver, and 
using non-Package Level MSRs is needed for backwards compatibility anyway.

I don't have access to an HWP system, so I cannot check if processors 
share a domain.  I'd feel a little silly if I only ever wrote to CPU 0 :/

I have no proof, but I want to say that at some point I had debug 
statements and saw hwp_cpufreq_target() called for each CPU.

Maybe forcing hw_all=1 in cpufreq_add_cpu()/cpufreq_del_cpu() would 
ensure per-cpu policies?

Regards,
Jason


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
  2025-09-04 12:04   ` Jan Beulich
@ 2025-09-05  5:15     ` Penny, Zheng
  2025-09-05  6:44       ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Penny, Zheng @ 2025-09-05  5:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Thursday, September 4, 2025 8:04 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné
> <roger.pau@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>; Orzel,
> Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in
> passive mode
>
> On 04.09.2025 08:35, Penny Zheng wrote:
> > amd-cppc is the AMD CPU performance scaling driver that introduces a
> > new CPU frequency control mechanism. The new mechanism is based on
> > Collaborative Processor Performance Control (CPPC) which is a finer
> > grain frequency management than legacy ACPI hardware P-States.
> > Current AMD CPU platforms are using the ACPI P-states driver to manage
> > CPU frequency and clocks with switching only in 3 P-states, while the
> > new amd-cppc allows a more flexible, low-latency interface for Xen to
> > directly communicate the performance hints to hardware.
> >
> > "amd-cppc" driver is responsible for implementing CPPC in passive
> > mode, which still leverages Xen governors such as *ondemand*,
> > *performance*, etc, to calculate the performance hints. In the future,
> > we will introduce an advanced active mode to enable autonomous performence
> level selection.
> >
> > Field epp, energy performance preference, which only has meaning when
> > active mode is enabled and will be introduced later in details, so we
> > read pre-defined BIOS value for it in passive mode.
> >
> > Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
> > Acked-by: Jan Beulich <jbeulich@suse.com>
>
> With the issue I had pointed out, leading to ...
>
> > ---
> > v8 -> v9
> > - embed struct amd_cppc_drv_data{} into struct cpufreq_policy{}
>
> ... this change, I think the tag would have needed to be dropped.
>

Understood, will remove

> > +static void cf_check amd_cppc_write_request_msrs(void *info) {
> > +    const struct amd_cppc_drv_data *data = info;
> > +
> > +    wrmsrl(MSR_AMD_CPPC_REQ, data->req.raw); }
> > +
> > +static void amd_cppc_write_request(unsigned int cpu,
> > +                                   struct amd_cppc_drv_data *data,
> > +                                   uint8_t min_perf, uint8_t des_perf,
> > +                                   uint8_t max_perf, uint8_t epp) {
> > +    uint64_t prev = data->req.raw;
> > +
> > +    data->req.min_perf = min_perf;
> > +    data->req.max_perf = max_perf;
> > +    data->req.des_perf = des_perf;
> > +    data->req.epp = epp;
> > +
> > +    if ( prev == data->req.raw )
> > +        return;
> > +
> > +    on_selected_cpus(cpumask_of(cpu), amd_cppc_write_request_msrs,
> > + data, 1);
>
> With "cpu" coming from ...
>
> > +}
> > +
> > +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
> > +                                            unsigned int target_freq,
> > +                                            unsigned int relation) {
> > +    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
> > +    uint8_t des_perf;
> > +    int res;
> > +
> > +    if ( unlikely(!target_freq) )
> > +        return 0;
> > +
> > +    res = amd_cppc_khz_to_perf(data, target_freq, &des_perf);
> > +    if ( res )
> > +        return res;
> > +
> > +    /*
> > +     * Having a performance level lower than the lowest nonlinear
> > +     * performance level, such as, lowest_perf <= perf <= lowest_nonliner_perf,
> > +     * may actually cause an efficiency penalty, So when deciding the min_perf
> > +     * value, we prefer lowest nonlinear performance over lowest performance.
> > +     */
> > +    amd_cppc_write_request(policy->cpu, data,
> > + data->caps.lowest_nonlinear_perf,
>
> ... here, how can this work when this particular CPU isn't online anymore?

Once any processor in the domain gets offline, the governor will stop, then .target() could not be invoked any more:
```
        if ( hw_all || cpumask_weight(cpufreq_dom->map) == domain_info->num_processors )
                __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
```

>
> > +                           des_perf, data->caps.highest_perf,
> > +                           /* Pre-defined BIOS value for passive mode */
> > +                           per_cpu(epp_init, policy->cpu));
> > +    return 0;
> > +}
> > +
> > +static void cf_check amd_cppc_init_msrs(void *info) {
> > +    struct cpufreq_policy *policy = info;
> > +    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
> > +    uint64_t val;
> > +    unsigned int min_freq = 0, nominal_freq = 0, max_freq;
> > +
> > +    /* Package level MSR */
> > +    rdmsrl(MSR_AMD_CPPC_ENABLE, val);
>
> Here you clarify the scope, yet what about ...
>
> > +    /*
> > +     * Only when Enable bit is on, the hardware will calculate the processor’s
> > +     * performance capabilities and initialize the performance level fields in
> > +     * the CPPC capability registers.
> > +     */
> > +    if ( !(val & AMD_CPPC_ENABLE) )
> > +    {
> > +        val |= AMD_CPPC_ENABLE;
> > +        wrmsrl(MSR_AMD_CPPC_ENABLE, val);
> > +    }
> > +
> > +    rdmsrl(MSR_AMD_CPPC_CAP1, data->caps.raw);
>
> ... this and ...
>
GOV_GETAVG);
> > +
> > +    /* Store pre-defined BIOS value for passive mode */
> > +    rdmsrl(MSR_AMD_CPPC_REQ, val);
>
> ... this?
>

They are all Per-thread MSR. I'll add descriptions.

> > +static int cf_check amd_cppc_cpufreq_cpu_init(struct cpufreq_policy
> > +*policy) {
> > +    unsigned int cpu = policy->cpu;
> > +    struct amd_cppc_drv_data *data;
> > +
> > +    data = xvzalloc(struct amd_cppc_drv_data);
> > +    if ( !data )
> > +        return -ENOMEM;
> > +    policy->u.amd_cppc = data;
> > +
> > +    data->cppc_data = &processor_pminfo[cpu]->cppc_data;
> > +
> > +    on_selected_cpus(cpumask_of(cpu), amd_cppc_init_msrs, policy, 1);
> > +
> > +    /*
> > +     * The enable bit is sticky, as we need to enable it at the very first
> > +     * begining, before CPPC capability values sanity check.
> > +     * If error path is taken effective, not only amd-cppc cpufreq core fails
> > +     * to initialize, but also we could not fall back to legacy P-states
> > +     * driver, irrespective of the command line specifying a fallback option.
> > +     */
> > +    if ( data->err )
> > +    {
> > +        amd_cppc_err(cpu, "Could not initialize cpufreq core in CPPC mode\n");
> > +        amd_cppc_cpufreq_cpu_exit(policy);
> > +        return data->err;
>
> amd_cppc_cpufreq_cpu_exit() has already freed what data points to.
>

True, I'll record the error info

> > --- a/xen/include/acpi/cpufreq/cpufreq.h
> > +++ b/xen/include/acpi/cpufreq/cpufreq.h
> > @@ -63,6 +63,7 @@ struct perf_limits {  };
> >
> >  struct hwp_drv_data;
> > +struct amd_cppc_drv_data;
> >  struct cpufreq_policy {
> >      cpumask_var_t       cpus;          /* affected CPUs */
> >      unsigned int        shared_type;   /* ANY or ALL affected CPUs
> > @@ -85,6 +86,9 @@ struct cpufreq_policy {
> >      union {
> >  #ifdef CONFIG_INTEL
> >          struct hwp_drv_data *hwp; /* Driver data for Intel HWP */
> > +#endif
> > +#ifdef CONFIG_AMD
> > +        struct amd_cppc_drv_data *amd_cppc; /* Driver data for AMD
> > +CPPC */
> >  #endif
> >      } u;
> >  };
>
> Same comments here as for the HWP patch.

May I ask why structure over pointer here?

>
> Jan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
  2025-09-05  5:15     ` Penny, Zheng
@ 2025-09-05  6:44       ` Jan Beulich
  2025-09-05  7:11         ` Penny, Zheng
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2025-09-05  6:44 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini,
	xen-devel@lists.xenproject.org

On 05.09.2025 07:15, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Thursday, September 4, 2025 8:04 PM
>>
>> On 04.09.2025 08:35, Penny Zheng wrote:
>>> +static void cf_check amd_cppc_write_request_msrs(void *info) {
>>> +    const struct amd_cppc_drv_data *data = info;
>>> +
>>> +    wrmsrl(MSR_AMD_CPPC_REQ, data->req.raw); }
>>> +
>>> +static void amd_cppc_write_request(unsigned int cpu,
>>> +                                   struct amd_cppc_drv_data *data,
>>> +                                   uint8_t min_perf, uint8_t des_perf,
>>> +                                   uint8_t max_perf, uint8_t epp) {
>>> +    uint64_t prev = data->req.raw;
>>> +
>>> +    data->req.min_perf = min_perf;
>>> +    data->req.max_perf = max_perf;
>>> +    data->req.des_perf = des_perf;
>>> +    data->req.epp = epp;
>>> +
>>> +    if ( prev == data->req.raw )
>>> +        return;
>>> +
>>> +    on_selected_cpus(cpumask_of(cpu), amd_cppc_write_request_msrs,
>>> + data, 1);
>>
>> With "cpu" coming from ...
>>
>>> +}
>>> +
>>> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
>>> +                                            unsigned int target_freq,
>>> +                                            unsigned int relation) {
>>> +    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
>>> +    uint8_t des_perf;
>>> +    int res;
>>> +
>>> +    if ( unlikely(!target_freq) )
>>> +        return 0;
>>> +
>>> +    res = amd_cppc_khz_to_perf(data, target_freq, &des_perf);
>>> +    if ( res )
>>> +        return res;
>>> +
>>> +    /*
>>> +     * Having a performance level lower than the lowest nonlinear
>>> +     * performance level, such as, lowest_perf <= perf <= lowest_nonliner_perf,
>>> +     * may actually cause an efficiency penalty, So when deciding the min_perf
>>> +     * value, we prefer lowest nonlinear performance over lowest performance.
>>> +     */
>>> +    amd_cppc_write_request(policy->cpu, data,
>>> + data->caps.lowest_nonlinear_perf,
>>
>> ... here, how can this work when this particular CPU isn't online anymore?
> 
> Once any processor in the domain gets offline, the governor will stop, then .target() could not be invoked any more:
> ```
>         if ( hw_all || cpumask_weight(cpufreq_dom->map) == domain_info->num_processors )
>                 __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
> ```

I can't bring the code in line with what you say.

>>> +                           des_perf, data->caps.highest_perf,
>>> +                           /* Pre-defined BIOS value for passive mode */
>>> +                           per_cpu(epp_init, policy->cpu));
>>> +    return 0;
>>> +}
>>> +
>>> +static void cf_check amd_cppc_init_msrs(void *info) {
>>> +    struct cpufreq_policy *policy = info;
>>> +    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
>>> +    uint64_t val;
>>> +    unsigned int min_freq = 0, nominal_freq = 0, max_freq;
>>> +
>>> +    /* Package level MSR */
>>> +    rdmsrl(MSR_AMD_CPPC_ENABLE, val);
>>
>> Here you clarify the scope, yet what about ...
>>
>>> +    /*
>>> +     * Only when Enable bit is on, the hardware will calculate the processor’s
>>> +     * performance capabilities and initialize the performance level fields in
>>> +     * the CPPC capability registers.
>>> +     */
>>> +    if ( !(val & AMD_CPPC_ENABLE) )
>>> +    {
>>> +        val |= AMD_CPPC_ENABLE;
>>> +        wrmsrl(MSR_AMD_CPPC_ENABLE, val);
>>> +    }
>>> +
>>> +    rdmsrl(MSR_AMD_CPPC_CAP1, data->caps.raw);
>>
>> ... this and ...
>>
> GOV_GETAVG);
>>> +
>>> +    /* Store pre-defined BIOS value for passive mode */
>>> +    rdmsrl(MSR_AMD_CPPC_REQ, val);
>>
>> ... this?
> 
> They are all Per-thread MSR. I'll add descriptions.

If they're per-thread, coordination will be yet more difficult if any domain
had more than one thread in it. So question again: Is it perhaps disallowed
by the spec for there to be any "domain" covering more than a single thread?

>>> --- a/xen/include/acpi/cpufreq/cpufreq.h
>>> +++ b/xen/include/acpi/cpufreq/cpufreq.h
>>> @@ -63,6 +63,7 @@ struct perf_limits {  };
>>>
>>>  struct hwp_drv_data;
>>> +struct amd_cppc_drv_data;
>>>  struct cpufreq_policy {
>>>      cpumask_var_t       cpus;          /* affected CPUs */
>>>      unsigned int        shared_type;   /* ANY or ALL affected CPUs
>>> @@ -85,6 +86,9 @@ struct cpufreq_policy {
>>>      union {
>>>  #ifdef CONFIG_INTEL
>>>          struct hwp_drv_data *hwp; /* Driver data for Intel HWP */
>>> +#endif
>>> +#ifdef CONFIG_AMD
>>> +        struct amd_cppc_drv_data *amd_cppc; /* Driver data for AMD
>>> +CPPC */
>>>  #endif
>>>      } u;
>>>  };
>>
>> Same comments here as for the HWP patch.
> 
> May I ask why structure over pointer here?

Efficiency: Less allocations, and one less indirection level. For relatively
small structures you also want to consider the storage overhead of the extra
pointer.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
  2025-09-05  6:44       ` Jan Beulich
@ 2025-09-05  7:11         ` Penny, Zheng
  0 siblings, 0 replies; 29+ messages in thread
From: Penny, Zheng @ 2025-09-05  7:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini,
	xen-devel@lists.xenproject.org

[Public]

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, September 5, 2025 2:45 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné
> <roger.pau@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>; Orzel,
> Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in
> passive mode
>
> On 05.09.2025 07:15, Penny, Zheng wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Thursday, September 4, 2025 8:04 PM
> >>
> >> On 04.09.2025 08:35, Penny Zheng wrote:
> >>> +static void cf_check amd_cppc_write_request_msrs(void *info) {
> >>> +    const struct amd_cppc_drv_data *data = info;
> >>> +
> >>> +    wrmsrl(MSR_AMD_CPPC_REQ, data->req.raw); }
> >>> +
> >>> +static void amd_cppc_write_request(unsigned int cpu,
> >>> +                                   struct amd_cppc_drv_data *data,
> >>> +                                   uint8_t min_perf, uint8_t des_perf,
> >>> +                                   uint8_t max_perf, uint8_t epp) {
> >>> +    uint64_t prev = data->req.raw;
> >>> +
> >>> +    data->req.min_perf = min_perf;
> >>> +    data->req.max_perf = max_perf;
> >>> +    data->req.des_perf = des_perf;
> >>> +    data->req.epp = epp;
> >>> +
> >>> +    if ( prev == data->req.raw )
> >>> +        return;
> >>> +
> >>> +    on_selected_cpus(cpumask_of(cpu), amd_cppc_write_request_msrs,
> >>> + data, 1);
> >>
> >> With "cpu" coming from ...
> >>
> >>> +}
> >>> +
> >>> +static int cf_check amd_cppc_cpufreq_target(struct cpufreq_policy *policy,
> >>> +                                            unsigned int target_freq,
> >>> +                                            unsigned int relation) {
> >>> +    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
> >>> +    uint8_t des_perf;
> >>> +    int res;
> >>> +
> >>> +    if ( unlikely(!target_freq) )
> >>> +        return 0;
> >>> +
> >>> +    res = amd_cppc_khz_to_perf(data, target_freq, &des_perf);
> >>> +    if ( res )
> >>> +        return res;
> >>> +
> >>> +    /*
> >>> +     * Having a performance level lower than the lowest nonlinear
> >>> +     * performance level, such as, lowest_perf <= perf <=
> lowest_nonliner_perf,
> >>> +     * may actually cause an efficiency penalty, So when deciding the min_perf
> >>> +     * value, we prefer lowest nonlinear performance over lowest performance.
> >>> +     */
> >>> +    amd_cppc_write_request(policy->cpu, data,
> >>> + data->caps.lowest_nonlinear_perf,
> >>
> >> ... here, how can this work when this particular CPU isn't online anymore?
> >
> > Once any processor in the domain gets offline, the governor will stop,
> then .target() could not be invoked any more:
> > ```
> >         if ( hw_all || cpumask_weight(cpufreq_dom->map) == domain_info-
> >num_processors )
> >                 __cpufreq_governor(policy, CPUFREQ_GOV_STOP); ```
>
> I can't bring the code in line with what you say.

Only processors in the domain are all online, the weight equates to the "num_processors". That is, governor stops when the *first* processor tries to offline.
If gov stops, cpufreq->target() will not be executed any more.
Also, in __cpufreq_driver_target(), we will do the cpu_online(policy->cpu) check to ensure registered cpu in policy->cpu is online

>
> >>> +                           des_perf, data->caps.highest_perf,
> >>> +                           /* Pre-defined BIOS value for passive mode */
> >>> +                           per_cpu(epp_init, policy->cpu));
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static void cf_check amd_cppc_init_msrs(void *info) {
> >>> +    struct cpufreq_policy *policy = info;
> >>> +    struct amd_cppc_drv_data *data = policy->u.amd_cppc;
> >>> +    uint64_t val;
> >>> +    unsigned int min_freq = 0, nominal_freq = 0, max_freq;
> >>> +
> >>> +    /* Package level MSR */
> >>> +    rdmsrl(MSR_AMD_CPPC_ENABLE, val);
> >>
> >> Here you clarify the scope, yet what about ...
> >>
> >>> +    /*
> >>> +     * Only when Enable bit is on, the hardware will calculate the processor’s
> >>> +     * performance capabilities and initialize the performance level fields in
> >>> +     * the CPPC capability registers.
> >>> +     */
> >>> +    if ( !(val & AMD_CPPC_ENABLE) )
> >>> +    {
> >>> +        val |= AMD_CPPC_ENABLE;
> >>> +        wrmsrl(MSR_AMD_CPPC_ENABLE, val);
> >>> +    }
> >>> +
> >>> +    rdmsrl(MSR_AMD_CPPC_CAP1, data->caps.raw);
> >>
> >> ... this and ...
> >>
> > GOV_GETAVG);
> >>> +
> >>> +    /* Store pre-defined BIOS value for passive mode */
> >>> +    rdmsrl(MSR_AMD_CPPC_REQ, val);
> >>
> >> ... this?
> >
> > They are all Per-thread MSR. I'll add descriptions.
>
> If they're per-thread, coordination will be yet more difficult if any domain had more
> than one thread in it. So question again: Is it perhaps disallowed by the spec for
> there to be any "domain" covering more than a single thread?
>

I'll double-check with the hardware team about it.

Also, maybe xen current code is already overing SW_ANY coordination. As for SW_ANY coordination type, the OS needs to coordinate the state for all processors in the domain by making a state request on the control interface of *only one* processor in the domain. In Xen, ig, the "only one" is the cpu registered in policy->cpu.
But for "SW_ALL", the OSPM coordinates the state for all processors in the domain by making the same state request on the control interface of *each processor" in the domain, I haven't see any codes coordinating the synchronization in Xen

> Jan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
  2025-09-04 18:53     ` Jason Andryuk
@ 2025-09-08  9:35       ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-08  9:35 UTC (permalink / raw)
  To: Jason Andryuk; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Penny Zheng

On 04.09.2025 20:53, Jason Andryuk wrote:
> On 2025-09-04 07:50, Jan Beulich wrote:
>> On 04.09.2025 08:35, Penny Zheng wrote:
>>> @@ -259,7 +258,7 @@ static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
>>>                                          unsigned int relation)
>>>   {
>>>       unsigned int cpu = policy->cpu;
>>> -    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
>>> +    struct hwp_drv_data *data = policy->u.hwp;
>>>       /* Zero everything to ensure reserved bits are zero... */
>>>       union hwp_request hwp_req = { .raw = 0 };
>>
>> Further down in this same function we have
>>
>>      on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
>>
>> That's similarly problematic when the CPU denoted by policy->cpu isn't
>> online anymore. (It's not quite clear whether all related issues would
>> want fixing together, or in multiple patches.)
>>
>>> @@ -350,7 +349,7 @@ static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
>>>   static void cf_check hwp_init_msrs(void *info)
>>>   {
>>>       struct cpufreq_policy *policy = info;
>>> -    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
>>> +    struct hwp_drv_data *data = policy->u.hwp;
>>>       uint64_t val;
>>>   
>>>       /*
>>> @@ -426,15 +425,14 @@ static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
>>>   
>>>       policy->governor = &cpufreq_gov_hwp;
>>>   
>>> -    per_cpu(hwp_drv_data, cpu) = data;
>>> +    policy->u.hwp = data;
>>>   
>>>       on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
>>
>> If multiple CPUs are in a domain, not all of them will make it here. By
>> implication the MSRs accessed by hwp_init_msrs() would need to have wider
>> than thread scope. The SDM, afaics, says nothing either way in this regard
>> in the Architectural MSRs section. Later model-specific tables have some
>> data.
> 
> When I wrote the HWP driver, I expected there to be per-cpu 
> hwp_drv_data.  policy->cpu looked like the correct way to identify each 
> CPU.  I was unaware of the idea of cpufreq_domains, and didn't intend 
> there to be any sharing.
> 
>> Which gets me back to my original question: Is "sharing" actually possible
>> for HWP? Note further how there are both HWP_REQUEST and HWP_REQUEST_PKG
>> MSRs, for example. Which one is (to be) used looks to be controlled by
>> HWP_CTL.PKG_CTL_POLARITY.
> 
> I was aware of the Package Level MSRs, but chose not to support them. 
> Topology information didn't seem readily available to the driver, and 
> using non-Package Level MSRs is needed for backwards compatibility anyway.
> 
> I don't have access to an HWP system, so I cannot check if processors 
> share a domain.  I'd feel a little silly if I only ever wrote to CPU 0 :/
> 
> I have no proof, but I want to say that at some point I had debug 
> statements and saw hwp_cpufreq_target() called for each CPU.
> 
> Maybe forcing hw_all=1 in cpufreq_add_cpu()/cpufreq_del_cpu() would 
> ensure per-cpu policies?

No, domain info is handed to us from ACPI (_PSD). That's what
cpufreq_add_cpu() evaluates. Therefore if there was evidence that HWP (and
CPPC) can only ever have single-CPU domains, we could refuse such _PSD
being handed to us (ideally already in set_px_pminfo()). But I don't think
we can just go and override what we were told. I fear though that the mere
existence of a package-level (alternative) MSR suggests that such
configurations might be possible.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
       [not found]     ` <DM4PR12MB8451C5D54EFEC8F6E0B76E43E103A@DM4PR12MB8451.namprd12.prod.outlook.com>
@ 2025-09-08 10:02       ` Jan Beulich
  2025-09-08 11:28         ` Penny, Zheng
  2025-09-15  3:49         ` Penny, Zheng
  0 siblings, 2 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-08 10:02 UTC (permalink / raw)
  To: Penny, Zheng; +Cc: Andryuk, Jason, xen-devel@lists.xenproject.org

(re-adding the list)

On 05.09.2025 06:58, Penny, Zheng wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Thursday, September 4, 2025 7:51 PM
>> To: Penny, Zheng <penny.zheng@amd.com>; Andryuk, Jason
>> <Jason.Andryuk@amd.com>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné
>> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org
>> Subject: Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
>>
>> On 04.09.2025 08:35, Penny Zheng wrote:
>>> For cpus sharing one cpufreq domain, cpufreq_driver.init() is only
>>> invoked on the firstcpu, so current per-CPU hwp driver data struct
>>> hwp_drv_data{} actually fails to be allocated for cpus other than the
>>> first one. There is no need to make it per-CPU.
>>> We embed struct hwp_drv_data{} into struct cpufreq_policy{}, then cpus
>>> could share the hwp driver data allocated for the firstcpu, like the
>>> way they share struct cpufreq_policy{}. We also make it a union, with
>>> "hwp", and later "amd-cppc" as a sub-struct.
>>
>> And ACPI, as per my patch (which then will need re-basing).
>>
>>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>>
>> Not quite, this really is Reported-by: as it's a bug you fix, and in turn it also wants to
>> gain a Fixes: tag. This also will need backporting.
>>
>> It would also have been nice if you had Cc-ed Jason right away, seeing that this
>> code was all written by him.
>>
>>> @@ -259,7 +258,7 @@ static int cf_check hwp_cpufreq_target(struct
>> cpufreq_policy *policy,
>>>                                         unsigned int relation)  {
>>>      unsigned int cpu = policy->cpu;
>>> -    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
>>> +    struct hwp_drv_data *data = policy->u.hwp;
>>>      /* Zero everything to ensure reserved bits are zero... */
>>>      union hwp_request hwp_req = { .raw = 0 };
>>
>> Further down in this same function we have
>>
>>     on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
>>
>> That's similarly problematic when the CPU denoted by policy->cpu isn't online
>> anymore. (It's not quite clear whether all related issues would want fixing together,
>> or in multiple patches.)
> 
> Checking the logic in cpufreq_del_cpu(), once any processor in the
> domain gets offline, the governor will stop.

Yet with HWP and CPPC drivers being governor-less, how would that matter?

> That is to say, only all processors in the domain are online, cpufreq driver could still be effective. Which is also complies to the description in _PSD ACPI SPEC for "Num Processors" [1]:
> ```
> The number of processors belonging to the domain for this logical processor’s P-states. OSPM will not start performing power state transitions to a particular P-state until this number of processors belonging to the same domain have been detected and started.
> ```
> [1] https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html?highlight=cppc#pstatedependency-package-values
> 
> I know that AMD CPPC is obeying the _PSD dependency relation too, even if both CPPC Request register (ccd[15:0]_lthree0_core[7:0]_thread[1:0]; MSRC001_02B3) and CPPC Capability Register (_ccd[15:0]_lthree0_core[7:0]_thread[1:0]; MSRC001_02B0) is Per-thread MSR.
> I don't have the hardware to test "sharing" logic. All my platform says "HW_ALL" in _PSD.

Aiui that's not mandated by the CPU spec, though. Plus HW_ALL still doesn't say
anything about the scope/size of each domain.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
  2025-09-08 10:02       ` Jan Beulich
@ 2025-09-08 11:28         ` Penny, Zheng
  2025-09-08 11:31           ` Jan Beulich
  2025-09-15  3:49         ` Penny, Zheng
  1 sibling, 1 reply; 29+ messages in thread
From: Penny, Zheng @ 2025-09-08 11:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andryuk, Jason, xen-devel@lists.xenproject.org

[Public]

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, September 8, 2025 6:02 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Andryuk, Jason <Jason.Andryuk@amd.com>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
>
> (re-adding the list)
>
> On 05.09.2025 06:58, Penny, Zheng wrote:
> > [Public]
> >
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Thursday, September 4, 2025 7:51 PM
> >> To: Penny, Zheng <penny.zheng@amd.com>; Andryuk, Jason
> >> <Jason.Andryuk@amd.com>
> >> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné
> >> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org
> >> Subject: Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct
> >> cpufreq_policy{}
> >>
> >> On 04.09.2025 08:35, Penny Zheng wrote:
> >>> For cpus sharing one cpufreq domain, cpufreq_driver.init() is only
> >>> invoked on the firstcpu, so current per-CPU hwp driver data struct
> >>> hwp_drv_data{} actually fails to be allocated for cpus other than
> >>> the first one. There is no need to make it per-CPU.
> >>> We embed struct hwp_drv_data{} into struct cpufreq_policy{}, then
> >>> cpus could share the hwp driver data allocated for the firstcpu,
> >>> like the way they share struct cpufreq_policy{}. We also make it a
> >>> union, with "hwp", and later "amd-cppc" as a sub-struct.
> >>
> >> And ACPI, as per my patch (which then will need re-basing).
> >>
> >>> Suggested-by: Jan Beulich <jbeulich@suse.com>
> >>
> >> Not quite, this really is Reported-by: as it's a bug you fix, and in
> >> turn it also wants to gain a Fixes: tag. This also will need backporting.
> >>
> >> It would also have been nice if you had Cc-ed Jason right away,
> >> seeing that this code was all written by him.
> >>
> >>> @@ -259,7 +258,7 @@ static int cf_check hwp_cpufreq_target(struct
> >> cpufreq_policy *policy,
> >>>                                         unsigned int relation)  {
> >>>      unsigned int cpu = policy->cpu;
> >>> -    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> >>> +    struct hwp_drv_data *data = policy->u.hwp;
> >>>      /* Zero everything to ensure reserved bits are zero... */
> >>>      union hwp_request hwp_req = { .raw = 0 };
> >>
> >> Further down in this same function we have
> >>
> >>     on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
> >>
> >> That's similarly problematic when the CPU denoted by policy->cpu
> >> isn't online anymore. (It's not quite clear whether all related
> >> issues would want fixing together, or in multiple patches.)
> >
> > Checking the logic in cpufreq_del_cpu(), once any processor in the
> > domain gets offline, the governor will stop.
>
> Yet with HWP and CPPC drivers being governor-less, how would that matter?

In CPPC passive mode, we are still relying on governor to do performance tuning.
In CPPC active mode, yes, it is governor-less, how about we disable the CPPC-enable bit for the offline cpus?

> still be effective. Which is also complies to the description in _PSD ACPI SPEC for
> "Num Processors" [1]:
> > ```
> > The number of processors belonging to the domain for this logical processor’s P-
> states. OSPM will not start performing power state transitions to a particular P-state
> until this number of processors belonging to the same domain have been detected
> and started.
> > ```
> > [1]
> > https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control
> > .html?highlight=cppc#pstatedependency-package-values
> >
> > I know that AMD CPPC is obeying the _PSD dependency relation too, even if
> both CPPC Request register (ccd[15:0]_lthree0_core[7:0]_thread[1:0];
> MSRC001_02B3) and CPPC Capability Register
> (_ccd[15:0]_lthree0_core[7:0]_thread[1:0]; MSRC001_02B0) is Per-thread MSR.
> > I don't have the hardware to test "sharing" logic. All my platform says "HW_ALL"
> in _PSD.
>
> Aiui that's not mandated by the CPU spec, though. Plus HW_ALL still doesn't say
> anything about the scope/size of each domain.
>
> Jan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
  2025-09-08 11:28         ` Penny, Zheng
@ 2025-09-08 11:31           ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-08 11:31 UTC (permalink / raw)
  To: Penny, Zheng; +Cc: Andryuk, Jason, xen-devel@lists.xenproject.org

On 08.09.2025 13:28, Penny, Zheng wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, September 8, 2025 6:02 PM
>> To: Penny, Zheng <penny.zheng@amd.com>
>> Cc: Andryuk, Jason <Jason.Andryuk@amd.com>; xen-devel@lists.xenproject.org
>> Subject: Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
>>
>> (re-adding the list)
>>
>> On 05.09.2025 06:58, Penny, Zheng wrote:
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Thursday, September 4, 2025 7:51 PM
>>>> To: Penny, Zheng <penny.zheng@amd.com>; Andryuk, Jason
>>>> <Jason.Andryuk@amd.com>
>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné
>>>> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org
>>>> Subject: Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct
>>>> cpufreq_policy{}
>>>>
>>>> On 04.09.2025 08:35, Penny Zheng wrote:
>>>>> For cpus sharing one cpufreq domain, cpufreq_driver.init() is only
>>>>> invoked on the firstcpu, so current per-CPU hwp driver data struct
>>>>> hwp_drv_data{} actually fails to be allocated for cpus other than
>>>>> the first one. There is no need to make it per-CPU.
>>>>> We embed struct hwp_drv_data{} into struct cpufreq_policy{}, then
>>>>> cpus could share the hwp driver data allocated for the firstcpu,
>>>>> like the way they share struct cpufreq_policy{}. We also make it a
>>>>> union, with "hwp", and later "amd-cppc" as a sub-struct.
>>>>
>>>> And ACPI, as per my patch (which then will need re-basing).
>>>>
>>>>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>>>>
>>>> Not quite, this really is Reported-by: as it's a bug you fix, and in
>>>> turn it also wants to gain a Fixes: tag. This also will need backporting.
>>>>
>>>> It would also have been nice if you had Cc-ed Jason right away,
>>>> seeing that this code was all written by him.
>>>>
>>>>> @@ -259,7 +258,7 @@ static int cf_check hwp_cpufreq_target(struct
>>>> cpufreq_policy *policy,
>>>>>                                         unsigned int relation)  {
>>>>>      unsigned int cpu = policy->cpu;
>>>>> -    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
>>>>> +    struct hwp_drv_data *data = policy->u.hwp;
>>>>>      /* Zero everything to ensure reserved bits are zero... */
>>>>>      union hwp_request hwp_req = { .raw = 0 };
>>>>
>>>> Further down in this same function we have
>>>>
>>>>     on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
>>>>
>>>> That's similarly problematic when the CPU denoted by policy->cpu
>>>> isn't online anymore. (It's not quite clear whether all related
>>>> issues would want fixing together, or in multiple patches.)
>>>
>>> Checking the logic in cpufreq_del_cpu(), once any processor in the
>>> domain gets offline, the governor will stop.
>>
>> Yet with HWP and CPPC drivers being governor-less, how would that matter?
> 
> In CPPC passive mode, we are still relying on governor to do performance tuning.
> In CPPC active mode, yes, it is governor-less, how about we disable the CPPC-
> enable bit for the offline cpus?

Didn't you say that's a sticky bit? Plus how would this help with the issue
at hand?

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver
  2025-09-04  6:35 [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Penny Zheng
                   ` (7 preceding siblings ...)
  2025-09-04  6:35 ` [PATCH v9 8/8] CHANGELOG.md: add amd-cppc/amd-cppc-epp cpufreq driver support Penny Zheng
@ 2025-09-09 16:10 ` Jan Beulich
  2025-09-10  9:27   ` Penny, Zheng
  8 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2025-09-09 16:10 UTC (permalink / raw)
  To: Penny Zheng
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD, Michal Orzel,
	Julien Grall, Stefano Stabellini, Juergen Gross, Oleksii Kurochko,
	Community Manager, xen-devel

On 04.09.2025 08:35, Penny Zheng wrote:
> amd-cppc is the AMD CPU performance scaling driver that introduces a
> new CPU frequency control mechanism on modern AMD APU and CPU series in
> Xen. The new mechanism is based on Collaborative Processor Performance
> Control (CPPC) which provides finer grain frequency management than
> legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using
> the ACPI P-states driver to manage CPU frequency and clocks with
> switching only in 3 P-states. CPPC replaces the ACPI P-states controls
> and allows a flexible, low-latency interface for Xen to directly
> communicate the performance hints to hardware.
> 
> amd_cppc driver has 2 operation modes: autonomous (active) mode,
> and non-autonomous (passive) mode. We register different CPUFreq driver
> for different modes, "amd-cppc" for passive mode and "amd-cppc-epp"
> for active mode.
> 
> The passive mode leverages common governors such as *ondemand*,
> *performance*, etc, to manage the performance tuning. While the active mode
> uses epp to provides a hint to the hardware if software wants to bias
> toward performance (0x0) or energy efficiency (0xff). CPPC power algorithm
> in hardware will automatically calculate the runtime workload and adjust the
> realtime cpu cores frequency according to the power supply and thermal, core
> voltage and some other hardware conditions.
> 
> amd-cppc is enabled on passive mode with a top-level `cpufreq=amd-cppc` option,
> while users add extra `active` flag to select active mode.
> 
> With `cpufreq=amd-cppc,active`, we did a 60s sampling test to see the CPU
> frequency change, through tweaking the energy_perf preference from
> `xenpm set-cpufreq-cppc powersave` to `xenpm set-cpufreq-cppc performance`.
> The outputs are as follows:
> ```
> Setting CPU in powersave mode
> Sampling and Outputs:
>   Avg freq      580000 KHz
>   Avg freq      580000 KHz
>   Avg freq      580000 KHz
> Setting CPU in performance mode
> Sampling and Outputs:
>   Avg freq      4640000 KHz
>   Avg freq      4220000 KHz
>   Avg freq      4640000 KHz
> ```
> 
> Penny Zheng (8):
>   xen/cpufreq: embed hwp into struct cpufreq_policy{}
>   xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
>   xen/cpufreq: implement amd-cppc-epp driver for CPPC in active mode
>   xen/cpufreq: get performance policy from governor set via xenpm
>   tools/cpufreq: extract CPPC para from cpufreq para
>   xen/cpufreq: bypass governor-related para for amd-cppc-epp
>   xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc
>     driver
>   CHANGELOG.md: add amd-cppc/amd-cppc-epp cpufreq driver support
> 
>  CHANGELOG.md                         |   1 +
>  docs/misc/xen-command-line.pandoc    |   9 +-
>  tools/include/xenctrl.h              |   3 +-
>  tools/libs/ctrl/xc_pm.c              |  25 +-
>  tools/misc/xenpm.c                   |  94 ++--
>  xen/arch/x86/acpi/cpufreq/amd-cppc.c | 703 ++++++++++++++++++++++++++-
>  xen/arch/x86/acpi/cpufreq/hwp.c      |  32 +-
>  xen/arch/x86/cpu/amd.c               |   8 +-
>  xen/arch/x86/include/asm/amd.h       |   2 +
>  xen/arch/x86/include/asm/msr-index.h |   6 +
>  xen/drivers/acpi/pm-op.c             |  58 ++-
>  xen/drivers/cpufreq/utility.c        |  15 +
>  xen/include/acpi/cpufreq/cpufreq.h   |  44 ++
>  xen/include/public/sysctl.h          |   5 +-
>  14 files changed, 936 insertions(+), 69 deletions(-)

As we're considering our options towards getting this work in, can you clarify
two things please:
(1) In v9, the sole changes were related to the use of per-CPU data and the
    adding of a ChangeLog entry?
(2) The driver is inactive by default, i.e. requires use of the command line
    option to come into play?

If the answer to both is yes, we're leaning towards committing v8 plus the
ChangeLog entry.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver
  2025-09-09 16:10 ` [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Jan Beulich
@ 2025-09-10  9:27   ` Penny, Zheng
  2025-09-10  9:46     ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Penny, Zheng @ 2025-09-10  9:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini, Juergen Gross,
	Oleksii Kurochko, Community Manager,
	xen-devel@lists.xenproject.org

[Public]

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, September 10, 2025 12:11 AM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné
> <roger.pau@citrix.com>; Anthony PERARD <anthony.perard@vates.tech>; Orzel,
> Michal <Michal.Orzel@amd.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; Juergen Gross <jgross@suse.com>; Oleksii
> Kurochko <oleksii.kurochko@gmail.com>; Community Manager
> <community.manager@xenproject.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver
>
> On 04.09.2025 08:35, Penny Zheng wrote:
> > amd-cppc is the AMD CPU performance scaling driver that introduces a
> > new CPU frequency control mechanism on modern AMD APU and CPU series
> > in Xen. The new mechanism is based on Collaborative Processor
> > Performance Control (CPPC) which provides finer grain frequency
> > management than legacy ACPI hardware P-States. Current AMD CPU/APU
> > platforms are using the ACPI P-states driver to manage CPU frequency
> > and clocks with switching only in 3 P-states. CPPC replaces the ACPI
> > P-states controls and allows a flexible, low-latency interface for Xen
> > to directly communicate the performance hints to hardware.
> >
> > amd_cppc driver has 2 operation modes: autonomous (active) mode, and
> > non-autonomous (passive) mode. We register different CPUFreq driver
> > for different modes, "amd-cppc" for passive mode and "amd-cppc-epp"
> > for active mode.
> >
> > The passive mode leverages common governors such as *ondemand*,
> > *performance*, etc, to manage the performance tuning. While the active
> > mode uses epp to provides a hint to the hardware if software wants to
> > bias toward performance (0x0) or energy efficiency (0xff). CPPC power
> > algorithm in hardware will automatically calculate the runtime
> > workload and adjust the realtime cpu cores frequency according to the
> > power supply and thermal, core voltage and some other hardware conditions.
> >
> > amd-cppc is enabled on passive mode with a top-level
> > `cpufreq=amd-cppc` option, while users add extra `active` flag to select active
> mode.
> >
> > With `cpufreq=amd-cppc,active`, we did a 60s sampling test to see the
> > CPU frequency change, through tweaking the energy_perf preference from
> > `xenpm set-cpufreq-cppc powersave` to `xenpm set-cpufreq-cppc performance`.
> > The outputs are as follows:
> > ```
> > Setting CPU in powersave mode
> > Sampling and Outputs:
> >   Avg freq      580000 KHz
> >   Avg freq      580000 KHz
> >   Avg freq      580000 KHz
> > Setting CPU in performance mode
> > Sampling and Outputs:
> >   Avg freq      4640000 KHz
> >   Avg freq      4220000 KHz
> >   Avg freq      4640000 KHz
> > ```
> >
> > Penny Zheng (8):
> >   xen/cpufreq: embed hwp into struct cpufreq_policy{}
> >   xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
> >   xen/cpufreq: implement amd-cppc-epp driver for CPPC in active mode
> >   xen/cpufreq: get performance policy from governor set via xenpm
> >   tools/cpufreq: extract CPPC para from cpufreq para
> >   xen/cpufreq: bypass governor-related para for amd-cppc-epp
> >   xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-
> cppc
> >     driver
> >   CHANGELOG.md: add amd-cppc/amd-cppc-epp cpufreq driver support
> >
> >  CHANGELOG.md                         |   1 +
> >  docs/misc/xen-command-line.pandoc    |   9 +-
> >  tools/include/xenctrl.h              |   3 +-
> >  tools/libs/ctrl/xc_pm.c              |  25 +-
> >  tools/misc/xenpm.c                   |  94 ++--
> >  xen/arch/x86/acpi/cpufreq/amd-cppc.c | 703 ++++++++++++++++++++++++++-
> >  xen/arch/x86/acpi/cpufreq/hwp.c      |  32 +-
> >  xen/arch/x86/cpu/amd.c               |   8 +-
> >  xen/arch/x86/include/asm/amd.h       |   2 +
> >  xen/arch/x86/include/asm/msr-index.h |   6 +
> >  xen/drivers/acpi/pm-op.c             |  58 ++-
> >  xen/drivers/cpufreq/utility.c        |  15 +
> >  xen/include/acpi/cpufreq/cpufreq.h   |  44 ++
> >  xen/include/public/sysctl.h          |   5 +-
> >  14 files changed, 936 insertions(+), 69 deletions(-)
>
> As we're considering our options towards getting this work in, can you clarify two
> things please:
> (1) In v9, the sole changes were related to the use of per-CPU data and the
>     adding of a ChangeLog entry?

Yes, also in commit of "xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC". I added description of "moving the copying of the governor name"

> (2) The driver is inactive by default, i.e. requires use of the command line
>     option to come into play?
>

Yes, only with specific command line to turn on the driver

> If the answer to both is yes, we're leaning towards committing v8 plus the
> ChangeLog entry.
>
> Jan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver
  2025-09-10  9:27   ` Penny, Zheng
@ 2025-09-10  9:46     ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-10  9:46 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD,
	Orzel, Michal, Julien Grall, Stefano Stabellini, Juergen Gross,
	Oleksii Kurochko, Community Manager,
	xen-devel@lists.xenproject.org

On 10.09.2025 11:27, Penny, Zheng wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Wednesday, September 10, 2025 12:11 AM
>>
>> On 04.09.2025 08:35, Penny Zheng wrote:
>>> Penny Zheng (8):
>>>   xen/cpufreq: embed hwp into struct cpufreq_policy{}
>>>   xen/cpufreq: implement amd-cppc driver for CPPC in passive mode
>>>   xen/cpufreq: implement amd-cppc-epp driver for CPPC in active mode
>>>   xen/cpufreq: get performance policy from governor set via xenpm
>>>   tools/cpufreq: extract CPPC para from cpufreq para
>>>   xen/cpufreq: bypass governor-related para for amd-cppc-epp
>>>   xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-
>> cppc
>>>     driver
>>>   CHANGELOG.md: add amd-cppc/amd-cppc-epp cpufreq driver support
>>>
>>>  CHANGELOG.md                         |   1 +
>>>  docs/misc/xen-command-line.pandoc    |   9 +-
>>>  tools/include/xenctrl.h              |   3 +-
>>>  tools/libs/ctrl/xc_pm.c              |  25 +-
>>>  tools/misc/xenpm.c                   |  94 ++--
>>>  xen/arch/x86/acpi/cpufreq/amd-cppc.c | 703 ++++++++++++++++++++++++++-
>>>  xen/arch/x86/acpi/cpufreq/hwp.c      |  32 +-
>>>  xen/arch/x86/cpu/amd.c               |   8 +-
>>>  xen/arch/x86/include/asm/amd.h       |   2 +
>>>  xen/arch/x86/include/asm/msr-index.h |   6 +
>>>  xen/drivers/acpi/pm-op.c             |  58 ++-
>>>  xen/drivers/cpufreq/utility.c        |  15 +
>>>  xen/include/acpi/cpufreq/cpufreq.h   |  44 ++
>>>  xen/include/public/sysctl.h          |   5 +-
>>>  14 files changed, 936 insertions(+), 69 deletions(-)
>>
>> As we're considering our options towards getting this work in, can you clarify two
>> things please:
>> (1) In v9, the sole changes were related to the use of per-CPU data and the
>>     adding of a ChangeLog entry?
> 
> Yes, also in commit of "xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC". I added description of "moving the copying of the governor name"

Oh, right, and that description is either too little or possibly unnecessary,
with a change made to "tools/cpufreq: extract CPPC para from cpufreq para"
(both as per my v9 comments). IOW v8 also isn't exactly what would want to go
in. All not a good basis for rushing this in at (later than) the last minute.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
  2025-09-08 10:02       ` Jan Beulich
  2025-09-08 11:28         ` Penny, Zheng
@ 2025-09-15  3:49         ` Penny, Zheng
  2025-09-15 14:17           ` Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: Penny, Zheng @ 2025-09-15  3:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andryuk, Jason, xen-devel@lists.xenproject.org, Huang, Ray,
	Stefano Stabellini

[Public]

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, September 8, 2025 6:02 PM
> To: Penny, Zheng <penny.zheng@amd.com>
> Cc: Andryuk, Jason <Jason.Andryuk@amd.com>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
>
> (re-adding the list)
>
> On 05.09.2025 06:58, Penny, Zheng wrote:
> > [Public]
> >
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Thursday, September 4, 2025 7:51 PM
> >> To: Penny, Zheng <penny.zheng@amd.com>; Andryuk, Jason
> >> <Jason.Andryuk@amd.com>
> >> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné
> >> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org
> >> Subject: Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct
> >> cpufreq_policy{}
> >>
> >> On 04.09.2025 08:35, Penny Zheng wrote:
> >>> For cpus sharing one cpufreq domain, cpufreq_driver.init() is only
> >>> invoked on the firstcpu, so current per-CPU hwp driver data struct
> >>> hwp_drv_data{} actually fails to be allocated for cpus other than
> >>> the first one. There is no need to make it per-CPU.
> >>> We embed struct hwp_drv_data{} into struct cpufreq_policy{}, then
> >>> cpus could share the hwp driver data allocated for the firstcpu,
> >>> like the way they share struct cpufreq_policy{}. We also make it a
> >>> union, with "hwp", and later "amd-cppc" as a sub-struct.
> >>
> >> And ACPI, as per my patch (which then will need re-basing).
> >>
> >>> Suggested-by: Jan Beulich <jbeulich@suse.com>
> >>
> >> Not quite, this really is Reported-by: as it's a bug you fix, and in
> >> turn it also wants to gain a Fixes: tag. This also will need backporting.
> >>
> >> It would also have been nice if you had Cc-ed Jason right away,
> >> seeing that this code was all written by him.
> >>
> >>> @@ -259,7 +258,7 @@ static int cf_check hwp_cpufreq_target(struct
> >> cpufreq_policy *policy,
> >>>                                         unsigned int relation)  {
> >>>      unsigned int cpu = policy->cpu;
> >>> -    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> >>> +    struct hwp_drv_data *data = policy->u.hwp;
> >>>      /* Zero everything to ensure reserved bits are zero... */
> >>>      union hwp_request hwp_req = { .raw = 0 };
> >>
> >> Further down in this same function we have
> >>
> >>     on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
> >>
> >> That's similarly problematic when the CPU denoted by policy->cpu
> >> isn't online anymore. (It's not quite clear whether all related
> >> issues would want fixing together, or in multiple patches.)
> >
> > Checking the logic in cpufreq_del_cpu(), once any processor in the
> > domain gets offline, the governor will stop.
>
> Yet with HWP and CPPC drivers being governor-less, how would that matter?
>
> > That is to say, only all processors in the domain are online, cpufreq driver could
> still be effective. Which is also complies to the description in _PSD ACPI SPEC
> for "Num Processors" [1]:
> > ```
> > The number of processors belonging to the domain for this logical processor’s
> P-states. OSPM will not start performing power state transitions to a particular P-
> state until this number of processors belonging to the same domain have been
> detected and started.
> > ```
> > [1]
> > https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control
> > .html?highlight=cppc#pstatedependency-package-values
> >
> > I know that AMD CPPC is obeying the _PSD dependency relation too, even if
> both CPPC Request register (ccd[15:0]_lthree0_core[7:0]_thread[1:0];
> MSRC001_02B3) and CPPC Capability Register
> (_ccd[15:0]_lthree0_core[7:0]_thread[1:0]; MSRC001_02B0) is Per-thread MSR.
> > I don't have the hardware to test "sharing" logic. All my platform says
> "HW_ALL" in _PSD.
>
> Aiui that's not mandated by the CPU spec, though. Plus HW_ALL still doesn't say
> anything about the scope/size of each domain.
>

Sorry for the late reply.
I have discussed this with hardware team now, they said that we shall not expect any value other than "HW_ALL" from _PSD, if we have _CPC table, aka, CPPC enabled. Maybe except for some initial implementations, which may or may have not reached product release, this may still need a few time to look for conclusion
And if it is HW_ALL, it means, in codes, we are invoking per-cpu cpufreq_driver.init, allocating per-cpu copufreq_policy, and etc. In HW_ALL, OSPM can make different state requests for processors in the domain, while hardware determines the resulting state for all processors in the domain.

>

> Jan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
  2025-09-15  3:49         ` Penny, Zheng
@ 2025-09-15 14:17           ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2025-09-15 14:17 UTC (permalink / raw)
  To: Penny, Zheng
  Cc: Andryuk, Jason, xen-devel@lists.xenproject.org, Huang, Ray,
	Stefano Stabellini

On 15.09.2025 05:49, Penny, Zheng wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, September 8, 2025 6:02 PM
>> To: Penny, Zheng <penny.zheng@amd.com>
>> Cc: Andryuk, Jason <Jason.Andryuk@amd.com>; xen-
>> devel@lists.xenproject.org
>> Subject: Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{}
>>
>> (re-adding the list)
>>
>> On 05.09.2025 06:58, Penny, Zheng wrote:
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Thursday, September 4, 2025 7:51 PM
>>>> To: Penny, Zheng <penny.zheng@amd.com>; Andryuk, Jason
>>>> <Jason.Andryuk@amd.com>
>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau Monné
>>>> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org
>>>> Subject: Re: [PATCH v9 1/8] xen/cpufreq: embed hwp into struct
>>>> cpufreq_policy{}
>>>>
>>>> On 04.09.2025 08:35, Penny Zheng wrote:
>>>>> For cpus sharing one cpufreq domain, cpufreq_driver.init() is only
>>>>> invoked on the firstcpu, so current per-CPU hwp driver data struct
>>>>> hwp_drv_data{} actually fails to be allocated for cpus other than
>>>>> the first one. There is no need to make it per-CPU.
>>>>> We embed struct hwp_drv_data{} into struct cpufreq_policy{}, then
>>>>> cpus could share the hwp driver data allocated for the firstcpu,
>>>>> like the way they share struct cpufreq_policy{}. We also make it a
>>>>> union, with "hwp", and later "amd-cppc" as a sub-struct.
>>>>
>>>> And ACPI, as per my patch (which then will need re-basing).
>>>>
>>>>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>>>>
>>>> Not quite, this really is Reported-by: as it's a bug you fix, and in
>>>> turn it also wants to gain a Fixes: tag. This also will need backporting.
>>>>
>>>> It would also have been nice if you had Cc-ed Jason right away,
>>>> seeing that this code was all written by him.
>>>>
>>>>> @@ -259,7 +258,7 @@ static int cf_check hwp_cpufreq_target(struct
>>>> cpufreq_policy *policy,
>>>>>                                         unsigned int relation)  {
>>>>>      unsigned int cpu = policy->cpu;
>>>>> -    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
>>>>> +    struct hwp_drv_data *data = policy->u.hwp;
>>>>>      /* Zero everything to ensure reserved bits are zero... */
>>>>>      union hwp_request hwp_req = { .raw = 0 };
>>>>
>>>> Further down in this same function we have
>>>>
>>>>     on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
>>>>
>>>> That's similarly problematic when the CPU denoted by policy->cpu
>>>> isn't online anymore. (It's not quite clear whether all related
>>>> issues would want fixing together, or in multiple patches.)
>>>
>>> Checking the logic in cpufreq_del_cpu(), once any processor in the
>>> domain gets offline, the governor will stop.
>>
>> Yet with HWP and CPPC drivers being governor-less, how would that matter?
>>
>>> That is to say, only all processors in the domain are online, cpufreq driver could
>> still be effective. Which is also complies to the description in _PSD ACPI SPEC
>> for "Num Processors" [1]:
>>> ```
>>> The number of processors belonging to the domain for this logical processor’s
>> P-states. OSPM will not start performing power state transitions to a particular P-
>> state until this number of processors belonging to the same domain have been
>> detected and started.
>>> ```
>>> [1]
>>> https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control
>>> .html?highlight=cppc#pstatedependency-package-values
>>>
>>> I know that AMD CPPC is obeying the _PSD dependency relation too, even if
>> both CPPC Request register (ccd[15:0]_lthree0_core[7:0]_thread[1:0];
>> MSRC001_02B3) and CPPC Capability Register
>> (_ccd[15:0]_lthree0_core[7:0]_thread[1:0]; MSRC001_02B0) is Per-thread MSR.
>>> I don't have the hardware to test "sharing" logic. All my platform says
>> "HW_ALL" in _PSD.
>>
>> Aiui that's not mandated by the CPU spec, though. Plus HW_ALL still doesn't say
>> anything about the scope/size of each domain.
> 
> Sorry for the late reply.

No worries. From feedback from Stefano I was fearing much of a delay.

> I have discussed this with hardware team now, they said that we shall not expect any value other than "HW_ALL" from _PSD, if we have _CPC table, aka, CPPC enabled. Maybe except for some initial implementations, which may or may have not reached product release, this may still need a few time to look for conclusion
> And if it is HW_ALL, it means, in codes, we are invoking per-cpu cpufreq_driver.init, allocating per-cpu copufreq_policy, and etc. In HW_ALL, OSPM can make different state requests for processors in the domain, while hardware determines the resulting state for all processors in the domain.

Okay, so going back to v8 (and doing some unrelated adjustments there)
looks (to me) to be the best option. Eventually (I wouldn't insist on
this happening right away) we may want to actually reject non-HW_ALL
configurations when using this new driver.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-09-15 14:19 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04  6:35 [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Penny Zheng
2025-09-04  6:35 ` [PATCH v9 1/8] xen/cpufreq: embed hwp into struct cpufreq_policy{} Penny Zheng
2025-09-04 11:50   ` Jan Beulich
2025-09-04 18:53     ` Jason Andryuk
2025-09-08  9:35       ` Jan Beulich
     [not found]     ` <DM4PR12MB8451C5D54EFEC8F6E0B76E43E103A@DM4PR12MB8451.namprd12.prod.outlook.com>
2025-09-08 10:02       ` Jan Beulich
2025-09-08 11:28         ` Penny, Zheng
2025-09-08 11:31           ` Jan Beulich
2025-09-15  3:49         ` Penny, Zheng
2025-09-15 14:17           ` Jan Beulich
2025-09-04  6:35 ` [PATCH v9 2/8] xen/cpufreq: implement amd-cppc driver for CPPC in passive mode Penny Zheng
2025-09-04 12:04   ` Jan Beulich
2025-09-05  5:15     ` Penny, Zheng
2025-09-05  6:44       ` Jan Beulich
2025-09-05  7:11         ` Penny, Zheng
2025-09-04 12:11   ` Jan Beulich
2025-09-04  6:35 ` [PATCH v9 3/8] xen/cpufreq: implement amd-cppc-epp driver for CPPC in active mode Penny Zheng
2025-09-04 12:12   ` Jan Beulich
2025-09-04  6:35 ` [PATCH v9 4/8] xen/cpufreq: get performance policy from governor set via xenpm Penny Zheng
2025-09-04  6:35 ` [PATCH v9 5/8] tools/cpufreq: extract CPPC para from cpufreq para Penny Zheng
2025-09-04 12:26   ` Jan Beulich
2025-09-04  6:35 ` [PATCH v9 6/8] xen/cpufreq: bypass governor-related para for amd-cppc-epp Penny Zheng
2025-09-04  6:35 ` [PATCH v9 7/8] xen/cpufreq: Adapt SET/GET_CPUFREQ_CPPC xen_sysctl_pm_op for amd-cppc driver Penny Zheng
2025-09-04 12:33   ` Jan Beulich
2025-09-04  6:35 ` [PATCH v9 8/8] CHANGELOG.md: add amd-cppc/amd-cppc-epp cpufreq driver support Penny Zheng
2025-09-04  6:47   ` Jan Beulich
2025-09-09 16:10 ` [PATCH v9 0/8] amd-cppc CPU Performance Scaling Driver Jan Beulich
2025-09-10  9:27   ` Penny, Zheng
2025-09-10  9:46     ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.